(e.g., rules of English, or how to recognize letters), but can easily How to Build Decision Trees. How to Build Decision Trees. choose rule to split on …
Machine Learning Algorithms for Classification
Rob Schapire Princeton University
Machine Learning
studies how to automatically learn to make accurate
predictions based on past observations
classification problems:
classify examples into given set of categories
new example
labeled training examples
machine learning algorithm
classification rule
predicted classification
Examples of Classification Problems
text categorization eg, spam filtering fraud detection optical character recognition natural-language processing market segmentation bioinformatics
machine vision eg, face detection
eg, spoken language understanding eg: predict if customer will respond to promotion eg, classify proteins according to their function
Characteristics of Modern Machine Learning
primary goal: highly accurate predictions on test data
goal is not to uncover underlying truth
methods should be general purpose, fully automatic and
off-the-shelf however, in practice, incorporation of prior, human knowledge is crucial
rich interplay between theory and practice
emphasis on methods that can handle large datasets
Why Use Machine Learning?
advantages:
often much more
accurate than human-crafted rules since data driven humans often incapable of expressing what they know eg, rules of English, or how to recognize letters, but can easily classify examples dont need a human expert or programmer automatic method to search for hypotheses explaining data cheap and flexible — can apply to any learning task need a lot of labeled data error prone — usually impossible to get perfect accuracy
disadvantages
This Talk
machine learning algorithms:
decision trees conditions for successful learning boosting support-vector machines
others not covered:
neural networks nearest neighbor algorithms Naive Bayes bagging random forests
practicalities of using machine learning algorithms
Decision Trees
Example: Good versus Evil
problem: identify people as good or bad from their appearance
sex batman robin alfred penguin catwoman joker batgirl riddler male male male male female male female male
mask yes yes no no yes no yes yes
cape tie training data yes no yes no no yes no yes no no no no test data yes no no no
ears yes no no no yes no yes no
smokes no no no yes no no no no
class Good Good Good Bad Bad Bad ?? ??
A Decision Tree
Classifier
tie no cape no bad yes good yes smokes no yes good bad
How to Build Decision Trees
choose rule to split on divide data using splitting rule into disjoint subsets
batman robin alfred penguin catwoman joker
tie no
batman robin catwoman joker
yes
alfred penguin
How to Build Decision Trees
choose rule to split on divide data using splitting rule into disjoint subsets repeat recursively for each subset stop when leaves are almost pure
batman robin alfred penguin catwoman joker
tie no yes
tie no
batman robin catwoman joker
yes
alfred penguin
How to Choose the Splitting Rule
key problem: choosing best rule to split on:
batman robin alfred penguin catwoman joker batman robin alfred penguin catwoman joker
tie no
batman robin catwoman joker
cape yes
alfred penguin
no
alfred penguin catwoman joker
yes
batman robin
How to Choose the Splitting Rule
key problem: choosing best rule to split on:
batman robin alfred penguin catwoman joker batman robin alfred penguin catwoman joker
tie no
batman robin catwoman joker
cape yes
alfred penguin
no
alfred penguin catwoman joker
yes
batman robin
idea: choose rule that leads to greatest increase in purity
How to
Measure Purity
want impurity function to look like this:
p fraction of positive examples
impurity
0
1/2
p
1
commonly used impurity measures:
entropy: -p ln p - 1 - p ln1 - p Gini index: p1 - p
Kinds of Error Rates
training error fraction of training examples misclassified test error fraction of test examples misclassified generalization error probability of misclassifying new
random example
A Possible Classifier
mask no smokes yes bad yes male good yes cape no sex female bad no smokes no good yes bad yes ears yes good
no ears no tie no bad
cape yes no yes bad good
good
perfectly classifies training data
BUT: intuitively, overly complex
Another Possible Classifier
mask no bad
overly simple
yes good
doesnt even fit available data
Tree Size versus Accuracy
significant problem: cant tell best tree size from training error
atad gniniart nO atad tset nO
001
09
08
07
06
05
04
03
02
01
0
40
BUT: trees that are too big may overfit
trees must be big enough to fit training data
capture noise or spurious patterns in the data so that true patterns are fully captured
10 0
20
ycaruccA
560
error
30
50
50 tree size
test
train
100
90
580 80
570
70
60
550
50
Overfitting Example
fitting points with a polynomial
underfit degree 1
ideal fit degree 3
overfit degree 20
Building an Accurate Classifier
for good test peformance, need:
enough training examples good performance on training set classifier that is not too complex
Occams razor
classifiers should be as simple as possible, but no simpler simplicity closely related to prior expectations
Building an Accurate Classifier
for good test peformance, need:
enough training examples good performance on training set classifier that is not too complex Occams razor
classifiers should be as simple as possible, but no simpler simplicity closely related to prior expectations measure complexity by:
number bits needed to write down number of parameters VC-dimension
Example
Training data:
Good and Bad Classifiers
Good:
Bad:
insufficient data
training error too high
sufficient data low training error simple classifier
classifier too complex
Theory
can prove:
generalization error training error O with high probability d VC-dimension m number training examples
d m
Controlling Tree Size
typical approach: build very large tree that fully fits training
data, then prune back
pruning strategies:
grow on just part of training
data, then find pruning with minimum error on held out part find pruning that minimizes training error constant tree size
Decision Trees
best known:
C45 Quinlan CART Breiman, Friedman, Olshen Stone
very fast to train and evaluate relatively easy to interpret but: accuracy often not state-of-the-art
Boosting
Example: Spam Filtering
problem: filter out spam junk email From: yoav@attcom From: xa412@hotmailcom
gather large collection of examples of spam and non-spam: Rob, can you review a paper Earn money without working non-spam spam
goal: have computer learn from examples to distinguish spam
from non-spam
Example: Spam Filtering
problem: filter out spam junk email From: yoav@attcom From: xa412@hotmailcom
gather large collection of examples of spam and non-spam: Rob, can you review a paper Earn money without working non-spam spam
goal: have computer learn from examples to distinguish spam
from non-spam
main observation:
easy to find rules of thumb that are often correct If v1agr@ occurs in message, then predict spam hard to find single rule that is very highly accurate
The Boosting Approach
devise computer program for deriving
rough rules of thumb apply procedure to subset of emails obtain rule of thumb apply to 2nd subset of emails obtain 2nd rule of thumb repeat T times
Details
how to choose examples on each round?
concentrate on hardest examples those most often misclassified by previous rules of thumb take weighted majority vote of rules of thumb
how to combine rules of thumb into single prediction rule?
Boosting
boosting general method of converting rough rules of
thumb into highly accurate prediction rule
technically:
assume given weak learning algorithm that can consistently find classifiers rules of thumb at least slightly better than random, say, accuracy 55 given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99
AdaBoost
given training examples xi , yi where yi {-1, 1}
AdaBoost
given training examples xi , yi where yi {-1, 1} for t 1, , T :
train weak classifier rule of thumb ht on Dt
AdaBoost
given training examples xi , yi where yi {-1, 1} for t 1, , T :
initialize D1 uniform distribution on training examples
train weak classifier rule of thumb ht on Dt
AdaBoost
given training
examples xi , yi where yi {-1, 1} for t 1, , T :
initialize D1 uniform distribution on training examples
train weak classifier rule of thumb ht on Dt choose t 0 compute new distribution Dt1 : for each example i : e -t 1 if yi ht xi multiply Dt xi by e t 1 if yi ht xi renormalize
AdaBoost
given training examples xi , yi where yi {-1, 1} for t 1, , T :
initialize D1 uniform distribution on training examples
train weak classifier rule of thumb ht on Dt choose t 0 compute new distribution Dt1 : for each example i : e -t 1 if yi ht xi multiply Dt xi by e t 1 if yi ht xi renormalize t ht x
t
output final classifier Hfinal x sign
Toy Example
D1
weak classifiers vertical or horizontal half-planes
Round 1
h1
1 030 1042
D2
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
1 0
Round 2
h2
1 0
2 021 2065
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2
D3
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B
C B E D
C B E D
C B E D
C B E D
C B E D
C B E D
C B E D
C B E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
E D
Round 3
h3
E D
E D
E D
E D
3 014 3092
C B
C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @
A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4
7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7
6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B
E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D E D E D E D E D E D E D E D E D E D E D E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D E D E D E D E D E D E D E D E D E D E D E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @
A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W
V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I H b
V
c I H b
c I H b
c I H b
W V
W V
W V
W V
c b
W V
c
b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
I c b H
V
I c b H
I c b H
I c b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
W V
W V
W V
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c b
W V
c I b H
V
c I b H
c I b H
c I b H
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G
F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P a a Y X Y X Y X Y X Y
X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Q P
Final Classifier
H sign final
042
065
092
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
S R
a
a
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
Y X
G F
G F
G F
G F
G F
G F
G F
G F
U T
U T
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
W V
I H
V
I H
I H
I H
Theory: Training Error
weak learning assumption: each weak classifier at least slightly
given this assumption, can prove:
better than random ie, error of ht on Dt 1/2 - for some 0 training errorHfinal e -2
2T
How Will Test Error Behave? A First Guess
1 08
error
06 04 02
test train
20 40 60 80 100
of rounds T
expect:
training error to continue to drop or reach zero
test error to increase when Hfinal becomes too complex
Occams razor overfitting hard to know when to stop training
Actual Typical Run
20 15
C45 test error test train
10 100 1000
error
10 5 0
boosting C45 on letter dataset
of rounds T
test error does not
increase, even after 1000 rounds
total size 2,000,000 nodes test error continues to drop even after training error is zero rounds 5 100 1000 00 00 00 84 33 31
train error test error
Occams razor wrongly predicts simpler rule is better
The Margins Explanation
key idea:
training error only measures whether classifications are right or wrong should also consider confidence of classifications
The Margins Explanation
key idea:
training error only measures whether classifications are right or wrong should also consider confidence of classifications
recall: Hfinal is weighted majority vote of weak classifiers
The Margins Explanation
key idea:
training error only measures whether classifications are right or wrong should also consider confidence of classifications
recall: Hfinal is weighted majority vote of weak classifiers measure confidence by margin strength of the vote empirical evidence and mathematical proof that:
large margins better generalization error regardless of number of rounds boosting tends to increase margins of training examples given weak learning assumption
Application: Human-computer Spoken Dialogue
[with Rahim, Di Fabbrizio,
Dutton, Gupta, Hollister Riccardi]
application: automatic store front or help desk for ATT
Labs Natural Voices business support, sales agent, etc
caller can request demo, pricing information, technical interactive dialogue
How It Works
computer speech text-to-speech Human raw utterance
automatic speech recognizer text
text response
dialogue manager
predicted category
natural language understanding
NLUs job: classify caller utterances into 24 categories
demo, sales rep, pricing info, yes, no, etc
weak classifiers: test for presence of word or phrase
Application: Detecting Faces
[Viola Jones]
problem: find faces in photograph or movie
weak classifiers: detect light/dark rectangles in image
many clever tricks to make extremely fast and accurate
Boosting
fast but not quite as fast as other methods simple and easy to program
flexible: can combine with any learning algorithm, eg
C45 very simple rules of thumb
provable guarantees
state-of-the-art accuracy many applications
tends not to overfit but occasionally does
Support-Vector Machines
Geometry of SVMs
given linearly separable data
Geometry of SVMs
given linearly separable data
margin distance to separating hyperplane intuitively:
choose hyperplane that maximizes minimum margin
want to separate s from -s as much as possible margin measure of confidence
Theoretical Justification
let minimum margin then
R radius of enclosing sphere VC-dim R
2
so larger margins lower complexity independent of number of dimensions
in contrast, unconstrained hyperplanes in Rn have
VC-dim parameters n 1
Finding the Maximum Margin Hyperplane
examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with
v 1
Finding the Maximum Margin Hyperplane
examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with margin y v x maximize:
v 1
subject to: yi v xi and v 1
Finding the Maximum Margin Hyperplane
examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with margin y v x maximize:
v 1
subject to: yi v xi and v 1 w 1/
set w v/
Finding the Maximum Margin Hyperplane
examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with margin y v x maximize:
v 1
subject to: yi v xi and v 1 w 1/ w2 subject to: yi w xi 1
1 2
set w v/ minimize:
Convex Dual
form Lagrangian, set /w 0 w
i yi xi
i 1 2
get quadratic program: maximize
i
i -
i ,j
i j yi yj xi xj
i Lagrange multiplier key points:
subject to: i 0
0 support vector
optimal w is linear combination of support vectors dependence on xi s only through inner products maximization problem is convex with no local maxima
What If Not Linearly Separable?
answer 1: penalize each point by distance from margin 1,
ie, minimize:
1 2
w
2
constant
i
max{0, 1 - yi w xi }
answer 2: map into higher dimensional space in which data
becomes linearly separable
Example
not linearly separable
Example
not linearly separable
22 map x x1 , x2 x 1, x1 , x2 , x1 x2 , x1 , x2
Example
not linearly separable
22 map x x1 , x2 x 1, x1 , x2 , x1 x2 , x1 , x2
hyperplane in mapped space has form
2 2 a bx1 cx2 dx1 x2 ex1 fx2 0
conic in original space linearly separable in mapped space
Why Mapping to High Dimensions Is Dumb
can carry idea further
eg, add all terms up to degree d then n dimensions mapped to Ond dimensions huge blow-up in dimensionality
Why Mapping to High Dimensions Is Dumb
can carry idea further
eg, add all terms up
to degree d then n dimensions mapped to Ond dimensions huge blow-up in dimensionality
statistical problem: amount of data needed often proportional
to number of dimensions curse of dimensionality to work in high dimensions
computational problem: very expensive in time and memory
How SVMs Avoid Both Problems
statistically, may not hurt since VC-dimension independent of
number of dimensions R/2
computationally, only need to be able to compute inner
products
x z
sometimes can do very efficiently using kernels
Example continued
modify slightly:
22 x1 x2 , x1 , x2
x x1 , x2 x 1,
x1 ,
x2 ,
Example continued
modify slightly:
x x1 , x2 x 1,
22 2×1 , 2×2 , 2×1 x2 , x1 , x2
Example continued
modify slightly:
x x1 , x2 x 1,
then
22 2×1 , 2×2 , 2×1 x2 , x1 , x2
22 22 x z 1 2×1 z1 2×2 z2 2×1 x2 z1 z2 x1 z1 x2 z2
1 x1 z1 x2 z2 2
1 x z2
simply use in place of usual inner product
Example continued
modify slightly:
x x1 , x2 x 1,
then
22 2×1 , 2×2 , 2×1 x2 , x1 , x2
22 22 x z 1 2×1 z1 2×2 z2 2×1 x2 z1 z2 x1 z1 x2 z2
1 x1 z1 x2 z2 2
1 x z2
simply use in place of usual inner product
in
general, for polynomial of degree d, use 1 x zd
very efficient, even though finding hyperplane in Ond
dimensions
Kernels
kernel function K for computing
K x, z x z
permits efficient computation of SVMs in very high
dimensions
K can be any symmetric, positive semi-definite function
Mercers theorem
some kernels:
polynomials Gaussian exp - x - z 2 /2 defined over structures trees, strings, sequences, etc evaluation: w x
i yi xi x
i yi K xi , x
time depends on support vectors
SVMs versus Boosting
both are large-margin classifiers
although with slightly different definitions of margin in boosting, dimensions correspond to weak classifiers
both work in very high dimensional spaces but different tricks are used:
SVMs use kernel trick boosting relies on weak learner to select one dimension ie, weak classifier to add to combined classifier
Application: Text Categorization
[Joachims]
goal: classify text documents
eg: spam filtering eg: categorize news articles by topic one dimension for each word in vocabulary value times word occurred in particular document many variations
need to represent text documents as vectors in Rn :
kernels
dont help much
performance state of the art
Application: Recognizing Handwritten Characters
[Cortes Vapnik] examples are 16 16 pixel images, viewed as vectors in R256
7
7
4
8
0
1
4
kernels help:
degree error dimensions 1 120 256 2 47 33000 3 44 106 4 43 109 5 43 1012 6 42 1014 7 43 1016 human 25 to choose best degree: train SVM for each degree choose one with minimum VC-dimension R/2
SVMs
fast algorithms now available, but not so simple to program
but good packages available
state-of-the-art accuracy theoretical justification many applications
power and flexibility from kernels
Other Machine Learning Problem Areas
supervised learning
classification regression predict real-valued labels rare class / cost-sensitive learning
unsupervised no labels
clustering density estimation
semi-supervised
in practice, unlabeled examples much cheaper than labeled examples how to take advantage of both labeled and unlabeled examples active learning how to carefully select which unlabeled examples to have labeled
on-line learning getting one example at a time
Practicalities
Getting Data
more is more
want training data to be like test
data
use your knowledge of problem to know where to get training
data, and what to expect test data to be like
Choosing Features
use your knowledge to know what features would be helpful
for learning
redundancy in features is okay, and often helpful
most modern algorithms do not require independent features could use feature selection methods usually preferable to use algorithm designed to handle large feature sets
too many features?
Choosing an Algorithm
first step: identify appropriate learning paradigm
classification? regression? labeled, unlabeled or a mix? class proportions heavily skewed? goal to predict probabilities? rank instances? is interpretability of the results important? keep in mind, no guarantees
Choosing an Algorithm
first step: identify appropriate learning paradigm
classification? regression? labeled, unlabeled or a mix? class proportions heavily skewed? goal to predict probabilities? rank instances? is interpretability of the results important? keep in mind, no guarantees
in general, no learning algorithm dominates all others on all
problems SVMs and boosting decision trees as well as other tree ensemble methods seem to be
best off-the-shelf algorithms even so, for some problems, difference in performance among these can be large, and sometimes, much simpler methods do better
Choosing an Algorithm cont
sometimes, one particular algorithm seems to naturally fit
problem, but often, best approach is to try many algorithms use knowledge of problem and algorithms to guide decisions eg, in choice of weak learner, kernel, etc usually, dont know what will work until you try be sure to try simple stuff some packages eg weka make easy to try many algorithms, though implementations are not always optimal
Testing Performance
does it work? which algorithm is best?
train on part of available data, and test on rest
if dataset large say, in 1000s, can simply set aside 1000 random examples as test otherwise, use 10-fold cross validation break dataset randomly into 10 parts in turn, use each block as a test set, training on other 9 blocks
Testing Performance
does it work? which algorithm is best?
train on part of available data, and test on rest
if dataset large say, in 1000s, can simply set aside 1000 random examples as test otherwise, use 10-fold cross validation break dataset randomly
into 10 parts in turn, use each block as a test set, training on other 9 blocks
repeat many times
use same train/test splits for each algorithm
might be natural split eg, train on data from 2004-06, test
on data from 2007 however, can confound results — bad performance because of algorithm, or change of distribution?
Selecting Parameters
sometimes, theory can guide setting of parameters, possibly
based on statistics measurable on training set
other times, need to use trial and test, as before
danger: trying too many combinations can lead to overfitting
test set break data into train, validation and test sets set parameters using validation set measure performance on test set for selected parameter settings or do cross-validation within cross-validation expensive
trying many parameter settings is also very computationally
Running Experiments
automate everything
write one script that does everything at the push of a single button fewer errors easy to re-run for instance, if computer crashes in middle of experiment have explicit, scientific record in script of exact experiments that were executed
Running Experiments
automate everything
write one
script that does everything at the push of a single button fewer errors easy to re-run for instance, if computer crashes in middle of experiment have explicit, scientific record in script of exact experiments that were executed
if running many experiments:
put result of each experiment in a separate file use script to scan for next experiment to run based on which files have or have not already been created makes very easy to re-start if computer crashes easy to run many experiments in parallel if have multiple processors/computers also need script to automatically gather and compile results
If Writing Your Own Code
R and matlab are great for easy coding, but for speed, may
need C or java
debugging machine learning algorithms is very tricky
hard to tell if working, since dont know what to expect run on small cases where can figure out answer by hand test each module/subroutine separately compare to other implementations written by others, or written in different language compare to theory or published results
Summary
central issues in machine learning:
avoidance of overfitting balance between simplicity and fit to data decision trees boosting SVMs
many not covered
machine learning algorithms:
looked at practicalities of using machine learning methods
will see more in lab
Further reading on machine learning in general: Ethem Alpaydin Introduction to machine learning MIT Press, 2004 Christopher M Bishop Pattern recognition and machine learning Springer, 2006 Richard O Duda, Peter E Hart and David G Stork Pattern Classification 2nd ed Wiley, 2000 Trevor Hastie, Robert Tibshirani and Jerome Friedman The Elements of Statistical Learning : Data Mining, Inference, and Prediction Springer, 2001 Tom M Mitchell Machine Learning McGraw Hill, 1997 Vladimir N Vapnik Statistical Learning Theory Wiley, 1998 Decision trees: Leo Breiman, Jerome H Friedman, Richard A Olshen and Charles J Stone Classification and Regression Trees Wadsworth Brooks, 1984 J Ross Quinlan C45: Programs for Machine Learning Morgan Kaufmann, 1993 Boosting: Ron Meir and Gunnar Rtsch An Introduction to Boosting and Leveraging In Advanced a Lectures on Machine Learning LNAI2600, 2003 www-eetechnionacil/rmeir/Publications/MeiRae03pdf Robert E Schapire The boosting approach to machine learning: An overview In Nonlinear Estimation and Classification, Springer,
2003 wwwcsprincetonedu/schapire/boosthtml Support-vector machines: Christopher J C Burges A Tutorial on Support Vector Machines for Pattern Recognition Data Mining and Knowledge Discovery, 22:121167, 1998 researchmicrosoftcom/cburges/papers/SVMTutorialpdf Nello Cristianni and John Shawe-Taylor An Introduction to Support Vector Machines and Other Kernel-based Learning Methods Cambridge University Press, 2000 wwwsupport-vectornet

































