(e.g., rules of English, or how to recognize letters), but can easily How to Build Decision Trees. How to Build Decision Trees. choose rule to split on …


Machine Learning Algorithms for Classification
Rob Schapire Princeton University

Machine Learning
studies how to automatically learn to make accurate

predictions based on past observations

classification problems:

classify examples into given set of categories
new example

labeled training examples

machine learning algorithm

classification rule

predicted classification

Examples of Classification Problems
text categorization eg, spam filtering fraud detection optical character recognition natural-language processing market segmentation bioinformatics

machine vision eg, face detection

eg, spoken language understanding eg: predict if customer will respond to promotion eg, classify proteins according to their function

Characteristics of Modern Machine Learning

primary goal: highly accurate predictions on test data

goal is not to uncover underlying truth

methods should be general purpose, fully automatic and

off-the-shelf however, in practice, incorporation of prior, human knowledge is crucial

rich interplay between theory and practice

emphasis on methods that can handle large datasets

Why Use Machine Learning?
advantages:

often much more
accurate than human-crafted rules since data driven humans often incapable of expressing what they know eg, rules of English, or how to recognize letters, but can easily classify examples dont need a human expert or programmer automatic method to search for hypotheses explaining data cheap and flexible — can apply to any learning task need a lot of labeled data error prone — usually impossible to get perfect accuracy

disadvantages

This Talk
machine learning algorithms:

decision trees conditions for successful learning boosting support-vector machines

others not covered:

neural networks nearest neighbor algorithms Naive Bayes bagging random forests

practicalities of using machine learning algorithms

Decision Trees

Example: Good versus Evil
problem: identify people as good or bad from their appearance

sex batman robin alfred penguin catwoman joker batgirl riddler male male male male female male female male

mask yes yes no no yes no yes yes

cape tie training data yes no yes no no yes no yes no no no no test data yes no no no

ears yes no no no yes no yes no

smokes no no no yes no no no no

class Good Good Good Bad Bad Bad ?? ??

A Decision Tree
Classifier

tie no cape no bad yes good yes smokes no yes good bad

How to Build Decision Trees
choose rule to split on divide data using splitting rule into disjoint subsets

batman robin alfred penguin catwoman joker

tie no
batman robin catwoman joker

yes
alfred penguin

How to Build Decision Trees
choose rule to split on divide data using splitting rule into disjoint subsets repeat recursively for each subset stop when leaves are almost pure
batman robin alfred penguin catwoman joker

tie no yes

tie no
batman robin catwoman joker

yes
alfred penguin

How to Choose the Splitting Rule
key problem: choosing best rule to split on:
batman robin alfred penguin catwoman joker batman robin alfred penguin catwoman joker

tie no
batman robin catwoman joker

cape yes
alfred penguin

no
alfred penguin catwoman joker

yes
batman robin

How to Choose the Splitting Rule
key problem: choosing best rule to split on:
batman robin alfred penguin catwoman joker batman robin alfred penguin catwoman joker

tie no
batman robin catwoman joker

cape yes
alfred penguin

no
alfred penguin catwoman joker

yes
batman robin

idea: choose rule that leads to greatest increase in purity

How to
Measure Purity
want impurity function to look like this:

p fraction of positive examples

impurity

0

1/2
p

1

commonly used impurity measures:

entropy: -p ln p - 1 - p ln1 - p Gini index: p1 - p

Kinds of Error Rates

training error fraction of training examples misclassified test error fraction of test examples misclassified generalization error probability of misclassifying new

random example

A Possible Classifier

mask no smokes yes bad yes male good yes cape no sex female bad no smokes no good yes bad yes ears yes good

no ears no tie no bad

cape yes no yes bad good

good

perfectly classifies training data

BUT: intuitively, overly complex

Another Possible Classifier

mask no bad
overly simple

yes good

doesnt even fit available data

Tree Size versus Accuracy

significant problem: cant tell best tree size from training error

atad gniniart nO atad tset nO

001

09

08

07

06

05

04

03

02

01

0

40

BUT: trees that are too big may overfit

trees must be big enough to fit training data

capture noise or spurious patterns in the data so that true patterns are fully captured
10 0

20

ycaruccA

560

error

30

50

50 tree size

test

train

100

90

580 80

570

70

60

550

50

Overfitting Example

fitting points with a polynomial

underfit degree 1

ideal fit degree 3

overfit degree 20

Building an Accurate Classifier

for good test peformance, need:

enough training examples good performance on training set classifier that is not too complex
Occams razor

classifiers should be as simple as possible, but no simpler simplicity closely related to prior expectations

Building an Accurate Classifier

for good test peformance, need:

enough training examples good performance on training set classifier that is not too complex Occams razor

classifiers should be as simple as possible, but no simpler simplicity closely related to prior expectations measure complexity by:

number bits needed to write down number of parameters VC-dimension

Example
Training data:

Good and Bad Classifiers

Good:

Bad:

insufficient data

training error too high

sufficient data low training error simple classifier

classifier too complex

Theory

can prove:

generalization error training error O with high probability d VC-dimension m number training examples

d m

Controlling Tree Size
typical approach: build very large tree that fully fits training

data, then prune back

pruning strategies:

grow on just part of training
data, then find pruning with minimum error on held out part find pruning that minimizes training error constant tree size

Decision Trees

best known:

C45 Quinlan CART Breiman, Friedman, Olshen Stone

very fast to train and evaluate relatively easy to interpret but: accuracy often not state-of-the-art

Boosting

Example: Spam Filtering
problem: filter out spam junk email From: yoav@attcom From: xa412@hotmailcom

gather large collection of examples of spam and non-spam: Rob, can you review a paper Earn money without working non-spam spam

goal: have computer learn from examples to distinguish spam

from non-spam

Example: Spam Filtering
problem: filter out spam junk email From: yoav@attcom From: xa412@hotmailcom

gather large collection of examples of spam and non-spam: Rob, can you review a paper Earn money without working non-spam spam

goal: have computer learn from examples to distinguish spam

from non-spam

main observation:

easy to find rules of thumb that are often correct If v1agr@ occurs in message, then predict spam hard to find single rule that is very highly accurate

The Boosting Approach

devise computer program for deriving
rough rules of thumb apply procedure to subset of emails obtain rule of thumb apply to 2nd subset of emails obtain 2nd rule of thumb repeat T times

Details

how to choose examples on each round?

concentrate on hardest examples those most often misclassified by previous rules of thumb take weighted majority vote of rules of thumb

how to combine rules of thumb into single prediction rule?

Boosting

boosting general method of converting rough rules of

thumb into highly accurate prediction rule

technically:

assume given weak learning algorithm that can consistently find classifiers rules of thumb at least slightly better than random, say, accuracy 55 given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99

AdaBoost
given training examples xi , yi where yi {-1, 1}

AdaBoost
given training examples xi , yi where yi {-1, 1} for t 1, , T :

train weak classifier rule of thumb ht on Dt

AdaBoost
given training examples xi , yi where yi {-1, 1} for t 1, , T :

initialize D1 uniform distribution on training examples

train weak classifier rule of thumb ht on Dt

AdaBoost
given training
examples xi , yi where yi {-1, 1} for t 1, , T :

initialize D1 uniform distribution on training examples

train weak classifier rule of thumb ht on Dt choose t 0 compute new distribution Dt1 : for each example i : e -t 1 if yi ht xi multiply Dt xi by e t 1 if yi ht xi renormalize

AdaBoost
given training examples xi , yi where yi {-1, 1} for t 1, , T :

initialize D1 uniform distribution on training examples

train weak classifier rule of thumb ht on Dt choose t 0 compute new distribution Dt1 : for each example i : e -t 1 if yi ht xi multiply Dt xi by e t 1 if yi ht xi renormalize t ht x
t

output final classifier Hfinal x sign

Toy Example
D1

weak classifiers vertical or horizontal half-planes

Round 1

h1

1 030 1042

D2

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

1 0

Round 2

h2

1 0

2 021 2065
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 2 3 2 3 2 3 2

D3

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B

C B E D

C B E D

C B E D

C B E D

C B E D

C B E D

C B E D

C B E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

E D

Round 3

h3

E D

E D

E D

E D

3 014 3092
C B
C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @
A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4
7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B C B C B C B C B C B C B C B C B C B C B C B 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B
E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D C B E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D E D E D E D E D E D E D E D E D E D E D E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D E D E D E D E D E D E D E D E D E D E D E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @
A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8 E D 5 4 5 4 5 4 5 4 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ A @ 9 8 9 8 9 8 9 8

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W
V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I H b

V

c I H b

c I H b

c I H b

W V

W V

W V

W V

c b

W V

c
b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

I c b H

V

I c b H

I c b H

I c b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

W V

W V

W V

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c b

W V

c I b H

V

c I b H

c I b H

c I b H

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G
F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P a a Y X Y X Y X Y X Y
X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P a a Y X Y X Y X Y X Y X Y X Y X Y X Y X G F G F G F G F G F G F G F G F U T U T

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Q P

Final Classifier

H sign final

042

065

092

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

S R

a

a

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

Y X

G F

G F

G F

G F

G F

G F

G F

G F

U T

U T

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

W V

I H

V

I H

I H

I H

Theory: Training Error

weak learning assumption: each weak classifier at least slightly

given this assumption, can prove:

better than random ie, error of ht on Dt 1/2 - for some 0 training errorHfinal e -2
2T

How Will Test Error Behave? A First Guess
1 08

error

06 04 02

test train
20 40 60 80 100

of rounds T

expect:
training error to continue to drop or reach zero

test error to increase when Hfinal becomes too complex

Occams razor overfitting hard to know when to stop training

Actual Typical Run
20 15

C45 test error test train
10 100 1000

error

10 5 0

boosting C45 on letter dataset

of rounds T
test error does not
increase, even after 1000 rounds

total size 2,000,000 nodes test error continues to drop even after training error is zero rounds 5 100 1000 00 00 00 84 33 31

train error test error

Occams razor wrongly predicts simpler rule is better

The Margins Explanation

key idea:

training error only measures whether classifications are right or wrong should also consider confidence of classifications

The Margins Explanation

key idea:

training error only measures whether classifications are right or wrong should also consider confidence of classifications

recall: Hfinal is weighted majority vote of weak classifiers

The Margins Explanation

key idea:

training error only measures whether classifications are right or wrong should also consider confidence of classifications

recall: Hfinal is weighted majority vote of weak classifiers measure confidence by margin strength of the vote empirical evidence and mathematical proof that:

large margins better generalization error regardless of number of rounds boosting tends to increase margins of training examples given weak learning assumption

Application: Human-computer Spoken Dialogue
[with Rahim, Di Fabbrizio,
Dutton, Gupta, Hollister Riccardi]

application: automatic store front or help desk for ATT

Labs Natural Voices business support, sales agent, etc

caller can request demo, pricing information, technical interactive dialogue

How It Works
computer speech text-to-speech Human raw utterance

automatic speech recognizer text

text response

dialogue manager

predicted category

natural language understanding

NLUs job: classify caller utterances into 24 categories

demo, sales rep, pricing info, yes, no, etc

weak classifiers: test for presence of word or phrase

Application: Detecting Faces
[Viola Jones]

problem: find faces in photograph or movie

weak classifiers: detect light/dark rectangles in image

many clever tricks to make extremely fast and accurate

Boosting

fast but not quite as fast as other methods simple and easy to program

flexible: can combine with any learning algorithm, eg

C45 very simple rules of thumb

provable guarantees

state-of-the-art accuracy many applications

tends not to overfit but occasionally does

Support-Vector Machines

Geometry of SVMs

given linearly separable data

Geometry of SVMs

given linearly separable data

margin distance to separating hyperplane intuitively:

choose hyperplane that maximizes minimum margin

want to separate s from -s as much as possible margin measure of confidence

Theoretical Justification
let minimum margin then

R radius of enclosing sphere VC-dim R
2

so larger margins lower complexity independent of number of dimensions

in contrast, unconstrained hyperplanes in Rn have

VC-dim parameters n 1

Finding the Maximum Margin Hyperplane

examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with

v 1

Finding the Maximum Margin Hyperplane

examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with margin y v x maximize:

v 1

subject to: yi v xi and v 1

Finding the Maximum Margin Hyperplane

examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with margin y v x maximize:

v 1

subject to: yi v xi and v 1 w 1/

set w v/

Finding the Maximum Margin Hyperplane

examples xi , yi where xi Rn , yi {-1, 1} find hyperplane v x 0 with margin y v x maximize:

v 1

subject to: yi v xi and v 1 w 1/ w2 subject to: yi w xi 1
1 2

set w v/ minimize:

Convex Dual

form Lagrangian, set /w 0 w

i yi xi
i 1 2

get quadratic program: maximize
i

i -

i ,j

i j yi yj xi xj

i Lagrange multiplier key points:

subject to: i 0

0 support vector

optimal w is linear combination of support vectors dependence on xi s only through inner products maximization problem is convex with no local maxima

What If Not Linearly Separable?

answer 1: penalize each point by distance from margin 1,

ie, minimize:
1 2

w

2

constant

i

max{0, 1 - yi w xi }

answer 2: map into higher dimensional space in which data

becomes linearly separable

Example

not linearly separable

Example

not linearly separable

22 map x x1 , x2 x 1, x1 , x2 , x1 x2 , x1 , x2

Example

not linearly separable

22 map x x1 , x2 x 1, x1 , x2 , x1 x2 , x1 , x2

hyperplane in mapped space has form

2 2 a bx1 cx2 dx1 x2 ex1 fx2 0

conic in original space linearly separable in mapped space

Why Mapping to High Dimensions Is Dumb

can carry idea further

eg, add all terms up to degree d then n dimensions mapped to Ond dimensions huge blow-up in dimensionality

Why Mapping to High Dimensions Is Dumb

can carry idea further

eg, add all terms up
to degree d then n dimensions mapped to Ond dimensions huge blow-up in dimensionality

statistical problem: amount of data needed often proportional

to number of dimensions curse of dimensionality to work in high dimensions

computational problem: very expensive in time and memory

How SVMs Avoid Both Problems

statistically, may not hurt since VC-dimension independent of

number of dimensions R/2

computationally, only need to be able to compute inner

products

x z

sometimes can do very efficiently using kernels

Example continued
modify slightly:
22 x1 x2 , x1 , x2

x x1 , x2 x 1,

x1 ,

x2 ,

Example continued
modify slightly:

x x1 , x2 x 1,

22 2×1 , 2×2 , 2×1 x2 , x1 , x2

Example continued
modify slightly:

x x1 , x2 x 1,
then

22 2×1 , 2×2 , 2×1 x2 , x1 , x2

22 22 x z 1 2×1 z1 2×2 z2 2×1 x2 z1 z2 x1 z1 x2 z2

1 x1 z1 x2 z2 2

1 x z2
simply use in place of usual inner product

Example continued
modify slightly:

x x1 , x2 x 1,
then

22 2×1 , 2×2 , 2×1 x2 , x1 , x2

22 22 x z 1 2×1 z1 2×2 z2 2×1 x2 z1 z2 x1 z1 x2 z2

1 x1 z1 x2 z2 2

1 x z2
simply use in place of usual inner product

in
general, for polynomial of degree d, use 1 x zd

very efficient, even though finding hyperplane in Ond

dimensions

Kernels
kernel function K for computing

K x, z x z
permits efficient computation of SVMs in very high

dimensions

K can be any symmetric, positive semi-definite function

Mercers theorem

some kernels:

polynomials Gaussian exp - x - z 2 /2 defined over structures trees, strings, sequences, etc evaluation: w x

i yi xi x

i yi K xi , x

time depends on support vectors

SVMs versus Boosting

both are large-margin classifiers

although with slightly different definitions of margin in boosting, dimensions correspond to weak classifiers

both work in very high dimensional spaces but different tricks are used:

SVMs use kernel trick boosting relies on weak learner to select one dimension ie, weak classifier to add to combined classifier

Application: Text Categorization
[Joachims]

goal: classify text documents

eg: spam filtering eg: categorize news articles by topic one dimension for each word in vocabulary value times word occurred in particular document many variations

need to represent text documents as vectors in Rn :

kernels
dont help much

performance state of the art

Application: Recognizing Handwritten Characters
[Cortes Vapnik] examples are 16 16 pixel images, viewed as vectors in R256

7

7

4

8

0

1

4

kernels help:

degree error dimensions 1 120 256 2 47 33000 3 44 106 4 43 109 5 43 1012 6 42 1014 7 43 1016 human 25 to choose best degree: train SVM for each degree choose one with minimum VC-dimension R/2

SVMs

fast algorithms now available, but not so simple to program

but good packages available

state-of-the-art accuracy theoretical justification many applications

power and flexibility from kernels

Other Machine Learning Problem Areas
supervised learning

classification regression predict real-valued labels rare class / cost-sensitive learning

unsupervised no labels

clustering density estimation

semi-supervised

in practice, unlabeled examples much cheaper than labeled examples how to take advantage of both labeled and unlabeled examples active learning how to carefully select which unlabeled examples to have labeled

on-line learning getting one example at a time

Practicalities

Getting Data

more is more

want training data to be like test
data

use your knowledge of problem to know where to get training

data, and what to expect test data to be like

Choosing Features

use your knowledge to know what features would be helpful

for learning

redundancy in features is okay, and often helpful

most modern algorithms do not require independent features could use feature selection methods usually preferable to use algorithm designed to handle large feature sets

too many features?

Choosing an Algorithm
first step: identify appropriate learning paradigm

classification? regression? labeled, unlabeled or a mix? class proportions heavily skewed? goal to predict probabilities? rank instances? is interpretability of the results important? keep in mind, no guarantees

Choosing an Algorithm
first step: identify appropriate learning paradigm

classification? regression? labeled, unlabeled or a mix? class proportions heavily skewed? goal to predict probabilities? rank instances? is interpretability of the results important? keep in mind, no guarantees

in general, no learning algorithm dominates all others on all

problems SVMs and boosting decision trees as well as other tree ensemble methods seem to be
best off-the-shelf algorithms even so, for some problems, difference in performance among these can be large, and sometimes, much simpler methods do better

Choosing an Algorithm cont

sometimes, one particular algorithm seems to naturally fit

problem, but often, best approach is to try many algorithms use knowledge of problem and algorithms to guide decisions eg, in choice of weak learner, kernel, etc usually, dont know what will work until you try be sure to try simple stuff some packages eg weka make easy to try many algorithms, though implementations are not always optimal

Testing Performance
does it work? which algorithm is best?

train on part of available data, and test on rest

if dataset large say, in 1000s, can simply set aside 1000 random examples as test otherwise, use 10-fold cross validation break dataset randomly into 10 parts in turn, use each block as a test set, training on other 9 blocks

Testing Performance
does it work? which algorithm is best?

train on part of available data, and test on rest

if dataset large say, in 1000s, can simply set aside 1000 random examples as test otherwise, use 10-fold cross validation break dataset randomly
into 10 parts in turn, use each block as a test set, training on other 9 blocks

repeat many times

use same train/test splits for each algorithm

might be natural split eg, train on data from 2004-06, test

on data from 2007 however, can confound results — bad performance because of algorithm, or change of distribution?

Selecting Parameters
sometimes, theory can guide setting of parameters, possibly

based on statistics measurable on training set

other times, need to use trial and test, as before

danger: trying too many combinations can lead to overfitting

test set break data into train, validation and test sets set parameters using validation set measure performance on test set for selected parameter settings or do cross-validation within cross-validation expensive

trying many parameter settings is also very computationally

Running Experiments
automate everything

write one script that does everything at the push of a single button fewer errors easy to re-run for instance, if computer crashes in middle of experiment have explicit, scientific record in script of exact experiments that were executed

Running Experiments
automate everything

write one
script that does everything at the push of a single button fewer errors easy to re-run for instance, if computer crashes in middle of experiment have explicit, scientific record in script of exact experiments that were executed

if running many experiments:

put result of each experiment in a separate file use script to scan for next experiment to run based on which files have or have not already been created makes very easy to re-start if computer crashes easy to run many experiments in parallel if have multiple processors/computers also need script to automatically gather and compile results

If Writing Your Own Code

R and matlab are great for easy coding, but for speed, may

need C or java

debugging machine learning algorithms is very tricky

hard to tell if working, since dont know what to expect run on small cases where can figure out answer by hand test each module/subroutine separately compare to other implementations written by others, or written in different language compare to theory or published results

Summary

central issues in machine learning:

avoidance of overfitting balance between simplicity and fit to data decision trees boosting SVMs
many not covered

machine learning algorithms:

looked at practicalities of using machine learning methods

will see more in lab

Further reading on machine learning in general: Ethem Alpaydin Introduction to machine learning MIT Press, 2004 Christopher M Bishop Pattern recognition and machine learning Springer, 2006 Richard O Duda, Peter E Hart and David G Stork Pattern Classification 2nd ed Wiley, 2000 Trevor Hastie, Robert Tibshirani and Jerome Friedman The Elements of Statistical Learning : Data Mining, Inference, and Prediction Springer, 2001 Tom M Mitchell Machine Learning McGraw Hill, 1997 Vladimir N Vapnik Statistical Learning Theory Wiley, 1998 Decision trees: Leo Breiman, Jerome H Friedman, Richard A Olshen and Charles J Stone Classification and Regression Trees Wadsworth Brooks, 1984 J Ross Quinlan C45: Programs for Machine Learning Morgan Kaufmann, 1993 Boosting: Ron Meir and Gunnar Rtsch An Introduction to Boosting and Leveraging In Advanced a Lectures on Machine Learning LNAI2600, 2003 www-eetechnionacil/rmeir/Publications/MeiRae03pdf Robert E Schapire The boosting approach to machine learning: An overview In Nonlinear Estimation and Classification, Springer,
2003 wwwcsprincetonedu/schapire/boosthtml Support-vector machines: Christopher J C Burges A Tutorial on Support Vector Machines for Pattern Recognition Data Mining and Knowledge Discovery, 22:121167, 1998 researchmicrosoftcom/cburges/papers/SVMTutorialpdf Nello Cristianni and John Shawe-Taylor An Introduction to Support Vector Machines and Other Kernel-based Learning Methods Cambridge University Press, 2000 wwwsupport-vectornet

del.icio.us:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... digg:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... spurl:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... newsvine:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... blinklist:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... furl:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... reddit:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... fark:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ... Y!:(e.g., rules of English, or how to recognize letters), but can easily  How to Build Decision Trees. How to Build Decision Trees. choose rule to split on ...