3900286

Ensemble Methods
 An ensemble method constructs a set of base

classifiers from the training data
– Ensemble or Classifier Combination
 Predict class label of previously unseen records

by aggregating predictions made by multiple
classifiers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

General Idea
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers

Why does it work?
 Suppose there are 25 base classifiers

– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes
a wrong prediction:
25
 25  i
 
 i 
i 13 
 (1   ) 25i
 0.06


Methods
 By manipulating the training dataset: a classifier

is built for a sampled subset of the training
dataset.
– Two ensemble methods: bagging (bootstrap
averaging) and boosting.

Characteristics
 Ensemble methods work better with unstable

classifiers.
– Base classifiers that are sensitive to minor
perturbations in the training set.
 For example, decision trees and ANNs.
– The variability among training examples is
one of the primary sources of errors in a
classifier.

Bias-Variance Decomposition
 Consider the trajectories of a projectile launched at a

particular angle. The observed distance can be divided
into 3 components.
d f , ( y , t )  Bais  Variance f  Noiset

– Force (f) and angle (θ)
– Suppose the target is t, but the projectile hits at x at
distance d away from t.

Two Decision Trees (1)

Two Decision Trees (2)
 Bias: The stronger the assumptions made by a classifier

about the nature of its decision boundary, the larger the
classifier’s bias will be.
– A smaller tree has a stronger assumption.
– An algorithm cannot learn the target.
 Variance: Variability in the training data affects the
expected error, because different compositions of the
training set may lead to different decision boundaries.
 Intrinsic noise in the target class.
– Target class for some domain can be non-
deterministic.
– Same attributes values with different class labels.

Bagging
 Sampling with replacement

Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
 Build classifier on each bootstrap sample
 Each sample has probability 1 - (1 – 1/n)n of

being selected. When n is large, a bootstrap
sample Di contains about 63.2% of the training
data.
Bagging Algorithm

A Bagging Example (1)
 Consider a one-level binary decision tree x <= k

where k is a split point to minimize the entropy.
 Without bagging, the best decision stump is
– x <= 0.35 or x >= 0.75




Summary on Bagging
 Bagging improves generalization error by

reducing the variance of the base classifier.
 Bagging depends on the stability of the base
classifier.
 If a base classifier is unstable, bagging helps to
reduce the errors associated with random
fluctuations in the training data.
 If a base classifier is stable, then the error of the
ensemble is primarily caused by bias in the base
classifier. Bagging may make error larger,
because the sample size is 37% smaller.
Boosting
 An iterative procedure to adaptively change

distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of boosting round
– The weights can be used by a base classifier
to learn a model that is biased toward higher-
weight examples.

Boosting
 Records that are wrongly classified will have their

weights increased
 Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify

• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

Example: AdaBoost
 Base classifiers: C1, C2, …, CT
 Error rate:
N
 i   w j Ci ( x j )  y j 
j 1
 Importance of a classifier:
1  1  i 
i  ln 
2  i 

Example: AdaBoost
 Weight update:
 j
w 
exp
( j) if C j ( xi )  yi
( j 1)
wi   i
Z j  exp j if C j ( xi )  yi

where Z j is the normalization factor to ensure  w i (j1)  1.
i
 If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
 Classification:
T
C * ( x )  arg max   j C j ( x )  y 
y j 1
A Boosting Example (1)
 Consider a one-level binary decision tree x <= k

where k is a split point to minimize the entropy.
 Without bagging, the best decision stump is
– x <= 0.35 or x >= 0.75



3900286

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3900286

Uploaded by

Copyright:

Available Formats

Ensemble Methods

 An ensemble method constructs a set of base

 Predict class label of previously unseen records

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

 Suppose there are 25 base classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

 By manipulating the training dataset: a classifier

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

 Ensemble methods work better with unstable

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

 Consider the trajectories of a projectile launched at a

d f , ( y , t )  Bais  Variance f  Noiset

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

 Bias: The stronger the assumptions made by a classifier

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

 Sampling with replacement

 Build classifier on each bootstrap sample

 Each sample has probability 1 - (1 – 1/n)n of

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

 Consider a one-level binary decision tree x <= k

– x <= 0.35 or x >= 0.75

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

 Bagging improves generalization error by

 An iterative procedure to adaptively change

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

 Records that are wrongly classified will have their

• Example 4 is hard to classify

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

 Base classifiers: C1, C2, …, CT

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

 Consider a one-level binary decision tree x <= k

– x <= 0.35 or x >= 0.75

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

You might also like