You are on page 1of 23

Ensemble Methods

 An ensemble method constructs a set of base


classifiers from the training data
– Ensemble or Classifier Combination

 Predict class label of previously unseen records


by aggregating predictions made by multiple
classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1


General Idea

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2


Why does it work?

 Suppose there are 25 base classifiers


– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes
a wrong prediction:
25
 25  i
 
 i 
i 13 
 (1   ) 25i
 0.06

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3


Methods

 By manipulating the training dataset: a classifier


is built for a sampled subset of the training
dataset.
– Two ensemble methods: bagging (bootstrap
averaging) and boosting.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4


Characteristics

 Ensemble methods work better with unstable


classifiers.
– Base classifiers that are sensitive to minor
perturbations in the training set.
 For example, decision trees and ANNs.
– The variability among training examples is
one of the primary sources of errors in a
classifier.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5


Bias-Variance Decomposition

 Consider the trajectories of a projectile launched at a


particular angle. The observed distance can be divided
into 3 components.

d f , ( y , t )  Bais  Variance f  Noiset


– Force (f) and angle (θ)
– Suppose the target is t, but the projectile hits at x at
distance d away from t.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6


Two Decision Trees (1)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7


Two Decision Trees (2)

 Bias: The stronger the assumptions made by a classifier


about the nature of its decision boundary, the larger the
classifier’s bias will be.
– A smaller tree has a stronger assumption.
– An algorithm cannot learn the target.
 Variance: Variability in the training data affects the
expected error, because different compositions of the
training set may lead to different decision boundaries.
 Intrinsic noise in the target class.
– Target class for some domain can be non-
deterministic.
– Same attributes values with different class labels.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8


Bagging

 Sampling with replacement


Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

 Build classifier on each bootstrap sample

 Each sample has probability 1 - (1 – 1/n)n of


being selected. When n is large, a bootstrap
sample Di contains about 63.2% of the training
data.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9
Bagging Algorithm

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10


A Bagging Example (1)

 Consider a one-level binary decision tree x <= k


where k is a split point to minimize the entropy.
 Without bagging, the best decision stump is

– x <= 0.35 or x >= 0.75

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11


A Bagging Example (2)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12


A Bagging Example (3)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13


A Bagging Example (4)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14


Summary on Bagging

 Bagging improves generalization error by


reducing the variance of the base classifier.
 Bagging depends on the stability of the base
classifier.
 If a base classifier is unstable, bagging helps to
reduce the errors associated with random
fluctuations in the training data.
 If a base classifier is stable, then the error of the
ensemble is primarily caused by bias in the base
classifier. Bagging may make error larger,
because the sample size is 37% smaller.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
Boosting

 An iterative procedure to adaptively change


distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of boosting round
– The weights can be used by a base classifier
to learn a model that is biased toward higher-
weight examples.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16


Boosting

 Records that are wrongly classified will have their


weights increased
 Records that are classified correctly will have
their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17


Example: AdaBoost

 Base classifiers: C1, C2, …, CT

 Error rate:
N
 i   w j Ci ( x j )  y j 
j 1

 Importance of a classifier:

1  1  i 
i  ln 
2  i 

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18


Example: AdaBoost

 Weight update:
 j
w 
exp
( j) if C j ( xi )  yi
( j 1)
wi   i
Z j  exp j if C j ( xi )  yi

where Z j is the normalization factor to ensure  w i (j1)  1.
i
 If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back
to 1/n and the resampling procedure is repeated
 Classification:
T
C * ( x )  arg max   j C j ( x )  y 
y j 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
A Boosting Example (1)

 Consider a one-level binary decision tree x <= k


where k is a split point to minimize the entropy.
 Without bagging, the best decision stump is

– x <= 0.35 or x >= 0.75

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21


A Boosting Example (2)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22


A Boosting Example (3)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

You might also like