Chapter 8

MIMA Group
M L
D M
Chapter 8
Ensemble Learning
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Content MIMA
 Introduction
 Bagging
 Boosting
 Adaboost
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Introduction MIMA
 What is ensemble learning

 Learn multiple alternative definitions of a concept
using different training data or different learning
algorithms.
 Example 1
 Generate 100 different decision trees from the same
or different training set and have them vote on the
best classification for a new example.

Introduction MIMA
Training Data
Data1 Data2  Data m
Learner1 Learner2  Learner m
Model1 Model2  Model m
Model Combiner Final Model
Key motivation: reduce the error rate.

Different Learners MIMA
 Different learning algorithms

 Algorithms with different choice for parameters
 Data set with different features
 Data set = different subsets

Homogenous Ensembles MIMA
 Use a single, arbitrary learning algorithm but

manipulate training data to make it learn multiple
models.
 Data1  Data2  …  Data m
 Learner1 = Learner2 = … = Learner m
 Different methods for changing training data:

 Bagging: Resample training data
 Boosting: Reweight training data

Bagging MIMA
 Introduced by Breiman (1996)

 L. Breiman. Bagging predictors. Machine Learning, 24(2):
123 – 140, 1996 (citations 16529)
 Create ensembles by repeatedly randomly resampling the
training data
 “Bagging” stands for “bootstrap aggregating”.
 Given a training set of size n, create m samples of size n
by drawing n examples from the original data, with
replacement.
 Each bootstrap sample will on average contain 63.2% of
the unique training examples, the rest are replicates.
 Combine the m resulting models using simple majority
vote.

Bootstrap MIMA
 Example
 What’s the average price of house prices?
 From F, get a sample L=(x1, x2, …, xn), and calculate
the average u.
 Question: how reliable is u? What’s the standard
error of u? what’s the confidence interval?

Bootstrap MIMA
 One possibility: get several samples like F.
 Problem: it is impossible (or too expensive) to

get multiple samples.
 Solution: bootstrap

Bootstrap MIMA
Let the original samples be L=(x1,x2,…,xn)

 Repeat B time:
 Generate a sample Lk of size n from L by sampling with
replacement.
 Compute ˆ * for x*.
 Now we end up with bootstrap values
ˆ*  (ˆ1* ,..., ˆB* )
 Use these values for calculating all the quantities of
interest (e.g., standard deviation, confidence
intervals)

Bootstrap MIMA
 An example X1=(1.57,0.22,19.67,
0,0,2.2,3.12)
Mean=4.13
X=(3.12, 0, 1.57,
19.67, 0.22, 2.20) X2=(0, 2.20, 2.20,
Mean=4.46 2.20, 19.67, 1.57)
Mean=4.64
X3=(0.22, 3.12,1.57,
3.12, 2.20, 0.22)
Mean=1.74
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11 11

Bootstrap MIMA
Original 1 2 3 4 5 6 7 8
Training set 1 2 7 8 3 7 6 3 1
Training set 2 7 8 5 6 4 2 7 1
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8

Bootstrap MIMA
 Cases where bootstrap does not apply

 Small data sets: the original sample is not a good
approximation of the population
 Dirty data: outliers add variability in our estimates.
 Dependence structures (e.g., time series, spatial

problems): Bootstrap is based on the assumption of
independence.

Bagging MIMA
Let the original training data be L

 Repeat B times:
 Get a bootstrap sample Lk from L.
 Train a predictor using Lk.
 Combine B predictors by
 Voting (for classification problem)
 Averaging (for estimation problem)
 Bagging works well for “unstable” learning
algorithms.
 Bagging can slightly degrade the performance of
“stable” learning algorithms.
Bagging MIMA
 Unstable learning algorithms: small changes in

the training set result in large changes in
predictions.
 Neural network
 Decision tree
 Regression tree
 Subset selection in linear regression
 Stable learning algorithms:

 K-nearest neighbors

Bagging-Summary MIMA
 Bootstrap is a resampling method.

 Bagging is directly related to bootstrap.
 It uses bootstrap samples to train multiple predictors.
 Output of predictors are combined by voting or other
methods.
 Experiment results:
 It is effective for unstable learning methods.
 It does not help stable learning methods.

Boosting MIMA
 A family of methods
 Sequential production of classifiers
 Each classifier is dependent on the previous one,
and focuses on the previous one’s errors
 Examples that are incorrectly predicted in
previous classifiers are chosen more often or
weighted more heavily

Boosting MIMA
 Robert E. Schapire, The strength of weak learnability.

Machine Learning, 5(2):197-227, 1990. (citations 4600+)
 Consider creating three component classifiers for a two-
category problem through boosting.
 Randomly select n1 < n samples from D without replacement to obtain
D1
 Train weak learner C1
 Select n2 < n samples from D with half of the samples misclassified by
C1 to obtain D2
 Train weak learner C2
 Select all remaining samples from D that C1
and C2 disagree on D
 Train weak learner C3 D3

D1
 Final classifier is vote of weak learners
D2 -
++ -

18
AdaBoost MIMA
 Yoav Freund, Robert E. Schapire. A decision-theoretic

generalization of on-line learning and an application to
boosting, EuroCOLT '95 Proceedings of the Second
European Conference on Computational Learning
Theory (citations 15000+)
 Instead of resampling, uses training set re-weighting
 Each training sample uses a weight to determine the probability
of being selected for a training set.
 AdaBoost is an algorithm for constructing a “strong”
classifier as linear combination of “simple” “weak”
classifier
 Final classification based on weighted vote of weak
classifiers
Introduction MIMA
 What is AdaBoost?
AdaBoost
Adaptive Boosting
A learning algorithm
Building a strong classifier from a lot of weaker ones

Introduction MIMA
h1 ( x)  {1, 1}
h2 ( x)  {1, 1}
 T 
.. H T ( x)  sign    t ht ( x) 
.  t 1 
hT ( x)  {1, 1}
weak classifiers strong classifier
slightly better than random

Introduction MIMA
h1 ( x)  {1, 1}  Each weak classifier learns by

considering one simple feature
h2 ( x)  {1, 1}  T most beneficial features for
..
classification should be selected
.  How to
define features?
hT ( x)  {1, 1}
–
– select beneficial features?
– train weak classifiers?
weak classifiers – manage (weight) training samples?
– associate weight to each weak
slightly better than random classifier?

Introduction MIMA
h1 ( x)  {1, 1}
h2 ( x)  {1, 1}
 T 
.. H T ( x)  sign    t ht ( x) 
.  t 1 
hT ( x)  {1, 1}
weak classifiers strong classifier
slightly better than random

The AdaBoost Algorithm MIMA
Given: ( x1 , y1 ),K , ( xm , ym ) where xi  X , yi  {1, 1}

Initialization: D1 (i )  m1 , i  1,K , m Dt (i ):probability distribution of xi 's at time t
For t  1,K , T:
• Find classifier ht : X  {1, 1} which minimizes error wrt Dt ,i.e.,
m
ht  arg min  j where  j   Dt (i )[ yi  h j ( xi )] minimize weighted error
hj i 1
1 1 t for minimize exponential loss
• Weight classifier:  t  ln
2 t
Dt (i ) exp[ t yi ht ( xi )]
• Update distribution: Dt 1 (i )  , Z t is for normalization
Zt
Give error classified patterns more chance for learning.


Initialization: D1 (i )  m1 , i  1,K , m
For t  1,K , T:
• Find classifier ht : X  {1, 1} which minimizes error wrt Dt ,i.e.,
m
ht  arg min  j where  j   Dt (i )[ yi  h j ( xi )]
hj i 1
1 1 t
2 t
Zt
 T

Output final classifier: sign  H ( x)    t ht ( x) 
 t 1 

Illustration MIMA
Weak
Classifier 1

Illustration MIMA
Weights
Increased

Illustration MIMA
Weak
Classifier 2

Illustration MIMA
Weights
Increased

Illustration MIMA
Weak
Classifier 3

Illustration MIMA
Final classifier is
a combination of weak
classifiers


For t  1,K , T:
• Find classifier ht : X  {1, 1}
which minimizes error wrt Dt ,i.e.,
How and why AdaBoost


ht  arg min works?
hj
j where j 
m
i 1
Dt (i )[ yi  h j ( xi )]
1 1 t
2 t
Zt
 T

 t 1 


For t  1,K , T:
Whath goal
arg min the
t
AdaBoost
where    D (i )[ y wants
j
m
h ( x )]
j
to reach?
t i j i
hj i 1
1 1 t
2 t
Zt
 T

 t 1 


Initialization: D1 (i )  m1 , i  1,K , m They are goal
For t  1,K , T: dependent.
m
ht  arg min  j where  j   Dt (i )[ yi  h j ( xi )]
hj i 1
1 1 t
2 t
Zt
 T

 t 1 

Goal MIMA
 T

Final classifier: sign  H ( x)    t ht ( x) 
 t 1 
Minimize exponential loss
lossexp  H ( x)  Ex , y e  yH ( x )


Maximize the margin yH(x)

Goal MIMA
 T

 t 1 
Minimize lossexp  H ( x)  Ex , y e yH ( x ) 
Define H t ( x)  H t 1 ( x)   t ht ( x) with H 0 ( x)  0
Then, H ( x)  H T ( x)
Ex , y e yHt ( x )   Ex  E y e yHt ( x ) | x  
 Ex  E y e y[ Ht 1 ( x )t ht ( x )] | x  
 Ex  E y e yHt 1 ( x ) e yt ht ( x ) | x  
 Ex e yHt 1 ( x ) et P( y  ht ( x))  et P( y  ht ( x))  

t  ? MIMA
 T

 t 1 
Then, H ( x)  H T ( x)
Ex , y e yHt ( x )   Ex eEyyHte1( xyH) t (ex )| txP(y  ht ( x))  et P( y  ht ( x))  

Set Ex , y e yHt ( x )   0
 t
 Ex e yHt 1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
Xin-Shun Xu @ SDU 0
School of Computer Science and Technology, Shandong University 37
t  ? MIMA
 T

 t 1 
Then, H ( x)  H T ( x)
1 P( y  ht ( x)) 1 1 t
 t  ln   t  ln P( xi , yi )  Dt (i )
2 P( y  ht ( x)) 2 t
m
 t  P (error)   Dt (i )[ yi  h j ( xi )]
i 1
 Ex e yHt 1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
Xin-Shun Xu @ SDU 0
t  ?
For t  1,K , T : MIMA

• Find classifier ht : X 

T {1, 1} which minimizes error wrt Dt ,i.e.,
Final classifier: sign h arg

Hmin t

( x) where
t 1
t h
hj
j t (Dx()i)[y  h ( x )]

j
m
i 1
t i j i
Minimize lossexp  H ( xclassifier:

• Weight )   Ex , y e12lnyH1 (x) 
t
t
t
Define H t ( x)  H t•1Update  t ht ( x) with

( x) distribution: D (i )  H 0 ( x )  0
D (i ) exp[ y h ( x )]
t 1
t
,Z t i t i
t is for normalization
Zt
Then, H ( x)  H TOutput
( x) final classifier:  T

sign  H ( x)    t ht ( x) 
 t 1 
1 P( y  ht ( x)) 1 1 t
 t  ln   t  ln P( xi , yi )  Dt (i )
2 P( y  ht ( x)) 2 t
m
 t  P(error)   Dt (i )[ yi  h j ( xi )]
i 1
 Ex e  yHt 1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
Xin-Shun Xu @ SDU 0
Dt 1  ?



Hmin t

( x) where
t 1
t h
hj
j t (Dx()i)[y  h ( x )]

j
m
i 1
t i j i

t
t
t

D (i ) exp[ y h ( x )]
t 1
t
,Z t i t i
Zt

sign  H ( x)    t ht ( x) 
 t 1 
1 P( y  ht ( x)) 1 1 t
 t  ln   t  ln P( xi , yi )  Dt (i )
2 P( y  ht ( x)) 2 t
m
 t  P(error)   Dt (i )[ yi  h j ( xi )]
i 1
 Ex e  yHt 1 ( x )  et P( y  ht ( x))  et P( y  ht ( x))    0
Xin-Shun Xu @ SDU 0
Dt 1  ? MIMA
 T

 t 1 
Then, H ( x)  H T ( x)
  1 
Ex , y e yHt   Ex , y e yHt 1 e yt ht   Ex , y e yHt 1 1  y t ht   t2 y 2 ht2  
  2 
  yHt 1  1 2 2 2 
 ht  arg min Ex , y e 1  y t h   t y h   y 2 h2  1
h
  2 
  1 
 ht  arg min Ex , y e yHt 1 1  y t h   t2  
h
  2 
   yHt 1  1 2  
 ht  arg min Ex  E y e 1  y t h   t   | x 
h
   2  
Dt 1  ? MIMA
 T

 t 1 
Then, H ( x)  H T ( x)
 ht  arg max Ex 1 h( x)e Ht 1 ( x )  P( y  1| x)  (1)  h( x)e Ht 1 ( x )  P( y  1| x) 

h
 ht  arg max Ex  E y e yHt 1  yh   | x 

h
 ht  arg min Ex  E y e  yHt 1   y t h   | x 

h
   yHt 1  1 2  
 ht  arg min Ex  E y e 1  y t h   t   | x 
h
   2  
Dt 1  ? MIMA
 T

 t 1 
Then, H ( x)  H T ( x)
 ht  arg max Ex 1 h( x)e  Ht 1 ( x )  P( y  1| x)  (1)  h( x)e Ht 1 ( x )  P( y  1| x) 

h
 ht  arg max Ex , y ~ e yHt 1 ( x ) P ( y| x )  yh( x) maximized when y  h( x) x

h

 ht ( x)  sign Ex , y ~ e yHt 1 ( x ) P ( y| x )  y | x  
 h ( x)  sign  P
t x , y ~ e yHt 1 ( x ) P ( y| x )
( y  1| x)  Px , y ~ e yHt 1 ( x ) P ( y| x ) ( y  1| x) 
Dt 1  ? MIMA
 T

 t 1 
Then, H ( x)  H T ( x)
At time t x, y ~ e yHt 1 ( x ) P( y | x)

 ht ( x)  sign Px , y ~ e yHt 1 ( x ) P ( y| x ) ( y  1| x)  Px , y ~ e yHt 1 ( x ) P ( y| x ) ( y  1| x) 
Dt 1  ?



Hmin t

( x) where
t 1
t h
hj
j t (Dx()i)[y  h ( x )]

j
m
i 1
t i j i

t
t
t

D (i ) exp[ y h ( x )]
t 1
t
,Z t i t i
Zt

sign  H ( x)    t ht ( x) 
 t 1 
At time t x, y ~ e yHt 1 ( x ) P( y | x)
1 1
At time 1 x, y ~ P ( y | x ) P( yi | xi )  1  D1 (1)  
Z1 m
 yH t ( x )  t yht ( x )
At time t+1 x, y ~ e P( y | x)  Dt e
 Dt 1 (i )  , Z t is for normalization
Zt
Successful applications of ensemblesMIMA
 Netflix prize
Users rate movies (1,2,3,4,5 stars);

Netflix makes suggestions to users based on previous
rated movies.

Netflix prize MIMA
“The Netflix Prize seeks to substantially improve the accuracy of predictions

about how much someone is going to love a movie based on their movie
preferences. Improve it enough and you win one (or more) Prizes. Winning the
Netflix Prize improves our ability to connect people to the movies they love.”
Netflix prize MIMA
 Supervised learning task

 Training data is a set of users and ratings (1,2,3,4,5
stars) those users have given to movies.
 Construct a classifier that given a user and an
unrated movie, correctly classifies that movie as
either 1, 2, 3, 4, or 5 stars
$1 million prize for a 10% improvement over

Netflix’s current movie recommender/classifier

Netflix prize MIMA

Netflix prize MIMA
 The final solution consists of blending 107

individual results.
Netflix prize MIMA

Netflix prize MIMA

MIMA Group
Any Question?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 8

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 8

Uploaded by

Copyright:

Available Formats

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

 What is ensemble learning

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3

Data1 Data2  Data m

Learner1 Learner2  Learner m

Model1 Model2  Model m

Model Combiner Final Model

Key motivation: reduce the error rate.

 Different learning algorithms

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5

 Use a single, arbitrary learning algorithm but

 Different methods for changing training data:

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6

 Introduced by Breiman (1996)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8

 One possibility: get several samples like F.

 Problem: it is impossible (or too expensive) to

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9

Let the original samples be L=(x1,x2,…,xn)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11 11

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12

 Cases where bootstrap does not apply

 Dirty data: outliers add variability in our estimates.

 Dependence structures (e.g., time series, spatial

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13

Let the original training data be L

 Unstable learning algorithms: small changes in

 Stable learning algorithms:

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15

 Bootstrap is a resampling method.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17

 Robert E. Schapire, The strength of weak learnability.

 Train weak learner C3 D3

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18

 Yoav Freund, Robert E. Schapire. A decision-theoretic

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21

h1 ( x)  {1, 1}  Each weak classifier learns by

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23

Given: ( x1 , y1 ),K , ( xm , ym ) where xi  X , yi  {1, 1}

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24

Given: ( x1 , y1 ),K , ( xm , ym ) where xi  X , yi  {1, 1}

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31

Given: ( x1 , y1 ),K , ( xm , ym ) where xi  X , yi  {1, 1}

How and why AdaBoost

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32

Given: ( x1 , y1 ),K , ( xm , ym ) where xi  X , yi  {1, 1}

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33

Given: ( x1 , y1 ),K , ( xm , ym ) where xi  X , yi  {1, 1}

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34

Minimize exponential loss

lossexp  H ( x)  Ex , y e  yH ( x )