07 Ensemble

• When combing multiple independent and diverse decisions
each of which seems more accurate than random guessing
• Random errors cancel each other out, correct decisions are

reinforced
• Example
• Assume you have 3 independent and diverse binary

classification method with 0.6 accuracy, what will happen
when combine all of them with voting
• C(3,3)*0.6*0.6*0.6+C(3,2)*0.6*0.6*0.4=0.648
• Although not really independent in real world

• Bagging: resample training data
• Random forest
• Boosting: reweight training data + weighted models
• AdaBoost
• Gradient Boosting
• Stacking: blending weak learners

• Bootstrap
• Draw n’ out of n data instances (n’ < n), usually with replacement
• Bootstrap aggregating
• Repeat Bootstrap for m times
• Train a model for each sample dataset
• Each model could be a weak learner
• Combine the models to make prediction
• Tend to form a strong learner

Training data
(size=n)
Random sample with replacement
Data 1 Data 2 Data m

…
(size=n’) (size=n’) (size=n’)
Learner 1 Learner 2 Learner m
Model 1 Model 2 Model m
Majoring
voting/averaging Final model
• Random forest = bagging + randomized feature set
• Build many decision tree classifiers (or regressors)
• Each tree is trained on a subset of the training data (bagging)
• Each tree uses a subset of the features
• Combine the prediction of each tree (e.g., average or majority

voting)
• Price: more computation per prediction
• However, RF is highly parallelizable during both training and testing

Decision tree on the full dataset
• “Weighted” combination of models
• Assign different weight to different samples
• New models pay more attention to the incorrected

predicted instances
• Weight each classifier and combine them
• K: number of • Pseudo code for training:
base learners
specified by Set uniform weights to each instance # i.e., !"
#
= 1/'
the user
for k = 1 to K
Train fk by minimizing (weighted) error
Compute weighted error of training instance using fk
Set () the weight of fk based on weighted error
)
Set !" the weight of each instance based on
ensemble prediction
2
• Set the weight of fk based on weighted error

1
alpha
"
1 − -.. 0
!" = 0.5 ⋅ log

-.. " −1
−2
" "23
-.. = ∑0 10 ⋅ 4" 5" 60 , 80 0.00 0.25 0.50 0.75 1.00
err
• Set the weight of each instance based on ensemble

prediction
"23
" 10 exp −!" 80 8<0
10 = "
=
" "
= is the normalization term such that ∑0 10 sums to one
• Prediction as a weighted sum of the base learners
"!# = %& '& (# + %* '* (# + ⋯ + %, ', (#
• Predicted class = sign "!#

• Generate "!# = %& ' # + %) ' # + ⋯ + %+ ' # in a step-wise
manner
• Similar to AdaBoost
• In each stage, introduce a weak learner to compensate the

shortcomings of existing weak learners
• In AdaBoost, shortcomings are identified by high weight data

points
• In Gradient boosting, shortcomings are identified by gradients

• Given ! = #$ , &$ , #' , &' … , #) , &)
• Train a model *$ to fit !, and let + = *$
• Train a model *' to fit the residuals given the features
• I.e., fitting #$ , &$ − + #$ , #' , &' − + #' , … , #) , &) − + #)
• Let + = *$ + *'
• Repeat the process to get *. , */ , …
• + #0 = *$ #0 + *' #0 + ⋯
• #
Loss function: ! = ∑& '& − ) *&
$
• Algorithm
$
• Initially, ) 1+ = 2# 1+
• Gradient of ! to ) *+
• Update by gd until termination
• ,-
,. */
= '+ − ) *+ −1 = ) *+ − '+ condition is met:
34# 3
5!
) 1+ = ) 1+ −
5) *+
3 3
=) *+ + '+ − ) *+
= 2# *+ + ⋯ + 23 *+ + 234# *+
• For regression with square loss
• Gradient is equivalent to negative residual
• Update the model ! "# by gradient descent is equivalent to

adding residual
• Using the concept of “gradient”, instead of “residuals”,

allows us to consider other loss functions
• If ! = %& + ⋯ + %) , we
• fit a function %)*& to the negative gradient, and
• set ! = %& + ⋯ + %) + %)*&

• Decision tree classifier (regressor) generates one complex
trees to make decisions
• Gradient boosting generates many simple trees and make

decisions based on an ensemble of the trees
• Random forest generates many trees; these trees are
independent to each other
• Gradient boosting generates many trees one by one, the

new trees try to “correct” the predictions of previous trees
Training: features and labels are
Model 1 Model 2 … Model m
based on training data part 1
Treat !% s as the input features to

!" !# !$
the ensemble model
Ensemble Training: labels are based on

model training data part 2
Output
• Ensemble to improve the base learners
• Bagging: resample training data
• Random forest
• Boosting: iteratively create new models to compensate the

old models
• AdaBoost, gradient boosting
• Stacking: blending weak learners

07 Ensemble

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 Ensemble

Uploaded by

Copyright:

Available Formats

• When combing multiple independent and diverse decisions

each of which seems more accurate than random guessing

• Random errors cancel each other out, correct decisions are

• Assume you have 3 independent and diverse binary

• Although not really independent in real world

• Boosting: reweight training data + weighted models

• Stacking: blending weak learners

• Repeat Bootstrap for m times

• Train a model for each sample dataset

• Each model could be a weak learner

• Combine the models to make prediction

• Tend to form a strong learner

Random sample with replacement

Data 1 Data 2 Data m

Learner 1 Learner 2 Learner m

Model 1 Model 2 Model m

• Build many decision tree classifiers (or regressors)

• Each tree is trained on a subset of the training data (bagging)

• Each tree uses a subset of the features

• Combine the prediction of each tree (e.g., average or majority

• Price: more computation per prediction

• However, RF is highly parallelizable during both training and testing

• Assign different weight to different samples

• New models pay more attention to the incorrected

Train fk by minimizing (weighted) error

Compute weighted error of training instance using fk

Set () the weight of fk based on weighted error

• Set the weight of fk based on weighted error

!" = 0.5 ⋅ log

• Set the weight of each instance based on ensemble

• Predicted class = sign "!#

• In each stage, introduce a weak learner to compensate the

• In AdaBoost, shortcomings are identified by high weight data

• In Gradient boosting, shortcomings are identified by gradients

• Train a model *$ to fit !, and let + = *$

• Train a model *' to fit the residuals given the features

• I.e., fitting #$ , &$ − + #$ , #' , &' − + #' , … , #) , &) − + #)

• Repeat the process to get *. , */ , …

• Gradient is equivalent to negative residual

• Update the model ! "# by gradient descent is equivalent to

• Using the concept of “gradient”, instead of “residuals”,

• fit a function %)*& to the negative gradient, and

• set ! = %& + ⋯ + %) + %)*&

• Gradient boosting generates many simple trees and make

• Gradient boosting generates many trees one by one, the

Treat !% s as the input features to

Ensemble Training: labels are based on

• Bagging: resample training data

• Boosting: iteratively create new models to compensate the

• AdaBoost, gradient boosting

• Stacking: blending weak learners

You might also like

• Train a model $ to fit !, and let + = $

• Repeat the process to get . , / , …