You are on page 1of 21

• When combing multiple independent and diverse decisions

each of which seems more accurate than random guessing

• Random errors cancel each other out, correct decisions are


reinforced

• Example

• Assume you have 3 independent and diverse binary


classification method with 0.6 accuracy, what will happen
when combine all of them with voting

• C(3,3)*0.6*0.6*0.6+C(3,2)*0.6*0.6*0.4=0.648

• Although not really independent in real world


• Bagging: resample training data

• Random forest

• Boosting: reweight training data + weighted models

• AdaBoost

• Gradient Boosting

• Stacking: blending weak learners


• Bootstrap

• Draw n’ out of n data instances (n’ < n), usually with replacement

• Bootstrap aggregating

• Repeat Bootstrap for m times

• Train a model for each sample dataset

• Each model could be a weak learner

• Combine the models to make prediction

• Tend to form a strong learner


Training data
(size=n)

Random sample with replacement

Data 1 Data 2 Data m



(size=n’) (size=n’) (size=n’)

Learner 1 Learner 2 Learner m

Model 1 Model 2 Model m

Majoring
voting/averaging Final model
• Random forest = bagging + randomized feature set

• Build many decision tree classifiers (or regressors)

• Each tree is trained on a subset of the training data (bagging)

• Each tree uses a subset of the features

• Combine the prediction of each tree (e.g., average or majority


voting)

• Price: more computation per prediction

• However, RF is highly parallelizable during both training and testing


Decision tree on the full dataset
• “Weighted” combination of models

• Assign different weight to different samples

• New models pay more attention to the incorrected


predicted instances
• Weight each classifier and combine them
• K: number of • Pseudo code for training:
base learners
specified by Set uniform weights to each instance # i.e., !"
#
= 1/'
the user
for k = 1 to K

Train fk by minimizing (weighted) error

Compute weighted error of training instance using fk

Set () the weight of fk based on weighted error

)
Set !" the weight of each instance based on
ensemble prediction
2

• Set the weight of fk based on weighted error


1

alpha
"
1 − -.. 0

!" = 0.5 ⋅ log


-.. " −1

−2
" "23
-.. = ∑0 10 ⋅ 4" 5" 60 , 80 0.00 0.25 0.50 0.75 1.00
err

• Set the weight of each instance based on ensemble


prediction
"23
" 10 exp −!" 80 8<0
10 = "
=

" "
= is the normalization term such that ∑0 10 sums to one
• Prediction as a weighted sum of the base learners
"!# = %& '& (# + %* '* (# + ⋯ + %, ', (#

• Predicted class = sign "!#


• Generate "!# = %& ' # + %) ' # + ⋯ + %+ ' # in a step-wise
manner

• Similar to AdaBoost

• In each stage, introduce a weak learner to compensate the


shortcomings of existing weak learners

• In AdaBoost, shortcomings are identified by high weight data


points

• In Gradient boosting, shortcomings are identified by gradients


• Given ! = #$ , &$ , #' , &' … , #) , &)

• Train a model *$ to fit !, and let + = *$

• Train a model *' to fit the residuals given the features

• I.e., fitting #$ , &$ − + #$ , #' , &' − + #' , … , #) , &) − + #)

• Let + = *$ + *'

• Repeat the process to get *. , */ , …

• + #0 = *$ #0 + *' #0 + ⋯
• #
Loss function: ! = ∑& '& − ) *&
$
• Algorithm
$

• Initially, ) 1+ = 2# 1+
• Gradient of ! to ) *+
• Update by gd until termination
• ,-
,. */
= '+ − ) *+ −1 = ) *+ − '+ condition is met:
34# 3
5!
) 1+ = ) 1+ −
5) *+
3 3
=) *+ + '+ − ) *+
= 2# *+ + ⋯ + 23 *+ + 234# *+
• For regression with square loss

• Gradient is equivalent to negative residual

• Update the model ! "# by gradient descent is equivalent to


adding residual

• Using the concept of “gradient”, instead of “residuals”,


allows us to consider other loss functions

• If ! = %& + ⋯ + %) , we

• fit a function %)*& to the negative gradient, and

• set ! = %& + ⋯ + %) + %)*&


• Decision tree classifier (regressor) generates one complex
trees to make decisions

• Gradient boosting generates many simple trees and make


decisions based on an ensemble of the trees
• Random forest generates many trees; these trees are
independent to each other

• Gradient boosting generates many trees one by one, the


new trees try to “correct” the predictions of previous trees
Training: features and labels are
Model 1 Model 2 … Model m
based on training data part 1

Treat !% s as the input features to


!" !# !$
the ensemble model

Ensemble Training: labels are based on


model training data part 2

Output
• Ensemble to improve the base learners

• Bagging: resample training data

• Random forest

• Boosting: iteratively create new models to compensate the


old models

• AdaBoost, gradient boosting

• Stacking: blending weak learners

You might also like