Cornell CS578: Bagging and Boosting

Bias/Variance Tradeoff
Model Loss (Error) Bias/Variance Decomposition
• Squared loss of model on test case i:

2 (L( x,D) − T (x))2 = Noise 2 + Bias 2 + Variance
( Learn(x i ,D) − Truth( xi )) D
Noise 2 = lower bound on performance

• Expected prediction error: 2
Bias 2 = (expected error due to model mismatch)
2
( Learn( x, D) − Truth( x)) D
Variance = variation due to train sample and randomization
1
Bias2 Variance
• Low bias • Low variance

– linear regression applied to linear data – constant function
– 2nd degree polynomial applied to quadratic data – model independent of training data
– ANN with many hidden units trained to completion – model depends on stable measures of data
• High bias • mean
• median
– constant function
– linear regression applied to non-linear data
• High variance
– ANN with few hidden units applied to non-linear data – high degree polynomial
– ANN with many hidden units trained to completion
Sources of Variance in Supervised Learning Bias/Variance Tradeoff
• noise in targets or input attributes • (bias2+variance) is what counts for prediction

• bias (model mismatch) • Often:
• training sample – low bias => high variance
• randomness in learning algorithm – low variance => high bias
– neural net weight initialization • Tradeoff:
• randomized subsetting of train set: – bias2 vs. variance
– cross validation, train and early stopping set
2
Bias/Variance Tradeoff Bias/Variance Tradeoff
Duda, Hart, Stork “Pattern Classification”, 2nd edition, 2001 Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001
Reduce Variance Without Increasing Bias Bagging: Bootstrap Aggregation
• Averaging reduces variance: • Leo Breiman (1994)

• Bootstrap Sample:
Var(X )
Var(X ) = – draw sample of size |D| with replacement from D
N
Train Li (BootstrapSamplei ( D) )
• Average models to reduce model variance
• One problem: Regression : Lbagging = Li
– only one train set Classification : Lbagging = Plurality (Li )
– where do multiple models come from?
3
Bagging Bagging Results
• Best case:
Variance(L( x,D))
Var(Bagging( L(x, D))) =
N
• In practice:
– models are correlated, so reduction is smaller than 1/N
– variance of models trained on fewer training cases
usually somewhat larger
– stable learning methods have low variance to begin
with, so bagging may not help much
Breiman “Bagging Predictors” Berkeley Statistics Department TR#421, 1994
How Many Bootstrap Samples? More bagging results
Breiman “Bagging Predictors” Berkeley Statistics Department TR#421, 1994
4
More bagging results Bagging with cross validation
• Train neural networks using 4-fold CV

– Train on 3 folds earlystop on the fourth
– At the end you have 4 neural nets
• How to make predictions on new examples?
Bagging with cross validation Can Bagging Hurt?
• Train neural networks using 4-fold CV

– Train on 3 folds earlystop on the fourth
– At the end you have 4 neural nets
• How to make predictions on new examples?

– Train a neural network until the mean earlystopping
point
– Average the predictions from the four neural networks
5
Can Bagging Hurt? Reduce Bias2 and Decrease Variance?
• Bagging reduces variance by averaging

• Each base classifier is trained on less data • Bagging has little effect on bias
– Only about 63.2% of the data points are in any • Can we average and reduce bias?
bootstrap sample
• Yes:
• However the final model has seen all the data

– On average a point will be in >50% of the bootstrap Boosting
samples
Boosting Boosting
• Freund & Schapire: • Weight all training samples equally

– theory for “weak learners” in late 80’s • Train model on train set
• Weak Learner: performance on any train set is • Compute error of model on train set
slightly better than chance prediction • Increase weights on train cases model gets wrong!
• intended to answer a theoretical question, not as a • Train new model on re-weighted train set
practical way to improve learning • Re-compute errors on weighted train set
• tested in mid 90’s using not-so-weak learners • Increase weights again on cases model gets wrong
• works anyway! • Repeat until tired (100+ iterations)
• Final model: weighted prediction of each model
6
Boosting Boosting: Initialization
Initialization
Iteration
Final Model
Boosting: Iteration Boosting: Prediction
7
Weight updates Reweighting vs Resampling
• Weights for incorrect instances are multiplied by • Example weights might be harder to deal with
1/(2Error_i) – Some learning methods can’t use weights on examples
– Small train set errors cause weights to grow by several – Many common packages don’t support weighs on the
orders of magnitude train
• We can resample instead:
• Total weight of misclassified examples is 0.5 – Draw a bootstrap sample from the data with the
probability of drawing each example is proportional to
it’s weight
• Total weight of correctly classified examples is • Reweighting usually works better but resampling
0.5 is easier to implement
Boosting Performance Boosting vs. Bagging
• Bagging doesn’t work so well with stable models.

Boosting might still help.
• Boosting might hurt performance on noisy

datasets. Bagging doesn’t have this problem
• In practice bagging almost always helps.
8
Boosting vs. Bagging
Bagged Decision Trees
 Draw 100 bootstrap samples of data
• On average, boosting helps more than bagging,  Train trees on each sample -> 100 trees
but it is also more common for boosting to hurt  Un-weighted average prediction of trees
performance.
…
• The weights grow exponentially. Code must be
written carefully (store log of weights, …)
• Bagging is easier to parallelize. Average prediction

(0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24
 Marriage made in heaven. Highly under-rated!
Random Forests (Bagged Trees++)

Model Averaging
 Draw 1000+ bootstrap samples of data
 Draw sample of available attributes at each split • Almost always helps
 Train trees on each sample/attribute set -> 1000+ trees • Often easy to do
 Un-weighted average prediction of trees • Models shouldn’t be too similar
• Models should all have pretty good performance
(not too many lemons)
… • When averaging, favor low bias, high variance
• Models can individually overfit
• Not just in ML
Average prediction
(0.23 + 0.19 + 0.34 + 0.22 + 0.26 + … + 0.31) / # Trees = 0.24
9
Out of Bag Samples
• With bagging, each model trained on about 63%

of training sample
• That means each model does not use 37% of data
• Treat these as test points!
– Backfitting in trees
– Pseudo cross validation
– Early stopping sets
10

Cornell CS578: Bagging and Boosting

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cornell CS578: Bagging and Boosting

Uploaded by

Copyright:

Available Formats

Bias/Variance Tradeoff

Model Loss (Error) Bias/Variance Decomposition

• Squared loss of model on test case i:

Noise 2 = lower bound on performance

• Low bias • Low variance

Sources of Variance in Supervised Learning Bias/Variance Tradeoff

• noise in targets or input attributes • (bias2+variance) is what counts for prediction

Reduce Variance Without Increasing Bias Bagging: Bootstrap Aggregation

• Averaging reduces variance: • Leo Breiman (1994)

How Many Bootstrap Samples? More bagging results

Breiman “Bagging Predictors” Berkeley Statistics Department TR#421, 1994

• Train neural networks using 4-fold CV

• How to make predictions on new examples?

Bagging with cross validation Can Bagging Hurt?

• Train neural networks using 4-fold CV

• How to make predictions on new examples?

• Bagging reduces variance by averaging

• However the final model has seen all the data

• Freund & Schapire: • Weight all training samples equally

Boosting: Iteration Boosting: Prediction

Boosting Performance Boosting vs. Bagging

• Bagging doesn’t work so well with stable models.

• Boosting might hurt performance on noisy

• In practice bagging almost always helps.

• Bagging is easier to parallelize. Average prediction

 Marriage made in heaven. Highly under-rated!

Random Forests (Bagged Trees++)

• With bagging, each model trained on about 63%

You might also like