Professional Documents
Culture Documents
Ensemble learning
Gergely Lukács
Pázmány Péter Catholic University
Faculty of Information Technology
Budapest, Hungary
lukacs@itk.ppke.hu
Ensemble learning
Combining multiple models
The basic idea
Bias-variance decomposition
Bagging
bagging with costs
Randomization
Rotation forests
Boosting
AdaBoost, the power of boosting
Interpretable ensembles
Option trees, alternating decision trees
Stacking
2
Combining multiple models
• Basic idea of “meta” learning schemes:
build different “experts” and let them vote
• Advantage:
– often improves predictive performance
• Disadvantage:
– usually produces output that is very hard to
analyze
– but: there are approaches that aim to produce a
single comprehensible structure
3
Bias-variance decomposition
• several different, but equally good, training data sets
• a learning algorithm is biased for a particular input if,
when trained on each of these data sets, it is
systematically incorrect when predicting
– machine learning method doesn’t match
(perfectly) the problem
• a learning algorithm has high variance for a
particular input if it predicts different output values
when trained on different training sets
4
Bagging
• Combining predictions by voting/averaging
– Simplest way
– Each model receives equal weight
• “Idealized” version:
– Sample several training sets of size n
(instead of just having one training set of size n)
– Build a classifier for each training set
– Combine the classifiers’ predictions
• Learning scheme is unstable ->
almost always improves performance
– Small change in training data can make big change in model
(e.g. decision trees)
5
More on bagging
• Bagging reduces variance by
voting/averaging, thus reducing the overall
expected error
• Problem: we only have one dataset!
• Solution: generate new datasets of size n
by sampling with replacement from original
dataset
– B-Agging = Bootstrap aggregating
6
Bagging classifiers
model generation
Let n be the number of instances in the training data.
For each of t iterations:
Sample n instances with replacement from training set.
Apply the learning algorithm to the sample.
Store the resulting model.
classification
For each of the t models:
Predict class of instance using model.
Return class that has been predicted most often.
7
More on bagging
• Theoretically, bagging would reduce variance without
changing bias
– In practice, bagging can reduce both bias and variance
• For high-bias classifiers, it can reduce bias
• For high-variance classifiers, it can reduce variance
– In the case of classification there are pathological
situations where the overall error might increase
– Usually, the more classifiers the better
– Can help a lot if data is noisy
• Can also be applied to numeric prediction
8
Bagging with costs
good probability estimates
instead of voting, the individual classifiers'
9
Randomization
Can randomize learning algorithm instead of input
Some algorithms already have a random
component: e.g. initial weights in neural net
Most algorithms can be randomized, e.g. greedy
algorithms:
Pick from the N best options at random instead
More generally applicable than bagging: e.g.
random subsets in nearest-neighbor scheme
Can be combined with bagging 10
Rotation forests
Bagging creates ensembles of accurate classifiers with
relatively low diversity
Bootstrap sampling creates training sets with a distribution
11
Rotation forests
Combine random attribute sets, bagging and principal
components to generate an ensemble of decision trees
An iteration involves
Randomly dividing the input attributes into k disjoint
subsets
Applying PCA to each of the k subsets in turn
directions
12
Boosting
• Combining models beneficial, if
– Models significantly different from each other
– Each model performs well for a reasonable
percentage of data
• Boosting: explicitely seeks models that
complement one another
13
Boosting 2
• Similar to Bagging
– Voting (classification), averaging (numeric prediction)
– Models of the same type (e.g. decision trees)
• Different from bagging
– Iterative, models not built separately
– New model is encouraged to become expert for instances
classified incorrectly by earlier models
– Intuitive justification: models should be experts that
complement each other
– Boosting weights a model’s contribution by its confidence
• There are several variants of this algorithm
14
Boosting illustration
Classifier 1
15
Boosting illustration
Weights
Increased
16
Boosting illustration
Classifier 2
17
Boosting illustration
Weights
Increased
18
Boosting illustration
Classifier 3
19
Boosting illustration
Final classifier is
a combination of
single classifiers
20
AdaBoost.M1
Model Generation
Assign equal weight to each training instance.
For each of t iterations:
Apply learning algorithm to weighted dataset and
store resulting model.
Compute error e of model on dataset and store error.
If e equal to zero, or e greater or equal to 0.5
Terminate model generation.
For each instance in dataset:
If instance classified correctly by model
Multiply weight of instance by e / (1 - e).
Normalize weight for all instances.
Classification
Assign weight of zero to all classes.
For each of the t (or less) models:
Add -log(e/(1-e)) to weight of class model predicts.
Return class with highest weight.
21
Comments on AdaBoost.M1
During model generation:
– Weights are renormalized (their sum is made
being the same as before)
weight * (sum of old w’s) / (sum of new w’s)
During clasification:
weigth = -log(e/(1-e)) , thus
– e close to 0 => high weight
– e close to 0.5 => low weight
22
More on boosting I
Boosting needs weights … but
Can adapt learning algorithm ... or
Can apply boosting without weights
resample with probability determined by weights
disadvantage: not all instances are used
advantage: if error > 0.5, can resample again
Stems from computational learning theory
Theoretical result:
training error decreases exponentially
Also:
works if base classifiers are not too complex, and
their error doesn’t become too large too quickly
23
More on boosting II
Continue boosting after training error = 0?
Puzzling fact:
generalization error continues to decrease!
Seems to contradict Occam’s Razor
Explanation:
consider margin (confidence), not error
Difference between estimated probability for true class and
nearest other class (between –1 and 1)
Boosting works with weak learners
only condition: error doesn’t exceed 0.5
In practice, boosting sometimes overfits (in contrast to bagging)
24
Additive regression I
Boosting: greedy algorithm for fitting additive models
Bosting: forward stagewise additive modeling
Starts with empty ensemble
Incorporates now members sequentially
Member maximizes predictive performance of ensemble as a
whole (without changing previous members)
Same kind of algorithm for numeric prediction:
– Build standard regression model (eg. tree)
– Gather residuals, learn model predicting residuals (eg. tree),
and repeat
To predict, simply sum up individual predictions from all
models
25
Additive regression II
Minimizes squared error of ensemble if base learner minimizes
squared error
Doesn't make sense to use it with standard (multiple; multi-
attribute) linear regression.
Sum of linear regression models is a linear regressions model
Can use it with simple linear regression – regression based on a
single attribute – to build multiple linear regression model
Use cross-validation to decide when to stop
26
Option trees
Ensembles are not interpretable
Can we generate a single model?
One possibility: “cloning” the ensemble by using lots of
27
Example
Can be learned by modifying tree learner:
Create option node if there are several equally promising splits
(within user-specified interval)
When pruning, error at option node is average error of options
28
Alternating decision trees
Can also grow option tree by incrementally adding nodes to it
Structure called alternating decision tree, with splitter nodes
and prediction nodes
Prediction nodes are leaves if no splitter nodes have been
29
Example
31
More on stacking
• If base learners can output probabilities it’s better to use
those as input to meta learner
• Which algorithm to use to generate meta learner?
– In principle, any learning scheme can be applied
– David Wolpert: “relatively global, smooth” model
• Base learners do most of the work
• Reduces risk of overfitting
• Stacking can also be applied to numeric prediction
32
Comparison
Bagging Bosting Stacking
Data1 Data2 … Data m Data1 = Data2 = … = Data m
33 33