You are on page 1of 33

Data Mining

Ensemble learning
Gergely Lukács
Pázmány Péter Catholic University
Faculty of Information Technology
Budapest, Hungary
lukacs@itk.ppke.hu
Ensemble learning

Combining multiple models
 The basic idea

Bias-variance decomposition

Bagging
 bagging with costs


Randomization
 Rotation forests


Boosting
 AdaBoost, the power of boosting


Interpretable ensembles
 Option trees, alternating decision trees


Stacking
2
Combining multiple models
• Basic idea of “meta” learning schemes:
build different “experts” and let them vote
• Advantage:
– often improves predictive performance
• Disadvantage:
– usually produces output that is very hard to
analyze
– but: there are approaches that aim to produce a
single comprehensible structure
3
Bias-variance decomposition
• several different, but equally good, training data sets
• a learning algorithm is biased for a particular input if,
when trained on each of these data sets, it is
systematically incorrect when predicting
– machine learning method doesn’t match
(perfectly) the problem
• a learning algorithm has high variance for a
particular input if it predicts different output values
when trained on different training sets

4
Bagging
• Combining predictions by voting/averaging
– Simplest way
– Each model receives equal weight
• “Idealized” version:
– Sample several training sets of size n
(instead of just having one training set of size n)
– Build a classifier for each training set
– Combine the classifiers’ predictions
• Learning scheme is unstable ->
almost always improves performance
– Small change in training data can make big change in model
(e.g. decision trees)
5
More on bagging
• Bagging reduces variance by
voting/averaging, thus reducing the overall
expected error
• Problem: we only have one dataset!
• Solution: generate new datasets of size n
by sampling with replacement from original
dataset
– B-Agging = Bootstrap aggregating

6
Bagging classifiers
model generation
Let n be the number of instances in the training data.
For each of t iterations:
Sample n instances with replacement from training set.
Apply the learning algorithm to the sample.
Store the resulting model.

classification
For each of the t models:
Predict class of instance using model.
Return class that has been predicted most often.

7
More on bagging
• Theoretically, bagging would reduce variance without
changing bias
– In practice, bagging can reduce both bias and variance
• For high-bias classifiers, it can reduce bias
• For high-variance classifiers, it can reduce variance
– In the case of classification there are pathological
situations where the overall error might increase
– Usually, the more classifiers the better
– Can help a lot if data is noisy
• Can also be applied to numeric prediction

8
Bagging with costs

good probability estimates
 instead of voting, the individual classifiers'

probability estimates are averaged



can use this with minimum-expected cost
approach for learning problems with costs
 MetaCost re-labels training data using bagging
with costs and then builds single tree
 produces a model that can be interpreted!

9
Randomization

Can randomize learning algorithm instead of input

Some algorithms already have a random
component: e.g. initial weights in neural net

Most algorithms can be randomized, e.g. greedy
algorithms:
 Pick from the N best options at random instead

of always picking the best options


 E.g.: attribute selection in decision trees


More generally applicable than bagging: e.g.
random subsets in nearest-neighbor scheme

Can be combined with bagging 10
Rotation forests

Bagging creates ensembles of accurate classifiers with
relatively low diversity
 Bootstrap sampling creates training sets with a distribution

that resembles the original data



Randomness in the learning algorithm increases diversity but
sacrifices accuracy of individual ensemble members

Rotation forests have the goal of creating accurate and
diverse ensemble members

11
Rotation forests

Combine random attribute sets, bagging and principal
components to generate an ensemble of decision trees

An iteration involves
 Randomly dividing the input attributes into k disjoint

subsets
 Applying PCA to each of the k subsets in turn

 Learning a decision tree from the k sets of PCA

directions

12
Boosting
• Combining models beneficial, if
– Models significantly different from each other
– Each model performs well for a reasonable
percentage of data
• Boosting: explicitely seeks models that
complement one another

13
Boosting 2
• Similar to Bagging
– Voting (classification), averaging (numeric prediction)
– Models of the same type (e.g. decision trees)
• Different from bagging
– Iterative, models not built separately
– New model is encouraged to become expert for instances
classified incorrectly by earlier models
– Intuitive justification: models should be experts that
complement each other
– Boosting weights a model’s contribution by its confidence
• There are several variants of this algorithm

14
Boosting illustration

Classifier 1

15
Boosting illustration

Weights
Increased

16
Boosting illustration

Classifier 2

17
Boosting illustration

Weights
Increased

18
Boosting illustration

Classifier 3

19
Boosting illustration

Final classifier is
a combination of
single classifiers

20
AdaBoost.M1
Model Generation
Assign equal weight to each training instance.
For each of t iterations:
Apply learning algorithm to weighted dataset and
store resulting model.
Compute error e of model on dataset and store error.
If e equal to zero, or e greater or equal to 0.5
Terminate model generation.
For each instance in dataset:
If instance classified correctly by model
Multiply weight of instance by e / (1 - e).
Normalize weight for all instances.
Classification
Assign weight of zero to all classes.
For each of the t (or less) models:
Add -log(e/(1-e)) to weight of class model predicts.
Return class with highest weight.
21
Comments on AdaBoost.M1
During model generation:
– Weights are renormalized (their sum is made
being the same as before)
weight * (sum of old w’s) / (sum of new w’s)
During clasification:
weigth = -log(e/(1-e)) , thus
– e close to 0 => high weight
– e close to 0.5 => low weight

22
More on boosting I

Boosting needs weights … but

Can adapt learning algorithm ... or

Can apply boosting without weights

resample with probability determined by weights

disadvantage: not all instances are used

advantage: if error > 0.5, can resample again

Stems from computational learning theory

Theoretical result:

training error decreases exponentially

Also:

works if base classifiers are not too complex, and

their error doesn’t become too large too quickly
23
More on boosting II

Continue boosting after training error = 0?

Puzzling fact:
generalization error continues to decrease!

Seems to contradict Occam’s Razor

Explanation:
consider margin (confidence), not error

Difference between estimated probability for true class and
nearest other class (between –1 and 1)

Boosting works with weak learners
only condition: error doesn’t exceed 0.5

In practice, boosting sometimes overfits (in contrast to bagging)

24
Additive regression I

Boosting: greedy algorithm for fitting additive models

Bosting: forward stagewise additive modeling

Starts with empty ensemble

Incorporates now members sequentially

Member maximizes predictive performance of ensemble as a
whole (without changing previous members)

Same kind of algorithm for numeric prediction:
– Build standard regression model (eg. tree)
– Gather residuals, learn model predicting residuals (eg. tree),
and repeat

To predict, simply sum up individual predictions from all
models
25
Additive regression II

Minimizes squared error of ensemble if base learner minimizes
squared error

Doesn't make sense to use it with standard (multiple; multi-
attribute) linear regression.

Sum of linear regression models is a linear regressions model

Can use it with simple linear regression – regression based on a
single attribute – to build multiple linear regression model

Use cross-validation to decide when to stop

26
Option trees

Ensembles are not interpretable

Can we generate a single model?
 One possibility: “cloning” the ensemble by using lots of

artificial data that is labeled by ensemble


creating single model for the artificial data
 Another possibility: generating a single structure that

represents ensemble in compact fashion



Option tree: decision tree with option nodes
 Idea: follow all possible branches at option node

 Predictions from different branches are merged using

voting or by averaging probability estimates

27
Example


Can be learned by modifying tree learner:
 Create option node if there are several equally promising splits
(within user-specified interval)
 When pruning, error at option node is average error of options

28
Alternating decision trees

Can also grow option tree by incrementally adding nodes to it

Structure called alternating decision tree, with splitter nodes
and prediction nodes
 Prediction nodes are leaves if no splitter nodes have been

added to them yet


 Standard alternating tree applies to 2-class problems

 To obtain prediction, filter instance down all applicable

branches and sum predictions



Predict one class or the other depending on whether
the sum is positive or negative

29
Example

Outlook=sunny temperature=hot humidity=normal windy=false


30
(-0.225)+0.213+(-0.430)+(-0.331) < 0 => play=yes
Stacking
• Uses meta learner instead of voting to combine
predictions of base learners
– Predictions of
• base learners (level-0 models) are used as input for
• meta learner (level-1 model)
– Level-1 instance has as many attributes as there are level-
0 learners
• Base learners: typically different learning schemes
• Predictions on training data can’t be used to generate
data for level-1 model!
– Cross-validation-like scheme is employed
• Hard to analyze theoretically: “black magic”

31
More on stacking
• If base learners can output probabilities it’s better to use
those as input to meta learner
• Which algorithm to use to generate meta learner?
– In principle, any learning scheme can be applied
– David Wolpert: “relatively global, smooth” model
• Base learners do most of the work
• Reduces risk of overfitting
• Stacking can also be applied to numeric prediction

32
Comparison
Bagging Bosting Stacking
Data1  Data2  …  Data m Data1 = Data2 = … = Data m

Resample training data Reweight training


data
Learner1 = Learner2 = … = Learner m Learner1  Learner2  … 
Learner m

Vote Weighted vote Level-1 model

33 33

You might also like