Data Mining Ensemble Learning: Gergely Lukács

Data Mining
Ensemble learning
Gergely Lukács
Pázmány Péter Catholic University
Faculty of Information Technology
Budapest, Hungary
lukacs@itk.ppke.hu
Ensemble learning

Combining multiple models
 The basic idea

Bias-variance decomposition

Bagging
 bagging with costs

Randomization
 Rotation forests

Boosting
 AdaBoost, the power of boosting

Interpretable ensembles
 Option trees, alternating decision trees

Stacking
2
Combining multiple models
• Basic idea of “meta” learning schemes:
build different “experts” and let them vote
• Advantage:
– often improves predictive performance
• Disadvantage:
– usually produces output that is very hard to
analyze
– but: there are approaches that aim to produce a
single comprehensible structure
3
Bias-variance decomposition
• several different, but equally good, training data sets
• a learning algorithm is biased for a particular input if,
when trained on each of these data sets, it is
systematically incorrect when predicting
– machine learning method doesn’t match
(perfectly) the problem
• a learning algorithm has high variance for a
particular input if it predicts different output values
when trained on different training sets
4
Bagging
• Combining predictions by voting/averaging
– Simplest way
– Each model receives equal weight
• “Idealized” version:
– Sample several training sets of size n
(instead of just having one training set of size n)
– Build a classifier for each training set
– Combine the classifiers’ predictions
• Learning scheme is unstable ->
almost always improves performance
– Small change in training data can make big change in model
(e.g. decision trees)
5
More on bagging
• Bagging reduces variance by
voting/averaging, thus reducing the overall
expected error
• Problem: we only have one dataset!
• Solution: generate new datasets of size n
by sampling with replacement from original
dataset
– B-Agging = Bootstrap aggregating
6
Bagging classifiers
model generation
Let n be the number of instances in the training data.
For each of t iterations:
Sample n instances with replacement from training set.
Apply the learning algorithm to the sample.
Store the resulting model.
classification
For each of the t models:
Predict class of instance using model.
Return class that has been predicted most often.
7
More on bagging
• Theoretically, bagging would reduce variance without
changing bias
– In practice, bagging can reduce both bias and variance
• For high-bias classifiers, it can reduce bias
• For high-variance classifiers, it can reduce variance
– In the case of classification there are pathological
situations where the overall error might increase
– Usually, the more classifiers the better
– Can help a lot if data is noisy
• Can also be applied to numeric prediction
8
Bagging with costs

good probability estimates
 instead of voting, the individual classifiers'
probability estimates are averaged


can use this with minimum-expected cost
approach for learning problems with costs
 MetaCost re-labels training data using bagging
with costs and then builds single tree
 produces a model that can be interpreted!
9
Randomization

Can randomize learning algorithm instead of input

Some algorithms already have a random
component: e.g. initial weights in neural net

Most algorithms can be randomized, e.g. greedy
algorithms:
 Pick from the N best options at random instead
of always picking the best options

 E.g.: attribute selection in decision trees

More generally applicable than bagging: e.g.
random subsets in nearest-neighbor scheme

Can be combined with bagging 10
Rotation forests

Bagging creates ensembles of accurate classifiers with
relatively low diversity
 Bootstrap sampling creates training sets with a distribution
that resembles the original data


Randomness in the learning algorithm increases diversity but
sacrifices accuracy of individual ensemble members

Rotation forests have the goal of creating accurate and
diverse ensemble members
11
Rotation forests

Combine random attribute sets, bagging and principal
components to generate an ensemble of decision trees

An iteration involves
 Randomly dividing the input attributes into k disjoint
subsets
 Applying PCA to each of the k subsets in turn
 Learning a decision tree from the k sets of PCA
directions
12
Boosting
• Combining models beneficial, if
– Models significantly different from each other
– Each model performs well for a reasonable
percentage of data
• Boosting: explicitely seeks models that
complement one another
13
Boosting 2
• Similar to Bagging
– Voting (classification), averaging (numeric prediction)
– Models of the same type (e.g. decision trees)
• Different from bagging
– Iterative, models not built separately
– New model is encouraged to become expert for instances
classified incorrectly by earlier models
– Intuitive justification: models should be experts that
complement each other
– Boosting weights a model’s contribution by its confidence
• There are several variants of this algorithm
14
Boosting illustration
Classifier 1
15
Weights
Increased
16
Classifier 2
17
Weights
Increased
18
Classifier 3
19
Final classifier is
a combination of
single classifiers
20
AdaBoost.M1
Model Generation
Assign equal weight to each training instance.
For each of t iterations:
Apply learning algorithm to weighted dataset and
store resulting model.
Compute error e of model on dataset and store error.
If e equal to zero, or e greater or equal to 0.5
Terminate model generation.
For each instance in dataset:
If instance classified correctly by model
Multiply weight of instance by e / (1 - e).
Normalize weight for all instances.
Classification
Assign weight of zero to all classes.
For each of the t (or less) models:
Add -log(e/(1-e)) to weight of class model predicts.
Return class with highest weight.
21
Comments on AdaBoost.M1
During model generation:
– Weights are renormalized (their sum is made
being the same as before)
weight * (sum of old w’s) / (sum of new w’s)
During clasification:
weigth = -log(e/(1-e)) , thus
– e close to 0 => high weight
– e close to 0.5 => low weight
22
More on boosting I

Boosting needs weights … but

Can adapt learning algorithm ... or

Can apply boosting without weights

resample with probability determined by weights

disadvantage: not all instances are used

advantage: if error > 0.5, can resample again

Stems from computational learning theory

Theoretical result:

training error decreases exponentially

Also:

works if base classifiers are not too complex, and

their error doesn’t become too large too quickly
23
More on boosting II

Continue boosting after training error = 0?

Puzzling fact:
generalization error continues to decrease!

Seems to contradict Occam’s Razor

Explanation:
consider margin (confidence), not error

Difference between estimated probability for true class and
nearest other class (between –1 and 1)

Boosting works with weak learners
only condition: error doesn’t exceed 0.5

In practice, boosting sometimes overfits (in contrast to bagging)
24
Additive regression I

Boosting: greedy algorithm for fitting additive models

Bosting: forward stagewise additive modeling

Starts with empty ensemble

Incorporates now members sequentially

Member maximizes predictive performance of ensemble as a
whole (without changing previous members)

Same kind of algorithm for numeric prediction:
– Build standard regression model (eg. tree)
– Gather residuals, learn model predicting residuals (eg. tree),
and repeat

To predict, simply sum up individual predictions from all
models
25
Additive regression II

Minimizes squared error of ensemble if base learner minimizes
squared error

Doesn't make sense to use it with standard (multiple; multi-
attribute) linear regression.

Sum of linear regression models is a linear regressions model

Can use it with simple linear regression – regression based on a
single attribute – to build multiple linear regression model

Use cross-validation to decide when to stop
26
Option trees

Ensembles are not interpretable

Can we generate a single model?
 One possibility: “cloning” the ensemble by using lots of
artificial data that is labeled by ensemble

creating single model for the artificial data
 Another possibility: generating a single structure that
represents ensemble in compact fashion


Option tree: decision tree with option nodes
 Idea: follow all possible branches at option node
 Predictions from different branches are merged using
voting or by averaging probability estimates
27
Example

Can be learned by modifying tree learner:
 Create option node if there are several equally promising splits
(within user-specified interval)
 When pruning, error at option node is average error of options
28
Alternating decision trees

Can also grow option tree by incrementally adding nodes to it

Structure called alternating decision tree, with splitter nodes
and prediction nodes
 Prediction nodes are leaves if no splitter nodes have been
added to them yet

 Standard alternating tree applies to 2-class problems
 To obtain prediction, filter instance down all applicable
branches and sum predictions


Predict one class or the other depending on whether
the sum is positive or negative
29
Example
Outlook=sunny temperature=hot humidity=normal windy=false

30
(-0.225)+0.213+(-0.430)+(-0.331) < 0 => play=yes
Stacking
• Uses meta learner instead of voting to combine
predictions of base learners
– Predictions of
• base learners (level-0 models) are used as input for
• meta learner (level-1 model)
– Level-1 instance has as many attributes as there are level-
0 learners
• Base learners: typically different learning schemes
• Predictions on training data can’t be used to generate
data for level-1 model!
– Cross-validation-like scheme is employed
• Hard to analyze theoretically: “black magic”
31
More on stacking
• If base learners can output probabilities it’s better to use
those as input to meta learner
• Which algorithm to use to generate meta learner?
– In principle, any learning scheme can be applied
– David Wolpert: “relatively global, smooth” model
• Base learners do most of the work
• Reduces risk of overfitting
• Stacking can also be applied to numeric prediction
32
Comparison
Bagging Bosting Stacking
Data1  Data2  …  Data m Data1 = Data2 = … = Data m
Resample training data Reweight training

data
Learner1 = Learner2 = … = Learner m Learner1  Learner2  … 
Learner m
Vote Weighted vote Level-1 model
33 33

Data Mining Ensemble Learning: Gergely Lukács

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Ensemble Learning: Gergely Lukács

Uploaded by

Copyright:

Available Formats

Data Mining

probability estimates are averaged

of always picking the best options

that resembles the original data

 Learning a decision tree from the k sets of PCA

artificial data that is labeled by ensemble

represents ensemble in compact fashion

 Predictions from different branches are merged using

voting or by averaging probability estimates

added to them yet

 To obtain prediction, filter instance down all applicable

branches and sum predictions

Outlook=sunny temperature=hot humidity=normal windy=false

Resample training data Reweight training

Vote Weighted vote Level-1 model

You might also like