AIML Lect6 Ensembles

Ensemble Techniques
1
Ensemble Methods
Construct a set of base classifiers learned from

the training data
Predict class label of test records by combining

the predictions made by multiple classifiers (e.g.,
by taking majority vote)
2
Example: Why Do Ensemble Methods Work?
3
Necessary Conditions for Ensemble Methods
Ensemble Methods work better than a single base classifier if:

1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)
Classification error for an

ensemble of 25 base classifiers,
assuming their errors are
uncorrelated.
4
Rationale for Ensemble Learning
Ensemble Methods work best with unstable

base classifiers
– Classifiers that are sensitive to minor perturbations in
training set, due to high model complexity
– Examples: Unpruned decision trees, ANNs, …
5
Bias-Variance Decomposition
Analogous problem of reaching a target y by firing

projectiles from x (regression problem)
For classification, the generalization error of model 𝑚 can

be given by:
𝑔𝑒𝑛. 𝑒𝑟𝑟𝑜𝑟 𝑚 = 𝑐1 + 𝑏𝑖𝑎𝑠 𝑚 + 𝑐2 × 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑚)

6
In general,
- Bias is contributed to by the training error; a complex
model has low bias.
-Variance is caused by future error; a complex model has
High variance.
- Bagging reduces the variance in the base classifiers.
7
Bias-Variance Trade-off and Overfitting
Overfitting
Underfitting
Ensemble methods try to reduce the variance of complex

models (with low bias) by aggregating responses of
multiple base classifiers
8
9
10
General Approach of Ensemble Learning
Using majority vote or

weighted majority vote
(weighted according to their
accuracy or relevance)
11
Constructing Ensemble Classifiers
By manipulating training set

– Example: bagging, boosting, random forests
By manipulating input features

– Example: random forests
By manipulating class labels

– Example: error-correcting output coding
By manipulating learning algorithm

– Example: injecting randomness in the initial weights of ANN
12
Bagging (Bootstrap AGGregating)
Bootstrap sampling: sampling with replacement
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Build classifier on each bootstrap sample
Probability of a training instance being selected in

a bootstrap sample is:
➢ 1 – (1 - 1/n)n (n: number of training instances)
➢ ~0.632 when n is large
13
Bagging Algorithm
14
Bagging Example
Consider 1-dimensional data set:

Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Classifier is a decision stump (decision tree of size 1)

– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy
xk
True False
yleft yright
15
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1
16
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7  y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7  y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35  y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3  y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35  y = 1
x > 0.35  y = -1
y 1 1 1 -1 -1 -1 -1 1 1 1
17
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75  y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1
Bagging Round 10:

x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05  y = 1
x > 0.05  y = 1
y 1 1 1 1 1 1 1 1 1 1
18
Bagging Example
Summary of Trained Decision Stumps:
Round Split Point Left Class Right Class

1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1
19
Bagging Example
Use majority vote (sign of sum of predictions) to
determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
Bagging can also increase the complexity (representation

capacity) of simple classifiers such as decision stumps
20
Boosting
An iterative procedure to adaptively change

distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of a boosting round
21
Boosting
Records that are wrongly classified will have

their weights increased
Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds
22
Boosting
Equal weights are assigned to each training

instance (1/N for round 1) at first round
After a classifier Ci is learned, the weights are
adjusted to allow the subsequent classifier
Ci+1 to “pay more attention” to data that were
misclassified by Ci.
Final boosted classifier C* combines the
votes of each individual classifier
– Weight of each classifier’s vote is a function of its
accuracy
Adaboost – popular boosting algorithm
23
Adaboost (Adaptive Boost)
Input:
– Training set D containing N instances
– T rounds
– A classification learning scheme
Output:
– A composite model
24
Adaboost: Training Phase
Training data D contain N labeled data (X1,y1),

(X2,y2 ), (X3,y3),….(XN,yN)
Initially assign equal weight 1/d to each data
To generate T base classifiers, we need T
rounds or iterations
Round i, data from D are sampled with
replacement , to form Di (size N)
Each data’s chance of being selected in the next
rounds depends on its weight
– Each time the new sample is generated directly from
the training data D with different sampling probability
according to the weights; these weights are not zero
25
Adaboost: Training Phase
Base classifier Ci, is derived from training data of

Di
Error of Ci is tested using Di
Weights of training data are adjusted depending
on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
Weight of a data indicates how hard it is to
classify it (directly proportional)
26
Adaboost: Testing Phase
The lower a classifier error rate, the more accurate it is,
and therefore, the higher its weight for voting should be
Weight of a classifier Ci’s vote is
1  1 − i 
 i = ln 
2  i 
Testing:
– For each class c, sum the weights of each classifier that
assigned class c to X (unseen data)
– The class with the highest sum is the WINNER!
T
C * ( xtest ) = arg max   i (Ci ( xtest ) = y )
y i =1
27
Example: Error and Classifier Weight in
AdaBoost
Base classifiers: C1, C2, …, CT
Error rate: (i = index of

classifier, j=index of instance)
 w  (C ( x )  y )
N
1
i = j i j j
N j =1
Importance of a classifier:
1  1 − i 
 i = ln 
2  i 
28
Example: Data Instance Weight
in AdaBoost
Assume: N training data in D, T rounds, (xj,yj) are
the training data, Ci, ai are the classifier and weight
of the ith round, respectively.
Weight update on all training data in D:
− i
( i +1) w (i )
j 
exp if Ci ( x j ) = y j
wj = 
Z i  exp  i if Ci ( x j )  y j
where Z i is the normalization factor
T
C * ( xtest ) = arg max   i (Ci ( xtest ) = y )
y i =1
29
AdaBoost Algorithm
30
AdaBoost Example
Consider 1-dimensional data set:

Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1
Classifier is a decision stump

– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy
xk
True False
yleft yright
31
AdaBoost Example
Training sets for the first 3 boosting rounds:

Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
32
AdaBoost Example
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
33
Illustrating AdaBoost
Initial weights for each data point Data points

for training
0.1 0.1 0.1

Original
Data +++ - - - - - ++
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
34
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++  = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++  = 3.8744
Overall +++ - - - - - ++
35
Random Forests
• Random Forest is a modified form of bagging that
creates ensembles of independent decision trees.
• Random Forests essentially perform bagging over the
predictor space and build a collection of less correlated
trees.
• Choose a random subset of predictors when defining
trees.
• It increases the stability of the algorithm and tackles
correlation problems that arise by a greedy search of the
best split at each node.
• It reduces variance of total estimator at the cost of an
equal or higher bias.
36
Random Forest
• Each classifier in the ensemble is a decision tree

classifier and is generated using a random selection
of attributes at each node to determine the split
• During classification, each tree votes and the most
popular class is returned.
37
Algorithm
– Choose T—number of trees to grow
– Choose m<M (M is the number of total features) —number
of features used to calculate the best split at each node
– For each tree
◆ Choose a training set by choosing N times (N is the
number of training examples) with replacement from the
training set
◆ For each node, randomly choose m features and
calculate the best split
◆ Fully grown and not pruned
– Use majority voting among all the trees
38
Tuning Random Forests
Multiple hyper-parameters to tune:
1. the number of predictors to randomly select at each split
2. the total number of trees in the ensemble
3. the minimum leaf node size
In theory, each tree in the random forest is full, but in practice

this can be computationally expensive,
– Imposing a minimum node size is not unusual.
39
Methods to construct Random Forest
Two main methods to construct Random Forest:

– Forest-RI (random input selection): Randomly select,
at each node, F attributes as candidates for the split at
the node. The CART methodology is used to grow the
trees to maximum size
– Forest-RC (random linear combinations): Creates
new attributes (or features) that are a linear combination
of the existing attributes (reduces the correlation
between individual classifiers)
40
Random Forest
Comparable in accuracy to Adaboost, but more robust

to errors and outliers
Faster than bagging or boosting since it is insensitive
to the number of attributes selected for consideration at
each split
41

AIML Lect6 Ensembles

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AIML Lect6 Ensembles

Uploaded by

Copyright:

Available Formats

Ensemble Techniques

Construct a set of base classifiers learned from

Predict class label of test records by combining

Ensemble Methods work better than a single base classifier if:

Classification error for an

Ensemble Methods work best with unstable

Analogous problem of reaching a target y by firing

For classification, the generalization error of model 𝑚 can

𝑔𝑒𝑛. 𝑒𝑟𝑟𝑜𝑟 𝑚 = 𝑐1 + 𝑏𝑖𝑎𝑠 𝑚 + 𝑐2 × 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑚)

Ensemble methods try to reduce the variance of complex

Using majority vote or

By manipulating training set

By manipulating input features

By manipulating class labels

By manipulating learning algorithm

Bootstrap sampling: sampling with replacement

Build classifier on each bootstrap sample

Probability of a training instance being selected in

Consider 1-dimensional data set:

Classifier is a decision stump (decision tree of size 1)

Bagging Round 10:

Summary of Trained Decision Stumps:

Round Split Point Left Class Right Class

Bagging can also increase the complexity (representation

An iterative procedure to adaptively change

Records that are wrongly classified will have

• Example 4 is hard to classify

Equal weights are assigned to each training

Training data D contain N labeled data (X1,y1),

Base classifier Ci, is derived from training data of

Error rate: (i = index of

Consider 1-dimensional data set:

Classifier is a decision stump

Training sets for the first 3 boosting rounds:

Initial weights for each data point Data points

0.1 0.1 0.1

• Each classifier in the ensemble is a decision tree

Multiple hyper-parameters to tune:

1. the number of predictors to randomly select at each split

2. the total number of trees in the ensemble

3. the minimum leaf node size

In theory, each tree in the random forest is full, but in practice

– Imposing a minimum node size is not unusual.

Two main methods to construct Random Forest:

Comparable in accuracy to Adaboost, but more robust

You might also like