You are on page 1of 41

Ensemble Techniques

1
Ensemble Methods

Construct a set of base classifiers learned from


the training data

Predict class label of test records by combining


the predictions made by multiple classifiers (e.g.,
by taking majority vote)

2
Example: Why Do Ensemble Methods Work?

3
Necessary Conditions for Ensemble Methods

Ensemble Methods work better than a single base classifier if:


1. All base classifiers are independent of each other
2. All base classifiers perform better than random guessing
(error rate < 0.5 for binary classification)

Classification error for an


ensemble of 25 base classifiers,
assuming their errors are
uncorrelated.

4
Rationale for Ensemble Learning

Ensemble Methods work best with unstable


base classifiers
– Classifiers that are sensitive to minor perturbations in
training set, due to high model complexity
– Examples: Unpruned decision trees, ANNs, …

5
Bias-Variance Decomposition

Analogous problem of reaching a target y by firing


projectiles from x (regression problem)

For classification, the generalization error of model 𝑚 can


be given by:

𝑔𝑒𝑛. 𝑒𝑟𝑟𝑜𝑟 𝑚 = 𝑐1 + 𝑏𝑖𝑎𝑠 𝑚 + 𝑐2 × 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑚)


6
In general,
- Bias is contributed to by the training error; a complex
model has low bias.
-Variance is caused by future error; a complex model has
High variance.
- Bagging reduces the variance in the base classifiers.

7
Bias-Variance Trade-off and Overfitting

Overfitting

Underfitting

Ensemble methods try to reduce the variance of complex


models (with low bias) by aggregating responses of
multiple base classifiers
8
9
10
General Approach of Ensemble Learning

Using majority vote or


weighted majority vote
(weighted according to their
accuracy or relevance)

11
Constructing Ensemble Classifiers

By manipulating training set


– Example: bagging, boosting, random forests

By manipulating input features


– Example: random forests

By manipulating class labels


– Example: error-correcting output coding

By manipulating learning algorithm


– Example: injecting randomness in the initial weights of ANN

12
Bagging (Bootstrap AGGregating)

Bootstrap sampling: sampling with replacement

Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Build classifier on each bootstrap sample

Probability of a training instance being selected in


a bootstrap sample is:
➢ 1 – (1 - 1/n)n (n: number of training instances)
➢ ~0.632 when n is large
13
Bagging Algorithm

14
Bagging Example

Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump (decision tree of size 1)


– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy

xk

True False

yleft yright
15
Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1

16
Bagging Example

Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35  y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7  y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7  y = 1

Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35  y = -1

Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3  y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3  y = -1

Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35  y = 1
x > 0.35  y = -1
y 1 1 1 -1 -1 -1 -1 1 1 1

17
Bagging Example

Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75  y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75  y = 1

Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75  y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75  y = 1

Bagging Round 10:


x 0.1 0.1 0.1 0.1 0.3 0.3 0.8 0.8 0.9 0.9 x <= 0.05  y = 1
x > 0.05  y = 1
y 1 1 1 1 1 1 1 1 1 1

18
Bagging Example

Summary of Trained Decision Stumps:

Round Split Point Left Class Right Class


1 0.35 1 -1
2 0.7 1 1
3 0.35 1 -1
4 0.3 1 -1
5 0.35 1 -1
6 0.75 -1 1
7 0.75 -1 1
8 0.75 -1 1
9 0.75 -1 1
10 0.05 1 1

19
Bagging Example
Use majority vote (sign of sum of predictions) to
determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

Bagging can also increase the complexity (representation


capacity) of simple classifiers such as decision stumps
20
Boosting

An iterative procedure to adaptively change


distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of a boosting round

21
Boosting

Records that are wrongly classified will have


their weights increased
Records that are classified correctly will have
their weights decreased
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify


• Its weight is increased, therefore it is more likely
to be chosen again in subsequent rounds

22
Boosting

Equal weights are assigned to each training


instance (1/N for round 1) at first round
After a classifier Ci is learned, the weights are
adjusted to allow the subsequent classifier
Ci+1 to “pay more attention” to data that were
misclassified by Ci.
Final boosted classifier C* combines the
votes of each individual classifier
– Weight of each classifier’s vote is a function of its
accuracy
Adaboost – popular boosting algorithm

23
Adaboost (Adaptive Boost)

Input:
– Training set D containing N instances
– T rounds
– A classification learning scheme
Output:
– A composite model

24
Adaboost: Training Phase

Training data D contain N labeled data (X1,y1),


(X2,y2 ), (X3,y3),….(XN,yN)
Initially assign equal weight 1/d to each data
To generate T base classifiers, we need T
rounds or iterations
Round i, data from D are sampled with
replacement , to form Di (size N)
Each data’s chance of being selected in the next
rounds depends on its weight
– Each time the new sample is generated directly from
the training data D with different sampling probability
according to the weights; these weights are not zero

25
Adaboost: Training Phase

Base classifier Ci, is derived from training data of


Di
Error of Ci is tested using Di
Weights of training data are adjusted depending
on how they were classified
– Correctly classified: Decrease weight
– Incorrectly classified: Increase weight
Weight of a data indicates how hard it is to
classify it (directly proportional)

26
Adaboost: Testing Phase
The lower a classifier error rate, the more accurate it is,
and therefore, the higher its weight for voting should be
Weight of a classifier Ci’s vote is
1  1 − i 
 i = ln 
2  i 
Testing:
– For each class c, sum the weights of each classifier that
assigned class c to X (unseen data)
– The class with the highest sum is the WINNER!
T
C * ( xtest ) = arg max   i (Ci ( xtest ) = y )
y i =1

27
Example: Error and Classifier Weight in
AdaBoost
Base classifiers: C1, C2, …, CT

Error rate: (i = index of


classifier, j=index of instance)

 w  (C ( x )  y )
N
1
i = j i j j
N j =1

Importance of a classifier:
1  1 − i 
 i = ln 
2  i 
28
Example: Data Instance Weight
in AdaBoost
Assume: N training data in D, T rounds, (xj,yj) are
the training data, Ci, ai are the classifier and weight
of the ith round, respectively.
Weight update on all training data in D:

− i
( i +1) w (i )
j 
exp if Ci ( x j ) = y j
wj = 
Z i  exp  i if Ci ( x j )  y j
where Z i is the normalization factor
T
C * ( xtest ) = arg max   i (Ci ( xtest ) = y )
y i =1

29
AdaBoost Algorithm

30
AdaBoost Example

Consider 1-dimensional data set:


Original Data:
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y 1 1 1 -1 -1 -1 -1 1 1 1

Classifier is a decision stump


– Decision rule: x  k versus x > k
– Split point k is chosen based on entropy
xk

True False

yleft yright
31
AdaBoost Example

Training sets for the first 3 boosting rounds:


Boosting Round 1:
x 0.1 0.4 0.5 0.6 0.6 0.7 0.7 0.7 0.8 1
y 1 -1 -1 -1 -1 -1 -1 -1 1 1

Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1

Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1

Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
32
AdaBoost Example

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009

Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class

33
Illustrating AdaBoost

Initial weights for each data point Data points


for training

0.1 0.1 0.1


Original
Data +++ - - - - - ++

B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459

34
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - -  = 1.9459

B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++  = 2.9323

B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++  = 3.8744

Overall +++ - - - - - ++
35
Random Forests
• Random Forest is a modified form of bagging that
creates ensembles of independent decision trees.
• Random Forests essentially perform bagging over the
predictor space and build a collection of less correlated
trees.
• Choose a random subset of predictors when defining
trees.
• It increases the stability of the algorithm and tackles
correlation problems that arise by a greedy search of the
best split at each node.
• It reduces variance of total estimator at the cost of an
equal or higher bias.
36
Random Forest

• Each classifier in the ensemble is a decision tree


classifier and is generated using a random selection
of attributes at each node to determine the split
• During classification, each tree votes and the most
popular class is returned.

37
Algorithm
– Choose T—number of trees to grow
– Choose m<M (M is the number of total features) —number
of features used to calculate the best split at each node
– For each tree
◆ Choose a training set by choosing N times (N is the
number of training examples) with replacement from the
training set
◆ For each node, randomly choose m features and
calculate the best split
◆ Fully grown and not pruned
– Use majority voting among all the trees

38
Tuning Random Forests

Multiple hyper-parameters to tune:

1. the number of predictors to randomly select at each split

2. the total number of trees in the ensemble

3. the minimum leaf node size

In theory, each tree in the random forest is full, but in practice


this can be computationally expensive,

– Imposing a minimum node size is not unusual.

39
Methods to construct Random Forest

Two main methods to construct Random Forest:


– Forest-RI (random input selection): Randomly select,
at each node, F attributes as candidates for the split at
the node. The CART methodology is used to grow the
trees to maximum size
– Forest-RC (random linear combinations): Creates
new attributes (or features) that are a linear combination
of the existing attributes (reduces the correlation
between individual classifiers)

40
Random Forest

Comparable in accuracy to Adaboost, but more robust


to errors and outliers
Faster than bagging or boosting since it is insensitive
to the number of attributes selected for consideration at
each split

41

You might also like