Professional Documents
Culture Documents
1
Ensemble Methods
2
Example: Why Do Ensemble Methods Work?
3
Necessary Conditions for Ensemble Methods
4
Rationale for Ensemble Learning
5
Bias-Variance Decomposition
7
Bias-Variance Trade-off and Overfitting
Overfitting
Underfitting
11
Constructing Ensemble Classifiers
12
Bagging (Bootstrap AGGregating)
Original Data 1 2 3 4 5 6 7 8 9 10
Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
14
Bagging Example
xk
True False
yleft yright
15
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1
y 1 1 1 -1 -1 -1 1 1 1 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9
y 1 1 1 -1 -1 -1 -1 -1 1 1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1
y 1 1 1 -1 -1 -1 -1 1 1 1
16
Bagging Example
Bagging Round 1:
x 0.1 0.2 0.2 0.3 0.4 0.4 0.5 0.6 0.9 0.9 x <= 0.35 y = 1
y 1 1 1 1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 2:
x 0.1 0.2 0.3 0.4 0.5 0.5 0.9 1 1 1 x <= 0.7 y = 1
y 1 1 1 -1 -1 -1 1 1 1 1 x > 0.7 y = 1
Bagging Round 3:
x 0.1 0.2 0.3 0.4 0.4 0.5 0.7 0.7 0.8 0.9 x <= 0.35 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.35 y = -1
Bagging Round 4:
x 0.1 0.1 0.2 0.4 0.4 0.5 0.5 0.7 0.8 0.9 x <= 0.3 y = 1
y 1 1 1 -1 -1 -1 -1 -1 1 1 x > 0.3 y = -1
Bagging Round 5:
x 0.1 0.1 0.2 0.5 0.6 0.6 0.6 1 1 1 x <= 0.35 y = 1
x > 0.35 y = -1
y 1 1 1 -1 -1 -1 -1 1 1 1
17
Bagging Example
Bagging Round 6:
x 0.2 0.4 0.5 0.6 0.7 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 7:
x 0.1 0.4 0.4 0.6 0.7 0.8 0.9 0.9 0.9 1 x <= 0.75 y = -1
y 1 -1 -1 -1 -1 1 1 1 1 1 x > 0.75 y = 1
Bagging Round 8:
x 0.1 0.2 0.5 0.5 0.5 0.7 0.7 0.8 0.9 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
Bagging Round 9:
x 0.1 0.3 0.4 0.4 0.6 0.7 0.7 0.8 1 1 x <= 0.75 y = -1
y 1 1 -1 -1 -1 -1 -1 1 1 1 x > 0.75 y = 1
18
Bagging Example
19
Bagging Example
Use majority vote (sign of sum of predictions) to
determine class of ensemble classifier
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
4 1 1 1 -1 -1 -1 -1 -1 -1 -1
5 1 1 1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 1 1 1
7 -1 -1 -1 -1 -1 -1 -1 1 1 1
8 -1 -1 -1 -1 -1 -1 -1 1 1 1
9 -1 -1 -1 -1 -1 -1 -1 1 1 1
10 1 1 1 1 1 1 1 1 1 1
Sum 2 2 2 -6 -6 -6 -6 2 2 2
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
21
Boosting
22
Boosting
23
Adaboost (Adaptive Boost)
Input:
– Training set D containing N instances
– T rounds
– A classification learning scheme
Output:
– A composite model
24
Adaboost: Training Phase
25
Adaboost: Training Phase
26
Adaboost: Testing Phase
The lower a classifier error rate, the more accurate it is,
and therefore, the higher its weight for voting should be
Weight of a classifier Ci’s vote is
1 1 − i
i = ln
2 i
Testing:
– For each class c, sum the weights of each classifier that
assigned class c to X (unseen data)
– The class with the highest sum is the WINNER!
T
C * ( xtest ) = arg max i (Ci ( xtest ) = y )
y i =1
27
Example: Error and Classifier Weight in
AdaBoost
Base classifiers: C1, C2, …, CT
w (C ( x ) y )
N
1
i = j i j j
N j =1
Importance of a classifier:
1 1 − i
i = ln
2 i
28
Example: Data Instance Weight
in AdaBoost
Assume: N training data in D, T rounds, (xj,yj) are
the training data, Ci, ai are the classifier and weight
of the ith round, respectively.
Weight update on all training data in D:
− i
( i +1) w (i )
j
exp if Ci ( x j ) = y j
wj =
Z i exp i if Ci ( x j ) y j
where Z i is the normalization factor
T
C * ( xtest ) = arg max i (Ci ( xtest ) = y )
y i =1
29
AdaBoost Algorithm
30
AdaBoost Example
True False
yleft yright
31
AdaBoost Example
Boosting Round 2:
x 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3
y 1 1 1 1 1 1 1 1 1 1
Boosting Round 3:
x 0.2 0.2 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.7
y 1 1 -1 -1 -1 -1 -1 -1 -1 -1
Summary:
Round Split Point Left Class Right Class alpha
1 0.75 -1 1 1.738
2 0.05 1 1 2.7784
3 0.3 1 -1 4.1195
32
AdaBoost Example
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
2 0.311 0.311 0.311 0.01 0.01 0.01 0.01 0.01 0.01 0.01
3 0.029 0.029 0.029 0.228 0.228 0.228 0.228 0.009 0.009 0.009
Round x=0.1 x=0.2 x=0.3 x=0.4 x=0.5 x=0.6 x=0.7 x=0.8 x=0.9 x=1.0
1 -1 -1 -1 -1 -1 -1 -1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 -1 -1 -1 -1 -1 -1 -1
Sum 5.16 5.16 5.16 -3.08 -3.08 -3.08 -3.08 0.397 0.397 0.397
Predicted Sign 1 1 1 -1 -1 -1 -1 1 1 1
Class
33
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
34
Illustrating AdaBoost
B1
0.0094 0.0094 0.4623
Boosting
Round 1 +++ - - - - - - - = 1.9459
B2
0.3037 0.0009 0.0422
Boosting
Round 2 - - - - - - - - ++ = 2.9323
B3
0.0276 0.1819 0.0038
Boosting
Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
35
Random Forests
• Random Forest is a modified form of bagging that
creates ensembles of independent decision trees.
• Random Forests essentially perform bagging over the
predictor space and build a collection of less correlated
trees.
• Choose a random subset of predictors when defining
trees.
• It increases the stability of the algorithm and tackles
correlation problems that arise by a greedy search of the
best split at each node.
• It reduces variance of total estimator at the cost of an
equal or higher bias.
36
Random Forest
37
Algorithm
– Choose T—number of trees to grow
– Choose m<M (M is the number of total features) —number
of features used to calculate the best split at each node
– For each tree
◆ Choose a training set by choosing N times (N is the
number of training examples) with replacement from the
training set
◆ For each node, randomly choose m features and
calculate the best split
◆ Fully grown and not pruned
– Use majority voting among all the trees
38
Tuning Random Forests
39
Methods to construct Random Forest
40
Random Forest
41