You are on page 1of 65

1

GRADIENT BOOSTED
ENSEMBLES

Jesse Davis
Ensemble Methods: Learn Multiple
Models and Combine Their Output
2

Key: Models in ensemble are “different”

Data +
+…+

Key questions:
1. How do we generate multiple different models?
2. How do we learn the models efficiently?
Canonical Approach: Boosting
3

 Focuses on combining “weak” learners


 Learns an additive ensemble in an iterative,
stagewise manner
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

Real value Weak model, e.g., depth-bounded tree

 Two big ideas:


 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted vote based on
how accurate each hi is
Recall: AdaBoost
4

 Approach: Learn model iteratively


 Given: Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)
 Add: αmhm(X) that minimizes the exponential error
𝑁

𝐸 = ෍ 𝑒 −𝑦𝑗 (𝐹𝑚−1 𝑥𝑗 + α𝑚 ℎ𝑚 𝑥𝑗 )

𝑗=1

 Two big problems


 Justbinary classification problems
 Specific loss function, that is, exponential loss
Gradient Tree Boosting
5

 The base algorithm is old but very hyped now


 1996: Adaboost, the first practical boosting
algorithm [Freund et al.]
 1998: Formulate Adaboost as gradient descent
with a special loss function [Breiman et al.]
 2000: Generalize Adaboost to Gradient Boosting
works with any differentiable loss [Friedman et al.]
 Since: MART, XGBoost, LightGBM, BitBoost, etc.

 Will focus on least squares regression case

 Will ignore some of mathematical details


Gradient Boosting is Popular!
6

N° of pages on Kaggle.com containing term:


Linear models 21100

TensorFlow 16900

PyTorch 5500

AdaBoost 2290

LightGBM 12700

XGBoost 17400

0 5000 10000 15000 20000 25000

Kaggle popularity
Gradient Boosting Big Picture
7

 Gradient Boosting =
Gradient Descent + Boosting

 Fit an additive model (ensemble) in a greedy


forward stage-wise manner

 Each stage introduces a weak learner to


address the shortcomings of the current model

 Shortcomings are identified by gradients


Formal Definition: Gradient Boosting
for Squared Loss
8

 Given: {(x1,y1), (x2,y2),…,(xn,yn)}

 Goal: Learn function F: X ↦ Y

𝑛
1 2
 Least squares objective: ℓ = ෍ 𝑦𝑖 − 𝐹 𝑥𝑖
2
𝑖=1

 Representation of F: 𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
Intuition
9

 Suppose we start with simple h: 𝐹 𝑥𝑖 = 𝑌ത

 Cannot change F in anyway


(e.g., remove a tree, change a parameter)

 Ideal scenario: Add a new h such that:


𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2

𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛

Such a h won’t exist, but one may approximate it


Learning h
10

 Learning h: Equivalent
𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1 ℎ 𝑥1 = 𝑦1 −𝐹 𝑥1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2
… ⇒ ℎ 𝑥2 = 𝑦2 −𝐹 𝑥2

𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛 ℎ 𝑛 = 𝑦𝑛 −𝐹 𝑥𝑛
 Construct new data set and learn h on it:
{ 𝑥1 , 𝑦1 −𝐹 𝑥1 , 𝑥2 , 𝑦2 −𝐹 𝑥2 , … , (𝑥𝑛 , 𝑦𝑛 −𝐹 𝑥𝑛 )}

 Add learned ℎ to 𝐹

 Repeat this procedure


Pictorial Representation
11

Additive Model: F(X) = h0(X) + h1(X) + … + hm(X)


Function Space:
All Decision Trees

+
+…+

Residual
Initial residual
for an instance 0
Connection to Gradient Descent
12

 So far, reweight with residual: = yi – F(xi)

 Gradient descent: Minimize a function by


moving in opposite direction of the gradient

 By changing F, we minimize our loss function


1 2
ℓ = ෍ 𝑦𝑖 − 𝐹(𝑥𝑖 )
2 𝑖

 View F(xi)s as parameters and take derivative


𝜕ℓ 𝜕1Τ2 σ𝑖 𝑦𝑖 −𝐹(𝑥𝑖 ) 2
=
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )
Squared Error: Can Interpret Residuals
as Negative Gradient
13

𝜕ℓ 𝜕1Τ2 𝑦1 −𝐹(𝑥1 ) 2 +…+ 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 +⋯+ 𝑦𝑛 −𝐹(𝑥𝑛 ) 2


=
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 No other term involves F(Xi)


= These terms’ derivatives are 0
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 By chain rule


= = 𝐹(𝑥𝑖 ) - 𝑦𝑖
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )

𝜕ℓ
Negative of gradient: − = 𝑦𝑖 − 𝐹 𝑥𝑖
𝜕𝐹 𝑥𝑖

Note: Negative gradient ≠ residual for all loss functions


Details on Chain Rule
14

∂ ½[(Yi –F(Xi))2]
=
∂F(Xi)
View above as F(x) = f (g (x)) with g (x) =yi –F(xi)

By chain rule F’(x) = f’ (g(x))g’(x)

f’(g(x)) = yi –F(xi) and g’(x) = -1, thus

∂ ½[(yi –F(xi))2]
= = F(xi) – yi
∂F(xi)
Illustration of Loss Function
15

Each tree fits step towards better prediction:

L2 Loss: Log Loss:


Regression Classification

0 0
The Power of Gradient Boosting
16

 Abstract away the algorithm from the loss


function and hence the task

 Thus can plug in any differentiable loss function


and use the same algorithm
 Other regression loss functions
 Classification

 Ranking
AdaBoost vs. Gradient Boosting
17

 Similarities
 Stage wise greedy learning of additive model
 Focus on mispredicted examples

 Typically use decision trees

 Difference 1: Focus on mispredictions


 AdaBoost: High-weight data points
 Gradient Boosting: Gradient of loss function

 Difference 2: Generality
 AdaBoost: Just classification
 Gradient Boosting: Any differentiable loss !
18 Anatomy of a Boosting System
XGBoost
19

 A system ML / DM paper that is hugely


successful in practice
 Wins Kaggle competitions
 Very good and easy to use implementation

 Spurred a number of follow up papers


 LightGBM (Microsoft)
 CatBoost (Yandex)

 BitBoost (DTAI)

 Commonalties in design decisions


Key Design Features
20

 Tree structure: Internal and leaf nodes

 Criteria for evaluating splits

 Optimizing evaluations of the splits

 Add randomization

 XGBoost specific features


 Data storage
 Sparsity aware splitting
Only Consider Binary Trees!
21

 Reals: As is Age < 35


 Binary: As is
T F
 Discrete: One-hot encoding
R Y B 20 HasAuto
Xi,r 1 0 0 T F
Color = {r, y, b} Xi,y 0 1 0
10 60
Xi,b 0 0 1
 Ordinal: Two choices
 One-hot encoding
 Convert to integers

Size = {small, medium, large} Size = {0,1,2}


Leaf Nodes Always Real-Valued
22

Value of leaf node depends on loss function


𝜕ℓ(𝐹 𝑥𝑖 ,𝑦𝑖 )
Let 𝑔𝑖 =
𝜕𝐹(𝑥𝑖 )
1 2
 Squared loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = 𝑦𝑖 − 𝐹 𝑥𝑖
2
σ𝑖∈𝐼 𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = − where 𝐼𝑗 = instances
|𝐼𝑗 |
sorted to leaf j
 Logistic loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = log(1 + 𝑒 [−2𝑦𝐹 𝑥𝑖 ] )
σ𝑖∈𝐼 −𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = σ
𝑖∈𝐼𝑗 |𝑔𝑖 |(2−|𝑔𝑖 |)
Predictions Always Real-Valued!
23
𝑇

𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
 Not problematic for regression tasks

 Threshold raw score for classification tasks


E.g.: Sign((𝐹 𝑥𝑖 )
 Convert raw score into a probability
 Traina logistic regression model to convert raw
score to a probability (aka Platt scaling)
 Loss function specific conversions

E.g.: logistic loss: 𝑝(𝑦𝑖 = 1 | 𝑥𝑖 ) = 1ൗ1+exp(−2𝐹 𝑥𝑖 )


Evaluating Splits: Loss Reduction
24

Split(𝐼𝑃 : Instance Set, Split Conditions S)


for each s ∊ S do
Let 𝐼𝐿 𝐼𝑅 be the instance set of left (right) child
2 2 2
σ𝑖∈𝐼𝐿 𝑔𝑖 σ𝑖∈𝐼𝑅 𝑔𝑖 σ𝑖∈𝐼𝑃 𝑔𝑖
Δloss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |

Bottleneck: ~90% of training time spent evaluating splits


Trick 1: Exploit Tree Structure
25

Age < 35
These quantities were
computed in the parent
? node: Reuse them!
2 2 2
σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖
𝐿 𝑅 𝑃
Loss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |

Let Σ𝑃 = σ𝑖∈𝐼𝑃 𝑔𝑖

Let Σ𝐿 = σ𝑖∈𝐼𝐿 𝑔𝑖
Exploit only binary splits:
Let Σ𝑅 = Σ𝑃 - Σ𝐿 save add operation!
Σ𝐿 2 Σ𝑅 2 Σ𝑃 2
Loss = + −
|𝐼𝐿 | 𝐼𝑃 −|𝐼𝐿 | |𝐼𝑃 |
Biggest Problem: Continuous Features
26

X1 … Xd Y
0 ... -1 5
0 … 5 -10
.. … … …
1 … 10 10
1. Copy feature: -1 5 95 -5 -1 … 10

2. Sort array: -5 -1 -1 5 10 … 95

3. Try all possible splits: Xd < -1 Xd < 5


Problem: Lots of thresholds to try
Trick 2: Histograms
27

 Determine a limited set of split points per node


 Can evaluate these in one pass over the data
X1 … Xd Y 1. Pick small number of equal
0 ... -1 5 width bins (e.g., 256)
0 … 5 -10 2. Pass over data and fill bins
.. … … … 3. Only consider splits at bin
1 … 10 10 boundaries
-5 ≤ Xd < 5 5 ≤ Xd < 15 85 ≤ Xd ≤ 95
22,50 3,-5 … 6,20
Xd < 5 Xd < 15 Count Sum gi
Add Randomization
28

 Bagging: Bootstrap replicate S’ by sampling |S|


examples with replacement from S
Data Bootstrap Replicatet
63.2% chance
example appears
in replicate


 Feature bagging: Randomly select subset of
columns when learning each tree
 Randomness in splits: Randomly select
subset of splits at each internal node
XGBoost Data Representation:
(Compressed) Column Format
29

Married Age Type Class

Matrix Format Column Format


Y 45 Sedan -10 Y 45 Sedan -10
N 20 SUV 5 N 20 SUV 5
Y 30 Sport 10 Y 30 Sport 10
Y 60 Berline -15 Y 60 Berline -15

☺ Easier and faster to randomly select a feature


☺ Presort continuous features
 More record keeping (if you presort)
Overhead with Presorting
30

Unsorted Presorted
Feature Y Feature Y
45 -10 20 -10
20 5 30 5
30 10 45 10
60 -15 60 -15

 Array indices  Alignment broken


aligned between  Requires storing a
feature and Y pointer to Y value for
 Easy to look up Y each feature
value
XGBoost: Sparsity-Aware Splitting
31

 Many entries “zero” due to one-hot encoding,


missing data, natural sparseness, etc.
 Do not store “zero” entries
Dense Sparse
0 -10 1 -10
0 -15 1 -15
Drop
0 5 5
1 15 15
1 10 10

Fewer entries to iterate over:


Can result in 50x speed ups!
Which Details Were Skipped?
32

 Regularization to avoid overfitting


 Restrict depth of trees
 Restrict number of leaves

 Add penalty term to loss function

 Setting the learning rate η [see Friedman 2011]

 Derivations and discussion of all loss functions


33 Issues with Ensembles
Two Problems with Ensembles
34

 Problem: Over multiple models, how can I


determine which features are interesting?

Solution: Feature importances

 Problem: Ensembles have multiple models


 Predictionstake longer
 Models take up more space

Solution: Model compression


Feature Importance:
Mean Decrease in Impurity
35

1 𝑆𝑛
= ෍ ෍ 𝐺𝑎𝑖𝑛(𝑣, 𝑆𝑛 )
|𝐸| 𝑚∈𝐸 𝑛𝑣 ∈𝑚 |𝐷|

E = {𝑚1 ,…, 𝑚𝑡 } is an ensemble of models


𝑛𝑣 is a node splitting on variable v
𝑆𝑛 is the set of training examples reaching node 𝑛𝑣
D is the training data
Gain(v, 𝑆𝑛 ) is the impurity reduction of split 𝑆𝑛 on v
Feature Importance:
Permutation Test
36

 For each variable, randomly permute its values


in the out of bag example
 Outof bag: Examples not selected in bootstrap
 Permutation: Type

Married Age Type Class Married Age Type Class


N 20 SUV N N 20 Sport N
Y 30 Sport N Y 30 Berline N
Y 60 Berline Y Y 60 SUV Y

 Measure decrease in accuracy


Model Compression
37

 Idea: Compress ensemble by mimicking its


behavior with a model that is smaller and
executes quickly
 Build new data set D’ = {(xj’, E(xj’)}
 x j’
is an example
 E(x1’) is the ensembles prediction for x1’

 Train a new model M on D’


Two questions:
1. How to generate data?
2. What model?
Approach
38

 Generate data: For each example (xj,yj) in D,


create new example (xj’, E(xj’)):
 Let (xn,yn) be (xj,yj)’s nearest neighbor in D
 For i = 0 to d
◼r ~ U[0,1]
◼ If (r < 0.5) then xj,i’ = xj,i
◼ Else xj,i’ = xn,I

 Add (xj’, E(xj’)) to D’


 Train a neural network on D’
39 Applications
Web Search
40

Query: “107.7 the end”


How Helpful Is an On-the-Ball Action in
a Soccer Match?
41

A soccer match has ±1600 on-the-ball actions


Problem: 99% of actions do not directly affect the score

Goal
How Helpful Is an On-the-Ball Action in
a Soccer Match?
42

A soccer match has ±1600 on-the-ball actions


Problem: 99% of actions do not directly affect the score

Goal

𝑃𝑎𝑠𝑠

Question: How valuable is an action (e.g., pass, dribble,…)?


Contribution Rating: How Much Did an
Action Contribute to the Scoreline?
43

 Insight: Action changes game state


 Assign value to each game state si
 Value of action 𝐶𝑅 𝑠𝑖 , 𝑎𝑖 = 𝑉 𝑠𝑖+1 − 𝑉(𝑠𝑖 )
Value(pass) = 0.04 ≈ Pass’ Expected Δgoal difference
V(si) = 0.01 V(si+1) = 0.05
Valuing a Game State
44

Intuition: Good actions either


1) Increase the short-term chance of scoring
2) Decrease the short-term chance of conceding
𝑉 𝑠𝑖 = 𝑃𝑠𝑐𝑜𝑟𝑒𝑠 𝑠𝑖 − 𝑃𝑐𝑜𝑛𝑐𝑒𝑑𝑒𝑠 s𝑖

Estimate these from historical data


 Game state uses last 3 actions: si = {si-2,si-1,si}
 An action’s effect is temporally limited:
+ Example: Goal by either team in next 10 actions
 − Example: No goals in next 10 actions

 Train a gradient boosted probability estimator


Represent Game State Using
> 20 Features
45

3 types of features Home: 2; Away: 0; GD: +2;


Time = 80min
1) Simple: One action
2) Complex: Compare Distance
consecutive actions Pass to goal

3) Contextual: Game info


Type Pass
Result Success Time
difference
Start Location (60,20) Pass
End Location (75,60) Tackle
Body Part Foot
Example: Barcelona’s 3-0 goal versus
Real Madrid (Dec 23, 2017)

Phase starts here


Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
47

VAEP June June Price


Rank Player Team Age
rating 2018 2019 delta

1 David Neres Ajax 21 0.62 € 20m € 45m + €25m

2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m

3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m

4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m

5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m


Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
48

VAEP June June Price


Rank Player Team Age
rating 2018 2019 delta

1 David Neres Ajax 21 0.62 € 20m € 45m + €25m

2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m

3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m

4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m

5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m


Summary
49

 Gradient boosting generalizes AdaBoost to work


with any differential loss function

 Mainly done with trees though with some tweaks


to standard tree learner

 Many highly performant implements

 Widely applied to real problems


Questions?
50
51
AdaBoost from Principles of ML
for Easy Reference / Recall

(c) jesse davis


Boosting
52

 Arose in the theoretical PAC learning community


 Strong PAC learner ≈ For arbitrary ε and δ, with
probability 1-δ, produce model with error of < ε
 Boosting assumes a weak learner:
 Cannot PAC learn for arbitrary ε and δ
 Models are (slightly) better than random guessing

 General approach
 Learns an additive model
 Greedily adds model at a time

 Focuses on the current model correcting


examples that are incorrectly predicted (c) jesse davis
AdaBoost: First Practical Booster
53

 Works for binary classification problems


 Learns additive model iteratively
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

 Two big ideas:


 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted voted based on
how accurate each hi is
(c) jesse davis
Boosting Example
54

+ - +
+ +
-
+ - -
-
+ -
-
 Assume that we are going to make one axis
parallel cut through feature space

(c) jesse davis


Boosting Example
55

+ - + + - +
+ + + +-
- -
+ - - +
-
-
-
+ - + -
- -

 Errors: 3
 Upweight the mistakes, downweight everything
else
(c) jesse davis
Boosting Example
56

+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -

(c) jesse davis


Boosting Example
57

+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -

Three key questions for AdaBoost:


1. What model should we pick (in theory)?
2. How should we set α ?
3. What practical details are important?
(c) jesse davis
AdaBoost Setting
58

 Given: S = {(xj,yj)} with j ∊ {1,…,n} and y ∊ {-1,+1}

 Learn: F(X) = α1h1(X) + α2h2(X) + … + αtht(X)

−1 (Negative)
 Prediction: 𝐹 𝑥𝑖 = ቊ
+1 (Positive)
argmaxy = σt αtht(𝑥𝑖 )

1. What model should we pick (in theory)?


2. How should we set α ?
3. What practical details (c)
are important?
jesse davis
What Classifier Should We Pick?
59

 Suppose our current function is:


Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)

 Goal:
 Pick αmhm(X) to add to the model
𝑁
 To minimize error: 𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )

𝑖=1

(c) jesse davis


Understanding the Error:
60
Exponential Loss Function
𝑦ො  Loss is small if
𝑁 predicted and true
𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )
label have same sign
𝑖=1
sign 𝑦 = sign 𝑦ො ⇒ −𝑦𝑦ො < 0
sign 𝑦 ≠ sign 𝑦ො ⇒ −𝑦𝑦ො > 0
Loss

 Loss drops quickly


𝑒 −(1)(−2) = 7.4
𝑒 −(1)(−1) = 2.7
𝑦ො (case 𝑦 = 1)  Loss always > 0
(c) jesse davis
What Classifier Should We Pick?
61
𝑁

 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )


𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )

Error per example that is fixed because we


cannot change classifiers 1 to m-1 in the ensemble

(c) jesse davis


What Classifier Should We Pick?
62
𝑁

 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )


𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )
and rewrite the error
(𝑚) −α𝑚
𝑒 α𝑚
(𝑚)
=෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖

Weight for correct predictions Weight for incorrect predictions


𝑁
(𝑚) −α𝑚
(𝑒 α𝑚 − 𝑒 −α𝑚 )
𝑚
= ෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖
𝑖=1
Assumes hm ‘s predictions hm that minimizes this sum,
are all correct minimizes E! (c) jesse davis
Setting α𝑚
63

(𝑚) (𝑚)
 Let 𝑊𝑐 = ෍ 𝑤𝑖 and 𝑊𝑖𝑐 = ෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖

 Then 𝐸 = 𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚


ⅆ𝐸
= −𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚 = 0
ⅆα
−𝑊𝑐 + 𝑊𝑖𝑐 𝑒 2α𝑚 = 0
1 𝑊𝑐
α𝑚 = ln
2 𝑊𝑖𝑐
1 1 − 𝜀𝑚
α𝑚 = ln
2 𝜀𝑚
(𝑚)
σ𝑦𝑖 ≠ ℎ𝑚𝑥𝑖 𝑤𝑖
with 𝜀𝑚 =
σ 𝑤𝑖(𝑚) (c) jesse davis
AdaBoost
64

Given S = {(xj,yj)} where j ∊ {1,…,n}, Integer T


(1)
𝑤𝑖 = 1Τ𝑛 All examples
for t = 1 to T have same weight Weighted error
(𝑚)
Find classifier ht, with small error ϵt = σℎ𝑡 𝑥𝑖 ≠𝑦𝑖 𝑤𝑖
if (ϵt > 1Τ2) then break
𝛽𝑡 = ϵtൗ(1−ϵ ) Down weight correct predictions
t
(𝑡+1) (𝑡)
if (ℎ𝑡 𝑥𝑖 = 𝑦𝑖 ) then 𝑤𝑖 = 𝑤𝑖 𝛽𝑡
(𝑡+1)
(𝑡+1) 𝑤𝑖
𝑤𝑖 = (𝑡+1) Normalize weights
σ 𝑤𝑗
1
𝛼𝑡 = 𝑙𝑛
𝛽𝑡 (c) jesse davis
AdaBoost in Practice
65

 Typically use depth bounded decision tree


 Sometimes stumps: Just a single split
 Often depth 5 or 6

 Dealing with weighted instances


 Approach 1: Adapt learner to learn from
weighted instances; trivial for decision trees just
use weighted counts for split criteria
 Approach 2: Sample a large (≫n) set of
unweighted instances according to the weight
distribution and run learner
(c) jesse davis

You might also like