DM - Lecture 4

1
GRADIENT BOOSTED
ENSEMBLES
Jesse Davis
Ensemble Methods: Learn Multiple
Models and Combine Their Output
2
Key: Models in ensemble are “different”
Data +
+…+
Key questions:
1. How do we generate multiple different models?
2. How do we learn the models efficiently?
Canonical Approach: Boosting
3
 Focuses on combining “weak” learners

 Learns an additive ensemble in an iterative,
stagewise manner
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)
Real value Weak model, e.g., depth-bounded tree
 Two big ideas:

 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted vote based on
how accurate each hi is
Recall: AdaBoost
4
 Approach: Learn model iteratively

 Given: Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)
 Add: αmhm(X) that minimizes the exponential error
𝑁
𝐸 = ෍ 𝑒 −𝑦𝑗 (𝐹𝑚−1 𝑥𝑗 + α𝑚 ℎ𝑚 𝑥𝑗 )
𝑗=1
 Two big problems

 Justbinary classification problems
 Specific loss function, that is, exponential loss
Gradient Tree Boosting
5
 The base algorithm is old but very hyped now

 1996: Adaboost, the first practical boosting
algorithm [Freund et al.]
 1998: Formulate Adaboost as gradient descent
with a special loss function [Breiman et al.]
 2000: Generalize Adaboost to Gradient Boosting
works with any differentiable loss [Friedman et al.]
 Since: MART, XGBoost, LightGBM, BitBoost, etc.
 Will focus on least squares regression case
 Will ignore some of mathematical details

Gradient Boosting is Popular!
6
N° of pages on Kaggle.com containing term:

Linear models 21100
TensorFlow 16900
PyTorch 5500
AdaBoost 2290
LightGBM 12700
XGBoost 17400
0 5000 10000 15000 20000 25000
Kaggle popularity
Gradient Boosting Big Picture
7
 Gradient Boosting =
Gradient Descent + Boosting
 Fit an additive model (ensemble) in a greedy

forward stage-wise manner
 Each stage introduces a weak learner to

address the shortcomings of the current model
 Shortcomings are identified by gradients

Formal Definition: Gradient Boosting
for Squared Loss
8
 Given: {(x1,y1), (x2,y2),…,(xn,yn)}
 Goal: Learn function F: X ↦ Y
𝑛
1 2
 Least squares objective: ℓ = ෍ 𝑦𝑖 − 𝐹 𝑥𝑖
2
𝑖=1
 Representation of F: 𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
Intuition
9
 Suppose we start with simple h: 𝐹 𝑥𝑖 = 𝑌ത
 Cannot change F in anyway

(e.g., remove a tree, change a parameter)
 Ideal scenario: Add a new h such that:

𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2
…
𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛
Such a h won’t exist, but one may approximate it

Learning h
10
 Learning h: Equivalent
𝐹 𝑥1 + ℎ 𝑥1 = 𝑦1 ℎ 𝑥1 = 𝑦1 −𝐹 𝑥1
𝐹 𝑥2 + ℎ 𝑥2 = 𝑦2
… ⇒ ℎ 𝑥2 = 𝑦2 −𝐹 𝑥2
…
𝐹 𝑥𝑛 + ℎ 𝑛 = 𝑦𝑛 ℎ 𝑛 = 𝑦𝑛 −𝐹 𝑥𝑛
 Construct new data set and learn h on it:
{ 𝑥1 , 𝑦1 −𝐹 𝑥1 , 𝑥2 , 𝑦2 −𝐹 𝑥2 , … , (𝑥𝑛 , 𝑦𝑛 −𝐹 𝑥𝑛 )}
 Add learned ℎ to 𝐹
 Repeat this procedure

Pictorial Representation
11
Additive Model: F(X) = h0(X) + h1(X) + … + hm(X)

Function Space:
All Decision Trees
+
+…+
Residual
Initial residual
for an instance 0
Connection to Gradient Descent
12
 So far, reweight with residual: = yi – F(xi)
 Gradient descent: Minimize a function by

moving in opposite direction of the gradient
 By changing F, we minimize our loss function

1 2
ℓ = ෍ 𝑦𝑖 − 𝐹(𝑥𝑖 )
2 𝑖
 View F(xi)s as parameters and take derivative

𝜕ℓ 𝜕1Τ2 σ𝑖 𝑦𝑖 −𝐹(𝑥𝑖 ) 2
=
𝜕𝐹(𝑥𝑖 ) 𝜕𝐹(𝑥𝑖 )
Squared Error: Can Interpret Residuals
as Negative Gradient
13
𝜕ℓ 𝜕1Τ2 𝑦1 −𝐹(𝑥1 ) 2 +…+ 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 +⋯+ 𝑦𝑛 −𝐹(𝑥𝑛 ) 2

=
𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 No other term involves F(Xi)

= These terms’ derivatives are 0
𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 By chain rule

= = 𝐹(𝑥𝑖 ) - 𝑦𝑖
𝜕ℓ
Negative of gradient: − = 𝑦𝑖 − 𝐹 𝑥𝑖
𝜕𝐹 𝑥𝑖
Note: Negative gradient ≠ residual for all loss functions

Details on Chain Rule
14
∂ ½[(Yi –F(Xi))2]
=
∂F(Xi)
View above as F(x) = f (g (x)) with g (x) =yi –F(xi)
By chain rule F’(x) = f’ (g(x))g’(x)
f’(g(x)) = yi –F(xi) and g’(x) = -1, thus
∂ ½[(yi –F(xi))2]
= = F(xi) – yi
∂F(xi)
Illustration of Loss Function
15
Each tree fits step towards better prediction:
L2 Loss: Log Loss:

Regression Classification
0 0
The Power of Gradient Boosting
16
 Abstract away the algorithm from the loss

function and hence the task
 Thus can plug in any differentiable loss function

and use the same algorithm
 Other regression loss functions
 Classification
 Ranking
AdaBoost vs. Gradient Boosting
17
 Similarities
 Stage wise greedy learning of additive model
 Focus on mispredicted examples
 Typically use decision trees
 Difference 1: Focus on mispredictions

 AdaBoost: High-weight data points
 Gradient Boosting: Gradient of loss function
 Difference 2: Generality
 AdaBoost: Just classification
 Gradient Boosting: Any differentiable loss !
18 Anatomy of a Boosting System
XGBoost
19
 A system ML / DM paper that is hugely

successful in practice
 Wins Kaggle competitions
 Very good and easy to use implementation
 Spurred a number of follow up papers

 LightGBM (Microsoft)
 CatBoost (Yandex)
 BitBoost (DTAI)
 Commonalties in design decisions

Key Design Features
20
 Tree structure: Internal and leaf nodes
 Criteria for evaluating splits
 Optimizing evaluations of the splits
 Add randomization
 XGBoost specific features

 Data storage
 Sparsity aware splitting
Only Consider Binary Trees!
21
 Reals: As is Age < 35

 Binary: As is
T F
 Discrete: One-hot encoding
R Y B 20 HasAuto
Xi,r 1 0 0 T F
Color = {r, y, b} Xi,y 0 1 0
10 60
Xi,b 0 0 1
 Ordinal: Two choices
 One-hot encoding
 Convert to integers
Size = {small, medium, large} Size = {0,1,2}

Leaf Nodes Always Real-Valued
22
Value of leaf node depends on loss function

𝜕ℓ(𝐹 𝑥𝑖 ,𝑦𝑖 )
Let 𝑔𝑖 =
𝜕𝐹(𝑥𝑖 )
1 2
 Squared loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = 𝑦𝑖 − 𝐹 𝑥𝑖
2
σ𝑖∈𝐼 𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = − where 𝐼𝑗 = instances
|𝐼𝑗 |
sorted to leaf j
 Logistic loss ℓ(𝐹 𝑥𝑖 , 𝑦𝑖 ) = log(1 + 𝑒 [−2𝑦𝐹 𝑥𝑖 ] )
σ𝑖∈𝐼 −𝑔𝑖
𝑗
Value of leaf j: 𝑤𝑗 = σ
𝑖∈𝐼𝑗 |𝑔𝑖 |(2−|𝑔𝑖 |)
Predictions Always Real-Valued!
23
𝑇
𝐹 𝑥𝑖 = ෍ 𝜂𝑡 ℎ𝑡 (𝑥𝑖 )
𝑡=1
 Not problematic for regression tasks
 Threshold raw score for classification tasks

E.g.: Sign((𝐹 𝑥𝑖 )
 Convert raw score into a probability
 Traina logistic regression model to convert raw
score to a probability (aka Platt scaling)
 Loss function specific conversions
E.g.: logistic loss: 𝑝(𝑦𝑖 = 1 | 𝑥𝑖 ) = 1ൗ1+exp(−2𝐹 𝑥𝑖 )

Evaluating Splits: Loss Reduction
24
Split(𝐼𝑃 : Instance Set, Split Conditions S)

for each s ∊ S do
Let 𝐼𝐿 𝐼𝑅 be the instance set of left (right) child
2 2 2
σ𝑖∈𝐼𝐿 𝑔𝑖 σ𝑖∈𝐼𝑅 𝑔𝑖 σ𝑖∈𝐼𝑃 𝑔𝑖
Δloss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |
Bottleneck: ~90% of training time spent evaluating splits

Trick 1: Exploit Tree Structure
25
Age < 35
These quantities were
computed in the parent
? node: Reuse them!
2 2 2
σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖 σ𝑖∈𝐼 𝑔𝑖
𝐿 𝑅 𝑃
Loss = + −
|𝐼𝐿 | |𝐼𝑅 | |𝐼𝑃 |
Let Σ𝑃 = σ𝑖∈𝐼𝑃 𝑔𝑖
Let Σ𝐿 = σ𝑖∈𝐼𝐿 𝑔𝑖
Exploit only binary splits:
Let Σ𝑅 = Σ𝑃 - Σ𝐿 save add operation!
Σ𝐿 2 Σ𝑅 2 Σ𝑃 2
Loss = + −
|𝐼𝐿 | 𝐼𝑃 −|𝐼𝐿 | |𝐼𝑃 |
Biggest Problem: Continuous Features
26
X1 … Xd Y
0 ... -1 5
0 … 5 -10
.. … … …
1 … 10 10
1. Copy feature: -1 5 95 -5 -1 … 10
2. Sort array: -5 -1 -1 5 10 … 95
3. Try all possible splits: Xd < -1 Xd < 5

Problem: Lots of thresholds to try
Trick 2: Histograms
27
 Determine a limited set of split points per node

 Can evaluate these in one pass over the data
X1 … Xd Y 1. Pick small number of equal
0 ... -1 5 width bins (e.g., 256)
0 … 5 -10 2. Pass over data and fill bins
.. … … … 3. Only consider splits at bin
1 … 10 10 boundaries
-5 ≤ Xd < 5 5 ≤ Xd < 15 85 ≤ Xd ≤ 95
22,50 3,-5 … 6,20
Xd < 5 Xd < 15 Count Sum gi
Add Randomization
28
 Bagging: Bootstrap replicate S’ by sampling |S|

examples with replacement from S
Data Bootstrap Replicatet
63.2% chance
example appears
in replicate
…
…
 Feature bagging: Randomly select subset of
columns when learning each tree
 Randomness in splits: Randomly select
subset of splits at each internal node
XGBoost Data Representation:
(Compressed) Column Format
29
Married Age Type Class
Matrix Format Column Format

Y 45 Sedan -10 Y 45 Sedan -10
N 20 SUV 5 N 20 SUV 5
Y 30 Sport 10 Y 30 Sport 10
Y 60 Berline -15 Y 60 Berline -15
☺ Easier and faster to randomly select a feature

☺ Presort continuous features
 More record keeping (if you presort)
Overhead with Presorting
30
Unsorted Presorted
Feature Y Feature Y
45 -10 20 -10
20 5 30 5
30 10 45 10
60 -15 60 -15
 Array indices  Alignment broken

aligned between  Requires storing a
feature and Y pointer to Y value for
 Easy to look up Y each feature
value
XGBoost: Sparsity-Aware Splitting
31
 Many entries “zero” due to one-hot encoding,

missing data, natural sparseness, etc.
 Do not store “zero” entries
Dense Sparse
0 -10 1 -10
0 -15 1 -15
Drop
0 5 5
1 15 15
1 10 10
Fewer entries to iterate over:

Can result in 50x speed ups!
Which Details Were Skipped?
32
 Regularization to avoid overfitting

 Restrict depth of trees
 Restrict number of leaves
 Add penalty term to loss function
 Setting the learning rate η [see Friedman 2011]
 Derivations and discussion of all loss functions

33 Issues with Ensembles
Two Problems with Ensembles
34
 Problem: Over multiple models, how can I

determine which features are interesting?
Solution: Feature importances
 Problem: Ensembles have multiple models

 Predictionstake longer
 Models take up more space
Solution: Model compression

Feature Importance:
Mean Decrease in Impurity
35
1 𝑆𝑛
= ෍ ෍ 𝐺𝑎𝑖𝑛(𝑣, 𝑆𝑛 )
|𝐸| 𝑚∈𝐸 𝑛𝑣 ∈𝑚 |𝐷|
E = {𝑚1 ,…, 𝑚𝑡 } is an ensemble of models

𝑛𝑣 is a node splitting on variable v
𝑆𝑛 is the set of training examples reaching node 𝑛𝑣
D is the training data
Gain(v, 𝑆𝑛 ) is the impurity reduction of split 𝑆𝑛 on v
Feature Importance:
Permutation Test
36
 For each variable, randomly permute its values

in the out of bag example
 Outof bag: Examples not selected in bootstrap
 Permutation: Type
Married Age Type Class Married Age Type Class

N 20 SUV N N 20 Sport N
Y 30 Sport N Y 30 Berline N
Y 60 Berline Y Y 60 SUV Y
 Measure decrease in accuracy

Model Compression
37
 Idea: Compress ensemble by mimicking its

behavior with a model that is smaller and
executes quickly
 Build new data set D’ = {(xj’, E(xj’)}
 x j’
is an example
 E(x1’) is the ensembles prediction for x1’
 Train a new model M on D’

Two questions:
1. How to generate data?
2. What model?
Approach
38
 Generate data: For each example (xj,yj) in D,

create new example (xj’, E(xj’)):
 Let (xn,yn) be (xj,yj)’s nearest neighbor in D
 For i = 0 to d
◼r ~ U[0,1]
◼ If (r < 0.5) then xj,i’ = xj,i
◼ Else xj,i’ = xn,I
 Add (xj’, E(xj’)) to D’

 Train a neural network on D’
39 Applications
Web Search
40
Query: “107.7 the end”

How Helpful Is an On-the-Ball Action in
a Soccer Match?
41
A soccer match has ±1600 on-the-ball actions

Problem: 99% of actions do not directly affect the score
Goal
How Helpful Is an On-the-Ball Action in
a Soccer Match?
42
A soccer match has ±1600 on-the-ball actions

Problem: 99% of actions do not directly affect the score
Goal
𝑃𝑎𝑠𝑠
Question: How valuable is an action (e.g., pass, dribble,…)?

Contribution Rating: How Much Did an
Action Contribute to the Scoreline?
43
 Insight: Action changes game state

 Assign value to each game state si
 Value of action 𝐶𝑅 𝑠𝑖 , 𝑎𝑖 = 𝑉 𝑠𝑖+1 − 𝑉(𝑠𝑖 )
Value(pass) = 0.04 ≈ Pass’ Expected Δgoal difference
V(si) = 0.01 V(si+1) = 0.05
Valuing a Game State
44
Intuition: Good actions either

1) Increase the short-term chance of scoring
2) Decrease the short-term chance of conceding
𝑉 𝑠𝑖 = 𝑃𝑠𝑐𝑜𝑟𝑒𝑠 𝑠𝑖 − 𝑃𝑐𝑜𝑛𝑐𝑒𝑑𝑒𝑠 s𝑖
Estimate these from historical data

 Game state uses last 3 actions: si = {si-2,si-1,si}
 An action’s effect is temporally limited:
+ Example: Goal by either team in next 10 actions
 − Example: No goals in next 10 actions
 Train a gradient boosted probability estimator

Represent Game State Using
> 20 Features
45
3 types of features Home: 2; Away: 0; GD: +2;

Time = 80min
1) Simple: One action
2) Complex: Compare Distance
consecutive actions Pass to goal
3) Contextual: Game info

Type Pass
Result Success Time
difference
Start Location (60,20) Pass
End Location (75,60) Tackle
Body Part Foot
Example: Barcelona’s 3-0 goal versus
Real Madrid (Dec 23, 2017)
Phase starts here

Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
47
VAEP June June Price

Rank Player Team Age
rating 2018 2019 delta
1 David Neres Ajax 21 0.62 € 20m € 45m + €25m
2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m
3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m
4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m
5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m

Application Scouting: Top-5 U21
players in the 2017/18 Dutch League
48
VAEP June June Price

Rank Player Team Age
rating 2018 2019 delta
1 David Neres Ajax 21 0.62 € 20m € 45m + €25m
2 Mason Mount Vitesse 19 0.62 € 4m € 12m + €8m
3 Frenkie de Jong Ajax 20 0.50 € 7m € 85m + €78m
4 Steven Bergwijn PSV 20 0.49 € 12m € 35m + €23m
5 Donny van de Beek Ajax 21 0.47 € 14m €40m + €26m

Summary
49
 Gradient boosting generalizes AdaBoost to work

with any differential loss function
 Mainly done with trees though with some tweaks

to standard tree learner
 Many highly performant implements
 Widely applied to real problems

Questions?
50
51
AdaBoost from Principles of ML
for Easy Reference / Recall
(c) jesse davis

Boosting
52
 Arose in the theoretical PAC learning community

 Strong PAC learner ≈ For arbitrary ε and δ, with
probability 1-δ, produce model with error of < ε
 Boosting assumes a weak learner:
 Cannot PAC learn for arbitrary ε and δ
 Models are (slightly) better than random guessing
 General approach
 Learns an additive model
 Greedily adds model at a time
 Focuses on the current model correcting

examples that are incorrectly predicted (c) jesse davis
AdaBoost: First Practical Booster
53
 Works for binary classification problems

 Learns additive model iteratively
F(X) = α1h1(X) + α2h2(X) + … + αtht(X)
 Two big ideas:

 Idea 1: Assign weights to examples to focus
attention on misclassified examples
 Idea 2: Prediction is a weighted voted based on
how accurate each hi is
(c) jesse davis
Boosting Example
54
+ - +
+ +
-
+ - -
-
+ -
-
 Assume that we are going to make one axis
parallel cut through feature space
(c) jesse davis

Boosting Example
55
+ - + + - +
+ + + +-
- -
+ - - +
-
-
-
+ - + -
- -
 Errors: 3
 Upweight the mistakes, downweight everything
else
(c) jesse davis
Boosting Example
56
+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -
(c) jesse davis

Boosting Example
57
+ +
+ - + - + - +
+ + + +- + +-
- - -
+ - - -
- + + -
- -
+ - -
+ - + -
- -
Three key questions for AdaBoost:

1. What model should we pick (in theory)?
2. How should we set α ?
3. What practical details are important?
(c) jesse davis
AdaBoost Setting
58
 Given: S = {(xj,yj)} with j ∊ {1,…,n} and y ∊ {-1,+1}
 Learn: F(X) = α1h1(X) + α2h2(X) + … + αtht(X)
−1 (Negative)
 Prediction: 𝐹 𝑥𝑖 = ቊ
+1 (Positive)
argmaxy = σt αtht(𝑥𝑖 )
1. What model should we pick (in theory)?

2. How should we set α ?
3. What practical details (c)
are important?
jesse davis
What Classifier Should We Pick?
59
 Suppose our current function is:

Fm-1(X) = α1h1(X) + α2h2(X) + … + αm-1hm-1(X)
 Goal:
 Pick αmhm(X) to add to the model
𝑁
 To minimize error: 𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )
𝑖=1
(c) jesse davis

Understanding the Error:
60
Exponential Loss Function
𝑦ො  Loss is small if
𝑁 predicted and true
𝐸 = ෍ 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 + α𝑚 ℎ𝑚 𝑥𝑖 )
label have same sign
𝑖=1
sign 𝑦 = sign 𝑦ො ⇒ −𝑦𝑦ො < 0
sign 𝑦 ≠ sign 𝑦ො ⇒ −𝑦𝑦ො > 0
Loss
 Loss drops quickly

𝑒 −(1)(−2) = 7.4
𝑒 −(1)(−1) = 2.7
𝑦ො (case 𝑦 = 1)  Loss always > 0
(c) jesse davis
61
𝑁
 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )

𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )
Error per example that is fixed because we

cannot change classifiers 1 to m-1 in the ensemble
(c) jesse davis

62
𝑁
 Goal: Minimize 𝐸 = ෍ 𝑒 −𝑦𝑖(𝐹𝑚−1 𝑥𝑖 + α𝑚ℎ𝑚 𝑥𝑖 )

𝑖=1
(𝑚)
 Let 𝑤𝑖 = 𝑒 −𝑦𝑖 (𝐹𝑚−1 𝑥𝑖 )
and rewrite the error
(𝑚) −α𝑚
𝑒 α𝑚
(𝑚)
=෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖
Weight for correct predictions Weight for incorrect predictions

𝑁
(𝑚) −α𝑚
(𝑒 α𝑚 − 𝑒 −α𝑚 )
𝑚
= ෍ 𝑤𝑖 𝑒 +෍ 𝑤𝑖
𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖
𝑖=1
Assumes hm ‘s predictions hm that minimizes this sum,
are all correct minimizes E! (c) jesse davis
Setting α𝑚
63
(𝑚) (𝑚)
 Let 𝑊𝑐 = ෍ 𝑤𝑖 and 𝑊𝑖𝑐 = ෍ 𝑤𝑖
𝑦𝑖 = ℎ𝑚 𝑥𝑖 𝑦𝑖 ≠ ℎ𝑚 𝑥𝑖
 Then 𝐸 = 𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚

ⅆ𝐸
= −𝑊𝑐 𝑒 −α𝑚 + 𝑊𝑖𝑐 𝑒 α𝑚 = 0
ⅆα
−𝑊𝑐 + 𝑊𝑖𝑐 𝑒 2α𝑚 = 0
1 𝑊𝑐
α𝑚 = ln
2 𝑊𝑖𝑐
1 1 − 𝜀𝑚
α𝑚 = ln
2 𝜀𝑚
(𝑚)
σ𝑦𝑖 ≠ ℎ𝑚𝑥𝑖 𝑤𝑖
with 𝜀𝑚 =
σ 𝑤𝑖(𝑚) (c) jesse davis
AdaBoost
64
Given S = {(xj,yj)} where j ∊ {1,…,n}, Integer T

(1)
𝑤𝑖 = 1Τ𝑛 All examples
for t = 1 to T have same weight Weighted error
(𝑚)
Find classifier ht, with small error ϵt = σℎ𝑡 𝑥𝑖 ≠𝑦𝑖 𝑤𝑖
if (ϵt > 1Τ2) then break
𝛽𝑡 = ϵtൗ(1−ϵ ) Down weight correct predictions
t
(𝑡+1) (𝑡)
if (ℎ𝑡 𝑥𝑖 = 𝑦𝑖 ) then 𝑤𝑖 = 𝑤𝑖 𝛽𝑡
(𝑡+1)
(𝑡+1) 𝑤𝑖
𝑤𝑖 = (𝑡+1) Normalize weights
σ 𝑤𝑗
1
𝛼𝑡 = 𝑙𝑛
𝛽𝑡 (c) jesse davis
AdaBoost in Practice
65
 Typically use depth bounded decision tree

 Sometimes stumps: Just a single split
 Often depth 5 or 6
 Dealing with weighted instances

 Approach 1: Adapt learner to learn from
weighted instances; trivial for decision trees just
use weighted counts for split criteria
 Approach 2: Sample a large (≫n) set of
unweighted instances according to the weight
distribution and run learner
(c) jesse davis

DM - Lecture 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM - Lecture 4

Uploaded by

Copyright:

Available Formats

1

Key: Models in ensemble are “different”

 Focuses on combining “weak” learners

Real value Weak model, e.g., depth-bounded tree

 Two big ideas:

 Approach: Learn model iteratively

 Two big problems

 The base algorithm is old but very hyped now

 Will focus on least squares regression case

 Will ignore some of mathematical details

N° of pages on Kaggle.com containing term:

0 5000 10000 15000 20000 25000

 Fit an additive model (ensemble) in a greedy

 Each stage introduces a weak learner to

 Shortcomings are identified by gradients

 Given: {(x1,y1), (x2,y2),…,(xn,yn)}

 Goal: Learn function F: X ↦ Y

 Suppose we start with simple h: 𝐹 𝑥𝑖 = 𝑌ത

 Cannot change F in anyway

 Ideal scenario: Add a new h such that:

Such a h won’t exist, but one may approximate it

 Repeat this procedure

Additive Model: F(X) = h0(X) + h1(X) + … + hm(X)

 So far, reweight with residual: = yi – F(xi)

 Gradient descent: Minimize a function by

 By changing F, we minimize our loss function

 View F(xi)s as parameters and take derivative

𝜕ℓ 𝜕1Τ2 𝑦1 −𝐹(𝑥1 ) 2 +…+ 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 +⋯+ 𝑦𝑛 −𝐹(𝑥𝑛 ) 2

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 No other term involves F(Xi)

𝜕ℓ 𝜕1Τ2 𝑦𝑖 −𝐹(𝑥𝑖 ) 2 By chain rule

Note: Negative gradient ≠ residual for all loss functions

By chain rule F’(x) = f’ (g(x))g’(x)

f’(g(x)) = yi –F(xi) and g’(x) = -1, thus

Each tree fits step towards better prediction:

L2 Loss: Log Loss:

 Abstract away the algorithm from the loss

 Thus can plug in any differentiable loss function

 Typically use decision trees

 Difference 1: Focus on mispredictions

 A system ML / DM paper that is hugely

 Spurred a number of follow up papers

 Commonalties in design decisions

 Tree structure: Internal and leaf nodes

 Criteria for evaluating splits

 Optimizing evaluations of the splits

 XGBoost specific features

 Reals: As is Age < 35

Size = {small, medium, large} Size = {0,1,2}

Value of leaf node depends on loss function

 Threshold raw score for classification tasks

E.g.: logistic loss: 𝑝(𝑦𝑖 = 1 | 𝑥𝑖 ) = 1ൗ1+exp(−2𝐹 𝑥𝑖 )

Split(𝐼𝑃 : Instance Set, Split Conditions S)

Bottleneck: ~90% of training time spent evaluating splits

3. Try all possible splits: Xd < -1 Xd < 5

 Determine a limited set of split points per node

 Bagging: Bootstrap replicate S’ by sampling |S|

Married Age Type Class

Matrix Format Column Format

☺ Easier and faster to randomly select a feature

 Array indices  Alignment broken

 Many entries “zero” due to one-hot encoding,

Fewer entries to iterate over:

 Regularization to avoid overfitting

 Add penalty term to loss function