Professional Documents
Culture Documents
Leonardo Auslender
Independent Statistical Consultant
2) Gradient Boosting.
1) Definition and algorithm.
2) Gradient-descent optimization method.
3) Innards of GB.
4) Partial Dependency Plots (PDP)
5) Case Studies.
6) On the practice of GB.
Model error can be broken down into three components mathematically. Let f
estimating function. f-hat empirically derived function.
Start with model M(X) and obtain 80% accuracy, or 60% R2, etc.
Y = M(X) + G(X) + ….. + Z(X) + error (t). If find optimal beta weights to
combined models, then
Algorithm:
1: initialize f0 with a constant
2: for t = 1 to M do
3: compute the negative gradient gt(x)
4: fit a new base-learner function h(x, θt)
5: find the best gradient descent step-size ρt :
ρt = arg minρ Σi=1ψ[yi, ft-1(xi) + ρ h(xi, θi)], i = 1,,,,, N
6: update the function estimate:
ft ← ft−1 + ρth (x, θt)
8: end for
Minimize Loss function itself, its expected value called risk. Many different
loss functions available, gaussian, 0-1, etc.
A loss function describes the loss (or cost) associated with all possible
decisions. Different decision functions or predictor functions will tend
to lead to different types of mistakes. The loss function tells us which
type of mistakes we should be more concerned about.
The best decision function is the function that yields the lowest expected
loss, and the expected loss function is itself called risk of an estimator. 0-1
assigns 0 for correct prediction, 1 for incorrect.
Residual
fitting
Residual
fitting
07/10/2022 Leonardo Auslender Copyright 2004 13
Leonardo Auslender
L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS
away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1]
in 0-1 case here.
convex Non-convex
The method of gradient descent is a first order optimization algorithm that is based
on taking small steps in the direction of the negative gradient at one point in the
curve in order to find the (hopefully global) minimum value (of loss function). If it is
desired to search for the maximum value instead, then the positive gradient is used and the
method is then called gradient ascent.
2 GB versions: one, with raw 20% events (M1) and next one with 50/50 mixture of events
(M2). Non GB Tree (referred as maxdepth 6 for M1 data set) the most biased, followed by
M1. Notice that M2 stabilizes earlier than M1.
5E-15
2.5E-15
1.5969917399003E-15
0 -2.9088316687833E-16
-2.5E-15
-5E-15
0 2 4 6 8 10
Iteration
MEAN_RESID_M1_TRN_TREES MEAN_RESID_M2_TRN_TREES
07/10/2022 Leonardo Auslender Copyright 2004 Ch. 5-19
Leonardo Auslender
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
0.196
0.1781230782
0.1775
Var of Resids
0.159
Depth 6 = 0.145774
0.1404
0.1218753847
0.1219
0 2 4 6 8 10
Ite ration
VAR_RESID_M1_TRN_TREES VAR_RESID_M2_TRN_TREES
Due to GB black-box nature, these plots show the effect of predictor on modeled
response once all other predictors have been marginalized. Predictors usually fixed at
constant value, such as mean.
Graphs may not capture nature of variable interactions especially if interaction significantly
affect model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for given X, PDP
is average of predictions in training with X kept constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to obtain model
interpretation.
................. .. 1
1 ................. Data set name ........................ sampled50_50
................. Data set name ........................ train 1
1 ................. Num_observations ................ 1133
................. Num_observations ................ 3595 1
1
................. Validation data set ................. validata 1 ................. Validation data set ................. validata50_50
1
................. Num_observations .............. 2365 ................. Num_observations .............. 4827
1 1
................. Test data set ................ ................. Test data set ................
1 1
................. Num_observations .......... 0 1
................. Num_observations .......... 0
1
................. ... 1
................. ...
1
................. Dep variable ....................... fraud 1
................. Dep variable ....................... fraud
1
................. ..... 1
................. .....
................. Pct Event Prior TRN............. 20.389 1 ................. Pct Event Prior TRN............. 50.838
1
................. Pct Event Prior VAL............. 19.281 1 ................. Pct Event Prior VAL............. 12.699
1
................. Pct Event Prior TEST ............ 1 ................. Pct Event Prior TEST ............
1
************************************************************* 1
**** *****************************************************************
07/10/2022 Leonardo Auslender1 Copyright 2004 24 1
Leonardo Auslender
Variable Label
FRAUD Fraudulent
Activity yes/no
TOTAL_SPEND Total spent on
opticals
DOCTOR_VISITS Total visits to a
doctor
NO_CLAIMS No of claims
made recently
MEMBER_DURATION Membership
duration
OPTOM_PRESC Number of
opticals claimed
NUM_MEMBERS Number of
members
covered
Ensemble: take all model predictions at end of M6 as predictors and run a logistic
against actual dependent variable and report.
07/10/2022 Leonardo Auslender Copyright 2004 26
Leonardo Auslender
Notice
From .... 1
**************************************************************** 1
***************************************************************** 1
**************************************************************** 1
.................... Ensemble 1
.................... p-values 1
***************************************************************** 1
From ...
****************************************************************
................. Partial
50/50: scales
................. Dependency Shifted up.
................. Plots.
*****************************************************************
Some evidence of
over-fitting
07/10/2022 Leonardo Auslender Copyright 2004 41
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 42
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 43
Leonardo Auslender
50/50
Ranking
Same.
2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.