Gradient Boosting

Gradient Boosting.
Presented at the NYC Informs, 2016/03
Leonardo Auslender
Independent Statistical Consultant
Leonardo.Auslender ‘at’ gmail ‘dot’ com.
07/10/2022 Leonardo Auslender Copyright 2004 1

Leonardo Auslender
Topics to cover:
1) Why more techniques? Bias-variance

tradeoff.
2) Gradient Boosting.
1) Definition and algorithm.
2) Gradient-descent optimization method.
3) Innards of GB.
4) Partial Dependency Plots (PDP)
5) Case Studies.
6) On the practice of GB.

Leonardo Auslender
1) Bias-
Variance
Trade-off

Leonardo Auslender
1) Why more techniques? Bias-variance tradeoff.
(Broken clock is right twice a day, variance of estimation = 0, bias extremely high.
Thermometer is accurate overall, but reports higher temperatures at night. Unbiased,
higher variance).
Model error can be broken down into three components mathematically. Let f
estimating function. f-hat empirically derived function.

Leonardo Auslender
Credit : Scott Fortmann-Roe (web)

Leonardo Auslender
Let X1, X2, X3, i.i.d random variables, mean and variance are mu
and sigma-square.
Well known that E(X) = mu, and variance = sigma-square / n.
 By just averaging estimates, we lower the variance and assure

same aspects of bias 
 Let us find methods to lower or stabilize variance (at least) while

keeping bias low.

Leonardo Auslender
2) J. Friedman:
Gradient Boosting,
(Salford’s TreeNet,
previously Mart,
R GBM,
SAS Gradient boosting).

Leonardo Auslender
Detour: Underlying idea for boosting (NOT GB).
Start with model M(X) and obtain 80% accuracy, or 60% R2, etc.
Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.
Therefore, error1 = G(X) + error2
Error (t - 1) = Z(X) + error (t) 
Y = M(X) + G(X) + ….. + Z(X) + error (t). If find optimal beta weights to
combined models, then
Y = beta(1) * M(X) + beta(2) G(X) + …. + Beta (t) Z(X) + error (t)
“Ensemble method” with single data set, iteratively reweighting

observations according to previous error, especially focusing on wrongly
classified observations.
Philosophy: Focus on most difficult points to classify in previous step by
reweighting observations.
Leonardo Auslender
2.1 Definition
And
Algorithm.

Leonardo Auslender
Quick description of GB using trees (GBDT).
1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. ( 
depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs.
2) Each tree allocates a probability of event or a mean value in each terminal node, according
to the nature of the dependent variable or target.
3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic
transformation to linearize them p / 1 - p).
4) Use the residuals as the new ‘target variable and grow a second small tree on them (second
stage of the process, same depth.) . To ensure against overfitting, use a random sample
without replacement ( “stochastic gradient boosting”.)
5) New model, once second stage is complete, is concatenation of two trees, Tree1 and Tree2
and predictions F1 + F2 * gamma, gamma multiplier factor.
6) Iterate the procedure of computing residuals from most recent tree, which become the
target of the new model, etc.
7) In the case of a binary target variable, each tree produces at least some nodes in which the
‘event’ is the majority (‘events’ are typically more difficult to identify since most data sets
contain a very low proportion of ‘events’ in the usual case).
8) The final score for each observation is obtained by summing the different scores
(probabilities) of every tree for each observation.
Why does it work? Why “gradient” and “boosting”?

Leonardo Auslender
More Details
Friedman’s 2001 GB algorithm:

1) Data (Y, X), Y (N, 1), X (N, p)
2) Choose # iterations M
3) Choose loss function ψ(Y, Y), and corresponding gradient
4) Choose base learner h( X, θ), say shallow trees.
Algorithm:
1: initialize f0 with a constant
2: for t = 1 to M do
3: compute the negative gradient gt(x)
4: fit a new base-learner function h(x, θt)
5: find the best gradient descent step-size ρt :
ρt = arg minρ Σi=1ψ[yi, ft-1(xi) + ρ h(xi, θi)], i = 1,,,,, N
6: update the function estimate:
ft ← ft−1 + ρth (x, θt)
8: end for
(all f function are function estimates, i.e., ‘hats’).

Leonardo Auslender
Setting.
Hypothesize existence of function Y = f (X, betas, error). Change of

paradigm, no MLE (e.g., logistic) but loss function.
Minimize Loss function itself, its expected value called risk. Many different
loss functions available, gaussian, 0-1, etc.
A loss function describes the loss (or cost) associated with all possible
decisions. Different decision functions or predictor functions will tend
to lead to different types of mistakes. The loss function tells us which
type of mistakes we should be more concerned about.
For instance, estimating demand, decision function could be linear equation

and loss function could be squared or absolute error.
The best decision function is the function that yields the lowest expected
loss, and the expected loss function is itself called risk of an estimator. 0-1
assigns 0 for correct prediction, 1 for incorrect.

Leonardo Auslender
Key Details.
Friedman’s 2001 GB algorithm: Need
1) Loss function (usually determined by nature of Y (binary,
continuous…)) (NO MLE).
2) Weak learner, typically tree stump or spline, marginally better
classifier than random (but by how much?).
Residual
fitting
Residual
fitting
Leonardo Auslender
L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS
away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1]
in 0-1 case here.

Leonardo Auslender
2.2 Gradient
Descent

Leonardo Auslender
Gradient Descent.
“Gradient” descent method to find minimum of function.
Gradient: multivariate generalization of derivative of function in one

dimension to many dimensions. I.e., gradient is vector of partial
derivatives. In one dimension, gradient is tangent to function.
Easier to work with convex and “smooth” functions.
convex Non-convex

Leonardo Auslender
“Gradient” descent
The method of gradient descent is a first order optimization algorithm that is based
on taking small steps in the direction of the negative gradient at one point in the
curve in order to find the (hopefully global) minimum value (of loss function). If it is
desired to search for the maximum value instead, then the positive gradient is used and the
method is then called gradient ascent.
Requires starting point, possibly many to avoid local minima.

Leonardo Auslender
2.3 Innards
Of
Gradient
Boosting.

Leonardo Auslender
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
2 GB versions: one, with raw 20% events (M1) and next one with 50/50 mixture of events
(M2). Non GB Tree (referred as maxdepth 6 for M1 data set) the most biased, followed by
M1. Notice that M2 stabilizes earlier than M1.
Avg residuals by iteration by model names in gradient boosting

Vertical line - Mean stabilizes
5E-15
Tree depth 6 2.83E-15

MEAN_RESID_M1_TRN_TREES
2.5E-15
1.5969917399003E-15
0 -2.9088316687833E-16
-2.5E-15
-5E-15
0 2 4 6 8 10
Iteration
MEAN_RESID_M1_TRN_TREES MEAN_RESID_M2_TRN_TREES
07/10/2022 Leonardo Auslender Copyright 2004 Ch. 5-19
Leonardo Auslender
Comparing full tree (depth = 6) to boosted tree residuals by iteration..
Trees have highest variance, followed by M2 and then M1. M2 stabilizes

earlier as well. In conclusion, M2 has lower bias and higher variance than
M1 in this example.
Variance of residuals by iteration in gradient boosting
Vertical line - Variance stabilizes
0.2146
5 8
0.196
0.1781230782
0.1775
Var of Resids
0.159
Depth 6 = 0.145774
0.1404
0.1218753847
0.1219
0 2 4 6 8 10
Ite ration
VAR_RESID_M1_TRN_TREES VAR_RESID_M2_TRN_TREES

Leonardo Auslender
2.4 Partial
Dependency
Plots.

Leonardo Auslender
Partial Dependency plots (PDP).
Due to GB black-box nature, these plots show the effect of predictor on modeled
response once all other predictors have been marginalized. Predictors usually fixed at
constant value, such as mean.
Graphs may not capture nature of variable interactions especially if interaction significantly
affect model outcome.
Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for given X, PDP
is average of predictions in training with X kept constant.
Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to obtain model
interpretation.

Leonardo Auslender
2.5 Comparative
Study

Leonardo Auslender
Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon
as a case is opened. Sometimes doctors involve in fraud.
The aim of the project is to predict fraudulent charges. Thus, it is a classification

problem, and we’ll use a battery of models and compare them, with and without
50/50 original training sample.

Notice
Notice
From ..... From .....
1 1
**************************************************************** ****************************************************************
1 1
................. Basic information on the original data ................. Basic information on the original data set.s:
set.s: 1
................. ..
1
................. .. 1
1 ................. Data set name ........................ sampled50_50
................. Data set name ........................ train 1
1 ................. Num_observations ................ 1133
................. Num_observations ................ 3595 1
1
................. Validation data set ................. validata 1 ................. Validation data set ................. validata50_50
1
................. Num_observations .............. 2365 ................. Num_observations .............. 4827
1 1
................. Test data set ................ ................. Test data set ................
1 1
................. Num_observations .......... 0 1
................. Num_observations .......... 0
1
................. ... 1
................. ...
1
................. Dep variable ....................... fraud 1
................. Dep variable ....................... fraud
1
................. ..... 1
................. .....
................. Pct Event Prior TRN............. 20.389 1 ................. Pct Event Prior TRN............. 50.838
1
................. Pct Event Prior VAL............. 19.281 1 ................. Pct Event Prior VAL............. 12.699
1
................. Pct Event Prior TEST ............ 1 ................. Pct Event Prior TEST ............
1
************************************************************* 1
**** *****************************************************************
07/10/2022 Leonardo Auslender1 Copyright 2004 24 1
Leonardo Auslender
Variable Label
FRAUD Fraudulent
Activity yes/no
TOTAL_SPEND Total spent on
opticals
DOCTOR_VISITS Total visits to a
doctor
NO_CLAIMS No of claims
made recently
MEMBER_DURATION Membership
duration
OPTOM_PRESC Number of
opticals claimed
NUM_MEMBERS Number of
members
covered

Leonardo Auslender
Requested Models: Names & Descriptions
Model #
Model Name Model Description
Overall Models
***
M1 Raw DATA 20 pct maxdepth 1 num iterations 3
-2
-2
-2
-2
-2
-2
Additional Information: Logistic backwards and Trees for M1 only,

M1 – M6 for GB, all models evaluated at TRN and VAL stages. Naming convention:
M#_modeling.Method and sometimes M#_TRN/VAL_modeling.method. Also, run
Bagging at M1 but reporting only on auroc to avoid clutter. M1 – M6 focuses on
changing depth and iterations for GB, with p = 6 predictors.
For instance, M1_logistic_backward means Case M1 by Using a logistic regression

with backward selection. M1 – M6: variations in depth and number of iterations for
Gradient boosting runs.
Ensemble: take all model predictions at end of M6 as predictors and run a logistic
against actual dependent variable and report.
Leonardo Auslender
Notice
From .... 1
**************************************************************** 1
.................... Reporting area for 1
.................... model.s coefficients 1
.................... and Selected Variables. 1
***************************************************************** 1
Similar for 50/50.

Leonardo Auslender
As we increase # iterations and maximum depth, larger
Number of variables selected.
Leonardo Auslender
50/50: trees
Seriously
Affected,
Not so GB.

Leonardo Auslender
GB and tree do not fully agree with Logistic findings. Tree methods find No_claims
as primary while logistic seems to value all vars except doctor_visits equally. In 50/50
Case, Doctor_visits is irrelevant and leads to quasi_complete separation.

Leonardo Auslender
Notice
From .... 1
**************************************************************** 1
.................... Ensemble 1
.................... p-values 1
***************************************************************** 1
50/50: mode models insignificant except for M6_grad_

Boosting.
Leonardo Auslender
Notice
From ...
****************************************************************
................. Partial
50/50: scales
................. Dependency Shifted up.
................. Plots.
*****************************************************************

Leonardo Auslender
Leonardo Auslender
Leonardo Auslender
Very interesting almost U relationship, conditioned on
Other vars in model.
Leonardo Auslender
Leonardo Auslender
Leonardo Auslender
Leonardo Auslender
Notice
From ...
****************************************************************
................. GOF Measures
................. Area.
*****************************************************************
Probs shifted up for 50/50

Leonardo Auslender
GB certainly does
not over-fit.
Leonardo Auslender
50/50: overall ranking hasn’t changed. Notice
Decline in Trees, and stability in bagging.
Some evidence of
over-fitting
Leonardo Auslender
Leonardo Auslender
Leonardo Auslender
50/50
Ranking
Same.

Leonardo Auslender
Leonardo Auslender
2.6 Comments on
the practice Of
Gradient Boosting.

Leonardo Auslender
Comments on GB.
1) It is not immediately apparent what a weak classifier is (for instance,

by varying depth in our case). Likewise, the number of iterations is a
big issue. In our simple example, M6 GB was the best performer. Still,
the overall modeling benefited from ensembling all the methods as
measured by either AUROC or Cum Lift or the ensemble p-values.
2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.
3) The PDPs show that different methods find distinct multivariate

structures. Interestingly, the ensemble p-values show a decreasing
tendency by logistic and trees and a strong S shaped tendency by M6
GB, which could mean that M6 GB alone tends to overshoot its
predictions.
4) GB relatively unaffected by 50/50 mixture.

Leonardo Auslender
Drawbacks of GB.
1) Memory requirements can be very large, especially with large

iterations, typical problem of ensemble methods.
2) Large number of iterations  slow speed to obtain predictions  on-

line scoring may require trade-off between complexity and time
available. Once GB is learned, parallelization certainly helps.
3) No simple algorithm to capture interactions because of base-learners.

4)
5) No simple rules to determine gamma, number of iterations or depth of
simple learner. Need to try different combinations and possibly
recalibrate in time.
6) Still, one of the most powerful methods available.

Leonardo Auslender
References
Friedman, J. (2001).Greedy boosting approximation: a gradient boosting

machine. Ann.Stat. 29, 1189–1232.doi:10.1214/aos/1013203451
Earlier literature on combining methods:
Winkler, RL. and Makridakis, S. (1983). The combination of forecasts. J.

R. Statis. Soc. A. 146(2), 150-157.
Makridakis, S. and Winkler, R.L. (1983). Averages of Forecasts: Some

Empirical Results,. Management Science, 29(9) 987-996.
Bates, J.M. and Granger, C.W. (1969). The combination of forecasts.

Or, 451-468.

Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 Ch. 5-50
Leonardo Auslender

Gradient Boosting

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gradient Boosting

Uploaded by

Copyright:

Available Formats

Gradient Boosting.

Presented at the NYC Informs, 2016/03

Leonardo.Auslender ‘at’ gmail ‘dot’ com.

07/10/2022 Leonardo Auslender Copyright 2004 1

1) Why more techniques? Bias-variance

07/10/2022 Leonardo Auslender Copyright 2004 2

07/10/2022 Leonardo Auslender Copyright 2004 3

07/10/2022 Leonardo Auslender Copyright 2004 4

07/10/2022 Leonardo Auslender Copyright 2004 5

Well known that E(X) = mu, and variance = sigma-square / n.

 By just averaging estimates, we lower the variance and assure

 Let us find methods to lower or stabilize variance (at least) while

07/10/2022 Leonardo Auslender Copyright 2004 6

07/10/2022 Leonardo Auslender Copyright 2004 7

Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.

Therefore, error1 = G(X) + error2

Error (t - 1) = Z(X) + error (t) 

Y = beta(1) * M(X) + beta(2) G(X) + …. + Beta (t) Z(X) + error (t)

“Ensemble method” with single data set, iteratively reweighting

07/10/2022 Leonardo Auslender Copyright 2004 9

Why does it work? Why “gradient” and “boosting”?

Friedman’s 2001 GB algorithm:

(all f function are function estimates, i.e., ‘hats’).

Hypothesize existence of function Y = f (X, betas, error). Change of

For instance, estimating demand, decision function could be linear equation

07/10/2022 Leonardo Auslender Copyright 2004 12

07/10/2022 Leonardo Auslender Copyright 2004 14

07/10/2022 Leonardo Auslender Copyright 2004 15

“Gradient” descent method to find minimum of function.

Gradient: multivariate generalization of derivative of function in one

Easier to work with convex and “smooth” functions.

07/10/2022 Leonardo Auslender Copyright 2004 16

Requires starting point, possibly many to avoid local minima.

07/10/2022 Leonardo Auslender Copyright 2004 17

07/10/2022 Leonardo Auslender Copyright 2004 18

Avg residuals by iteration by model names in gradient boosting

Tree depth 6 2.83E-15

Trees have highest variance, followed by M2 and then M1. M2 stabilizes

07/10/2022 Leonardo Auslender Copyright 2004 20

07/10/2022 Leonardo Auslender Copyright 2004 21

07/10/2022 Leonardo Auslender Copyright 2004 22

07/10/2022 Leonardo Auslender Copyright 2004 23

The aim of the project is to predict fraudulent charges. Thus, it is a classification

07/10/2022 Leonardo Auslender Copyright 2004 25

Additional Information: Logistic backwards and Trees for M1 only,

For instance, M1_logistic_backward means Case M1 by Using a logistic regression

.................... Reporting area for 1

.................... model.s coefficients 1

.................... and Selected Variables. 1

Similar for 50/50.

07/10/2022 Leonardo Auslender Copyright 2004 29

07/10/2022 Leonardo Auslender Copyright 2004 30

50/50: mode models insignificant except for M6_grad_

07/10/2022 Leonardo Auslender Copyright 2004 32

Probs shifted up for 50/50

07/10/2022 Leonardo Auslender Copyright 2004 44

07/10/2022 Leonardo Auslender Copyright 2004 46

1) It is not immediately apparent what a weak classifier is (for instance,

3) The PDPs show that different methods find distinct multivariate

4) GB relatively unaffected by 50/50 mixture.

07/10/2022 Leonardo Auslender Copyright 2004 47

1) Memory requirements can be very large, especially with large

2) Large number of iterations  slow speed to obtain predictions  on-