You are on page 1of 50

Gradient Boosting.

Presented at the NYC Informs, 2016/03

Leonardo Auslender
Independent Statistical Consultant

Leonardo.Auslender ‘at’ gmail ‘dot’ com.

07/10/2022 Leonardo Auslender Copyright 2004 1


Leonardo Auslender
Topics to cover:

1) Why more techniques? Bias-variance


tradeoff.

2) Gradient Boosting.
1) Definition and algorithm.
2) Gradient-descent optimization method.
3) Innards of GB.
4) Partial Dependency Plots (PDP)
5) Case Studies.
6) On the practice of GB.

07/10/2022 Leonardo Auslender Copyright 2004 2


Leonardo Auslender
1) Bias-
Variance
Trade-off

07/10/2022 Leonardo Auslender Copyright 2004 3


Leonardo Auslender
1) Why more techniques? Bias-variance tradeoff.
(Broken clock is right twice a day, variance of estimation = 0, bias extremely high.
Thermometer is accurate overall, but reports higher temperatures at night. Unbiased,
higher variance).

Model error can be broken down into three components mathematically. Let f
estimating function. f-hat empirically derived function.

07/10/2022 Leonardo Auslender Copyright 2004 4


Leonardo Auslender
Credit : Scott Fortmann-Roe (web)

07/10/2022 Leonardo Auslender Copyright 2004 5


Leonardo Auslender
Let X1, X2, X3, i.i.d random variables, mean and variance are mu
and sigma-square.

Well known that E(X) = mu, and variance = sigma-square / n.

 By just averaging estimates, we lower the variance and assure


same aspects of bias 

 Let us find methods to lower or stabilize variance (at least) while


keeping bias low.

07/10/2022 Leonardo Auslender Copyright 2004 6


Leonardo Auslender
2) J. Friedman:
Gradient Boosting,
(Salford’s TreeNet,
previously Mart,
R GBM,
SAS Gradient boosting).

07/10/2022 Leonardo Auslender Copyright 2004 7


Leonardo Auslender
Detour: Underlying idea for boosting (NOT GB).

Start with model M(X) and obtain 80% accuracy, or 60% R2, etc.

Then Y = M(X) + error1. Hypothesize that error is still correlated with Y.

Therefore, error1 = G(X) + error2

Error (t - 1) = Z(X) + error (t) 

Y = M(X) + G(X) + ….. + Z(X) + error (t). If find optimal beta weights to
combined models, then

Y = beta(1) * M(X) + beta(2) G(X) + …. + Beta (t) Z(X) + error (t)

“Ensemble method” with single data set, iteratively reweighting


observations according to previous error, especially focusing on wrongly
classified observations.
Philosophy: Focus on most difficult points to classify in previous step by
reweighting observations.
07/10/2022 Leonardo Auslender Copyright 2004 8
Leonardo Auslender
2.1 Definition
And
Algorithm.

07/10/2022 Leonardo Auslender Copyright 2004 9


Leonardo Auslender
Quick description of GB using trees (GBDT).
1) Create very small tree as initial model, ‘weak’ learner, (e.g., tree with two terminal nodes. ( 
depth = 1). ‘WEAK’ avoids over-fitting and local minina, and predicts, F1, for each obs.
2) Each tree allocates a probability of event or a mean value in each terminal node, according
to the nature of the dependent variable or target.
3) Compute “residuals” (prediction error) for every observation (if 0-1 target, apply logistic
transformation to linearize them p / 1 - p).
4) Use the residuals as the new ‘target variable and grow a second small tree on them (second
stage of the process, same depth.) . To ensure against overfitting, use a random sample
without replacement ( “stochastic gradient boosting”.)
5) New model, once second stage is complete, is concatenation of two trees, Tree1 and Tree2
and predictions F1 + F2 * gamma, gamma multiplier factor.
6) Iterate the procedure of computing residuals from most recent tree, which become the
target of the new model, etc.
7) In the case of a binary target variable, each tree produces at least some nodes in which the
‘event’ is the majority (‘events’ are typically more difficult to identify since most data sets
contain a very low proportion of ‘events’ in the usual case).
8) The final score for each observation is obtained by summing the different scores
(probabilities) of every tree for each observation.

Why does it work? Why “gradient” and “boosting”?


07/10/2022 Leonardo Auslender Copyright 2004 10
Leonardo Auslender
More Details

Friedman’s 2001 GB algorithm:


1) Data (Y, X), Y (N, 1), X (N, p)
2) Choose # iterations M
3) Choose loss function ψ(Y, Y), and corresponding gradient
4) Choose base learner h( X, θ), say shallow trees.

Algorithm:
1: initialize f0 with a constant
2: for t = 1 to M do
3: compute the negative gradient gt(x)
4: fit a new base-learner function h(x, θt)
5: find the best gradient descent step-size ρt :
ρt = arg minρ Σi=1ψ[yi, ft-1(xi) + ρ h(xi, θi)], i = 1,,,,, N
6: update the function estimate:
ft ← ft−1 + ρth (x, θt)
8: end for

(all f function are function estimates, i.e., ‘hats’).


07/10/2022 Leonardo Auslender Copyright 2004 11
Leonardo Auslender
Setting.

Hypothesize existence of function Y = f (X, betas, error). Change of


paradigm, no MLE (e.g., logistic) but loss function.

Minimize Loss function itself, its expected value called risk. Many different
loss functions available, gaussian, 0-1, etc.

A loss function describes the loss (or cost) associated with all possible
decisions. Different decision functions or predictor functions will tend
to lead to different types of mistakes. The loss function tells us which
type of mistakes we should be more concerned about.

For instance, estimating demand, decision function could be linear equation


and loss function could be squared or absolute error.

The best decision function is the function that yields the lowest expected
loss, and the expected loss function is itself called risk of an estimator. 0-1
assigns 0 for correct prediction, 1 for incorrect.

07/10/2022 Leonardo Auslender Copyright 2004 12


Leonardo Auslender
Key Details.
Friedman’s 2001 GB algorithm: Need
1) Loss function (usually determined by nature of Y (binary,
continuous…)) (NO MLE).
2) Weak learner, typically tree stump or spline, marginally better
classifier than random (but by how much?).

Residual
fitting

Residual
fitting
07/10/2022 Leonardo Auslender Copyright 2004 13
Leonardo Auslender
L2-error penalizes symmetrically away from 0, Huber penalizes less than OLS
away from [-1, 1], Bernoulli and Adaboost are very similar. Note that Y ε [-1, 1]
in 0-1 case here.

07/10/2022 Leonardo Auslender Copyright 2004 14


Leonardo Auslender
2.2 Gradient
Descent

07/10/2022 Leonardo Auslender Copyright 2004 15


Leonardo Auslender
Gradient Descent.

“Gradient” descent method to find minimum of function.

Gradient: multivariate generalization of derivative of function in one


dimension to many dimensions. I.e., gradient is vector of partial
derivatives. In one dimension, gradient is tangent to function.

Easier to work with convex and “smooth” functions.

convex Non-convex

07/10/2022 Leonardo Auslender Copyright 2004 16


Leonardo Auslender
“Gradient” descent

The method of gradient descent is a first order optimization algorithm that is based
on taking small steps in the direction of the negative gradient at one point in the
curve in order to find the (hopefully global) minimum value (of loss function). If it is
desired to search for the maximum value instead, then the positive gradient is used and the
method is then called gradient ascent.

Requires starting point, possibly many to avoid local minima.

07/10/2022 Leonardo Auslender Copyright 2004 17


Leonardo Auslender
2.3 Innards
Of
Gradient
Boosting.

07/10/2022 Leonardo Auslender Copyright 2004 18


Leonardo Auslender
Comparing full tree (depth = 6) to boosted tree residuals by iteration..

2 GB versions: one, with raw 20% events (M1) and next one with 50/50 mixture of events
(M2). Non GB Tree (referred as maxdepth 6 for M1 data set) the most biased, followed by
M1. Notice that M2 stabilizes earlier than M1.

Avg residuals by iteration by model names in gradient boosting


Vertical line - Mean stabilizes

5E-15

Tree depth 6 2.83E-15


MEAN_RESID_M1_TRN_TREES

2.5E-15
1.5969917399003E-15

0 -2.9088316687833E-16

-2.5E-15

-5E-15

0 2 4 6 8 10
Iteration

MEAN_RESID_M1_TRN_TREES MEAN_RESID_M2_TRN_TREES
07/10/2022 Leonardo Auslender Copyright 2004 Ch. 5-19
Leonardo Auslender
Comparing full tree (depth = 6) to boosted tree residuals by iteration..

Trees have highest variance, followed by M2 and then M1. M2 stabilizes


earlier as well. In conclusion, M2 has lower bias and higher variance than
M1 in this example.
Variance of residuals by iteration in gradient boosting
Vertical line - Variance stabilizes
0.2146
5 8

0.196

0.1781230782
0.1775
Var of Resids

0.159

Depth 6 = 0.145774

0.1404

0.1218753847
0.1219

0 2 4 6 8 10
Ite ration

VAR_RESID_M1_TRN_TREES VAR_RESID_M2_TRN_TREES

07/10/2022 Leonardo Auslender Copyright 2004 20


Leonardo Auslender
2.4 Partial
Dependency
Plots.

07/10/2022 Leonardo Auslender Copyright 2004 21


Leonardo Auslender
Partial Dependency plots (PDP).

Due to GB black-box nature, these plots show the effect of predictor on modeled
response once all other predictors have been marginalized. Predictors usually fixed at
constant value, such as mean.

Graphs may not capture nature of variable interactions especially if interaction significantly
affect model outcome.

Formally, PDP of F(x1, x2, xp) on X is E(F) over all vars except X. Thus, for given X, PDP
is average of predictions in training with X kept constant.

Since GB, Boosting, Bagging, etc are BLACK BOX models, use PDP to obtain model
interpretation.

07/10/2022 Leonardo Auslender Copyright 2004 22


Leonardo Auslender
2.5 Comparative
Study

07/10/2022 Leonardo Auslender Copyright 2004 23


Leonardo Auslender
Analytical problem to investigate.
Optical Health Care fraud insurance patients. Longer care typically involves higher
treatment costs and insurance company has to set up reserves immediately as soon
as a case is opened. Sometimes doctors involve in fraud.

The aim of the project is to predict fraudulent charges. Thus, it is a classification


problem, and we’ll use a battery of models and compare them, with and without
50/50 original training sample.
     
 
Notice
Notice
From ..... From .....
1 1
**************************************************************** ****************************************************************
1 1
................. Basic information on the original data ................. Basic information on the original data set.s:
set.s: 1
................. ..
1

................. .. 1
1 ................. Data set name ........................ sampled50_50
................. Data set name ........................ train 1
1 ................. Num_observations ................ 1133
................. Num_observations ................ 3595 1
1
................. Validation data set ................. validata 1 ................. Validation data set ................. validata50_50
1
................. Num_observations .............. 2365 ................. Num_observations .............. 4827
1 1
................. Test data set ................ ................. Test data set ................
1 1
................. Num_observations .......... 0 1
................. Num_observations .......... 0
1
................. ... 1
................. ...
1
................. Dep variable ....................... fraud 1
................. Dep variable ....................... fraud
1
................. ..... 1
................. .....
................. Pct Event Prior TRN............. 20.389 1 ................. Pct Event Prior TRN............. 50.838
1

................. Pct Event Prior VAL............. 19.281 1 ................. Pct Event Prior VAL............. 12.699
1

................. Pct Event Prior TEST ............ 1 ................. Pct Event Prior TEST ............
1

************************************************************* 1
**** *****************************************************************
07/10/2022 Leonardo Auslender1 Copyright 2004 24 1

Leonardo Auslender
Variable Label

FRAUD Fraudulent
Activity yes/no
TOTAL_SPEND Total spent on
opticals
DOCTOR_VISITS Total visits to a
doctor
NO_CLAIMS No of claims
made recently
MEMBER_DURATION Membership
duration
OPTOM_PRESC Number of
opticals claimed
NUM_MEMBERS Number of
members
covered

07/10/2022 Leonardo Auslender Copyright 2004 25


Leonardo Auslender
Requested Models: Names & Descriptions
Model #
Model Name Model Description
  Overall Models
***
M1 Raw DATA 20 pct maxdepth 1 num iterations 3
-2
M2 Raw DATA 20 pct maxdepth 1 num iterations 10
-2
M3 Raw DATA 20 pct maxdepth 3 num iterations 3
-2
M4 Raw DATA 20 pct maxdepth 3 num iterations 10
-2
M5 Raw DATA 20 pct maxdepth 5 num iterations 3
-2
M6 Raw DATA 20 pct maxdepth 5 num iterations 10
-2

Additional Information: Logistic backwards and Trees for M1 only,


M1 – M6 for GB, all models evaluated at TRN and VAL stages. Naming convention:
M#_modeling.Method and sometimes M#_TRN/VAL_modeling.method. Also, run
Bagging at M1 but reporting only on auroc to avoid clutter. M1 – M6 focuses on
changing depth and iterations for GB, with p = 6 predictors.

For instance, M1_logistic_backward means Case M1 by Using a logistic regression


with backward selection. M1 – M6: variations in depth and number of iterations for
Gradient boosting runs.

Ensemble: take all model predictions at end of M6 as predictors and run a logistic
against actual dependent variable and report.
07/10/2022 Leonardo Auslender Copyright 2004 26
Leonardo Auslender
Notice
From .... 1

**************************************************************** 1

.................... Reporting area for 1

.................... model.s coefficients 1

.................... and Selected Variables. 1

***************************************************************** 1

Similar for 50/50.


07/10/2022 Leonardo Auslender Copyright 2004 27
Leonardo Auslender
As we increase # iterations and maximum depth, larger
Number of variables selected.
07/10/2022 Leonardo Auslender Copyright 2004 28
Leonardo Auslender
50/50: trees
Seriously
Affected,
Not so GB.

07/10/2022 Leonardo Auslender Copyright 2004 29


Leonardo Auslender
GB and tree do not fully agree with Logistic findings. Tree methods find No_claims
as primary while logistic seems to value all vars except doctor_visits equally. In 50/50
Case, Doctor_visits is irrelevant and leads to quasi_complete separation.

07/10/2022 Leonardo Auslender Copyright 2004 30


Leonardo Auslender
Notice
From .... 1

**************************************************************** 1

.................... Ensemble 1

.................... p-values 1

***************************************************************** 1

50/50: mode models insignificant except for M6_grad_


Boosting.
07/10/2022 Leonardo Auslender Copyright 2004 31
Leonardo Auslender
Notice

From ...
****************************************************************
................. Partial
50/50: scales
................. Dependency Shifted up.
................. Plots.
*****************************************************************

07/10/2022 Leonardo Auslender Copyright 2004 32


Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 33
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 34
Leonardo Auslender
Very interesting almost U relationship, conditioned on
Other vars in model.
07/10/2022 Leonardo Auslender Copyright 2004 35
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 36
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 37
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 38
Leonardo Auslender
Notice
From ...
****************************************************************
................. GOF Measures
................. Area.
*****************************************************************

Probs shifted up for 50/50


07/10/2022 Leonardo Auslender Copyright 2004 39
Leonardo Auslender
GB certainly does
not over-fit.
07/10/2022 Leonardo Auslender Copyright 2004 40
Leonardo Auslender
50/50: overall ranking hasn’t changed. Notice
Decline in Trees, and stability in bagging.

Some evidence of
over-fitting
07/10/2022 Leonardo Auslender Copyright 2004 41
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 42
Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 43
Leonardo Auslender
50/50
Ranking
Same.

07/10/2022 Leonardo Auslender Copyright 2004 44


Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 45
Leonardo Auslender
2.6 Comments on
the practice Of
Gradient Boosting.

07/10/2022 Leonardo Auslender Copyright 2004 46


Leonardo Auslender
Comments on GB.

1) It is not immediately apparent what a weak classifier is (for instance,


by varying depth in our case). Likewise, the number of iterations is a
big issue. In our simple example, M6 GB was the best performer. Still,
the overall modeling benefited from ensembling all the methods as
measured by either AUROC or Cum Lift or the ensemble p-values.

2) The posterior probability ranges are vastly different and thus the
tendency to classify observations by the .5 threshold is too simplistic.

3) The PDPs show that different methods find distinct multivariate


structures. Interestingly, the ensemble p-values show a decreasing
tendency by logistic and trees and a strong S shaped tendency by M6
GB, which could mean that M6 GB alone tends to overshoot its
predictions.

4) GB relatively unaffected by 50/50 mixture.

07/10/2022 Leonardo Auslender Copyright 2004 47


Leonardo Auslender
Drawbacks of GB.

1) Memory requirements can be very large, especially with large


iterations, typical problem of ensemble methods.

2) Large number of iterations  slow speed to obtain predictions  on-


line scoring may require trade-off between complexity and time
available. Once GB is learned, parallelization certainly helps.

3) No simple algorithm to capture interactions because of base-learners.


4)
5) No simple rules to determine gamma, number of iterations or depth of
simple learner. Need to try different combinations and possibly
recalibrate in time.

6) Still, one of the most powerful methods available.

07/10/2022 Leonardo Auslender Copyright 2004 48


Leonardo Auslender
References

Friedman, J. (2001).Greedy boosting approximation: a gradient boosting


machine. Ann.Stat. 29, 1189–1232.doi:10.1214/aos/1013203451

Earlier literature on combining methods:

Winkler, RL. and Makridakis, S. (1983). The combination of forecasts. J.


R. Statis. Soc. A. 146(2), 150-157.

Makridakis, S. and Winkler, R.L. (1983). Averages of Forecasts: Some


Empirical Results,. Management Science, 29(9) 987-996.

Bates, J.M. and Granger, C.W. (1969). The combination of forecasts.


 Or, 451-468.

07/10/2022 Leonardo Auslender Copyright 2004 49


Leonardo Auslender
07/10/2022 Leonardo Auslender Copyright 2004 Ch. 5-50
Leonardo Auslender

You might also like