Slides

Gradient Boosted Regression Trees
scikit
Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe)

DataRobot Université de Liège, Belgium
Motivation
Motivation
Outline
1 Basics
2 Gradient Boosting
3 Gradient Boosting in scikit-learn
4 Case Study: California housing

About us
Peter
• @pprett
• Python & ML ∼ 6 years
• sklearn dev since 2010
Gilles
• @glouppe
• PhD student (Liège,
Belgium)
• sklearn dev since 2011
Chief tree hugger
Outline
1 Basics
2 Gradient Boosting

Machine Learning 101
• Data comes as...
• A set of examples {(xi , yi )|0 ≤ i < n samples}, with
• Feature vector x ∈ Rn features , and
• Response y ∈ R (regression) or y ∈ {−1, 1} (classification)
• Goal is to...
• Find a function ŷ = f (x)
• Such that error L(y , ŷ ) on new (unseen) x is minimal
Classification and Regression Trees [Breiman et al, 1984]
MedInc <= 5.04
MedInc <= 3.07 MedInc <= 6.82
AveRooms <= 4.31 AveOccup <= 2.37 AveOccup <= 2.74 MedInc <= 7.82
1.62 1.16 2.79 1.88 3.39 2.56 3.73 4.57
sklearn.tree.DecisionTreeClassifier|Regressor
Function approximation with Regression Trees
10
ground truth
8 RT max_depth=1
RT max_depth=3
6 RT max_depth=20
2
y
80 2 4 6 8 10
x
Function approximation with Regression Trees
10
ground truth
8 RT max_depth=1
RT max_depth=3
6 RT max_depth=20
2 Deprecated
y
• Nowadays seldom used alone

0
• Ensembles: Random Forest, Bagging, or Boosting
2 (see sklearn.ensemble)
4
80 2 4 6 8 10
x
Outline
1 Basics
2 Gradient Boosting

Gradient Boosted Regression Trees
Advantages
• Heterogeneous data (features measured on different scale)
• Supports different loss functions (e.g. huber)
• Automatically detects (non-linear) feature interactions
Disadvantages
• Requires careful tuning
• Slow to train (but fast to predict)
• Cannot extrapolate
Boosting
AdaBoost [Y. Freund & R. Schapire, 1995]

• Ensemble: each member is an expert on the errors of its
predecessor
• Iteratively re-weights training examples based on errors
1
x1
2
2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3
x0 x0 x0 x0
sklearn.ensemble.AdaBoostClassifier|Regressor
Boosting
Huge success
AdaBoost [Y. Freund & R. Schapire, 1995]
• Viola-Jones Face Detector (2001)
• Ensemble: each member is an expert on the errors of its
predecessor
• Iteratively re-weights training examples based on errors
1
x1
2
2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3
x x x x0
• Freund
0
& Schapire won the Gödel
0
prize 2003 0
sklearn.ensemble.AdaBoostClassifier|Regressor
Gradient Boosting [J. Friedman, 1999]
Statistical view on boosting

• ⇒ Generalization of boosting to arbitrary loss functions
Gradient Boosting [J. Friedman, 1999]
Statistical view on boosting

• ⇒ Generalization of boosting to arbitrary loss functions
Residual fitting
2.5 Ground truth tree 1 tree 2 tree 3

2.0
1.5
1.0
0.5 ∼ + +
0.0
y
0.5
1.0
1.5
2.0
2 6 10 2 6 10 2 6 10 2 6 10
x x x x
sklearn.ensemble.GradientBoostingClassifier|Regressor
Functional Gradient Descent
Least Squares Regression
• Squared loss: L(yi , f (xi )) = (yi − f (xi ))2
• The residual ∼ the (negative) gradient ∂L(yi , f (xi ))
∂f (xi )
Functional Gradient Descent
Least Squares Regression
• Squared loss: L(yi , f (xi )) = (yi − f (xi ))2
• The residual ∼ the (negative) gradient ∂L(yi , f (xi ))
∂f (xi )
Steepest Descent
• Regression trees approximate the (negative) gradient
• Each tree is a successive gradient descent step
8 8
Squared error Zero-one loss
7 Absolute error 7 Log loss
Huber error Exponential loss
6 6
5 5
L(y,f(x))
4 L(y,f(x)) 4
3 3
2 2
1 1
0 0
4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4
y−f(x) y ·f(x)
Outline
1 Basics
2 Gradient Boosting

GBRT in scikit-learn
How to use it
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> from sklearn.datasets import make_hastie_10_2
>>> X, y = make_hastie_10_2(n_samples=10000)
>>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3)
>>> est.fit(X, y)
...
>>> # get predictions
>>> pred = est.predict(X)
>>> est.predict_proba(X)[0] # class probabilities
array([ 0.67, 0.33])
Implementation
• Written in pure Python/Numpy (easy to extend).
• Builds on top of sklearn.tree.DecisionTreeRegressor (Cython).
• Custom node splitter that uses pre-sorting (better for shallow trees).
Example
from sklearn.ensemble import GradientBoostingRegressor
est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y)
for pred in est.staged_predict(X):
plt.plot(X[:, 0], pred, color=’r’, alpha=0.1)
10
ground truth
8 RT max_depth=1
RT max_depth=3
6 GBRT max_depth=1
4 High bias - low variance
2
y
6 Low bias - high variance
80 2 4 6 8 10
x
Model complexity & Overfitting
test_score = np.empty(len(est.estimators_))
for i, pred in enumerate(est.staged_predict(X_test)):
test_score[i] = est.loss_(y_test, pred)
plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’)
plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
2.0
Test
Train
1.5
Lowest test error

Error
1.0
0.5
train-test gap
0.0
0 200 400 600 800 1000
n_estimators
Model complexity & Overfitting
test_score = np.empty(len(est.estimators_))
for i, pred in enumerate(est.staged_predict(X_test)):
test_score[i] = est.loss_(y_test, pred)
plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’)
plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
2.0
Test
Train
Regularization
1.5
GBRT provides a number of knobs to control
overfitting
• Tree structure
Lowest test error
Error
1.0
• Shrinkage
• Stochastic Gradient Boosting
0.5
train-test gap
0.0
0 200 400 600 800 1000
n_estimators
Regularization: Tree structure
• The max depth of the trees controls the degree of features interactions
• Use min samples leaf to have a sufficient nr. of samples per leaf.
Regularization: Shrinkage
• Slow learning by shrinking tree predictions with 0 < learning rate <= 1
• Lower learning rate requires higher n estimators
2.0
Test
Train
Test learning_rate=0.1
Train learning_rate=0.1
1.5
Requires more trees

Error
1.0
Lower test error
0.5
0.0
0 200 400 600 800 1000
n_estimators
Regularization: Stochastic Gradient Boosting
• Samples: random subset of the training set (subsample)
• Features: random subset of features (max features)
• Improved accuracy – reduced runtime
2.0
Train
Test
Train subsample=0.5, learning_rate=0.1
Test subsample=0.5, learning_rate=0.1
1.5
Subsample alone does poorly

Error
1.0
Even lower test error

0.5
0.0
0 200 400 600 800 1000
n_estimators
Hyperparameter tuning
1. Set n estimators as high as possible (eg. 3000)
2. Tune hyperparameters via grid search.

from sklearn.grid_search import GridSearchCV
param_grid = {’learning_rate’: [0.1, 0.05, 0.02, 0.01],
’max_depth’: [4, 6],
’min_samples_leaf’: [3, 5, 9, 17],
’max_features’: [1.0, 0.3, 0.1]}
est = GradientBoostingRegressor(n_estimators=3000)
gs_cv = GridSearchCV(est, param_grid).fit(X, y)
# best hyperparameter setting
gs_cv.best_params_
3. Finally, set n estimators even higher and tune

learning rate.
Outline
1 Basics
2 Gradient Boosting

Case Study
California Housing dataset
• Predict log(medianHouseValue)
• Block groups in 1990 census
• 20.640 groups with 8 features
(median income, median age, lat,
lon, ...)
• Evaluation: Mean absolute error
on 80/20 split
Challenges
• Heterogeneous features
• Non-linear interactions
Predictive accuracy & runtime
Train time [s] Test time [ms] MAE

Mean - - 0.4635
Ridge 0.006 0.11 0.2756
SVR 28.0 2000.00 0.1888
RF 26.3 605.00 0.1620
GBRT 192.0 439.00 0.1438
0.5
Test
Train
0.4
0.3
error
0.2
0.1
0.0
0 500 1000 1500 2000 2500 3000
n_estimators
Model interpretation
Which features are important?
>>> est.feature_importances_
array([ 0.01, 0.38, ...])
MedInc
AveRooms
Longitude
AveOccup
Latitude
AveBedrms
Population
HouseAge
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18
Relative importance
What is the effect of a feature on the response?
from sklearn.ensemble import partial_dependence import as pd
features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’,

(’AveOccup’, ’HouseAge’)]
fig, axs = pd.plot_partial_dependence(est, X_train, features,
feature_names=names)
Partial dependence of house value on nonlocation features

for the California housing dataset
0.6 0.6 0.6
0.4 0.4 0.4
Partial dependence
Partial dependence
Partial dependence
0.2 0.2 0.2
0.0 0.0 0.0
0.2 0.2 0.2
0.4 0.4 0.4
1.5 3.0 4.5 6.0 7.5 2.0 2.5 3.0 3.5 4.0 4.5 10 20 30 40 50 60
MedInc AveOccup HouseAge
0.6 50
-0.05
0.16
0.4
Partial dependence
40
0.02
0.2
HouseAge
0.0 30
3
0.2
0.2 20
-0.12
0.09
0.4 10
4 5 6 7 8 2.0 2.5 3.0 3.5 4.0
AveRooms AveOccup
Automatically detects spatial effects
0.97 0.57
0.66 0.49
0.41
0.34
partial dep. on median house value

partial dep. on median house value
0.33
0.03
0.25
latitude
latitude
-0.28 0.17
-0.60 0.09
-0.91 0.01
-0.07
-1.22
-0.15
-1.54
longitude longitude
Summary
• Flexible non-parametric classification and regression technique
• Applicable to a variety of problems
• Solid, battle-worn implementation in scikit-learn

Thanks! Questions?
Test time Train time Error
0.4
0.4
0.6
0.6
0.0
0.0
0.0
0.8
0.8
2.0
3.0
0.5
2.5
0.2
0.2
1.0
1.0
1.0
1.5
1.2
Arcene
Benchmarks
Boston
California
Covtype
Example 10.2
Expedia
dataset
Madelon
Solar
Spam
gbm
YahooLTRC
bioresp
sklearn-0.15
Tipps & Tricks 1
Input layout
Use dtype=np.float32 to avoid memory copies and fortan layout for slight
runtime benefit.
X = np.asfortranarray(X, dtype=np.float32)
Tipps & Tricks 2
Feature interactions
GBRT automatically detects feature interactions but often explicit interactions
help.
Trees required to approximate X1 − X2 : 10 (left), 1000 (right).
0.3 1.0
0.2
0.5
0.1
x-y
x-y
0.0 0.0
0.1
0.5
0.2
0.3 1.0
0.0 0.0
0.2 0.2
1.0 0.4 1.0 0.4
0.8 0.6 0.8 0.6 0.6 y
0.6 y 0.4 0.2 0.8
0.4 0.2 0.8 x 0.0 1.0
x 0.0 1.0
Tipps & Tricks 3
Categorical variables
Sklearn requires that categorical variables are encoded as numerics. Tree-based
methods work well with ordinal encoding:
df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]})
# ordinal encoding
df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao,
return_inverse=True)[1]})
X = np.asfortranarray(df_enc.values, dtype=np.float32)

Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides

Uploaded by

Copyright:

Available Formats

Gradient Boosted Regression Trees

Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe)

3 Gradient Boosting in scikit-learn

4 Case Study: California housing

3 Gradient Boosting in scikit-learn

4 Case Study: California housing

MedInc <= 5.04

MedInc <= 3.07 MedInc <= 6.82

1.62 1.16 2.79 1.88 3.39 2.56 3.73 4.57

• Nowadays seldom used alone

3 Gradient Boosting in scikit-learn

4 Case Study: California housing

AdaBoost [Y. Freund & R. Schapire, 1995]

Statistical view on boosting

Statistical view on boosting

2.5 Ground truth tree 1 tree 2 tree 3

3 Gradient Boosting in scikit-learn

4 Case Study: California housing

6 Low bias - high variance

Lowest test error

Requires more trees

Lower test error

Subsample alone does poorly

Even lower test error

1. Set n estimators as high as possible (eg. 3000)

2. Tune hyperparameters via grid search.

3. Finally, set n estimators even higher and tune

3 Gradient Boosting in scikit-learn

4 Case Study: California housing

Train time [s] Test time [ms] MAE

features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’,

Partial dependence of house value on nonlocation features

Automatically detects spatial effects

partial dep. on median house value

• Flexible non-parametric classification and regression technique

• Applicable to a variety of problems

• Solid, battle-worn implementation in scikit-learn

You might also like