5 - EnsembleModeling

Advanced Analytics in Business
Big Data Platforms & Technologies

Ensemble Modeling: Bagging and Boosting
Interpretability (Part 1)
Overview
Introduction
Combination rules and voting systems
Bagging
Boosting
Interpretability (part 1)
2
Introduction
3
Ensemble modeling: the basic motivation
Are two models better than one?
Intuitively, this does make sense: you might have two models that each are good at
predicting a certain (different) subsegment of your data set
So this seems like a good idea to increase performance
We’ll also see that we will be able to make models more robust to overfitting, more robust to noise
Combination of models is called an “ensemble”
https://towardsdatascience.com/the-unexpected-lesson-within-a-jelly-bean-jar-1b6de9c40cca
4
Can we have it all?
Overfitting:
Model is too specific, works great on training data but not on a new data set
E.g.: a very deep decision tree
5
Can we have it all?
We have seen early stopping and pruning
Using a strong validation setup, too

But at the end: perhaps an accuracy level we might not be happy with
6
Can we have it all?
Also consider: what if we could combine multiple linear classifiers?
7
Combination Rules and Voting Systems
8
Combination rules
Let’s say we’ve created two models
How to combine them?
Model 1 Model 2
True label Ensemble?
(threshold: 0.54) (threshold: 0.50)
Yes 0.80 (yes) 0.70 (yes)
Yes 0.78 (yes) 0.50 (yes)
Yes 0.54 (yes) 0.50 (yes)
No 0.57 (yes) 0.30 (no)
No 0.30 (no) 0.70 (yes)
No 0.22 (no) 0.40 (no)
9
Combination rules
Algebraic combination
Determine new, optimal cutoff!
Model 1 Model 2
True label Min (0.50) Max (0.78) Mean (0.52)
(0.54) (0.50)
Yes 0.80 (yes) 0.70 (yes) 0.70 (yes) 0.80 (yes) 0.75 (yes)
Yes 0.78 (yes) 0.50 (yes) 0.50 (yes) 0.78 (yes) 0.64 (yes)
Yes 0.54 (yes) 0.50 (yes) 0.50 (yes) 0.54 (no) 0.52 (yes)
No 0.57 (yes) 0.30 (no) 0.30 (no) 0.57 (no) 0.44 (no)
No 0.30 (no) 0.70 (yes) 0.30 (no) 0.70 (no) 0.50 (no)
No 0.22 (no) 0.40 (no) 0.22 (no) 0.40 (no) 0.31 (no)
10
Voting
Useful when combining models: majority voting
Less sensitive to underlying probability distributions, no need for calibration or determination of optimal
new cutoff
1 2 3 4 5 6 → "yes" wins (4 to 2)
What about weighted voting?
1 2 3 4 5 6 Model 4 gets 5 votes, the others 1 → "no" wins (5+1 to 4)
We could also go for a linear combination of the probabilities
Though again: how to determine the weights?

A learning step in itself (meta-learning)
11
Mixture of experts
Jordan and Jacobs’ mixture of experts (Jacobs, 1991) generates several “experts”
(classifiers) whose outputs are combined through a linear rule
The weights of this combination are determined by a “gating network”, typically trained using the
expectation maximization (EM) algorithm
But: loss of interpretability, additional production strain!
12
Stacking
Wolpert’s (Wolpert, 1992) stacked generalization (or stacking)
An ensemble of Tier 1 classifiers is first trained on a subset of the training data

Outputs of these classifiers are then used to train a Tier 2 classifier (meta-classifier) (potentially with
original features)
The underlying idea is to learn whether training data have been properly learned
For example, if a particular classifier incorrectly learned a certain region of the feature space, then the Tier 2
classifier may be able to learn this behavior
But: loss of interpretability, additional production strain!
13
Smoothing
λ× +(1 − λ)×
14
Bagging
15
Bagging
Bagging (bootstrap aggregating) is one of the earliest, most intuitive and perhaps the simplest ensemble based
algorithms, with a surprisingly good performance (Breiman, 1996)
The main idea is to add diversity to the classifiers

Obtained by using bootstrapped replicas of the training data: different training data subsets are randomly drawn – with replacement –
from the entire training dataset
Each training data subset is used to train a different classifier of the same type
Individual classifiers are then combined by taking a simple majority vote of their decisions
Since the training datasets may overlap substantially, additional measures can be used to increase diversity, such
as:
Using a subset of the training data for training each classifier

Using a subset of features
Using unstable classifiers
Other ideas
16
Bagging
17
Out-of-bag (OOB) validation
When using bagging, one can already estimate the generalization capabilities of the
ensemble model using the training data: out-of-bag (OOB) validation
When validating an instance i, only consider those models which did not have i in their bootstrap sample
A good initial validation check, though an independent test set is still required!
18
Random forests: the quintessential bagging
technique
Random forests (Leo Breiman and Adele Cutler): bagging-based ensemble learning method for classification and
regression
Construct a multitude of decision trees at training time and outputting the class that is the majority vote of the classes (classification) or
mean prediction (regression) of the individual trees
Applies bagging, so one part of randomness comes from bootstrapping each decision tree, i.e. each decision tree sees a random bootstrap
of the training data
However, random forests use an additional piece of randomness, i.e. to select the candidate features to split on, consider a random
subset of features (sampled at every split, not once per tree!)
Random decision forests correct for decision trees’ habit of overfitting to their training set
No more pruning needed

Great performance in most cases!
19
technique
How many trees?
No risk of overfit, so use plenty
Depth of tree?
No pruning necessary
But one can still decide to apply some pruning or early stopping mechanisms to speed up (many techniques will do so)
Size of bootstrap?
Can be 100% (this doesn’t mean selecting all instances, as we’re drawing with replacement!)
Lower values possible given enough data points. Key is to build enough trees
M: size of subset of features?

1, 2, all (i.e. “default bagging”)?
N
Heuristic: max(⌊ 3
, 1⌋) for regression, ⌊√N ⌋ for classification (with N the number of features)
Alternative: find through cross-validation!
20
technique
Thinking points: how to assign a probability? How to set the thresholds of the base
classifiers (do we need to)?
“ In contrast to the original publication, the scikit-learn implementation combines

classifiers by averaging their probabilistic prediction, instead of letting each classifier
“
vote for a single class.
Why does this virtually guarantee same outcome?
21
technique
Random forests are easy to use, don’t require much configuration or preprocessing
Because you are building many trees, will include lots of interaction effects “for free”
Good at avoiding overfitting (by design)
However… how to explain 100 trees vs. 1…
Many fun extensions, e.g. Extra Randomized Trees: also consider a random subset of the possible splitting
points, instead of only a random subset of features!
See also Maximizing Tree Diversity by Building Complete-Random Decision Trees (Liu et al., 2005)
There is even a thing such as completely randomized trees (and we’ll see an application of those soon)
22
Boosting
23
Boosting
Similar to bagging, boosting also creates an ensemble of classifiers which are then
combined
However, not using bootstrapping this time around

Instead, classifiers are added sequentially were each new classifier aims at correcting the mistakes by the
ensemble thus far
In short: steering the learning towards fixing the mistakes it made in a previous step
Main idea is cooperation between classifiers, rather than adding diversity
24
Boosting
25
AdaBoost
AdaBoost trains an ensemble of weak learners over a number
of rounds T
At first, every instance has the same weight (D1 = 1/N ), so
AdaBoost trains a normal classifier
Next, samples that were misclassified by the ensemble so far
are given a heavier weight
The learner is also given a weight (αt ) depending on its
accuracy and incorporated into the ensemble
AdaBoost then constructs a new learner: now incorporating
the weights so far
26
AdaBoost
Friedman et al. showed that AdaBoost can be implemented as additive logistic regression
model
Assuming logistic regression as the base, weak learner

AdaBoost optimizes exponential loss function
Nice mathematical solution, shows that AdaBoost is closely linked to a particular loss function
27
Gradient Boosting Machines
Friedman et al.: “what if we would want to optimize a
different loss function?”
Doesn’t work with standard additive logistic regression setup or

AdaBoost
So take a different view: instead of weighting instances (and
classifiers in the ensemble) in every cycle, let every sequential
classifier predict on the residuals on the ensemble so far
Learning to predict the errors
Boils down to the same: predicting the errors and then adjusting accordingly
But allows to optimize for any loss function based on its gradients
Weak decision trees are used as the base learner
28
Gradient Boosting Machines
Gradient boosting will fit learners to the residual y − Fm (x)
Each Fm+1 attempts to correct the errors of its predecessor Fm

Follows from the observation that residuals hm (x) for a given model are proportional equivalent to the
negative gradients of the mean squared error (MSE) loss function:
1 2 ∂LM SE 2 2
LM SE = (y − F (x)) − = (y − F (x)) = hm (x)
n ∂F n n
So, gradient boosting could be specialized to a gradient descent algorithm, and

generalizing it entails plugging in a different loss and its gradient:
n
MSE =
1 2
∑ (yi − y
^ )
n i=1 i
n
LogLoss (binary classification) = − n
1
∑ (yi log(pi ) + (1 − yi )log(1 − pi ))
i=1
n m
MLogLoss (multiclass classification) = − n
1
∑ ∑ yij log(pij )
i=1 j=1
29
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical
improvement that speeds up training)
First, it uses second partial derivatives of the loss function, which provides more information about the
direction of gradients and how to get to the minimum of our loss function (Jacobian and Hessian needed)
As implemented by xgboost , lightgbm and catboost packages
Very powerful techniques: wins many Kaggle competitions that deal with structured data!
But:
How about the risk of overfitting?

Explainability? We still have a model with 100 trees…
30
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical
improvement that speeds up training)
Second, it adds regularization (L1 or L2), which improves model generalization
Defined as a “control” parameter on the complexity of the model
Here defined over the depth of the tree or number of leaf nodes
Objective function combines loss and simplicity of the trees
31
DART
Rashmi Korlakai Vinayak, Ran Gilad-Bachrach. “DART: Dropouts meet Multiple Additive
Regression Trees.” (2015)
“ XGBoost mostly combines a huge number of regression trees with a small learning
rate. In this situation, trees added early are significant and trees added late are
unimportant.
Vinayak and Gilad-Bachrach propose a new method to add dropout techniques from
the deep neural net community to boosted trees, and reported better results in some
“
situations.
32
Feature interaction constraints
“ Variables that appear together in a traversal x10< -1.5 ?
path are interacting with one another, since
Yes/Missing No
the condition of a child node is predicated on
x2< 2 ? x7< 0.3 ?
the condition of the parent node. For
example, the highlighted red path in the Yes No/Missing Yes No/Missing
diagram.
... ... x1< 0.5 ? ...
When the tree depth is larger than one, many

variables interact on the sole basis of
... Predict +1.3
minimizing training loss, and the resulting
decision tree may capture a spurious relationship (noise) rather than a legitimate
relationship that generalizes across different datasets. Feature interaction constraints
“
allow users to decide which variables are allowed to interact and which are not.
33
Monotonic relationship constraints
“ It is often the case in a modeling problem or project that the functional form of an
acceptable model is constrained in some way. A common type of constraint in this
situation is that certain features bear a monotonic relationship to the predicted
“
response.
34
Comparing Bagging and Boosting
35
Comparing bagging and boosting
Bagging can be done in parallel (each sub-model is built independently)
Boosting is sequential (try to add new sub-models that do well where previous model lacks)
But the trees themselves can be parallelized somewhat
Bagging decreases variance, not necessarily bias
Suitable to combine high variance low bias models (complex models)

Example algorithm: random forest
Reducing the overfit of ensembles of complex models (strength of diversity)
Hence: deep decision trees
Boosting decreases bias, not necessarily variance
Suitable to combine low variance high bias models (simple models)

Example algorithms: AdaBoost, gradient boosting machines (GBM), xgboost
Reducing the error of ensembles of simple models (strength of cooperation)
Hence: logistic regression, or non-deep, “weak” decision stumps
36
Bagging decreases variance
Boosting decreases bias
High bias ←—————— High variance

——————————————————————→
37
This means that you need to be careful of overfitting when using boosting techniques
There is such thing as overboosting

xgboost and others implement regularization and other strategies to protect against this
Requires hyperparameter tuning, however
In most practical use cases, similar performance can be obtained using random forests without risk of overfit
38
39
Opening the Black Box
40
Opening the black box
“
“ How to explain 100 trees versus 1?
Since we’ve now stepped into the domain of black-box models, we need to look at
appropriate techniques to gain understanding of the models we construct!
Model interpretability techniques:
Some of these are “native” to the concept of decision trees used in our ensembles
Others can be used for any type of model (“model agnostic” techniques)
Work at different levels:
Help to explain the model (make the model simpler)

Help to explain a feature (explain which ones are important and how they drive the outcome)
Help to explain a particular instance-level prediction
41
Feature importance
42
Feature importance
Which features are important according to the model?
Different techniques exist:
Based on mean decrease in impurity in tree based ensembles

Biased: http://explained.ai/rf-importance/index.html
Model-dependent (gradient boosted models, random forests)
Based on position in trees

Model-dependent as well
Drop feature importance

Robust, but requires retraining
Model agnostic
Permutation importance
No retraining required
Model agnostic
43
Permutation based feature importance
Start with trained model on given data set and performance measure (e.g. AUC)
Randomly permute values for feature under study
Use trained model to predict observations again
Importance = baseline performance measure – performance measure on permuted data set
Note that this takes interactions into account (as shuffling breaks interaction effects)
44
Permutation based feature importance
Indicates which features are important, but not to which extent they affect outcome
Can also be used for feature selection: retrain model on top-N features (e.g. Boruta package in R)
Permutating a single feature breaks up interaction effects!
But possible to permute multiple features at once to zoom in on interaction effect (“keep together”)
Correlated features will “share” importance, which might lead to misinterpretation!

Still important to check correlated features! ( rfpimp Python package has this built-in)
Or permute multiple features together (separately)
Extensions exist which add significance values (less commonly used)
45
Feature importance
https://explained.ai/rf-importance/index.html
46
Feature importance
Check your documentation, many
implementations (e.g. scikit-learn) implement
Gini Impurity based importance
scikit-learn 0.23.2: finally available out of the box, but
use the correct approach:
https://scikit-
learn.org/stable/modules/permutation_importance.html
And also: take into account correlation effects!
Train or test set? Both have their use!

Test set: in line with evaluation concerns: e.g. the feature
importance based on training data might make us mistakenly believe that features are important for the predictions, when in
reality the model was just overfitting and the features were not important at all
Train set: learn and understand how the model has actually relied on each feature
47
Partial dependence plots
Each point on the partial dependence plot is the average vote percentage (or average
continuous prediction) across all observations
Understand how feature impacts outcome of model
48
Option 1) Keep feature under observation as-is, impute with median and mode for others
Also possible to define a range between observed minimum and maximum

Have the trained model predict on the new dataset and plot the results over the values of the feature under
observation
49
Option 2) For each value of the feature under study, generate N synthetic instances
Based on observed values or a range between observed minimum and maximum

By taking feature values for the other feature from a randomly sampled instance
Average the prediction
50
51
Note that absence of evidence does not mean evidence of absence!
E.g. interaction effects might not show in partial dependence plot given the ceteris paribus approach (keeping
everything fixed except for the variable under observation)
y ~ x0 * x1 + x2 + noise
52
On the other hand, this is also a benefit…
“ Why not look at the data itself to assess the impact on an outcome (i.e. by
constructing bins on a feature under observation and looking at the percentage of yes
“
vs. no cases per bin)?
You might infer correlations from this univariate investigation which might not be true given the presence of
interaction effects!
E.g. “sales drop for customers between 30 and 40 years old” (data) vs. “sales stay constant” (partial
dependence plot) indicates presence of interaction effects: age alone is not a sufficient explanation!
Need to inspect both!
53
Here too, like permutation importance, it is possible to keep more than one feature as-is whilst replacing the
others with their median and mode
Harder to visualize, however (e.g. contour plots, 3d plots… infeasible for higher dimensions)
http://forestfloor.dk/ R package
54
Individual conditional expectation (ICE) plots
Similar idea to partial dependence plots
Create new instances based on the values of the feature under observation
Again, also possible to define a grid-based range over the feature under observation between its observed
minimum and maximum
55
Individual conditional expectation (ICE) plots
Every original instance now leads to multiple rows in
the modified data set
Again, we let the model predict over all these
instances
For each distinct value for the feature under
observation, we now have multiple predictions
Instead of averaging, we also plot the different lines
Finally, the lines are commonly centered
An average line can also be plotted (yellow line in
plot: similar as PDP option 2)
ICE plots well suited to show behavior of feature across data set
56
Tree inspection
Feed an instance through every tree and tally which variable was used more often
Or only for trees agreeing with the majority vote outcome or all of them
Can also keep track of the splitting points per variable
Not model-agnostic!
https://medium.com/airbnb-engineering/unboxing-the-random-forest-classifier-the-threshold-distributions-22ea2bb58ea6
57
LIME
Local interpretable model-agnostic explanations
“Local surrogate model”

Works on the instance level
https://github.com/marcotcr/lime
58
LIME
59
LIME
60
LIME
A simple model is trained over this perturbed data set, e.g. a Lasso regression
Regression as the output is now the predicted probabilities of the black box model: a continuous value
Lasso to keep explanations sparse
This provides us with a simple, local decision boundary which can be easily inspected:
61
LIME
Decision boundary of “explanatory model” approximates decision boundary of black box model around the
instance under study
62
LIME
Also works on non tabular data: changes how the perturbation is performed
Text: new texts are created by randomly removing words from the original text
Images: perturbing individual pixels does not make a lot of sense. Instead, groups of pixels in the image are
perturbed at once by “blanking” them (removing them from the image)
These groups are called “superpixels”, based on interconnected pixels with similar coloring
Can be found using e.g. k-means
Ribeiro et. al., 2016
63
LIME
Defining neighborhood around instance to define weights is difficult
Distance measure or bandwidth of exponential smoothing kernel can impact results
One can also simply select k nearest neighbors around instance under study (but need to decide upon appropriate k)
Choosing the simple model is somewhat arbitrary

E.g. tuning of regularization parameter of LASSO regression can impact results
Main reason why not used that often
Main advantage is that LIME is easy to understand and works on tabular data, text and images
Was very common for a while, now less used
64
Global surrogate
Simply train an interpretable model to use whilst explaining, which hopefully is pretty
close to the original model
However:
How close is close enough?

It could happen that the interpretable model is very close for one subset of the dataset, but widely divergent
for another subset: interpretation for the simple model would not be equally good for all data points
Also remember the basics regarding model inspection (look at your confusion matrix, look
at the “most confused” instances)
65
Shapley values
Combines the best of the above: importance, partial dependence, and instance-
level explanations
Originated in game theory
Represent the fair payout for each player in a cooperative game
Can be used to measure the contribution of a feature to a model
66
Shapley values
We want to play a game. What is contribution of each player to the team?
We are interested in the marginal contribution of each player
67
Shapley values
Suppose the (hypothetical) value of the team with only player A is 6
v({A}) = 6 where v is called characteristic function
Next, we add player B

v({A}) = 6
v({A, B}) = 10
B has a marginal contribution of 4
Next, add player C

v({A, B, C}) = 12
C has a marginal contribution of 2
68
Shapley values
Adding C before B?
v({A}) = 6
v({A, C}) = 10
v({A, C, B}) = 12
C has a marginal contribution of 4, B of 2
To be fair to C in terms of contribution payout, we need to average their contribution in all

formations of the team
Contribution does not depend on the order of players that were added before
v({A, B, C}) − v({A, B}) = v({B, A, C}) − v({B, A})
Contribution is also independent from how remaining players are added afterwards
69
Shapley values
The shapley value of a player j is now:
|S|! × (p − |S| − 1)!

ϕj = ∑ [v(S ∪ {xj }) − v(S)]
p!
S⊆{x1 ,…,xp }∖{xj }
To know the contribution of player j, we need to know their contribution to every possible sub-team
Sum over all unordered subsets we can make with other players
|S|! : all permutations of players in team
(p − |S| − 1)! : all permutations of residual players (p is number of players)
p! normalize based on total number of possible teams

Sum value of team with player and without
70
Shapley values
How to bring this to machine learning?
The players are the features with their values, the game that is played is the prediction of an instance
x = {x1 , … , xp } using a model f^
^ ^
v(XS ) = ∫ f (x1 , … , xp )dPx∉S − EX (f (X))
Prediction of the model using only included feature values minus average prediction
71
Shapley values
Problem: lots of possible subsets of features. Also: prediction with only included features
means retraining, or imputation with median/mode?
Monte Carlo sampling with permutation based approximation
M
1
^ ^ m ^ m
ϕ = ∑ (f (x ) − f (x ))
j +j −j
M
m=1
To calculate the Shapley value for a feature j and an instance x, we sample M times and:
Draw a random instance z from the data

Construct two new instances, one with the value of x for feature j (xm
+j
), and one with value of z (xm
−j
)
The other features are randomly chosen (permuted) across x and z

We then let the model predict for both
72
Shapley values
R F M debt age income
F and x is the feature/instance under
x 5 2 1000 100 44 2500 observation, we permute over {R, Debt}
z 10 4 500 50 36 3000 versus {M, age, income} (randomly
chosen)
↓
M
1
^ ^ m ^ m
R F M debt age income ϕ = ∑ (f (x ) − f (x ))
j +j −j
M
m m=1
x
+j
5 2 500 100 36 3000
m
x
−j
5 4 500 100 36 3000
73
Shapley values
Start from the default (average) prediction and assess the contribution of each feature with each value
towards outcome
Shapley value represent contribution of feature to the given output for a given instance, not by how much
that output would change when removing the feature
Most commonly used instance explainability technique in industry
0.1 (base rate) + 0.4 (age=65) - 0.3 (sex=F) + 0.1 (bp=180) + 0.1 (bmi=40) = 0.4
74
Shapley values
Explaining a single instance
75
Shapley values
Features, feature values, impact on model output, across all instances
76
Shapley values
Singling out one or two features and plotting their shapley values across instances:
comparable to partial dependence plot
Bring in second feature to look at interaction effects
77
Closing
More reading and packages:
Fantastic book on the topic: https://christophm.github.io/interpretable-ml-book/

Good overview: https://github.com/jphall663/awesome-machine-learning-interpretability
scikit-learn : https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html (recent
versions also have support for partial dependence and ICE plots)
The pdp R package: https://cran.r-project.org/web/packages/pdp/pdp.pdf
The iml R Package: https://cran.r-project.org/web/packages/iml/index.html
Descriptive mAchine Learning EXplanations ( DALEX ) R Package: https://github.com/pbiecek/DALEX

eli5 for Python: https://eli5.readthedocs.io/en/latest/index.html
Skater for Python: https://github.com/datascienceinc/Skater
Or with https://github.com/parrt/random-forest-importances ( rfpimp package: recommended)

Or with https://github.com/ralphhaygood/sklearn-gbmi ( sklearn-gbmi )
pdpbox for Python: https://github.com/SauceCat/PDPbox

vip for Python (and R): https://koalaverse.github.io/vip/index.html
Shapley Values: https://github.com/slundberg/shap

https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc
http://blog.macuyiko.com/post/2019/discovering-interaction-effects-in-ensemble-models.html and https://blog.macuyiko.com/post/2021/revisiting-
discovery-of-interaction-effects.html
http://explained.ai/rf-importance/index.html
78
Or…
Graft, Reassemble, Answer delta, Neighbour sensitivity, Training delta (GRANT) -
https://github.com/wagtaillabs/GRANT
79
Or…
80

5 - EnsembleModeling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 - EnsembleModeling

Uploaded by

Copyright:

Available Formats

Advanced Analytics in Business

Big Data Platforms & Technologies

Using a strong validation setup, too

What about weighted voting?

1 2 3 4 5 6 Model 4 gets 5 votes, the others 1 → "no" wins (5+1 to 4)

We could also go for a linear combination of the probabilities

Though again: how to determine the weights?

An ensemble of Tier 1 classifiers is first trained on a subset of the training data

The main idea is to add diversity to the classifiers

Using a subset of the training data for training each classifier

No more pruning needed

M: size of subset of features?

Alternative: find through cross-validation!

“ In contrast to the original publication, the scikit-learn implementation combines

Why does this virtually guarantee same outcome?

However, not using bootstrapping this time around

Assuming logistic regression as the base, weak learner

Doesn’t work with standard additive logistic regression setup or

Each Fm+1 attempts to correct the errors of its predecessor Fm

So, gradient boosting could be specialized to a gradient descent algorithm, and

As implemented by xgboost , lightgbm and catboost packages

How about the risk of overfitting?

Second, it adds regularization (L1 or L2), which improves model generalization

Defined as a “control” parameter on the complexity of the model

When the tree depth is larger than one, many

But the trees themselves can be parallelized somewhat

Bagging decreases variance, not necessarily bias

Suitable to combine high variance low bias models (complex models)

Boosting decreases bias, not necessarily variance

Suitable to combine low variance high bias models (simple models)

Boosting decreases bias

High bias ←—————— High variance

There is such thing as overboosting

Model interpretability techniques:

Work at different levels:

Help to explain the model (make the model simpler)

Different techniques exist:

Based on mean decrease in impurity in tree based ensembles

Based on position in trees

Drop feature importance

Correlated features will “share” importance, which might lead to misinterpretation!

Or permute multiple features together (separately)

Extensions exist which add significance values (less commonly used)

Train or test set? Both have their use!

Understand how feature impacts outcome of model

Also possible to define a range between observed minimum and maximum

Based on observed values or a range between observed minimum and maximum

“Local surrogate model”

Ribeiro et. al., 2016

Choosing the simple model is somewhat arbitrary

Was very common for a while, now less used

How close is close enough?

Next, we add player B

B has a marginal contribution of 4

Next, add player C

C has a marginal contribution of 2

C has a marginal contribution of 4, B of 2

To be fair to C in terms of contribution payout, we need to average their contribution in all

|S|! × (p − |S| − 1)!

|S|! : all permutations of players in team

(p − |S| − 1)! : all permutations of residual players (p is number of players)

p! normalize based on total number of possible teams

Monte Carlo sampling with permutation based approximation

Draw a random instance z from the data