Professional Documents
Culture Documents
2
Introduction
3
Ensemble modeling: the basic motivation
Are two models better than one?
Intuitively, this does make sense: you might have two models that each are good at
predicting a certain (different) subsegment of your data set
So this seems like a good idea to increase performance
We’ll also see that we will be able to make models more robust to overfitting, more robust to noise
Combination of models is called an “ensemble”
https://towardsdatascience.com/the-unexpected-lesson-within-a-jelly-bean-jar-1b6de9c40cca
4
Can we have it all?
Overfitting:
Model is too specific, works great on training data but not on a new data set
E.g.: a very deep decision tree
5
Can we have it all?
We have seen early stopping and pruning
6
Can we have it all?
Also consider: what if we could combine multiple linear classifiers?
7
Combination Rules and Voting Systems
8
Combination rules
Let’s say we’ve created two models
How to combine them?
Model 1 Model 2
True label Ensemble?
(threshold: 0.54) (threshold: 0.50)
Yes 0.80 (yes) 0.70 (yes)
Yes 0.78 (yes) 0.50 (yes)
Yes 0.54 (yes) 0.50 (yes)
No 0.57 (yes) 0.30 (no)
No 0.30 (no) 0.70 (yes)
No 0.22 (no) 0.40 (no)
9
Combination rules
Algebraic combination
Determine new, optimal cutoff!
Model 1 Model 2
True label Min (0.50) Max (0.78) Mean (0.52)
(0.54) (0.50)
Yes 0.80 (yes) 0.70 (yes) 0.70 (yes) 0.80 (yes) 0.75 (yes)
Yes 0.78 (yes) 0.50 (yes) 0.50 (yes) 0.78 (yes) 0.64 (yes)
Yes 0.54 (yes) 0.50 (yes) 0.50 (yes) 0.54 (no) 0.52 (yes)
No 0.57 (yes) 0.30 (no) 0.30 (no) 0.57 (no) 0.44 (no)
No 0.30 (no) 0.70 (yes) 0.30 (no) 0.70 (no) 0.50 (no)
No 0.22 (no) 0.40 (no) 0.22 (no) 0.40 (no) 0.31 (no)
10
Voting
Useful when combining models: majority voting
Less sensitive to underlying probability distributions, no need for calibration or determination of optimal
new cutoff
1 2 3 4 5 6 → "yes" wins (4 to 2)
11
Mixture of experts
Jordan and Jacobs’ mixture of experts (Jacobs, 1991) generates several “experts”
(classifiers) whose outputs are combined through a linear rule
The weights of this combination are determined by a “gating network”, typically trained using the
expectation maximization (EM) algorithm
But: loss of interpretability, additional production strain!
12
Stacking
Wolpert’s (Wolpert, 1992) stacked generalization (or stacking)
13
Smoothing
λ× +(1 − λ)×
14
Bagging
15
Bagging
Bagging (bootstrap aggregating) is one of the earliest, most intuitive and perhaps the simplest ensemble based
algorithms, with a surprisingly good performance (Breiman, 1996)
Since the training datasets may overlap substantially, additional measures can be used to increase diversity, such
as:
16
Bagging
17
Out-of-bag (OOB) validation
When using bagging, one can already estimate the generalization capabilities of the
ensemble model using the training data: out-of-bag (OOB) validation
When validating an instance i, only consider those models which did not have i in their bootstrap sample
A good initial validation check, though an independent test set is still required!
18
Random forests: the quintessential bagging
technique
Random forests (Leo Breiman and Adele Cutler): bagging-based ensemble learning method for classification and
regression
Construct a multitude of decision trees at training time and outputting the class that is the majority vote of the classes (classification) or
mean prediction (regression) of the individual trees
Applies bagging, so one part of randomness comes from bootstrapping each decision tree, i.e. each decision tree sees a random bootstrap
of the training data
However, random forests use an additional piece of randomness, i.e. to select the candidate features to split on, consider a random
subset of features (sampled at every split, not once per tree!)
Random decision forests correct for decision trees’ habit of overfitting to their training set
19
Random forests: the quintessential bagging
technique
How many trees?
No risk of overfit, so use plenty
Depth of tree?
No pruning necessary
But one can still decide to apply some pruning or early stopping mechanisms to speed up (many techniques will do so)
Size of bootstrap?
Can be 100% (this doesn’t mean selecting all instances, as we’re drawing with replacement!)
Lower values possible given enough data points. Key is to build enough trees
20
Random forests: the quintessential bagging
technique
Thinking points: how to assign a probability? How to set the thresholds of the base
classifiers (do we need to)?
21
Random forests: the quintessential bagging
technique
Random forests are easy to use, don’t require much configuration or preprocessing
Because you are building many trees, will include lots of interaction effects “for free”
Good at avoiding overfitting (by design)
However… how to explain 100 trees vs. 1…
Many fun extensions, e.g. Extra Randomized Trees: also consider a random subset of the possible splitting
points, instead of only a random subset of features!
See also Maximizing Tree Diversity by Building Complete-Random Decision Trees (Liu et al., 2005)
There is even a thing such as completely randomized trees (and we’ll see an application of those soon)
22
Boosting
23
Boosting
Similar to bagging, boosting also creates an ensemble of classifiers which are then
combined
24
Boosting
25
AdaBoost
AdaBoost trains an ensemble of weak learners over a number
of rounds T
At first, every instance has the same weight (D1 = 1/N ), so
AdaBoost trains a normal classifier
Next, samples that were misclassified by the ensemble so far
are given a heavier weight
The learner is also given a weight (αt ) depending on its
accuracy and incorporated into the ensemble
AdaBoost then constructs a new learner: now incorporating
the weights so far
26
AdaBoost
Friedman et al. showed that AdaBoost can be implemented as additive logistic regression
model
27
Gradient Boosting Machines
Friedman et al.: “what if we would want to optimize a
different loss function?”
28
Gradient Boosting Machines
Gradient boosting will fit learners to the residual y − Fm (x)
1 2 ∂LM SE 2 2
LM SE = (y − F (x)) − = (y − F (x)) = hm (x)
n ∂F n n
n
LogLoss (binary classification) = − n
1
∑ (yi log(pi ) + (1 − yi )log(1 − pi ))
i=1
n m
MLogLoss (multiclass classification) = − n
1
∑ ∑ yij log(pij )
i=1 j=1
29
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical
improvement that speeds up training)
First, it uses second partial derivatives of the loss function, which provides more information about the
direction of gradients and how to get to the minimum of our loss function (Jacobian and Hessian needed)
Very powerful techniques: wins many Kaggle competitions that deal with structured data!
But:
30
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical
improvement that speeds up training)
Here defined over the depth of the tree or number of leaf nodes
Objective function combines loss and simplicity of the trees
31
DART
Rashmi Korlakai Vinayak, Ran Gilad-Bachrach. “DART: Dropouts meet Multiple Additive
Regression Trees.” (2015)
“ XGBoost mostly combines a huge number of regression trees with a small learning
rate. In this situation, trees added early are significant and trees added late are
unimportant.
Vinayak and Gilad-Bachrach propose a new method to add dropout techniques from
the deep neural net community to boosted trees, and reported better results in some
“
situations.
32
Feature interaction constraints
“ Variables that appear together in a traversal x10< -1.5 ?
path are interacting with one another, since
Yes/Missing No
the condition of a child node is predicated on
x2< 2 ? x7< 0.3 ?
the condition of the parent node. For
example, the highlighted red path in the Yes No/Missing Yes No/Missing
diagram.
... ... x1< 0.5 ? ...
33
Monotonic relationship constraints
“ It is often the case in a modeling problem or project that the functional form of an
acceptable model is constrained in some way. A common type of constraint in this
situation is that certain features bear a monotonic relationship to the predicted
“
response.
34
Comparing Bagging and Boosting
35
Comparing bagging and boosting
Bagging can be done in parallel (each sub-model is built independently)
Boosting is sequential (try to add new sub-models that do well where previous model lacks)
36
Comparing bagging and boosting
Bagging decreases variance
38
Comparing bagging and boosting
39
Opening the Black Box
40
Opening the black box
“
“ How to explain 100 trees versus 1?
Since we’ve now stepped into the domain of black-box models, we need to look at
appropriate techniques to gain understanding of the models we construct!
Some of these are “native” to the concept of decision trees used in our ensembles
Others can be used for any type of model (“model agnostic” techniques)
41
Feature importance
42
Feature importance
Which features are important according to the model?
Permutation importance
No retraining required
Model agnostic
43
Permutation based feature importance
Start with trained model on given data set and performance measure (e.g. AUC)
Randomly permute values for feature under study
Use trained model to predict observations again
Importance = baseline performance measure – performance measure on permuted data set
Note that this takes interactions into account (as shuffling breaks interaction effects)
44
Permutation based feature importance
Indicates which features are important, but not to which extent they affect outcome
Can also be used for feature selection: retrain model on top-N features (e.g. Boruta package in R)
Permutating a single feature breaks up interaction effects!
But possible to permute multiple features at once to zoom in on interaction effect (“keep together”)
45
Feature importance
https://explained.ai/rf-importance/index.html
46
Feature importance
Check your documentation, many
implementations (e.g. scikit-learn) implement
Gini Impurity based importance
scikit-learn 0.23.2: finally available out of the box, but
use the correct approach:
https://scikit-
learn.org/stable/modules/permutation_importance.html
And also: take into account correlation effects!
47
Partial dependence plots
Each point on the partial dependence plot is the average vote percentage (or average
continuous prediction) across all observations
48
Partial dependence plots
Option 1) Keep feature under observation as-is, impute with median and mode for others
49
Partial dependence plots
Option 2) For each value of the feature under study, generate N synthetic instances
50
Partial dependence plots
51
Partial dependence plots
Note that absence of evidence does not mean evidence of absence!
E.g. interaction effects might not show in partial dependence plot given the ceteris paribus approach (keeping
everything fixed except for the variable under observation)
y ~ x0 * x1 + x2 + noise
52
Partial dependence plots
On the other hand, this is also a benefit…
“ Why not look at the data itself to assess the impact on an outcome (i.e. by
constructing bins on a feature under observation and looking at the percentage of yes
“
vs. no cases per bin)?
You might infer correlations from this univariate investigation which might not be true given the presence of
interaction effects!
E.g. “sales drop for customers between 30 and 40 years old” (data) vs. “sales stay constant” (partial
dependence plot) indicates presence of interaction effects: age alone is not a sufficient explanation!
Need to inspect both!
53
Partial dependence plots
Here too, like permutation importance, it is possible to keep more than one feature as-is whilst replacing the
others with their median and mode
Harder to visualize, however (e.g. contour plots, 3d plots… infeasible for higher dimensions)
http://forestfloor.dk/ R package
54
Individual conditional expectation (ICE) plots
Similar idea to partial dependence plots
Create new instances based on the values of the feature under observation
Again, also possible to define a grid-based range over the feature under observation between its observed
minimum and maximum
55
Individual conditional expectation (ICE) plots
Every original instance now leads to multiple rows in
the modified data set
Again, we let the model predict over all these
instances
For each distinct value for the feature under
observation, we now have multiple predictions
Instead of averaging, we also plot the different lines
Finally, the lines are commonly centered
An average line can also be plotted (yellow line in
plot: similar as PDP option 2)
ICE plots well suited to show behavior of feature across data set
56
Tree inspection
Feed an instance through every tree and tally which variable was used more often
Or only for trees agreeing with the majority vote outcome or all of them
Can also keep track of the splitting points per variable
Not model-agnostic!
https://medium.com/airbnb-engineering/unboxing-the-random-forest-classifier-the-threshold-distributions-22ea2bb58ea6
57
LIME
Local interpretable model-agnostic explanations
https://github.com/marcotcr/lime
58
LIME
59
LIME
60
LIME
A simple model is trained over this perturbed data set, e.g. a Lasso regression
Regression as the output is now the predicted probabilities of the black box model: a continuous value
Lasso to keep explanations sparse
This provides us with a simple, local decision boundary which can be easily inspected:
61
LIME
Decision boundary of “explanatory model” approximates decision boundary of black box model around the
instance under study
62
LIME
Also works on non tabular data: changes how the perturbation is performed
Text: new texts are created by randomly removing words from the original text
Images: perturbing individual pixels does not make a lot of sense. Instead, groups of pixels in the image are
perturbed at once by “blanking” them (removing them from the image)
These groups are called “superpixels”, based on interconnected pixels with similar coloring
Can be found using e.g. k-means
63
LIME
Defining neighborhood around instance to define weights is difficult
Distance measure or bandwidth of exponential smoothing kernel can impact results
One can also simply select k nearest neighbors around instance under study (but need to decide upon appropriate k)
Main advantage is that LIME is easy to understand and works on tabular data, text and images
64
Global surrogate
Simply train an interpretable model to use whilst explaining, which hopefully is pretty
close to the original model
However:
Also remember the basics regarding model inspection (look at your confusion matrix, look
at the “most confused” instances)
65
Shapley values
Combines the best of the above: importance, partial dependence, and instance-
level explanations
Originated in game theory
Represent the fair payout for each player in a cooperative game
Can be used to measure the contribution of a feature to a model
66
Shapley values
We want to play a game. What is contribution of each player to the team?
We are interested in the marginal contribution of each player
67
Shapley values
Suppose the (hypothetical) value of the team with only player A is 6
v({A}) = 6 where v is called characteristic function
v({A, B}) = 10
68
Shapley values
Adding C before B?
v({A}) = 6
v({A, C}) = 10
v({A, C, B}) = 12
Contribution does not depend on the order of players that were added before
v({A, B, C}) − v({A, B}) = v({B, A, C}) − v({B, A})
Contribution is also independent from how remaining players are added afterwards
69
Shapley values
The shapley value of a player j is now:
To know the contribution of player j, we need to know their contribution to every possible sub-team
Sum over all unordered subsets we can make with other players
70
Shapley values
How to bring this to machine learning?
The players are the features with their values, the game that is played is the prediction of an instance
x = {x1 , … , xp } using a model f^
^ ^
v(XS ) = ∫ f (x1 , … , xp )dPx∉S − EX (f (X))
Prediction of the model using only included feature values minus average prediction
71
Shapley values
Problem: lots of possible subsets of features. Also: prediction with only included features
means retraining, or imputation with median/mode?
M
1
^ ^ m ^ m
ϕ = ∑ (f (x ) − f (x ))
j +j −j
M
m=1
To calculate the Shapley value for a feature j and an instance x, we sample M times and:
72
Shapley values
R F M debt age income
F and x is the feature/instance under
x 5 2 1000 100 44 2500 observation, we permute over {R, Debt}
z 10 4 500 50 36 3000 versus {M, age, income} (randomly
chosen)
↓
M
1
^ ^ m ^ m
R F M debt age income ϕ = ∑ (f (x ) − f (x ))
j +j −j
M
m m=1
x
+j
5 2 500 100 36 3000
m
x
−j
5 4 500 100 36 3000
73
Shapley values
Start from the default (average) prediction and assess the contribution of each feature with each value
towards outcome
Shapley value represent contribution of feature to the given output for a given instance, not by how much
that output would change when removing the feature
Most commonly used instance explainability technique in industry
0.1 (base rate) + 0.4 (age=65) - 0.3 (sex=F) + 0.1 (bp=180) + 0.1 (bmi=40) = 0.4
74
Shapley values
75
Shapley values
76
Shapley values
Singling out one or two features and plotting their shapley values across instances:
comparable to partial dependence plot
77
Closing
More reading and packages:
78
Or…
Graft, Reassemble, Answer delta, Neighbour sensitivity, Training delta (GRANT) -
https://github.com/wagtaillabs/GRANT
79
Or…
80