You are on page 1of 80

Advanced Analytics in Business

Big Data Platforms & Technologies


Ensemble Modeling: Bagging and Boosting
Interpretability (Part 1)
Overview
Introduction
Combination rules and voting systems
Bagging
Boosting
Interpretability (part 1)

2
Introduction

3
Ensemble modeling: the basic motivation
Are two models better than one?

Intuitively, this does make sense: you might have two models that each are good at
predicting a certain (different) subsegment of your data set
So this seems like a good idea to increase performance
We’ll also see that we will be able to make models more robust to overfitting, more robust to noise
Combination of models is called an “ensemble”

https://towardsdatascience.com/the-unexpected-lesson-within-a-jelly-bean-jar-1b6de9c40cca

4
Can we have it all?
Overfitting:

Model is too specific, works great on training data but not on a new data set
E.g.: a very deep decision tree

5
Can we have it all?
We have seen early stopping and pruning

Using a strong validation setup, too


But at the end: perhaps an accuracy level we might not be happy with

6
Can we have it all?
Also consider: what if we could combine multiple linear classifiers?

7
Combination Rules and Voting Systems

8
Combination rules
Let’s say we’ve created two models
How to combine them?

Model 1 Model 2
True label Ensemble?
(threshold: 0.54) (threshold: 0.50)
Yes 0.80 (yes) 0.70 (yes)
Yes 0.78 (yes) 0.50 (yes)
Yes 0.54 (yes) 0.50 (yes)
No 0.57 (yes) 0.30 (no)
No 0.30 (no) 0.70 (yes)
No 0.22 (no) 0.40 (no)

9
Combination rules
Algebraic combination
Determine new, optimal cutoff!

Model 1 Model 2
True label Min (0.50) Max (0.78) Mean (0.52)
(0.54) (0.50)
Yes 0.80 (yes) 0.70 (yes) 0.70 (yes) 0.80 (yes) 0.75 (yes)
Yes 0.78 (yes) 0.50 (yes) 0.50 (yes) 0.78 (yes) 0.64 (yes)
Yes 0.54 (yes) 0.50 (yes) 0.50 (yes) 0.54 (no) 0.52 (yes)
No 0.57 (yes) 0.30 (no) 0.30 (no) 0.57 (no) 0.44 (no)
No 0.30 (no) 0.70 (yes) 0.30 (no) 0.70 (no) 0.50 (no)
No 0.22 (no) 0.40 (no) 0.22 (no) 0.40 (no) 0.31 (no)

10
Voting
Useful when combining models: majority voting
Less sensitive to underlying probability distributions, no need for calibration or determination of optimal
new cutoff

1 2 3 4 5 6 → "yes" wins (4 to 2)

What about weighted voting?

1 2 3 4 5 6 Model 4 gets 5 votes, the others 1 → "no" wins (5+1 to 4)

We could also go for a linear combination of the probabilities

Though again: how to determine the weights?


A learning step in itself (meta-learning)

11
Mixture of experts
Jordan and Jacobs’ mixture of experts (Jacobs, 1991) generates several “experts”
(classifiers) whose outputs are combined through a linear rule

The weights of this combination are determined by a “gating network”, typically trained using the
expectation maximization (EM) algorithm
But: loss of interpretability, additional production strain!

12
Stacking
Wolpert’s (Wolpert, 1992) stacked generalization (or stacking)

An ensemble of Tier 1 classifiers is first trained on a subset of the training data


Outputs of these classifiers are then used to train a Tier 2 classifier (meta-classifier) (potentially with
original features)
The underlying idea is to learn whether training data have been properly learned
For example, if a particular classifier incorrectly learned a certain region of the feature space, then the Tier 2
classifier may be able to learn this behavior
But: loss of interpretability, additional production strain!

13
Smoothing

λ× +(1 − λ)×

14
Bagging

15
Bagging
Bagging (bootstrap aggregating) is one of the earliest, most intuitive and perhaps the simplest ensemble based
algorithms, with a surprisingly good performance (Breiman, 1996)

The main idea is to add diversity to the classifiers


Obtained by using bootstrapped replicas of the training data: different training data subsets are randomly drawn – with replacement –
from the entire training dataset
Each training data subset is used to train a different classifier of the same type
Individual classifiers are then combined by taking a simple majority vote of their decisions

Since the training datasets may overlap substantially, additional measures can be used to increase diversity, such
as:

Using a subset of the training data for training each classifier


Using a subset of features
Using unstable classifiers
Other ideas

16
Bagging

17
Out-of-bag (OOB) validation
When using bagging, one can already estimate the generalization capabilities of the
ensemble model using the training data: out-of-bag (OOB) validation

When validating an instance i, only consider those models which did not have i in their bootstrap sample
A good initial validation check, though an independent test set is still required!

18
Random forests: the quintessential bagging
technique
Random forests (Leo Breiman and Adele Cutler): bagging-based ensemble learning method for classification and
regression

Construct a multitude of decision trees at training time and outputting the class that is the majority vote of the classes (classification) or
mean prediction (regression) of the individual trees
Applies bagging, so one part of randomness comes from bootstrapping each decision tree, i.e. each decision tree sees a random bootstrap
of the training data
However, random forests use an additional piece of randomness, i.e. to select the candidate features to split on, consider a random
subset of features (sampled at every split, not once per tree!)

Random decision forests correct for decision trees’ habit of overfitting to their training set

No more pruning needed


Great performance in most cases!

19
Random forests: the quintessential bagging
technique
How many trees?
No risk of overfit, so use plenty

Depth of tree?
No pruning necessary
But one can still decide to apply some pruning or early stopping mechanisms to speed up (many techniques will do so)

Size of bootstrap?
Can be 100% (this doesn’t mean selecting all instances, as we’re drawing with replacement!)
Lower values possible given enough data points. Key is to build enough trees

M: size of subset of features?


1, 2, all (i.e. “default bagging”)?
N
Heuristic: max(⌊ 3
, 1⌋) for regression, ⌊√N ⌋ for classification (with N the number of features)

Alternative: find through cross-validation!

20
Random forests: the quintessential bagging
technique
Thinking points: how to assign a probability? How to set the thresholds of the base
classifiers (do we need to)?

“ In contrast to the original publication, the scikit-learn implementation combines


classifiers by averaging their probabilistic prediction, instead of letting each classifier

vote for a single class.

Why does this virtually guarantee same outcome?

21
Random forests: the quintessential bagging
technique
Random forests are easy to use, don’t require much configuration or preprocessing
Because you are building many trees, will include lots of interaction effects “for free”
Good at avoiding overfitting (by design)
However… how to explain 100 trees vs. 1…
Many fun extensions, e.g. Extra Randomized Trees: also consider a random subset of the possible splitting
points, instead of only a random subset of features!
See also Maximizing Tree Diversity by Building Complete-Random Decision Trees (Liu et al., 2005)
There is even a thing such as completely randomized trees (and we’ll see an application of those soon)

22
Boosting

23
Boosting
Similar to bagging, boosting also creates an ensemble of classifiers which are then
combined

However, not using bootstrapping this time around


Instead, classifiers are added sequentially were each new classifier aims at correcting the mistakes by the
ensemble thus far
In short: steering the learning towards fixing the mistakes it made in a previous step
Main idea is cooperation between classifiers, rather than adding diversity

24
Boosting

25
AdaBoost
AdaBoost trains an ensemble of weak learners over a number
of rounds T
At first, every instance has the same weight (D1 = 1/N ), so
AdaBoost trains a normal classifier
Next, samples that were misclassified by the ensemble so far
are given a heavier weight
The learner is also given a weight (αt ) depending on its
accuracy and incorporated into the ensemble
AdaBoost then constructs a new learner: now incorporating
the weights so far

26
AdaBoost
Friedman et al. showed that AdaBoost can be implemented as additive logistic regression
model

Assuming logistic regression as the base, weak learner


AdaBoost optimizes exponential loss function
Nice mathematical solution, shows that AdaBoost is closely linked to a particular loss function

27
Gradient Boosting Machines
Friedman et al.: “what if we would want to optimize a
different loss function?”

Doesn’t work with standard additive logistic regression setup or


AdaBoost
So take a different view: instead of weighting instances (and
classifiers in the ensemble) in every cycle, let every sequential
classifier predict on the residuals on the ensemble so far
Learning to predict the errors
Boils down to the same: predicting the errors and then adjusting accordingly
But allows to optimize for any loss function based on its gradients
Weak decision trees are used as the base learner

28
Gradient Boosting Machines
Gradient boosting will fit learners to the residual y − Fm (x)

Each Fm+1 attempts to correct the errors of its predecessor Fm


Follows from the observation that residuals hm (x) for a given model are proportional equivalent to the
negative gradients of the mean squared error (MSE) loss function:

1 2 ∂LM SE 2 2
LM SE = (y − F (x)) − = (y − F (x)) = hm (x)
n ∂F n n

So, gradient boosting could be specialized to a gradient descent algorithm, and


generalizing it entails plugging in a different loss and its gradient:
n
MSE =
1 2
∑ (yi − y
^ )
n i=1 i

n
LogLoss (binary classification) = − n
1
∑ (yi log(pi ) + (1 − yi )log(1 − pi ))
i=1

n m
MLogLoss (multiclass classification) = − n
1
∑ ∑ yij log(pij )
i=1 j=1

29
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical
improvement that speeds up training)

First, it uses second partial derivatives of the loss function, which provides more information about the
direction of gradients and how to get to the minimum of our loss function (Jacobian and Hessian needed)

As implemented by xgboost , lightgbm and catboost packages

Very powerful techniques: wins many Kaggle competitions that deal with structured data!

But:

How about the risk of overfitting?


Explainability? We still have a model with 100 trees…

30
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical
improvement that speeds up training)

Second, it adds regularization (L1 or L2), which improves model generalization

Defined as a “control” parameter on the complexity of the model

Here defined over the depth of the tree or number of leaf nodes
Objective function combines loss and simplicity of the trees

31
DART
Rashmi Korlakai Vinayak, Ran Gilad-Bachrach. “DART: Dropouts meet Multiple Additive
Regression Trees.” (2015)

“ XGBoost mostly combines a huge number of regression trees with a small learning
rate. In this situation, trees added early are significant and trees added late are
unimportant.

Vinayak and Gilad-Bachrach propose a new method to add dropout techniques from
the deep neural net community to boosted trees, and reported better results in some

situations.

32
Feature interaction constraints
“ Variables that appear together in a traversal x10< -1.5 ?
path are interacting with one another, since
Yes/Missing No
the condition of a child node is predicated on
x2< 2 ? x7< 0.3 ?
the condition of the parent node. For
example, the highlighted red path in the Yes No/Missing Yes No/Missing
diagram.
... ... x1< 0.5 ? ...

When the tree depth is larger than one, many


variables interact on the sole basis of
... Predict +1.3
minimizing training loss, and the resulting
decision tree may capture a spurious relationship (noise) rather than a legitimate
relationship that generalizes across different datasets. Feature interaction constraints

allow users to decide which variables are allowed to interact and which are not.

33
Monotonic relationship constraints
“ It is often the case in a modeling problem or project that the functional form of an
acceptable model is constrained in some way. A common type of constraint in this
situation is that certain features bear a monotonic relationship to the predicted

response.

34
Comparing Bagging and Boosting

35
Comparing bagging and boosting
Bagging can be done in parallel (each sub-model is built independently)
Boosting is sequential (try to add new sub-models that do well where previous model lacks)

But the trees themselves can be parallelized somewhat

Bagging decreases variance, not necessarily bias

Suitable to combine high variance low bias models (complex models)


Example algorithm: random forest
Reducing the overfit of ensembles of complex models (strength of diversity)
Hence: deep decision trees

Boosting decreases bias, not necessarily variance

Suitable to combine low variance high bias models (simple models)


Example algorithms: AdaBoost, gradient boosting machines (GBM), xgboost
Reducing the error of ensembles of simple models (strength of cooperation)
Hence: logistic regression, or non-deep, “weak” decision stumps

36
Comparing bagging and boosting
Bagging decreases variance

Boosting decreases bias

High bias ←—————— High variance


——————————————————————→
37
Comparing bagging and boosting
This means that you need to be careful of overfitting when using boosting techniques

There is such thing as overboosting


xgboost and others implement regularization and other strategies to protect against this
Requires hyperparameter tuning, however
In most practical use cases, similar performance can be obtained using random forests without risk of overfit

38
Comparing bagging and boosting

39
Opening the Black Box

40
Opening the black box

“ How to explain 100 trees versus 1?

Since we’ve now stepped into the domain of black-box models, we need to look at
appropriate techniques to gain understanding of the models we construct!

Model interpretability techniques:

Some of these are “native” to the concept of decision trees used in our ensembles
Others can be used for any type of model (“model agnostic” techniques)

Work at different levels:

Help to explain the model (make the model simpler)


Help to explain a feature (explain which ones are important and how they drive the outcome)
Help to explain a particular instance-level prediction

41
Feature importance

42
Feature importance
Which features are important according to the model?

Different techniques exist:

Based on mean decrease in impurity in tree based ensembles


Biased: http://explained.ai/rf-importance/index.html
Model-dependent (gradient boosted models, random forests)

Based on position in trees


Model-dependent as well

Drop feature importance


Robust, but requires retraining
Model agnostic

Permutation importance
No retraining required
Model agnostic

43
Permutation based feature importance
Start with trained model on given data set and performance measure (e.g. AUC)
Randomly permute values for feature under study
Use trained model to predict observations again
Importance = baseline performance measure – performance measure on permuted data set
Note that this takes interactions into account (as shuffling breaks interaction effects)

44
Permutation based feature importance
Indicates which features are important, but not to which extent they affect outcome
Can also be used for feature selection: retrain model on top-N features (e.g. Boruta package in R)
Permutating a single feature breaks up interaction effects!
But possible to permute multiple features at once to zoom in on interaction effect (“keep together”)

Correlated features will “share” importance, which might lead to misinterpretation!


Still important to check correlated features! ( rfpimp Python package has this built-in)

Or permute multiple features together (separately)

Extensions exist which add significance values (less commonly used)

45
Feature importance

https://explained.ai/rf-importance/index.html

46
Feature importance
Check your documentation, many
implementations (e.g. scikit-learn) implement
Gini Impurity based importance
scikit-learn 0.23.2: finally available out of the box, but
use the correct approach:
https://scikit-
learn.org/stable/modules/permutation_importance.html
And also: take into account correlation effects!

Train or test set? Both have their use!


Test set: in line with evaluation concerns: e.g. the feature
importance based on training data might make us mistakenly believe that features are important for the predictions, when in
reality the model was just overfitting and the features were not important at all
Train set: learn and understand how the model has actually relied on each feature

47
Partial dependence plots

Each point on the partial dependence plot is the average vote percentage (or average
continuous prediction) across all observations

Understand how feature impacts outcome of model

48
Partial dependence plots
Option 1) Keep feature under observation as-is, impute with median and mode for others

Also possible to define a range between observed minimum and maximum


Have the trained model predict on the new dataset and plot the results over the values of the feature under
observation

49
Partial dependence plots
Option 2) For each value of the feature under study, generate N synthetic instances

Based on observed values or a range between observed minimum and maximum


By taking feature values for the other feature from a randomly sampled instance
Average the prediction

50
Partial dependence plots

51
Partial dependence plots
Note that absence of evidence does not mean evidence of absence!

E.g. interaction effects might not show in partial dependence plot given the ceteris paribus approach (keeping
everything fixed except for the variable under observation)

y ~ x0 * x1 + x2 + noise

52
Partial dependence plots
On the other hand, this is also a benefit…

“ Why not look at the data itself to assess the impact on an outcome (i.e. by
constructing bins on a feature under observation and looking at the percentage of yes

vs. no cases per bin)?

You might infer correlations from this univariate investigation which might not be true given the presence of
interaction effects!
E.g. “sales drop for customers between 30 and 40 years old” (data) vs. “sales stay constant” (partial
dependence plot) indicates presence of interaction effects: age alone is not a sufficient explanation!
Need to inspect both!

53
Partial dependence plots
Here too, like permutation importance, it is possible to keep more than one feature as-is whilst replacing the
others with their median and mode
Harder to visualize, however (e.g. contour plots, 3d plots… infeasible for higher dimensions)

http://forestfloor.dk/ R package

54
Individual conditional expectation (ICE) plots
Similar idea to partial dependence plots
Create new instances based on the values of the feature under observation
Again, also possible to define a grid-based range over the feature under observation between its observed
minimum and maximum

55
Individual conditional expectation (ICE) plots
Every original instance now leads to multiple rows in
the modified data set
Again, we let the model predict over all these
instances
For each distinct value for the feature under
observation, we now have multiple predictions
Instead of averaging, we also plot the different lines
Finally, the lines are commonly centered
An average line can also be plotted (yellow line in
plot: similar as PDP option 2)
ICE plots well suited to show behavior of feature across data set

56
Tree inspection
Feed an instance through every tree and tally which variable was used more often

Or only for trees agreeing with the majority vote outcome or all of them
Can also keep track of the splitting points per variable
Not model-agnostic!

https://medium.com/airbnb-engineering/unboxing-the-random-forest-classifier-the-threshold-distributions-22ea2bb58ea6

57
LIME
Local interpretable model-agnostic explanations

“Local surrogate model”


Works on the instance level

https://github.com/marcotcr/lime

58
LIME

59
LIME

60
LIME
A simple model is trained over this perturbed data set, e.g. a Lasso regression
Regression as the output is now the predicted probabilities of the black box model: a continuous value
Lasso to keep explanations sparse

This provides us with a simple, local decision boundary which can be easily inspected:

61
LIME
Decision boundary of “explanatory model” approximates decision boundary of black box model around the
instance under study

62
LIME
Also works on non tabular data: changes how the perturbation is performed
Text: new texts are created by randomly removing words from the original text
Images: perturbing individual pixels does not make a lot of sense. Instead, groups of pixels in the image are
perturbed at once by “blanking” them (removing them from the image)
These groups are called “superpixels”, based on interconnected pixels with similar coloring
Can be found using e.g. k-means

Ribeiro et. al., 2016

63
LIME
Defining neighborhood around instance to define weights is difficult
Distance measure or bandwidth of exponential smoothing kernel can impact results
One can also simply select k nearest neighbors around instance under study (but need to decide upon appropriate k)

Choosing the simple model is somewhat arbitrary


E.g. tuning of regularization parameter of LASSO regression can impact results
Main reason why not used that often

Main advantage is that LIME is easy to understand and works on tabular data, text and images

Was very common for a while, now less used

64
Global surrogate
Simply train an interpretable model to use whilst explaining, which hopefully is pretty
close to the original model

However:

How close is close enough?


It could happen that the interpretable model is very close for one subset of the dataset, but widely divergent
for another subset: interpretation for the simple model would not be equally good for all data points

Also remember the basics regarding model inspection (look at your confusion matrix, look
at the “most confused” instances)

65
Shapley values
Combines the best of the above: importance, partial dependence, and instance-
level explanations
Originated in game theory
Represent the fair payout for each player in a cooperative game
Can be used to measure the contribution of a feature to a model

66
Shapley values
We want to play a game. What is contribution of each player to the team?
We are interested in the marginal contribution of each player

67
Shapley values
Suppose the (hypothetical) value of the team with only player A is 6
v({A}) = 6 where v is called characteristic function

Next, we add player B


v({A}) = 6

v({A, B}) = 10

B has a marginal contribution of 4

Next, add player C


v({A, B, C}) = 12

C has a marginal contribution of 2

68
Shapley values
Adding C before B?
v({A}) = 6

v({A, C}) = 10

v({A, C, B}) = 12

C has a marginal contribution of 4, B of 2

To be fair to C in terms of contribution payout, we need to average their contribution in all


formations of the team

Contribution does not depend on the order of players that were added before
v({A, B, C}) − v({A, B}) = v({B, A, C}) − v({B, A})

Contribution is also independent from how remaining players are added afterwards

69
Shapley values
The shapley value of a player j is now:

|S|! × (p − |S| − 1)!


ϕj = ∑ [v(S ∪ {xj }) − v(S)]
p!
S⊆{x1 ,…,xp }∖{xj }

To know the contribution of player j, we need to know their contribution to every possible sub-team
Sum over all unordered subsets we can make with other players

|S|! : all permutations of players in team

(p − |S| − 1)! : all permutations of residual players (p is number of players)

p! normalize based on total number of possible teams


Sum value of team with player and without

70
Shapley values
How to bring this to machine learning?

The players are the features with their values, the game that is played is the prediction of an instance
x = {x1 , … , xp } using a model f^

^ ^
v(XS ) = ∫ f (x1 , … , xp )dPx∉S − EX (f (X))

Prediction of the model using only included feature values minus average prediction

71
Shapley values
Problem: lots of possible subsets of features. Also: prediction with only included features
means retraining, or imputation with median/mode?

Monte Carlo sampling with permutation based approximation

M
1
^ ^ m ^ m
ϕ = ∑ (f (x ) − f (x ))
j +j −j
M
m=1

To calculate the Shapley value for a feature j and an instance x, we sample M times and:

Draw a random instance z from the data


Construct two new instances, one with the value of x for feature j (xm
+j
), and one with value of z (xm
−j
)

The other features are randomly chosen (permuted) across x and z


We then let the model predict for both

72
Shapley values
R F M debt age income
F and x is the feature/instance under
x 5 2 1000 100 44 2500 observation, we permute over {R, Debt}
z 10 4 500 50 36 3000 versus {M, age, income} (randomly
chosen)

M
1
^ ^ m ^ m
R F M debt age income ϕ = ∑ (f (x ) − f (x ))
j +j −j
M
m m=1
x
+j
5 2 500 100 36 3000
m
x
−j
5 4 500 100 36 3000

73
Shapley values
Start from the default (average) prediction and assess the contribution of each feature with each value
towards outcome
Shapley value represent contribution of feature to the given output for a given instance, not by how much
that output would change when removing the feature
Most commonly used instance explainability technique in industry

0.1 (base rate) + 0.4 (age=65) - 0.3 (sex=F) + 0.1 (bp=180) + 0.1 (bmi=40) = 0.4

74
Shapley values

Explaining a single instance

75
Shapley values

Features, feature values, impact on model output, across all instances

76
Shapley values

Singling out one or two features and plotting their shapley values across instances:
comparable to partial dependence plot

Bring in second feature to look at interaction effects

77
Closing
More reading and packages:

Fantastic book on the topic: https://christophm.github.io/interpretable-ml-book/


Good overview: https://github.com/jphall663/awesome-machine-learning-interpretability
scikit-learn : https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html (recent
versions also have support for partial dependence and ICE plots)
The pdp R package: https://cran.r-project.org/web/packages/pdp/pdp.pdf
The iml R Package: https://cran.r-project.org/web/packages/iml/index.html

Descriptive mAchine Learning EXplanations ( DALEX ) R Package: https://github.com/pbiecek/DALEX


eli5 for Python: https://eli5.readthedocs.io/en/latest/index.html

Skater for Python: https://github.com/datascienceinc/Skater

Or with https://github.com/parrt/random-forest-importances ( rfpimp package: recommended)


Or with https://github.com/ralphhaygood/sklearn-gbmi ( sklearn-gbmi )

pdpbox for Python: https://github.com/SauceCat/PDPbox


vip for Python (and R): https://koalaverse.github.io/vip/index.html

Shapley Values: https://github.com/slundberg/shap


https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc
http://blog.macuyiko.com/post/2019/discovering-interaction-effects-in-ensemble-models.html and https://blog.macuyiko.com/post/2021/revisiting-
discovery-of-interaction-effects.html
http://explained.ai/rf-importance/index.html

78
Or…
Graft, Reassemble, Answer delta, Neighbour sensitivity, Training delta (GRANT) -
https://github.com/wagtaillabs/GRANT

79
Or…

80

You might also like