Interpretability and Algorithmic Fairness DSB 2023 Slides

Interpretability and Algorithmic Fairness
Christophe Pérignon (Instructor)

7
Professor of Finance, perignon@hec.fr
Sébastien Saurin (Teaching assistant)
Ph.D. student, seb.saurin@hotmail.fr
Master DSB
October 24 – November 10, 2023
1
Course Description
The goal of this course is to present the concepts of interpretability and

algorithmic fairness + techniques to implement them.
1
Course Description

1
Course Description

(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.
1
Course Description

(2) check whether a machine‐learning model treats different groups of individuals in a

fair way.
1
Course Description


fair way.
These concepts and skills are important to master for data scientists,
model validators, AI entrepreneurs, executives, regulators, policymakers,
and consumer advocates.
1
Course Description


fair way.
These concepts and skills are important to master for data scientists,
model validators, AI entrepreneurs, executives, regulators, policymakers,
and consumer advocates.
Students will have to work on a project combining some of the techniques

presented in the course and make a final presentation.
1
Main Topics
• Definition of interpretability/explainability
• Natively interpretable models: OLS, Logistic, GAM, PLTR
• Global interpretability measures: Global surrogate, Partial
Dependence Plot (PDP), Accumulated Local Effect (ALE)
• Local interpretability measures: ICE, LIME, SHapley Additive
exPlicability (SHAP)
• Explaining Performance (Permutation Importance, XPER)
2
Main Topics
• Definition of interpretability/explainability
• Natively interpretable models: OLS, Logistic, GAM, PLTR
• Global interpretability measures: Global surrogate, Partial
Dependence Plot (PDP), Accumulated Local Effect (ALE)
• Local interpretability measures: ICE, LIME, SHapley Additive
exPlicability (SHAP)
• Explaining Performance (Permutation Importance, XPER)
• Why an algorithm needs to be fair and definitions of fairness metrics
• Inference tests and fairness diagnosis
• Fairness interpretability
• Mitigation techniques
2
Learning Objectives
By the end of this course, you should:

I F
• Know why it is important for an algorithm to be interpretable
• Know how to interpret it globally
• Know how to explain a decision affecting one instance
• Know why an algorithm needs to be fair
• Know how to detect an algorithm which is not fair
• Understand why a given algorithm is unfair
• Be able to make it more fair, yet maintaining a high level of perf.
3
Course Structure
Tuesday 24 Wednesday 25 Thursday 26 Friday 27 Friday 10
Student
Morning Lecture Lecture Lecture Lecture
Presentations
GS2 GS4
Student
Afternoon Lecture Lecture Presentations
GS1 GS3
Guest speakers (GS):

GS1: Vassilis Digalakis, Assistant Professor, HEC Paris
GS2: Florence d’Alché‐Buc, Professor and Chair Holder “Data Science and Artificial Intelligence for
Digitalized Industry and Services”, Telecom Paris
GS3: Jean‐Marie John‐Mathews, Co‐founder, Giskard
GS4: Edouard Tabary, Head of Data Science, BNPP PF Scoring Center
4
Evaluation
• Attendance is compulsory. Students must send an email to the

instructor if they cannot attend a lecture.
• Quiz (15%)
• Class participation (15%)
• Group project (70%). NB: Grades may vary among students within a
given group.
5
Recommended Readings
Barocas S. Hardt M., Narayanan A. (2023) Fairness and Machine Learning:

Limitations and Opportunities https://fairmlbook.org/
Molnar C. (2023) Interpretable Machine Learning
https://christophm.github.io/interpretable‐ml‐book/
Hué S., Hurlin C., Pérignon C., Saurin S. (2023) Measuring the Driving Forces of
Predictive Performance: Application to Credit Scoring
https://ssrn.com/abstract=4280563
Hurlin, C., Pérignon C., Saurin S. (2022) The Fairness of Credit Scoring Models
https://ssrn.com/abstract=3785882
Krishna S., Han T., Gu A., Pombra J., Jabbari S., Wu S. and Lakkaraju H. (2022)
The Disagreement Problem in Explainable Machine Learning: A Practitioner’s
Perspective https://arxiv.org/abs/2202.01602
Artificial Intelligence: The Robots Are Now Hiring, Wall Street Journal
https://www.youtube.com/watch?v=8QEK7B9GUhM
6
7
1. Introduction
“Life‐changing algorithms”
Condition access to:
• Credit
Credit scoring models used to decide on new

loan applications
Classification algorithm: High‐risk borrowers
(reject) v.s. Low‐risk borrowers (accept)
7
• Credit
Variables Forecast (score)

Decision vs. Outcome
8
• Credit
• Work
9
• Credit
• Work
9
• Credit
• Work
9
• Credit
• Work
• Education
10
• Credit
• Work
• Education
• Love
11
• Credit
• Work
• Education
• Love
11
• Credit
• Work
• Education
• Love
• Freedom
12
The use of AI by government agencies
Source: Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies

(https://www‐cdn.law.stanford.edu/wp‐content/uploads/2020/02/ACUS‐AI‐Report.pdf)
13
Automated claim management for insurance
14
Why is AI so popular in these business
and government applications?
• Allow processing massive quantity of data (including new data)

 including digital footprint, new forms of data (open data, video, payment data, etc)
• Speed up and lower the cost of application processing (scalable)

 Including small firms or unbanked households traditionally overlooked by standard screening
• Use powerful algorithms, hence improving classification of applications

between good and bad type
 Reduction in Type‐1 and Type‐2 errors. Impact on P&L
 This quest for performance leads companies to develop models that are increasingly complex,
and less and less transparent and interpretable. “Algorithm Darwinism”
15
Interpretability
16
Interpretability
17
Interpretability
18
Important distinctions
We distinguish between:
• Models that are intrinsically interpretable (white box or glass box)
• Models whose structures do not permit easy interpretation (black box)
We distinguish between:
• Understanding a model (global interpretability)
For instance, in this Artificial Neural Network, the variables with the strongest impact on the estimated
house price are the lot size and the presence of a swimming pool.
• Understanding a particular decision for a given instance (local

interpretablity)
For instance, why Mrs Smith’s consumer loan application got rejected by an algorithmic lender?
19
The ML Journey
Validation:
Modeling: test set
list of models stability
variable selection interpretability Production:
hyperparameters fairness value creation
training set
performance
Data:
collection
cleaning
Problem,
Question
20
Hot debate about interpretability
Source: https://www.youtube.com/watch?v=GtCFprO5p7k
21
Hot debate about interpretability (2)
22
Hot debate about interpretability (3)
23
When we need to interpret (and when we don’t)
ML models produce predictions, but they do not explain their predictions to users
• In some case, this is not important

e.g. search engines predicts that a document will be what a user wants. Low cost for users if this is a
mistake
• In many other situations, understanding how predictions are made is

desirable because mistakes are costly, confidence in the model is key
e.g. loan applications, estimated house price given by real‐estate agent, match between organ
donors/recipients, job applications, releasing a prisoner, applying to a university, tax fraud detection,
investigation about child abuse, allocation of ventilators during covid, illness detection.
“The algorithm rejected you. Sorry I do not have any additional information.”
is not going to be an acceptable answer…
24
When we need to interpret (2)
• Requested by regulation and Law

e.g. GDPR’s Chapter 11: Right to explanation; French Code for Public Health (Article L4001‐3): algorithm
developers […] must make sure the functioning of the algorithm can be explained to users (Law #2021‐
1017, August 2, 2021)
• Does the model produce biased decisions? If challenged on court, can

the company using the model prove that it does not.
e.g. Is the model racist or misogynistic?
• Improving the model

e.g. Image recognition software that distinguishes between polar bears and dogs. If you can show that
the model makes decisions by looking at the background (ice vs. grass) rather than looking at the shape
of the animals, there is a problem.
25
Research on interpretability
Total number
of accepted 400 2000
papers :
26
When we need to interpret (3)
AI/ML is highly criticized for failing to deliver on its promise:
• It is very costly and the ROI is unclear

• there are many POCs but few algorithms are actually put in production
• It is very opaque
• It often creates tensions between business experts and data scientists
 Interpretability can make AI/ML more P&L centric
 Interpretability increases transparency and can reduce tensions
27
Quality AI
October 6, 2022: Memorandum of Cooperation (MoC) between VDE and Confiance.ia, a consortium of
French industrial companies and research centers.
* VCIO = Values Criteria Indicators Observables
28
Whom to explain?
Example from the Banking industry
• Developers, i.e. those developing or implementing a ML application.

• 1st line model checkers, i.e., those directly responsible for making sure model
development is of sufficient quality.
• Management responsible for the application.
• 2nd line model checkers, i.e., control function, they independently check the
quality of model development and deployment.
• Regulators that aim to protect consumers and financial stability.
• Clients
Christophe
29
Whom to explain? (2)
30
31
, 2. Interpreting White box
In this section, we introduce five white-box models:
1 Ordinary Least Squares (OLS)
2 Logistic Regression (LR)
3 Decision Tree (DT)
4 Generalized Additive Models (GAM)
5 Penalised Logistic Tree Regression (PLTR)
32
, 2.1 OLS
Y = β 0 + β 1 X1 + β 2 X2 + ... + β k Xk + ε t
The marginal effect of a feature Xj on Y is equal to:
∂Y
= βj
∂Xj
E(Y ) = β 0 + β 1 E(X1 ) + β 2 E(X2 ) + ... + β k E(Xk )
Y − E(Y ) = β 1 (X1 − E(X1 )) + β 2 (X2 − E(X2 )) + ... + β k (Xk − E(Xk )) + ε t
33
, 2.1 OLS
Example
”The model predicts that the price of the house of interest (with the feature values in the second column of
Table 9.1) is $232,349. The value of an “average house” with the feature values in the third column of Table
9.1 is $180,817. This house is therefore worth $51,532 more than the average house.”
Source: Hull (2020)
34
, 2.2 Logistic Regression
P(Y = 1|X = x ) = Λ( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk )

with Λ(.) the cumulative density function (cdf) of the logistic distribution such as:
1
P(Y = 1|X = x ) =
1 + exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))
exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))

P(Y = 0|X = x ) =
1 + exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))
The marginal effect of a feature Xj on P(Y = 1|X = x ) is equal to:
∂P(Y = 1|X = x ) exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))

= β
∂Xj [1 + exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))]2 j
35
, 2.3 Decision Tree
Figure: Decision Tree for classification
36
, 2.4 Generalized Additive Models (GAM)
Definition (Generalized Additive Models (GAM))

Generalized Additive Models (GAMs) provide an extension to standard linear models by
allowing the incorporation of non-linear functions for each variable while preserving the
principle of additivity.
Hastie, T. and Tibshirani, R. (1986), Generalized additive models. Statistical Science, Vol.1,
No. 3, 297-318.
37

In general, the conditional mean µ(X ) of a target variable Y is related to an additive
function of the features via a link function g :
g (µ(X )) = α + f1 (X1 ) + ... + fp (Xp )
where f1 (X1 ), ..., fp (Xp ) correspond to (smooth) non-linear functions of the features.
Examples of classical link functions are the following:
g (µ(X )) = µ for linear and additive models for Gaussian response data.
g (µ(X )) = logit (µ) or g (µ(X )) = probit (µ) for modeling binomial probabilities.
38

In general, the conditional mean µ(X ) of a target variable Y is related to an additive
function of the features via a link function g :
g (µ(X )) = α + f1 (X1 ) + ... + fp (Xp )
where f1 (X1 ), ..., fp (Xp ) correspond to (smooth) non-linear functions of the features.
Examples of classical link functions are the following:
g (µ(X )) = µ for linear and additive models for Gaussian response data.
g (µ(X )) = logit (µ) or g (µ(X )) = probit (µ) for modeling binomial probabilities.
Examples of (smooth) functions: Splines, Natural Cubic Splines, Smoothing Splines
38
In the regression setting, a generalized additive model has the form
E (Y |X1 , ..., Xp ) = α + f1 (X1 ) + ... + fp (Xp )
In the binary classification setting, the additive logistic regression has the form

µ (X )
log = α + f1 (X1 ) + ... + fp (Xp )
1 − µ (X )
39
Regression
Figure: Example of GAM for a regression model
Source: An introduction to Statistical Learning: With Application in R (2021)
40
Advantages
1 Automatically model non-linear relationships that standard linear regression will
miss.
2 The non-linear fits can potentially make more accurate predictions of the target.
3 Easily examine the marginal effect of each feature on the target.
Limits
1 Interactions can be missed as the model is restricted to be additive. However, as
with linear regression, we can manually add interaction terms to the GAM. By doing
so, the GAM are subject to overfitting issues.
41
Implementation
For the implementation, we can cite:
1 Python: package pyGAM, PiML, and statsmodels.
2 R: package GAM.
42
, 2.5 Penalised Logistic Tree Regression (PLTR)
Definition (Penalised Logistic Tree Regression (PLTR))

The Penalised Logistic Tree Regression (PLTR) is high performance and interpretable
credit scoring method which uses information from decision trees to improve the
performance of logistic regression.
Remark: This method is not restricted to credit scoring applications and can be used for
other classification and regression (after some small adjustements) problems.
Dumitrescu, E., Hué, S., Hurlin, C. and Tokpavi, S. (2022), Machine learning for credit
scoring: Improving logistic regression with non-linear decision-tree effects, European Journal
of Operational Research, Vol. 297, Issue 3, 1178-1192.
43
, 2.5 Penalised Logistic Tree Regression
First Step
Figure: Splitting Process
Source: Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects (2022)
44
Second Step
In the second step, the endogenous univariate and bivariate threshold effects previously
obtained are plugged in the logistic regression
(j ) (j,k ) 1
P(yi = 1|Vi,1 , Vi,2 ; Θ) = h
(j ) (j,k )
i
1 + exp −η (Vi,1 , Vi,2 ; Θ)
(j ) (j,k ) (j ) (j,k )
with η (Vi,1 , Vi,2 ; Θ) = β 0 + ∑pj=1 αj xj + ∑pj=1 β j Vi,1 + ∑pj =−11 ∑pk =j +1 γj,k Vi,2 .
The corresponding likelihood is
1 N h i
N i∑
(j ) (j,k ) (j ) (j,k )
L(Vi,1 , Vi,2 ; Θ) = yi log [F (η (Vi,1 , Vi,2 ; Θ))]
=1
1 N h i
∑
(j ) (j,k )
+ (1 − yi )log [1 − F (η (Vi,1 , Vi,2 ; Θ))] .
N i =1
45
Second Step
Finally, the adaptive lasso estimators are obtained as
V
∑
(j ) (j,k )
Θ̂alasso (λ) = argmin − L(Vi,1 , Vi,2 ; Θ) + λ wv |θv |.
Θ v =1
46
Figure: PLTR interactions
47
Figure: PLTR performances
48
49
, 3. Interpreting Black-Box - Global
In this section, we introduce three model-agnostic interpretation methods:
1 Global Surrogate
2 Partial Dependence Plot (PDP)
3 Accumulated Local Effects (ALE)
50
51
Consider a given opaque machine learning model that produces predictions ŷ .
52
Decision Tree used as a surrogate model: Bike rental example
Source: Molnar (2023)
53
, 3.1 Global Surrogate
Advantages
1 Simple and intuitive
2 Can accommodate any black box model
Limits
1 Estimation risk: Discrepency between the original ŷ and the one estimated by the
surrogate model ŷˆ . R-squared = 0.7
2 Dangerous: Can give the illusion of interpretability.
54
55
, 3.2 PDP
Definition (Partial Dependence Plot )

The Partial Dependence Plot (PDP) shows the marginal effect one feature has on the
predicted outcome of a machine learning model (Friedman 2001).
Friedman, J.H. (2001), Greedy function approximation: A gradient boosting machine.

Annals of statistics, 1189-1232.
56
, 3.2 PDP
Notations
Denote by Xs the features for which the partial dependence function should be
plotted and Ωs the universe of its realization.
Denote by xs a realization of the random variable Xs .
Denote by Xc the other features used in the machine learning model fb.
57
, 3.2 PDP
Definition (Partial Dependence function)

The Partial Dependence function pds (.) is the expectation of the model output fb (.)
over the marginal distribution of all variables other than Xs .

pds (xs ) = EXc fb (xs , Xc ) ∀xs ∈ Ωs
Remark: the PDP is different from the conditional expectation EXc |Xs (fb (xs , Xc )) where
expectation is taken over the conditional distribution of Xc given Xs = xs .
58
, 3.2 PDP
Example (PDP in a linear model)

Consider the linear model
Y = fb (X ) + ε = β̂ 0 + β̂ 1 X1 + . . . + β̂ p Xp + ε
The PD function associated to feature X1 is defined as:
pd1 (x1 ) = β̂ 0 + β̂ 1 x1 + β̂ 2 E (X2 ) + . . . + β̂ p E (Xp )
59
, 3.2 PDP
Example (PDP in a logit model)

Consider the logit model
Pr ( Y = 1| X ) = fb (X ) = Λ β̂ 0 + β̂ 1 X1 + β̂ 2 X2

where Λ (z ) = exp (z ) / (1 + exp (z )) is the cdf of the logistic distribution. The PD

function associated to feature X1 is defined as:
pd1 (x1 ) = EX2 Λ β̂ 0 + β̂ 1 x1 + β̂ 2 X2

60
, 3.2 PDP
Definition (PD estimate)

In practice, PD function is simply estimated by averaging over the dataset as
1 n b
pbds (xs ) = ∑ f xs , xc
(i )
∀xs ∈ Ωs
n i =1
 
(1)
xs xc

xs (2) 
 xc  
. .. 
 .. . 
 
(n )
xs xc
Remark 1: xs can take any values including some which are not in the dataset, e.g., for
age xs can take any value between min (age ) and max (age ).
Remark 2: The PDP shows whether the relationship between the target and a feature is
linear, monotonic or more complex.
61
, 3.2 PDP
Pseudo algorithm for PDP

1 Select feature Xs among all features.
2 Define a grid of values xs ∈ {c1 , . . . , ck } ∈ Ωks .
3 For each xs :
1 Replace all realisations of Xs by xs
2 Compute fˆ(xs , xc ) for each instance.
3 Average predictions across instances.
4 Draw curve pbds (xs ).
62
, 3.2 PDP
Continuous feature
Example (PDP in a regression model)

Zhou and Hastie (2019) consider two predictive models for housing price: random forest
and gradient boosting machine. PDPs suggest that the housing price is insensitive to air
quality until it reaches certain pollution level around 0.65.
63
, 3.2 PDP
Categorical feature

64
, 3.2 PDP
Classification model
When the target variable Y is categorical, the PD function
displays the average
marginal
(i )
effect of the feature s on the conditional probability Pr Y = 1| xs , xc as
1 n
pds (xs ) = ∑ Pr Y = 1| xs , xc
(i )
∀xs ∈ Ωs
n i =1
65
, 3.2 PDP
Classification model

66
, 3.2 PDP
Advantages
1 Interpretation is clear: The PDP shows the marginal effect of a given feature Xs
on the average prediction.
2 Easy to implement: The PDP does not require re-estimating the model.
67
, 3.2 PDP
Limits
1 Maximum number of features: the PDP analysis is limited to 2 features.
2 Heterogeneous effects across instances might be hidden because PDP only show
the average marginal effects.
▶ Solution: Individual Conditional Expectation (ICE) curves.
3 Assumption of independence is the biggest issue with PDP.

▶ When the features are correlated, we create new data points in areas of the feature
distribution where the actual probability is very low (for example someone who is 2
meter tall but weighs less than 50 kg).
▶ Solution: Accumulated Local Effects (ALE) curves.
68
Implementation
1 Python: package scikit-learn: function plot partial dependence().
2 R: package IML: function Featureffect.
69
70
, 3.3 ALE
Definition (Accumulated Local Effects (ALE) )

Accumulated Local Effects (ALE) plots describe how features influence the prediction of
a ML model on average, while taking into account the dependence between the features.
Apley, D.W. (2016), Visualizing the effects of predictor variables in black box supervised
learning models. arXiv preprint arXiv:1612.08468.
71
, 3.3 ALE
Independence assumption
To calculate the feature effect of x1 at a given value, say 0.75, the PDP replaces x1
of all instances with 0.75.
It means that we use the marginal distribution of X2 .
72
, 3.3 ALE
Figure: Marginal Distribution
73
, 3.3 ALE
Independence assumption
To calculate the feature effect of x1 at a given value, say 0.75, the PDP replaces x1
of all instances with 0.75.
It means that we use the marginal distribution of X2 .
This results in unlikely combinations of x1 and x2 (e.g. x2 = 0.2 at x1 = 0.75),

which the PDP uses for the calculation of the average effect.
The figure displays two correlated features and illustrates the fact that PDP average
predictions of unlikely instances.
74
, 3.3 ALE
Conditional expectation
We could average over the conditional distribution of the feature, meaning at a
grid value of x1 , we average the predictions of instances with a similar x1 value.
The solution for calculating feature effects using the conditional distribution is called
Marginal Plots, or M-Plots
M-plots are obtained like PDP but using the conditional distribution:
EXc |Xs (fb (xs , Xc )) (1)
75
, 3.3 ALE
Figure: Conditional Distribution
76
, 3.3 ALE
Limits of the M-plot
Example
Consider a model which predicts the value of a house depending on the number of rooms
and the size of the living area.
If we average the predictions of all houses of about 80 m2 , we estimate the combined
effect of living area and number of rooms, because of their correlation.
♢ Suppose that the living area has no effect on the predicted value of a house, only the
number of rooms has.
♢ The M-Plot would still show that the size of the living area increases the predicted
value, since the number of rooms increases with the living area.
77
, 3.3 ALE
Accumulated Local Effect

ALE solves the combined-effect problem by calculating differences in predictions instead
of averages, yet using conditional distributions.
For the calculation of ALE for feature x1 , which is correlated with x2 :
1 We divide the feature space of x1 into several intervals.
2 For each data point in a given interval, we calculate the difference in predictions
when replacing the feature value by, respectively, the upper and lower limit of the
interval.
3 In each interval, we compute the average difference in predictions.
4 Average differences are then accumulated.
78
, 3.3 ALE
Figure: Intuition of the ALE
79
, 3.3 ALE
Example
For the calculation of ALE for the living area, which is correlated with the number of
rooms:
1 We divide the feature space of the living area into several intervals.
2 For each house between 79 and 81 m2 , we calculate the difference in estimated price
when replacing the living area value by 81 and then by 79.
3 For this interval, we compute the average difference in estimated price.
4 We do the same in all other intervals: (77-79), (75-77), ...
5 Average differences are then accumulated.
80
Definition (Accumulated Local Effect)
The uncentered ALE average the changes in the predictions and accumulate them over
an interval:
Z xs
!
∂fb (Xs , Xc )
ALEs (xs ) = E Xc | Xs Xs = zs dzs ∀xs ∈ Ωs
z0,s ∂Xs
where z0,s is the minimum value of Xs for which the ALE curve is computed.
81
, 3.3 ALE
Definition (centered ALE)

The centered ALE is defined as
ALEscent (xs ) = ALEs (xs ) − EXs (ALEs (Xs ))
with k the number of values taken by Xs .
82
, 3.3 ALE
Figure: Example with 5 intervals
83
, 3.3 ALE
Figure: Example of centered ALE for numerical features and a regression model

84
, 3.3 ALE
Advantages
1 No independence assumption. ALE plots still work when features are correlated.
2 Faster to compute than PDP. For PDP, we need to compute n × k predictions and
compute k averages. For ALE, we need to compute 2 × n predictions and compute
k averages.
85
, 3.3 ALE
Limits
1 No solution for setting the number of intervals
▶ If the number of intervals is too low, ALE plots are inaccurate because it only partially
accounts for dependence across features.
▶ If the number of intervals is too high, we end up with too few instances per interval.
Empirical expectations do not converge towards the theoretical ones and ALE plots
can become a bit shaky.
86
, 3.3 ALE
Implementation
For the implementation, we can cite among many others.
1 Python: PyALE library
2 R: package IML, ALEPlot
87
88
89
, 4.1 ICE
Individual Conditional Expectation (ICE) plots display one curve per instance that shows
how the instance’s prediction changes when a feature changes.
Definition (Individual Conditional Expectation)

n o
(i ) (i )
For each instance xs , xc , the ICE associated to the feature Xs corresponds to:

(i )
ICEs,i (xs ) = fb xs , xc ∀xs ∈ Ωs
Goldstein, A., et al. (2015), Peeking inside the black box: Visualizing statistical learning
with plots of individual conditional expectation. Journal of Computational and Graphical
Statistics 24.1 (2015): 44-65.
90
, 4.1 ICE
Figure: Example of ICE for a regression model and 3 numerical features
91
, 4.1 ICE
Centered ICE
It can be hard to tell whether ICE curves differ across individuals because they start
at different levels.
A simple solution is to center the curves and only display the difference with respect
to the reference point.
Definition (Centered ICE)

The centered ICE is defined as

cent (i ) (i )
ICEs,i (xs ) = fb xs , xc − fb xa , xc ∀xs ∈ Ωs
where xa is the anchor point.
92
, 4.1 ICE
Centered ICE
In general, the anchor point is defined as:
xa = min xs
xs ∈Ωs
Thus, we have:
(
cent
0 if xs = xa
ICEs,i (xs ) = ( i )

( i )

fb xs , xc − fb xa , xc if xs > xa
All the instances have the same (null) ICE for the value xa , i.e.
cent
ICEs,i (xa ) = 0, ∀i = 1, . . . , n
93
, 4.1 ICE
Figure: Example of centered ICE
94
, 4.1 ICE
Advantages
1 Intuitive: ICE curves are even more intuitive to understand than PDP.
2 Heterogeneous effects: unlike PDP, ICE curves can uncover heterogeneous

relationships.
Limits
1 Maximum number of features: ICE curves can only display one feature at the time.
2 Assumption of independence: ICE curves suffer from the same problem as PDP: If
the feature of interest is correlated with the other features, then some points in the
curve might be unlikely data points according to the joint feature distribution.
3 Readibility: If many ICE curves are drawn, the plot can become overcrowded.
95
, 4.1 ICE
Implementation
1 Python: package scikit-learn: function plot partial dependence().
2 R: package IML, ICEbox, and pdp.
96
97
, 4.2 LIME
Definition
LIME explains the prediction of any model by learning from an interpretable model
locally around the prediction.
Ribeiro, M.T., and al. (2016), ”Why Should I Trust You?” Explaining the Predictions of
Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.
98
, 4.2 LIME
Intuition
Black-box model’s complex

decision function f (blue vs.
pink)
One instance being explained

(solid red cross)
Local surrogate model (dashed

line)
99
, 4.2 LIME
Intuition
Black-box model’s complex

decision function f (blue vs.
pink)
One instance being explained

(solid red cross)
Local surrogate model (dashed

line)
100
, 4.2 LIME
Main ingredients of LIME
Local surrogate model

▶ With creation of synthetic instances
Proximity measure between instances
Simplification of the interpretation

▶ Transformation of the original features, e.g., continuous features to binary features.
▶ Complexity measure, e.g., limit the number of features used in the surrogate model.
101
, 4.2 LIME
Main ingredients of LIME
Local surrogate model

▶ With creation of synthetic instances
Proximity measure between instances
Simplification of the interpretation

▶ Transformation of the original features, e.g., continuous features to binary features.
▶ Complexity measure, e.g., limit the number of features used in the surrogate model.
102
, 4.2 LIME
Transformation of the original features
Distinction between features and interpretable data representations
▶ For image classification, only consider groups of pixels (a super-pixel).
▶ For text classification, only consider groups of words.
▶ For tabular data, continuous variables are discretized. For categorical variables,
categories can be combined.
103
, 4.2 LIME
Transformation of the original features
Notations
▶ Let denote by x ∈ Rd the vector of the original representation (original features) of an

instance being explained.
′ ′
▶ Let denote by x ∈ {0, 1}d the binary vector of the interpretable representation
(transformed features).
104
, 4.2 LIME
Fidelity-Interpretability Trade-off
We denote by f : Rd −→ R the model being explained

▶ In classification, f (x ) is the probability that x belongs to a certain class.
′
Let define an explanation as a model g : {0, 1}d −→ R, with g ∈ G , where G is a
class of potentially interpretable model.
▶ e.g., linear models, decision trees.
105
, 4.2 LIME
Let Ω(g ) be a measure of complexity (as opposed to interpretability ) of the

explanation of g ∈ G .
▶ e.g., depth of the tree for decision trees, number of non-zero weights for linear models.
Let πx (z ) be a proximity measure between an instance z to x, so as to define

locality around x.
▶ e.g., exponential kernel.
Let L(f , g , πx ) be a measure of how unfaithful g is in approximating f in the

locality defined by πx .
106
, 4.2 LIME
The explanation produced by LIME is obtained by the following:
ξ (x ) = argmin L(f , g , πx ) + Ω(g ) (2)

g ∈G
Remark: This formulation can be used with different explanation families G , fidelity
functions L, and complexity measures Ω.
107
, 4.2 LIME
Sampling for Local Exploration
108
, 4.2 LIME
Sparse Linear Explanations using LIME
′ ′ ′
Let G be the class of linear models, such that g (z ′ ) = wg · z = ∑dj =1 wg ,j zj
Let L(f , g , πx ) be the locally weighted square loss defined as:

2
L(f , g , πx ) = ∑
′
′
πx (z ) f (z ) − g (z )
z,z ∈Z
where πx = exp (−D (x, z )/σ2 ) is an exponential kernel defined on some distance
function D.
109
, 4.2 LIME
Set a limit K on the number of interpretable representations z ′ taking into account

in g as follows:
Ω(g ) = ∞ × 1 [||wg ||0 > K ]
 ∞

if ||wg ||0 > K
=
0 otherwise

with ||wg ||0 the number of nonzero elements in wg .
110
, 4.2 LIME
Figure: LIME for tabular data: Titanic data, probability of survival

of an 8-year-old boy who travelled in the first class
111
, 4.2 LIME
Figure: Example of error explanation
112
, 4.2 LIME
Advantages
1 Human-friendly explanation: The resulting explanations are short (= selective) and
possibly contrastive.
2 Features type: Works for tabular data, text and images.
3 Fidelity measure: Good idea of how reliable the interpretable model is in explaining
the black box predictions in the neighborhood of the data instance of interest.
Limits
1 Neighborhood definition: Sensitive results to the kernel values.
2 Sampling: Creation of unrealistic instances.
3 Complexity definition: It has to be defined in advance and the choice of K is let to

the user.
113
, 4.2 LIME
Implementation
1 Python: lime library
2 R: package lime, localModel, iml
114
115
, 4.3 Shapley values
Definition (Shapley Values)

Shapley values indicates how to fairly distribute a payout among several players.
The Shapley value is a concept in cooperative game theory.
It was named in honor of Lloyd Shapley, who introduced it in 1953 and won the
Nobel Prize in Economics for it in 2012.
Shapley, Lloyd S. (1953). A value for n-person games. Contributions to the Theory of
Games 2.28, pp 307-317.
116
Intuition of Shapley Values

Three friends – Pierre, Eve, and Aminata – go out for dinner. They order and share fries,
wine, and pies. It is hard to figure out how much each friend should pay since they did
not eat an equal share. We have the following information:
If Pierre was eating alone, he would pay 25

If Eve was eating alone, she would pay 16
If Aminata was eating alone, she would pay 19
If Pierre and Eve were eating together, they would pay 50
If Pierre and Aminata were eating together, they would pay 56
If Eve and Aminata were eating together, they would pay 42
If Pierre, Eve, and Aminata were all eating together, they would pay 73
117
1 We take all permutations of the 3 participants in sequence.
2 We compute the incremental payout every time a new guest arrives.
Example
Consider the sequence (Pierre, Eve, Aminata). As described above, Pierre comes and
pays 25. Now, Pierre and Eve pay 50 so there is an additional payout for Eve of 25.
Finally, all three eat together and pay 73 so the additional payout for Aminata is 23.
118

We repeat the same exercise for each possible order for the 3 friends and get the
following marginal payout values:
(Pierre, Eve, Aminata) – (25, 25, 23)

(Eve, Pierre, Aminata) – (16, 34, 23)
(Eve, Aminata, Pierre) – (16, 26, 31)
(Aminata, Pierre, Eve) – (19, 37, 17)
(Aminata, Eve, Pierre) – (19, 23, 31)
(Pierre, Aminata, Eve) – (25, 31, 17)
119
We repeat the same exercise for each possible order for the 3 friends and get the
following marginal payout values:
(Pierre, Eve, Aminata) – (25, 25, 23) – 1/6

(Eve, Pierre, Aminata) – (16, 34, 23) – 1/6
(Eve, Aminata, Pierre) – (16, 26, 31) – 1/6
(Aminata, Pierre, Eve) – (19, 37, 17) – 1/6
(Aminata, Eve, Pierre) – (19, 23, 31) – 1/6
(Pierre, Aminata, Eve) – (25, 31, 17) – 1/6
The Shapley value of Pierre corresponds to his average marginal payouts, i.e., (25 + 34
+ 31 + 37 + 31 + 25)/6 = 30.5.
120

What is the Shapley value for, respectively, Pierre, Eve, and Aminata? It is just the
average of the marginal payout for each friend:
For Pierre, the Shapley value is (25 + 34 + 31 + 37 + 31 + 25)/6 = 30.5.
For Eve, the Shapley value is (25 + 16 + 16 + 17 + 23 + 17)/6 = 19.
For Aminata, the Shapley value is (23 + 23 + 26 + 19 + 19 + 31)/6 = 23.5.
The total is 30.5 + 19 + 23.5 = 73.
This is the final amount that each of them should pay if they go out together.
121
Shapley Values for ML Interpretability: SHAP (SHapley Additive exPlanations)
The game is the prediction task.
The payoff corresponds to the predicted value for a single instance.
The players are the features that ”collaborate to play the game” (i.e., to predict a
value).
The goal is to explain the difference between the actual prediction and the average
prediction.
Lundberg, S.M. and Lee, S.L. (2016) A Unified Approach to Interpreting Model Predictions,
31st Conference on Neural Information Processing Systems.
122
Shapley Values for ML Interpretability
Example
Consider a ML model used to predict house prices. For a certain house, with a garage, a
private pool and, an area of 50 yards, the estimated price is 510,000 e whereas the
average prediction for all houses is 500,000 e.
Question: What is the contribution of each feature to the difference between the
estimated price of this house and the average estimated price?
123
Shapley Values for Machine Learning Interpretability

Remark 1: Shapley values correspond to the contribution of each feature towards
pushing the prediction away from the expected value.
Remark 2: The sum of all Shapley values is equal to the difference between the
individual prediction ŷi and the average value of the model.
124
Figure: Example of Shapley values for a classification model
125
Figure: Example of Shapley values for a regression model
126
Notations
S is a subset of the features used in the model.
x is the vector of feature values of the instance to be explained.
p the number of features.
127
Definition (Shapley value )

The Shapley value of the feature xj is the weighted average contribution of xj across all
possible subsets S, where S does not include xj :
1
∑

ϕj fb = fb S ∪ xj − fb (S )
S ⊆{x1 ,...,xp }\{xj } p × C|pS−| 1
128
How to compute the weights
From p − 1 features, we can form 2p −1 subsets (coalitions) of size

|S | = 0, . . . , p − 1.
p −1
2p −1 = ∑ C|pS−| 1
|S |=0
with C|pS−| 1 the number of coalitions of size |S | among p − 1 features.
The weight associated to each coalition is the inverse of
p!
p × C|pS−| 1 =
|S |! (p − 1 − |S |)!
129
Example
Consider the case with three features X1 , X2 , and X3 , with X1 the feature of interest.
S |S | C|pS−| 1 p × C|pS−| 1 1
p ×C|pS−| 1
{∅} 0 1 3 1/3
{x2 } 1 2 6 1/6
{x3 } 1 2 6 1/6
{x2 , x3 } 2 1 3 1/3
130
How to compute fˆ(X S )
Consider the following model fˆ(X ) = β̂ 0 + β̂ 1 X1 + β̂ 2 X2 + β̂ 3 X3 .
Consider X S = X1 :
fˆ(X1 ) = β̂ 0 + β̂ 1 X1 + β̂ 2 ? + β̂ 3 ?
1 Re-estimate the model: f˜(X ) = β̃ 0 + β̃ 1 X1
2 Do not re-estimate the model: EX2 ,X3 (fˆ(X )) = β̂ 0 + β̂ 1 X1 + β̂ 2 E(X2 ) + β̂ 3 E(X3 )
131
Additional notations
We denote by:
{X} the set of all the features.

{X} \ Xj the set containing all the features except Xj .
XS the vector of features in subset S.

n o
XS the vector of features not included in subset S, such as {X} \ Xj = XS , XS

132
Definition (Shapley value )

The Shapley value of the feature xj is the weighted average contribution of xj across all
possible subsets S, where S does not include xj :
1
∑

ϕj fb = fb S ∪ xj − fb (S )
S ⊆{x1 ,...,xp }\{xj } p × C|pS−| 1
1
ϕj fb = ∑ p × C|pS−| 1
E XS fˆ xj , xS , XS − EXj ,XS fˆ Xj , xS , XS
S ⊆{x1 ,...,xp }\{xj }
133
Example
Consider the linear case with three features X1 , X2 , and X3 , with X1 the feature of
interest.

1
S E XS fˆ x1 , xS , XS − EX1 ,XS fˆ X1 , xS , XS
p ×C|pS−| 1
{∅} 1/3 β̂ 1 x1 + β̂ 2 E(X2 ) + β̂ 3 E(X3 ) − ( β̂ 1 E(X1 ) + β̂ 2 E(X2 ) + β̂ 3 E(X3 ))

{x2 } 1/6 β̂ 1 x1 + β̂ 2 x2 + β̂ 3 E(X3 ) − ( β̂ 1 E(X1 ) + β̂ 2 x2 + β̂ 3 E(X3 ))
{x3 } 1/6 β̂ 1 x1 + β̂ 2 E(X2 ) + β̂ 3 x3 − ( β̂ 1 E(X1 ) + β̂ 2 E(X2 ) + β̂ 3 x3 )
{x2 , x3 } 1/3 β̂ 1 x1 + β̂ 2 x2 + β̂ 3 x3 − ( β̂ 1 E(X1 ) + β̂ 2 x2 + β̂ 3 x3 )
134
Example
Consider the linear case with three features X1 , X2 , and X3 , with X1 the feature of
interest.

1
S E XS fˆ x1 , xS , XS − EX1 ,XS fˆ X1 , xS , XS
p ×C|pS−| 1
{∅} 1/3 β̂ 1 (x1 − E(X1 ))

{x2 } 1/6 β̂ 1 (x1 − E(X1 ))
{x3 } 1/6 β̂ 1 (x1 − E(X1 ))
{x2 , x3 } 1/3 β̂ 1 (x1 − E(X1 ))
Hence,
1 1 1 1
ϕ1 fb = β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )
3 6 6 3
ϕ1 fb = β̂ 1 (x1 − E(X1 ))
135
Example
More generally,
ϕj fb = β̂ j (xj − E(Xj )) = β̂ j xj − β̂ j E(Xj )
Thus,
3 3 3
∑ ϕj fb = ∑ β̂j xj − ∑ β̂j E(Xj )
j =1 j =1 j =1
3
∑ ϕj fb = ŷ − E(ŷ )
j =1
3
∑ ϕj fb = fb(x ) − E(fb(x ))
j =1
136
For any function fb(x ) with three features, we have:
1
ϕ1 fb = E fb (x1 , X2 , X3 ) − E fb (X1 , X2 , X3 )
3
1 b
+ E f (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 )
6
1 b
+ E f (x1 , X2 , x3 ) − E fb (X1 , X2 , x3 )
6
1 b
+ E f (x1 , x2 , x3 ) − E fb (X1 , x2 , x3 )
3
1
ϕ1 fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + W1 (x )
3

with W1 (x ) = 61 E fb (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 ) +

1
6 E f (x1 , X2 , x3 ) − E f (X1 , X2 , x3 ) +
b b

1
3 E f (x1 , X2 , X3 ) − E f (X1 , x2 , x3 )
b b
137
More generally:
1 b
ϕj fb = f (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + Wj (x )
3
Thus,
3 3 3
1 b
∑ ϕj fb = ∑ 3 f (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + ∑ Wj (x )
j =1 j =1 j =1
3 3
∑ ϕj fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + ∑ Wj (x )
j =1 j =1
3
∑ ϕj fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 )
j =1
because ∑3j =1 Wj (x ) = 0.
138
1
W1 (x ) = E fb (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 )
6
1
+ E fb (x1 , X2 , x3 ) − E fb (X1 , X2 , x3 )
6
1
+ E fb (x1 , X2 , X3 ) − E fb (X1 , x2 , x3 )
3
1
W2 (x ) = E fb (x1 , x2 , X3 ) − E fb (x1 , X2 , X3 )
6
1
+ E fb (X1 , x2 , x3 ) − E fb (X1 , X2 , x3 )
6
1
+ E fb (X1 , x2 , X3 ) − E fb (x1 , X2 , x3 )
3
1
W3 (x ) = E fb (x1 , X2 , x3 ) − E fb (x1 , X2 , X3 )
6
1
+ E fb (X1 , x2 , x3 ) − E fb (X1 , x2 , X3 )
6
1
+ E fb (X1 , X2 , x3 ) − E fb (x1 , x2 , X3 )
3
139
Properties
Efficiency: The feature contributions must add up to the difference of prediction for x
and the average.
p
∑ ϕj fb = fb (x ) − E fb (x )
i =1
Dummy: A feature j that does not change the predicted value – regardless of which
subset of feature values it is added to – should have a Shapley value of 0.

fb S ∪ xj = fb (S ) ⇐⇒ ϕj fb = 0
140
, 4.3 SHAP
Figure: SHAP force plot
141
, 4.3 SHAP
Definition (SHAP feature importance)

The SHAP feature importance is defined as:
n
∑
(i )
Ij = ϕj
i =1
Note: The idea behind SHAP feature importance is simple: Features with large absolute
Shapley values are important.
142
, 4.3 SHAP
Figure: SHAP feature importance
143
, 4.3 SHAP
Definition (SHAP summary plot)

The SHAP summary plot combines feature importance with feature effects.
Each point on the summary plot is a Shapley value for a feature and an instance.
Position on the y-axis is determined by the feature and the x-axis by the Shapley value.
The color represents the value of the feature from low to high.
144
, 4.3 SHAP
Figure: SHAP summary plot
145
Advantage
1 Solid theory. The Shapley value is the only explanatory method with a solid theory
with axioms.
Limit
1 Computing time: In many real-world applications, only an approximate solution is
feasible. An exact computation of the Shapley value is computationally expensive
because there are 2k possible subsets of the feature.
146
, 4.3 SHAP
Implementation
1 Python: SHAP package, scikit learn (for tree).
2 SHAP is integrated into the tree boosting frameworks xgboost and LightGBM.
3 R: package shapper and fastshap packages.
147
148
Figure: AUC = 0.78
What is the contribution of each feature to the AUC?
149
XPER allows us to interpret the predictive or economic performance of any
econometric or ML model (model-agnostic).
This method is based on:

▶ A Shapley Value decomposition (Shapley, 1953).
▶ A Performance Metric (PM).
▶ Predictions yb of a regression or classification model f (.).
150
, An intuitive primer on XPER
AUC ϕ0 ϕ1 ϕ2 ϕ3
Test sample 0.78 = 0.50 + 0.14 + 0.10 + 0.04
with ϕ0 a benchmark value, and ϕj the XPER contribution of feature xj to the AUC of
the model.
151
Such performance decomposition can prove handy in several contexts
1 XPER can help to rationalize a potential heterogeneity in the predictive performance

of a model
AUC ϕ0 ϕ1 ϕ2 ϕ3
Subset A 0.65 = 0.50 + 0.01 + 0.10 + 0.04

Subset B 0.85 = 0.50 + 0.21 + 0.10 + 0.04
152
2 XPER can help to understand the origin of overfitting.

AUC ϕ0 ϕ1 ϕ2 ϕ3
Training 0.90 = 0.5 + 0.20 + 0.15 + 0.05

Test 0.78 = 0.5 + 0.08 + 0.15 + 0.05
Training - Test 0.12 = 0 + 0.12 + 0 + 0
153
3 XPER can be applied to any statistical performance metrics, but also to any
economic performance metrics.
n
P&L = ∑ (1 − ŷi )(1 − yi ) × profit + (1 − ŷi )yi × loss
i =1
where profit is the money made on any reimbursed loan (yi = 0) and loss is the
money lost on any defaulted loan (yi = 1). The P&L can be broken down as follows:
P&L ϕ0 ϕ1 ϕ2 ϕ3
$10,000 = $2,000 + $1,000 + $5,000 + $2,000
154
Framework and Performance Metrics
155
We consider a classification or a regression problem for which:
A dependent (target) variable denoted y takes values in Y . In case of classification

Y = {0, 1}, and in case of regression Y ⊂ R.
A q-vector x ∈ X refers to input (explanatory) features, with X ⊂ Rq .
We denote by f : x → ŷ a model, where ŷ ∈ Y is either a classification or regression

output, such as ŷ = f (x).
156
The econometric or machine learning model may be parametric or not, linear or not,
individual or an ensemble classifier, etc.
The model is estimated (parametric model) or trained (machine learning algorithm)

once for all on an estimation or training sample {xj , yj }T
j =1 .
The statistical performance of the model is evaluated on a test sample

Sn = {xi , yi , fˆ(xi )}ni=1 for n individuals.
157
Definition
A sample performance metric PMn ∈ Θ ⊆ R associated to the model fˆ(.) and a test
sample Sn is a scalar defined as:
PMn = G̃n (y1 , ..., yn ; fˆ(x1 ), ..., fˆ(xn )) = Gn (y; X),
where y = (y1 , ..., yn )T and X = (x1 , .., xn )T .
Examples:
Regression model: MSE, MAE, R-squared.

Classification model: AUC, Accuracy, Sensitivity, Specificity, Brier Score.
Economic criteria: Profit or cost function.
158
Assumption 1
The sample performance metric satisfies an additive property such that:
1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1
where G (yi ; xi ; δn ) denotes an individual contribution to the performance metric and δ̂n
is a nuisance parameter which depends on the test sample Sn .
159
Assumption 1
1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1
Example (Mean Squared Error (MSE))

1 n 1 n
Gn (y; X) = ∑ G (yi ; xi ) = ∑ (yi − fˆ(xi ))2 ,
n i =1 n i =1
159
Assumption 1
1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1
Example (Mean Squared Error (MSE))

1 n 1 n
Gn (y; X) = ∑ G (yi ; xi ) = ∑ (yi − fˆ(xi ))2 ,
n i =1 n i =1
Assumption 2
The sample performance metric Gn (y; X; δ̂n ) converges to the population performance
metric Ey ,x (G (y ; x; δ0 )), where Ey ,x (.) refers to the expected value with respect to the
joint distribution of y and x, and δ0 = plim δ̂n .
159
Theoretical Decomposition
160
, Intuition
161
, Intuition
162
, Intuition
163
, Intuition
163
, Intuition
163
, Intuition
163
, Intuition
164
, Definition of XPER
Definition (XPER value)

The contribution of feature xj to the performance metric is:
h i
ϕj = ∑ wS ExS Ey ,xj ,xS (G (y ; x; δ0 )) − Exj ,xS Ey ,xS (G (y ; x; δ0 )) ,
S ⊆P ({x}\{xj })
with S a coalition, i.e., a subset of features, excluding

the feature of interest xj , |S | the
number of features in the coalition, and P ({x} \ xj ) the partition of the set {x} \ xj .
The XPER value ϕj associated to feature xj measures its weighted average marginal
contribution to the performance metric over all feature coalitions.
165
Axiom 1. (Efficiency)
The sum of the XPER values ϕj , ∀j = 1, ..., q satisfies:
q
Ey ,x (G (y ; x; δ0 )) = ϕ0 + ∑ ϕj ,
| {z } |{z} j =1 |{z}
performance metric benchmark XPER value
ϕ0 = Ex Ey (G (y ; x; δ0 ))
with ϕ0 the performance metric associated to a population where the target variable is
independent from all features considered in the model.
166
Definition (Individual XPER)
The individual XPER value ϕi,j associated to individual i is defined as:
h i
ϕi,j (yi ; xi ) = ∑ wS ExS (G (yi ; xi ; δ0 )) − Exj ,xS (G (yi ; xi ; δ0 )) .
S ⊆P ({x}\{xj })
For a given realisation (yi , xi ), the corresponding individual contribution to the

performance metric can be broken down into:
q
G (yi ; xi ; δ0 ) = ϕi,0 + ∑ ϕi,j ,
j =1
where ϕi,j is the realisation of ϕi,j (yi ; xi ) and ϕi,0 is the realisation of
ϕi,0 (yi ) = Ex (G (yi ; x; δ0 )).
167
Empirical Application
168
, Database
Database of auto loans provided by an international bank:
Target variable yi :
▶ 1: Default
▶ 0: No default
7,440 consumer loans
10 features:
▶ 2 categorical features
▶ 8 continuous features
169
, Model
Table: XGBoost Performances
Sample Size (%) AUC Brier Score Accuracy BA Sensitivity Specificity

Training 70 0.8969 0.0958 86.98 72.43 48.18 96.69
Test 30 0.7521 0.1433 79.53 58.69 23.99 93.39
170
, XPER decomposition
Figure: XPER decomposition
Funding amount
Job tenure
Car price
Age
Loan duration
Owner
Married
Credit event
Monthly payment
Down payment
0 10 20 30 40 50
Contribution (%)
171
Figure: XPER decomposition and Permutation Importance (PI)
Funding amount Funding amount
Job tenure Job tenure
Car price Car price
Age Age
Loan duration Loan duration
Owner Owner
Married Married
Credit event Credit event
Monthly payment Monthly payment
Down payment
Down payment
0 10 20 30 40 50 0 10 20 30 40 50
Contribution (%) Contribution (%)
(a) XPER values (b) PI-based feature contributions
172
, Permutation Importance
Definition (Permutation Importance)

The permutation importance corresponds to the average decrease of the model
performance when a single feature value is randomly shuffled.
(1) (12) (1)

 
x1 ... xj ... xp
.. 
 
 (2) (48)
x
 1 ... xj ... . 

(n ) (4) (n )
x1 ... xj ... xp
1 S
∆PMj = PM − ∑ (PM (Xj,s ))
S s =1
with Xj,s the Sth reshuffled vector of the values of Xj .
173
Figure: XPER vs. SHAP
Funding amount
Job tenure
Vehicle price
Customer’s age
Estimated funding duration
Owner
Married
Default past 6 months
Monthly payment in pct of income
Downpayment
AUC
SHAP
0 5 10 15 20 25 30 35 40
Feature contribution (%)
174
, Using XPER to boost model performance
Figure: Two-step procedure using XPER
175
, Using XPER to boost model performance
Table: Model performances
Initial Clusters on XPER values Clusters on features

(1) (2) (3)
AUC 0.752 0.912 0.744
Brier score 0.143 0.080 0.151
Accuracy 79.53 89.11 79.53
Balanced Accuracy 58.69 79.74 59.11
Sensitivity 23.99 64.13 25.11
Specificity 93.39 95.35 93.11
176
, Conclusion
We introduce a methodology designed to measure the feature contributions to

the performance of any regression or classification model.
Our methodology is theoretically grounded on Shapley values and is both

model-agnostic and performance metric-agnostic.
In a loan default forecasting application, XPER appears to be able to significantly

boost out-of-sample performance.
177
, XPER python package
Figure: Github link
178
7
6. Disagreement in Explainable Machine

Learning
The Disagreement Problem in Explainable Machine Learning:
A Practitioner’s Perspective
Satyapriya Krishna1, Tessa Han1, Alex Gu2, Javin Pombra1,

Shahin Jabbari3, Zhiwei Steven Wu4, and Himabindu Lakkaraju1
1 Harvard University
2 Massachusetts Institute of Technology
3 Drexel University
4 Carnegie Mellon University
February 9, 2022
Interpretable Machine Learning in Healthcare in ICML 2022
179
180
181
182
183
184
7
7. Fairness
7
7.1 Definition and Measures

Unlike humans, algorithms are neutral
• Humans exhibit behavioral biases/taste‐based discrimination and are

affected by cultural norms, stereotypes, homophily, etc.
Source: D'Acunto, Ghosh, Jain and Rossi (2021)
• Higher rates charged to Black and Hispanic borrowers (Bartlett et al.,

2021), for immigrants (Dobbie et al., forth.), for women (Alesina, 2013)
185
Algorithms can exacerbate differences
in rates and access to credit
Fuster, Goldsmith‐Pinkham, Ramadorai, and Walther (2021)
Using detailed administrative data on US mortgages, they find that ML:
• increases rate disparity across groups of borrowers
• benefits more White and Asian borrowers than Black and Hispanic
borrowers
• Channels: ML better captures the structural relationship between

observable characteristics and default, (2) ML is able to triangulate
the (hidden) identity of borrowers.
186
Neutral?
Articles: FR, EN
Algorithmic Fairness – HEC Paris

Neutral?
188
Fraud detection
189
Many risks:
Model, Reputation, Regulatory, Legal
190
What is an “unfair” algorithm?
• An algorithm that places a group of individuals who share a protected

attribute (PA) at a systematic disadvantage.
e.g. gender, age, residence, ethnic origin, skin color, religion
• Disadvantage in terms of rejection rate, price, etc.
• More likely to happen with highly flexible, non‐linear, opaque models
• An algorithm can be unfair even when the PA is not used. Can be

completely unintentional
• But can arise with both interpretable and non‐interpretable models
191
Statistical triangulation
income < 40k€ Job contract =

fixed term
sector Gender
192
Interpretability
Being interpretable does not imply being fair
interpretability
𝟏𝟐 𝟑𝟒
Variables Decision Gender
fairness
193
Why algorithmic fairness is important?
• Protect minorities and prevent discrimination
• Consumer protection
• Part of model risk
• Potentially problematic from an ethical and reputation point of

view
• Being unfair can be illegal as under U.S. fair‐lending law, lenders

can discriminate against minorities only for creditworthiness
194
Reasons for an algorithm not to be fair
• Algorithm trained using past decisions made by a human (credit

officer, judge, HR director) who may be racist, misogynistic, etc
• Algorithms learning from actual outcomes (past defaults) but using

variables that proxy for membership in a protected group
195
Growing concern in the academia
196
Now a key component of sustainable AI
Sustainable AI must:
• be robust
• be transparent, or at least interpretable
• respect users’ data privacy
• be fair
197
Application to a credit database
Dataset: German Credit dataset
1,000 borrowers with 20 attributes, including gender

690 men and 310 women
For each borrower, we know if there is a default: 300 defaults in total
198
199
Forecasting default with simple
and ML models
200
Many fairness definitions (# citations in 2019)
201
An example: Statistical parity
% acceptance of loan applications for men

=
% acceptance of loan applications for women
men: 40% women: 40%
202

=
men: 40% women: 25%
203

=
men: 40% women: 25% < 32
« 4/5 rule »
204
Composition effect
women men
risk class
creditworthiness
205

=
men: 40% women: 25%
% acceptance of loan applications for men [income > 100k€, job > 5Y, home owner]
=
% acceptance of loan applications for women [income > 100k€, job > 5Y, home owner]
men: 70% women: 66%
206

=
men: 40% women: 25%
=
men: 70% women: 62%
206

=
men: 40% women: 25%
=
men: 70% women: 62%
206
Set up
207
Set up
207
Set up
208
Statistical parity
209
Conditional statistical parity
Building groups
For credit scoring applications, groups gather applicants with similar risk profiles:
Determined through unsupervised clustering methods (K-Means) or using an
exogenous classification (Basel classification)
210
Number of sub‐groups
• The larger the number of sub‐groups:
‐ the more homogenous the sub‐groups are (cleaner test)
‐ the more likely at least one sub‐group is found to be unfair
‐ the smaller the number of individuals in each sub‐group
• Aggregation process:
‐ Economic criterion: Fairness in all sub‐groups, majority of

sub‐groups, majority of individuals, etc
‐ Statistical criterion
211
(Conditional) Statistical parity
212
Independence assumptions
213
Testing fairness using independence tests
214
Testing independence using Chi‐2 statistics
215
Testing Statistical Parity
216
Fairness traffic light
217
Numerical example
n+1f = 214 E(n+1f) = 310*771/1000
n+0f = 96 E(n+0f) = 310*229/1000
n+1m = 557 E(n+1m) = 690*771/1000
n+0m = 133 E(n+0m) = 690*229/1000
Chi2 =
(214 - 310*771/1000)^2/(310*771/1000) +
(96 - 310*229/1000)^2/(310*229/1000) +
(557 - 690*771/1000)^2/(690*771/1000) +
(133 - 690*229/1000)^2/(690*229/1000)
218
Conditional classification
219
Testing conditional statistical parity
220
Generalization: Other metrics
221
Generalization: Other metrics (2)
222
Fairness tests: with‐models
223
Fairness tests: without‐models
224
Configuration 2
225
7
7.2 Interpretability and Mitigation

Fairness interpretability
226
Introduction to Partial Dependence Plots (PDP)

227
Fairness Partial Dependence Plots (FPDP):
Theory
228
FPDP: Notation
229
FPDP: Definition
230
FPDP: Definition
230
FPDP: Definition
230
Categorical feature
231
Categorical feature
231
Continuous feature
232
Continuous feature
232
FPDP: Interpretation
233
FPDP: Plots
Variable 1
Variable 2
Variable 3
234
FPDP: Plots
Variable 1
Variable 2
Variable 3
234
Interpretability: without, white‐box model
235
236
Mitigation
How to make an unfair algorithm less unfair:
• Remove each candidate variable and re‐estimate the model
• Set the value of each candidate variable to a level for which there is
no fairness problem
237
With or without re‐estimation
238
7
7.3 Gender Effects and Fairness in Hiring

239
7
7.4 Fairness and Quality AI

https://www.vde.com/resource/blob/2176686
/a24b13db01773747e6b7bba4ce20ea60/vd
e-spec-vcio-based-description-of-systems-
for-ai-trustworthiness-characterisation-
data.pdf
The VCIO* approach, therefore, fulfils

three tasks:
1) It clarifies what is meant by a particular
value (value definition).
2) It explains in a comprehensible
manner how to check or observe whether
or to what extent a technical system
fulfils or violates a value (measurement).
3) It acknowledges the existence of value
conflicts and explains how to deal with
these conflicts depending on the
application context (balancing).
* VCIO = Values Criteria Indicators
Observables
October 6, 2022: Memorandum of Cooperation (MoC) with

Confiance.ia, a French consortium of French industrial companies and
research centers.
240
241
242

Interpretability and Algorithmic Fairness DSB 2023 Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Interpretability and Algorithmic Fairness DSB 2023 Slides

Uploaded by

Copyright:

Available Formats

Interpretability and Algorithmic Fairness

Christophe Pérignon (Instructor)

Sébastien Saurin (Teaching assistant)

Ph.D. student, seb.saurin@hotmail.fr

The goal of this course is to present the concepts of interpretability and

The goal of this course is to present the concepts of interpretability and

The goal of this course is to present the concepts of interpretability and

The goal of this course is to present the concepts of interpretability and

(2) check whether a machine‐learning model treats different groups of individuals in a

The goal of this course is to present the concepts of interpretability and

(2) check whether a machine‐learning model treats different groups of individuals in a

The goal of this course is to present the concepts of interpretability and

(2) check whether a machine‐learning model treats different groups of individuals in a

Students will have to work on a project combining some of the techniques

By the end of this course, you should:

Tuesday 24 Wednesday 25 Thursday 26 Friday 27 Friday 10

Guest speakers (GS):

• Attendance is compulsory. Students must send an email to the

Barocas S. Hardt M., Narayanan A. (2023) Fairness and Machine Learning:

Condition access to:

Credit scoring models used to decide on new

Condition access to:

Variables Forecast (score)

Condition access to:

Condition access to:

Condition access to:

Condition access to:

Condition access to:

Condition access to:

Condition access to:

Source: Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies

• Allow processing massive quantity of data (including new data)

• Speed up and lower the cost of application processing (scalable)

• Use powerful algorithms, hence improving classification of applications

• Understanding a particular decision for a given instance (local

• In some case, this is not important

• In many other situations, understanding how predictions are made is

• Requested by regulation and Law

• Does the model produce biased decisions? If challenged on court, can

• Improving the model

AI/ML is highly criticized for failing to deliver on its promise:

• It is very costly and the ROI is unclear

 Interpretability can make AI/ML more P&L centric

 Interpretability increases transparency and can reduce tensions

Example from the Banking industry

• Developers, i.e. those developing or implementing a ML application.

In this section, we introduce five white-box models:

1 Ordinary Least Squares (OLS)

2 Logistic Regression (LR)

3 Decision Tree (DT)

4 Generalized Additive Models (GAM)

5 Penalised Logistic Tree Regression (PLTR)

The marginal effect of a feature Xj on Y is equal to:

E(Y ) = β 0 + β 1 E(X1 ) + β 2 E(X2 ) + ... + β k E(Xk )

Y − E(Y ) = β 1 (X1 − E(X1 )) + β 2 (X2 − E(X2 )) + ... + β k (Xk − E(Xk )) + ε t

Source: Hull (2020)

P(Y = 1|X = x ) = Λ( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk )

exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))

The marginal effect of a feature Xj on P(Y = 1|X = x ) is equal to:

∂P(Y = 1|X = x ) exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))

Figure: Decision Tree for classification

Definition (Generalized Additive Models (GAM))

Definition (Generalized Additive Models (GAM))