You are on page 1of 273

Interpretability and Algorithmic Fairness

Christophe Pérignon (Instructor)


7
Professor of Finance, perignon@hec.fr

Sébastien Saurin (Teaching assistant)

Ph.D. student, seb.saurin@hotmail.fr

Master DSB
October 24 – November 10, 2023
1
Course Description

The goal of this course is to present the concepts of interpretability and


algorithmic fairness + techniques to implement them.

1
Course Description

The goal of this course is to present the concepts of interpretability and


algorithmic fairness + techniques to implement them.

1
Course Description

The goal of this course is to present the concepts of interpretability and


algorithmic fairness + techniques to implement them.

(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.

1
Course Description

The goal of this course is to present the concepts of interpretability and


algorithmic fairness + techniques to implement them.

(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.

(2) check whether a machine‐learning model treats different groups of individuals in a


fair way.

1
Course Description

The goal of this course is to present the concepts of interpretability and


algorithmic fairness + techniques to implement them.

(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.

(2) check whether a machine‐learning model treats different groups of individuals in a


fair way.

These concepts and skills are important to master for data scientists,
model validators, AI entrepreneurs, executives, regulators, policymakers,
and consumer advocates.

1
Course Description

The goal of this course is to present the concepts of interpretability and


algorithmic fairness + techniques to implement them.

(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.

(2) check whether a machine‐learning model treats different groups of individuals in a


fair way.

These concepts and skills are important to master for data scientists,
model validators, AI entrepreneurs, executives, regulators, policymakers,
and consumer advocates.

Students will have to work on a project combining some of the techniques


presented in the course and make a final presentation.
1
Main Topics

• Definition of interpretability/explainability
• Natively interpretable models: OLS, Logistic, GAM, PLTR
• Global interpretability measures: Global surrogate, Partial
Dependence Plot (PDP), Accumulated Local Effect (ALE)
• Local interpretability measures: ICE, LIME, SHapley Additive
exPlicability (SHAP)
• Explaining Performance (Permutation Importance, XPER)

2
Main Topics

• Definition of interpretability/explainability
• Natively interpretable models: OLS, Logistic, GAM, PLTR
• Global interpretability measures: Global surrogate, Partial
Dependence Plot (PDP), Accumulated Local Effect (ALE)
• Local interpretability measures: ICE, LIME, SHapley Additive
exPlicability (SHAP)
• Explaining Performance (Permutation Importance, XPER)
• Why an algorithm needs to be fair and definitions of fairness metrics
• Inference tests and fairness diagnosis
• Fairness interpretability
• Mitigation techniques
2
Learning Objectives

By the end of this course, you should:


I F
• Know why it is important for an algorithm to be interpretable
• Know how to interpret it globally
• Know how to explain a decision affecting one instance
• Know why an algorithm needs to be fair
• Know how to detect an algorithm which is not fair
• Understand why a given algorithm is unfair
• Be able to make it more fair, yet maintaining a high level of perf.

3
Course Structure

Tuesday 24 Wednesday 25 Thursday 26 Friday 27 Friday 10

Student
Morning Lecture Lecture Lecture Lecture
Presentations
GS2 GS4

Student
Afternoon Lecture Lecture Presentations
GS1 GS3

Guest speakers (GS):


GS1: Vassilis Digalakis, Assistant Professor, HEC Paris
GS2: Florence d’Alché‐Buc, Professor and Chair Holder “Data Science and Artificial Intelligence for
Digitalized Industry and Services”, Telecom Paris
GS3: Jean‐Marie John‐Mathews, Co‐founder, Giskard
GS4: Edouard Tabary, Head of Data Science, BNPP PF Scoring Center
4
Evaluation

• Attendance is compulsory. Students must send an email to the


instructor if they cannot attend a lecture.
• Quiz (15%)
• Class participation (15%)
• Group project (70%). NB: Grades may vary among students within a
given group.

5
Recommended Readings

Barocas S. Hardt M., Narayanan A. (2023) Fairness and Machine Learning:


Limitations and Opportunities https://fairmlbook.org/
Molnar C. (2023) Interpretable Machine Learning
https://christophm.github.io/interpretable‐ml‐book/

Hué S., Hurlin C., Pérignon C., Saurin S. (2023) Measuring the Driving Forces of
Predictive Performance: Application to Credit Scoring
https://ssrn.com/abstract=4280563

Hurlin, C., Pérignon C., Saurin S. (2022) The Fairness of Credit Scoring Models
https://ssrn.com/abstract=3785882

Krishna S., Han T., Gu A., Pombra J., Jabbari S., Wu S. and Lakkaraju H. (2022)
The Disagreement Problem in Explainable Machine Learning: A Practitioner’s
Perspective https://arxiv.org/abs/2202.01602

Artificial Intelligence: The Robots Are Now Hiring, Wall Street Journal
https://www.youtube.com/watch?v=8QEK7B9GUhM
6
7
Interpretability and Algorithmic Fairness

1. Introduction
“Life‐changing algorithms”

Condition access to:

• Credit

Credit scoring models used to decide on new


loan applications
Classification algorithm: High‐risk borrowers
(reject) v.s. Low‐risk borrowers (accept)
7
“Life‐changing algorithms”

Condition access to:

• Credit

Variables Forecast (score)


Decision vs. Outcome
8
“Life‐changing algorithms”

Condition access to:

• Credit

• Work

9
“Life‐changing algorithms”

Condition access to:

• Credit

• Work

9
“Life‐changing algorithms”

Condition access to:

• Credit

• Work

9
“Life‐changing algorithms”

Condition access to:

• Credit

• Work

• Education

10
“Life‐changing algorithms”

Condition access to:

• Credit

• Work

• Education

• Love

11
“Life‐changing algorithms”

Condition access to:

• Credit

• Work

• Education

• Love

11
“Life‐changing algorithms”

Condition access to:

• Credit

• Work

• Education

• Love

• Freedom

12
The use of AI by government agencies

Source: Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies


(https://www‐cdn.law.stanford.edu/wp‐content/uploads/2020/02/ACUS‐AI‐Report.pdf)

13
Automated claim management for insurance

14
Why is AI so popular in these business
and government applications?

• Allow processing massive quantity of data (including new data)


 including digital footprint, new forms of data (open data, video, payment data, etc)

• Speed up and lower the cost of application processing (scalable)


 Including small firms or unbanked households traditionally overlooked by standard screening

• Use powerful algorithms, hence improving classification of applications


between good and bad type
 Reduction in Type‐1 and Type‐2 errors. Impact on P&L

 This quest for performance leads companies to develop models that are increasingly complex,
and less and less transparent and interpretable. “Algorithm Darwinism”

15
Interpretability

16
Interpretability

17
Interpretability

18
Important distinctions

We distinguish between:
• Models that are intrinsically interpretable (white box or glass box)
• Models whose structures do not permit easy interpretation (black box)

We distinguish between:
• Understanding a model (global interpretability)
For instance, in this Artificial Neural Network, the variables with the strongest impact on the estimated
house price are the lot size and the presence of a swimming pool.

• Understanding a particular decision for a given instance (local


interpretablity)
For instance, why Mrs Smith’s consumer loan application got rejected by an algorithmic lender?

19
The ML Journey

Validation:
Modeling: test set
list of models stability
variable selection interpretability Production:
hyperparameters fairness value creation
training set
performance

Data:
collection
cleaning

Problem,
Question

20
Hot debate about interpretability

Source: https://www.youtube.com/watch?v=GtCFprO5p7k
21
Hot debate about interpretability (2)

22
Hot debate about interpretability (3)

23
When we need to interpret (and when we don’t)

ML models produce predictions, but they do not explain their predictions to users

• In some case, this is not important


e.g. search engines predicts that a document will be what a user wants. Low cost for users if this is a
mistake

• In many other situations, understanding how predictions are made is


desirable because mistakes are costly, confidence in the model is key
e.g. loan applications, estimated house price given by real‐estate agent, match between organ
donors/recipients, job applications, releasing a prisoner, applying to a university, tax fraud detection,
investigation about child abuse, allocation of ventilators during covid, illness detection.

“The algorithm rejected you. Sorry I do not have any additional information.”
is not going to be an acceptable answer…

24
When we need to interpret (2)

• Requested by regulation and Law


e.g. GDPR’s Chapter 11: Right to explanation; French Code for Public Health (Article L4001‐3): algorithm
developers […] must make sure the functioning of the algorithm can be explained to users (Law #2021‐
1017, August 2, 2021)

• Does the model produce biased decisions? If challenged on court, can


the company using the model prove that it does not.
e.g. Is the model racist or misogynistic?

• Improving the model


e.g. Image recognition software that distinguishes between polar bears and dogs. If you can show that
the model makes decisions by looking at the background (ice vs. grass) rather than looking at the shape
of the animals, there is a problem.

25
Research on interpretability

Total number
of accepted 400 2000
papers :
26
When we need to interpret (3)

AI/ML is highly criticized for failing to deliver on its promise:

• It is very costly and the ROI is unclear


• there are many POCs but few algorithms are actually put in production
• It is very opaque
• It often creates tensions between business experts and data scientists

 Interpretability can make AI/ML more P&L centric

 Interpretability increases transparency and can reduce tensions

27
Quality AI

October 6, 2022: Memorandum of Cooperation (MoC) between VDE and Confiance.ia, a consortium of
French industrial companies and research centers.
* VCIO = Values Criteria Indicators Observables
28
Whom to explain?

Example from the Banking industry

• Developers, i.e. those developing or implementing a ML application.


• 1st line model checkers, i.e., those directly responsible for making sure model
development is of sufficient quality.
• Management responsible for the application.
• 2nd line model checkers, i.e., control function, they independently check the
quality of model development and deployment.
• Regulators that aim to protect consumers and financial stability.
• Clients
Christophe
29
Whom to explain? (2)

30
31
, 2. Interpreting White box

In this section, we introduce five white-box models:

1 Ordinary Least Squares (OLS)

2 Logistic Regression (LR)

3 Decision Tree (DT)

4 Generalized Additive Models (GAM)

5 Penalised Logistic Tree Regression (PLTR)

32
, 2.1 OLS

Y = β 0 + β 1 X1 + β 2 X2 + ... + β k Xk + ε t

The marginal effect of a feature Xj on Y is equal to:

∂Y
= βj
∂Xj

E(Y ) = β 0 + β 1 E(X1 ) + β 2 E(X2 ) + ... + β k E(Xk )

Y − E(Y ) = β 1 (X1 − E(X1 )) + β 2 (X2 − E(X2 )) + ... + β k (Xk − E(Xk )) + ε t

33
, 2.1 OLS

Example
”The model predicts that the price of the house of interest (with the feature values in the second column of
Table 9.1) is $232,349. The value of an “average house” with the feature values in the third column of Table
9.1 is $180,817. This house is therefore worth $51,532 more than the average house.”

Source: Hull (2020)

34
, 2.2 Logistic Regression

P(Y = 1|X = x ) = Λ( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk )


with Λ(.) the cumulative density function (cdf) of the logistic distribution such as:
1
P(Y = 1|X = x ) =
1 + exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))

exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))


P(Y = 0|X = x ) =
1 + exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))

The marginal effect of a feature Xj on P(Y = 1|X = x ) is equal to:

∂P(Y = 1|X = x ) exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))


= β
∂Xj [1 + exp (−( β 0 + β 1 X1 + β 2 X2 + ... + β k Xk ))]2 j

35
, 2.3 Decision Tree

Figure: Decision Tree for classification

36
, 2.4 Generalized Additive Models (GAM)

Definition (Generalized Additive Models (GAM))


Generalized Additive Models (GAMs) provide an extension to standard linear models by
allowing the incorporation of non-linear functions for each variable while preserving the
principle of additivity.

Hastie, T. and Tibshirani, R. (1986), Generalized additive models. Statistical Science, Vol.1,
No. 3, 297-318.

37
, 2.4 Generalized Additive Models (GAM)

Definition (Generalized Additive Models (GAM))


In general, the conditional mean µ(X ) of a target variable Y is related to an additive
function of the features via a link function g :

g (µ(X )) = α + f1 (X1 ) + ... + fp (Xp )

where f1 (X1 ), ..., fp (Xp ) correspond to (smooth) non-linear functions of the features.

Examples of classical link functions are the following:

g (µ(X )) = µ for linear and additive models for Gaussian response data.

g (µ(X )) = logit (µ) or g (µ(X )) = probit (µ) for modeling binomial probabilities.

38
, 2.4 Generalized Additive Models (GAM)

Definition (Generalized Additive Models (GAM))


In general, the conditional mean µ(X ) of a target variable Y is related to an additive
function of the features via a link function g :

g (µ(X )) = α + f1 (X1 ) + ... + fp (Xp )

where f1 (X1 ), ..., fp (Xp ) correspond to (smooth) non-linear functions of the features.

Examples of classical link functions are the following:

g (µ(X )) = µ for linear and additive models for Gaussian response data.

g (µ(X )) = logit (µ) or g (µ(X )) = probit (µ) for modeling binomial probabilities.

Examples of (smooth) functions: Splines, Natural Cubic Splines, Smoothing Splines

38
, 2.4 Generalized Additive Models (GAM)

In the regression setting, a generalized additive model has the form

E (Y |X1 , ..., Xp ) = α + f1 (X1 ) + ... + fp (Xp )

In the binary classification setting, the additive logistic regression has the form
 
µ (X )
log = α + f1 (X1 ) + ... + fp (Xp )
1 − µ (X )

39
, 2.4 Generalized Additive Models (GAM)

Regression

Figure: Example of GAM for a regression model

Source: An introduction to Statistical Learning: With Application in R (2021)

40
, 2.4 Generalized Additive Models (GAM)

Advantages
1 Automatically model non-linear relationships that standard linear regression will
miss.

2 The non-linear fits can potentially make more accurate predictions of the target.

3 Easily examine the marginal effect of each feature on the target.

Limits
1 Interactions can be missed as the model is restricted to be additive. However, as
with linear regression, we can manually add interaction terms to the GAM. By doing
so, the GAM are subject to overfitting issues.

41
, 2.4 Generalized Additive Models (GAM)

Implementation
For the implementation, we can cite:

1 Python: package pyGAM, PiML, and statsmodels.

2 R: package GAM.

42
, 2.5 Penalised Logistic Tree Regression (PLTR)

Definition (Penalised Logistic Tree Regression (PLTR))


The Penalised Logistic Tree Regression (PLTR) is high performance and interpretable
credit scoring method which uses information from decision trees to improve the
performance of logistic regression.

Remark: This method is not restricted to credit scoring applications and can be used for
other classification and regression (after some small adjustements) problems.

Dumitrescu, E., Hué, S., Hurlin, C. and Tokpavi, S. (2022), Machine learning for credit
scoring: Improving logistic regression with non-linear decision-tree effects, European Journal
of Operational Research, Vol. 297, Issue 3, 1178-1192.

43
, 2.5 Penalised Logistic Tree Regression

First Step

Figure: Splitting Process

Source: Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects (2022)

44
, 2.5 Penalised Logistic Tree Regression

Second Step
In the second step, the endogenous univariate and bivariate threshold effects previously
obtained are plugged in the logistic regression
(j ) (j,k ) 1
P(yi = 1|Vi,1 , Vi,2 ; Θ) = h
(j ) (j,k )
i
1 + exp −η (Vi,1 , Vi,2 ; Θ)

(j ) (j,k ) (j ) (j,k )
with η (Vi,1 , Vi,2 ; Θ) = β 0 + ∑pj=1 αj xj + ∑pj=1 β j Vi,1 + ∑pj =−11 ∑pk =j +1 γj,k Vi,2 .

The corresponding likelihood is

1 N h i
N i∑
(j ) (j,k ) (j ) (j,k )
L(Vi,1 , Vi,2 ; Θ) = yi log [F (η (Vi,1 , Vi,2 ; Θ))]
=1
1 N h i

(j ) (j,k )
+ (1 − yi )log [1 − F (η (Vi,1 , Vi,2 ; Θ))] .
N i =1

45
, 2.5 Penalised Logistic Tree Regression

Second Step
Finally, the adaptive lasso estimators are obtained as
V

(j ) (j,k )
Θ̂alasso (λ) = argmin − L(Vi,1 , Vi,2 ; Θ) + λ wv |θv |.
Θ v =1

46
, 2.5 Penalised Logistic Tree Regression

Figure: PLTR interactions

Source: Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects (2022)

47
, 2.5 Penalised Logistic Tree Regression

Figure: PLTR performances

Source: Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects (2022)

48
49
, 3. Interpreting Black-Box - Global

In this section, we introduce three model-agnostic interpretation methods:

1 Global Surrogate

2 Partial Dependence Plot (PDP)

3 Accumulated Local Effects (ALE)

50
51
Consider a given opaque machine learning model that produces predictions ŷ .

52
Decision Tree used as a surrogate model: Bike rental example

Source: Molnar (2023)

53
, 3.1 Global Surrogate

Advantages
1 Simple and intuitive

2 Can accommodate any black box model

Limits
1 Estimation risk: Discrepency between the original ŷ and the one estimated by the
surrogate model ŷˆ . R-squared = 0.7
2 Dangerous: Can give the illusion of interpretability.

54
55
, 3.2 PDP

Definition (Partial Dependence Plot )


The Partial Dependence Plot (PDP) shows the marginal effect one feature has on the
predicted outcome of a machine learning model (Friedman 2001).

Friedman, J.H. (2001), Greedy function approximation: A gradient boosting machine.


Annals of statistics, 1189-1232.

56
, 3.2 PDP

Notations
Denote by Xs the features for which the partial dependence function should be
plotted and Ωs the universe of its realization.

Denote by xs a realization of the random variable Xs .

Denote by Xc the other features used in the machine learning model fb.

57
, 3.2 PDP

Definition (Partial Dependence function)


The Partial Dependence function pds (.) is the expectation of the model output fb (.)
over the marginal distribution of all variables other than Xs .
 
pds (xs ) = EXc fb (xs , Xc ) ∀xs ∈ Ωs

Remark: the PDP is different from the conditional expectation EXc |Xs (fb (xs , Xc )) where
expectation is taken over the conditional distribution of Xc given Xs = xs .

58
, 3.2 PDP

Example (PDP in a linear model)


Consider the linear model

Y = fb (X ) + ε = β̂ 0 + β̂ 1 X1 + . . . + β̂ p Xp + ε

The PD function associated to feature X1 is defined as:

pd1 (x1 ) = β̂ 0 + β̂ 1 x1 + β̂ 2 E (X2 ) + . . . + β̂ p E (Xp )

59
, 3.2 PDP

Example (PDP in a logit model)


Consider the logit model

Pr ( Y = 1| X ) = fb (X ) = Λ β̂ 0 + β̂ 1 X1 + β̂ 2 X2


where Λ (z ) = exp (z ) / (1 + exp (z )) is the cdf of the logistic distribution. The PD


function associated to feature X1 is defined as:

pd1 (x1 ) = EX2 Λ β̂ 0 + β̂ 1 x1 + β̂ 2 X2




60
, 3.2 PDP

Definition (PD estimate)


In practice, PD function is simply estimated by averaging over the dataset as

1 n b 
pbds (xs ) = ∑ f xs , xc
(i )
∀xs ∈ Ωs
n i =1

 
(1)
xs xc

xs (2) 
 xc  
. .. 
 .. . 
 
(n )
xs xc

Remark 1: xs can take any values including some which are not in the dataset, e.g., for
age xs can take any value between min (age ) and max (age ).
Remark 2: The PDP shows whether the relationship between the target and a feature is
linear, monotonic or more complex.

61
, 3.2 PDP

Pseudo algorithm for PDP


1 Select feature Xs among all features.

2 Define a grid of values xs ∈ {c1 , . . . , ck } ∈ Ωks .

3 For each xs :
1 Replace all realisations of Xs by xs
2 Compute fˆ(xs , xc ) for each instance.
3 Average predictions across instances.

4 Draw curve pbds (xs ).

62
, 3.2 PDP

Continuous feature

Example (PDP in a regression model)


Zhou and Hastie (2019) consider two predictive models for housing price: random forest
and gradient boosting machine. PDPs suggest that the housing price is insensitive to air
quality until it reaches certain pollution level around 0.65.

63
, 3.2 PDP

Categorical feature

Source: Molnar (2023)


64
, 3.2 PDP

Classification model
When the target variable Y is categorical, the PD function
 displays the average
 marginal
(i )
effect of the feature s on the conditional probability Pr Y = 1| xs , xc as

1 n  
pds (xs ) = ∑ Pr Y = 1| xs , xc
(i )
∀xs ∈ Ωs
n i =1

65
, 3.2 PDP

Classification model

Source: Molnar (2023)


66
, 3.2 PDP

Advantages
1 Interpretation is clear: The PDP shows the marginal effect of a given feature Xs
on the average prediction.

2 Easy to implement: The PDP does not require re-estimating the model.

67
, 3.2 PDP

Limits
1 Maximum number of features: the PDP analysis is limited to 2 features.
2 Heterogeneous effects across instances might be hidden because PDP only show
the average marginal effects.
▶ Solution: Individual Conditional Expectation (ICE) curves.

3 Assumption of independence is the biggest issue with PDP.


▶ When the features are correlated, we create new data points in areas of the feature
distribution where the actual probability is very low (for example someone who is 2
meter tall but weighs less than 50 kg).
▶ Solution: Accumulated Local Effects (ALE) curves.

68
Implementation
For the implementation, we can cite:

1 Python: package scikit-learn: function plot partial dependence().

2 R: package IML: function Featureffect.

69
70
, 3.3 ALE

Definition (Accumulated Local Effects (ALE) )


Accumulated Local Effects (ALE) plots describe how features influence the prediction of
a ML model on average, while taking into account the dependence between the features.

Apley, D.W. (2016), Visualizing the effects of predictor variables in black box supervised
learning models. arXiv preprint arXiv:1612.08468.

71
, 3.3 ALE

Independence assumption
To calculate the feature effect of x1 at a given value, say 0.75, the PDP replaces x1
of all instances with 0.75.

It means that we use the marginal distribution of X2 .

72
, 3.3 ALE

Figure: Marginal Distribution

73
, 3.3 ALE

Independence assumption
To calculate the feature effect of x1 at a given value, say 0.75, the PDP replaces x1
of all instances with 0.75.

It means that we use the marginal distribution of X2 .

This results in unlikely combinations of x1 and x2 (e.g. x2 = 0.2 at x1 = 0.75),


which the PDP uses for the calculation of the average effect.

The figure displays two correlated features and illustrates the fact that PDP average
predictions of unlikely instances.

74
, 3.3 ALE

Conditional expectation
We could average over the conditional distribution of the feature, meaning at a
grid value of x1 , we average the predictions of instances with a similar x1 value.

The solution for calculating feature effects using the conditional distribution is called
Marginal Plots, or M-Plots

M-plots are obtained like PDP but using the conditional distribution:

EXc |Xs (fb (xs , Xc )) (1)

75
, 3.3 ALE

Figure: Conditional Distribution

76
, 3.3 ALE

Limits of the M-plot

Example
Consider a model which predicts the value of a house depending on the number of rooms
and the size of the living area.
If we average the predictions of all houses of about 80 m2 , we estimate the combined
effect of living area and number of rooms, because of their correlation.
♢ Suppose that the living area has no effect on the predicted value of a house, only the
number of rooms has.
♢ The M-Plot would still show that the size of the living area increases the predicted
value, since the number of rooms increases with the living area.

77
, 3.3 ALE

Accumulated Local Effect


ALE solves the combined-effect problem by calculating differences in predictions instead
of averages, yet using conditional distributions.
For the calculation of ALE for feature x1 , which is correlated with x2 :

1 We divide the feature space of x1 into several intervals.

2 For each data point in a given interval, we calculate the difference in predictions
when replacing the feature value by, respectively, the upper and lower limit of the
interval.

3 In each interval, we compute the average difference in predictions.

4 Average differences are then accumulated.

78
, 3.3 ALE

Figure: Intuition of the ALE

Source: Molnar (2023)

79
, 3.3 ALE

Example
For the calculation of ALE for the living area, which is correlated with the number of
rooms:

1 We divide the feature space of the living area into several intervals.

2 For each house between 79 and 81 m2 , we calculate the difference in estimated price
when replacing the living area value by 81 and then by 79.

3 For this interval, we compute the average difference in estimated price.

4 We do the same in all other intervals: (77-79), (75-77), ...

5 Average differences are then accumulated.

80
Definition (Accumulated Local Effect)
The uncentered ALE average the changes in the predictions and accumulate them over
an interval:
Z xs
!
∂fb (Xs , Xc )
ALEs (xs ) = E Xc | Xs Xs = zs dzs ∀xs ∈ Ωs
z0,s ∂Xs

where z0,s is the minimum value of Xs for which the ALE curve is computed.

81
, 3.3 ALE

Definition (centered ALE)


The centered ALE is defined as

ALEscent (xs ) = ALEs (xs ) − EXs (ALEs (Xs ))

with k the number of values taken by Xs .

82
, 3.3 ALE

Figure: Example with 5 intervals

Source: Molnar (2023)

83
, 3.3 ALE
Figure: Example of centered ALE for numerical features and a regression model

Source: Molnar (2023)


84
, 3.3 ALE

Advantages
1 No independence assumption. ALE plots still work when features are correlated.

2 Faster to compute than PDP. For PDP, we need to compute n × k predictions and
compute k averages. For ALE, we need to compute 2 × n predictions and compute
k averages.

85
, 3.3 ALE

Limits
1 No solution for setting the number of intervals
▶ If the number of intervals is too low, ALE plots are inaccurate because it only partially
accounts for dependence across features.
▶ If the number of intervals is too high, we end up with too few instances per interval.
Empirical expectations do not converge towards the theoretical ones and ALE plots
can become a bit shaky.

86
, 3.3 ALE

Implementation
For the implementation, we can cite among many others.

1 Python: PyALE library

2 R: package IML, ALEPlot

87
88
89
, 4.1 ICE

Individual Conditional Expectation (ICE) plots display one curve per instance that shows
how the instance’s prediction changes when a feature changes.

Definition (Individual Conditional Expectation)


n o
(i ) (i )
For each instance xs , xc , the ICE associated to the feature Xs corresponds to:
 
(i )
ICEs,i (xs ) = fb xs , xc ∀xs ∈ Ωs

Goldstein, A., et al. (2015), Peeking inside the black box: Visualizing statistical learning
with plots of individual conditional expectation. Journal of Computational and Graphical
Statistics 24.1 (2015): 44-65.

90
, 4.1 ICE

Figure: Example of ICE for a regression model and 3 numerical features

Source: Molnar (2023)

91
, 4.1 ICE

Centered ICE
It can be hard to tell whether ICE curves differ across individuals because they start
at different levels.

A simple solution is to center the curves and only display the difference with respect
to the reference point.

Definition (Centered ICE)


The centered ICE is defined as
   
cent (i ) (i )
ICEs,i (xs ) = fb xs , xc − fb xa , xc ∀xs ∈ Ωs

where xa is the anchor point.

92
, 4.1 ICE

Centered ICE
In general, the anchor point is defined as:

xa = min xs
xs ∈Ωs

Thus, we have:
(
cent
0 if xs = xa
ICEs,i (xs ) = ( i )
 
( i )

fb xs , xc − fb xa , xc if xs > xa

All the instances have the same (null) ICE for the value xa , i.e.
cent
ICEs,i (xa ) = 0, ∀i = 1, . . . , n

93
, 4.1 ICE

Figure: Example of centered ICE

Source: Molnar (2023)

94
, 4.1 ICE

Advantages
1 Intuitive: ICE curves are even more intuitive to understand than PDP.

2 Heterogeneous effects: unlike PDP, ICE curves can uncover heterogeneous


relationships.

Limits
1 Maximum number of features: ICE curves can only display one feature at the time.

2 Assumption of independence: ICE curves suffer from the same problem as PDP: If
the feature of interest is correlated with the other features, then some points in the
curve might be unlikely data points according to the joint feature distribution.

3 Readibility: If many ICE curves are drawn, the plot can become overcrowded.

95
, 4.1 ICE

Implementation
For the implementation, we can cite among many others.

1 Python: package scikit-learn: function plot partial dependence().

2 R: package IML, ICEbox, and pdp.

96
97
, 4.2 LIME

Definition
LIME explains the prediction of any model by learning from an interpretable model
locally around the prediction.

Ribeiro, M.T., and al. (2016), ”Why Should I Trust You?” Explaining the Predictions of
Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.

98
, 4.2 LIME

Intuition

Black-box model’s complex


decision function f (blue vs.
pink)

One instance being explained


(solid red cross)

Local surrogate model (dashed


line)

99
, 4.2 LIME

Intuition

Black-box model’s complex


decision function f (blue vs.
pink)

One instance being explained


(solid red cross)

Local surrogate model (dashed


line)

100
, 4.2 LIME

Main ingredients of LIME

Local surrogate model


▶ With creation of synthetic instances

Proximity measure between instances

Simplification of the interpretation


▶ Transformation of the original features, e.g., continuous features to binary features.
▶ Complexity measure, e.g., limit the number of features used in the surrogate model.

101
, 4.2 LIME

Main ingredients of LIME

Local surrogate model


▶ With creation of synthetic instances

Proximity measure between instances

Simplification of the interpretation


▶ Transformation of the original features, e.g., continuous features to binary features.
▶ Complexity measure, e.g., limit the number of features used in the surrogate model.

102
, 4.2 LIME

Transformation of the original features

Distinction between features and interpretable data representations

▶ For image classification, only consider groups of pixels (a super-pixel).

▶ For text classification, only consider groups of words.

▶ For tabular data, continuous variables are discretized. For categorical variables,
categories can be combined.

103
, 4.2 LIME

Transformation of the original features

Notations

▶ Let denote by x ∈ Rd the vector of the original representation (original features) of an


instance being explained.

′ ′
▶ Let denote by x ∈ {0, 1}d the binary vector of the interpretable representation
(transformed features).

104
, 4.2 LIME

Fidelity-Interpretability Trade-off

We denote by f : Rd −→ R the model being explained


▶ In classification, f (x ) is the probability that x belongs to a certain class.


Let define an explanation as a model g : {0, 1}d −→ R, with g ∈ G , where G is a
class of potentially interpretable model.
▶ e.g., linear models, decision trees.

105
, 4.2 LIME

Fidelity-Interpretability Trade-off

Let Ω(g ) be a measure of complexity (as opposed to interpretability ) of the


explanation of g ∈ G .

▶ e.g., depth of the tree for decision trees, number of non-zero weights for linear models.

Let πx (z ) be a proximity measure between an instance z to x, so as to define


locality around x.
▶ e.g., exponential kernel.

Let L(f , g , πx ) be a measure of how unfaithful g is in approximating f in the


locality defined by πx .

106
, 4.2 LIME

Fidelity-Interpretability Trade-off

The explanation produced by LIME is obtained by the following:

ξ (x ) = argmin L(f , g , πx ) + Ω(g ) (2)


g ∈G

Remark: This formulation can be used with different explanation families G , fidelity
functions L, and complexity measures Ω.

107
, 4.2 LIME

Sampling for Local Exploration

108
, 4.2 LIME

Sparse Linear Explanations using LIME

′ ′ ′
Let G be the class of linear models, such that g (z ′ ) = wg · z = ∑dj =1 wg ,j zj

Let L(f , g , πx ) be the locally weighted square loss defined as:


2
L(f , g , πx ) = ∑


πx (z ) f (z ) − g (z )
z,z ∈Z

where πx = exp (−D (x, z )/σ2 ) is an exponential kernel defined on some distance
function D.

109
, 4.2 LIME

Sparse Linear Explanations using LIME

Set a limit K on the number of interpretable representations z ′ taking into account


in g as follows:

Ω(g ) = ∞ × 1 [||wg ||0 > K ]

 ∞

if ||wg ||0 > K
=
0 otherwise

with ||wg ||0 the number of nonzero elements in wg .

110
, 4.2 LIME

Figure: LIME for tabular data: Titanic data, probability of survival


of an 8-year-old boy who travelled in the first class

111
, 4.2 LIME

Sparse Linear Explanations using LIME

Figure: Example of error explanation

112
, 4.2 LIME

Advantages
1 Human-friendly explanation: The resulting explanations are short (= selective) and
possibly contrastive.

2 Features type: Works for tabular data, text and images.

3 Fidelity measure: Good idea of how reliable the interpretable model is in explaining
the black box predictions in the neighborhood of the data instance of interest.

Limits
1 Neighborhood definition: Sensitive results to the kernel values.

2 Sampling: Creation of unrealistic instances.

3 Complexity definition: It has to be defined in advance and the choice of K is let to


the user.

113
, 4.2 LIME

Implementation
For the implementation, we can cite among many others.

1 Python: lime library

2 R: package lime, localModel, iml

114
115
, 4.3 Shapley values

Definition (Shapley Values)


Shapley values indicates how to fairly distribute a payout among several players.

The Shapley value is a concept in cooperative game theory.

It was named in honor of Lloyd Shapley, who introduced it in 1953 and won the
Nobel Prize in Economics for it in 2012.

Shapley, Lloyd S. (1953). A value for n-person games. Contributions to the Theory of
Games 2.28, pp 307-317.

116
, 4.3 Shapley values

Intuition of Shapley Values


Three friends – Pierre, Eve, and Aminata – go out for dinner. They order and share fries,
wine, and pies. It is hard to figure out how much each friend should pay since they did
not eat an equal share. We have the following information:

If Pierre was eating alone, he would pay 25


If Eve was eating alone, she would pay 16
If Aminata was eating alone, she would pay 19
If Pierre and Eve were eating together, they would pay 50
If Pierre and Aminata were eating together, they would pay 56
If Eve and Aminata were eating together, they would pay 42
If Pierre, Eve, and Aminata were all eating together, they would pay 73

117
, 4.3 Shapley values

Intuition of Shapley Values

1 We take all permutations of the 3 participants in sequence.

2 We compute the incremental payout every time a new guest arrives.

Example
Consider the sequence (Pierre, Eve, Aminata). As described above, Pierre comes and
pays 25. Now, Pierre and Eve pay 50 so there is an additional payout for Eve of 25.
Finally, all three eat together and pay 73 so the additional payout for Aminata is 23.

118
, 4.3 Shapley values

Intuition of Shapley Values


We repeat the same exercise for each possible order for the 3 friends and get the
following marginal payout values:

(Pierre, Eve, Aminata) – (25, 25, 23)


(Eve, Pierre, Aminata) – (16, 34, 23)
(Eve, Aminata, Pierre) – (16, 26, 31)
(Aminata, Pierre, Eve) – (19, 37, 17)
(Aminata, Eve, Pierre) – (19, 23, 31)
(Pierre, Aminata, Eve) – (25, 31, 17)

119
Intuition of Shapley Values
We repeat the same exercise for each possible order for the 3 friends and get the
following marginal payout values:

(Pierre, Eve, Aminata) – (25, 25, 23) – 1/6


(Eve, Pierre, Aminata) – (16, 34, 23) – 1/6
(Eve, Aminata, Pierre) – (16, 26, 31) – 1/6
(Aminata, Pierre, Eve) – (19, 37, 17) – 1/6
(Aminata, Eve, Pierre) – (19, 23, 31) – 1/6
(Pierre, Aminata, Eve) – (25, 31, 17) – 1/6

The Shapley value of Pierre corresponds to his average marginal payouts, i.e., (25 + 34
+ 31 + 37 + 31 + 25)/6 = 30.5.

120
, 4.3 Shapley values

Intuition of Shapley Values


What is the Shapley value for, respectively, Pierre, Eve, and Aminata? It is just the
average of the marginal payout for each friend:

For Pierre, the Shapley value is (25 + 34 + 31 + 37 + 31 + 25)/6 = 30.5.

For Eve, the Shapley value is (25 + 16 + 16 + 17 + 23 + 17)/6 = 19.

For Aminata, the Shapley value is (23 + 23 + 26 + 19 + 19 + 31)/6 = 23.5.

The total is 30.5 + 19 + 23.5 = 73.

This is the final amount that each of them should pay if they go out together.

121
, 4.3 Shapley values

Shapley Values for ML Interpretability: SHAP (SHapley Additive exPlanations)

The game is the prediction task.

The payoff corresponds to the predicted value for a single instance.

The players are the features that ”collaborate to play the game” (i.e., to predict a
value).

The goal is to explain the difference between the actual prediction and the average
prediction.

Lundberg, S.M. and Lee, S.L. (2016) A Unified Approach to Interpreting Model Predictions,
31st Conference on Neural Information Processing Systems.

122
, 4.3 Shapley values

Shapley Values for ML Interpretability

Example
Consider a ML model used to predict house prices. For a certain house, with a garage, a
private pool and, an area of 50 yards, the estimated price is 510,000 e whereas the
average prediction for all houses is 500,000 e.

Question: What is the contribution of each feature to the difference between the
estimated price of this house and the average estimated price?

123
, 4.3 Shapley values

Shapley Values for Machine Learning Interpretability


Remark 1: Shapley values correspond to the contribution of each feature towards
pushing the prediction away from the expected value.
Remark 2: The sum of all Shapley values is equal to the difference between the
individual prediction ŷi and the average value of the model.

124
, 4.3 Shapley values

Figure: Example of Shapley values for a classification model

125
, 4.3 Shapley values

Figure: Example of Shapley values for a regression model

126
, 4.3 Shapley values

Notations

S is a subset of the features used in the model.

x is the vector of feature values of the instance to be explained.

p the number of features.

127
, 4.3 Shapley values

Definition (Shapley value )


The Shapley value of the feature xj is the weighted average contribution of xj across all
possible subsets S, where S does not include xj :
  1  

 
ϕj fb = fb S ∪ xj − fb (S )
S ⊆{x1 ,...,xp }\{xj } p × C|pS−| 1

128
, 4.3 Shapley values

How to compute the weights

From p − 1 features, we can form 2p −1 subsets (coalitions) of size


|S | = 0, . . . , p − 1.

p −1
2p −1 = ∑ C|pS−| 1
|S |=0

with C|pS−| 1 the number of coalitions of size |S | among p − 1 features.

The weight associated to each coalition is the inverse of

p!
p × C|pS−| 1 =
|S |! (p − 1 − |S |)!

129
Example
Consider the case with three features X1 , X2 , and X3 , with X1 the feature of interest.

S |S | C|pS−| 1 p × C|pS−| 1 1
p ×C|pS−| 1

{∅} 0 1 3 1/3
{x2 } 1 2 6 1/6
{x3 } 1 2 6 1/6
{x2 , x3 } 2 1 3 1/3

130
, 4.3 Shapley values

How to compute fˆ(X S )

Consider the following model fˆ(X ) = β̂ 0 + β̂ 1 X1 + β̂ 2 X2 + β̂ 3 X3 .

Consider X S = X1 :

fˆ(X1 ) = β̂ 0 + β̂ 1 X1 + β̂ 2 ? + β̂ 3 ?

1 Re-estimate the model: f˜(X ) = β̃ 0 + β̃ 1 X1

2 Do not re-estimate the model: EX2 ,X3 (fˆ(X )) = β̂ 0 + β̂ 1 X1 + β̂ 2 E(X2 ) + β̂ 3 E(X3 )

131
Additional notations

We denote by:

{X} the set of all the features.



{X} \ Xj the set containing all the features except Xj .

XS the vector of features in subset S.


n o
XS the vector of features not included in subset S, such as {X} \ Xj = XS , XS


132
, 4.3 Shapley values

Definition (Shapley value )


The Shapley value of the feature xj is the weighted average contribution of xj across all
possible subsets S, where S does not include xj :

  1  

 
ϕj fb = fb S ∪ xj − fb (S )
S ⊆{x1 ,...,xp }\{xj } p × C|pS−| 1

  1       
ϕj fb = ∑ p × C|pS−| 1
E XS fˆ xj , xS , XS − EXj ,XS fˆ Xj , xS , XS
S ⊆{x1 ,...,xp }\{xj }

133
Example
Consider the linear case with three features X1 , X2 , and X3 , with X1 the feature of
interest.

     
1
S E XS fˆ x1 , xS , XS − EX1 ,XS fˆ X1 , xS , XS
p ×C|pS−| 1

{∅} 1/3 β̂ 1 x1 + β̂ 2 E(X2 ) + β̂ 3 E(X3 ) − ( β̂ 1 E(X1 ) + β̂ 2 E(X2 ) + β̂ 3 E(X3 ))


{x2 } 1/6 β̂ 1 x1 + β̂ 2 x2 + β̂ 3 E(X3 ) − ( β̂ 1 E(X1 ) + β̂ 2 x2 + β̂ 3 E(X3 ))
{x3 } 1/6 β̂ 1 x1 + β̂ 2 E(X2 ) + β̂ 3 x3 − ( β̂ 1 E(X1 ) + β̂ 2 E(X2 ) + β̂ 3 x3 )
{x2 , x3 } 1/3 β̂ 1 x1 + β̂ 2 x2 + β̂ 3 x3 − ( β̂ 1 E(X1 ) + β̂ 2 x2 + β̂ 3 x3 )

134
Example
Consider the linear case with three features X1 , X2 , and X3 , with X1 the feature of
interest.

     
1
S E XS fˆ x1 , xS , XS − EX1 ,XS fˆ X1 , xS , XS
p ×C|pS−| 1

{∅} 1/3 β̂ 1 (x1 − E(X1 ))


{x2 } 1/6 β̂ 1 (x1 − E(X1 ))
{x3 } 1/6 β̂ 1 (x1 − E(X1 ))
{x2 , x3 } 1/3 β̂ 1 (x1 − E(X1 ))

Hence,
  1 1 1 1
ϕ1 fb = β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )
  3 6 6 3
ϕ1 fb = β̂ 1 (x1 − E(X1 ))

135
Example
More generally,  
ϕj fb = β̂ j (xj − E(Xj )) = β̂ j xj − β̂ j E(Xj )

Thus,
3   3 3
∑ ϕj fb = ∑ β̂j xj − ∑ β̂j E(Xj )
j =1 j =1 j =1
3  
∑ ϕj fb = ŷ − E(ŷ )
j =1
3  
∑ ϕj fb = fb(x ) − E(fb(x ))
j =1

136
For any function fb(x ) with three features, we have:
  1     
ϕ1 fb = E fb (x1 , X2 , X3 ) − E fb (X1 , X2 , X3 )
3
1  b   
+ E f (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 )
6
1  b   
+ E f (x1 , X2 , x3 ) − E fb (X1 , X2 , x3 )
6
1  b   
+ E f (x1 , x2 , x3 ) − E fb (X1 , x2 , x3 )
3
  1   
ϕ1 fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + W1 (x )
3
    
with W1 (x ) = 61 E fb (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 ) +
    
1
6 E f (x1 , X2 , x3 ) − E f (X1 , X2 , x3 ) +
b b
    
1
3 E f (x1 , X2 , X3 ) − E f (X1 , x2 , x3 )
b b

137
More generally:
  1 b  
ϕj fb = f (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + Wj (x )
3
Thus,
3 3 3
  1 b  
∑ ϕj fb = ∑ 3 f (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + ∑ Wj (x )
j =1 j =1 j =1
3     3
∑ ϕj fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + ∑ Wj (x )
j =1 j =1
3    
∑ ϕj fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 )
j =1

because ∑3j =1 Wj (x ) = 0.

138
1     
W1 (x ) = E fb (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 )
6
1     
+ E fb (x1 , X2 , x3 ) − E fb (X1 , X2 , x3 )
6
1     
+ E fb (x1 , X2 , X3 ) − E fb (X1 , x2 , x3 )
3
1     
W2 (x ) = E fb (x1 , x2 , X3 ) − E fb (x1 , X2 , X3 )
6
1     
+ E fb (X1 , x2 , x3 ) − E fb (X1 , X2 , x3 )
6
1     
+ E fb (X1 , x2 , X3 ) − E fb (x1 , X2 , x3 )
3
1     
W3 (x ) = E fb (x1 , X2 , x3 ) − E fb (x1 , X2 , X3 )
6
1     
+ E fb (X1 , x2 , x3 ) − E fb (X1 , x2 , X3 )
6
1     
+ E fb (X1 , X2 , x3 ) − E fb (x1 , x2 , X3 )
3

139
, 4.3 Shapley values

Properties
Efficiency: The feature contributions must add up to the difference of prediction for x
and the average.
p    
∑ ϕj fb = fb (x ) − E fb (x )
i =1
Dummy: A feature j that does not change the predicted value – regardless of which
subset of feature values it is added to – should have a Shapley value of 0.
   
fb S ∪ xj = fb (S ) ⇐⇒ ϕj fb = 0

140
, 4.3 SHAP

Figure: SHAP force plot

Source: Molnar (2023)

141
, 4.3 SHAP

Definition (SHAP feature importance)


The SHAP feature importance is defined as:
n

(i )
Ij = ϕj
i =1

Note: The idea behind SHAP feature importance is simple: Features with large absolute
Shapley values are important.

142
, 4.3 SHAP

Figure: SHAP feature importance

Source: Molnar (2023)

143
, 4.3 SHAP

Definition (SHAP summary plot)


The SHAP summary plot combines feature importance with feature effects.
Each point on the summary plot is a Shapley value for a feature and an instance.
Position on the y-axis is determined by the feature and the x-axis by the Shapley value.
The color represents the value of the feature from low to high.

144
, 4.3 SHAP

Figure: SHAP summary plot

Source: Molnar (2023)

145
, 4.3 Shapley values

Advantage
1 Solid theory. The Shapley value is the only explanatory method with a solid theory
with axioms.

Limit
1 Computing time: In many real-world applications, only an approximate solution is
feasible. An exact computation of the Shapley value is computationally expensive
because there are 2k possible subsets of the feature.

146
, 4.3 SHAP

Implementation
For the implementation, we can cite:

1 Python: SHAP package, scikit learn (for tree).

2 SHAP is integrated into the tree boosting frameworks xgboost and LightGBM.

3 R: package shapper and fastshap packages.

147
148
Figure: AUC = 0.78

What is the contribution of each feature to the AUC?

149
XPER allows us to interpret the predictive or economic performance of any
econometric or ML model (model-agnostic).

This method is based on:


▶ A Shapley Value decomposition (Shapley, 1953).
▶ A Performance Metric (PM).
▶ Predictions yb of a regression or classification model f (.).

150
, An intuitive primer on XPER

AUC ϕ0 ϕ1 ϕ2 ϕ3

Test sample 0.78 = 0.50 + 0.14 + 0.10 + 0.04

with ϕ0 a benchmark value, and ϕj the XPER contribution of feature xj to the AUC of
the model.

151
, An intuitive primer on XPER

Such performance decomposition can prove handy in several contexts

1 XPER can help to rationalize a potential heterogeneity in the predictive performance


of a model
AUC ϕ0 ϕ1 ϕ2 ϕ3

Subset A 0.65 = 0.50 + 0.01 + 0.10 + 0.04


Subset B 0.85 = 0.50 + 0.21 + 0.10 + 0.04

152
, An intuitive primer on XPER

Such performance decomposition can prove handy in several contexts

2 XPER can help to understand the origin of overfitting.


AUC ϕ0 ϕ1 ϕ2 ϕ3

Training 0.90 = 0.5 + 0.20 + 0.15 + 0.05


Test 0.78 = 0.5 + 0.08 + 0.15 + 0.05
Training - Test 0.12 = 0 + 0.12 + 0 + 0

153
, An intuitive primer on XPER

Such performance decomposition can prove handy in several contexts

3 XPER can be applied to any statistical performance metrics, but also to any
economic performance metrics.
n
P&L = ∑ (1 − ŷi )(1 − yi ) × profit + (1 − ŷi )yi × loss
i =1

where profit is the money made on any reimbursed loan (yi = 0) and loss is the
money lost on any defaulted loan (yi = 1). The P&L can be broken down as follows:

P&L ϕ0 ϕ1 ϕ2 ϕ3

$10,000 = $2,000 + $1,000 + $5,000 + $2,000

154
Framework and Performance Metrics

155
We consider a classification or a regression problem for which:

A dependent (target) variable denoted y takes values in Y . In case of classification


Y = {0, 1}, and in case of regression Y ⊂ R.

A q-vector x ∈ X refers to input (explanatory) features, with X ⊂ Rq .

We denote by f : x → ŷ a model, where ŷ ∈ Y is either a classification or regression


output, such as ŷ = f (x).

156
The econometric or machine learning model may be parametric or not, linear or not,
individual or an ensemble classifier, etc.

The model is estimated (parametric model) or trained (machine learning algorithm)


once for all on an estimation or training sample {xj , yj }T
j =1 .

The statistical performance of the model is evaluated on a test sample


Sn = {xi , yi , fˆ(xi )}ni=1 for n individuals.

157
Definition
A sample performance metric PMn ∈ Θ ⊆ R associated to the model fˆ(.) and a test
sample Sn is a scalar defined as:

PMn = G̃n (y1 , ..., yn ; fˆ(x1 ), ..., fˆ(xn )) = Gn (y; X),

where y = (y1 , ..., yn )T and X = (x1 , .., xn )T .

Examples:

Regression model: MSE, MAE, R-squared.


Classification model: AUC, Accuracy, Sensitivity, Specificity, Brier Score.
Economic criteria: Profit or cost function.

158
Assumption 1
The sample performance metric satisfies an additive property such that:

1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1

where G (yi ; xi ; δn ) denotes an individual contribution to the performance metric and δ̂n
is a nuisance parameter which depends on the test sample Sn .

159
Assumption 1
The sample performance metric satisfies an additive property such that:

1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1

where G (yi ; xi ; δn ) denotes an individual contribution to the performance metric and δ̂n
is a nuisance parameter which depends on the test sample Sn .

Example (Mean Squared Error (MSE))


1 n 1 n
Gn (y; X) = ∑ G (yi ; xi ) = ∑ (yi − fˆ(xi ))2 ,
n i =1 n i =1

159
Assumption 1
The sample performance metric satisfies an additive property such that:

1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1

where G (yi ; xi ; δn ) denotes an individual contribution to the performance metric and δ̂n
is a nuisance parameter which depends on the test sample Sn .

Example (Mean Squared Error (MSE))


1 n 1 n
Gn (y; X) = ∑ G (yi ; xi ) = ∑ (yi − fˆ(xi ))2 ,
n i =1 n i =1

Assumption 2
The sample performance metric Gn (y; X; δ̂n ) converges to the population performance
metric Ey ,x (G (y ; x; δ0 )), where Ey ,x (.) refers to the expected value with respect to the
joint distribution of y and x, and δ0 = plim δ̂n .

159
Theoretical Decomposition

160
, Intuition

161
, Intuition

162
, Intuition

163
, Intuition

163
, Intuition

163
, Intuition

163
, Intuition

164
, Definition of XPER

Definition (XPER value)


The contribution of feature xj to the performance metric is:
h i
ϕj = ∑ wS ExS Ey ,xj ,xS (G (y ; x; δ0 )) − Exj ,xS Ey ,xS (G (y ; x; δ0 )) ,
S ⊆P ({x}\{xj })

with S a coalition, i.e., a subset of features, excluding


 the feature of interest xj , |S | the
number of features in the coalition, and P ({x} \ xj ) the partition of the set {x} \ xj .

The XPER value ϕj associated to feature xj measures its weighted average marginal
contribution to the performance metric over all feature coalitions.

165
Axiom 1. (Efficiency)
The sum of the XPER values ϕj , ∀j = 1, ..., q satisfies:
q
Ey ,x (G (y ; x; δ0 )) = ϕ0 + ∑ ϕj ,
| {z } |{z} j =1 |{z}
performance metric benchmark XPER value

ϕ0 = Ex Ey (G (y ; x; δ0 ))

with ϕ0 the performance metric associated to a population where the target variable is
independent from all features considered in the model.

166
Definition (Individual XPER)
The individual XPER value ϕi,j associated to individual i is defined as:
h i
ϕi,j (yi ; xi ) = ∑ wS ExS (G (yi ; xi ; δ0 )) − Exj ,xS (G (yi ; xi ; δ0 )) .
S ⊆P ({x}\{xj })

For a given realisation (yi , xi ), the corresponding individual contribution to the


performance metric can be broken down into:
q
G (yi ; xi ; δ0 ) = ϕi,0 + ∑ ϕi,j ,
j =1

where ϕi,j is the realisation of ϕi,j (yi ; xi ) and ϕi,0 is the realisation of
ϕi,0 (yi ) = Ex (G (yi ; x; δ0 )).

167
Empirical Application

168
, Database

Database of auto loans provided by an international bank:

Target variable yi :
▶ 1: Default
▶ 0: No default

7,440 consumer loans

10 features:
▶ 2 categorical features
▶ 8 continuous features

169
, Model

Table: XGBoost Performances

Sample Size (%) AUC Brier Score Accuracy BA Sensitivity Specificity


Training 70 0.8969 0.0958 86.98 72.43 48.18 96.69
Test 30 0.7521 0.1433 79.53 58.69 23.99 93.39

170
, XPER decomposition

Figure: XPER decomposition

Funding amount

Job tenure

Car price

Age

Loan duration

Owner

Married

Credit event

Monthly payment

Down payment

0 10 20 30 40 50
Contribution (%)

171
, XPER decomposition

Figure: XPER decomposition and Permutation Importance (PI)

Funding amount Funding amount

Job tenure Job tenure

Car price Car price

Age Age

Loan duration Loan duration

Owner Owner

Married Married

Credit event Credit event

Monthly payment Monthly payment

Down payment
Down payment

0 10 20 30 40 50 0 10 20 30 40 50
Contribution (%) Contribution (%)

(a) XPER values (b) PI-based feature contributions

172
, Permutation Importance

Definition (Permutation Importance)


The permutation importance corresponds to the average decrease of the model
performance when a single feature value is randomly shuffled.

(1) (12) (1)


 
x1 ... xj ... xp
.. 
 
 (2) (48)
x
 1 ... xj ... . 

(n ) (4) (n )
x1 ... xj ... xp

1 S
∆PMj = PM − ∑ (PM (Xj,s ))
S s =1
with Xj,s the Sth reshuffled vector of the values of Xj .

173
, XPER decomposition

Figure: XPER vs. SHAP

Funding amount

Job tenure

Vehicle price

Customer’s age

Estimated funding duration

Owner

Married

Default past 6 months

Monthly payment in pct of income

Downpayment
AUC
SHAP

0 5 10 15 20 25 30 35 40
Feature contribution (%)

174
, Using XPER to boost model performance

Figure: Two-step procedure using XPER

175
, Using XPER to boost model performance

Table: Model performances

Initial Clusters on XPER values Clusters on features


(1) (2) (3)
AUC 0.752 0.912 0.744
Brier score 0.143 0.080 0.151
Accuracy 79.53 89.11 79.53
Balanced Accuracy 58.69 79.74 59.11
Sensitivity 23.99 64.13 25.11
Specificity 93.39 95.35 93.11

176
, Conclusion

We introduce a methodology designed to measure the feature contributions to


the performance of any regression or classification model.

Our methodology is theoretically grounded on Shapley values and is both


model-agnostic and performance metric-agnostic.

In a loan default forecasting application, XPER appears to be able to significantly


boost out-of-sample performance.

177
, XPER python package

Figure: Github link

178
7
Interpretability and Algorithmic Fairness

6. Disagreement in Explainable Machine


Learning
The Disagreement Problem in Explainable Machine Learning:
A Practitioner’s Perspective

Satyapriya Krishna1, Tessa Han1, Alex Gu2, Javin Pombra1,


Shahin Jabbari3, Zhiwei Steven Wu4, and Himabindu Lakkaraju1

1 Harvard University
2 Massachusetts Institute of Technology
3 Drexel University
4 Carnegie Mellon University

February 9, 2022

Interpretable Machine Learning in Healthcare in ICML 2022

179
180
181
182
183
184
7
Interpretability and Algorithmic Fairness

7. Fairness
7
Interpretability and Algorithmic Fairness

7.1 Definition and Measures


Unlike humans, algorithms are neutral

• Humans exhibit behavioral biases/taste‐based discrimination and are


affected by cultural norms, stereotypes, homophily, etc.

Source: D'Acunto, Ghosh, Jain and Rossi (2021)

• Higher rates charged to Black and Hispanic borrowers (Bartlett et al.,


2021), for immigrants (Dobbie et al., forth.), for women (Alesina, 2013)
185
Algorithms can exacerbate differences
in rates and access to credit

Fuster, Goldsmith‐Pinkham, Ramadorai, and Walther (2021)

Using detailed administrative data on US mortgages, they find that ML:

• increases rate disparity across groups of borrowers

• benefits more White and Asian borrowers than Black and Hispanic
borrowers

• Channels: ML better captures the structural relationship between


observable characteristics and default, (2) ML is able to triangulate
the (hidden) identity of borrowers.

186
Neutral?

Articles: FR, EN

Algorithmic Fairness – HEC Paris


Neutral?

188
Fraud detection

189
Many risks:
Model, Reputation, Regulatory, Legal

190
What is an “unfair” algorithm?

• An algorithm that places a group of individuals who share a protected


attribute (PA) at a systematic disadvantage.
e.g. gender, age, residence, ethnic origin, skin color, religion

• Disadvantage in terms of rejection rate, price, etc.

• More likely to happen with highly flexible, non‐linear, opaque models

• An algorithm can be unfair even when the PA is not used. Can be


completely unintentional

• But can arise with both interpretable and non‐interpretable models

191
Statistical triangulation

income < 40k€ Job contract =


fixed term

sector Gender

192
Interpretability

Being interpretable does not imply being fair

interpretability

𝟏𝟐 𝟑𝟒
Variables Decision Gender

fairness

193
Why algorithmic fairness is important?

• Protect minorities and prevent discrimination

• Consumer protection

• Part of model risk

• Potentially problematic from an ethical and reputation point of


view

• Being unfair can be illegal as under U.S. fair‐lending law, lenders


can discriminate against minorities only for creditworthiness

194
Reasons for an algorithm not to be fair

• Algorithm trained using past decisions made by a human (credit


officer, judge, HR director) who may be racist, misogynistic, etc

• Algorithms learning from actual outcomes (past defaults) but using


variables that proxy for membership in a protected group

195
Growing concern in the academia

196
Now a key component of sustainable AI

Sustainable AI must:

• be robust

• be transparent, or at least interpretable

• respect users’ data privacy

• be fair

197
Application to a credit database

Dataset: German Credit dataset

1,000 borrowers with 20 attributes, including gender


690 men and 310 women
For each borrower, we know if there is a default: 300 defaults in total

198
199
Forecasting default with simple
and ML models

200
Many fairness definitions (# citations in 2019)

201
An example: Statistical parity

% acceptance of loan applications for men


=
% acceptance of loan applications for women

men: 40% women: 40%

202
An example: Statistical parity

% acceptance of loan applications for men


=
% acceptance of loan applications for women

men: 40% women: 25%

203
An example: Statistical parity

% acceptance of loan applications for men


=
% acceptance of loan applications for women

men: 40% women: 25% < 32

« 4/5 rule »

204
Composition effect

women men

risk class

creditworthiness
205
An example: Statistical parity

% acceptance of loan applications for men


=
% acceptance of loan applications for women

men: 40% women: 25%

% acceptance of loan applications for men [income > 100k€, job > 5Y, home owner]
=
% acceptance of loan applications for women [income > 100k€, job > 5Y, home owner]

men: 70% women: 66%

206
An example: Statistical parity

% acceptance of loan applications for men


=
% acceptance of loan applications for women

men: 40% women: 25%

% acceptance of loan applications for men [income > 100k€, job > 5Y, home owner]
=
% acceptance of loan applications for women [income > 100k€, job > 5Y, home owner]

men: 70% women: 62%

206
An example: Statistical parity

% acceptance of loan applications for men


=
% acceptance of loan applications for women

men: 40% women: 25%

% acceptance of loan applications for men [income > 100k€, job > 5Y, home owner]
=
% acceptance of loan applications for women [income > 100k€, job > 5Y, home owner]

men: 70% women: 62%

206
Set up

207
Set up

207
Set up

208
Statistical parity

209
Conditional statistical parity

Building groups

For credit scoring applications, groups gather applicants with similar risk profiles:
Determined through unsupervised clustering methods (K-Means) or using an
exogenous classification (Basel classification)

210
Number of sub‐groups

• The larger the number of sub‐groups:

‐ the more homogenous the sub‐groups are (cleaner test)

‐ the more likely at least one sub‐group is found to be unfair

‐ the smaller the number of individuals in each sub‐group

• Aggregation process:

‐ Economic criterion: Fairness in all sub‐groups, majority of


sub‐groups, majority of individuals, etc
‐ Statistical criterion

211
(Conditional) Statistical parity

212
Independence assumptions

213
Testing fairness using independence tests

214
Testing independence using Chi‐2 statistics

215
Testing Statistical Parity

216
Fairness traffic light

217
Numerical example

n+1f = 214 E(n+1f) = 310*771/1000

n+0f = 96 E(n+0f) = 310*229/1000

n+1m = 557 E(n+1m) = 690*771/1000

n+0m = 133 E(n+0m) = 690*229/1000

Chi2 =
(214 - 310*771/1000)^2/(310*771/1000) +
(96 - 310*229/1000)^2/(310*229/1000) +
(557 - 690*771/1000)^2/(690*771/1000) +
(133 - 690*229/1000)^2/(690*229/1000)

218
Conditional classification

219
Testing conditional statistical parity

220
Generalization: Other metrics

221
Generalization: Other metrics (2)

222
Fairness tests: with‐models

223
Fairness tests: without‐models

224
Configuration 2

225
7
Interpretability and Algorithmic Fairness

7.2 Interpretability and Mitigation


Fairness interpretability

226
Introduction to Partial Dependence Plots (PDP)

Source: Molnar (2023)


227
Fairness Partial Dependence Plots (FPDP):
Theory

228
FPDP: Notation

229
FPDP: Definition

230
FPDP: Definition

230
FPDP: Definition

230
Categorical feature

231
Categorical feature

231
Continuous feature

232
Continuous feature

232
FPDP: Interpretation

233
FPDP: Plots
Variable 1

Variable 2

Variable 3

234
FPDP: Plots
Variable 1

Variable 2

Variable 3

234
Interpretability: without, white‐box model

235
236
Mitigation

How to make an unfair algorithm less unfair:

• Remove each candidate variable and re‐estimate the model

• Set the value of each candidate variable to a level for which there is
no fairness problem

237
With or without re‐estimation

238
7
Interpretability and Algorithmic Fairness

7.3 Gender Effects and Fairness in Hiring


239
7
Interpretability and Algorithmic Fairness

7.4 Fairness and Quality AI


https://www.vde.com/resource/blob/2176686
/a24b13db01773747e6b7bba4ce20ea60/vd
e-spec-vcio-based-description-of-systems-
for-ai-trustworthiness-characterisation-
data.pdf

The VCIO* approach, therefore, fulfils


three tasks:
1) It clarifies what is meant by a particular
value (value definition).
2) It explains in a comprehensible
manner how to check or observe whether
or to what extent a technical system
fulfils or violates a value (measurement).
3) It acknowledges the existence of value
conflicts and explains how to deal with
these conflicts depending on the
application context (balancing).
* VCIO = Values Criteria Indicators
Observables

October 6, 2022: Memorandum of Cooperation (MoC) with


Confiance.ia, a French consortium of French industrial companies and
research centers.

240
241
242

You might also like