Professional Documents
Culture Documents
Master DSB
October 24 – November 10, 2023
1
Course Description
1
Course Description
1
Course Description
(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.
1
Course Description
(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.
1
Course Description
(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.
These concepts and skills are important to master for data scientists,
model validators, AI entrepreneurs, executives, regulators, policymakers,
and consumer advocates.
1
Course Description
(1) identify the variables that play the most important role in a given AI/machine‐learning
model or understand individual decisions.
These concepts and skills are important to master for data scientists,
model validators, AI entrepreneurs, executives, regulators, policymakers,
and consumer advocates.
• Definition of interpretability/explainability
• Natively interpretable models: OLS, Logistic, GAM, PLTR
• Global interpretability measures: Global surrogate, Partial
Dependence Plot (PDP), Accumulated Local Effect (ALE)
• Local interpretability measures: ICE, LIME, SHapley Additive
exPlicability (SHAP)
• Explaining Performance (Permutation Importance, XPER)
2
Main Topics
• Definition of interpretability/explainability
• Natively interpretable models: OLS, Logistic, GAM, PLTR
• Global interpretability measures: Global surrogate, Partial
Dependence Plot (PDP), Accumulated Local Effect (ALE)
• Local interpretability measures: ICE, LIME, SHapley Additive
exPlicability (SHAP)
• Explaining Performance (Permutation Importance, XPER)
• Why an algorithm needs to be fair and definitions of fairness metrics
• Inference tests and fairness diagnosis
• Fairness interpretability
• Mitigation techniques
2
Learning Objectives
3
Course Structure
Student
Morning Lecture Lecture Lecture Lecture
Presentations
GS2 GS4
Student
Afternoon Lecture Lecture Presentations
GS1 GS3
5
Recommended Readings
Hué S., Hurlin C., Pérignon C., Saurin S. (2023) Measuring the Driving Forces of
Predictive Performance: Application to Credit Scoring
https://ssrn.com/abstract=4280563
Hurlin, C., Pérignon C., Saurin S. (2022) The Fairness of Credit Scoring Models
https://ssrn.com/abstract=3785882
Krishna S., Han T., Gu A., Pombra J., Jabbari S., Wu S. and Lakkaraju H. (2022)
The Disagreement Problem in Explainable Machine Learning: A Practitioner’s
Perspective https://arxiv.org/abs/2202.01602
Artificial Intelligence: The Robots Are Now Hiring, Wall Street Journal
https://www.youtube.com/watch?v=8QEK7B9GUhM
6
7
Interpretability and Algorithmic Fairness
1. Introduction
“Life‐changing algorithms”
• Credit
• Credit
• Credit
• Work
9
“Life‐changing algorithms”
• Credit
• Work
9
“Life‐changing algorithms”
• Credit
• Work
9
“Life‐changing algorithms”
• Credit
• Work
• Education
10
“Life‐changing algorithms”
• Credit
• Work
• Education
• Love
11
“Life‐changing algorithms”
• Credit
• Work
• Education
• Love
11
“Life‐changing algorithms”
• Credit
• Work
• Education
• Love
• Freedom
12
The use of AI by government agencies
13
Automated claim management for insurance
14
Why is AI so popular in these business
and government applications?
This quest for performance leads companies to develop models that are increasingly complex,
and less and less transparent and interpretable. “Algorithm Darwinism”
15
Interpretability
16
Interpretability
17
Interpretability
18
Important distinctions
We distinguish between:
• Models that are intrinsically interpretable (white box or glass box)
• Models whose structures do not permit easy interpretation (black box)
We distinguish between:
• Understanding a model (global interpretability)
For instance, in this Artificial Neural Network, the variables with the strongest impact on the estimated
house price are the lot size and the presence of a swimming pool.
19
The ML Journey
Validation:
Modeling: test set
list of models stability
variable selection interpretability Production:
hyperparameters fairness value creation
training set
performance
Data:
collection
cleaning
Problem,
Question
20
Hot debate about interpretability
Source: https://www.youtube.com/watch?v=GtCFprO5p7k
21
Hot debate about interpretability (2)
22
Hot debate about interpretability (3)
23
When we need to interpret (and when we don’t)
ML models produce predictions, but they do not explain their predictions to users
“The algorithm rejected you. Sorry I do not have any additional information.”
is not going to be an acceptable answer…
24
When we need to interpret (2)
25
Research on interpretability
Total number
of accepted 400 2000
papers :
26
When we need to interpret (3)
27
Quality AI
October 6, 2022: Memorandum of Cooperation (MoC) between VDE and Confiance.ia, a consortium of
French industrial companies and research centers.
* VCIO = Values Criteria Indicators Observables
28
Whom to explain?
30
31
, 2. Interpreting White box
32
, 2.1 OLS
Y = β 0 + β 1 X1 + β 2 X2 + ... + β k Xk + ε t
∂Y
= βj
∂Xj
33
, 2.1 OLS
Example
”The model predicts that the price of the house of interest (with the feature values in the second column of
Table 9.1) is $232,349. The value of an “average house” with the feature values in the third column of Table
9.1 is $180,817. This house is therefore worth $51,532 more than the average house.”
34
, 2.2 Logistic Regression
35
, 2.3 Decision Tree
36
, 2.4 Generalized Additive Models (GAM)
Hastie, T. and Tibshirani, R. (1986), Generalized additive models. Statistical Science, Vol.1,
No. 3, 297-318.
37
, 2.4 Generalized Additive Models (GAM)
where f1 (X1 ), ..., fp (Xp ) correspond to (smooth) non-linear functions of the features.
g (µ(X )) = µ for linear and additive models for Gaussian response data.
g (µ(X )) = logit (µ) or g (µ(X )) = probit (µ) for modeling binomial probabilities.
38
, 2.4 Generalized Additive Models (GAM)
where f1 (X1 ), ..., fp (Xp ) correspond to (smooth) non-linear functions of the features.
g (µ(X )) = µ for linear and additive models for Gaussian response data.
g (µ(X )) = logit (µ) or g (µ(X )) = probit (µ) for modeling binomial probabilities.
38
, 2.4 Generalized Additive Models (GAM)
In the binary classification setting, the additive logistic regression has the form
µ (X )
log = α + f1 (X1 ) + ... + fp (Xp )
1 − µ (X )
39
, 2.4 Generalized Additive Models (GAM)
Regression
40
, 2.4 Generalized Additive Models (GAM)
Advantages
1 Automatically model non-linear relationships that standard linear regression will
miss.
2 The non-linear fits can potentially make more accurate predictions of the target.
Limits
1 Interactions can be missed as the model is restricted to be additive. However, as
with linear regression, we can manually add interaction terms to the GAM. By doing
so, the GAM are subject to overfitting issues.
41
, 2.4 Generalized Additive Models (GAM)
Implementation
For the implementation, we can cite:
2 R: package GAM.
42
, 2.5 Penalised Logistic Tree Regression (PLTR)
Remark: This method is not restricted to credit scoring applications and can be used for
other classification and regression (after some small adjustements) problems.
Dumitrescu, E., Hué, S., Hurlin, C. and Tokpavi, S. (2022), Machine learning for credit
scoring: Improving logistic regression with non-linear decision-tree effects, European Journal
of Operational Research, Vol. 297, Issue 3, 1178-1192.
43
, 2.5 Penalised Logistic Tree Regression
First Step
Source: Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects (2022)
44
, 2.5 Penalised Logistic Tree Regression
Second Step
In the second step, the endogenous univariate and bivariate threshold effects previously
obtained are plugged in the logistic regression
(j ) (j,k ) 1
P(yi = 1|Vi,1 , Vi,2 ; Θ) = h
(j ) (j,k )
i
1 + exp −η (Vi,1 , Vi,2 ; Θ)
(j ) (j,k ) (j ) (j,k )
with η (Vi,1 , Vi,2 ; Θ) = β 0 + ∑pj=1 αj xj + ∑pj=1 β j Vi,1 + ∑pj =−11 ∑pk =j +1 γj,k Vi,2 .
1 N h i
N i∑
(j ) (j,k ) (j ) (j,k )
L(Vi,1 , Vi,2 ; Θ) = yi log [F (η (Vi,1 , Vi,2 ; Θ))]
=1
1 N h i
∑
(j ) (j,k )
+ (1 − yi )log [1 − F (η (Vi,1 , Vi,2 ; Θ))] .
N i =1
45
, 2.5 Penalised Logistic Tree Regression
Second Step
Finally, the adaptive lasso estimators are obtained as
V
∑
(j ) (j,k )
Θ̂alasso (λ) = argmin − L(Vi,1 , Vi,2 ; Θ) + λ wv |θv |.
Θ v =1
46
, 2.5 Penalised Logistic Tree Regression
Source: Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects (2022)
47
, 2.5 Penalised Logistic Tree Regression
Source: Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects (2022)
48
49
, 3. Interpreting Black-Box - Global
1 Global Surrogate
50
51
Consider a given opaque machine learning model that produces predictions ŷ .
52
Decision Tree used as a surrogate model: Bike rental example
53
, 3.1 Global Surrogate
Advantages
1 Simple and intuitive
Limits
1 Estimation risk: Discrepency between the original ŷ and the one estimated by the
surrogate model ŷˆ . R-squared = 0.7
2 Dangerous: Can give the illusion of interpretability.
54
55
, 3.2 PDP
56
, 3.2 PDP
Notations
Denote by Xs the features for which the partial dependence function should be
plotted and Ωs the universe of its realization.
Denote by Xc the other features used in the machine learning model fb.
57
, 3.2 PDP
Remark: the PDP is different from the conditional expectation EXc |Xs (fb (xs , Xc )) where
expectation is taken over the conditional distribution of Xc given Xs = xs .
58
, 3.2 PDP
Y = fb (X ) + ε = β̂ 0 + β̂ 1 X1 + . . . + β̂ p Xp + ε
59
, 3.2 PDP
Pr ( Y = 1| X ) = fb (X ) = Λ β̂ 0 + β̂ 1 X1 + β̂ 2 X2
60
, 3.2 PDP
1 n b
pbds (xs ) = ∑ f xs , xc
(i )
∀xs ∈ Ωs
n i =1
(1)
xs xc
xs (2)
xc
. ..
.. .
(n )
xs xc
Remark 1: xs can take any values including some which are not in the dataset, e.g., for
age xs can take any value between min (age ) and max (age ).
Remark 2: The PDP shows whether the relationship between the target and a feature is
linear, monotonic or more complex.
61
, 3.2 PDP
3 For each xs :
1 Replace all realisations of Xs by xs
2 Compute fˆ(xs , xc ) for each instance.
3 Average predictions across instances.
62
, 3.2 PDP
Continuous feature
63
, 3.2 PDP
Categorical feature
Classification model
When the target variable Y is categorical, the PD function
displays the average
marginal
(i )
effect of the feature s on the conditional probability Pr Y = 1| xs , xc as
1 n
pds (xs ) = ∑ Pr Y = 1| xs , xc
(i )
∀xs ∈ Ωs
n i =1
65
, 3.2 PDP
Classification model
Advantages
1 Interpretation is clear: The PDP shows the marginal effect of a given feature Xs
on the average prediction.
2 Easy to implement: The PDP does not require re-estimating the model.
67
, 3.2 PDP
Limits
1 Maximum number of features: the PDP analysis is limited to 2 features.
2 Heterogeneous effects across instances might be hidden because PDP only show
the average marginal effects.
▶ Solution: Individual Conditional Expectation (ICE) curves.
68
Implementation
For the implementation, we can cite:
69
70
, 3.3 ALE
Apley, D.W. (2016), Visualizing the effects of predictor variables in black box supervised
learning models. arXiv preprint arXiv:1612.08468.
71
, 3.3 ALE
Independence assumption
To calculate the feature effect of x1 at a given value, say 0.75, the PDP replaces x1
of all instances with 0.75.
72
, 3.3 ALE
73
, 3.3 ALE
Independence assumption
To calculate the feature effect of x1 at a given value, say 0.75, the PDP replaces x1
of all instances with 0.75.
The figure displays two correlated features and illustrates the fact that PDP average
predictions of unlikely instances.
74
, 3.3 ALE
Conditional expectation
We could average over the conditional distribution of the feature, meaning at a
grid value of x1 , we average the predictions of instances with a similar x1 value.
The solution for calculating feature effects using the conditional distribution is called
Marginal Plots, or M-Plots
M-plots are obtained like PDP but using the conditional distribution:
75
, 3.3 ALE
76
, 3.3 ALE
Example
Consider a model which predicts the value of a house depending on the number of rooms
and the size of the living area.
If we average the predictions of all houses of about 80 m2 , we estimate the combined
effect of living area and number of rooms, because of their correlation.
♢ Suppose that the living area has no effect on the predicted value of a house, only the
number of rooms has.
♢ The M-Plot would still show that the size of the living area increases the predicted
value, since the number of rooms increases with the living area.
77
, 3.3 ALE
2 For each data point in a given interval, we calculate the difference in predictions
when replacing the feature value by, respectively, the upper and lower limit of the
interval.
78
, 3.3 ALE
79
, 3.3 ALE
Example
For the calculation of ALE for the living area, which is correlated with the number of
rooms:
1 We divide the feature space of the living area into several intervals.
2 For each house between 79 and 81 m2 , we calculate the difference in estimated price
when replacing the living area value by 81 and then by 79.
80
Definition (Accumulated Local Effect)
The uncentered ALE average the changes in the predictions and accumulate them over
an interval:
Z xs
!
∂fb (Xs , Xc )
ALEs (xs ) = E Xc | Xs Xs = zs dzs ∀xs ∈ Ωs
z0,s ∂Xs
where z0,s is the minimum value of Xs for which the ALE curve is computed.
81
, 3.3 ALE
82
, 3.3 ALE
83
, 3.3 ALE
Figure: Example of centered ALE for numerical features and a regression model
Advantages
1 No independence assumption. ALE plots still work when features are correlated.
2 Faster to compute than PDP. For PDP, we need to compute n × k predictions and
compute k averages. For ALE, we need to compute 2 × n predictions and compute
k averages.
85
, 3.3 ALE
Limits
1 No solution for setting the number of intervals
▶ If the number of intervals is too low, ALE plots are inaccurate because it only partially
accounts for dependence across features.
▶ If the number of intervals is too high, we end up with too few instances per interval.
Empirical expectations do not converge towards the theoretical ones and ALE plots
can become a bit shaky.
86
, 3.3 ALE
Implementation
For the implementation, we can cite among many others.
87
88
89
, 4.1 ICE
Individual Conditional Expectation (ICE) plots display one curve per instance that shows
how the instance’s prediction changes when a feature changes.
Goldstein, A., et al. (2015), Peeking inside the black box: Visualizing statistical learning
with plots of individual conditional expectation. Journal of Computational and Graphical
Statistics 24.1 (2015): 44-65.
90
, 4.1 ICE
91
, 4.1 ICE
Centered ICE
It can be hard to tell whether ICE curves differ across individuals because they start
at different levels.
A simple solution is to center the curves and only display the difference with respect
to the reference point.
92
, 4.1 ICE
Centered ICE
In general, the anchor point is defined as:
xa = min xs
xs ∈Ωs
Thus, we have:
(
cent
0 if xs = xa
ICEs,i (xs ) = ( i )
( i )
fb xs , xc − fb xa , xc if xs > xa
All the instances have the same (null) ICE for the value xa , i.e.
cent
ICEs,i (xa ) = 0, ∀i = 1, . . . , n
93
, 4.1 ICE
94
, 4.1 ICE
Advantages
1 Intuitive: ICE curves are even more intuitive to understand than PDP.
Limits
1 Maximum number of features: ICE curves can only display one feature at the time.
2 Assumption of independence: ICE curves suffer from the same problem as PDP: If
the feature of interest is correlated with the other features, then some points in the
curve might be unlikely data points according to the joint feature distribution.
3 Readibility: If many ICE curves are drawn, the plot can become overcrowded.
95
, 4.1 ICE
Implementation
For the implementation, we can cite among many others.
96
97
, 4.2 LIME
Definition
LIME explains the prediction of any model by learning from an interpretable model
locally around the prediction.
Ribeiro, M.T., and al. (2016), ”Why Should I Trust You?” Explaining the Predictions of
Any Classifier. Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.
98
, 4.2 LIME
Intuition
99
, 4.2 LIME
Intuition
100
, 4.2 LIME
101
, 4.2 LIME
102
, 4.2 LIME
▶ For tabular data, continuous variables are discretized. For categorical variables,
categories can be combined.
103
, 4.2 LIME
Notations
′ ′
▶ Let denote by x ∈ {0, 1}d the binary vector of the interpretable representation
(transformed features).
104
, 4.2 LIME
Fidelity-Interpretability Trade-off
′
Let define an explanation as a model g : {0, 1}d −→ R, with g ∈ G , where G is a
class of potentially interpretable model.
▶ e.g., linear models, decision trees.
105
, 4.2 LIME
Fidelity-Interpretability Trade-off
▶ e.g., depth of the tree for decision trees, number of non-zero weights for linear models.
106
, 4.2 LIME
Fidelity-Interpretability Trade-off
Remark: This formulation can be used with different explanation families G , fidelity
functions L, and complexity measures Ω.
107
, 4.2 LIME
108
, 4.2 LIME
′ ′ ′
Let G be the class of linear models, such that g (z ′ ) = wg · z = ∑dj =1 wg ,j zj
where πx = exp (−D (x, z )/σ2 ) is an exponential kernel defined on some distance
function D.
109
, 4.2 LIME
∞
if ||wg ||0 > K
=
0 otherwise
110
, 4.2 LIME
111
, 4.2 LIME
112
, 4.2 LIME
Advantages
1 Human-friendly explanation: The resulting explanations are short (= selective) and
possibly contrastive.
3 Fidelity measure: Good idea of how reliable the interpretable model is in explaining
the black box predictions in the neighborhood of the data instance of interest.
Limits
1 Neighborhood definition: Sensitive results to the kernel values.
113
, 4.2 LIME
Implementation
For the implementation, we can cite among many others.
114
115
, 4.3 Shapley values
It was named in honor of Lloyd Shapley, who introduced it in 1953 and won the
Nobel Prize in Economics for it in 2012.
Shapley, Lloyd S. (1953). A value for n-person games. Contributions to the Theory of
Games 2.28, pp 307-317.
116
, 4.3 Shapley values
117
, 4.3 Shapley values
Example
Consider the sequence (Pierre, Eve, Aminata). As described above, Pierre comes and
pays 25. Now, Pierre and Eve pay 50 so there is an additional payout for Eve of 25.
Finally, all three eat together and pay 73 so the additional payout for Aminata is 23.
118
, 4.3 Shapley values
119
Intuition of Shapley Values
We repeat the same exercise for each possible order for the 3 friends and get the
following marginal payout values:
The Shapley value of Pierre corresponds to his average marginal payouts, i.e., (25 + 34
+ 31 + 37 + 31 + 25)/6 = 30.5.
120
, 4.3 Shapley values
This is the final amount that each of them should pay if they go out together.
121
, 4.3 Shapley values
The players are the features that ”collaborate to play the game” (i.e., to predict a
value).
The goal is to explain the difference between the actual prediction and the average
prediction.
Lundberg, S.M. and Lee, S.L. (2016) A Unified Approach to Interpreting Model Predictions,
31st Conference on Neural Information Processing Systems.
122
, 4.3 Shapley values
Example
Consider a ML model used to predict house prices. For a certain house, with a garage, a
private pool and, an area of 50 yards, the estimated price is 510,000 e whereas the
average prediction for all houses is 500,000 e.
Question: What is the contribution of each feature to the difference between the
estimated price of this house and the average estimated price?
123
, 4.3 Shapley values
124
, 4.3 Shapley values
125
, 4.3 Shapley values
126
, 4.3 Shapley values
Notations
127
, 4.3 Shapley values
128
, 4.3 Shapley values
p −1
2p −1 = ∑ C|pS−| 1
|S |=0
p!
p × C|pS−| 1 =
|S |! (p − 1 − |S |)!
129
Example
Consider the case with three features X1 , X2 , and X3 , with X1 the feature of interest.
S |S | C|pS−| 1 p × C|pS−| 1 1
p ×C|pS−| 1
{∅} 0 1 3 1/3
{x2 } 1 2 6 1/6
{x3 } 1 2 6 1/6
{x2 , x3 } 2 1 3 1/3
130
, 4.3 Shapley values
Consider X S = X1 :
fˆ(X1 ) = β̂ 0 + β̂ 1 X1 + β̂ 2 ? + β̂ 3 ?
131
Additional notations
We denote by:
132
, 4.3 Shapley values
1
∑
ϕj fb = fb S ∪ xj − fb (S )
S ⊆{x1 ,...,xp }\{xj } p × C|pS−| 1
1
ϕj fb = ∑ p × C|pS−| 1
E XS fˆ xj , xS , XS − EXj ,XS fˆ Xj , xS , XS
S ⊆{x1 ,...,xp }\{xj }
133
Example
Consider the linear case with three features X1 , X2 , and X3 , with X1 the feature of
interest.
1
S E XS fˆ x1 , xS , XS − EX1 ,XS fˆ X1 , xS , XS
p ×C|pS−| 1
134
Example
Consider the linear case with three features X1 , X2 , and X3 , with X1 the feature of
interest.
1
S E XS fˆ x1 , xS , XS − EX1 ,XS fˆ X1 , xS , XS
p ×C|pS−| 1
Hence,
1 1 1 1
ϕ1 fb = β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )) + β̂ 1 (x1 − E(X1 )
3 6 6 3
ϕ1 fb = β̂ 1 (x1 − E(X1 ))
135
Example
More generally,
ϕj fb = β̂ j (xj − E(Xj )) = β̂ j xj − β̂ j E(Xj )
Thus,
3 3 3
∑ ϕj fb = ∑ β̂j xj − ∑ β̂j E(Xj )
j =1 j =1 j =1
3
∑ ϕj fb = ŷ − E(ŷ )
j =1
3
∑ ϕj fb = fb(x ) − E(fb(x ))
j =1
136
For any function fb(x ) with three features, we have:
1
ϕ1 fb = E fb (x1 , X2 , X3 ) − E fb (X1 , X2 , X3 )
3
1 b
+ E f (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 )
6
1 b
+ E f (x1 , X2 , x3 ) − E fb (X1 , X2 , x3 )
6
1 b
+ E f (x1 , x2 , x3 ) − E fb (X1 , x2 , x3 )
3
1
ϕ1 fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + W1 (x )
3
with W1 (x ) = 61 E fb (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 ) +
1
6 E f (x1 , X2 , x3 ) − E f (X1 , X2 , x3 ) +
b b
1
3 E f (x1 , X2 , X3 ) − E f (X1 , x2 , x3 )
b b
137
More generally:
1 b
ϕj fb = f (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + Wj (x )
3
Thus,
3 3 3
1 b
∑ ϕj fb = ∑ 3 f (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + ∑ Wj (x )
j =1 j =1 j =1
3 3
∑ ϕj fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 ) + ∑ Wj (x )
j =1 j =1
3
∑ ϕj fb = fb (x1 , x2 , x3 ) − E fb (X1 , X2 , X3 )
j =1
because ∑3j =1 Wj (x ) = 0.
138
1
W1 (x ) = E fb (x1 , x2 , X3 ) − E fb (X1 , x2 , X3 )
6
1
+ E fb (x1 , X2 , x3 ) − E fb (X1 , X2 , x3 )
6
1
+ E fb (x1 , X2 , X3 ) − E fb (X1 , x2 , x3 )
3
1
W2 (x ) = E fb (x1 , x2 , X3 ) − E fb (x1 , X2 , X3 )
6
1
+ E fb (X1 , x2 , x3 ) − E fb (X1 , X2 , x3 )
6
1
+ E fb (X1 , x2 , X3 ) − E fb (x1 , X2 , x3 )
3
1
W3 (x ) = E fb (x1 , X2 , x3 ) − E fb (x1 , X2 , X3 )
6
1
+ E fb (X1 , x2 , x3 ) − E fb (X1 , x2 , X3 )
6
1
+ E fb (X1 , X2 , x3 ) − E fb (x1 , x2 , X3 )
3
139
, 4.3 Shapley values
Properties
Efficiency: The feature contributions must add up to the difference of prediction for x
and the average.
p
∑ ϕj fb = fb (x ) − E fb (x )
i =1
Dummy: A feature j that does not change the predicted value – regardless of which
subset of feature values it is added to – should have a Shapley value of 0.
fb S ∪ xj = fb (S ) ⇐⇒ ϕj fb = 0
140
, 4.3 SHAP
141
, 4.3 SHAP
Note: The idea behind SHAP feature importance is simple: Features with large absolute
Shapley values are important.
142
, 4.3 SHAP
143
, 4.3 SHAP
144
, 4.3 SHAP
145
, 4.3 Shapley values
Advantage
1 Solid theory. The Shapley value is the only explanatory method with a solid theory
with axioms.
Limit
1 Computing time: In many real-world applications, only an approximate solution is
feasible. An exact computation of the Shapley value is computationally expensive
because there are 2k possible subsets of the feature.
146
, 4.3 SHAP
Implementation
For the implementation, we can cite:
2 SHAP is integrated into the tree boosting frameworks xgboost and LightGBM.
147
148
Figure: AUC = 0.78
149
XPER allows us to interpret the predictive or economic performance of any
econometric or ML model (model-agnostic).
150
, An intuitive primer on XPER
AUC ϕ0 ϕ1 ϕ2 ϕ3
with ϕ0 a benchmark value, and ϕj the XPER contribution of feature xj to the AUC of
the model.
151
, An intuitive primer on XPER
152
, An intuitive primer on XPER
153
, An intuitive primer on XPER
3 XPER can be applied to any statistical performance metrics, but also to any
economic performance metrics.
n
P&L = ∑ (1 − ŷi )(1 − yi ) × profit + (1 − ŷi )yi × loss
i =1
where profit is the money made on any reimbursed loan (yi = 0) and loss is the
money lost on any defaulted loan (yi = 1). The P&L can be broken down as follows:
P&L ϕ0 ϕ1 ϕ2 ϕ3
154
Framework and Performance Metrics
155
We consider a classification or a regression problem for which:
156
The econometric or machine learning model may be parametric or not, linear or not,
individual or an ensemble classifier, etc.
157
Definition
A sample performance metric PMn ∈ Θ ⊆ R associated to the model fˆ(.) and a test
sample Sn is a scalar defined as:
Examples:
158
Assumption 1
The sample performance metric satisfies an additive property such that:
1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1
where G (yi ; xi ; δn ) denotes an individual contribution to the performance metric and δ̂n
is a nuisance parameter which depends on the test sample Sn .
159
Assumption 1
The sample performance metric satisfies an additive property such that:
1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1
where G (yi ; xi ; δn ) denotes an individual contribution to the performance metric and δ̂n
is a nuisance parameter which depends on the test sample Sn .
159
Assumption 1
The sample performance metric satisfies an additive property such that:
1 n
Gn (y; X) = ∑ G (yi ; xi ; δ̂n ),
n i =1
where G (yi ; xi ; δn ) denotes an individual contribution to the performance metric and δ̂n
is a nuisance parameter which depends on the test sample Sn .
Assumption 2
The sample performance metric Gn (y; X; δ̂n ) converges to the population performance
metric Ey ,x (G (y ; x; δ0 )), where Ey ,x (.) refers to the expected value with respect to the
joint distribution of y and x, and δ0 = plim δ̂n .
159
Theoretical Decomposition
160
, Intuition
161
, Intuition
162
, Intuition
163
, Intuition
163
, Intuition
163
, Intuition
163
, Intuition
164
, Definition of XPER
The XPER value ϕj associated to feature xj measures its weighted average marginal
contribution to the performance metric over all feature coalitions.
165
Axiom 1. (Efficiency)
The sum of the XPER values ϕj , ∀j = 1, ..., q satisfies:
q
Ey ,x (G (y ; x; δ0 )) = ϕ0 + ∑ ϕj ,
| {z } |{z} j =1 |{z}
performance metric benchmark XPER value
ϕ0 = Ex Ey (G (y ; x; δ0 ))
with ϕ0 the performance metric associated to a population where the target variable is
independent from all features considered in the model.
166
Definition (Individual XPER)
The individual XPER value ϕi,j associated to individual i is defined as:
h i
ϕi,j (yi ; xi ) = ∑ wS ExS (G (yi ; xi ; δ0 )) − Exj ,xS (G (yi ; xi ; δ0 )) .
S ⊆P ({x}\{xj })
where ϕi,j is the realisation of ϕi,j (yi ; xi ) and ϕi,0 is the realisation of
ϕi,0 (yi ) = Ex (G (yi ; x; δ0 )).
167
Empirical Application
168
, Database
Target variable yi :
▶ 1: Default
▶ 0: No default
10 features:
▶ 2 categorical features
▶ 8 continuous features
169
, Model
170
, XPER decomposition
Funding amount
Job tenure
Car price
Age
Loan duration
Owner
Married
Credit event
Monthly payment
Down payment
0 10 20 30 40 50
Contribution (%)
171
, XPER decomposition
Age Age
Owner Owner
Married Married
Down payment
Down payment
0 10 20 30 40 50 0 10 20 30 40 50
Contribution (%) Contribution (%)
172
, Permutation Importance
1 S
∆PMj = PM − ∑ (PM (Xj,s ))
S s =1
with Xj,s the Sth reshuffled vector of the values of Xj .
173
, XPER decomposition
Funding amount
Job tenure
Vehicle price
Customer’s age
Owner
Married
Downpayment
AUC
SHAP
0 5 10 15 20 25 30 35 40
Feature contribution (%)
174
, Using XPER to boost model performance
175
, Using XPER to boost model performance
176
, Conclusion
177
, XPER python package
178
7
Interpretability and Algorithmic Fairness
1 Harvard University
2 Massachusetts Institute of Technology
3 Drexel University
4 Carnegie Mellon University
February 9, 2022
179
180
181
182
183
184
7
Interpretability and Algorithmic Fairness
7. Fairness
7
Interpretability and Algorithmic Fairness
• benefits more White and Asian borrowers than Black and Hispanic
borrowers
186
Neutral?
Articles: FR, EN
188
Fraud detection
189
Many risks:
Model, Reputation, Regulatory, Legal
190
What is an “unfair” algorithm?
191
Statistical triangulation
sector Gender
192
Interpretability
interpretability
𝟏𝟐 𝟑𝟒
Variables Decision Gender
fairness
193
Why algorithmic fairness is important?
• Consumer protection
194
Reasons for an algorithm not to be fair
195
Growing concern in the academia
196
Now a key component of sustainable AI
Sustainable AI must:
• be robust
• be fair
197
Application to a credit database
198
199
Forecasting default with simple
and ML models
200
Many fairness definitions (# citations in 2019)
201
An example: Statistical parity
202
An example: Statistical parity
203
An example: Statistical parity
« 4/5 rule »
204
Composition effect
women men
risk class
creditworthiness
205
An example: Statistical parity
% acceptance of loan applications for men [income > 100k€, job > 5Y, home owner]
=
% acceptance of loan applications for women [income > 100k€, job > 5Y, home owner]
206
An example: Statistical parity
% acceptance of loan applications for men [income > 100k€, job > 5Y, home owner]
=
% acceptance of loan applications for women [income > 100k€, job > 5Y, home owner]
206
An example: Statistical parity
% acceptance of loan applications for men [income > 100k€, job > 5Y, home owner]
=
% acceptance of loan applications for women [income > 100k€, job > 5Y, home owner]
206
Set up
207
Set up
207
Set up
208
Statistical parity
209
Conditional statistical parity
Building groups
For credit scoring applications, groups gather applicants with similar risk profiles:
Determined through unsupervised clustering methods (K-Means) or using an
exogenous classification (Basel classification)
210
Number of sub‐groups
• Aggregation process:
211
(Conditional) Statistical parity
212
Independence assumptions
213
Testing fairness using independence tests
214
Testing independence using Chi‐2 statistics
215
Testing Statistical Parity
216
Fairness traffic light
217
Numerical example
Chi2 =
(214 - 310*771/1000)^2/(310*771/1000) +
(96 - 310*229/1000)^2/(310*229/1000) +
(557 - 690*771/1000)^2/(690*771/1000) +
(133 - 690*229/1000)^2/(690*229/1000)
218
Conditional classification
219
Testing conditional statistical parity
220
Generalization: Other metrics
221
Generalization: Other metrics (2)
222
Fairness tests: with‐models
223
Fairness tests: without‐models
224
Configuration 2
225
7
Interpretability and Algorithmic Fairness
226
Introduction to Partial Dependence Plots (PDP)
228
FPDP: Notation
229
FPDP: Definition
230
FPDP: Definition
230
FPDP: Definition
230
Categorical feature
231
Categorical feature
231
Continuous feature
232
Continuous feature
232
FPDP: Interpretation
233
FPDP: Plots
Variable 1
Variable 2
Variable 3
234
FPDP: Plots
Variable 1
Variable 2
Variable 3
234
Interpretability: without, white‐box model
235
236
Mitigation
• Set the value of each candidate variable to a level for which there is
no fairness problem
237
With or without re‐estimation
238
7
Interpretability and Algorithmic Fairness
240
241
242