Professional Documents
Culture Documents
ng with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
Open in app
You have 1 free member-only story left this month. Upgrade for unlimited access.
Homer Simpsons is known to be a simple person who makes silly deeds from time to time. Here he decided
to cook bread, bacon, and eggs using the microwave. We found it funny to watch, but at the same time, we
tend to do similar stuff in data science. Don’t be like Homer; use proper tools. This shot from Simpsons
(“Homer The Smithers”, Season 7 Episode 17) is believed to qualify as fair use
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 1/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
While being one of the most popular approaches for explaining predictions of any
classifier, LIME has been several times criticized in the research community: LIME
suffers from labels and data shift, explanations dependent on the choice of
hyperparameters (yes, LIME has hyperparameters) and even similar points may have
different explanations. The combination of those problems might give wrong,
unexpected interpretations.
The idea behind the LIME is aimed to take a right from both worlds: let’s train an
accurate black-box model, but the model’s explanations would be based on the simple,
easy-to-understand model, such as linear or logistic regression.
That being said, to explain a single point P, let’s train a local linear (or logistic)
regression, model. The input data is perturbed train data (X) weighted by distance to
point P (l1, l2, or other). Target is the prediction for X made by the black-box model.
Explanations can be obtained by exploring coefficients of the surrogate model (more
on how LIME works can be found in the original paper or this excellent book).
Such an approach can work with any data type: tabular data, texts, and images. LIME
library has a nice API and easy to use. Nevertheless, LIME has several hidden problems.
Problems
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 2/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
The LIME approach has several significant problems, which are described in this
section.
Source
where x_i is the input example, x_j is a neighbor point to x_i, f(x) — the importance of
the feature, and B(x_i) — is a ball of radius ε centered at x_i. This measure shows a
maximum change of feature importance due to small changes in input features. Lower
the measure — more robust explanations are.
In the paper, the authors estimated the robustness of explanations of random forest
models using SHAP and LIME. For several datasets, they sampled a hundred points
from each dataset and calculated the Local Lipschitz estimate. Results are presented in
the figure below:
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 3/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
Local Lipschitz estimates computed on 100 test points on various UCI classification datasets. Source of
figure
To give an understanding, are those values low, average, or high, let’s look at a single
explanation example. Here authors presented a single point from the Boston dataset
and a modified point, which maximized the Local Lipschitz estimate:
Top: example x_i from the BOSTON dataset and its explanations (attributions). Bottom: explanations for the
maximizer of the Lipschitz estimate L(x_i). Source of figure
We may see that a small change in input features (CRIM: 0.4 → 0.3, AGE: 92.3 → 92.2,
DIS: 3.0 → 3.1, TAX: 307 → 307.1, B: 395.2 → 395.3, LSTAT: 21.5 → 21.4) made
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 4/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
1. Sample selection bias. Data for the train and test parts were selected using different
rules. For example, an image classification model trained on images taken using a
professional camera, while tested on photos taken using smartphones.
More information on data shift may be found in the Understanding Dataset Shift blog
post.
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 5/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
The research conducted by Amir Hossein Akhavan Rahnama and Henrik Boström — “A
study of data and label shift in the LIME framework” — addresses this problem. They
did several experiments and concluded that instances generated by LIME’s perturbation
procedure are significantly different from training instances drawn from the underlying
distribution. Based on the obtained results, the authors argued that random
perturbation of features of the explained example could not be considered a reliable
method of data generation in the LIME framework.
It leads to a significant problem. Partly because of the data shift, partly because of the
limitations of the prediction capacity of the surrogate model, the surrogate model
doesn’t approximate the prediction of the black-box model well enough. Such low
fidelity explanations are pretty much useless (the goodness of approximation is called
fidelity).
Fidelity and MMD divergence in the Newsgroup and ImageNet dataset. Source
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 6/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
2. How should we weight those points? What distance measure should we use?
In the experiment, I trained the LightGBM model and made LIME explanations for a
sample of points from the Heart Disease UCI dataset using a different set of
hyperparameters. For each point, I calculated the pairwise Spearman rank correlation
between features weights obtained using a different set of hyperparameters. In the best-
case scenario, when explanations don’t depend on hyperparameters, the pairwise
correlations should be equal to one. Unfortunately, it’s not the case for LIME.
The distribution of pairwise correlations of the top five most important features
(obtained using LightGBM “gain” importance):
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 7/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
We see that correlation is high, but far from perfect. That means, for the fixed point, we
may expect different essential features and their contribution depending on what
hyperparameters choice. In some cases, the same feature may “push” prediction
towards positive or negative direction, based on what hyperparameters were used for
explanation.
Feature weights using different hyperparameters of LIME. We may see that feature might positively or
negatively contribute to the final prediction depending on hyperparameters’ choice.
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 8/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
DenisVorotyntsev/lime_experiments
This repository contains code for experiments on how the choice of
hyperparameters affects explanations of LIME. The…
github.com
Summary
The LIME framework is wildly used nowadays. As I showed in this blog post, it has
several significant problems, which make LIME a poor choice for the model
interpretation:
Is it possible to overcome those problems? I think one might calculate actual feature
importance as an average of LIME importances obtained using a different set of
hyperparameters and slightly changed input vector of features. But for me, it instead
seems like a cane, then a reliable solution. I would like to hear your comments and
thoughts on this.
I would instead recommend using either fully interpretable models, such as linear
regression or additive models (for example, interpret shows comparable results to tree-
based models, while being fully interpretable) or using SHAP for ML models
interpretation.
Additional Links
If you liked this, you might be interested in reading my other post on problems with
permutation importance:
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 9/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 10/10