LIME Limitations (Towards DS)

12/22/21, 9:53 AM What’s Wrong with LIME.
ng with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
Open in app
Follow 607K Followers
You have 1 free member-only story left this month. Upgrade for unlimited access.
Listen to this story · 7 min
Homer Simpsons is known to be a simple person who makes silly deeds from time to time. Here he decided
to cook bread, bacon, and eggs using the microwave. We found it funny to watch, but at the same time, we
tend to do similar stuff in data science. Don’t be like Homer; use proper tools. This shot from Simpsons
(“Homer The Smithers”, Season 7 Episode 17) is believed to qualify as fair use
https://towardsdatascience.com/whats-wrong-with-lime-86b335f34612 1/10
12/22/21, 9:53 AM What’s Wrong with LIME. While being one of the most popular… | by Denis Vorotyntsev | Towards Data Science
What’s Wrong with LIME

Denis Vorotyntsev Sep 7, 2020 · 7 min read
Local Interpretable Model-agnostic Explanations (LIME) is a popular Python package

for explaining individual model’s predictions for text classifiers or classifiers that act on
tables (NumPy arrays of numerical or categorical data) or images. LIME was firstly
introduced in the paper called “Why Should I Trust You?”: Explaining the Predictions of
Any Classifier” in 2016, and since then LIME project repository hit a point of almost 8k
stars (for comparison, scikit-learn has 42k start).
While being one of the most popular approaches for explaining predictions of any
classifier, LIME has been several times criticized in the research community: LIME
suffers from labels and data shift, explanations dependent on the choice of
hyperparameters (yes, LIME has hyperparameters) and even similar points may have
different explanations. The combination of those problems might give wrong,
unexpected interpretations.
How LIME works

In machine learning projects, we often work with complex models, such as random
forests or deep neural networks. For a human, it’s not feasible to understand how these
black boxes are making predictions. At the same time, if we replace complex models
with more straightforward, explainable ones, models such as linear regression or
shallow decision tree, we would lose in terms of score.
The idea behind the LIME is aimed to take a right from both worlds: let’s train an
accurate black-box model, but the model’s explanations would be based on the simple,
easy-to-understand model, such as linear or logistic regression.
That being said, to explain a single point P, let’s train a local linear (or logistic)
regression, model. The input data is perturbed train data (X) weighted by distance to
point P (l1, l2, or other). Target is the prediction for X made by the black-box model.
Explanations can be obtained by exploring coefficients of the surrogate model (more
on how LIME works can be found in the original paper or this excellent book).
Such an approach can work with any data type: tabular data, texts, and images. LIME
library has a nice API and easy to use. Nevertheless, LIME has several hidden problems.
Problems
The LIME approach has several significant problems, which are described in this
section.
LIME explanations aren’t robust

While making predictions, we expect that similar points should have similar
predictions. I.e., if the income of credit applicant has changed from 60k/year to
61k/year, we don’t expect to see a drop or spike in credit score (at least often). The
same logic should be applicable to explanations as well: small changes in the input
vector of features shouldn’t dramatically affect explanations. Going back to the credit
score example: if the model’s explanation shows that income=60k is an important
feature and it pushes the credit score towards a positive direction, it should be the case
for income=61k as well. Unfortunately, it’s not always true for both LIME and SHAP.
David Alvarez-Melis and Tommi S. Jaakkola introduced a measure of the robustness of

explanations tools in their paper “On the Robustness of Interpretability Methods”. The
measure, Local Lipschitz estimate, is defined as:
Source
where x_i is the input example, x_j is a neighbor point to x_i, f(x) — the importance of
the feature, and B(x_i) — is a ball of radius ε centered at x_i. This measure shows a
maximum change of feature importance due to small changes in input features. Lower
the measure — more robust explanations are.
In the paper, the authors estimated the robustness of explanations of random forest
models using SHAP and LIME. For several datasets, they sampled a hundred points
from each dataset and calculated the Local Lipschitz estimate. Results are presented in
the figure below:
Local Lipschitz estimates computed on 100 test points on various UCI classification datasets. Source of
figure
To give an understanding, are those values low, average, or high, let’s look at a single
explanation example. Here authors presented a single point from the Boston dataset
and a modified point, which maximized the Local Lipschitz estimate:
Top: example x_i from the BOSTON dataset and its explanations (attributions). Bottom: explanations for the
maximizer of the Lipschitz estimate L(x_i). Source of figure
We may see that a small change in input features (CRIM: 0.4 → 0.3, AGE: 92.3 → 92.2,
DIS: 3.0 → 3.1, TAX: 307 → 307.1, B: 395.2 → 395.3, LSTAT: 21.5 → 21.4) made
explanations change dramatically: signs of features importances and absolute values

have changed for almost half of the features! The authors observed a similar behavior
while using LIME for images explanations as well.
Explanations of similar examples might be totally

different.
LIME suffers from label and data shift
Data shift is when training and test distributions are different. It’s crucial, yet
sometimes a hidden problem of ML pipelines, which may lead to scores under- or
overestimation during the validation stage or model degradation in production. Data
shift is caused by:
1. Sample selection bias. Data for the train and test parts were selected using different
rules. For example, an image classification model trained on images taken using a
professional camera, while tested on photos taken using smartphones.
2. Non-stationary environments. The world is not stationary, the underlying processes

are changing, so algorithms must be updated. It’s more evident in the continually
evolving world of algorithmic trading and fraud detection and less noticeable in
more “conservative” domains such as sentiment analysis.
More information on data shift may be found in the Understanding Dataset Shift blog
post.
Example of data shift
The research conducted by Amir Hossein Akhavan Rahnama and Henrik Boström — “A
study of data and label shift in the LIME framework” — addresses this problem. They
did several experiments and concluded that instances generated by LIME’s perturbation
procedure are significantly different from training instances drawn from the underlying
distribution. Based on the obtained results, the authors argued that random
perturbation of features of the explained example could not be considered a reliable
method of data generation in the LIME framework.
It leads to a significant problem. Partly because of the data shift, partly because of the
limitations of the prediction capacity of the surrogate model, the surrogate model
doesn’t approximate the prediction of the black-box model well enough. Such low
fidelity explanations are pretty much useless (the goodness of approximation is called
fidelity).
LIME explanations fidelity is low.
Fidelity and MMD divergence in the Newsgroup and ImageNet dataset. Source
LIME explanations depend on the selection of hyperparameters

LIME has a set of essential hyperparameters:
1. How many points do we need to take for training a surrogate model?
2. How should we weight those points? What distance measure should we use?
3. How hyperparameters of a surrogate model should be selected (i.e., type and

strength of regularization)?
It is desired that no matter what set of hyperparameters we choose, the explanations

would be more or less the same. I experimented with checking that.
In the experiment, I trained the LightGBM model and made LIME explanations for a
sample of points from the Heart Disease UCI dataset using a different set of
hyperparameters. For each point, I calculated the pairwise Spearman rank correlation
between features weights obtained using a different set of hyperparameters. In the best-
case scenario, when explanations don’t depend on hyperparameters, the pairwise
correlations should be equal to one. Unfortunately, it’s not the case for LIME.
The distribution of pairwise correlations are shown below:
The distribution of pairwise correlations of the top five most important features
(obtained using LightGBM “gain” importance):
We see that correlation is high, but far from perfect. That means, for the fixed point, we
may expect different essential features and their contribution depending on what
hyperparameters choice. In some cases, the same feature may “push” prediction
towards positive or negative direction, based on what hyperparameters were used for
explanation.
Feature weights using different hyperparameters of LIME. We may see that feature might positively or
negatively contribute to the final prediction depending on hyperparameters’ choice.
Explanation depends on the choice of LIME

hyperparameters.
The code for reproducibility is available in my GitHub:
DenisVorotyntsev/lime_experiments
This repository contains code for experiments on how the choice of
hyperparameters affects explanations of LIME. The…
github.com
Summary
The LIME framework is wildly used nowadays. As I showed in this blog post, it has
several significant problems, which make LIME a poor choice for the model
interpretation:
1. Explanations of similar examples might be totally different;
2. LIME explanations fidelity is low;
3. Explanation depends on the choice of LIME hyperparameters.
Is it possible to overcome those problems? I think one might calculate actual feature
importance as an average of LIME importances obtained using a different set of
hyperparameters and slightly changed input vector of features. But for me, it instead
seems like a cane, then a reliable solution. I would like to hear your comments and
thoughts on this.
I would instead recommend using either fully interpretable models, such as linear
regression or additive models (for example, interpret shows comparable results to tree-
based models, while being fully interpretable) or using SHAP for ML models
interpretation.
Additional Links
If you liked this, you might be interested in reading my other post on problems with
permutation importance:
Stop Permuting Features

Permutation importance may give you wrong, misleading results. But
why?
towardsdatascience.c
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Emails will be sent to mkhpanchal@gmail.com.

Get this newsletter Not you?
Machine Learning Data Science Programming Interpretability Python
About Write Help Legal
Get the Medium app

LIME Limitations (Towards DS)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LIME Limitations (Towards DS)

Uploaded by

Copyright:

Available Formats

12/22/21, 9:53 AM What’s Wrong with LIME.

Follow 607K Followers

Listen to this story · 7 min

What’s Wrong with LIME

Local Interpretable Model-agnostic Explanations (LIME) is a popular Python package

How LIME works

LIME explanations aren’t robust

David Alvarez-Melis and Tommi S. Jaakkola introduced a measure of the robustness of

explanations change dramatically: signs of features importances and absolute values

Explanations of similar examples might be totally

2. Non-stationary environments. The world is not stationary, the underlying processes

Example of data shift

LIME explanations fidelity is low.

LIME explanations depend on the selection of hyperparameters

1. How many points do we need to take for training a surrogate model?

3. How hyperparameters of a surrogate model should be selected (i.e., type and

It is desired that no matter what set of hyperparameters we choose, the explanations

The distribution of pairwise correlations are shown below:

Explanation depends on the choice of LIME

The code for reproducibility is available in my GitHub:

1. Explanations of similar examples might be totally different;

2. LIME explanations fidelity is low;

3. Explanation depends on the choice of LIME hyperparameters.

Stop Permuting Features

Sign up for The Variable

Emails will be sent to mkhpanchal@gmail.com.

Machine Learning Data Science Programming Interpretability Python

About Write Help Legal

Get the Medium app

You might also like