Professional Documents
Culture Documents
Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, Prithviraj Sen
IBM Research – Almadem
mdanile@us.ibm.com, {qian.kun,Ranit.Aharonov2}@ibm.com
yannis.katsis@ibm.com, {bkawas,senp}@us.ibm.com
overview of the current state of Explainable vey focusing on the NLP domain.
AI (XAI), considered within the domain of As will become clear in this survey, explainabil-
Natural Language Processing (NLP). We dis- ity is in itself a term that requires an explanation.
cuss the main categorization of explanations, While explainability may generally serve many
as well as the various ways explanations can be
purposes (see, e.g., Lertvittayakumjorn and Toni,
arrived at and visualized. We detail the oper-
ations and explainability techniques currently
2019), our focus is on explainability from the per-
available for generating explanations for NLP spective of an end user whose goal is to understand
model predictions, to serve as a resource for how a model arrives at its result, also referred to as
model developers in the community. Finally, the outcome explanation problem (Guidotti et al.,
we point out the current gaps and encourage 2018). In this regard, explanations can help users
directions for future work in this important re- of NLP-based AI systems build trust in these sys-
search area. tems’ predictions. Additionally, understanding the
model’s operation may also allow users to provide
1 Introduction useful feedback, which in turn can help developers
Traditionally, Natural Language Processing (NLP) improve model quality (Adadi and Berrada, 2018).
systems have been mostly based on techniques that Explanations of model predictions have previ-
are inherently explainable. Examples of such ap- ously been categorized in a fairly simple way that
proaches, often referred to as white box techniques, differentiates between (1) whether the explanation
include rules, decision trees, hidden Markov mod- is for each prediction individually or the model’s
els, logistic regressions, and others. Recent years, prediction process as a whole, and (2) determin-
though, have brought the advent and popularity of ing whether generating the explanation requires
black box techniques, such as deep learning mod- post-processing or not (see Section 3). However,
els and the use of language embeddings as features. although rarely studied, there are many additional
While these methods in many cases substantially characterizations of explanations, the most impor-
advance model quality, they come at the expense tant being the techniques used to either generate
of models becoming less interpretable. This ob- or visualize explanations. In this survey, we ana-
fuscation of the process by which a model arrives lyze the NLP literature with respect to both these
at its results can be problematic, as it may erode dimensions and identify the most commonly used
trust in the many AI systems humans interact with explainability and visualization techniques, in ad-
daily (e.g., chatbots, recommendation systems, in- dition to operations used to generate explanations
formation retrieval algorithms, and many others). (Sections 4.1-Section 4.3). We briefly describe
In the broader AI community, this growing under- each technique and point to representative papers
standing of the importance of explainability has cre- adopting it. Finally, we discuss the common evalu-
ated an emerging field called Explainable AI (XAI). ation techniques used to measure the quality of ex-
However, just as tasks in different fields are more planations (Section 5), and conclude with a discus-
amenable to particular approaches, explainability sion of gaps and challenges in developing success-
ful explainability approaches in the NLP domain paper covered in this survey can be found at the
(Section 6). aforementioned website; to enable readers of this
Related Surveys: Earlier surveys on XAI in- survey to discover interesting explainability tech-
clude Adadi and Berrada (2018) and Guidotti et al. niques and ideas, even if they have not been fully
(2018). While Adadi and Berrada provide a com- developed in the respective publications.
prehensive review of basic terminology and fun-
damental concepts relevant to XAI in general, our 3 Categorization of Explanations
goal is to survey more recent works in NLP in an
Explanations are often categorized along two main
effort to understand how these achieve XAI and
aspects (Guidotti et al., 2018; Adadi and Berrada,
how well they achieve it. Guidotti et al. adopt a
2018). The first distinguishes whether the expla-
four dimensional classification scheme to rate var-
nation is for an individual prediction (local) or the
ious approaches. Crucially, they differentiate be-
model’s prediction process as a whole (global).
tween the “explanator” and the black-box model it
The second differentiates between the explanation
explains. This makes most sense when a surrogate
emerging directly from the prediction process (self-
model is used to explain a black-box model. As we
explaining) versus requiring post-processing (post-
shall subsequently see, such a distinction applies
hoc). We next describe both of these aspects in de-
less well to the majority of NLP works published in
tail, and provide a summary of the four categories
the past few years where the same neural network
they induce in Table 1.
(NN) can be used not only to make predictions but
also to derive explanations. In a series of tutorials, 3.1 Local vs Global
Lecue et al. (2020) discuss fairness and trust in ma-
A local explanation provides information or justifi-
chine learning (ML) that are clearly related to XAI
cation for the model’s prediction on a specific in-
but not the focus of this survey. Finally, we adapt
put; 46 of the 50 papers fall into this category.
some nomenclature from Arya et al. (2019) which
presents a software toolkit that can help users lend A global explanation provides similar justifica-
explainability to their models and ML pipelines. tion by revealing how the model’s predictive pro-
cess works, independently of any particular input.
Our goal for this survey is to: (1) provide the
This category holds the remaining 4 papers cov-
reader with a better understanding of the state of
ered by this survey. This low number is not surpris-
XAI in NLP, (2) point developers interested in
ing given the focus of this survey being on explana-
building explainable NLP models to currently avail-
tions that justify predictions, as opposed to expla-
able techniques, and (3) bring to the attention of
nations that help understand a model’s behavior in
the research community the gaps that exist; mainly
general (which lie outside the scope of this survey).
a lack of formal definitions and evaluation for ex-
plainability. We have also built an interactive web- 3.2 Self-Explaining vs Post-Hoc
site providing interested readers with all relevant
aspects for every paper covered in this survey. 1 Regardless of whether the explanation is local or
global, explanations differ on whether they arise
2 Methodology as part of the prediction process, or whether their
generation requires post-processing following the
We identified relevant papers (see Appendix A) and model making a prediction. A self-explaining ap-
classified them based on the aspects defined in Sec- proach, which may also be referred to as directly
tions 3 and 4. To ensure a consistent classification, interpretable (Arya et al., 2019), generates the ex-
each paper was individually analyzed by at least planation at the same time as the prediction, us-
two reviewers, consulting additional reviewers in ing information emitted by the model as a result
the case of disagreement. For simplicity of presen- of the process of making that prediction. Decision
tation, we label each paper with its main applicable trees and rule-based models are examples of global
category for each aspect, though some papers may self-explaining models, while feature saliency ap-
span multiple categories (usually with varying de- proaches such as attention are examples of local
grees of emphasis.) All relevant aspects for every self-explaining models.
1
In contrast, a post-hoc approach requires that
https://xainlp2020.github.io/xainlp/
(we plan to maintain this website as a contribution to the an additional operation is performed after the pre-
community.) dictions are made. LIME (Ribeiro et al., 2016) is
an example of producing a local explanation us- 4.1 Explainability Techniques
ing a surrogate model applied following the predic-
In the papers surveyed, we identified five major
tor’s operation. A paper might also be considered
explainability techniques that differ in the mecha-
to span both categories – for example, (Sydorova
nisms they adopt to generate the raw mathematical
et al., 2019) actually presents both self-explaining
justifications that lead to the final explanation pre-
and post-hoc explanation techniques.
sented to the end users.
Local Explain a single prediction by per- Feature importance. The main idea is to derive
Post-Hoc forming additional operations (after the explanation by investigating the importance scores
model has emitted a prediction) of different features used to output the final pre-
Local Self- Explain a single prediction using the diction. Such approaches can be built on differ-
Explaining model itself (calculated from informa-
tion made available from the model as ent types of features, such as manual features ob-
part of making the prediction) tained from feature engineering (e.g., Voskarides
Global Perform additional operations to explain et al., 2015), lexical features including word/tokens
Post-Hoc the entire model’s predictive reasoning and n-gram (e.g., Godin et al., 2018; Mullenbach
Global Self- Use the predictive model itself to explain et al., 2018), or latent features learned by NNs (e.g.,
Explaining the entire model’s predictive reasoning Xie et al., 2017). Attention mechanism (Bahdanau
(a.k.a. directly interpretable model)
et al., 2015) and first-derivative saliency (Li et al.,
Table 1: Overview of the high-level categories of expla- 2015) are two widely used operations to enable
nations (Section 3). feature importance-based explanations. Text-based
features are inherently more interpretable by hu-
mans than general features, which may explain the
4 Aspects of Explanations widespread use of attention-based approaches in
While the previous categorization serves as a con- the NLP domain.
venient high-level classification of explanations, it Surrogate model. Model predictions are ex-
does not cover other important characteristics. We plained by learning a second, usually more explain-
now introduce two additional aspects of explana- able model, as a proxy. One well-known example
tions: (1) techniques for deriving the explanation is LIME (Ribeiro et al., 2016), which learns sur-
and (2) presentation to the end user. We discuss rogate models using an operation called input per-
the most commonly used explainability techniques, turbation. Surrogate model-based approaches are
along with basic operations that enable explainabil- model-agnostic and can be used to achieve either
ity, as well as the visualization techniques com- local (e.g., Alvarez-Melis and Jaakkola, 2017) or
monly used to present the output of associated ex- global (e.g., Liu et al., 2018) explanations. How-
plainability techniques. We identify the most com- ever, the learned surrogate models and the original
mon combinations of explainability techniques, op- models may have completely different mechanisms
erations, and visualization techniques for each of to make predictions, leading to concerns about the
the four high-level categories of explanations pre- fidelity of surrogate model-based approaches.
sented above, and summarize them, together with Example-driven. Such approaches explain the
representative papers, in Table 2. prediction of an input instance by identifying and
Although explainability techniques and visual- presenting other instances, usually from available
izations are often intermixed, there are fundamental labeled data, that are semantically similar to the
differences between them that motivated us to treat input instance. They are similar in spirit to nearest
them separately. Concretely, explanation derivation neighbor-based approaches (Dudani, 1976), and
- typically done by AI scientists and engineers - fo- have been applied to different NLP tasks such as
cuses on mathematically motivated justifications text classification (Croce et al., 2019) and question
of models’ output, leveraging various explainabil- answering (Abujabal et al., 2017).
ity techniques to produce “raw explanations” (such Provenance-based. Explanations are provided
as attention scores). On the other hand, explana- by illustrating some or all of the prediction deriva-
tion presentation - ideally done by UX engineers - tion process, which is an intuitive and effective ex-
focuses on how these “raw explanations” are best plainability technique when the final prediction is
presented to the end users using suitable visualiza- the result of a series of reasoning steps. We observe
tion techniques (such as saliency heatmaps). several question answering papers adopt such ap-
Category Explainability Operations to Visualization Representative
(#) Technique Enable Explainability Technique # Paper(s)
Local feature first derivative saliency, example saliency 5 (Wallace et al., 2018;
Post-Hoc importance driven Ross et al., 2017)
(11)
surrogate model first derivative saliency, layer-wise saliency 4 (Alvarez-Melis and
relevance propagation, input pertur- Jaakkola, 2017; Poerner
bation et al., 2018; Ribeiro et al.,
2016)
example driven layer-wise relevance propagation, raw examples 2 (Croce et al., 2018; Jiang
explainability-aware architecture et al., 2019)
Local feature attention, first derivative saliency, saliency 22 (Mullenbach et al., 2018;
Self-Exp importance LSTM gating signals, explainability- Ghaeini et al., 2018; Xie
(35) aware architecture et al., 2017; Aubakirova
and Bansal, 2016)
induction explainability-aware architecture, raw declarative 6 (Ling et al., 2017; Dong
rule induction representation et al., 2019; Pezeshkpour
et al., 2019a)
provenance template-based natural 3 (Abujabal et al., 2017)
language, other
surrogate model attention, input perturbation, natural 3 (Rajani et al., 2019a;
explainability-aware architecture language Sydorova et al., 2019)
example driven layer-wise relevance propagation raw examples 1 (Croce et al., 2019)
Global feature class activation mapping, attention, saliency 2 (Pryzant et al., 2018a,b)
Post-Hoc importance gradient reversal
(3)
surrogate model taxonomy induction raw declarative 1 (Liu et al., 2018)
representation
Global induction reinforcement learning raw declarative 1 (Pröllochs et al., 2019)
Self-Exp representation
(1)
Table 2: Overview of common combinations of explanation aspects: columns 2, 3, and 4 capture explainability
techniques, operations, and visualization techniques, respectively (see Sections 4.1, 4.2, and 4.3 for details). These
are grouped by the high-level categories detailed in Section 3, as shown in the first column. The last two columns
show the number of papers in this survey that fall within each subgroup, and a list of representative references.
proaches (Abujabal et al., 2017; Zhou et al., 2018; may employ more than one of these explainabil-
Amini et al., 2019). ity techniques. A representative example is the QA
Declarative induction. Human-readable repre- system QUINT (Abujabal et al., 2017), which dis-
sentations, such as rules (Pröllochs et al., 2019), plays the query template that best matches the user
trees (Voskarides et al., 2015), and programs (Ling input query (example-driven) as well as the instan-
et al., 2017) are induced as explanations. tiated knowledge-base entities (provenance).
As shown in Table 2, feature importance-based
4.2 Operations to Enable Explainability
and surrogate model-based approaches have been
in frequent use (accounting for 29 and 8, respec- We now present the most common set of operations
tively, of the 50 papers reviewed). This should not encountered in our literature review that are used to
come as a surprise, as features serve as building enable explainability, in conjunction with relevant
blocks for machine learning models (explaining work employing each one.
the proliferation of feature importance-based ap- First-derivative saliency. Gradient-based ex-
proaches) and most recent NLP papers employ NN- planations estimate the contribution of input i to-
based models, which are generally black-box mod- wards output o by computing the partial derivative
els (explaining the popularity of surrogate model- of o with respect to i. This is closely related to
based approaches). Finally note that a complex older concepts such as sensitivity (Saltelli et al.,
NLP approach consisting of different components 2008). First-derivative saliency is particularly con-
venient for NN-based models because these can humans employ to arrive at a solution. This makes
be computed for any layer using a single call to the learned model (partially) interpretable since the
auto-differentiation, which most deep learning en- architecture contains human-recognizable compo-
gines provide out-of-the-box. Recent work has also nents. Implementing such a model architecture can
proposed improvements to first-derivative saliency be used to enable the induction of human-readable
(Sundararajan et al., 2017). As suggested by its programs for solving math problems (Amini et al.,
name and definition, first-derivative saliency can be 2019; Ling et al., 2017) or sentence simplification
used to enable feature importance explainability, es- problems (Dong et al., 2019). This design may also
pecially on word/token-level features (Aubakirova be applied to surrogate models that generate expla-
and Bansal, 2016; Karlekar et al., 2018). nations for predictions (Rajani et al., 2019a; Liu
Layer-wise relevance propagation. This is an- et al., 2019).
other way to attribute relevance to features com- Previous works have also attempted to compare
puted in any intermediate layer of an NN. Defini- these operations in terms of efficacy with respect
tions are available for most common NN layers in- to specific NLP tasks (Poerner et al., 2018). Oper-
cluding fully connected layers, convolution layers ations outside of this list exist and are popular for
and recurrent layers. Layer-wise relevance propa- particular categories of explanations. Table 2 men-
gation has been used to, for example, enable feature tions some of these. For instance, Pröllochs et al.
importance explainability (Poerner et al., 2018) and (2019) use reinforcement learning to learn simple
example-driven explainability (Croce et al., 2018). negation rules, Liu et al. (2018) learns a taxonomy
Input perturbations. Pioneered by LIME post-hoc to better interpret network embeddings,
(Ribeiro et al., 2016), input perturbations can ex- and Pryzant et al. (2018b) uses gradient reversal
plain the output for input x by generating ran- (Ganin et al., 2016) to deconfound lexicons.
dom perturbations of x and training an explainable
model (usually a linear model). They are mainly 4.3 Visualization Techniques
used to enable surrogate models (e.g., Ribeiro et al., An explanation may be presented in different ways
2016; Alvarez-Melis and Jaakkola, 2017). to the end user, and making the appropriate choice
Attention (Bahdanau et al., 2015; Vaswani et al., is crucial for the overall success of an XAI ap-
2017). Less an operation and more of a strategy to proach. For example, the widely used attention
enable the NN to explain predictions, attention lay- mechanism, which learns the importance scores
ers can be added to most NN architectures and, be- of a set of features, can be visualized as raw at-
cause they appeal to human intuition, can help indi- tention scores or as a saliency heatmap (see Fig-
cate where the NN model is “focusing”. While pre- ure 1a). Although the former is acceptable, the lat-
vious work has widely used attention layers (Luo ter is more user-friendly and has become the stan-
et al., 2018; Xie et al., 2017; Mullenbach et al., dard way to visualize attention-based approaches.
2018) to enable feature importance explainability, We now present the major visualization techniques
the jury is still out as to how much explainability at- identified in our literature review.
tention provides (Jain and Wallace, 2019; Serrano Saliency. This has been primarily used to visu-
and Smith, 2019; Wiegreffe and Pinter, 2019). alize the importance scores of different types of
LSTM gating signals. Given the sequential na- elements in XAI learning systems, such as show-
ture of language, recurrent layers, in particular ing input-output word alignment (Bahdanau et al.,
LSTMs (Hochreiter and Schmidhuber, 1997), are 2015) (Figure 1a), highlighting words in input text
commonplace. While it is common to mine the out- (Mullenbach et al., 2018) (Figure 1b) or displaying
puts of LSTM cells to explain outputs, there may extracted relations (Xie et al., 2017). We observe a
also be information present in the outputs of the strong correspondence between feature importance-
gates produced within the cells. It is possible to uti- based explainability and saliency-based visualiza-
lize (and even combine) other operations presented tions; namely, all papers using feature importance
here to interpret gating signals to aid feature impor- to generate explanations also chose saliency-based
tance explainability (Ghaeini et al., 2018). visualization techniques. Saliency-based visualiza-
Explainability-aware architecture design. One tions are popular because they present visually per-
way to exploit the flexibility of deep learning is to ceptive explanations and can be easily understood
devise an NN architecture that mimics the process by different types of end users. They are there-
(a) Saliency heatmap (Bahdanau et al., (b) Saliency highlighting (Mullenbach (c) Raw declarative rules
2015) et al., 2018) (Pezeshkpour et al., 2019b)
(d) Raw declarative program (Amini et al., 2019) (e) Raw examples (Croce et al., 2019)
A Appendix A - Methodology
This survey aims to demonstrate the recent ad-
vances of XAI research in NLP, rather than to pro-
vide an exhaustive list of XAI papers in the NLP
community. To this end, we identified relevant pa-
pers published in major NLP conferences (ACL,
NAACL, EMNLP, and COLING) between 2013
and 2019. We filtered for titles containing (lemma-
tized) terms related to XAI, such as “explainabil-
ity”, “interpretability”, “transparent”, etc. While
this may ignore some related papers, we argue that
representative papers are more likely to include
such terms in their titles. In particular, we assume