You are on page 1of 11

Adaptively Sparse Transformers

Gonçalo M. Correiaä Vlad Niculaeä André F.T. Martinsä ã


goncalo.correia@lx.it.pt vlad@vene.ro andre.martins@unbabel.com
ä
Instituto de Telecomunicações, Lisbon, Portugal
ã
Unbabel, Lisbon, Portugal

fo n

fo n
fo n

ov s

ov s
ov s
Abstract

k
k

ow

ow

p
ow

er

er
er

ic

ic
ic

e
e

m
m

x
x

Th

Th
Th

qu

qu
qu

br

br
br

ju

ju
ju
Attention mechanisms have become ubiqui- head 1

tous in NLP. Recent architectures, notably head 2


head 3
the Transformer, learn powerful context-aware
head 4
word representations through layered, multi-
Adaptive Span Adaptively Sparse
headed attention. The multiple heads learn Sparse Transformer
Transformer Transformer (Ours)
diverse types of word relationships. How-
ever, with standard softmax attention, all at- Figure 1: Attention distributions of different self-
tention heads are dense, assigning a non-zero attention heads for the time step of the token “over”,
weight to all context words. In this work, we shown to compare our model to other related work.
introduce the adaptively sparse Transformer, While the sparse Transformer (Child et al., 2019) and
wherein attention heads have flexible, context- the adaptive span Transformer (Sukhbaatar et al., 2019)
dependent sparsity patterns. This sparsity is only attend to words within a contiguous span of the
accomplished by replacing softmax with α- past tokens, our model is not only able to obtain differ-
entmax: a differentiable generalization of soft- ent and not necessarily contiguous sparsity patterns for
max that allows low-scoring words to receive each attention head, but is also able to tune its support
precisely zero weight. Moreover, we derive a over which tokens to attend adaptively.
method to automatically learn the α parameter
– which controls the shape and sparsity of α-
entmax – allowing attention heads to choose lie multi-head attention mechanisms: each word
between focused or spread-out behavior. Our is represented by multiple different weighted aver-
adaptively sparse Transformer improves inter- ages of its relevant context. As suggested by recent
pretability and head diversity when compared works on interpreting attention head roles, sepa-
to softmax Transformers on machine transla- rate attention heads may learn to look for various
tion datasets. Findings of the quantitative and relationships between tokens (Tang et al., 2018; Ra-
qualitative analysis of our approach include
ganato and Tiedemann, 2018; Mareček and Rosa,
that heads in different layers learn different
sparsity preferences and tend to be more di- 2018; Tenney et al., 2019; Voita et al., 2019).
verse in their attention distributions than soft- The attention distribution of each head is pre-
max Transformers. Furthermore, at no cost in dicted typically using the softmax normalizing
accuracy, sparsity in attention heads helps to transform. As a result, all context words have
uncover different head specializations. non-zero attention weight. Recent work on sin-
gle attention architectures suggest that using sparse
1 Introduction
normalizing transforms in attention mechanisms
The Transformer architecture (Vaswani et al., 2017) such as sparsemax – which can yield exactly zero
for deep neural networks has quickly risen to promi- probabilities for irrelevant words – may improve
nence in NLP through its efficiency and perfor- performance and interpretability (Malaviya et al.,
mance, leading to improvements in the state of the 2018; Deng et al., 2018; Peters et al., 2019). Qual-
art of Neural Machine Translation (NMT; Junczys- itative analysis of attention heads (Vaswani et al.,
Dowmunt et al., 2018; Ott et al., 2018), as well as 2017, Figure 5) suggests that, depending on what
inspiring other powerful general-purpose models phenomena they capture, heads tend to favor flatter
like BERT (Devlin et al., 2019) and GPT-2 (Rad- or more peaked distributions.
ford et al., 2019). At the heart of the Transformer Recent works have proposed sparse Transform-

2174
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing, pages 2174–2184,
Hong Kong, China, November 3–7, 2019. 2019c Association for Computational Linguistics
ers (Child et al., 2019) and adaptive span Trans- where Q ∈ Rn×d contains representations of the
formers (Sukhbaatar et al., 2019). However, the queries, K, V ∈ Rm×d are the keys and values
“sparsity” of those models only limits the attention of the items attended over, and d is the dimen-
to a contiguous span of past tokens, while in this sionality of these representations. The π mapping
work we propose a highly adaptive Transformer normalizes row-wise using softmax, π(Z)ij =
model that is capable of attending to a sparse set of softmax(zi )j , where
words that are not necessarily contiguous. Figure 1
shows the relationship of these methods with ours. exp(zj )
softmax(z) = P . (2)
Our contributions are the following: j ′ exp(zj ′ )

• We introduce sparse attention into the Trans- In words, the keys are used to compute a relevance
former architecture, showing that it eases inter- score between each item and query. Then, normal-
pretability and leads to slight accuracy gains. ized attention weights are computed using softmax,
and these are used to weight the values of each item
• We propose an adaptive version of sparse at- at each query context.
tention, where the shape of each attention However, for complex tasks, different parts of a
head is learnable and can vary continuously sequence may be relevant in different ways, moti-
and dynamically between the dense limit case vating multi-head attention in Transformers. This
of softmax and the sparse, piecewise-linear is simply the application of Equation 1 in paral-
sparsemax case.1 lel H times, each with a different, learned linear
transformation that allows specialization:
• We make an extensive analysis of the added
interpretability of these models, identifying Headi (Q,K,V ) = Att(QWiQ, KWiK, V WiV ) (3)
both crisper examples of attention head behav-
ior observed in previous work, as well as novel In the Transformer, there are three separate multi-
behaviors unraveled thanks to the sparsity and head attention mechanisms for distinct purposes:
adaptivity of our proposed model.
• Encoder self-attention: builds rich, layered
2 Background representations of each input word, by attend-
ing on the entire input sentence.
2.1 The Transformer
In NMT, the Transformer (Vaswani et al., 2017) • Context attention: selects a representative
is a sequence-to-sequence (seq2seq) model which weighted average of the encodings of the input
maps an input sequence to an output sequence words, at each time step of the decoder.
through hierarchical multi-head attention mech-
• Decoder self-attention: attends over the par-
anisms, yielding a dynamic, context-dependent
tial output sentence fragment produced so far.
strategy for propagating information within and
across sentences. It contrasts with previous seq2seq Together, these mechanisms enable the contextual-
models, which usually rely either on costly gated ized flow of information between the input sentence
recurrent operations (often LSTMs: Bahdanau and the sequential decoder.
et al., 2015; Luong et al., 2015) or static convo-
lutions (Gehring et al., 2017). 2.2 Sparse Attention
Given n query contexts and m sequence items The softmax mapping (Equation 2) is elementwise
under consideration, attention mechanisms com- proportional to exp, therefore it can never assign a
pute, for each query, a weighted representation of weight of exactly zero. Thus, unnecessary items
the items. The particular attention mechanism used are still taken into consideration to some extent.
in Vaswani et al. (2017) is called scaled dot-product Since its output sums to one, this invariably means
attention, and it is computed in the following way: less weight is assigned to the relevant items, po-
tentially harming performance and interpretabil-
QK ⊤
 
Att(Q, K, V ) = π √ V, (1) ity (Jain and Wallace, 2019). This has motivated a
d line of research on learning networks with sparse
1
Code and pip package available at https://github. mappings (Martins and Astudillo, 2016; Niculae
com/deep-spin/entmax. and Blondel, 2017; Louizos et al., 2018; Shao et al.,

2175
2019). We focus on a recently-introduced flexible Our work furthers the study of α-entmax by
family of transformations, α-entmax (Blondel et al., providing a derivation of the Jacobian w.r.t. the
2019; Peters et al., 2019), defined as: hyper-parameter α (Section 3), thereby allowing
the shape and sparsity of the mapping to be learned
α-entmax(z) := argmax p⊤ z + HαT (p), (4) automatically. This is particularly appealing in the
p∈△d
context of multi-head attention mechanisms, where
where △d := {p ∈ Rd : i pi = 1} is the prob-
P
we shall show in Section 5.1 that different heads
ability simplex, and, for α ≥ 1, HαT is the Tsallis tend to learn different sparsity behaviors.
continuous family of entropies (Tsallis, 1988):
( P   3 Adaptively Sparse Transformers
1 α , α 6= 1,
j p j − p j with α-entmax
HαT (p) := α(α−1)
P (5)
− j pj log pj , α = 1. We now propose a novel Transformer architecture
This family contains the well-known Shannon and wherein we simply replace softmax with α-entmax
Gini entropies, corresponding to the cases α = 1 in the attention heads. Concretely, we replace the
and α = 2, respectively. row normalization π in Equation 1 by
Equation 4 involves a convex optimization sub-
problem. Using the definition of HαT , the optimality π(Z)ij = α-entmax(zi )j (8)
conditions may be used to derive the following
This change leads to sparse attention weights, as
form for the solution (Appendix B.2):
long as α > 1; in particular, α = 1.5 is a sensible
1/α−1
α-entmax(z) = [(α − 1)z − τ 1]+ , (6) starting point (Peters et al., 2019).

where [·]+ is the positive part (ReLU) function, Different α per head. Unlike LSTM-based
1 denotes the vector of all ones, and τ – which seq2seq models, where α can be more easily tuned
acts like a threshold P
– is the Lagrange multiplier by grid search, in a Transformer, there are many
corresponding to the i pi = 1 constraint. attention heads in multiple layers. Crucial to the
power of such models, the different heads capture
Properties of α-entmax. The appeal of α- different linguistic phenomena, some of them iso-
entmax for attention rests on the following prop- lating important words, others spreading out atten-
erties. For α = 1 (i.e., when HαT becomes the tion across phrases (Vaswani et al., 2017, Figure 5).
Shannon entropy), it exactly recovers the softmax This motivates using different, adaptive α values
mapping (We provide a short derivation in Ap- for each attention head, such that some heads may
pendix B.3.). For all α > 1 it permits sparse solu- learn to be sparser, and others may become closer
tions, in stark contrast to softmax. In particular, for to softmax. We propose doing so by treating the α
α = 2, it recovers the sparsemax mapping (Martins values as neural network parameters, optimized via
and Astudillo, 2016), which is piecewise linear. In- stochastic gradients along with the other weights.
between, as α increases, the mapping continuously
gets sparser as its curvature changes. Derivatives w.r.t. α. In order to optimize α au-
To compute the value of α-entmax, one must tomatically via gradient methods, we must com-
find the threshold τ such that the r.h.s. in Equa- pute the Jacobian of the entmax output w.r.t. α.
tion 6 sums to one. Blondel et al. (2019) propose Since entmax is defined through an optimization
a general bisection algorithm. Peters et al. (2019) problem, this is non-trivial and cannot be simply
introduce a faster, exact algorithm for α = 1.5, and handled through automatic differentiation; it falls
enable using α-entmax with fixed α within a neu- within the domain of argmin differentiation, an ac-
ral network by showing that the α-entmax Jacobian tive research topic in optimization (Gould et al.,
w.r.t. z for p⋆ = α-entmax(z) is 2016; Amos and Kolter, 2017).
One of our key contributions is the derivation
∂ α-entmax(z) 1 of a closed-form expression for this Jacobian. The
= diag(s) − P ss⊤ , next proposition provides such an expression, en-
∂z s
j j
( (7) abling entmax layers with adaptive α. To the best

(pi ) 2−α , p⋆i > 0,
where si = of our knowledge, ours is the first neural network
0, p⋆i = 0. module that can automatically, continuously vary

2176
in shape away from softmax and toward sparse Datasets. Our models were trained on 4 machine
mappings like sparsemax. translation datasets of different training sizes:
Proposition 1. Let p⋆ := α-entmax(z) be the so- • IWSLT 2017 German → English (DEEN, Cet-
lution of Equation 4. Denote the distribution p̃i := tolo et al., 2017): 200K sentence pairs.
(p⋆i )2−α/P (p⋆ )2−α and let hi := −p⋆ log p⋆ . The ith
j j i i
component of the Jacobian g := ∂ α-entmax(z) is • KFTT Japanese → English (JAEN, Neubig,
∂α
2011): 300K sentence pairs.
P
hi −p̃i j hj
 p⋆ −p̃
i
i
 (α−1) 2 + α−1 , α > 1, • WMT 2016 Romanian → English (ROEN, Bo-
gi = P (9) jar et al., 2016): 600K sentence pairs.
 hi log p⋆i −p⋆i j hj log p⋆j
2 , α = 1.
• WMT 2014 English → German (ENDE, Bojar
The proof uses implicit function differentiation and et al., 2014): 4.5M sentence pairs.
is given in Appendix C.
All of these datasets were preprocessed with
Proposition 1 provides the remaining missing
byte-pair encoding (BPE; Sennrich et al., 2016),
piece needed for training adaptively sparse Trans-
using joint segmentations of 32k merge operations.
formers. In the following section, we evaluate this
strategy on neural machine translation, and analyze Training. We follow the dimensions of the
the behavior of the learned attention heads. Transformer-Base model of Vaswani et al. (2017):
The number of layers is L = 6 and number of
4 Experiments heads is H = 8 in the encoder self-attention, the
We apply our adaptively sparse Transformers on context attention, and the decoder self-attention.
four machine translation tasks. For comparison, We use a mini-batch size of 8192 tokens and warm
a natural baseline is the standard Transformer ar- up the learning rate linearly until 20k steps, after
chitecture using the softmax transform in its multi- which it decays according to an inverse square root
head attention mechanisms. We consider two other schedule. All models were trained until conver-
model variants in our experiments that make use of gence of validation accuracy, and evaluation was
different normalizing transformations: done at each 10k steps for ROEN and ENDE
and at each 5k steps for DEEN and JAEN. The
• 1.5-entmax: a Transformer with sparse ent- end-to-end computational overhead of our methods,
max attention with fixed α = 1.5 for all heads. when compared to standard softmax, is relatively
This is a novel model, since 1.5-entmax had small; in training tokens per second, the models
only been proposed for RNN-based NMT using α-entmax and 1.5-entmax are, respectively,
models (Peters et al., 2019), but never in 75% and 90% the speed of the softmax model.
Transformers, where attention modules are
not just one single component of the seq2seq Results. We report test set tokenized BLEU (Pa-
model but rather an integral part of all of the pineni et al., 2002) results in Table 1. We can see
model components. that replacing softmax by entmax does not hurt
performance in any of the datasets; indeed, sparse
• α-entmax: an adaptive Transformer with attention Transformers tend to have slightly higher
sparse entmax attention with a different, BLEU, but their sparsity leads to a better poten-
t for each head.
learned αi,j tial for analysis. In the next section, we make use
of this potential by exploring the learned internal
The adaptive model has an additional scalar pa-
mechanics of the self-attention heads.
rameter per attention head per layer for each of the
three attention mechanisms (encoder self-attention, 5 Analysis
context attention, and decoder self-attention), i.e.,
 t We conduct an analysis for the higher-resource
ai,j ∈ R : i ∈ {1, . . . , L}, j ∈ {1, . . . , H}, dataset WMT 2014 English → German of the at-
(10) tention in the sparse adaptive Transformer model
t ∈ {enc, ctx, dec} ,
(α-entmax) at multiple levels: we analyze high-
t = 1 + sigmoid(at ) ∈]1, 2[. All or
and we set αi,j level statistics as well as individual head behavior.
i,j
some of the α values can be tied if desired, but we Moreover, we make a qualitative analysis of the
keep them independent for analysis purposes. interpretability capabilities of our models.

2177
activation DE  EN JA  EN RO  EN EN  DE

softmax 29.79 21.57 32.70 26.02


1.5-entmax 29.83 22.13 33.10 25.89
α-entmax 29.90 21.74 32.89 26.93

Table 1: Machine translation tokenized BLEU test results on IWSLT 2017 DEEN, KFTT JAEN, WMT 2016
RO  EN and WMT 2014 EN DE, respectively.

5.1 High-Level Statistics


What kind of α values are learned? Figure 2
shows the learning trajectories of the α parameters 1.8

of a selected subset of heads. We generally observe


decoder, layer 1, head 8
a tendency for the randomly-initialized α parame- 1.6
encoder, layer 1, head 3

ters to decrease initially, suggesting that softmax- encoder, layer 1, head 4

encoder, layer 2, head 8


like behavior may be preferable while the model 1.4
encoder, layer 6, head 2

is still very uncertain. After around one thousand


1.2
steps, some heads change direction and become
sparser, perhaps as they become more confident
1.0
and specialized. This shows that the initialization 0 2000 4000 6000 8000 10000 12000
training steps
of α does not predetermine its sparsity level or the
role the head will have throughout. In particular, Figure 2: Trajectories of α values for a subset of
head 8 in the encoder self-attention layer 2 first the heads during training. Initialized at random, most
drops to around α = 1.3 before becoming one of heads become denser in the beginning, before converg-
the sparsest heads, with α ≈ 2. ing. This suggests that dense attention may be more
beneficial while the network is still uncertain, being re-
The overall distribution of α values at conver-
placed by sparse attention afterwards.
gence can be seen in Figure 3. We can observe
that the encoder self-attention blocks learn to con-
centrate the α values in two modes: a very sparse attention – whose background distribution tends
one around α → 2, and a dense one between soft- toward extreme sparsity – and the other two mod-
max and 1.5-entmax. However, the decoder self ules, who exhibit more uniform background distri-
and context attention only learn to distribute these butions. This suggests that perhaps entirely sparse
parameters in a single mode. We show next that Transformers are suboptimal.
this is reflected in the average density of attention The fact that the decoder seems to prefer denser
weight vectors as well. attention distributions might be attributed to it be-
ing auto-regressive, only having access to past to-
Attention weight density when translating. kens and not the full sentence. We speculate that
For any α > 1, it would still be possible for the it might lose too much information if it assigned
weight matrices in Equation 3 to learn re-scalings weights of zero to too many tokens in the self-
so as to make attention sparser or denser. To visu- attention, since there are fewer tokens to attend to
alize the impact of adaptive α values, we compare in the first place.
the empirical attention weight density (the aver- Teasing this down into separate layers, Figure 5
age number of tokens receiving non-zero attention) shows the average (sorted) density of each head for
within each module, against sparse Transformers each layer. We observe that α-entmax is able to
with fixed α = 1.5. learn different sparsity patterns at each layer, lead-
Figure 4 shows that, with fixed α = 1.5, heads ing to more variance in individual head behavior, to
tend to be sparse and similarly-distributed in all clearly-identified dense and sparse heads, and over-
three attention modules. With learned α, there are all to different tendencies compared to the fixed
two notable changes: (i) a prominent mode corre- case of α = 1.5.
sponding to fully dense probabilities, showing that
our models learn to combine sparse and dense atten- Head diversity. To measure the overall disagree-
tion, and (ii) a distinction between the encoder self- ment between attention heads, as a measure of head

2178
20 1.5-entmax -entmax
Self-Attention
Encoder 50k

Self-Attention
10

Encoder
30k

0 10k
20 0
0.0 0.5 1.0 0.0 0.5 1.0
Attention
Context

10 50k

Attention
Context
30k
0
20 10k
Self-Attention

0
Decoder

0.0 0.5 1.0 0.0 0.5 1.0


10
50k

Self-Attention
0

Decoder
1.0 1.2 1.4 1.6 1.8 2.0 30k

10k
Figure 3: Distribution of learned α values per attention 0
block. While the encoder self-attention has a bimodal 0.0 0.5 1.0 0.0 0.5 1.0
distribution of values of α, the decoder self-attention
density density
and context attention have a single mode.
Figure 4: Distribution of attention densities (average
number of tokens receiving non-zero attention weight)
for all attention heads and all validation sentences.
diversity, we use the following generalization of When compared to 1.5-entmax, α-entmax distributes
the Jensen-Shannon divergence: the sparsity in a more uniform manner, with a clear
mode at fully dense attentions, corresponding to the
  heads with low α. In the softmax case, this distribution
H H would lead to a single bar with density 1.
1 X 1 X S
JS = HS  pj  − H (pj ) (11)
H H
j=1 j=1
or sequences that the head often assigns most of its
where pj is the vector of attention weights as- attention weight; this is facilitated by sparsity.
signed by head j to each word in the sequence, and Positional heads. One particular type of head, as
HS is the Shannon entropy, base-adjusted based on noted by Voita et al. (2019), is the positional head.
the dimension of p such that JS ≤ 1. We average These heads tend to focus their attention on either
this measure over the entire validation set. The the previous or next token in the sequence, thus
higher this metric is, the more the heads are taking obtaining representations of the neighborhood of
different roles in the model. the current time step. In Figure 7, we show atten-
Figure 6 shows that both sparse Transformer tion plots for such heads, found for each of the
variants show more diversity than the traditional studied models. The sparsity of our models allows
softmax one. Interestingly, diversity seems to peak these heads to be more confident in their represen-
in the middle layers of the encoder self-attention tations, by assigning the whole probability distribu-
and context attention, while this is not the case for tion to a single token in the sequence. Concretely,
the decoder self-attention. we may measure a positional head’s confidence as
The statistics shown in this section can be found the average attention weight assigned to the pre-
for the other language pairs in Appendix A. vious token. The softmax model has three heads
for position −1, with median confidence 93.5%.
5.2 Identifying Head Specializations The 1.5-entmax model also has three heads for
Previous work pointed out some specific roles this position, with median confidence 94.4%. The
played by different heads in the softmax Trans- adaptive model has four heads, with median con-
former model (Voita et al., 2018; Tang et al., 2018; fidences 95.9%, the lowest-confidence head being
Voita et al., 2019). Identifying the specialization of dense with α = 1.18, while the highest-confidence
a head can be done by observing the type of tokens head being sparse (α = 1.91).

2179
fixed = 1.5 learned softmax
1.0

Self-Attention
1.5-entmax
Self-Attention 0.5 -entmax

Encoder
Encoder

0.5 0.4

0.0 0.35
1.0
0.30

Attention
Context
Attention

0.25
Context

0.5
0.20

0.0 0.35

Self-Attention
1.0

Decoder
0.30
Self-Attention
Decoder

0.5 0.25
1 2 3 4 5 6
0.0 Layers
1 2 3 4 5 6 1 2 3 4 5 6
Layers Layers
Figure 6: Jensen-Shannon Divergence between heads
Figure 5: Head density per layer for fixed and learned at each layer. Measures the disagreement between
α. Each line corresponds to an attention head; lower heads: the higher the value, the more the heads are dis-
values mean that that attention head is sparser. Learned agreeing with each other in terms of where to attend.
α has higher variance. Models using sparse entmax have more diverse atten-
tion than the softmax baseline.

For position +1, the models each dedicate one


capabilities of these heads.3 There are not any at-
head, with confidence around 95%, slightly higher
tention heads in the softmax model that are able
for entmax. The adaptive model sets α = 1.96 for
to obtain a score over 80%, while for 1.5-entmax
this head.
and α-entmax there are two heads in each (83.3%
BPE-merging head. Due to the sparsity of our and 85.6% for 1.5-entmax and 88.5% and 89.8%
models, we are able to identify other head special- for α-entmax).
izations, easily identifying which heads should be Interrogation head. On the other hand, in Fig-
further analysed. In Figure 8 we show one such ure 9 we show a head for which our adaptively
head where the α value is particularly high (in the sparse model chose an α close to 1, making it
encoder, layer 1, head 4 depicted in Figure 2). We closer to softmax (also shown in encoder, layer 1,
found that this head most often looks at the cur- head 3 depicted in Figure 2). We observe that this
rent time step with high confidence, making it a head assigns a high probability to question marks
positional head with offset 0. However, this head at the end of the sentence in time steps where the
often spreads weight sparsely over 2-3 neighbor- current token is interrogative, thus making it an
ing tokens, when the tokens are part of the same interrogation-detecting head. We also observe this
BPE cluster2 or hyphenated words. As this head type of heads in the other models, which we also
is in the first layer, it provides a useful service to depict in Figure 9. The average attention weight
the higher layers by combining information evenly placed on the question mark when the current to-
within some BPE clusters. ken is an interrogative word is 98.5% for softmax,
For each BPE cluster or cluster of hyphenated 97.0% for 1.5-entmax, and 99.5% for α-entmax.
words, we computed a score between 0 and 1 that Furthermore, we can examine sentences where
corresponds to the maximum attention mass as- some tendentially sparse heads become less so, thus
signed by any token to the rest of the tokens inside identifying sources of ambiguity where the head
the cluster in order to quantify the BPE-merging
3
If the cluster has size 1, the score is the weight the token
2
BPE-segmented words are denoted by ∼ in the figures. assigns to itself.

2180
n

wh ver

wh ver

wh ver
't ren

't ren

't ren
. so

. so

. so
la s a y

la s a y

la s a y

i n~

i n~

i n~
, we

, we

, we
seat

seat

seat

is at

is at

is at
we
we
aw

we
we
aw

we
we
aw
fa r

fa r

fa r

? lo

? lo

? lo
Ar~

Ar~

Ar~
ma

ma

ma
ho

ho

ho
Po

Po

Po
we however
weren ,
't what
far is
Ar~
away man~
last i
season Polo
. ?
softmax 1.5-entmax -entmax softmax 1.5-entmax -entmax

moat r

moat r

moat r
. pect

. pect

. pect
exople

exople

exople
whnde

whnde

whnde
Figure 7: Self-attention from the most confidently

pe re

pe re

pe re
wou

wou

wou
yo

yo

yo
previous-position head in each model. The learned pa- you
rameter in the α-entmax model is α = 1.91. Quanti- wonder
what
tatively more confident, visual inspection confirms that more
the adaptive head behaves more consistently. people
expect
.
softmax 1.5-entmax -entmax
a d as
annan

- e
fores
win~
up g
lo~l~

Figure 9: Interrogation-detecting heads in the three


cu~

. r
e

o
, s

fou
blo

thr
for
rul

ba
on
ba

on
cir

tw
-

rules one models. The top sentence is interrogative while the


for
blo~ - bottom one is declarative but includes the interrogative
wing
up two word “what”. In the top example, these interrogation
bal~
lo~ -
ons heads assign a high probability to the question mark in
, three
for -
the time step of the interrogative word (with ≥ 97.0%
bananas
and four probability), while in the bottom example since there
a
cir~ . is no question mark, the same head does not assign a
cus
high probability to the last token in the sentence dur-
minfir~

? ted

ba e
co ld

. m
thing

yo at
stave

to ck
m
whh

un~
nou

hau
wits

co s
u

thet
are

co t

thi

ha

ing the interrogative word time step. Surprisingly, this


yo

are this head prefers a low α = 1.05, as can be seen from the
you could
not dense weights. This allows the head to identify the
confir~ come
ming back noun phrase “Armani Polo” better.
this to
with
what ha~
you unt
have them
stated tion mappings to dynamically adapt their curvature
? .
and sparsity, by automatically adjusting the contin-
Figure 8: BPE-merging head (α = 1.91) discovered uous α parameter. We also provide the first results
in the α-entmax model. Found in the first encoder using sparse attention in a Transformer model.
layer, this head learns to discover some subword units
and combine their information, leaving most words in-
tact. It places 99.09% of its probability mass within the Fixed sparsity patterns. Recent research im-
same BPE cluster as the current token: more than any proves the scalability of Transformer-like networks
head in any other model. through static, fixed sparsity patterns (Child et al.,
2019; Wu et al., 2019). Our adaptively-sparse
is less confident in its prediction. An example is Transformer can dynamically select a sparsity pat-
shown in Figure 10 where sparsity in the same head tern that finds relevant words regardless of their po-
differs for sentences of similar length. sition (e.g., Figure 9). Moreover, the two strategies
could be combined. In a concurrent line of research,
6 Related Work Sukhbaatar et al. (2019) propose an adaptive atten-
tion span for Transformer language models. While
Sparse attention. Prior work has developed their work has each head learn a different contigu-
sparse attention mechanisms, including appli- ous span of context tokens to attend to, our work
cations to NMT (Martins and Astudillo, 2016; finds different sparsity patterns in the same span.
Malaviya et al., 2018; Niculae and Blondel, 2017; Interestingly, some of their findings mirror ours –
Shao et al., 2019; Maruf et al., 2019). Peters et al. we found that attention heads in the last layers tend
(2019) introduced the entmax function this work to be denser on average when compared to the ones
builds upon. In their work, there is a single atten- in the first layers, while their work has found that
tion mechanism which is controlled by a fixed α. lower layers tend to have a shorter attention span
In contrast, this is the first work to allow such atten- compared to higher layers.

2181
accuracy as well as in model interpretability.

d
sex te s

se te
a icatom

? eamit
In particular, we analyzed how the attention

symich
ind p

disns
ua ~
is er

trally
, re

. n
lays

wh
thi

thi
he heads in the proposed adaptively sparse Trans-
here which
, symptoms former can specialize more and with higher con-
this indicate
a fidence. Our adaptivity strategy relies only on
layer sex~
is ually gradient-based optimization, side-stepping costly
thin transmitted
disease per-head hyper-parameter searches. Further speed-
. ? ups are possible by leveraging more parallelism in
the bisection algorithm for computing α-entmax.
Figure 10: Example of two sentences of similar length
Finally, some of the automatically-learned be-
where the same head (α = 1.33) exhibits different spar-
sity. The longer phrase in the example on the right haviors of our adaptively sparse Transformers – for
“a sexually transmitted disease” is handled with higher instance, the near-deterministic positional heads or
confidence, leading to more sparsity. the subword joining head – may provide new ideas
for designing static variations of the Transformer.
Transformer interpretability. The original Acknowledgments
Transformer paper (Vaswani et al., 2017) shows
attention visualizations, from which some specula- This work was supported by the European Re-
tion can be made of the roles the several attention search Council (ERC StG DeepSPIN 758969),
heads have. Mareček and Rosa (2018) study the and by the Fundação para a Ciência e Tecnolo-
syntactic abilities of the Transformer self-attention, gia through contracts UID/EEA/50008/2019 and
while Raganato and Tiedemann (2018) extract CMUPERI/TIC/0046/2014 (GoLocal). We are
dependency relations from the attention weights. grateful to Ben Peters for the α-entmax code and
Tenney et al. (2019) find that the self-attentions in Erick Fonseca, Marcos Treviso, Pedro Martins, and
BERT (Devlin et al., 2019) follow a sequence of Tsvetomila Mihaylova for insightful group discus-
processes that resembles a classical NLP pipeline. sion. We thank Mathieu Blondel for the idea to
Regarding redundancy of heads, Voita et al. (2019) learn α. We would also like to thank the anony-
develop a method that is able to prune heads of mous reviewers for their helpful feedback.
the multi-head attention module and make an
empirical study of the role that each head has
in self-attention (positional, syntactic and rare References
words). Li et al. (2018) also aim to reduce head Brandon Amos and J. Zico Kolter. 2017. OptNet:
redundancy by adding a regularization term to Differentiable optimization as a layer in neural net-
the loss that maximizes head disagreement and works. In Proc. ICML.
obtain improved results. While not considering Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
Transformer attentions, Jain and Wallace (2019) gio. 2015. Neural machine translation by jointly
show that traditional attention mechanisms do not learning to align and translate. In Proc. ICLR.
necessarily improve interpretability since softmax
Mathieu Blondel, André FT Martins, and Vlad Nicu-
attention is vulnerable to an adversarial attack lae. 2019. Learning classifiers with Fenchel-Young
leading to wildly different model predictions losses: Generalized entropies, margins, and algo-
for the same attention weights. Sparse attention rithms. In Proc. AISTATS.
may mitigate these issues; however, our work
Ondrej Bojar, Christian Buck, Christian Federmann,
focuses mostly on a more mechanical aspect of Barry Haddow, Philipp Koehn, Johannes Leveling,
interpretation by analyzing head behavior, rather Christof Monz, Pavel Pecina, Matt Post, Herve
than on explanations for predictions. Saint-Amand, et al. 2014. Findings of the 2014
workshop on statistical machine translation. In Proc.
7 Conclusion and Future Work Workshop on Statistical Machine Translation.

We contribute a novel strategy for adaptively sparse Ondrej Bojar, Rajen Chatterjee, Christian Federmann,
attention, and, in particular, for adaptively sparse Yvette Graham, Barry Haddow, Matthias Huck, An-
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
Transformers. We present the first empirical analy- gacheva, Christof Monz, et al. 2016. Findings of the
sis of Transformers with sparse attention mappings 2016 conference on machine translation. In Proc.
(i.e., entmax), showing potential in both translation WMT.

2182
M Cettolo, M Federico, L Bentivogli, J Niehues, Sameen Maruf, André FT Martins, and Gholam-
S Stüker, K Sudoh, K Yoshino, and C Federmann. reza Haffari. 2019. Selective attention for
2017. Overview of the IWSLT 2017 evaluation cam- context-aware neural machine translation. preprint
paign. In Proc. IWSLT. arXiv:1903.08788.
Rewon Child, Scott Gray, Alec Radford, and Ilya Graham Neubig. 2011. The Kyoto free translation task.
Sutskever. 2019. Generating long sequences with http://www.phontron.com/kftt.
sparse Transformers. preprint arXiv:1904.10509.
Vlad Niculae and Mathieu Blondel. 2017. A regular-
Frank H Clarke. 1990. Optimization and Nonsmooth ized framework for sparse and structured neural at-
Analysis. SIAM. tention. In Proc. NeurIPS.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Myle Ott, Sergey Edunov, David Grangier, and
Alexander Rush. 2018. Latent alignment and varia- Michael Auli. 2018. Scaling neural machine trans-
tional attention. In Proc. NeurIPS. lation. In Proc. WMT.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Kristina Toutanova. 2019. BERT: Pre-training of Jing Zhu. 2002. BLEU: a method for automatic eval-
deep bidirectional transformers for language under- uation of machine translation. In Proc. ACL.
standing. In Proc. NAACL-HLT.
Ben Peters, Vlad Niculae, and André FT Martins. 2019.
Jonas Gehring, Michael Auli, David Grangier, Denis Sparse sequence-to-sequence models. In Proc. ACL.
Yarats, and Yann N Dauphin. 2017. Convolutional
sequence to sequence learning. In Proc. ICML. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Lan-
Stephen Gould, Basura Fernando, Anoop Cherian, Pe- guage models are unsupervised multitask learners.
ter Anderson, Rodrigo Santa Cruz, and Edison Guo. preprint.
2016. On differentiating parameterized argmin and
argmax problems with application to bi-level opti- Alessandro Raganato and Jörg Tiedemann. 2018. An
mization. preprint arXiv:1607.05447. analysis of encoder representations in Transformer-
based machine translation. In Proc. BlackboxNLP.
Michael Held, Philip Wolfe, and Harlan P Crowder.
1974. Validation of subgradient optimization. Math- Rico Sennrich, Barry Haddow, and Alexandra Birch.
ematical Programming, 6(1):62–88. 2016. Neural machine translation of rare words with
subword units. In Proc. ACL.
Sarthak Jain and Byron C. Wallace. 2019. Attention is
not explanation. In Proc. NAACL-HLT. Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang,
Marcin Junczys-Dowmunt, Kenneth Heafield, Hieu Yudian Li, Xiaogang Wang, and Ping Luo. 2019.
Hoang, Roman Grundkiewicz, and Anthony Aue. SSN: Learning sparse switchable normalization via
2018. Marian: Cost-effective high-quality neural SparsestMax. In Proc. CVPR.
machine translation in C++. In Proc. WNMT.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-
Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu, janowski, and Armand Joulin. 2019. Adaptive At-
and Tong Zhang. 2018. Multi-Head Attention with tention Span in Transformers. In Proc. ACL.
Disagreement Regularization. In Proc. EMNLP.
Gongbo Tang, Mathias Müller, Annette Rios, and Rico
Christos Louizos, Max Welling, and Diederik P Sennrich. 2018. Why self-attention? A targeted
Kingma. 2018. Learning sparse neural networks evaluation of neural machine translation architec-
through L0 regularization. Proc. ICLR. tures. In Proc. EMNLP.

Minh-Thang Luong, Hieu Pham, and Christopher D Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Manning. 2015. Effective approaches to attention- BERT rediscovers the classical NLP pipeline. In
based neural machine translation. In Proc. EMNLP. Proc. ACL.

Chaitanya Malaviya, Pedro Ferreira, and André FT Constantino Tsallis. 1988. Possible generalization of
Martins. 2018. Sparse and constrained attention for Boltzmann-Gibbs statistics. Journal of Statistical
neural machine translation. In Proc. ACL. Physics, 52:479–487.

David Mareček and Rudolf Rosa. 2018. Extract- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ing syntactic trees from Transformer encoder self- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
attentions. In Proc. BlackboxNLP. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proc. NeurIPS.
André FT Martins and Ramón Fernandez Astudillo.
2016. From softmax to sparsemax: A sparse model Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
of attention and multi-label classification. In Proc. Titov. 2018. Context-aware neural machine transla-
of ICML. tion learns anaphora resolution. In Proc. ACL.

2183
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-head
self-attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proc. ACL.
Felix Wu, Angela Fan, Alexei Baevski, Yann N
Dauphin, and Michael Auli. 2019. Pay less atten-
tion with lightweight and dynamic convolutions. In
Proc. ICLR.

2184

You might also like