Professional Documents
Culture Documents
fo n
fo n
fo n
ov s
ov s
ov s
Abstract
k
k
ow
ow
p
ow
er
er
er
ic
ic
ic
e
e
m
m
x
x
Th
Th
Th
qu
qu
qu
br
br
br
ju
ju
ju
Attention mechanisms have become ubiqui- head 1
2174
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing, pages 2174–2184,
Hong Kong, China, November 3–7, 2019.
2019c Association for Computational Linguistics
ers (Child et al., 2019) and adaptive span Trans- where Q ∈ Rn×d contains representations of the
formers (Sukhbaatar et al., 2019). However, the queries, K, V ∈ Rm×d are the keys and values
“sparsity” of those models only limits the attention of the items attended over, and d is the dimen-
to a contiguous span of past tokens, while in this sionality of these representations. The π mapping
work we propose a highly adaptive Transformer normalizes row-wise using softmax, π(Z)ij =
model that is capable of attending to a sparse set of softmax(zi )j , where
words that are not necessarily contiguous. Figure 1
shows the relationship of these methods with ours. exp(zj )
softmax(z) = P . (2)
Our contributions are the following: j ′ exp(zj ′ )
• We introduce sparse attention into the Trans- In words, the keys are used to compute a relevance
former architecture, showing that it eases inter- score between each item and query. Then, normal-
pretability and leads to slight accuracy gains. ized attention weights are computed using softmax,
and these are used to weight the values of each item
• We propose an adaptive version of sparse at- at each query context.
tention, where the shape of each attention However, for complex tasks, different parts of a
head is learnable and can vary continuously sequence may be relevant in different ways, moti-
and dynamically between the dense limit case vating multi-head attention in Transformers. This
of softmax and the sparse, piecewise-linear is simply the application of Equation 1 in paral-
sparsemax case.1 lel H times, each with a different, learned linear
transformation that allows specialization:
• We make an extensive analysis of the added
interpretability of these models, identifying Headi (Q,K,V ) = Att(QWiQ, KWiK, V WiV ) (3)
both crisper examples of attention head behav-
ior observed in previous work, as well as novel In the Transformer, there are three separate multi-
behaviors unraveled thanks to the sparsity and head attention mechanisms for distinct purposes:
adaptivity of our proposed model.
• Encoder self-attention: builds rich, layered
2 Background representations of each input word, by attend-
ing on the entire input sentence.
2.1 The Transformer
In NMT, the Transformer (Vaswani et al., 2017) • Context attention: selects a representative
is a sequence-to-sequence (seq2seq) model which weighted average of the encodings of the input
maps an input sequence to an output sequence words, at each time step of the decoder.
through hierarchical multi-head attention mech-
• Decoder self-attention: attends over the par-
anisms, yielding a dynamic, context-dependent
tial output sentence fragment produced so far.
strategy for propagating information within and
across sentences. It contrasts with previous seq2seq Together, these mechanisms enable the contextual-
models, which usually rely either on costly gated ized flow of information between the input sentence
recurrent operations (often LSTMs: Bahdanau and the sequential decoder.
et al., 2015; Luong et al., 2015) or static convo-
lutions (Gehring et al., 2017). 2.2 Sparse Attention
Given n query contexts and m sequence items The softmax mapping (Equation 2) is elementwise
under consideration, attention mechanisms com- proportional to exp, therefore it can never assign a
pute, for each query, a weighted representation of weight of exactly zero. Thus, unnecessary items
the items. The particular attention mechanism used are still taken into consideration to some extent.
in Vaswani et al. (2017) is called scaled dot-product Since its output sums to one, this invariably means
attention, and it is computed in the following way: less weight is assigned to the relevant items, po-
tentially harming performance and interpretabil-
QK ⊤
Att(Q, K, V ) = π √ V, (1) ity (Jain and Wallace, 2019). This has motivated a
d line of research on learning networks with sparse
1
Code and pip package available at https://github. mappings (Martins and Astudillo, 2016; Niculae
com/deep-spin/entmax. and Blondel, 2017; Louizos et al., 2018; Shao et al.,
2175
2019). We focus on a recently-introduced flexible Our work furthers the study of α-entmax by
family of transformations, α-entmax (Blondel et al., providing a derivation of the Jacobian w.r.t. the
2019; Peters et al., 2019), defined as: hyper-parameter α (Section 3), thereby allowing
the shape and sparsity of the mapping to be learned
α-entmax(z) := argmax p⊤ z + HαT (p), (4) automatically. This is particularly appealing in the
p∈△d
context of multi-head attention mechanisms, where
where △d := {p ∈ Rd : i pi = 1} is the prob-
P
we shall show in Section 5.1 that different heads
ability simplex, and, for α ≥ 1, HαT is the Tsallis tend to learn different sparsity behaviors.
continuous family of entropies (Tsallis, 1988):
( P 3 Adaptively Sparse Transformers
1 α , α 6= 1,
j p j − p j with α-entmax
HαT (p) := α(α−1)
P (5)
− j pj log pj , α = 1. We now propose a novel Transformer architecture
This family contains the well-known Shannon and wherein we simply replace softmax with α-entmax
Gini entropies, corresponding to the cases α = 1 in the attention heads. Concretely, we replace the
and α = 2, respectively. row normalization π in Equation 1 by
Equation 4 involves a convex optimization sub-
problem. Using the definition of HαT , the optimality π(Z)ij = α-entmax(zi )j (8)
conditions may be used to derive the following
This change leads to sparse attention weights, as
form for the solution (Appendix B.2):
long as α > 1; in particular, α = 1.5 is a sensible
1/α−1
α-entmax(z) = [(α − 1)z − τ 1]+ , (6) starting point (Peters et al., 2019).
where [·]+ is the positive part (ReLU) function, Different α per head. Unlike LSTM-based
1 denotes the vector of all ones, and τ – which seq2seq models, where α can be more easily tuned
acts like a threshold P
– is the Lagrange multiplier by grid search, in a Transformer, there are many
corresponding to the i pi = 1 constraint. attention heads in multiple layers. Crucial to the
power of such models, the different heads capture
Properties of α-entmax. The appeal of α- different linguistic phenomena, some of them iso-
entmax for attention rests on the following prop- lating important words, others spreading out atten-
erties. For α = 1 (i.e., when HαT becomes the tion across phrases (Vaswani et al., 2017, Figure 5).
Shannon entropy), it exactly recovers the softmax This motivates using different, adaptive α values
mapping (We provide a short derivation in Ap- for each attention head, such that some heads may
pendix B.3.). For all α > 1 it permits sparse solu- learn to be sparser, and others may become closer
tions, in stark contrast to softmax. In particular, for to softmax. We propose doing so by treating the α
α = 2, it recovers the sparsemax mapping (Martins values as neural network parameters, optimized via
and Astudillo, 2016), which is piecewise linear. In- stochastic gradients along with the other weights.
between, as α increases, the mapping continuously
gets sparser as its curvature changes. Derivatives w.r.t. α. In order to optimize α au-
To compute the value of α-entmax, one must tomatically via gradient methods, we must com-
find the threshold τ such that the r.h.s. in Equa- pute the Jacobian of the entmax output w.r.t. α.
tion 6 sums to one. Blondel et al. (2019) propose Since entmax is defined through an optimization
a general bisection algorithm. Peters et al. (2019) problem, this is non-trivial and cannot be simply
introduce a faster, exact algorithm for α = 1.5, and handled through automatic differentiation; it falls
enable using α-entmax with fixed α within a neu- within the domain of argmin differentiation, an ac-
ral network by showing that the α-entmax Jacobian tive research topic in optimization (Gould et al.,
w.r.t. z for p⋆ = α-entmax(z) is 2016; Amos and Kolter, 2017).
One of our key contributions is the derivation
∂ α-entmax(z) 1 of a closed-form expression for this Jacobian. The
= diag(s) − P ss⊤ , next proposition provides such an expression, en-
∂z s
j j
( (7) abling entmax layers with adaptive α. To the best
⋆
(pi ) 2−α , p⋆i > 0,
where si = of our knowledge, ours is the first neural network
0, p⋆i = 0. module that can automatically, continuously vary
2176
in shape away from softmax and toward sparse Datasets. Our models were trained on 4 machine
mappings like sparsemax. translation datasets of different training sizes:
Proposition 1. Let p⋆ := α-entmax(z) be the so- • IWSLT 2017 German → English (DEEN, Cet-
lution of Equation 4. Denote the distribution p̃i := tolo et al., 2017): 200K sentence pairs.
(p⋆i )2−α/P (p⋆ )2−α and let hi := −p⋆ log p⋆ . The ith
j j i i
component of the Jacobian g := ∂ α-entmax(z) is • KFTT Japanese → English (JAEN, Neubig,
∂α
2011): 300K sentence pairs.
P
hi −p̃i j hj
p⋆ −p̃
i
i
(α−1) 2 + α−1 , α > 1, • WMT 2016 Romanian → English (ROEN, Bo-
gi = P (9) jar et al., 2016): 600K sentence pairs.
hi log p⋆i −p⋆i j hj log p⋆j
2 , α = 1.
• WMT 2014 English → German (ENDE, Bojar
The proof uses implicit function differentiation and et al., 2014): 4.5M sentence pairs.
is given in Appendix C.
All of these datasets were preprocessed with
Proposition 1 provides the remaining missing
byte-pair encoding (BPE; Sennrich et al., 2016),
piece needed for training adaptively sparse Trans-
using joint segmentations of 32k merge operations.
formers. In the following section, we evaluate this
strategy on neural machine translation, and analyze Training. We follow the dimensions of the
the behavior of the learned attention heads. Transformer-Base model of Vaswani et al. (2017):
The number of layers is L = 6 and number of
4 Experiments heads is H = 8 in the encoder self-attention, the
We apply our adaptively sparse Transformers on context attention, and the decoder self-attention.
four machine translation tasks. For comparison, We use a mini-batch size of 8192 tokens and warm
a natural baseline is the standard Transformer ar- up the learning rate linearly until 20k steps, after
chitecture using the softmax transform in its multi- which it decays according to an inverse square root
head attention mechanisms. We consider two other schedule. All models were trained until conver-
model variants in our experiments that make use of gence of validation accuracy, and evaluation was
different normalizing transformations: done at each 10k steps for ROEN and ENDE
and at each 5k steps for DEEN and JAEN. The
• 1.5-entmax: a Transformer with sparse ent- end-to-end computational overhead of our methods,
max attention with fixed α = 1.5 for all heads. when compared to standard softmax, is relatively
This is a novel model, since 1.5-entmax had small; in training tokens per second, the models
only been proposed for RNN-based NMT using α-entmax and 1.5-entmax are, respectively,
models (Peters et al., 2019), but never in 75% and 90% the speed of the softmax model.
Transformers, where attention modules are
not just one single component of the seq2seq Results. We report test set tokenized BLEU (Pa-
model but rather an integral part of all of the pineni et al., 2002) results in Table 1. We can see
model components. that replacing softmax by entmax does not hurt
performance in any of the datasets; indeed, sparse
• α-entmax: an adaptive Transformer with attention Transformers tend to have slightly higher
sparse entmax attention with a different, BLEU, but their sparsity leads to a better poten-
t for each head.
learned αi,j tial for analysis. In the next section, we make use
of this potential by exploring the learned internal
The adaptive model has an additional scalar pa-
mechanics of the self-attention heads.
rameter per attention head per layer for each of the
three attention mechanisms (encoder self-attention, 5 Analysis
context attention, and decoder self-attention), i.e.,
t We conduct an analysis for the higher-resource
ai,j ∈ R : i ∈ {1, . . . , L}, j ∈ {1, . . . , H}, dataset WMT 2014 English → German of the at-
(10) tention in the sparse adaptive Transformer model
t ∈ {enc, ctx, dec} ,
(α-entmax) at multiple levels: we analyze high-
t = 1 + sigmoid(at ) ∈]1, 2[. All or
and we set αi,j level statistics as well as individual head behavior.
i,j
some of the α values can be tied if desired, but we Moreover, we make a qualitative analysis of the
keep them independent for analysis purposes. interpretability capabilities of our models.
2177
activation DE EN JA EN RO EN EN DE
Table 1: Machine translation tokenized BLEU test results on IWSLT 2017 DEEN, KFTT JAEN, WMT 2016
RO EN and WMT 2014 EN DE, respectively.
2178
20 1.5-entmax -entmax
Self-Attention
Encoder 50k
Self-Attention
10
Encoder
30k
0 10k
20 0
0.0 0.5 1.0 0.0 0.5 1.0
Attention
Context
10 50k
Attention
Context
30k
0
20 10k
Self-Attention
0
Decoder
Self-Attention
0
Decoder
1.0 1.2 1.4 1.6 1.8 2.0 30k
10k
Figure 3: Distribution of learned α values per attention 0
block. While the encoder self-attention has a bimodal 0.0 0.5 1.0 0.0 0.5 1.0
distribution of values of α, the decoder self-attention
density density
and context attention have a single mode.
Figure 4: Distribution of attention densities (average
number of tokens receiving non-zero attention weight)
for all attention heads and all validation sentences.
diversity, we use the following generalization of When compared to 1.5-entmax, α-entmax distributes
the Jensen-Shannon divergence: the sparsity in a more uniform manner, with a clear
mode at fully dense attentions, corresponding to the
heads with low α. In the softmax case, this distribution
H H would lead to a single bar with density 1.
1 X 1 X S
JS = HS pj − H (pj ) (11)
H H
j=1 j=1
or sequences that the head often assigns most of its
where pj is the vector of attention weights as- attention weight; this is facilitated by sparsity.
signed by head j to each word in the sequence, and Positional heads. One particular type of head, as
HS is the Shannon entropy, base-adjusted based on noted by Voita et al. (2019), is the positional head.
the dimension of p such that JS ≤ 1. We average These heads tend to focus their attention on either
this measure over the entire validation set. The the previous or next token in the sequence, thus
higher this metric is, the more the heads are taking obtaining representations of the neighborhood of
different roles in the model. the current time step. In Figure 7, we show atten-
Figure 6 shows that both sparse Transformer tion plots for such heads, found for each of the
variants show more diversity than the traditional studied models. The sparsity of our models allows
softmax one. Interestingly, diversity seems to peak these heads to be more confident in their represen-
in the middle layers of the encoder self-attention tations, by assigning the whole probability distribu-
and context attention, while this is not the case for tion to a single token in the sequence. Concretely,
the decoder self-attention. we may measure a positional head’s confidence as
The statistics shown in this section can be found the average attention weight assigned to the pre-
for the other language pairs in Appendix A. vious token. The softmax model has three heads
for position −1, with median confidence 93.5%.
5.2 Identifying Head Specializations The 1.5-entmax model also has three heads for
Previous work pointed out some specific roles this position, with median confidence 94.4%. The
played by different heads in the softmax Trans- adaptive model has four heads, with median con-
former model (Voita et al., 2018; Tang et al., 2018; fidences 95.9%, the lowest-confidence head being
Voita et al., 2019). Identifying the specialization of dense with α = 1.18, while the highest-confidence
a head can be done by observing the type of tokens head being sparse (α = 1.91).
2179
fixed = 1.5 learned softmax
1.0
Self-Attention
1.5-entmax
Self-Attention 0.5 -entmax
Encoder
Encoder
0.5 0.4
0.0 0.35
1.0
0.30
Attention
Context
Attention
0.25
Context
0.5
0.20
0.0 0.35
Self-Attention
1.0
Decoder
0.30
Self-Attention
Decoder
0.5 0.25
1 2 3 4 5 6
0.0 Layers
1 2 3 4 5 6 1 2 3 4 5 6
Layers Layers
Figure 6: Jensen-Shannon Divergence between heads
Figure 5: Head density per layer for fixed and learned at each layer. Measures the disagreement between
α. Each line corresponds to an attention head; lower heads: the higher the value, the more the heads are dis-
values mean that that attention head is sparser. Learned agreeing with each other in terms of where to attend.
α has higher variance. Models using sparse entmax have more diverse atten-
tion than the softmax baseline.
2180
n
wh ver
wh ver
wh ver
't ren
't ren
't ren
. so
. so
. so
la s a y
la s a y
la s a y
i n~
i n~
i n~
, we
, we
, we
seat
seat
seat
is at
is at
is at
we
we
aw
we
we
aw
we
we
aw
fa r
fa r
fa r
? lo
? lo
? lo
Ar~
Ar~
Ar~
ma
ma
ma
ho
ho
ho
Po
Po
Po
we however
weren ,
't what
far is
Ar~
away man~
last i
season Polo
. ?
softmax 1.5-entmax -entmax softmax 1.5-entmax -entmax
moat r
moat r
moat r
. pect
. pect
. pect
exople
exople
exople
whnde
whnde
whnde
Figure 7: Self-attention from the most confidently
pe re
pe re
pe re
wou
wou
wou
yo
yo
yo
previous-position head in each model. The learned pa- you
rameter in the α-entmax model is α = 1.91. Quanti- wonder
what
tatively more confident, visual inspection confirms that more
the adaptive head behaves more consistently. people
expect
.
softmax 1.5-entmax -entmax
a d as
annan
- e
fores
win~
up g
lo~l~
. r
e
o
, s
fou
blo
thr
for
rul
ba
on
ba
on
cir
tw
-
? ted
ba e
co ld
. m
thing
yo at
stave
to ck
m
whh
un~
nou
hau
wits
co s
u
thet
are
co t
thi
ha
are this head prefers a low α = 1.05, as can be seen from the
you could
not dense weights. This allows the head to identify the
confir~ come
ming back noun phrase “Armani Polo” better.
this to
with
what ha~
you unt
have them
stated tion mappings to dynamically adapt their curvature
? .
and sparsity, by automatically adjusting the contin-
Figure 8: BPE-merging head (α = 1.91) discovered uous α parameter. We also provide the first results
in the α-entmax model. Found in the first encoder using sparse attention in a Transformer model.
layer, this head learns to discover some subword units
and combine their information, leaving most words in-
tact. It places 99.09% of its probability mass within the Fixed sparsity patterns. Recent research im-
same BPE cluster as the current token: more than any proves the scalability of Transformer-like networks
head in any other model. through static, fixed sparsity patterns (Child et al.,
2019; Wu et al., 2019). Our adaptively-sparse
is less confident in its prediction. An example is Transformer can dynamically select a sparsity pat-
shown in Figure 10 where sparsity in the same head tern that finds relevant words regardless of their po-
differs for sentences of similar length. sition (e.g., Figure 9). Moreover, the two strategies
could be combined. In a concurrent line of research,
6 Related Work Sukhbaatar et al. (2019) propose an adaptive atten-
tion span for Transformer language models. While
Sparse attention. Prior work has developed their work has each head learn a different contigu-
sparse attention mechanisms, including appli- ous span of context tokens to attend to, our work
cations to NMT (Martins and Astudillo, 2016; finds different sparsity patterns in the same span.
Malaviya et al., 2018; Niculae and Blondel, 2017; Interestingly, some of their findings mirror ours –
Shao et al., 2019; Maruf et al., 2019). Peters et al. we found that attention heads in the last layers tend
(2019) introduced the entmax function this work to be denser on average when compared to the ones
builds upon. In their work, there is a single atten- in the first layers, while their work has found that
tion mechanism which is controlled by a fixed α. lower layers tend to have a shorter attention span
In contrast, this is the first work to allow such atten- compared to higher layers.
2181
accuracy as well as in model interpretability.
d
sex te s
se te
a icatom
? eamit
In particular, we analyzed how the attention
symich
ind p
disns
ua ~
is er
trally
, re
. n
lays
wh
thi
thi
he heads in the proposed adaptively sparse Trans-
here which
, symptoms former can specialize more and with higher con-
this indicate
a fidence. Our adaptivity strategy relies only on
layer sex~
is ually gradient-based optimization, side-stepping costly
thin transmitted
disease per-head hyper-parameter searches. Further speed-
. ? ups are possible by leveraging more parallelism in
the bisection algorithm for computing α-entmax.
Figure 10: Example of two sentences of similar length
Finally, some of the automatically-learned be-
where the same head (α = 1.33) exhibits different spar-
sity. The longer phrase in the example on the right haviors of our adaptively sparse Transformers – for
“a sexually transmitted disease” is handled with higher instance, the near-deterministic positional heads or
confidence, leading to more sparsity. the subword joining head – may provide new ideas
for designing static variations of the Transformer.
Transformer interpretability. The original Acknowledgments
Transformer paper (Vaswani et al., 2017) shows
attention visualizations, from which some specula- This work was supported by the European Re-
tion can be made of the roles the several attention search Council (ERC StG DeepSPIN 758969),
heads have. Mareček and Rosa (2018) study the and by the Fundação para a Ciência e Tecnolo-
syntactic abilities of the Transformer self-attention, gia through contracts UID/EEA/50008/2019 and
while Raganato and Tiedemann (2018) extract CMUPERI/TIC/0046/2014 (GoLocal). We are
dependency relations from the attention weights. grateful to Ben Peters for the α-entmax code and
Tenney et al. (2019) find that the self-attentions in Erick Fonseca, Marcos Treviso, Pedro Martins, and
BERT (Devlin et al., 2019) follow a sequence of Tsvetomila Mihaylova for insightful group discus-
processes that resembles a classical NLP pipeline. sion. We thank Mathieu Blondel for the idea to
Regarding redundancy of heads, Voita et al. (2019) learn α. We would also like to thank the anony-
develop a method that is able to prune heads of mous reviewers for their helpful feedback.
the multi-head attention module and make an
empirical study of the role that each head has
in self-attention (positional, syntactic and rare References
words). Li et al. (2018) also aim to reduce head Brandon Amos and J. Zico Kolter. 2017. OptNet:
redundancy by adding a regularization term to Differentiable optimization as a layer in neural net-
the loss that maximizes head disagreement and works. In Proc. ICML.
obtain improved results. While not considering Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
Transformer attentions, Jain and Wallace (2019) gio. 2015. Neural machine translation by jointly
show that traditional attention mechanisms do not learning to align and translate. In Proc. ICLR.
necessarily improve interpretability since softmax
Mathieu Blondel, André FT Martins, and Vlad Nicu-
attention is vulnerable to an adversarial attack lae. 2019. Learning classifiers with Fenchel-Young
leading to wildly different model predictions losses: Generalized entropies, margins, and algo-
for the same attention weights. Sparse attention rithms. In Proc. AISTATS.
may mitigate these issues; however, our work
Ondrej Bojar, Christian Buck, Christian Federmann,
focuses mostly on a more mechanical aspect of Barry Haddow, Philipp Koehn, Johannes Leveling,
interpretation by analyzing head behavior, rather Christof Monz, Pavel Pecina, Matt Post, Herve
than on explanations for predictions. Saint-Amand, et al. 2014. Findings of the 2014
workshop on statistical machine translation. In Proc.
7 Conclusion and Future Work Workshop on Statistical Machine Translation.
We contribute a novel strategy for adaptively sparse Ondrej Bojar, Rajen Chatterjee, Christian Federmann,
attention, and, in particular, for adaptively sparse Yvette Graham, Barry Haddow, Matthias Huck, An-
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-
Transformers. We present the first empirical analy- gacheva, Christof Monz, et al. 2016. Findings of the
sis of Transformers with sparse attention mappings 2016 conference on machine translation. In Proc.
(i.e., entmax), showing potential in both translation WMT.
2182
M Cettolo, M Federico, L Bentivogli, J Niehues, Sameen Maruf, André FT Martins, and Gholam-
S Stüker, K Sudoh, K Yoshino, and C Federmann. reza Haffari. 2019. Selective attention for
2017. Overview of the IWSLT 2017 evaluation cam- context-aware neural machine translation. preprint
paign. In Proc. IWSLT. arXiv:1903.08788.
Rewon Child, Scott Gray, Alec Radford, and Ilya Graham Neubig. 2011. The Kyoto free translation task.
Sutskever. 2019. Generating long sequences with http://www.phontron.com/kftt.
sparse Transformers. preprint arXiv:1904.10509.
Vlad Niculae and Mathieu Blondel. 2017. A regular-
Frank H Clarke. 1990. Optimization and Nonsmooth ized framework for sparse and structured neural at-
Analysis. SIAM. tention. In Proc. NeurIPS.
Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Myle Ott, Sergey Edunov, David Grangier, and
Alexander Rush. 2018. Latent alignment and varia- Michael Auli. 2018. Scaling neural machine trans-
tional attention. In Proc. NeurIPS. lation. In Proc. WMT.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Kristina Toutanova. 2019. BERT: Pre-training of Jing Zhu. 2002. BLEU: a method for automatic eval-
deep bidirectional transformers for language under- uation of machine translation. In Proc. ACL.
standing. In Proc. NAACL-HLT.
Ben Peters, Vlad Niculae, and André FT Martins. 2019.
Jonas Gehring, Michael Auli, David Grangier, Denis Sparse sequence-to-sequence models. In Proc. ACL.
Yarats, and Yann N Dauphin. 2017. Convolutional
sequence to sequence learning. In Proc. ICML. Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Lan-
Stephen Gould, Basura Fernando, Anoop Cherian, Pe- guage models are unsupervised multitask learners.
ter Anderson, Rodrigo Santa Cruz, and Edison Guo. preprint.
2016. On differentiating parameterized argmin and
argmax problems with application to bi-level opti- Alessandro Raganato and Jörg Tiedemann. 2018. An
mization. preprint arXiv:1607.05447. analysis of encoder representations in Transformer-
based machine translation. In Proc. BlackboxNLP.
Michael Held, Philip Wolfe, and Harlan P Crowder.
1974. Validation of subgradient optimization. Math- Rico Sennrich, Barry Haddow, and Alexandra Birch.
ematical Programming, 6(1):62–88. 2016. Neural machine translation of rare words with
subword units. In Proc. ACL.
Sarthak Jain and Byron C. Wallace. 2019. Attention is
not explanation. In Proc. NAACL-HLT. Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang,
Marcin Junczys-Dowmunt, Kenneth Heafield, Hieu Yudian Li, Xiaogang Wang, and Ping Luo. 2019.
Hoang, Roman Grundkiewicz, and Anthony Aue. SSN: Learning sparse switchable normalization via
2018. Marian: Cost-effective high-quality neural SparsestMax. In Proc. CVPR.
machine translation in C++. In Proc. WNMT.
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-
Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu, janowski, and Armand Joulin. 2019. Adaptive At-
and Tong Zhang. 2018. Multi-Head Attention with tention Span in Transformers. In Proc. ACL.
Disagreement Regularization. In Proc. EMNLP.
Gongbo Tang, Mathias Müller, Annette Rios, and Rico
Christos Louizos, Max Welling, and Diederik P Sennrich. 2018. Why self-attention? A targeted
Kingma. 2018. Learning sparse neural networks evaluation of neural machine translation architec-
through L0 regularization. Proc. ICLR. tures. In Proc. EMNLP.
Minh-Thang Luong, Hieu Pham, and Christopher D Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
Manning. 2015. Effective approaches to attention- BERT rediscovers the classical NLP pipeline. In
based neural machine translation. In Proc. EMNLP. Proc. ACL.
Chaitanya Malaviya, Pedro Ferreira, and André FT Constantino Tsallis. 1988. Possible generalization of
Martins. 2018. Sparse and constrained attention for Boltzmann-Gibbs statistics. Journal of Statistical
neural machine translation. In Proc. ACL. Physics, 52:479–487.
David Mareček and Rudolf Rosa. 2018. Extract- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ing syntactic trees from Transformer encoder self- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
attentions. In Proc. BlackboxNLP. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proc. NeurIPS.
André FT Martins and Ramón Fernandez Astudillo.
2016. From softmax to sparsemax: A sparse model Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
of attention and multi-label classification. In Proc. Titov. 2018. Context-aware neural machine transla-
of ICML. tion learns anaphora resolution. In Proc. ACL.
2183
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. 2019. Analyzing multi-head
self-attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proc. ACL.
Felix Wu, Angela Fan, Alexei Baevski, Yann N
Dauphin, and Michael Auli. 2019. Pay less atten-
tion with lightweight and dynamic convolutions. In
Proc. ICLR.
2184