Professional Documents
Culture Documents
Rethinking Self-Attention: Towards Interpretability in Neural Parsing
Rethinking Self-Attention: Towards Interpretability in Neural Parsing
person
Select
rently widely used, however interpretability is ai
Select
WiVX WiVX
the
the Select
person Ai the
difficult due to the numerous attention distri- person
heads
information, while facilitating interpretation
person
Select
of predictions. We introduce the Label At-
the
Matrix Projection
Matrix Projection
tention Layer: a new form of self-attention Select
the
person
where attention heads represent labels. We
person
Select
test our novel layer by running constituency From
the
other X
and dependency parsing experiments and show heads
Select the person
731
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 731–742
November 16 - 20, 2020.
2020
c Association for Computational Linguistics
Example Input Select
The Label Attention Layer takes word vectors as input (red-contour the
matrix). In the example sentence, start and end symbols are omitted. person
driving
driving
driving
driving
driving
person
person
person
person
W1K W2K W3K W4K
Select
Select
Select
Select
Computing the matrix of
the
the
the
the
key vectors for the input.
Each head has its own
learned key matrix WK.
Softmax and Dropout Softmax and Dropout Softmax and Dropout Softmax and Dropout
words.
Figure 2: The architecture of the top of our proposed Label Attention Layer. In this figure, the example input
sentence is “Select the person driving”.
contains its own attention-weighted view of the where d is the dimension of query and key vectors,
sentence. We hypothesize that a word represen- Ki is the matrix of key vectors. Given a learned
tation can be enhanced by including each label’s head-specific key matrix WiK , we compute Ki as:
attention-weighted view of the sentence, on top of
the information obtained from self-attention. Ki = WiK X (2)
The Label Attention Layer (LAL) is a novel,
Each attention head in our Label Attention layer
modified form of self-attention, where only one
has an attention vector, instead of an attention ma-
query vector is needed per attention head. Each
trix as in self-attention. Consequently, we do not
classification label is represented by one or more
obtain a matrix of vectors, but a single vector that
attention heads, and this allows the model to learn
contains head-specific context information. This
label-specific views of the input sentence. Figure 1
context vector corresponds to the green vector in
shows a high-level comparison between our Label
Figure 3. We compute the context vector ci of head
Attention Layer and self-attention.
i as follows:
We explain the architecture and intuition behind
our proposed Label Attention Layer through the
example application of parsing. ci = a i ∗ V i (3)
Figure 2 shows one of the main differences be- where ai is the vector of attention weights in Equa-
tween our Label Attention mechanism and self- tion 1, and Vi is the matrix of value vectors. Given
attention: the absence of the Query matrix WQ . a learned head-specific value matrix WiV , we com-
Instead, we have a learned matrix Q of query vec- pute Vi as:
tors representing each head. More formally, for
the attention head i and an input matrix X of word Vi = WiV X (4)
vectors, we compute the corresponding attention
weights vector ai as follows: The context vector gets added to each individual
input vector making for one residual connection
qi ∗ Ki
per head, rather one for all heads, as in the yellow
ai = softmax √ (1) box in Figure 3. We project the resulting matrix of
d
732
tations are not formed by splitting the entire word
Vector of attention weights aNP
Select
the vector at the middle, but rather by splitting each
person
from the label to the words. driving
head-specific word vector at the middle.
In the example in Figure 6, we show averaging
driving
person
V
Computing the matrix of WNP
Select
the
value vectors for the input. as one way of computing contributions, other func-
Each label has its own tions, such as softmax, can be used. Another way
learned value matrix WV.
of interpreting predictions is to look at the head-to-
The green vector is an word attention distributions, which are the output
attention-weighted sum of vectors in the computation in Figure 2.
value vectors. It represents
Matrix
the input sentence as viewed Projection
driving
person
p
Select
The sentence vector is
Re
3.1 Encoder
the
repeated and added to each
input vector. Our parser is an encoder-decoder model. The
encoder has self-attention layers (Vaswani et al.,
2017), preceding the Label Attention Layer. We
The yellow vectors are word follow the attention partition of Kitaev and Klein
representations conscious Select
the
of the label’s view of the person
driving
(2018), who show that separating content embed-
sentence and the word they Matrix
Projection dings from position ones improves performance.
represent. Select
La
No yer
the
rm
al
iz
HNP
at
io
n simplified Head-driven Phrase Structure Grammar
hcar
(HPSG) (Pollard and Sag, 1994). In Zhou and Zhao
(2019), two kinds of span representations are pro-
Figure 3: The Value vector computations in our pro- posed: the division span and the joint span. We
posed Label Attention Layer. choose the joint span representation as it is the
best-performing one in their experiments. Figure
5 shows how the example sentence in Figure 2 is
word vectors to a lower dimension before normal- represented.
izing. We then distribute the vectors computed by The token representations for our model are a
each label attention head, as shown in Figure 4. concatenation of content and position embeddings.
We chose to assign as many attention heads to The content embeddings are a sum of word and
the Label Attention Layer as there are classification part-of-speech embeddings.
labels. As parsing labels (syntactic categories) are
3.2 Constituency Parsing
related, we did not apply an orthogonality loss to
force the heads to learn separate information. We For constituency parsing, span representations fol-
therefore expect an overlap when we match labels low the definition of Gaddy et al. (2018) and Kitaev
to heads. The values from each head are identifi- and Klein (2018). For a span starting at the i-th
able within the final word representation, as shown word and ending at the j-th word, the correspond-
in the color-coded vectors in Figure 4. ing span vector sij is computed as:
The activation functions of the position-wise h→
− −−→ ←−− ← −i
feed-forward layer make it difficult to follow the sij = hj − hi−1 ; hj+1 − hi (5)
path of the contributions. Therefore we can remove ←
− →
−
where hi and hi are respectively the backward and
the position-wise feed-forward layer, and compute forward representation of the i-th word obtained
the contributions from each label. We provide an by splitting its representation in half. An example
example in Figure 6, where the contributions are of a span representation is shown in the middle of
computed using normalization and averaging. In Figure 6.
this case, we are computing the contributions of The score vector for the span is obtained by ap-
each head to the span vector. The span represen- plying a one-layer feed-forward layer:
tation for “the person” is computed following the
method of Gaddy et al. (2018) and Kitaev and Klein
(2018). However, forward and backward represen- S(i, j) = W2 ReLU(LN(W1 sij +b1 ))+b2 (6)
733
Label Attention Label Attention Label Attention Label Attention
Head #1 Head #2 Head #3 Head #4
H1 H2 H3 H4
A word representation is a
concatenation of all of its
head-specific
representations.
Figure 4: Redistribution of the head-specific word representations to form word vectors by concatenation. We use
different colors for each label attention head. The colors show where the head outputs go in the word representa-
tions. We do not use colors for the vectors resulting from the position-wise feed-forward layer, as the head-specific
information moved.
734
Computing Head Contributions
Computing the
contributions from each Normalization
head to the span vector: we and Average
sum values from the same
head together and then Here, heads #1 and #2 have
Fraction of contribution from the highest contributions to
normalize and average. the heads to the span vector
predicting “the person” as a
Heads #1 #2 #3 #4
noun phrase.
Figure 6: If we remove the position-wise feed-forward layer, we can compute the contributions from each label
attention head to the span representation, and thus interpret head contributions. This illustrative example follows
the label color scheme in Figure 4.
735
PFL RD Prec. Recall F1 UAS LAS Query Matrix and Query Vector and
Yes Yes 96.47 96.20 96.34 97.33 96.29 Concatenation Matrix Projection
No Yes 96.51 96.15 96.33 97.25 96.11
Yes No 96.53 96.24 96.38 97.42 96.26 WiQX WiKX qi WiKX
No No 96.29 96.05 96.17 97.23 96.11
person
Select
Select
Table 1: Results on the PTB test set of the ablation WiVX ai WiVX
the
Select the
Ai the person
person
study on the Position-wise Feed-forward Layer (PFL)
and Residual Dropout (RD) of the Label Attention Repeated
X X
Layer.
person
Select
Aggregating
QV Conc. Prec. Recall F1 UAS LAS
the
Select
Matrix Projection
with output the
Yes Yes 96.53 96.24 96.38 97.42 96.26 from other person
heads
No Yes 96.43 96.03 96.23 97.25 96.12
Yes No 96.30 96.10 96.20 97.23 96.15
From
No No 96.30 96.06 96.18 97.26 96.17 other
Matrix Projection
Select
heads the
person
Select the person
person
Select
the
study on the Query Vectors (QV) and Concatenation X
736
English Chinese English Chinese
Model Model
LR LP F1 LR LP F1
Shen et al. (2018) 92.0 91.7 91.8 86.6 86.4 86.5
UAS LAS UAS LAS
Fried and Klein (2018) - - 92.2 - - 87.0 Kuncoro et al. (2016) 94.26 92.06 88.87 87.30
Teng and Zhang (2018) 92.2 92.5 92.4 86.6 88.0 87.3 Li et al. (2018) 94.11 92.08 88.78 86.23
Vaswani et al. (2017) - - 92.7 - - -
Ma and Hovy (2017) 94.88 92.98 89.05 87.74
Dyer et al. (2016) - - 93.3 - - 84.6
Kuncoro et al. (2017) - - 93.6 - - - Dozat and Manning (2016) 95.74 94.08 89.30 88.23
Charniak et al. (2016) - - 93.8 - - - Choe and Charniak (2016) 95.9 94.1 - -
Liu and Zhang (2017b) 91.3 92.1 91.7 85.9 85.2 85.5 Ma et al. (2018) 95.87 94.19 90.59 89.29
Liu and Zhang (2017a) - - 94.2 - - 86.1
Suzuki et al. (2018) - - 94.32 - - - Ji et al. (2019) 95.97 94.31 - -
Takase et al. (2018) - - 94.47 - - - Fernández-González and Gómez- 96.04 94.43 - -
Fried et al. (2017) - - 94.66 - - - Rodrı́guez (2019)
Kitaev and Klein (2018) 94.85 95.40 95.13 - - -
Kitaev et al. (2018) 95.51 96.03 95.77 91.55 91.96 91.75 Kuncoro et al. (2017) 95.8 94.6 - -
Zhou and Zhao (2019) 95.70 95.98 95.84 92.03 92.33 92.18 Clark et al. (2018) 96.61 95.02 - -
(BERT) Wang et al. (2018) 96.35 95.25 - -
Zhou and Zhao (2019) 96.21 96.46 96.33 - - -
(XLNet) Zhou and Zhao (2019) (BERT) 97.00 95.43 91.21 89.15
Our work 96.24 96.53 96.38 91.85 93.45 92.64 Zhou and Zhao (2019) (XLNet) 97.20 95.72 - -
Our work 97.42 96.26 94.56 89.28
Table 3: Constituency Parsing on PTB & CTB test sets.
Table 4: Dependency Parsing on PTB & CTB test sets.
737
1.2 et al., 2015) have been extended to other tasks, such
Average Contribution to Vectors of Spans (%) 1.0 Spans
0.8
Average Head Contribution
Predicted as text classification (Yang et al., 2016), natural lan-
0.6
as NP
0.4 (Noun Phrase) guage inference (Chen et al., 2016) and language
0.2
0.0
#0 #31 #35 #47 #52 #71 #111
modeling (Salton et al., 2017).
1.4
1.2
1.0 Spans
Self-attention and transformer architectures
0.8 Average Head Contribution
Predicted (Vaswani et al., 2017) are now the state of the
0.6
as VP
0.4
0.2
(Verb Phrase)
art in language understanding (Devlin et al., 2018;
0.0
1.2
#0 #31 #35 #47 #52 #71 #111
Yang et al., 2019), extractive summarization (Liu,
1.0
0.8
Average Head Contribution Spans 2019), semantic role labeling (Strubell et al., 2018)
0.6 Predicted
0.4 as S and machine translation for low-resource languages
(Sentence)
0.2
0.0
(Rikters, 2018; Rikters et al., 2018).
#0 #31 #35 #47 #52 #71 #111
738
than a matrix, we can significantly reduce the num- Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,
ber of parameters per attention head. Finally, our and Noah A Smith. 2016. Recurrent neural network
grammars. In Proceedings of the 2016 Conference
Label Attention heads learn the relations between
of the North American Chapter of the Association
the syntactic categories, as we show by computing for Computational Linguistics: Human Language
contributions from each attention head to span vec- Technologies, pages 199–209.
tors. We show how the heads also help to analyze
prediction errors, and suggest methods to correct Daniel Fernández-González and Carlos Gómez-
Rodrı́guez. 2019. Left-to-right dependency parsing
them. with pointer networks. In Proceedings of the 2019
Conference of the North American Chapter of the
Acknowledgements Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short
We thank the anonymous reviewers for their helpful Papers), pages 710–716.
and detailed comments.
Daniel Fried and Dan Klein. 2018. Policy gradient as
a proxy for dynamic oracles in constituency parsing.
References In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 2: Short Papers), pages 469–476, Melbourne, Aus-
gio. 2014. Neural machine translation by jointly tralia. Association for Computational Linguistics.
learning to align and translate. arXiv preprint
arXiv:1409.0473.
Daniel Fried, Mitchell Stern, and Dan Klein. 2017. Im-
Eugene Charniak et al. 2016. Parsing as language mod- proving neural parsing by disentangling model com-
eling. In Proceedings of the 2016 Conference on bination and reranking effects. In Proceedings of the
Empirical Methods in Natural Language Processing, 55th Annual Meeting of the Association for Compu-
pages 2331–2336. tational Linguistics (Volume 2: Short Papers), pages
161–166.
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and
Hui Jiang. 2016. Enhancing and combining sequen- David Gaddy, Mitchell Stern, and Dan Klein. 2018.
tial and tree lstm for natural language inference. Whats going on in neural constituency parsers? an
arXiv preprint arXiv:1609.06038. analysis. In Proceedings of the 2018 Conference of
the North American Chapter of the Association for
Do Kook Choe and Eugene Charniak. 2016. Parsing Computational Linguistics: Human Language Tech-
as language modeling. In Proceedings of the 2016 nologies, Volume 1 (Long Papers), pages 999–1010.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 2331–2336, Austin, Texas. Sarthak Jain and Byron C Wallace. 2019. Attention is
Association for Computational Linguistics. not explanation. In Proceedings of the 2019 Con-
Kevin Clark, Minh-Thang Luong, Christopher D Man- ference of the North American Chapter of the Asso-
ning, and Quoc Le. 2018. Semi-supervised se- ciation for Computational Linguistics: Human Lan-
quence modeling with cross-view training. In Pro- guage Technologies, Volume 1 (Long and Short Pa-
ceedings of the 2018 Conference on Empirical Meth- pers), pages 3543–3556.
ods in Natural Language Processing, pages 1914–
1925. Tao Ji, Yuanbin Wu, and Man Lan. 2019. Graph-based
dependency parsing with graph neural networks. In
John Cocke. 1969. Programming languages and their Proceedings of the 57th Annual Meeting of the Asso-
compilers: Preliminary notes. ciation for Computational Linguistics, pages 2475–
2485.
Leyang Cui and Yue Zhang. 2019. Hierarchically-
refined label attention network for sequence labeling. Tadao Kasami. 1966. An efficient recognition
In Proceedings of the 2019 Conference on Empirical and syntax-analysis algorithm for context-free lan-
Methods in Natural Language Processing and the guages. Coordinated Science Laboratory Report no.
9th International Joint Conference on Natural Lan- R-257.
guage Processing (EMNLP-IJCNLP), pages 4106–
4119. Nikita Kitaev, Steven Cao, and Dan Klein. 2018. Multi-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and lingual constituency parsing with self-attention and
Kristina Toutanova. 2018. Bert: Pre-training of deep pre-training. arXiv preprint arXiv:1812.11760.
bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805. Nikita Kitaev and Dan Klein. 2018. Constituency pars-
ing with a self-attentive encoder. In Proceedings
Timothy Dozat and Christopher D Manning. 2016. of the 56th Annual Meeting of the Association for
Deep biaffine attention for neural dependency pars- Computational Linguistics (Volume 1: Long Papers),
ing. arXiv preprint arXiv:1611.01734. pages 2676–2686.
739
Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Carl Pollard and Ivan A Sag. 1994. Head-driven
Kong, Chris Dyer, Graham Neubig, and Noah A phrase structure grammar. University of Chicago
Smith. 2017. What do recurrent neural network Press.
grammars learn about syntax? In Proceedings of
the 15th Conference of the European Chapter of the Matı̄ss Rikters. 2018. Impact of corpora quality
Association for Computational Linguistics: Volume on neural machine translation. arXiv preprint
1, Long Papers, pages 1249–1258. arXiv:1810.08392.
Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Matı̄ss Rikters, Mārcis Pinnis, and Rihards Krišlauks.
Kong, Chris Dyer, and Noah A. Smith. 2016. Distill- 2018. Training and adapting multilingual nmt for
ing an ensemble of greedy dependency parsers into less-resourced and morphologically rich languages.
one MST parser. In Proceedings of the 2016 Con- In Proceedings of the Eleventh International Con-
ference on Empirical Methods in Natural Language ference on Language Resources and Evaluation
Processing, pages 1744–1753, Austin, Texas. Asso- (LREC-2018).
ciation for Computational Linguistics.
Giancarlo Salton, Robert Ross, and John Kelleher.
Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao. 2018. 2017. Attentive language models. In Proceedings
Seq2seq dependency parsing. In Proceedings of of the Eighth International Joint Conference on Nat-
the 27th International Conference on Computational ural Language Processing (Volume 1: Long Papers),
Linguistics, pages 3203–3214, Santa Fe, New Mex- pages 441–450.
ico, USA. Association for Computational Linguis-
Satoshi Sekine and Michael Collins. 1997. Evalb
tics.
bracket scoring program. URL: http://www. cs. nyu.
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San- edu/cs/projects/proteus/evalb.
tos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Sofia Serrano and Noah A Smith. 2019. Is attention
Bengio. 2017. A structured self-attentive sentence interpretable? arXiv preprint arXiv:1906.03731.
embedding. arXiv preprint arXiv:1703.03130.
Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan-
Jiangming Liu and Yue Zhang. 2017a. In-order dro Sordoni, Aaron Courville, and Yoshua Bengio.
transition-based constituent parsing. Transactions 2018. Straight to the tree: Constituency parsing
of the Association for Computational Linguistics, with neural syntactic distance. In Proceedings of the
5:413–424. 56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
Jiangming Liu and Yue Zhang. 2017b. Shift-reduce 1171–1180, Melbourne, Australia. Association for
constituent parsing with neural lookahead features. Computational Linguistics.
Transactions of the Association for Computational
Linguistics, 5:45–58. Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A
minimal span-based neural constituency parser. In
Yang Liu. 2019. Fine-tune BERT for extractive sum- Proceedings of the 55th Annual Meeting of the As-
marization. CoRR, abs/1903.10318. sociation for Computational Linguistics (Volume 1:
Long Papers), volume 1, pages 818–827.
Thang Luong, Hieu Pham, and Christopher D Manning.
2015. Effective approaches to attention-based neu- Emma Strubell, Patrick Verga, Daniel Andor,
ral machine translation. In Proceedings of the 2015 David Weiss, and Andrew McCallum. 2018.
Conference on Empirical Methods in Natural Lan- Linguistically-informed self-attention for semantic
guage Processing, pages 1412–1421. role labeling. arXiv preprint arXiv:1804.08199.
Xuezhe Ma and Eduard Hovy. 2017. Neural proba- Jun Suzuki, Sho Takase, Hidetaka Kamigaito, Makoto
bilistic model for non-projective MST parsing. In Morishita, and Masaaki Nagata. 2018. An empirical
Proceedings of the Eighth International Joint Con- study of building a strong baseline for constituency
ference on Natural Language Processing (Volume 1: parsing. In Proceedings of the 56th Annual Meet-
Long Papers), pages 59–69, Taipei, Taiwan. Asian ing of the Association for Computational Linguistics
Federation of Natural Language Processing. (Volume 2: Short Papers), pages 612–618.
Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Sho Takase, Jun Suzuki, and Masaaki Nagata. 2018.
Graham Neubig, and Eduard Hovy. 2018. Stack- Direct output connection for a high-rank language
pointer networks for dependency parsing. In Pro- model. In Proceedings of the 2018 Conference on
ceedings of the 56th Annual Meeting of the Associa- Empirical Methods in Natural Language Processing,
tion for Computational Linguistics (Volume 1: Long pages 4599–4609.
Papers), pages 1403–1414, Melbourne, Australia.
Association for Computational Linguistics. Zhiyang Teng and Yue Zhang. 2018. Two local mod-
els for neural constituent parsing. In Proceedings of
Mitchell Marcus, Beatrice Santorini, and Mary Ann the 27th International Conference on Computational
Marcinkiewicz. 1993. Building a large annotated Linguistics, pages 119–132, Santa Fe, New Mexico,
corpus of english: The penn treebank. USA. Association for Computational Linguistics.
740
Kristina Toutanova, Dan Klein, Christopher D Man- and transition-based dependency parsing. In Pro-
ning, and Yoram Singer. 2003. Feature-rich part-of- ceedings of the 2008 Conference on Empirical Meth-
speech tagging with a cyclic dependency network. ods in Natural Language Processing, pages 562–
In Proceedings of the 2003 Conference of the North 571, Honolulu, Hawaii. Association for Computa-
American Chapter of the Association for Computa- tional Linguistics.
tional Linguistics on Human Language Technology-
Volume 1, pages 173–180. Association for computa- Junru Zhou and Hai Zhao. 2019. Head-driven phrase
tional Linguistics. structure grammar parsing on penn treebank. arXiv
preprint arXiv:1907.02684.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz A Additional Experiment Results
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro- We report experiment results for hyperparameter
cessing systems, pages 5998–6008. tuning based on the number of self-attention layers
in Table 5.
Wenhui Wang, Baobao Chang, and Mairgup Mansur.
2018. Improved dependency parsing using im-
plicit word connections learned from unlabeled data.
In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing,
pages 2857–2863, Brussels, Belgium. Association
for Computational Linguistics.
741
Self-Attention Layers Precision Recall F1 UAS LAS
2 96.23 96.03 96.13 97.16 96.09
3 96.47 96.20 96.34 97.33 96.29
4 96.52 96.15 96.34 97.39 96.23
6 96.48 96.09 96.29 97.30 96.16
8 96.43 96.09 96.26 97.33 96.15
12 96.27 96.06 96.16 97.24 96.14
16 96.38 96.02 96.20 97.32 96.11
Table 5: Performance on the Penn Treebank test set of our LAL parser according to the number of self-attention
layers. All parsers here include the Position-wise Feed-forward Layer and Residual Dropout.
742