Professional Documents
Culture Documents
ABSTRACT 100
The performance
The Transformer model has benefited various real-world applica- 90
tions, where the self-attention mechanism with dot-products shows
superior alignment ability on building long dependency. However, 80 SQuAD
the pair-wisely attended self-attention limits further performance 70 GLUE
improvement on challenging tasks. To the extent of our knowl- SuperGLUE
edge, this is the first work to define the Triplet Attention (A3 ) for 60
60M 220M 770M 2.8B 11B
Transformer, which introduces triplet connections as the comple- The Paramters of T5 model (log scale)
mentary dependency. Specifically, we define the triplet attention
based on the scalar triplet product, which may be interchangeably
Figure 1: The large-scale self-attention model has reached
used with the canonical one within the multi-head attention. It
its performance limits. The figure exhibits a practical exam-
allows the self-attention mechanism to attend to diverse triplets
ple. Starting from the number of parameters is greater than
and capture complex dependency. Then, we utilize the permuted
770M, the performance’s growth rate drops rapidly. The Su-
formulation and kernel tricks to establish a linear approximation to
perGLUE is more challenging than GLUE, but all the meth-
A3 . The proposed architecture could be smoothly integrated into
ods’ metrics seem to converge to an “unseen” wall.
the pre-training by modifying head configurations. Extensive ex-
periments show that our methods achieve significant performance
improvement on various tasks and two benchmarks.
1 INTRODUCTION
CCS CONCEPTS Transformers have achieved great success in a variety of domains,
such as natural language processing [6], computer vision [17], and
• Computing methodologies → Neural networks; Kernel meth-
time-series analysis [31]. Such success is mainly brought by the core
ods; Natural language processing; Logical and relational learning.
principle - utilizing the scaled dot-product similarity measurement
in the self-attention mechanism, which, however, becomes the
KEYWORDS
unbreakable constraint in the further performance improvement in
Neural Network; Large-scale Model; Transformer; Self-attention the large-scale models. Take the giant model GPT-3 [3] and T5 [7] as
ACM Reference Format: examples, their performance improvement is not proportion to their
Haoyi Zhou, Jianxin Li, Jieqi Peng, Shuai Zhang, and Shanghang Zhang. 2021. enlargement of model size (parameters) after the BERT model [6]
Triplet Attention: Rethinking the similarity in Transformers. In Proceedings beats the Human Performance in GLUE [27] and SQuAD [21]. In
of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Fig.(1), we compare the performance improvement of T5 [7] from a
Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. ACM, New smaller model (60M) to a larger one (11B) on three different tasks.
York, NY, USA, 11 pages. https://doi.org/10.1145/3447548.3467241
From the figure, we can see that the curves tend to be flat after the
∗ Also with Beijing Advanced Innovation Center for Big Data and Brain Comput- parameter number is greater than 770M.
ing (BDBC), Beihang University. If we reduce this bottleneck to over-fitting from the generaliza-
† Jianxin Li is the corresponding author.
tion performance perspective, some attempts [8, 13, 15] on enhanc-
ing different heads’ behaviors or using dropout-like attention is
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed limited to specific tasks. Meanwhile, the GPT-3 and T5 both have
for profit or commercial advantage and that copies bear this notice and the full citation built bigger models on the larger corpus, which is considered as
on the first page. Copyrights for components of this work owned by others than ACM the best way to alleviate the over-fitting problem. However, this
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a performance bottleneck still exists.
fee. Request permissions from permissions@acm.org. Based on the aforementioned methods, we proposed to slightly
KDD ’21, August 14–18, 2021, Virtual Event, Singapore. break the similarity assumption of dot-product attention, which
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00 builds direct and simple dependency. We utilizes dissimilarity con-
https://doi.org/10.1145/3447548.3467241 nections as a way to enhance the complex dependency modeling.
2378
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Researchers [12, 23] have found that our human beings can easily 2 RELATED WORK
recognize similarities on triplet comparison, even with dissimilar Transformer and Its Applications. Transformer has brought
pairs. Nevertheless, the pair-wisely defined measurements are hard significant success and attracted wide attention in various fields,
to succeed in these tasks. This drawback is obvious when dynami- such as natural language processing (NLP) [6] and computer vi-
cally changing semantic or needing complementary connections. sion [30]. It is mainly based on the self-attention mechanism and
Consider a concrete example, If we say “Kate likes the apple” and demonstrates strong representation learning capabilities. Particu-
“We are in the garden”, we can instantly guess that the apple refers larly, transformer-based models achieve better performance than
to the red eatable fruit. Otherwise, if we change the second sen- other types of networks in various NLP and Vision benchmarks.
tence into “We are in the Bestbuy”, we may think Kate wants the Among these models, the most popular ones include BERT (Bidi-
Apple smartphone. The word “apple” has vague meanings depend- rectional Encoder Representations from Transformers) [6], GPT
ing on different contexts and the triplet similarity is more reliable (Generative Pre-trained Transformer)-1 [18], and GPT-3 [3]. Though
than the pair-wise one in this case. Theses notional words “Kate”, the larger transformer models have brought a more profound im-
“apple”, “garden”, and “Bestbuy” are not pair-wisely semantic simi- pact, the performance gain of large models is far from significant
lar. We motivated by allowing these dissimilar triplets attentive to compared with their increased layers and model size. For instance,
each other to build more complex dependency beyond the scope of compared with the mBART model [16], which has 680 million pa-
pair-wise self-attention. The dot-product attention fails to capture rameters and 24 layers, the latest GPT-3 model [3] has 175 billion
consistently dissimilar triplets because it is designed on twin com- parameters and 96 layers, which is 250 times larger than mBART.
parisons, narrowing the concept of attention to the pair-wise case. However, the performance gain is only 6.6 BLUE on the German to
Seldom work notices the role of dissimilarity in contributing com- English Translation [3] and its performance on English to Romanian
plementary dependency. We are looking for a missing piece of the translation is even 14 BLUE less than mBART.
puzzle if the Transformers solely rely on dot-product self-attention.
This paper builds a new type of attention that allows for captur- Regularizer on Transformers. Even though transformer has
ing triplet dissimilarity on inputs. We found that most embeddings brought success to a variety of applications, most of the existing
are attentive to the positional tokens and dissimilar triplets are po- works focus on pursuing better performance with larger model
tential weak connections in Section 4. Thus, we use the scalar triple design (e.g. model with more layers), while much fewer works have
product to build the triplet similarity measurement. We still stick investigated the limitation and capacity of self-attention in each
to the term “similarity” to be aligned with the vanilla Transformer layer. The work [13] introduces three types of disagreement regu-
architecture [25], which recombines the inputs with attentive items larization to encourage diversity among multiple attention heads.
under the similarity measurement. Nevertheless, we interpret this The work [5] also analyzes the attention mechanisms of pre-trained
triplet attention as a stable correlation, reflecting that the inputs’ models and applies them to BERT. The apparent redundancy of
intrinsic dissimilarity under random projections. We define the heads is also found by clustering BERT’s attention heads. As a
new type of similarity based on these specific correlations. In this further investigation, our proposed method first reveals that such
way, canonical self-attention and triplet attention can be used in- limitation is caused by the “simple” similarity assumption brought
terchangeably. Last but not least, the proposed triplet attention by dot-product attention. And then we introduce the triplet atten-
can also be regarded as the implicit regularizer for the canonical tion as a fundamental complementary to this problem. From these
self-attention by allowing unattended pairs to attend. studies, we can see the importance and necessity of encouraging the
To the best of our knowledge, we are the first to define the Triplet diversity of the attention model. Though there are several works
Attention (A3 ) in Transformer models. Furthermore, we propose on designing the diversity-promoting regularizer to improve the
an efficient formulation of A3 , which can be used interchangeably generalization performance [14, 29], none of them are dealing with
with the canonical self-attention. The A3 -enhanced Transformer the self-attention of transformers.
allows for fine-tuning on the large pre-trained models and achieves
superior performance on various tasks. Our main contributions can 3 PRELIMINARY
be summarized as: To better structure the proposed method, we briefly introduce the
• We point out that the triplet connections could contribute to self-attention mechanism and the related basic notations.
the similarity measurement in the building of Transformers
and reveals it as implicitly regularizing the Transformers. 3.1 The Self-attention
• We define the Triplet Attention A3 with the scalar triplet prod-
The self-attention mechanism lies the core component in Trans-
uct, which is a tangible formulation for encouraging the diverse
former [25] model, where it is integrated as the encoder-decoder ar-
triplets to be attentive to each other.
chitecture in seq-to-seq processing to allow for building long-range
• We propose the permutation approximation to A3 as a way to
dependency between inputs and outputs, i.e., NLP [6], Image [17],
achieve linear complexity w.r.t input length, where we combine
Time-series [31]. The objective of building strong alignment ability
it with the efficient tricks of the canonical self-attention. We
motivates the self-attention’s principle to learn to acquire direct
show that our methods are compatible with the pre-trained
access between important items. For a sequence input X ∈ R𝐿×𝐷
models under proper head configurations.
with length 𝐿, the single-headed self-attention is defined as:
• We conduct experiments on various tasks to investigate the
QK⊤
effectiveness of the proposed methods. Results demonstrate
Attention(Q, K, V) = softmax √ V , (1)
the success of A3 on achieving superior performance. 𝑑
2379
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
2380
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
2381
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Multi-head Attention
If we drop the second key, the STP will degrade into dot-product
Inputs
the correct orders to be aligned with the weighted combination of
value V when we perform permutation on the second key. Note that
the permutation is designable and it allows flexible information
sharing, e.g., the reversing, shifting and exchanging. Clark et al. [5]
shows that different heads favor their specific attentive patterns N layers
and we make the permutation fixed in each head for a stable repre-
(a) The overview of A3 -enhanced Transformers’ encoder.
sentation. A more detailed discussion on the choice of permutations
is given in Section 5. Based on that, we proposed the approximation
of A3 in the following equations: s heads s×d heads
T A3 head1
atAttention(Q, K (1) , K (2) , V) = 𝔇−1 𝔄V, 𝔄 = exp √ P . (10) head1 A3 head1
𝑑𝑡 A3 head2
We follow the similar definition in Eq.(8), the alphabet ‘at’ stands head21 A3 head2
2382
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
CoLA MRPC RTE STS-B QNLI QQP SST-2 WNLI MNLI Average
Model
8.5k 3.5k 2.5k 5.7k 108k 363k 67k 0.64k 392k -
ELMo 44.1 76.6 53.4 70.4 71.1 86.2 91.5 56.3 68.6 68.7
DistilBERT 51.3 87.5 59.9 86.9 89.2 88.5 91.3 56.3 82.2 77.0
DistilBERT-A3 (ours) 55.1 87.9 67.8 87.5 88.9 90.3 91.5 56.3 83.8 78.8
DistilBERT-A3 -FAVOR (ours) 55.1 87.8 65.6 87.1 88.3 89.8 91.1 56.3 83.6 78.3
BERT𝑏𝑎𝑠𝑒 56.3 88.6 69.3 89.0 91.9 89.6 92.7 53.5 86.6 79.7
BERT-A3 (ours) 62.9 90.8 74.4 90.3 91.1 91.0 93.2 56.3 87.6 81.9
BERT-A3 -FAVOR (ours) 61.8 89.8 72.2 90.1 90.8 90.8 92.7 56.3 87.4 81.3
The 𝑠 controls the number of head replacement. Since the output Table 2: The performance on the SQuAD dataset.
of A3 is aligned with the canonical self-attention by keeping the
same recombination of value V in Eq.(8,10), we develop the final SQuAD
layer output of matAttention(·) into the concatenation of all heads’ Model
EM F1
feature map. Note that the key K arises from linear projection
W ∈ R𝐷×𝑑 and the keys K (1) , K (2) are acquired by W ∈ R𝐷×𝑑𝑡 , BiDAF-ELMo - 85.6
thus we have three individual keys. Another important issue we R.M. Reader 81.2 87.9
have not yet addressed is the choice of inner dimension 𝑑𝑡 . The DistilBERT 77.7 85.8
vector cross product in STP limits the 𝑑𝑡 to be 3 for becoming the DistilBERT-A3 (ours) 78.5 87.1
dual operation. To alleviate the information loss, we let the 𝑠 = 𝛼 ·𝑑𝑡 DistilBERT-A3 -FAVOR (ours) 78.9 87.8
and build 𝑑 mini-heads across the 𝛼-interval on 𝑑-dimensions under
each heads replacement. In practice, we start from the 𝛼 = 1 case, BERT𝑏𝑎𝑠𝑒 81.2 88.5
and the set of mini-heads has comparable representation ability BERT-A3 (ours) 81.8 89.3
and equivalent time complexity with the canonical one. In Fig.(4b), BERT-A3 -FAVOR (ours) 81.6 88.9
the 𝑚 heads (red) are divided into 𝑠 heads (blue) and (𝑚 − 𝑠) heads
(green), and the blue heads are decomposed into mini-heads (light
blue). The dash lines in mini-heads split them into 𝑝 groups, where on eight tasks, which demonstrates that A3 can enhance the com-
we apply different permutations individually as even as possible. plementary dependency of self-attention mechanism and gain more
A more detailed discussion on the choice of permutations is given competitive scores. More specifically, our DistilBERT-A3 /BERT-A3
in Section 5.3.3. As a summary, the proposed architecture is briefly model achieves 7.4%/11.7% score rising on the ColA dataset (8.5k)
presented in Fig.(4). and 13.2%/6.8% on the RTE dataset (2.5k). This is because that A3
has better expression capability on the smaller dataset, and the
5 EXPERIMENTS triplet connections help the model capture weak dependency with
This section empirically demonstrates the effectiveness of A3 mech- limited data. For the QNLI dataset, the A3 mechanism slightly re-
anism on the pre-trained DistilBERT model [22], which has 6 layers duce the performance. We also conduct the SQuAD experiment as
and 12 heads in each layer, and the pre-trained BERT𝑏𝑎𝑠𝑒 model [6], an independent supplement. Note that although A3 ’s results have
which has 12 layers and 12 heads in each layer. We implement not achieved the state-of-the-art performance on GLUE because
DistilBERT-A3 and BERT-A3 based on Transformers [28] and per- our practical resource is limited, the transparent architecture of A3
form the model fine-tuning and evaluation on ten NLP tasks1 . is supposed to be promising on large-scale models.
The result of SQuAD experiment is summarized in Table 2. Our
5.1 Setup model DistilBERT-A3 and BERT-A3 achieve better performance in
We perform experiments on two benchmarks with pre-train models. both EM and F1 scores. We refer such contradictory results (SQuAD
More details and settings can be found at Appendix A. and QNLI) to the different problem settings. We suggest that the
QNLI force the Question-Answering to sentence level and remove
5.2 Main Results pairs with low vocabulary overlap, while the SQuAD maintains the
paragraph level. The triplet correlations fit better in the later case.
We summarize the results of GLUE experiment in Table 1, where
‘(our)’ indicates the A3 variants we build. They are compared with
the canonical architecture DistilBERT and BERT𝑏𝑎𝑠𝑒 , together with 5.3 Ablation Study
the baseline ELMo [9]. Table 1 shows that our model DistilBERT-A3 5.3.1 Effect of A3 ’s Layer Deployment. We explore the effect of
and BERT-A3 outperform DistilBERT and BERT𝑏𝑎𝑠𝑒 respectively applying A3 mechanism in different layers. We design two experi-
ments across layers: continuously employing A3 , and intervening
1 The code can be download at https://github.com/zhouhaoyi/TripletAttention usage of A3 . BERT-A3𝑙 [𝑖−𝑗 ] applies A3 from 𝑖-th layer to 𝑗-th layer.
2383
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Table 3: The performance of consecutive BERT-A3 ’s layer de- Table 5: The performance of different BERT-A3 ’s heads con-
ployment on the selected GLUE tasks. figuration on the selected GLUE tasks.
Model CoLA MRPC RTE STS-B Model CoLA MRPC RTE STS-B
BERT-A3 𝑙 [1−3] 60.8 90.1 71.8 89.9 BERT-A3 ℎ3 60.8 90.1 71.8 89.9
BERT-A3𝑙 [4−6] 59.4 87.8 70.4 89.7 BERT-A3ℎ6 60.8 89.6 70.1 89.5
BERT-A3𝑙 [7−9] 58.6 88.7 67.9 90.0 BERT-A3ℎ9 57.6 88.7 66.8 88.3
BERT-A3𝑙 [10−12] 59.6 88.9 67.1 89.7 BERT-A3ℎ12 53.6 85.7 57.3 88.1
BERT-A3𝑙 [1−6] 58.9 88.8 68.6 89.7
BERT-A3𝑙 [7−12] 57.9 89.4 70.1 89.7
Table 6: The performance of different BERT-A3 ’s permuta-
BERT-A3𝑙 [1−12] 56.3 87.6 69.0 89.4
tion grouping on the selected GLUE tasks.
2384
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
75
60 1e-1
RTE Accuracy
PERs
40 65 4
ORGs
LOCs
BERTbase 60 BERTbase CROSS
20 BERT-A3 BERT-A3
2
55
121110 9 8 7 6 5 4 3 2 121110 9 8 7 6 5 4 3 2
Remaining Layers Remaining Layers
0
1 2 3 4 5 6 7 8 9 10 11 12
The Layer of Self-Attention
Figure 5: The performance decreases when the layer number
of Transformers degrades from 12 to 2. (a) The NER visualization on the BERT𝑏𝑎𝑠𝑒 model.
1e-1
L3-Batch2
CROSS
L2-Batch2 2
-3 L1-Batch2
3
L -Batch4
L2-Batch4 0
1 2 3 4 5 6 7 8 9 10 11 12
-4 L1-Batch4
The Layer of Self-Attention
discovering more dependency. In the experiment of using two Per- 5.4.2 Applying the FAVOR Trick (A3 vs A3 -FAVOR). We empirically
muted STP methods, we mainly combine “random permutation” validate the accelerating A3 by applying the FAVOR [4]. We speed-
with other Permuted STP methods. In Table 6, we can find that wisely compare the model forward and backward passed among
the combination of “random permutation” and “next two words” BERT-A3 -L3 , BERT-A3 and BERT-A3 -FAVOR, which is represented
achieve more effective results. The results in Table 6 also show as L3 , L2 and L1 . We use the six layers BERT-A3 model, and each
that combining methods except method “random permutation” can layer has six A3 heads, and we gradually increase the batch size and
achieve almost the best scores, which demonstrates that increasing sequence length of input data. The results are shown in Figure 6. We
the variety of Permuted STP methods and implementing those dif- can find that compared with BERT-A3 -𝐿 3 , BERT-A3 and BERT-A3 -
ferent Permuted STP methods combination on different A3 heads FAVOR reduce the computation cost, and BERT-A3 -FAVOR model
can obtain more significant dissimilar information and enhance the allows extensive batch training and lower computation time, which
diversity of attention mechanism. contributes to total train time reduction.
2385
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
1e-2 2020. Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers. CoRR abs/2006.03555 (2020).
self connection
Attention Dependency (%)
[5] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning.
1.5 writer-doctor 2019. What Does BERT Look At? An Analysis of BERT’s Attention. CoRR
writer-policeman abs/1906.04341 (2019).
doctor-policeman [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
1.0 Pre-training of Deep Bidirectional Transformers for Language Understanding. In
ACL. 4171–4186.
[7] William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers:
0.5 Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. CoRR
abs/2101.03961 (2021).
[8] Po-Yao Huang, Xiaojun Chang, and Alexander G. Hauptmann. 2019. Multi-
Head Attention with Diversity for Learning Grounded Multilingual Multimodal
0.0 Representations. In EMNLP. 1461–1467.
1 2 3 4 5 6 7 8 9 10 11 12
The Layer of Self-Attention [9] Vidur Joshi, Matthew E. Peters, and Mark Hopkins. 2018. Extending a Parser
to Distant Domains Using a Few Dozen Partially Annotated Examples. In ACL.
1190–1199.
[10] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret.
Figure 8: The visualization of the pairwise connections in [n.d.]. Transformers are RNNs: Fast Autoregressive Transformers with Linear
self-attention using A3 mechanism. We select the major con- Attention. In ICML, Vol. 119. 5156–5165.
nections, where their scores is greater than (mean + std). [11] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient
Transformer. In ICLR.
[12] Matthäus Kleindessner and Ulrike von Luxburg. 2017. Kernel functions based on
We suggest this as a concrete visualization of A3 ’s effectiveness. It triplet comparisons. In NIPS. 6807–6817.
[13] Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu, and Tong Zhang. 2018.
can discover high-order information and provide diverse features. Multi-Head Attention with Disagreement Regularization. In EMNLP. 2897–2903.
Recalling the visualization in Fig.(2), we further visualize the [14] Jianxin Li, Haoyi Zhou, Pengtao Xie, and Yingchun Zhang. 2017. Improving the
Generalization Performance of Multi-class SVM via Angular Regularization. In
“writer-doctor-policeman” pairs when using A3 mechanism in Fig.(8). IJCAI. 2131–2137.
We can find that A3 mechanism significantly increases the propor- [15] Zehui Lin, Pengfei Liu, Luyao Huang, Junkun Chen, Xipeng Qiu, and Xuanjing
tion of various connections compared with Fig.(2), which indicate Huang. 2019. DropAttention: A Regularization Method for Fully-Connected
Self-Attention Networks. CoRR abs/1907.11065 (2019).
that A3 meet the motivation of introducing triplet dependency to [16] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvinine-
enhance the alignment ability of Transformers. jad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training
for neural machine translation. ACL 8 (2020), 726–742.
[17] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer,
6 CONCLUSION Alexander Ku, and Dustin Tran. 2018. Image Transformer. In ICML 2018, Vol. 80.
In this work, we address the performance bottleneck on large-scale 4052–4061.
[18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im-
Transformers and the solution is motivated by slightly breaking proving language understanding by generative pre-training. (2018).
the similarity assumption in dot-product self-attention computa- [19] Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Tim-
othy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence
tion. The proposed Triplet Attention (A3 ) mechanism allows for Modelling. In ICLR.
attention on dissimilarity pairs and it contributes to building the [20] Ali Rahimi and Benjamin Recht. 2007. Random Features for Large-Scale Kernel
high-level dependency, and it can be deployed with canonical self- Machines. In NIPS. 1177–1184.
[21] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
attention through proper configurations. Besides, we also provide SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.
the A3 ’s efficient variants for long inputs on large-scale models. 2383–2392.
Experimental results on two benchmarks demonstrate that A3 sig- [22] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis-
tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR
nificantly outperforms the baselines and it shows the benefits of abs/1910.01108 (2019).
introducing triplet attention into Transformers. [23] Neil Stewart, Gordon DA Brown, and Nick Chater. 2005. Absolute identification
by relative judgment. Psychological review 112, 4 (2005), 881.
[24] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient
ACKNOWLEDGMENTS Transformers: A Survey. CoRR abs/2009.06732 (2020).
[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
This work was supported by National Natural Science Foundation of Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
China (No.U20B2053, No.61872022 and No.61421003) and State Key you Need. In NIPS. 5998–6008.
Laboratory of Software Development Environment (No. SKLSDE- [26] Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model.
In ACL 2019. ACL, 37–42.
2020ZX-12). Special thanks for computing infrastructure provided [27] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
by BDBC. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for
Natural Language Understanding. In EMNLP. 353–355.
[28] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
REFERENCES Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
[1] George B Arfken and Hans J Weber. 1999. Mathematical methods for physicists. Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
[2] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normaliza- Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
tion. CoRR abs/1607.06450 (2016). and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Processing. In EMNLP. 38–45.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda [29] Pengtao Xie, Aarti Singh, and Eric P. Xing. 2017. Uncorrelation and Evenness: a
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, New Diversity-Promoting Regularizer. In ICML, Vol. 70. 3811–3820.
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, [30] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Transformers from Scratch on ImageNet. CoRR abs/2101.11986 (2021).
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In [31] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong,
NIPS. and Wancai Zhang. 2020. Informer: Beyond Efficient Transformer for Long
[4] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Sequence Time-Series Forecasting. In AAAI, Vol. 35. 11106–11115.
Jared Davis, Tamás Sarlós, David Belanger, Lucy Colwell, and Adrian Weller.
2386
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
2387
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
token 0 token 0 token 0 token 0 token 0 token 0 token 0 token 0 token 0 token 0
token 1 token 1 token 1 token 1 token 1 token 1 token 1 token 1 token 1 token 1
token 2 token 2 token 2 token 2 token 2 token 2 token 2 token 2 token 2 token 2
token 3 token 3 token 3 token 3 token 3 token 3 token 3 token 3 token 3 token 3
... ... ... ... ... ... ... ... ... ...
token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1
token m token m token m token m token m token m token m token m token m token m
Next One Word Next Two Words Random Permutation Reverse Sequence Cross of Previous and Next Word
MISCs
where the formulation also applies to PERs, ORGs, LOCs and CROSS.
PERs
ORGs Discussion. We visualize the results containing the ‘SEP’ con-
1.0
LOCs
CROSS
nection in Fig.(10). From Fig.(10a), we can find that the number of
SEP ‘SEP’ connection is more than other connections, which means that
0.5
there are many tokens connected to special tokens ‘CLS’ or ‘SEP’
in self-attention, and these connections have redundant and invalid
0.0 information. This observation is also consistent with previous re-
1 2 3 4 5 6 7 8 9 10 11 12
The Layer of Self-Attention search. From Fig.(10b), we can find that A3 reduce the number of
‘SEP’ connection at the layers {1,2,3,10,11}, which shows that A3
(b) The NER visualization on the BERT-A3 𝑙 {1,2,3,10,11,12} model. can reduce redundant information in attention and enhance the
model’s expression ability.
Figure 10: The pair-wise attention connections of BERT𝑏𝑎𝑠𝑒
and BERT-A3 on dataset CoNLL2003, our model BERT-A3 re-
D PLATFORM
duce the connections to the special tokens ‘CLS’ or ‘SEP’ in All models were training/testing on a cluster, where each node is
lower layers. installed with 2 × Intel(R) Xeon Gold CPU @ 2.40GHz and 4 ×
Nvidia V100 GPU (32 GB). Due to the limitation of available nodes,
to represent the pair-wise attention connections of tokens with we are unable to evaluate A3 on large-scale language models like
the same named-entities, which means that both tokens are mis- GPT-3 and T5.
cellaneous, persons, locations or organizations. We use ‘SEP’ to
2388