Triplet Attention: Rethinking The Similarity in Transformers

Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Triplet Attention: Rethinking the similarity in Transformers

Haoyi Zhou∗ Jianxin Li∗† Jieqi Peng∗
Beihang University Beihang University Beihang University
Beijing, China Beijing, China Beijing, China
zhouhy@act.buaa.edu.cn lijx@act.buaa.edu.cn pengjq@act.buaa.edu.cn
Shuai Zhang∗ Shanghang Zhang

Beihang University UC Berkeley
Beijing, China Berkeley, CA
zhangs@act.buaa.edu.cn shz@eecs.berkeley.edu
ABSTRACT 100
The performance
The Transformer model has benefited various real-world applica- 90
tions, where the self-attention mechanism with dot-products shows
superior alignment ability on building long dependency. However, 80 SQuAD
the pair-wisely attended self-attention limits further performance 70 GLUE
improvement on challenging tasks. To the extent of our knowl- SuperGLUE
edge, this is the first work to define the Triplet Attention (A3 ) for 60
60M 220M 770M 2.8B 11B
Transformer, which introduces triplet connections as the comple- The Paramters of T5 model (log scale)
mentary dependency. Specifically, we define the triplet attention
based on the scalar triplet product, which may be interchangeably
Figure 1: The large-scale self-attention model has reached
used with the canonical one within the multi-head attention. It
its performance limits. The figure exhibits a practical exam-
allows the self-attention mechanism to attend to diverse triplets
ple. Starting from the number of parameters is greater than
and capture complex dependency. Then, we utilize the permuted
770M, the performance’s growth rate drops rapidly. The Su-
formulation and kernel tricks to establish a linear approximation to
perGLUE is more challenging than GLUE, but all the meth-
A3 . The proposed architecture could be smoothly integrated into
ods’ metrics seem to converge to an “unseen” wall.
the pre-training by modifying head configurations. Extensive ex-
periments show that our methods achieve significant performance
improvement on various tasks and two benchmarks.
1 INTRODUCTION
CCS CONCEPTS Transformers have achieved great success in a variety of domains,
such as natural language processing [6], computer vision [17], and
• Computing methodologies → Neural networks; Kernel meth-
time-series analysis [31]. Such success is mainly brought by the core
ods; Natural language processing; Logical and relational learning.
principle - utilizing the scaled dot-product similarity measurement
in the self-attention mechanism, which, however, becomes the
KEYWORDS
unbreakable constraint in the further performance improvement in
Neural Network; Large-scale Model; Transformer; Self-attention the large-scale models. Take the giant model GPT-3 [3] and T5 [7] as
ACM Reference Format: examples, their performance improvement is not proportion to their
Haoyi Zhou, Jianxin Li, Jieqi Peng, Shuai Zhang, and Shanghang Zhang. 2021. enlargement of model size (parameters) after the BERT model [6]
Triplet Attention: Rethinking the similarity in Transformers. In Proceedings beats the Human Performance in GLUE [27] and SQuAD [21]. In
of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Fig.(1), we compare the performance improvement of T5 [7] from a
Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore. ACM, New smaller model (60M) to a larger one (11B) on three different tasks.
York, NY, USA, 11 pages. https://doi.org/10.1145/3447548.3467241
From the figure, we can see that the curves tend to be flat after the
∗ Also with Beijing Advanced Innovation Center for Big Data and Brain Comput- parameter number is greater than 770M.
ing (BDBC), Beihang University. If we reduce this bottleneck to over-fitting from the generaliza-
† Jianxin Li is the corresponding author.
tion performance perspective, some attempts [8, 13, 15] on enhanc-
ing different heads’ behaviors or using dropout-like attention is
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed limited to specific tasks. Meanwhile, the GPT-3 and T5 both have
for profit or commercial advantage and that copies bear this notice and the full citation built bigger models on the larger corpus, which is considered as
on the first page. Copyrights for components of this work owned by others than ACM the best way to alleviate the over-fitting problem. However, this
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a performance bottleneck still exists.
fee. Request permissions from permissions@acm.org. Based on the aforementioned methods, we proposed to slightly
KDD ’21, August 14–18, 2021, Virtual Event, Singapore. break the similarity assumption of dot-product attention, which
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00 builds direct and simple dependency. We utilizes dissimilarity con-
https://doi.org/10.1145/3447548.3467241 nections as a way to enhance the complex dependency modeling.
2378
Researchers [12, 23] have found that our human beings can easily 2 RELATED WORK
recognize similarities on triplet comparison, even with dissimilar Transformer and Its Applications. Transformer has brought
pairs. Nevertheless, the pair-wisely defined measurements are hard significant success and attracted wide attention in various fields,
to succeed in these tasks. This drawback is obvious when dynami- such as natural language processing (NLP) [6] and computer vi-
cally changing semantic or needing complementary connections. sion [30]. It is mainly based on the self-attention mechanism and
Consider a concrete example, If we say “Kate likes the apple” and demonstrates strong representation learning capabilities. Particu-
“We are in the garden”, we can instantly guess that the apple refers larly, transformer-based models achieve better performance than
to the red eatable fruit. Otherwise, if we change the second sen- other types of networks in various NLP and Vision benchmarks.
tence into “We are in the Bestbuy”, we may think Kate wants the Among these models, the most popular ones include BERT (Bidi-
Apple smartphone. The word “apple” has vague meanings depend- rectional Encoder Representations from Transformers) [6], GPT
ing on different contexts and the triplet similarity is more reliable (Generative Pre-trained Transformer)-1 [18], and GPT-3 [3]. Though
than the pair-wise one in this case. Theses notional words “Kate”, the larger transformer models have brought a more profound im-
“apple”, “garden”, and “Bestbuy” are not pair-wisely semantic simi- pact, the performance gain of large models is far from significant
lar. We motivated by allowing these dissimilar triplets attentive to compared with their increased layers and model size. For instance,
each other to build more complex dependency beyond the scope of compared with the mBART model [16], which has 680 million pa-
pair-wise self-attention. The dot-product attention fails to capture rameters and 24 layers, the latest GPT-3 model [3] has 175 billion
consistently dissimilar triplets because it is designed on twin com- parameters and 96 layers, which is 250 times larger than mBART.
parisons, narrowing the concept of attention to the pair-wise case. However, the performance gain is only 6.6 BLUE on the German to
Seldom work notices the role of dissimilarity in contributing com- English Translation [3] and its performance on English to Romanian
plementary dependency. We are looking for a missing piece of the translation is even 14 BLUE less than mBART.
puzzle if the Transformers solely rely on dot-product self-attention.
This paper builds a new type of attention that allows for captur- Regularizer on Transformers. Even though transformer has
ing triplet dissimilarity on inputs. We found that most embeddings brought success to a variety of applications, most of the existing
are attentive to the positional tokens and dissimilar triplets are po- works focus on pursuing better performance with larger model
tential weak connections in Section 4. Thus, we use the scalar triple design (e.g. model with more layers), while much fewer works have
product to build the triplet similarity measurement. We still stick investigated the limitation and capacity of self-attention in each
to the term “similarity” to be aligned with the vanilla Transformer layer. The work [13] introduces three types of disagreement regu-
architecture [25], which recombines the inputs with attentive items larization to encourage diversity among multiple attention heads.
under the similarity measurement. Nevertheless, we interpret this The work [5] also analyzes the attention mechanisms of pre-trained
triplet attention as a stable correlation, reflecting that the inputs’ models and applies them to BERT. The apparent redundancy of
intrinsic dissimilarity under random projections. We define the heads is also found by clustering BERT’s attention heads. As a
new type of similarity based on these specific correlations. In this further investigation, our proposed method first reveals that such
way, canonical self-attention and triplet attention can be used in- limitation is caused by the “simple” similarity assumption brought
terchangeably. Last but not least, the proposed triplet attention by dot-product attention. And then we introduce the triplet atten-
can also be regarded as the implicit regularizer for the canonical tion as a fundamental complementary to this problem. From these
self-attention by allowing unattended pairs to attend. studies, we can see the importance and necessity of encouraging the
To the best of our knowledge, we are the first to define the Triplet diversity of the attention model. Though there are several works
Attention (A3 ) in Transformer models. Furthermore, we propose on designing the diversity-promoting regularizer to improve the
an efficient formulation of A3 , which can be used interchangeably generalization performance [14, 29], none of them are dealing with
with the canonical self-attention. The A3 -enhanced Transformer the self-attention of transformers.
allows for fine-tuning on the large pre-trained models and achieves
superior performance on various tasks. Our main contributions can 3 PRELIMINARY
be summarized as: To better structure the proposed method, we briefly introduce the
• We point out that the triplet connections could contribute to self-attention mechanism and the related basic notations.
the similarity measurement in the building of Transformers
and reveals it as implicitly regularizing the Transformers. 3.1 The Self-attention
• We define the Triplet Attention A3 with the scalar triplet prod-
The self-attention mechanism lies the core component in Trans-
uct, which is a tangible formulation for encouraging the diverse
former [25] model, where it is integrated as the encoder-decoder ar-
triplets to be attentive to each other.
chitecture in seq-to-seq processing to allow for building long-range
• We propose the permutation approximation to A3 as a way to
dependency between inputs and outputs, i.e., NLP [6], Image [17],
achieve linear complexity w.r.t input length, where we combine
Time-series [31]. The objective of building strong alignment ability
it with the efficient tricks of the canonical self-attention. We
motivates the self-attention’s principle to learn to acquire direct
show that our methods are compatible with the pre-trained
access between important items. For a sequence input X ∈ R𝐿×𝐷
models under proper head configurations.
with length 𝐿, the single-headed self-attention is defined as:
• We conduct experiments on various tasks to investigate the
QK⊤

effectiveness of the proposed methods. Results demonstrate
Attention(Q, K, V) = softmax √ V , (1)
the success of A3 on achieving superior performance. 𝑑
2379
where the multiplication between row-wised softmax(·) and V de- 1e-2

notes the attention weighted combination of inputs by the attention
√ self connection
Attention Dependency (%)

2.5
scores between queries Q and keys K. The scaling factor 𝑑 in writer-doctor
writer-policeman
denominator alleviates the dot-products’ magnitude growing to 2.0
doctor-policeman
stabilize the measurement. Note that different linear transforma-
1.5
tions {W𝑄 , W𝐾 , W𝑉 } ∈ R𝐷×𝑑 projects the input into queries, keys
and values respectively. Choromanski et al. [4] reformulated the 1.0
self-attention into the following form:
0.5
Attention(Q, K, V) = D−1 AV . (2)
√ 0.0
1 2 3 4 5 6 7 8 9 10 11 12
The A = exp(QK⊤ / 𝑑) denotes the attention matrix in R𝐿×𝐿 and The Layer of Self-Attention
we applied the exponential operator exp(·) element-wise. The nor-
malizer D = diag(A1𝐿 ) is the diagonal matrix with the multipli-
cation between A and all-ones vector 1𝐿 such that we decompose Figure 2: The visualization of the pairwise connections in
the softmax(·) operator into the division on attention matrix. The canonical self-attention. We select the major connections,
quadratic computation and storage requirement of A w.r.t. the input where their scores is greater than (mean + std).
length 𝐿 become the bottleneck in building larger models.
3.2 The Multi-head Self-attention 4.2 The Triplet Attention

To balance the diversified attention patterns and computation effi- The design of canonical self-attention [25] is simple. The layer
ciency, multiple attention heads are adopted to learn robust linear out is a re-representation of the inputs by taking convex combi-
transformations [26]. With 𝑚 single head, we have: nations of inputs’ linear transformed vectors, where coefficients
of combinations are normalized, i.e., applying the softmax(·) to
MultiHeadAttention(Q, K, V) = concat(head1, . . . , head𝑚 )
. (3) form the possibility distribution, similarity measure between inde-
𝑤ℎ𝑒𝑟𝑒 head𝑖 = Attention(Q (𝑖) , K (𝑖) , V (𝑖) ) pendent transformed vectors. Let sim(𝑖, 𝑗) = A𝑖 𝑗 and the higher
similarity show the 𝑖-th query strong attendance to the 𝑗-th key.
After concatenating all heads’ feature maps, we use the fully con-
We briefly illustrate the dot-product in Fig.(3a), the purple square
nected layer and LayerNorm operator [2] on it to add residual
denotes a selected pair and the projection length of vector q on
connections in the attention layer output. So far, no activation func-
the normalized vector k is the geometric representation of sim(·, ·).
tion is applied. Vaswani et al. [25] suggests using two position-wise
As we discussed, the principle of building direct access between
Feed-Forward Networks with a ReLU activation as the additional
queries and keys limits the canonical self-attention’s performance
layer. The self-attention mechanism learns the inputs’ mutual con-
on capturing more complex dependency.
nections through the network, and the inner layers build non-linear
However, what if we use the triplet measurement on inputs rather
mappings for indirect dependency. Note that our methods acquire
than the pair-wise one? The Fig.(2) reveals that there exists com-
complementary dependency for individual heads, reducing the re-
plicated dependency beyond the alignment ability of pair-wisely
quirement of inner layers’ non-linear capture capability.
defined self-attention. The sequence input X is projected into three
vectors, i.e., one query Q = XW𝑄 together with two different keys
4 METHODS
K (1) = XW𝐾1 and K (2) = XW𝐾2 , where the linear transformation
In this section, we empirically show the existence of diversity con-
matrix lies in R𝐷×𝑑𝑡 . Subscripting the matrix with 𝑖 returns the 𝑖-th
nections in canonical self-attention. Then, we will the proposed (1) (2)
triplet self-attention mechanism for better representation. row, the key vectors K 𝑗 and K𝑘 form the individual plane, by
defining the normal vector as:
4.1 Rethinking the Canonical Self-attention (1)
K 𝑗×𝑘 = K 𝑗
(2)
× K𝑘 , (4)
To better motivate our approach, we consider a case study, and it
will be further explored in Section 5.4.3. We focus on the lexical where the operator × denotes the cross product between two vec-
(1) (2)
patterns of self-attention heads. Based on BERT [6], we consider tors. The magnitude of K 𝑗×𝑘 , namely ∥K 𝑗×𝑘 ∥ = ∥K 𝑗 ∥ ∥K𝑘 ∥ sin 𝜃 ,
logical reasoning questions on two continuous sentences “The old can be interpreted as the positive area of the parallelogram having
writer is the father of the doctor” and “The doctor is the mother of two vectors as sides, which we decorate the parallelogram with
the policeman on duty at the door” and it seeks to find the triplet darker grey in Fig.(3b). If the two keys are similar, like the vector
connections between “writer-doctor-policeman”. The articles are angle 𝜃 approaching zero, their magnitude ∥K 𝑗×𝑘 ∥ decreases. In
omitted to avoid interference. In Fig.(2), we can see that the pair- the extreme case as the two vectors are parallel, the parallelogram
wise similar connections are more widely distributed in the shallow disappears with ∥K 𝑗×𝑘 ∥ = 0. Based on that, we introduce the Scalar
layers than in the deeper layers. The dependency among the three Triple Product (STP) [1] as a way to capture complex dependency
person on high-level layers is vital in the network’s final repre- beyond the scope of canonical self-attention. For the query and two
sentation, like reasoning their relationship in question answer. In keys, the (𝑖, 𝑗, 𝑘)-th STP score can be expressed as:
the canonical self-attention, the insufficiency of triplet dependency
(1) (2)
limits the alignment performance on challenging tasks. 𝑇𝑖 𝑗𝑘 = Q𝑖 K⊤
𝑗×𝑘 = Q𝑖 (K 𝑗 × K𝑘 ) ⊤ . (5)
2380
query Q𝑖 of token 𝑖 and two keys K 𝑗,𝑘 of token 𝑗, 𝑘, namely:

q
𝑇𝑖 𝑗𝑘
sim(𝑖, 𝑗, 𝑘) = exp √ , (6)
𝑑𝑡
Q o
where the exp(·) denotes the exponential operator applied element-
k wise and we utilize the same scaling factor in Eq.(1). If we could
encourage token connections of higher score triplets, the overall
K Transformer architecture is able to overcame the limitation of su-
perficial strong attentive and it leverages the triplet similarity to
(a) The dot-product computation in the canonical self-attention.
build high-level dependency. Thus, we reformulate the attention
matrix into the triplet form as:
q

TW
A = exp √ , (7)
𝑑𝑡
k(2)
K (2) where the similarity matrix T ∈ R𝐿×𝐿×𝐿 is calculated by performing
Q o the triplet similarity calculation sim(𝑖, 𝑗, 𝑘) in every query and
key combination. The linear transformation matrix W ∈ R𝐿×1
K(1) k(1) project the significant triplet connections into the flat feature map
as the dot-product one in R𝐿×𝐿 , which can be consistent with the
(b) The scalar triple product in the proposed Triplet Attention.
canonical self-attention paradigm. Then we substitute the triplet
attention matrix into Eq.(2) and acquires the definition of Triplet
Figure 3: (a) The canonical self-attention: The attention ma- Attention (A3 ) as follows:
trix is formed through the pair-wise computations between
tAttention(Q, K (1) , K (2) , V) = D −1 AV , (8)
queries and keys on the left side. We draw the selected dot-
product score (the purple square) on the right side, and the where the alphabet ‘t’ in function name stands for measuring the
grey rectangle represents the product magnitude after pro- triplet attention and the normalizer D = diag(A1𝐿 ) follows the
jecting the vector q on the vector k. (b) The proposed self- similar definition as it in Eq.(2).
attention: We build the three-dimensional attention matrix
with queries and different keys on the left side. Unlike the 4.3 Build the Efficient A3
dot-product operators, we draw the selected scalar triple The time complexity of computing the triplet attention matrix A
product (the purple cubic) on the right side. The cross prod- is O (𝐿 3𝑑𝑡 ) since the STP operator traverses through all possible
uct between vector k1 and k2 define the darker parallelo- triplet pairs and each score has to be calculated and stored explicitly.
gram, and the signed volume of parallelepiped is the geo- This cubic calculation drawback causes the computational bottle-
metric interpretation of the proposed score measurement. neck to be magnified in such a way that the A3 could not only be
inaccessible to long inputs or even to regular inputs. There exists
various studies on efficient Transformer [10, 11, 19, 24, 31], they
provided tangible improvements on reducing the quadratic time
As illustrated in Fig.(3b), T𝑖 𝑗𝑘 represents the signed volume of par- complexity and alleviate the quadratic memory consumption of the
allelepiped constructed on the three projected vectors. Consider canonical self-attention. However, the running time and memory
the volume as the product of the height and the area of the parallel- usage of A3 are proportional to the cube of input length, which
ogram, the more the query vector is “away” from the parallelogram makes it unfeasible to leverage the above acceleration techniques.
and the bigger parallelogram produces a larger volume. Unlike the Compared with the dot-product self-attention, the overhead of A3
dot-product of the canonical self-attention in Fig.(3a), it is designed is 𝐿-times larger due to the fact that the STP’s second key brings
to find the similar pairs attentive to each other and avoids dissimilar extra 𝐿 circulations. Motivated by the self-attention’s sparsity as-
pairs, e.g, the query vector is perpendicular to the key vector. The sumption [24, 31], we restrict the vector cross product between
STP attend to the triplet vectors that they all were not so similar each vector pairs to a row-wise pairs, where one key’s order in the
to each other, in other words, they form a distribution with diver- row-wise pairs is permuted to enable inter-position information
sity [14, 29]. On the right side of Fig.(3b), the three vectors form a sharing. Thus, the new cross product will only generate 𝐿 normal
cube to fill a larger volume. Recall that we perform three random vectors from permuted vector pairs and the overall computation
linear transformations on generating query and keys, the level of complexity could reduce to quadratic (w.r.t the input length). We
disparity among the three projected vectors indicates the higher call this as Permuted STP and derive the corresponding similarity
possibility of existing intrinsic invariance to random projections. matrix as:
Basically, the dot-product captures the strong similar dependency
h i
T P = P −1 Q(K (1) ⊗ P [K (2) ]) ⊤ , (9)
and the triplet product enhances the complementary dependency
for the potential high-level context. Therefore, we define the triplet where the ⊗ denotes performing the vector cross product between
similarity measures sim: R𝑑𝑡 × R𝑑𝑡 × R𝑑𝑡 → R, which are simple ad- the key matrix in the row-wise manner, P [·] represents the row
hoc “softmax style” functions of the scalar triple product between permutation and P −1 [·] is the corresponding reverse operator. It
2381
can be thought in this way, the diversity attention connections are

brought by adding additional key K (2) in the vector cross product.
Multi-head Attention
If we drop the second key, the STP will degrade into dot-product
Add & LayerNorm
Add & LayerNorm

Feed Forward Net
Embeddings
operation, namely canonical self-attention. Thus, we have to restore
Inputs
the correct orders to be aligned with the weighted combination of
value V when we perform permutation on the second key. Note that
the permutation is designable and it allows flexible information
sharing, e.g., the reversing, shifting and exchanging. Clark et al. [5]
shows that different heads favor their specific attentive patterns N layers
and we make the permutation fixed in each head for a stable repre-
(a) The overview of A3 -enhanced Transformers’ encoder.
sentation. A more detailed discussion on the choice of permutations
is given in Section 5. Based on that, we proposed the approximation
of A3 in the following equations: s heads s×d heads

T A3 head1
atAttention(Q, K (1) , K (2) , V) = 𝔇−1 𝔄V, 𝔄 = exp √ P . (10) head1 A3 head1
𝑑𝑡 A3 head2
We follow the similar definition in Eq.(8), the alphabet ‘at’ stands head21 A3 head2
Concatenated Feature Map

A3 head3
for the permutation-based approximation to the triplet attention
and the normalizer 𝔇 = diag(𝔄1𝐿 ). Recall the canonical self- head31 A3 heads A3 headp
attention formulation in Eq.(2), the above approximation follows
same weighted combination of value V and they can be used inter-
changeably in Transformer variants. head
headm-2
1 head1
Hence, the efficient Transformer trick can be introduced after the
approximation and we give an exemplar on how to apply FAVOR [4] head
headm-1 head2
1
on it. For the attention matrix 𝔄, the signed volume of T P remains

the same after cyclic swapping [1] and we have: headm1 headm-s
h i
−1 P [K (2) ] (Q ⊗ K (1) ) ⊤
©P ª m heads (m-s) heads
𝔄 = exp √ ® .
® (11)
𝑑𝑡
« ¬ (b) Applying A3 in the multi-head self-attention mechanism.
Note that the exp(·) operator is applied element-wise and we can
exchange the permutation operation with the exponential operation. Figure 4: (a) A3 -enhanced encoder: The input embeddings
The 𝔄 belongs to R𝐿×𝐿 and the (𝑖, 𝑗)-th element of attention matrix contain sequence inputs and positional tokens. The dash
𝔄 can be decomposed into: lines stand for the layers stacking in the encoder. The green
−∥K𝑖′ − Q ′𝑗 ∥ 22 ∥Q ′𝑗 ∥ 22 and blue strip represent the entangled canonical heads and
" #
−1 ∥K𝑖′ ∥ 22
𝔄𝑖 𝑗 = P exp( √ ) · exp( √ ) · exp( √ ) , A3 heads accordingly. (b) The head configuration of apply-
2 𝑑𝑡 2 𝑑𝑡 2 𝑑𝑡
ing A3 : We replace 𝑠 heads of the standard heads (red) with
(12) the proposed A3 heads as indicated by the arrows. They can
where Q ′ = Q ⊗ K (1) , K ′ = P [K (2) ], and the first and last ex- be used interchangeably. Moreover, each head of the 𝑠 heads
ponential term can be computed √ in O (𝐿𝑑𝑡 ). The quadratic term (blue) is split into 𝑑 mini-heads (light blue) for better repre-
B = exp(−∥K𝑖′ − Q ′𝑗 ∥ 22 /(2 𝑑𝑡 )) is the Gaussian kernel and it has sentations, and we apply 𝑝 groups of permutations on them
close-form random feature approximation [20] as: individually. Finally, we concatenate all the heads’ feature
B = 𝜙 ( K̂)𝜙 ( Q̂), 𝑤ℎ𝑒𝑟𝑒 K̂𝑖 = 𝜙 (K𝑖′ ), Q̂𝑖 = 𝜙 (Q𝑖′ ), maps as the final output.
(13)
r
def 2 .
cos(𝜔 1⊤ x + 𝑏 1 ), . . . , cos(𝜔𝑐⊤ x + 𝑏𝑐 )

𝜙 (x) =
𝑐 et al. [5] demonstrate that most heads follow similar patterns with
The 𝑐 is the number of random features. And apparent redundancy. We found it beneficial to replace partial
√ we acquire each param-
eter independently as 𝜔 1, . . . , 𝜔𝑐 ∼ N (0, 𝑑𝑡 I𝑑𝑡 ) and 𝑏 1, . . . , 𝑏𝑐 ∼ canonical self-attention heads with triplet attention A3 , which
U (0, 2𝜋). We can also utilize the orthogonal random features (ORF) works as an implicit regularizer for complementary dependency.
variants [4] for better sampling on 𝜔 and 𝑏, on which we omit the Our implementation maintains the classic paradigm on vanilla
further discussion for simplicity. In this way, the overall complex- Transformer [6, 25], as depicted in Fig.(4a), and it can be expressed
ity of approximated A3 is O (𝐿𝑐𝑑𝑡 ) and it is much lower than the as:
original O (𝐿 3𝑑𝑡 ) for considering 𝐿 ≫ 𝑐 in practice. matAttention(Q, K, K (1) , K (2) , V)
= concat(head1, . . . , head𝑠 , . . . , head𝑚 ) ,
4.4 Overall Architecture (14)
(1) (2)
(
As we discussed in Section 3, the multi-head self-attention mecha- atAttention(Q𝑖 , K𝑖 , K𝑖 , V𝑖 ), if 𝑖 ≤ 𝑠
𝑤ℎ𝑒𝑟𝑒 head𝑖 =
nism shows dominant performance, in which Vig [26] and Clark Attention(Q𝑖 , K𝑖 , V𝑖 ), otherwise
2382
Table 1: The performance comparison on the GLUE benchmark.
CoLA MRPC RTE STS-B QNLI QQP SST-2 WNLI MNLI Average
Model
8.5k 3.5k 2.5k 5.7k 108k 363k 67k 0.64k 392k -
ELMo 44.1 76.6 53.4 70.4 71.1 86.2 91.5 56.3 68.6 68.7
DistilBERT 51.3 87.5 59.9 86.9 89.2 88.5 91.3 56.3 82.2 77.0
DistilBERT-A3 (ours) 55.1 87.9 67.8 87.5 88.9 90.3 91.5 56.3 83.8 78.8
DistilBERT-A3 -FAVOR (ours) 55.1 87.8 65.6 87.1 88.3 89.8 91.1 56.3 83.6 78.3
BERT𝑏𝑎𝑠𝑒 56.3 88.6 69.3 89.0 91.9 89.6 92.7 53.5 86.6 79.7
BERT-A3 (ours) 62.9 90.8 74.4 90.3 91.1 91.0 93.2 56.3 87.6 81.9
BERT-A3 -FAVOR (ours) 61.8 89.8 72.2 90.1 90.8 90.8 92.7 56.3 87.4 81.3
The 𝑠 controls the number of head replacement. Since the output Table 2: The performance on the SQuAD dataset.
of A3 is aligned with the canonical self-attention by keeping the
same recombination of value V in Eq.(8,10), we develop the final SQuAD
layer output of matAttention(·) into the concatenation of all heads’ Model
EM F1
feature map. Note that the key K arises from linear projection
W ∈ R𝐷×𝑑 and the keys K (1) , K (2) are acquired by W ∈ R𝐷×𝑑𝑡 , BiDAF-ELMo - 85.6
thus we have three individual keys. Another important issue we R.M. Reader 81.2 87.9
have not yet addressed is the choice of inner dimension 𝑑𝑡 . The DistilBERT 77.7 85.8
vector cross product in STP limits the 𝑑𝑡 to be 3 for becoming the DistilBERT-A3 (ours) 78.5 87.1
dual operation. To alleviate the information loss, we let the 𝑠 = 𝛼 ·𝑑𝑡 DistilBERT-A3 -FAVOR (ours) 78.9 87.8
and build 𝑑 mini-heads across the 𝛼-interval on 𝑑-dimensions under
each heads replacement. In practice, we start from the 𝛼 = 1 case, BERT𝑏𝑎𝑠𝑒 81.2 88.5
and the set of mini-heads has comparable representation ability BERT-A3 (ours) 81.8 89.3
and equivalent time complexity with the canonical one. In Fig.(4b), BERT-A3 -FAVOR (ours) 81.6 88.9
the 𝑚 heads (red) are divided into 𝑠 heads (blue) and (𝑚 − 𝑠) heads
(green), and the blue heads are decomposed into mini-heads (light
blue). The dash lines in mini-heads split them into 𝑝 groups, where on eight tasks, which demonstrates that A3 can enhance the com-
we apply different permutations individually as even as possible. plementary dependency of self-attention mechanism and gain more
A more detailed discussion on the choice of permutations is given competitive scores. More specifically, our DistilBERT-A3 /BERT-A3
in Section 5.3.3. As a summary, the proposed architecture is briefly model achieves 7.4%/11.7% score rising on the ColA dataset (8.5k)
presented in Fig.(4). and 13.2%/6.8% on the RTE dataset (2.5k). This is because that A3
has better expression capability on the smaller dataset, and the
5 EXPERIMENTS triplet connections help the model capture weak dependency with
This section empirically demonstrates the effectiveness of A3 mech- limited data. For the QNLI dataset, the A3 mechanism slightly re-
anism on the pre-trained DistilBERT model [22], which has 6 layers duce the performance. We also conduct the SQuAD experiment as
and 12 heads in each layer, and the pre-trained BERT𝑏𝑎𝑠𝑒 model [6], an independent supplement. Note that although A3 ’s results have
which has 12 layers and 12 heads in each layer. We implement not achieved the state-of-the-art performance on GLUE because
DistilBERT-A3 and BERT-A3 based on Transformers [28] and per- our practical resource is limited, the transparent architecture of A3
form the model fine-tuning and evaluation on ten NLP tasks1 . is supposed to be promising on large-scale models.
The result of SQuAD experiment is summarized in Table 2. Our
5.1 Setup model DistilBERT-A3 and BERT-A3 achieve better performance in
We perform experiments on two benchmarks with pre-train models. both EM and F1 scores. We refer such contradictory results (SQuAD
More details and settings can be found at Appendix A. and QNLI) to the different problem settings. We suggest that the
QNLI force the Question-Answering to sentence level and remove
5.2 Main Results pairs with low vocabulary overlap, while the SQuAD maintains the
paragraph level. The triplet correlations fit better in the later case.
We summarize the results of GLUE experiment in Table 1, where
‘(our)’ indicates the A3 variants we build. They are compared with
the canonical architecture DistilBERT and BERT𝑏𝑎𝑠𝑒 , together with 5.3 Ablation Study
the baseline ELMo [9]. Table 1 shows that our model DistilBERT-A3 5.3.1 Effect of A3 ’s Layer Deployment. We explore the effect of
and BERT-A3 outperform DistilBERT and BERT𝑏𝑎𝑠𝑒 respectively applying A3 mechanism in different layers. We design two experi-
ments across layers: continuously employing A3 , and intervening
1 The code can be download at https://github.com/zhouhaoyi/TripletAttention usage of A3 . BERT-A3𝑙 [𝑖−𝑗 ] applies A3 from 𝑖-th layer to 𝑗-th layer.
2383
Table 3: The performance of consecutive BERT-A3 ’s layer de- Table 5: The performance of different BERT-A3 ’s heads con-
ployment on the selected GLUE tasks. figuration on the selected GLUE tasks.
Model CoLA MRPC RTE STS-B Model CoLA MRPC RTE STS-B
BERT-A3 𝑙 [1−3] 60.8 90.1 71.8 89.9 BERT-A3 ℎ3 60.8 90.1 71.8 89.9
BERT-A3𝑙 [4−6] 59.4 87.8 70.4 89.7 BERT-A3ℎ6 60.8 89.6 70.1 89.5
BERT-A3𝑙 [7−9] 58.6 88.7 67.9 90.0 BERT-A3ℎ9 57.6 88.7 66.8 88.3
BERT-A3𝑙 [10−12] 59.6 88.9 67.1 89.7 BERT-A3ℎ12 53.6 85.7 57.3 88.1
BERT-A3𝑙 [1−6] 58.9 88.8 68.6 89.7
BERT-A3𝑙 [7−12] 57.9 89.4 70.1 89.7
Table 6: The performance of different BERT-A3 ’s permuta-
BERT-A3𝑙 [1−12] 56.3 87.6 69.0 89.4
tion grouping on the selected GLUE tasks.
Table 4: The performance of unconsecutive BERT-A3 ’s layer

Model CoLA MRPC RTE STS-B
deployment on the selected GLUE tasks.
BERT-A3 𝑃 {1} 61.6 88.9 70.4 89.6
BERT-A3 𝑃 {2} 60.2 89.9 71.1 89.7
Model CoLA MRPC RTE STS-B
BERT-A3 𝑃 {3} 60.8 89.3 70.0 89.4
BERT-A3 𝑙 {1,3} 60.8 89.8 74.4 90.0 BERT-A3 𝑃 {4} 58.6 90.0 69.7 89.4
BERT-A3𝑙 {1,3,5} 60.4 89.2 73.3 90.2 BERT-A3 𝑃 {5} 58.6 88.9 70.8 89.8
BERT-A3𝑙 {2,4} 58.9 89.8 73.6 89.8
BERT-A3 𝑃 {1,2} 61.6 88.5 70.8 89.6
BERT-A3𝑙 {2,4,6} 60.7 89.4 71.1 90.0
BERT-A3 𝑃 {1,3} 59.8 90.8 71.8 90.0
BERT-A3𝑙 {1,12} 58.5 89.3 73.6 89.5 BERT-A3 𝑃 {1,4} 60.6 88.9 70.4 89.3
BERT-A3𝑙 {1,3,10} 60.3 89.6 71.8 90.0 BERT-A3 𝑃 {1,5} 61.8 88.8 71.5 89.6
BERT-A3𝑙 {1,3,10,12} 59.9 89.8 72.6 90.2 BERT-A3 𝑃 {2,3} 60.4 89.8 71.1 90.0
BERT-A3𝑙 {2,11} 59.1 89.6 72.2 89.6
BERT-A3 𝑃 {1,2,3} 60.8 89.6 71.8 90.2
BERT-A3𝑙 {2,4,11} 61.1 90.0 74.0 89.7
BERT-A3 𝑃 {2,3,4} 59.9 89.5 72.6 90.0
BERT-A3𝑙 {2,4,9,11} 60.4 89.3 66.8 89.7
BERT-A3 𝑃 {1,2,3,4} 61.8 89.9 71.4 90.2
BERT-A3 𝑃 {2,3,4,5} 62.3 90.8 70.8 90.3
BERT-A3𝑙 {𝑖, 𝑗, 𝑘 } applies A3 at (𝑖, 𝑗, 𝑘)-th layer. Each A3 -enhanced
layer has three A3 heads, and all five permutation are included.
Table 3 and Table 4 summary the layer deployment’s results The average results on selected GLUE tasks are summarized in
on selected GLUE tasks, including CoLA, MRPC, RTE, and STS-B. Table 5. With the number of A3 head increasing, we can find that
Firstly, we can conclude that using A3 heads in lower consecu- increasingly adding A3 heads do not constantly improve the metrics,
tive layers leads to better scores. This is because the lower layers and the performance of the model shows rapid degradation after
have rich raw correlations, and the A3 layers could effectively en- applying more than six A3 heads. That probably originates from the
hancing the feature diversity. While adding A3 layer on the higher interferes between triplet correlations and pairwise correlations.
layers receives reduced performance improvement, however, it is The A3 mechanism should not dominate the attention block, but it
worth applying for better results than BERT𝑏𝑎𝑠𝑒 . The above finding could be used as a supplement for better diversity.
supports our motivation that if the self-attention could attend to
5.3.3 Effect of A3 ’s Permutation Grouping. In this section, we set
complementary dependency, the Transformer will achieve better
up experiments to evaluate the effects of permutation grouping.
alignment ability. Comparing the results in Table 4 with Table 3, we
We provide five alternative permutation strategies used in cross
can find that employing A3 mechanism in the first layer and third
product, which includes “random permutation”, “next one word”,
layer achieve more competitive scores when applying A3 heads in
“next two words”, “reverse sequence” and “cross of previous and
lower layers. This may be the insertion of canonical self-attention
next word”, and we presented them as {P1 ,P2 ,P3 ,P4 ,P5 } respectively.
in the second layer could make the triplet information introduced
The experiments are three-fold: the first one using only one STP
by the previous A3 layer to be fully exchanged and shared.
method on all heads; the second one using two STP methods on
5.3.2 Effect of A3 ’s Head Configuration. In the layer deployment each head, and the third one combining three or four STP methods.
experiments, we make each layer contain exactly three A3 heads. The A3 mechanism in all experiments is set in the first layer and
To better explore the effect of different A3 ’s head configuration, we third layer, and the number of A3 heads in each A3 layer is set to 3.
train various BERT-A3 models with different head configurations The results of three experiments are reported in Table 6. We
in layers 1-3. The BERT-A3ℎ𝑛 represents the A3 layers in model find that one single Permuted STP method cannot achieve the best
have 𝑛 A3 heads. We design a random operator, which can generate scores on all datasets, and the “random permutation” performs
the index list of heads applying A3 mechanism in A3 layer. We use well. It may be that “random permutation” increases the probability
different random operators to run five trails for each model. of cross product between two distant dissimilar vectors, thereby
2384
CoLA Matthews Correlation
75
60 1e-1

6
70 MISCs
RTE Accuracy
PERs
40 65 4
ORGs
LOCs
BERTbase 60 BERTbase CROSS
20 BERT-A3 BERT-A3
2
55
121110 9 8 7 6 5 4 3 2 121110 9 8 7 6 5 4 3 2
Remaining Layers Remaining Layers
0
1 2 3 4 5 6 7 8 9 10 11 12
The Layer of Self-Attention
Figure 5: The performance decreases when the layer number
of Transformers degrades from 12 to 2. (a) The NER visualization on the BERT𝑏𝑎𝑠𝑒 model.
1e-1

6 MISCs
PERs
L3-Batch1
ORGs
-2 L2-Batch1
4 LOCs
L1-Batch1
log2(T)(sec)
L3-Batch2
CROSS
L2-Batch2 2
-3 L1-Batch2
3
L -Batch4
L2-Batch4 0
1 2 3 4 5 6 7 8 9 10 11 12
-4 L1-Batch4
The Layer of Self-Attention
4 5 6 7 8 9 10 (b) The NER visualization on the BERT-A3 𝑙 {1,2,3,10,11,12} model.

log2(L)
Figure 7: The pair-wise attention connections on dataset

Figure 6: Comparison of BERT-A3 -L3 , BERT-A3 , BERT-A3 - CoNLL2003, our model BERT-A3 improve the “CORSS” con-
FAVOR in terms of forward and backward pass speed. nections (orange) and other connections on tokens.
discovering more dependency. In the experiment of using two Per- 5.4.2 Applying the FAVOR Trick (A3 vs A3 -FAVOR). We empirically
muted STP methods, we mainly combine “random permutation” validate the accelerating A3 by applying the FAVOR [4]. We speed-
with other Permuted STP methods. In Table 6, we can find that wisely compare the model forward and backward passed among
the combination of “random permutation” and “next two words” BERT-A3 -L3 , BERT-A3 and BERT-A3 -FAVOR, which is represented
achieve more effective results. The results in Table 6 also show as L3 , L2 and L1 . We use the six layers BERT-A3 model, and each
that combining methods except method “random permutation” can layer has six A3 heads, and we gradually increase the batch size and
achieve almost the best scores, which demonstrates that increasing sequence length of input data. The results are shown in Figure 6. We
the variety of Permuted STP methods and implementing those dif- can find that compared with BERT-A3 -𝐿 3 , BERT-A3 and BERT-A3 -
ferent Permuted STP methods combination on different A3 heads FAVOR reduce the computation cost, and BERT-A3 -FAVOR model
can obtain more significant dissimilar information and enhance the allows extensive batch training and lower computation time, which
diversity of attention mechanism. contributes to total train time reduction.
5.4.3 NER Visualization ((A3 vs Canonical)). We conduct an exper-

5.4 Case Study iment on Named-Entity Recognition (NER) task using the standard
5.4.1 Layer Stacking Degradation (A3 vs Canonical). We conduct CoNLL-2003 dataset, which concentrates on four types of named-
layer stacking degradation experiments on BERT𝑏𝑎𝑠𝑒 and BERT-A3 entities: persons (PER), locations (LOC), organizations (ORG), and
(layer 1-2). Starting from the highest layer, we reduce one layer at names of miscellaneous entities that do not belong to the three
a time and we fine-tune the model on CoLA and RTE datasets. The groups (MISC). We remove the ‘SEP’ connection to avoid the side
results are shown in Figure 5. We can find that BERT-A3 consis- impact, and please refer to Appendix C.1 for complete results.
tently outperforms BERT𝑏𝑎𝑠𝑒 with fewer layers. The grey dash line In Fig.(7a), ‘CROSS’ represent cross dependency of different
represents the “ruler” of baseline performance. The BERT-A3 with named-entities. We can find that ‘CROSS’ connections are rare
only eight layers outperforms BERT𝑏𝑎𝑠𝑒 with complete 12 layers on in the canonical attention of BERT𝑏𝑎𝑠𝑒 , and main connections are
both datasets, which means that we can reduce the number of the between persons entity’s tokens. In Fig.(7b), A3 significantly in-
parameters by 25.6% and the computation time by 28.9% by using creases the number of ‘CROSS’ connections and other connections
A3 mechanism while still achieving similar or better results. at the layers {1,2,3,10,11,12}, where we apply the A3 mechanism.
2385
1e-2 2020. Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers. CoRR abs/2006.03555 (2020).
self connection
[5] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning.
1.5 writer-doctor 2019. What Does BERT Look At? An Analysis of BERT’s Attention. CoRR
writer-policeman abs/1906.04341 (2019).
doctor-policeman [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
1.0 Pre-training of Deep Bidirectional Transformers for Language Understanding. In
ACL. 4171–4186.
[7] William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch Transformers:
0.5 Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. CoRR
abs/2101.03961 (2021).
[8] Po-Yao Huang, Xiaojun Chang, and Alexander G. Hauptmann. 2019. Multi-
Head Attention with Diversity for Learning Grounded Multilingual Multimodal
0.0 Representations. In EMNLP. 1461–1467.
1 2 3 4 5 6 7 8 9 10 11 12
The Layer of Self-Attention [9] Vidur Joshi, Matthew E. Peters, and Mark Hopkins. 2018. Extending a Parser
to Distant Domains Using a Few Dozen Partially Annotated Examples. In ACL.
1190–1199.
[10] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret.
Figure 8: The visualization of the pairwise connections in [n.d.]. Transformers are RNNs: Fast Autoregressive Transformers with Linear
self-attention using A3 mechanism. We select the major con- Attention. In ICML, Vol. 119. 5156–5165.
nections, where their scores is greater than (mean + std). [11] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient
Transformer. In ICLR.
[12] Matthäus Kleindessner and Ulrike von Luxburg. 2017. Kernel functions based on
We suggest this as a concrete visualization of A3 ’s effectiveness. It triplet comparisons. In NIPS. 6807–6817.
[13] Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu, and Tong Zhang. 2018.
can discover high-order information and provide diverse features. Multi-Head Attention with Disagreement Regularization. In EMNLP. 2897–2903.
Recalling the visualization in Fig.(2), we further visualize the [14] Jianxin Li, Haoyi Zhou, Pengtao Xie, and Yingchun Zhang. 2017. Improving the
Generalization Performance of Multi-class SVM via Angular Regularization. In
“writer-doctor-policeman” pairs when using A3 mechanism in Fig.(8). IJCAI. 2131–2137.
We can find that A3 mechanism significantly increases the propor- [15] Zehui Lin, Pengfei Liu, Luyao Huang, Junkun Chen, Xipeng Qiu, and Xuanjing
tion of various connections compared with Fig.(2), which indicate Huang. 2019. DropAttention: A Regularization Method for Fully-Connected
Self-Attention Networks. CoRR abs/1907.11065 (2019).
that A3 meet the motivation of introducing triplet dependency to [16] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvinine-
enhance the alignment ability of Transformers. jad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual denoising pre-training
for neural machine translation. ACL 8 (2020), 726–742.
[17] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer,
6 CONCLUSION Alexander Ku, and Dustin Tran. 2018. Image Transformer. In ICML 2018, Vol. 80.
In this work, we address the performance bottleneck on large-scale 4052–4061.
[18] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im-
Transformers and the solution is motivated by slightly breaking proving language understanding by generative pre-training. (2018).
the similarity assumption in dot-product self-attention computa- [19] Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Tim-
othy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence
tion. The proposed Triplet Attention (A3 ) mechanism allows for Modelling. In ICLR.
attention on dissimilarity pairs and it contributes to building the [20] Ali Rahimi and Benjamin Recht. 2007. Random Features for Large-Scale Kernel
high-level dependency, and it can be deployed with canonical self- Machines. In NIPS. 1177–1184.
[21] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
attention through proper configurations. Besides, we also provide SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP.
the A3 ’s efficient variants for long inputs on large-scale models. 2383–2392.
Experimental results on two benchmarks demonstrate that A3 sig- [22] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis-
tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR
nificantly outperforms the baselines and it shows the benefits of abs/1910.01108 (2019).
introducing triplet attention into Transformers. [23] Neil Stewart, Gordon DA Brown, and Nick Chater. 2005. Absolute identification
by relative judgment. Psychological review 112, 4 (2005), 881.
[24] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient
ACKNOWLEDGMENTS Transformers: A Survey. CoRR abs/2009.06732 (2020).
[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
This work was supported by National Natural Science Foundation of Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
China (No.U20B2053, No.61872022 and No.61421003) and State Key you Need. In NIPS. 5998–6008.
Laboratory of Software Development Environment (No. SKLSDE- [26] Jesse Vig. 2019. A Multiscale Visualization of Attention in the Transformer Model.
In ACL 2019. ACL, 37–42.
2020ZX-12). Special thanks for computing infrastructure provided [27] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
by BDBC. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for
Natural Language Understanding. In EMNLP. 353–355.
[28] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
REFERENCES Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe
[1] George B Arfken and Hans J Weber. 1999. Mathematical methods for physicists. Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,
[2] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normaliza- Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,
tion. CoRR abs/1607.06450 (2016). and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Processing. In EMNLP. 38–45.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda [29] Pengtao Xie, Aarti Singh, and Eric P. Xing. 2017. Uncorrelation and Evenness: a
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, New Diversity-Promoting Regularizer. In ICML, Vol. 70. 3811–3820.
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, [30] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis EH Tay,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-Token ViT: Training Vision
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Transformers from Scratch on ImageNet. CoRR abs/2101.11986 (2021).
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In [31] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong,
NIPS. and Wancai Zhang. 2020. Informer: Beyond Efficient Transformer for Long
[4] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Sequence Time-Series Forecasting. In AAAI, Vol. 35. 11106–11115.
Jared Davis, Tamás Sarlós, David Belanger, Lucy Colwell, and Adrian Weller.
2386
A EXPERIMENT DETAILS Settings: We maintain the similar fine-tuning strategy as in the

We perform experiments on two benchmarks with pre-train models. GLUE experiment (Section A.1).
The following sections give the implementation details on them. Metric: We use two different metrics. The Exact match (EM)
is the number of exactly correct answers. For each question and
A.1 GLUE answer pair, if the model’s 𝑖-th results exactly match True answer(s)
We conduct experiments to evaluate the language understanding at character-level, we have 𝐸𝑀𝑖 = 1, otherwise 𝐸𝑀𝑖 = 0. Naturally,
and generalization capabilities of our proposed A3 on the Gen- we have:
eral Language Understanding Evaluation (GLUE) benchmark [27], #(𝐸𝑀)
𝐸𝑀 = . (18)
which is a collection of diverse natural language understanding #(𝑇𝑜𝑡𝑎𝑙 𝑇 𝑒𝑠𝑡 𝑃𝑎𝑖𝑟𝑠)
tasks across different applications. The F1 score has been widely used in the classification problem,
and we choose it to evaluate the overall performance of precisions
Settings: We set batch size = 32 and perform fine-tuning for four and recalls the words chosen as part of the answer. Precision (P) is
epochs on the training data for 9 GLUE tasks. Other settings follow the proportion of shared words in the prediction to the total words,
the recommendation of the original paper. We perform a grid search and Recall (R) is the proportion of shared words in the ground truth
on the layer replacement, head replacement, and cross product to the total words.
permutation for each task because the proposed A3 mechanism can 𝑃 ·𝑅
𝐹1 = 2 . (19)
be used interchangeably with the canonical self-attention. There are 𝑃 +𝑅
two sets of layer deployment, where the first combination is chosen
from {Layer1−3 , Layer4−6 , Layer7−9 , Layer10−12 } and the alternative B ABLATION STUDY DETAILS
is {Layer1,3,5 , Layer2,4,6 , Layer1,3,10 , Layer2,4,11 }. The head grouping B.1 Permutation Grouping
is also an important part for A3 mechanism, we choose the number We have used five different permutation strategies in Section 5.3.3.
of replacing heads in {3,6,9,12}, and we provide a random operator The specific operations for the various types of alignment changes
for the model, which can generate the index list of heads that are summarized as:
need to apply A3 mechanism in each layer. Another important • P1: “random permutation”: shuffle the sequence randomly.
selection is the cross product permutation and combination. The • P2: “next one word”: shift the sequence forward by one word,
alternative strategies we provided in selecting elements in cross and fill the last word with the first word.
product includes “random permutation”, “next one word”, “next two • P3: “next two words”: shift the sequence forward by two words,
words”, “reverse sequence” and “cross of previous and next word”, and fill the last two words with the first two words.
and presented as {P1 , P2 , P3 , P4 , P5 }, respectively. More details can • P4: “reverse sequence”: reverse the sequence.
be found in Appendix B.1. For A3 -FAVOR, we choose the number • P5: “cross of previous and next word”: swap the values of odd
of random features 𝑐 from {32, 64, 128, 256}. We fine-tune the pre- index and even index from 0 and 1 in turn.
trained model on the corresponding single-task training data and In Fig.(9), we show the examples of permutation strategies. Note
do not use any ensembling strategy or multi-tasking scheme. All that the permutation strategies are not limited to the above exam-
the evaluation is performed on the Dev set. ples. The researchers could develop more variants.
Metric: We use three different evaluation metrics on the 9 tasks.
The Matthews Correlation Coefficient is applied on CoLA: C CASE STUDY DETAILS
𝑇𝑃 × 𝑇 𝑁 − 𝐹𝑃 × 𝐹𝑁 We perform three different case study. The Layer Number Degrada-
𝑀𝐶𝐶 = p , (15) tion (Section 5.4.1) and Applying the FAVOR trick (Section 5.4.2)
(𝑇 𝑃 + 𝐹 𝑃) (𝑇 𝑃 + 𝐹 𝑁 ) (𝑇 𝑁 + 𝐹 𝑃) (𝑇 𝑁 + 𝐹 𝑁 )
follow the similar experiment settings as in Appendix A. Thus we
where 𝑇 𝑃 stands for the number of True Positive, 𝑇 𝑁 for True will present the NER’s implementation in this section.
Negative, 𝐹 𝑃 for False Positive and 𝐹 𝑁 for False Negative. The
Pearson Correlation Coefficient is used for STS-B: C.1 NER Visualization
Í Í Í
𝑁 𝑦𝑖 𝑦ˆ𝑖 − 𝑦𝑖 𝑦ˆ𝑖
𝑃𝐶𝐶 = q q . (16) We conduct an experiment on Named-Entity Recognition (NER)
𝑁 𝑦𝑖2 − ( 𝑦𝑖 ) 2 𝑁 𝑦ˆ𝑖2 − ( 𝑦ˆ𝑖 ) 2 task using the standard CoNLL-2003 dataset. And we perform two
Í Í Í Í
visualization of attention connections on the NER’s results.
And the Accuracy is used for others:
𝑇𝑃 +𝑇𝑁 Settings: We use a batch size of 32 and fine-tune a pre-trained
𝐴𝐶𝐶 = . (17) BERT𝑏𝑎𝑠𝑒 model and a BERT-A3 model whose layers {1,2,3,10,11,12}
𝑇𝑃 + 𝑇 𝑁 + 𝐹𝑃 + 𝐹𝑁
apply A3 mechanism for 5 epochs. The other settings follow the
A.2 SQuAD v1.1 recommendation of the past work. We preprocess the texts before
We also conduct experiments on the question-answering task. The feeding them to the model, including adding ‘CLS’ and ‘SEP’ special
Stanford Question Answering Dataset (SQuAD v1.1) [21] is a col- tokens whose named-entity labels are set to -100, and split the words
lection of 100k crowd-sourced question/answer pairs. The task is to into subwords. We select the significant connections in the attention
predict the answer text span in the passage given a question and a feature map where their scores are greater than (mean + std). We
passage from Wikipedia containing the answer. This is the original studied the named-entities of the pair-wise tokens corresponding
task of QNLI in GLUE, which also focus on the similar problem as to those major connections and use ‘MISCs’,‘PERs’,‘ORGs’,‘LOCs’
we discussed in Section 5.2.
2387
token 0 token 0 token 0 token 0 token 0 token 0 token 0 token 0 token 0 token 0
... ... ... ... ... ... ... ... ... ...
token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1 token m-1
token m token m token m token m token m token m token m token m token m token m
Next One Word Next Two Words Random Permutation Reverse Sequence Cross of Previous and Next Word
Figure 9: The illustration of five different permutation strategies.
represent connections of two tokens where one of them is a special

1.50
MISCs token ‘CLS’ or ‘SEP’, and we use ‘CROSS’ to represent connections

1.25 PERs of tokens with different named-entities. Since the A3 mechanism
ORGs
1.00 will split the original attention head into many small attention heads.
LOCs In order to maintain a balance of values, we calculate the propor-
CROSS
0.75
SEP
tions of various types of connections in A3 attention and canonical
0.50 self-attention, respectively. Then, calculate the final proportion of
0.25 various connections according to the number of A3 attention heads
before the split and the number of canonical self-attention heads.
0.00
1 2 3 4 5 6 7 8 9 10 11 12
The Layer of Self-Attention Metric: We count the number of major connections between
different named-entities. We have the attention dependency for
(a) The NER visualization on the BERT𝑏𝑎𝑠𝑒 model.
MISCs as:
#(𝑀𝐼𝑆𝐶𝑠 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠)
𝐴𝐷 (%) = , (20)
1.5 #(𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠)
MISCs
where the formulation also applies to PERs, ORGs, LOCs and CROSS.
PERs
ORGs Discussion. We visualize the results containing the ‘SEP’ con-
1.0
LOCs
CROSS
nection in Fig.(10). From Fig.(10a), we can find that the number of
SEP ‘SEP’ connection is more than other connections, which means that
0.5
there are many tokens connected to special tokens ‘CLS’ or ‘SEP’
in self-attention, and these connections have redundant and invalid
0.0 information. This observation is also consistent with previous re-
1 2 3 4 5 6 7 8 9 10 11 12
The Layer of Self-Attention search. From Fig.(10b), we can find that A3 reduce the number of
‘SEP’ connection at the layers {1,2,3,10,11}, which shows that A3
(b) The NER visualization on the BERT-A3 𝑙 {1,2,3,10,11,12} model. can reduce redundant information in attention and enhance the
model’s expression ability.
Figure 10: The pair-wise attention connections of BERT𝑏𝑎𝑠𝑒
and BERT-A3 on dataset CoNLL2003, our model BERT-A3 re-
D PLATFORM
duce the connections to the special tokens ‘CLS’ or ‘SEP’ in All models were training/testing on a cluster, where each node is
lower layers. installed with 2 × Intel(R) Xeon Gold CPU @ 2.40GHz and 4 ×
Nvidia V100 GPU (32 GB). Due to the limitation of available nodes,
to represent the pair-wise attention connections of tokens with we are unable to evaluate A3 on large-scale language models like
the same named-entities, which means that both tokens are mis- GPT-3 and T5.
cellaneous, persons, locations or organizations. We use ‘SEP’ to
2388

Triplet Attention: Rethinking The Similarity in Transformers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Triplet Attention: Rethinking The Similarity in Transformers

Uploaded by

Copyright:

Available Formats

Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

Triplet Attention: Rethinking the similarity in Transformers

Shuai Zhang∗ Shanghang Zhang

where the multiplication between row-wised softmax(·) and V de- 1e-2

Attention Dependency (%)

3.2 The Multi-head Self-attention 4.2 The Triplet Attention

query Q𝑖 of token 𝑖 and two keys K 𝑗,𝑘 of token 𝑗, 𝑘, namely:

can be thought in this way, the diversity attention connections are

Add & LayerNorm

Add & LayerNorm

Concatenated Feature Map

on it. For the attention matrix 𝔄, the signed volume of T P remains

Table 1: The performance comparison on the GLUE benchmark.

Table 4: The performance of unconsecutive BERT-A3 ’s layer

CoLA Matthews Correlation

Attention Dependency (%)

Attention Dependency (%)

4 5 6 7 8 9 10 (b) The NER visualization on the BERT-A3 𝑙 {1,2,3,10,11,12} model.

Figure 7: The pair-wise attention connections on dataset

5.4.3 NER Visualization ((A3 vs Canonical)). We conduct an exper-

A EXPERIMENT DETAILS Settings: We maintain the similar fine-tuning strategy as in the

Figure 9: The illustration of five different permutation strategies.

represent connections of two tokens where one of them is a special

MISCs token ‘CLS’ or ‘SEP’, and we use ‘CROSS’ to represent connections

You might also like