10 Transformer 1

Transformer Model (1/2):
Attention without RNN
Shusen Wang
Transformer Model
• Original paper: Vaswani et al. Attention Is

All You Need. In NIPS, 2017.
Transformer Model
• Transformer is a Seq2Seq model.

• Transformer is not RNN.
• Purely based attention and dense layers.
• Higher accuracy than RNNs on large

datasets.
Revisiting Attention for RNN
Attention for Seq2Seq Model
𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .
𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
• Compute 𝐤 :2 = 𝐖> 𝐡2 and 𝐪:/ = 𝐖@ 𝐬/ .
𝐜+ 𝐜# 𝐜$
𝛼#: 𝛼$: 𝛼%: ⋯ 𝛼&: matrices
Parameter
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬:
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐊= ⋯ 𝐀′ 𝐀′ ⋯ 𝐀′
𝐤 :# 𝐤 :$𝐤 :% 𝐤 :& 𝐱#. 𝐱 $. ⋯ 𝐱 :.
𝐱# 𝐱$ 𝐱% 𝐱&

• Compute weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝐜+ 𝐜# 𝐜$
𝛼#: 𝛼$: 𝛼%: ⋯ 𝛼&:
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬:
𝐡# 𝐡$ 𝐡% ⋯ 𝐡& 𝛼#/
𝛼$/
𝐊= ⋯ 𝐀′𝛂 =
:/
𝐀′
⋮
⋯ ∈ 𝐀′
ℝ &
.
𝛼&/
𝐤 :# 𝐤 :$𝐤 :% 𝐤 :& 𝐱#.
𝐱$.
⋯ .
𝐱:
𝐱# 𝐱$ 𝐱% 𝐱&

• Compute weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝐜+ 𝐜# 𝐜$
𝛼#: 𝛼$: 𝛼%: ⋯ 𝛼&:
• Query: 𝐪:/ = 𝐖@ 𝐬/ . (To match
𝐬+ others.)
𝐬# 𝐬$ ⋯ 𝐬:
• Key: 𝐤 :2 = 𝐖> 𝐡2 . (To be matched.)
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
• Value: 𝐯:2 = 𝐖N 𝐡2 . (To be weighted averaged.)
𝐀′ 𝐀′ ⋯ 𝐀′
𝐱#. 𝐱 $. ⋯ 𝐱 :.
𝐱# 𝐱$ 𝐱% 𝐱&
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

𝐯:# 𝐯:$ 𝐯:% ⋯ 𝐯:&
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

𝐯:# 𝐯:$ 𝐯:% ⋯ 𝐯:&
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Context vector: 𝐜/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& .
𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ 𝐜/
𝐯:# 𝐯:$ 𝐯:% ⋯ 𝐯:&
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Context vector: 𝐜/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& .
Question: How to remove RNN while keeping attention?

Attention without RNN
Attention Layer
• We study Seq2Seq model (encoder + decoder).

• Encoder’s inputs are vectors 𝐱# , 𝐱 $ , ⋯ , 𝐱 & .
• Decoder’s inputs are vectors 𝐱#. , 𝐱 $. , ⋯ , 𝐱 :. .
Encoder’s inputs: Decoder’s inputs:
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Keys and values are based on encoder’s inputs 𝐱# , 𝐱 $ , ⋯ , 𝐱 & .
• Key: 𝐤 :2 = 𝐖> 𝐱 2 .
• Value: 𝐯:2 = 𝐖N 𝐱 2 .
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Keys and values are based on encoder’s inputs 𝐱# , 𝐱 $ , ⋯ , 𝐱 & .
• Key: 𝐤 :2 = 𝐖> 𝐱 2 .
• Value: 𝐯:2 = 𝐖N 𝐱 2 .
• Queries are based on decoder’s inputs 𝐱#. , 𝐱 $. , ⋯ , 𝐱 :. .

• Query: 𝐪:/ = 𝐖@ 𝐱/. .
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Compute weights: 𝛂:# = Softmax 𝐊 I 𝐪:# ∈ ℝ& .
𝜶:#
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:#
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#.
Attention Layer
• Compute context vector: 𝐜:# = 𝛼## 𝐯:# + ⋯ + 𝛼&# 𝐯:& = 𝐕𝛂:# .
𝜶:#
𝐜:#
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:#
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#.
Attention Layer
• Compute weights: 𝛂:$ = Softmax 𝐊 I 𝐪:$ ∈ ℝ& .
𝜶:$
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer
• Compute context vector: 𝐜:$ = 𝛼#$ 𝐯:# + ⋯ + 𝛼&$ 𝐯:& = 𝐕𝛂:$ .
𝜶:$
𝐜:$
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer
• Compute context vector: 𝐜:/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& = 𝐕𝛂:/ .
𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶::
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Output of attention layer: 𝐂 = 𝐜:# , 𝐜:$ , 𝐜:% , ⋯ , 𝐜:: .
• Here, 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
• Thus, 𝐜:/ is a function of 𝐱/. and 𝐱# , ⋯ , 𝐱 & .
Output of attention layer:
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer for Machine Translation
• Translate English to German.
• Use 𝐜:$ to generate the 3rd German word. 𝐩$
Softmax Classifier
𝐜:# 𝐜:$
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer for Machine Translation
• Translate English to German.
• Use 𝐜:$ to generate the 3rd German word. 𝐩$
Softmax Classifier
𝐜:# 𝐜:$
𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %.
Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Encoder’s inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Decoder’s inputs: 𝐗 . = 𝐱#., 𝐱 $., ⋯ , 𝐱 :. .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::
Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention without RNN
Attention Layer
• Encoder’s inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Decoder’s inputs: 𝐗 . = 𝐱#., 𝐱 $., ⋯ , 𝐱 &
. .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention Layer
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• Inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&
Self-Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Inputs:
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Query: 𝐪:2 = 𝐖@ 𝐱 2 , Key: 𝐤 :2 = 𝐖> 𝐱 2 , Value: 𝐯:2 = 𝐖N 𝐱 2 .
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
𝜶:#
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
𝜶:# 𝜶:$
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Context vector: 𝐜:# = 𝛼## 𝐯:# + ⋯ + 𝛼&# 𝐯:& = 𝐕𝛂:# .
𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&
𝐜:#
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Context vector: 𝐜:$ = 𝛼#$ 𝐯:# + ⋯ + 𝛼&$ 𝐯:& = 𝐕𝛂:$ .
𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&
𝐜:# 𝐜:$
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Context vector: 𝐜:/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& = 𝐕𝛂:/ .
𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
• Here, 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
• Thus, 𝐜:/ is a function of all the 𝑚 vectors 𝐱# , ⋯ , 𝐱 & .
Output of self-attention layer:
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&
𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• Inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Summary
Summary
• Attention was originally developed for Seq2Seq RNN models [1].

• Self-attention: attention for all the RNN models (not necessarily
Seq2Seq models [2].
• Attention can be used without RNN [3].
• We learned how to build attention layer and self-attention layer.
Reference:
1. Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and
translate. In ICLR, 2015.
2. Cheng, Dong, & Lapata. Long Short-Term Memory-Networks for Machine Reading. In
EMNLP, 2016.
3. Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Attention Layer
• Query: 𝐪:/ = 𝐖@ 𝐱/. ,
• Key: 𝐤 :2 = 𝐖> 𝐱 2 ,
• Value: 𝐯:2 = 𝐖N 𝐱 2 .
• Output: 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
• Self-Attention layer: 𝐂 = Attn 𝐗, 𝐗 .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Thank you!

10 Transformer 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Transformer 1

Uploaded by

Copyright:

Available Formats

Transformer Model (1/2):

Attention without RNN

• Original paper: Vaswani et al. Attention Is

• Transformer is a Seq2Seq model.

• Higher accuracy than RNNs on large

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

• Compute 𝐤 :2 = 𝐖> 𝐡2 and 𝐪:/ = 𝐖@ 𝐬/ .

• Compute 𝐤 :2 = 𝐖> 𝐡2 and 𝐪:/ = 𝐖@ 𝐬/ .

• Compute 𝐤 :2 = 𝐖> 𝐡2 and 𝐪:/ = 𝐖@ 𝐬/ .

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

Question: How to remove RNN while keeping attention?

• We study Seq2Seq model (encoder + decoder).

Encoder’s inputs: Decoder’s inputs:

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:&

• Queries are based on decoder’s inputs 𝐱#. , 𝐱 $. , ⋯ , 𝐱 :. .

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:#

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:#

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶::

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

Self-Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

Output of self-attention layer:

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

Self-Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

• Attention was originally developed for Seq2Seq RNN models [1].

Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

Self-Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

You might also like