You are on page 1of 47

Transformer Model (1/2):

Attention without RNN

Shusen Wang
Transformer Model

• Original paper: Vaswani et al. Attention Is


All You Need. In NIPS, 2017.
Transformer Model

• Transformer is a Seq2Seq model.


• Transformer is not RNN.
• Purely based attention and dense layers.

• Higher accuracy than RNNs on large


datasets.
Revisiting Attention for RNN
Attention for Seq2Seq Model

𝐜+ 𝐜# 𝐜$ ?

𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .

• Compute 𝐤 :2 = 𝐖> 𝐡2 and 𝐪:/ = 𝐖@ 𝐬/ .

𝐜+ 𝐜# 𝐜$
𝛼#: 𝛼$: 𝛼%: ⋯ 𝛼&: matrices
Parameter
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬:
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐊= ⋯ 𝐀′ 𝐀′ ⋯ 𝐀′
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐤 :# 𝐤 :$𝐤 :% 𝐤 :& 𝐱#. 𝐱 $. ⋯ 𝐱 :.
𝐱# 𝐱$ 𝐱% 𝐱&
Attention for Seq2Seq Model
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .

• Compute 𝐤 :2 = 𝐖> 𝐡2 and 𝐪:/ = 𝐖@ 𝐬/ .


• Compute weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝐜+ 𝐜# 𝐜$
𝛼#: 𝛼$: 𝛼%: ⋯ 𝛼&:
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬:
𝐡# 𝐡$ 𝐡% ⋯ 𝐡& 𝛼#/
𝛼$/
𝐊= ⋯ 𝐀′𝛂 =
:/
𝐀′

⋯ ∈ 𝐀′
ℝ &
.
𝐀 𝐀 𝐀 ⋯ 𝐀
𝛼&/
𝐤 :# 𝐤 :$𝐤 :% 𝐤 :& 𝐱#.
𝐱$.
⋯ .
𝐱:
𝐱# 𝐱$ 𝐱% 𝐱&
Attention for Seq2Seq Model
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .

• Compute 𝐤 :2 = 𝐖> 𝐡2 and 𝐪:/ = 𝐖@ 𝐬/ .


• Compute weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝐜+ 𝐜# 𝐜$
𝛼#: 𝛼$: 𝛼%: ⋯ 𝛼&:
• Query: 𝐪:/ = 𝐖@ 𝐬/ . (To match
𝐬+ others.)
𝐬# 𝐬$ ⋯ 𝐬:
• Key: 𝐤 :2 = 𝐖> 𝐡2 . (To be matched.)
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
• Value: 𝐯:2 = 𝐖N 𝐡2 . (To be weighted averaged.)
𝐀′ 𝐀′ ⋯ 𝐀′
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱 :.
𝐱# 𝐱$ 𝐱% 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .

𝐜+ 𝐜# 𝐜$ ?

𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .

𝐜+ 𝐜# 𝐜$ ?

𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?

𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?


𝐯:# 𝐯:$ 𝐯:% ⋯ 𝐯:&
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .

𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ ?


𝐯:# 𝐯:$ 𝐯:% ⋯ 𝐯:&
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
Context vector: 𝐜/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& .
𝛼#/ 𝛼$/ 𝛼%/ ⋯ 𝛼&/ 𝐜+ 𝐜# 𝐜$ 𝐜/
𝐯:# 𝐯:$ 𝐯:% ⋯ 𝐯:&
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
Context vector: 𝐜/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& .

Question: How to remove RNN while keeping attention?


Attention without RNN
Attention Layer

• We study Seq2Seq model (encoder + decoder).


• Encoder’s inputs are vectors 𝐱# , 𝐱 $ , ⋯ , 𝐱 & .
• Decoder’s inputs are vectors 𝐱#. , 𝐱 $. , ⋯ , 𝐱 :. .

Encoder’s inputs: Decoder’s inputs:

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Keys and values are based on encoder’s inputs 𝐱# , 𝐱 $ , ⋯ , 𝐱 & .
• Key: 𝐤 :2 = 𝐖> 𝐱 2 .
• Value: 𝐯:2 = 𝐖N 𝐱 2 .

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:&


𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Keys and values are based on encoder’s inputs 𝐱# , 𝐱 $ , ⋯ , 𝐱 & .
• Key: 𝐤 :2 = 𝐖> 𝐱 2 .
• Value: 𝐯:2 = 𝐖N 𝐱 2 .

• Queries are based on decoder’s inputs 𝐱#. , 𝐱 $. , ⋯ , 𝐱 :. .


• Query: 𝐪:/ = 𝐖@ 𝐱/. .

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Compute weights: 𝛂:# = Softmax 𝐊 I 𝐪:# ∈ ℝ& .

𝜶:#

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:#

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#.
Attention Layer
• Compute context vector: 𝐜:# = 𝛼## 𝐯:# + ⋯ + 𝛼&# 𝐯:& = 𝐕𝛂:# .

𝜶:#

𝐜:#

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:#

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#.
Attention Layer
• Compute weights: 𝛂:$ = Softmax 𝐊 I 𝐪:$ ∈ ℝ& .

𝜶:$

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer
• Compute context vector: 𝐜:$ = 𝛼#$ 𝐯:# + ⋯ + 𝛼&$ 𝐯:& = 𝐕𝛂:$ .

𝜶:$

𝐜:$

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer
• Compute context vector: 𝐜:/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& = 𝐕𝛂:/ .

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶::

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Output of attention layer: 𝐂 = 𝐜:# , 𝐜:$ , 𝐜:% , ⋯ , 𝐜:: .
• Here, 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
• Thus, 𝐜:/ is a function of 𝐱/. and 𝐱# , ⋯ , 𝐱 & .
Output of attention layer:

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$ 𝐪:% ⋯ 𝐪::

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer for Machine Translation
• Translate English to German.
• Use 𝐜:$ to generate the 3rd German word. 𝐩$

Softmax Classifier

𝐜:# 𝐜:$

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer for Machine Translation
• Translate English to German.
• Use 𝐜:$ to generate the 3rd German word. 𝐩$

Softmax Classifier

𝐜:# 𝐜:$

𝐤 :# 𝐯:# 𝐤 :$ 𝐯:$ 𝐤 :% 𝐯:% ⋯ 𝐤:& 𝐯:& 𝐪:# 𝐪:$

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %.
Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Encoder’s inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Decoder’s inputs: 𝐗 . = 𝐱#., 𝐱 $., ⋯ , 𝐱 :. .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention without RNN
Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Encoder’s inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Decoder’s inputs: 𝐗 . = 𝐱#., 𝐱 $., ⋯ , 𝐱 &
. .

• Parameters: 𝐖@ , 𝐖> , 𝐖N .

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention Layer
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• Inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

Self-Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer

Inputs:

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Query: 𝐪:2 = 𝐖@ 𝐱 2 , Key: 𝐤 :2 = 𝐖> 𝐱 2 , Value: 𝐯:2 = 𝐖N 𝐱 2 .

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .

𝜶:#

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .

𝜶:# 𝜶:$

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Context vector: 𝐜:# = 𝛼## 𝐯:# + ⋯ + 𝛼&# 𝐯:& = 𝐕𝛂:# .

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐜:#

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Context vector: 𝐜:$ = 𝛼#$ 𝐯:# + ⋯ + 𝛼&$ 𝐯:& = 𝐕𝛂:$ .

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐜:# 𝐜:$

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Context vector: 𝐜:/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& = 𝐕𝛂:/ .

𝜶:# 𝜶:$ 𝜶:% ⋯ 𝜶:&

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
• Here, 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
• Thus, 𝐜:/ is a function of all the 𝑚 vectors 𝐱# , ⋯ , 𝐱 & .

Output of self-attention layer:

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

𝐪:# 𝐤 :# 𝐯:# 𝐪:$ 𝐤 :$ 𝐯:$ 𝐪:% 𝐤 :% 𝐯:% ⋯ 𝐪:& 𝐤 :& 𝐯:&

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• Inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

Self-Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Summary
Summary

• Attention was originally developed for Seq2Seq RNN models [1].


• Self-attention: attention for all the RNN models (not necessarily
Seq2Seq models [2].
• Attention can be used without RNN [3].
• We learned how to build attention layer and self-attention layer.

Reference:

1. Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and
translate. In ICLR, 2015.
2. Cheng, Dong, & Lapata. Long Short-Term Memory-Networks for Machine Reading. In
EMNLP, 2016.
3. Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Query: 𝐪:/ = 𝐖@ 𝐱/. ,
• Key: 𝐤 :2 = 𝐖> 𝐱 2 ,
• Value: 𝐯:2 = 𝐖N 𝐱 2 .
• Output: 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::

Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Self-Attention layer: 𝐂 = Attn 𝐗, 𝐗 .

𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜:&

Self-Attention Layer (Parameters: 𝐖@ , 𝐖> , 𝐖N )

𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Thank you!

You might also like