Professional Documents
Culture Documents
Shusen Wang
Transformer Model
𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .
𝐜+ 𝐜# 𝐜$
𝛼#: 𝛼$: 𝛼%: ⋯ 𝛼&: matrices
Parameter
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬:
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐊= ⋯ 𝐀′ 𝐀′ ⋯ 𝐀′
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐤 :# 𝐤 :$𝐤 :% 𝐤 :& 𝐱#. 𝐱 $. ⋯ 𝐱 :.
𝐱# 𝐱$ 𝐱% 𝐱&
Attention for Seq2Seq Model
Weights: 𝛼2/ = align 𝐡2 , 𝐬/ .
𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
𝐜+ 𝐜# 𝐜$ ?
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝐬+ 𝐬# 𝐬$ ⋯ 𝐬/
𝐡# 𝐡$ 𝐡% ⋯ 𝐡&
𝐀′ 𝐀′ ⋯ 𝐀′ ⋯
𝐀 𝐀 𝐀 ⋯ 𝐀
𝐱#. 𝐱 $. ⋯ 𝐱/. ⋯
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Attention for Seq2Seq Model
Query: 𝐪:/ = 𝐖@ 𝐬/ , Key: 𝐤 :2 = 𝐖> 𝐡2 , Value: 𝐯:2 = 𝐖N 𝐡2 .
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Keys and values are based on encoder’s inputs 𝐱# , 𝐱 $ , ⋯ , 𝐱 & .
• Key: 𝐤 :2 = 𝐖> 𝐱 2 .
• Value: 𝐯:2 = 𝐖N 𝐱 2 .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Compute weights: 𝛂:# = Softmax 𝐊 I 𝐪:# ∈ ℝ& .
𝜶:#
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#.
Attention Layer
• Compute context vector: 𝐜:# = 𝛼## 𝐯:# + ⋯ + 𝛼&# 𝐯:& = 𝐕𝛂:# .
𝜶:#
𝐜:#
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#.
Attention Layer
• Compute weights: 𝛂:$ = Softmax 𝐊 I 𝐪:$ ∈ ℝ& .
𝜶:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer
• Compute context vector: 𝐜:$ = 𝛼#$ 𝐯:# + ⋯ + 𝛼&$ 𝐯:& = 𝐕𝛂:$ .
𝜶:$
𝐜:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer
• Compute context vector: 𝐜:/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& = 𝐕𝛂:/ .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer
• Output of attention layer: 𝐂 = 𝐜:# , 𝐜:$ , 𝐜:% , ⋯ , 𝐜:: .
• Here, 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
• Thus, 𝐜:/ is a function of 𝐱/. and 𝐱# , ⋯ , 𝐱 & .
Output of attention layer:
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Attention Layer for Machine Translation
• Translate English to German.
• Use 𝐜:$ to generate the 3rd German word. 𝐩$
Softmax Classifier
𝐜:# 𝐜:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $.
Attention Layer for Machine Translation
• Translate English to German.
• Use 𝐜:$ to generate the 3rd German word. 𝐩$
Softmax Classifier
𝐜:# 𝐜:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %.
Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Encoder’s inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Decoder’s inputs: 𝐗 . = 𝐱#., 𝐱 $., ⋯ , 𝐱 :. .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention without RNN
Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Encoder’s inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Decoder’s inputs: 𝐗 . = 𝐱#., 𝐱 $., ⋯ , 𝐱 &
. .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention Layer
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• Inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Inputs:
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Query: 𝐪:2 = 𝐖@ 𝐱 2 , Key: 𝐤 :2 = 𝐖> 𝐱 2 , Value: 𝐯:2 = 𝐖N 𝐱 2 .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝜶:#
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝜶:# 𝜶:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Weights: 𝛂:/ = Softmax 𝐊 I 𝐪:/ ∈ ℝ& .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Context vector: 𝐜:# = 𝛼## 𝐯:# + ⋯ + 𝛼&# 𝐯:& = 𝐕𝛂:# .
𝐜:#
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Context vector: 𝐜:$ = 𝛼#$ 𝐯:# + ⋯ + 𝛼&$ 𝐯:& = 𝐕𝛂:$ .
𝐜:# 𝐜:$
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
Context vector: 𝐜:/ = 𝛼#/ 𝐯:# + ⋯ + 𝛼&/ 𝐯:& = 𝐕𝛂:/ .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
• Here, 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
• Thus, 𝐜:/ is a function of all the 𝑚 vectors 𝐱# , ⋯ , 𝐱 & .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Self-Attention Layer
• Self-attention layer: 𝐂 = Attn 𝐗, 𝐗 .
• Inputs: 𝐗 = 𝐱#, 𝐱 $, ⋯ , 𝐱 & .
• Parameters: 𝐖@ , 𝐖> , 𝐖N .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Summary
Summary
Reference:
1. Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and
translate. In ICLR, 2015.
2. Cheng, Dong, & Lapata. Long Short-Term Memory-Networks for Machine Reading. In
EMNLP, 2016.
3. Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Query: 𝐪:/ = 𝐖@ 𝐱/. ,
• Key: 𝐤 :2 = 𝐖> 𝐱 2 ,
• Value: 𝐯:2 = 𝐖N 𝐱 2 .
• Output: 𝐜:/ = 𝐕 ⋅ Softmax 𝐊 I 𝐪:/ .
𝐜:# 𝐜:$ 𝐜:% ⋯ 𝐜::
𝐱# 𝐱$ 𝐱% ⋯ 𝐱& 𝐱#. 𝐱 $. 𝐱 %. ⋯ 𝐱 :.
Self-Attention Layer
• Attention layer: 𝐂 = Attn 𝐗, 𝐗 . .
• Self-Attention layer: 𝐂 = Attn 𝐗, 𝐗 .
𝐱# 𝐱$ 𝐱% ⋯ 𝐱&
Thank you!