Professional Documents
Culture Documents
Transformer (v5)
Transformer (v5)
Transformer
李宏毅
Hung-yi Lee
Transformer
Seq2seq model with “Self-attention”
Sequence
Next layer
𝑏1 𝑏 2
𝑏 3
𝑏 4
1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel !
Filters in higher layer can
Sequence consider longer sequence
Next layer
𝑏1 𝑏 2
𝑏 3
𝑏4
𝑏1 𝑏2 𝑏3 𝑏4
……
……
……
𝑎 1
𝑎 2
𝑎3 𝑎4 ……
𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel (CNN can parallel)
is obtained based on the whole
input sequence.
Self-Attention , , , can be parallelly computed.
𝑏1 𝑏 2
𝑏3
𝑏4
𝑏1 𝑏 2
𝑏3 𝑏4
Self-Attention Layer
1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
You can try to replace any thing that has been done by RNN
with self-attention.
Self-attention : query (to match others)
https://arxiv.org/abs/1706.03762
𝑞𝑖 =𝑊 𝑞 𝑎𝑖
: key (to be matched)
𝑘𝑖=𝑊 𝑘 𝑎𝑖
Attention is all : information to be extracted
you need. 𝑣 𝑖 =𝑊 𝑣 𝑎 𝑖
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4
4
𝑎1 𝑎2 𝑎3 𝑎4
𝑎 𝑖=𝑊 𝑥 𝑖
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
d is the dim of and
拿每個 query q 去對每個 key k 做 attention
Scaled Dot-Product Attention: 𝛼 1, 𝑖=𝑞 1
∙ 𝑘 𝑖
/√𝑑
dot product
𝛼1,1 𝛼1 , 2 𝛼1 , 3 𝛼1,4
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4
𝑎1 𝑎2 𝑎3 𝑎4
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention ^ 1 , 𝑖=𝑒𝑥𝑝 ( 𝛼1 ,𝑖 ) / ∑ 𝑒𝑥𝑝 ( 𝛼1 , 𝑗 )
𝛼
𝑗
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼
Soft-max
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4
𝑎1 𝑎2 𝑎3 𝑎4
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑏 =∑ 𝛼1, 𝑖 𝑣
1
^ 𝑖
Considering the whole sequence 𝑖
𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4
𝑎1 𝑎2 𝑎3 𝑎4
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
拿每個 query q 去對每個 key k 做 attention 𝑏 =∑ 𝛼^ 2,𝑖 𝑣
2 𝑖
𝑖
2
𝑏
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4
𝑎1 𝑎2 𝑎3 𝑎4
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention , , , can be parallelly computed.
𝑏1 𝑏2 𝑏3 𝑏4
Self-Attention Layer
𝑎1 𝑎2 𝑎3 𝑎4
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑞1𝑞2𝑞3𝑞 4 = 𝑊𝑞 𝑎 1𝑎 2𝑎 3𝑎 4
𝑄 I
1 2 3 4
𝑖
𝑞 =𝑊 𝑎𝑞 𝑖 𝑘1𝑘𝑘
2 3 4
𝑘 = 𝑊 𝑘
𝑎 𝑎𝑎𝑎
𝐾 I
𝑘𝑖=𝑊 𝑘 𝑎𝑖
1 2 3 4 1 2 3 4
𝑖
𝑣 =𝑊 𝑎𝑣 𝑖 𝑣 𝑣𝑣 𝑣 = 𝑊 𝑣
𝑎𝑎 𝑎 𝑎
𝑉 I
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4
𝑎1 𝑎2 𝑎3 𝑎4
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4
𝛼1,4 𝑘4
(ignore for simplicity)
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
^
2 𝑖
𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4
^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼 𝛼1,1 𝛼2,1 𝛼3,1 𝛼 4,1
^ 1,2
𝛼 ^ 2,2 ^ 3,2
𝛼
𝑘1
𝛼 ^ 4,2
𝛼 𝛼1,2 𝛼2,2 𝛼3,2 𝛼 4,2
𝑘2 𝑞1 2 3 4
^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼 𝛼1,3 𝛼2,3 𝛼3,3 𝛼 4,3 ¿ 𝑘 3 𝑞𝑞𝑞
^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼 𝛼1,4 𝛼2,4 𝛼3,4 𝛼 4,4 𝑄
𝑘4
^𝐴 𝐴 𝐾𝑇
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
2
^ 𝑖
𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼
𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 𝑣 4
4 4
^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼
^ 1,2
𝛼 ^ 2,2
𝛼 ^ 3,2
𝛼 ^ 4,2
𝛼
𝑏1𝑏2𝑏3𝑏4 = 𝑣 1𝑣 𝑣
2 3 4
𝑣 ^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼
O 𝑉 ^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼
^𝐴
Self-attention
O Q = 𝑊𝑞 I
K = 𝑊𝑘 I
I V = 𝑊𝑣 I
^
A A = 𝐾𝑇 Q
O = V ^
A
反正就是一堆矩陣乘法,用 GPU 可以加速
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖
𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2
𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣
𝑞𝑖 =𝑊 𝑞 𝑎𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖 𝑖, 2
𝑏
𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2
𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣
𝑞𝑖 =𝑊 𝑞 𝑥 𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑏𝑖, 1 𝑖
𝑏
𝑖, 2
𝑏𝑖, 1 𝑏
𝑂
𝑏𝑖
= 𝑊
𝑏𝑖, 2
𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2
𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣
𝑎 𝑖
𝑎𝑗
𝑖 𝑖 𝑖
𝑞 𝑘 𝑣
Positional Encoding
𝑒𝑖 + 𝑎𝑖
• No position information in self-attention.
• Original paper: each position has a unique
positional vector (not learned from data) 𝑥𝑖
• In other words: each appends a one-hot
vector
𝑥𝑖 𝑎 𝑖
𝑊 = 𝑊 𝐼 𝑥𝑖
………
𝑖
𝐼 𝑃 𝑝
𝑖 𝑊 𝑊
𝑝 =¿ 0
1 i-th dim
𝑒𝑖
𝑃
0 + 𝑊 𝑝𝑖
…
𝑖
𝑥
𝑊 𝑝𝑖
𝑊𝐼 𝑊𝑃
𝑖
𝑎
𝐼 𝑥𝑖
= 𝑊
𝑒𝑖
𝑃
+ 𝑊 𝑝𝑖
-1 1
source of image: http://jalammar.github.io/illustrated-transformer/
Review: https://www.youtube.com/watch?v=ZjfjPzXw6og&feature=youtu.be
𝑜1 𝑜2 𝑜3
h1 h2 h3 h4
𝑥1
𝑥 2
𝑥 3
𝑥 4
𝑐1 𝑐2 𝑐3
Encoder Decoder
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Transformer machine learning
Encoder Decoder
機 器 學 習 <BOS> machine
Transformer Layer Norm:
https:// Batch Size
′ Layer arxiv.org/abs/
𝑏 Norm 1607.06450
Batch Norm:
https://
+ www.youtube.co
m/watch? Batch
v=BZh1ltr5Rkg
𝑏 𝜇=0 , 𝜎=1
Layer
…
𝑎 attend on the
input sequence
https://arxiv.org/abs/1706.03762
Attention Visualization
The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a
Transformer trained on English to French translation (one of eight attention heads).
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Multi-head
Attention
Example Application
• If you can use seq2seq, you can use transformer.
Summarizer
Document Set
https://arxiv.org/abs/1801.10198
Universal Transformer
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Self-Attention GAN
https://arxiv.org/abs/1805.08318