You are on page 1of 31

BERT

Transformer
李宏毅
Hung-yi Lee
Transformer
Seq2seq model with “Self-attention”
Sequence
Next layer
𝑏1 𝑏 2
𝑏 3
𝑏 4

1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel !
Filters in higher layer can
Sequence consider longer sequence

Next layer
𝑏1 𝑏 2
𝑏 3
𝑏4
𝑏1 𝑏2 𝑏3 𝑏4

……

……
……
𝑎 1
𝑎 2
𝑎3 𝑎4 ……
𝑎1 𝑎 2
𝑎 3
𝑎4
Previous layer
Using CNN to replace RNN
Hard to parallel (CNN can parallel)
is obtained based on the whole
input sequence.
Self-Attention , , , can be parallelly computed.

𝑏1 𝑏 2
𝑏3
𝑏4
𝑏1 𝑏 2
𝑏3 𝑏4

Self-Attention Layer

1
𝑎 𝑎 2
𝑎3 𝑎4 𝑎1 𝑎 2
𝑎 3
𝑎4
You can try to replace any thing that has been done by RNN
with self-attention.
Self-attention : query (to match others)
https://arxiv.org/abs/1706.03762
𝑞𝑖 =𝑊 𝑞 𝑎𝑖
: key (to be matched)
𝑘𝑖=𝑊 𝑘 𝑎𝑖
Attention is all : information to be extracted
you need. 𝑣 𝑖 =𝑊 𝑣 𝑎 𝑖

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4
4

𝑎1 𝑎2 𝑎3 𝑎4
𝑎 𝑖=𝑊 𝑥 𝑖
𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
d is the dim of and
拿每個 query q 去對每個 key k 做 attention
Scaled Dot-Product Attention: 𝛼 1, 𝑖=𝑞 1
∙ 𝑘 𝑖
/√𝑑
dot product
𝛼1,1 𝛼1 , 2 𝛼1 , 3 𝛼1,4

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention ^ 1 , 𝑖=𝑒𝑥𝑝 ( 𝛼1 ,𝑖 ) / ∑ 𝑒𝑥𝑝 ( 𝛼1 , 𝑗 )
𝛼
𝑗
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

Soft-max
𝛼1,1 𝛼1,2 𝛼1,3 𝛼1,4

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑏 =∑ 𝛼1, 𝑖 𝑣
1
^ 𝑖
Considering the whole sequence 𝑖
𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention
拿每個 query q 去對每個 key k 做 attention 𝑏 =∑ 𝛼^ 2,𝑖 𝑣
2 𝑖

𝑖
2
𝑏
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention , , , can be parallelly computed.

𝑏1 𝑏2 𝑏3 𝑏4

Self-Attention Layer

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention 𝑞1𝑞2𝑞3𝑞 4 = 𝑊𝑞 𝑎 1𝑎 2𝑎 3𝑎 4
𝑄 I
1 2 3 4
𝑖
𝑞 =𝑊 𝑎𝑞 𝑖 𝑘1𝑘𝑘
2 3 4
𝑘 = 𝑊 𝑘
𝑎 𝑎𝑎𝑎
𝐾 I
𝑘𝑖=𝑊 𝑘 𝑎𝑖
1 2 3 4 1 2 3 4
𝑖
𝑣 =𝑊 𝑎𝑣 𝑖 𝑣 𝑣𝑣 𝑣 = 𝑊 𝑣
𝑎𝑎 𝑎 𝑎
𝑉 I

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘4
𝑣 4 4

𝑎1 𝑎2 𝑎3 𝑎4

𝑥1 𝑥2 𝑥3 𝑥4
Self-attention

𝑏1
^ 1,1
𝛼 ^ 1,2
𝛼 ^ 1,3
𝛼 ^ 1,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

𝛼1,1 =¿ 𝑘 1 𝑞1 𝛼1,2 =¿ 𝑘 2 𝑞1 𝛼1,1


𝑘1
𝛼1,2
𝑘2 1
𝛼1,3 =¿ 𝑘 3 𝑞1 𝛼1,4 =¿ 𝑘 4 𝑞1 𝛼1,3 ¿ 𝑘 3 𝑞

𝛼1,4 𝑘4
(ignore for simplicity)
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
^
2 𝑖

𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 4
𝑞 𝑘 𝑣 4 4

^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼 𝛼1,1 𝛼2,1 𝛼3,1 𝛼 4,1
^ 1,2
𝛼 ^ 2,2 ^ 3,2
𝛼
𝑘1
𝛼 ^ 4,2
𝛼 𝛼1,2 𝛼2,2 𝛼3,2 𝛼 4,2
𝑘2 𝑞1 2 3 4
^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼 𝛼1,3 𝛼2,3 𝛼3,3 𝛼 4,3 ¿ 𝑘 3 𝑞𝑞𝑞
^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼 𝛼1,4 𝛼2,4 𝛼3,4 𝛼 4,4 𝑄
𝑘4
^𝐴 𝐴 𝐾𝑇
Self-attention 𝑏 =∑ 𝛼 2,𝑖 𝑣
2
^ 𝑖

𝑖
𝑏2
^ 2,1
𝛼 ^ 2,2
𝛼 ^ 2,3
𝛼 ^ 2,4
𝛼

𝑞1 𝑘1 𝑣 1 𝑞2 𝑘2 𝑣 2 𝑞3 𝑘3 𝑣 3 𝑞 𝑘 𝑣 4
4 4

^ 1,1
𝛼 ^ 2,1
𝛼 ^ 3,1
𝛼 ^ 4,1
𝛼
^ 1,2
𝛼 ^ 2,2
𝛼 ^ 3,2
𝛼 ^ 4,2
𝛼
𝑏1𝑏2𝑏3𝑏4 = 𝑣 1𝑣 𝑣
2 3 4
𝑣 ^ 1,3
𝛼 ^ 2,3
𝛼 ^ 3,3
𝛼 ^ 4,3
𝛼
O 𝑉 ^ 1,4
𝛼 ^ 2,4
𝛼 ^ 3,4
𝛼 ^ 4,4
𝛼
^𝐴
Self-attention
O Q = 𝑊𝑞 I
K = 𝑊𝑘 I
I V = 𝑊𝑣 I

^
A A = 𝐾𝑇 Q

O = V ^
A
反正就是一堆矩陣乘法,用 GPU 可以加速
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑞𝑖 =𝑊 𝑞 𝑎𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑞𝑖 ,1=𝑊 𝑞,1 𝑞𝑖 𝑏𝑖, 1
𝑞𝑖 ,2=𝑊 𝑞, 2 𝑞 𝑖 𝑖, 2
𝑏

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞 𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑞𝑖 =𝑊 𝑞 𝑥 𝑖 𝑎 𝑖
𝑎𝑗
Multi-head Self-attention (2 heads as example)
𝑏𝑖, 1 𝑖
𝑏
𝑖, 2
𝑏𝑖, 1 𝑏
𝑂
𝑏𝑖
= 𝑊
𝑏𝑖, 2

𝑗 ,1 𝑗, 1 𝑗 ,1
𝑞𝑖 ,1
𝑞 𝑖 ,2
𝑘 𝑖,1
𝑘 𝑖, 2
𝑣 𝑖, 1 𝑣 𝑖, 2 𝑞 𝑞 𝑗 ,2 𝑘 𝑘 𝑗 , 2 𝑣 𝑣 𝑗 ,2

𝑗 𝑗 𝑗
𝑞 𝑖
𝑘 𝑖
𝑣 𝑖
𝑞 𝑘 𝑣

𝑎 𝑖
𝑎𝑗
𝑖 𝑖 𝑖
𝑞 𝑘 𝑣
Positional Encoding
𝑒𝑖 + 𝑎𝑖
• No position information in self-attention.
• Original paper: each position has a unique
positional vector (not learned from data) 𝑥𝑖
• In other words: each appends a one-hot
vector
𝑥𝑖 𝑎 𝑖

𝑊 = 𝑊 𝐼 𝑥𝑖
………

𝑖
𝐼 𝑃 𝑝
𝑖 𝑊 𝑊
𝑝 =¿ 0
1 i-th dim
𝑒𝑖
𝑃
0 + 𝑊 𝑝𝑖

𝑖
𝑥
𝑊 𝑝𝑖
𝑊𝐼 𝑊𝑃

𝑖
𝑎
𝐼 𝑥𝑖
= 𝑊
𝑒𝑖
𝑃
+ 𝑊 𝑝𝑖

-1 1
source of image: http://jalammar.github.io/illustrated-transformer/
Review: https://www.youtube.com/watch?v=ZjfjPzXw6og&feature=youtu.be

Seq2seq with Attention


𝑐1 𝑐2

𝑜1 𝑜2 𝑜3
h1 h2 h3 h4

Self-Attention Layer Self-Attention Layer

𝑥1
𝑥 2
𝑥 3
𝑥 4
𝑐1 𝑐2 𝑐3
Encoder Decoder
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Transformer machine learning

Using Chinese to English


translation as example

Encoder Decoder

機 器 學 習 <BOS> machine
Transformer Layer Norm:
https:// Batch Size
′ Layer arxiv.org/abs/
𝑏 Norm 1607.06450
Batch Norm:
https://
+ www.youtube.co
m/watch? Batch
v=BZh1ltr5Rkg
𝑏 𝜇=0 , 𝜎=1
Layer

𝑎 attend on the
input sequence

Masked: attend on the


generated sequence
Attention Visualization

https://arxiv.org/abs/1706.03762
Attention Visualization

The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a
Transformer trained on English to French translation (one of eight attention heads).
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Multi-head
Attention
Example Application
• If you can use seq2seq, you can use transformer.

Summarizer

Document Set

https://arxiv.org/abs/1801.10198
Universal Transformer

https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Self-Attention GAN

https://arxiv.org/abs/1805.08318

You might also like