Professional Documents
Culture Documents
Attention-based Models:
Transformers
Dr. Dileep A. D.
Associate Professor,
Multimedia Analytics Networks And Systems (MANAS) Lab,
School of Computing and Electrical Engineering (SCEE),
Indian Institute of Technology Mandi, Kamand, H.P.
Email: addileep@iitmandi.ac.in
1
15-05-2023
2
15-05-2023
• Context vector: ct a jt s E, j
j 1
• Context vector goes as input to the decoder at every time t
T
z t cTt y Tt
• State of the decoder at time t: s D,t tanh( U D z t WD s D,t 1 b D )
• Output of the model at time t : P (y t 1 ) f (VDs D,t c D )
6
3
15-05-2023
4
15-05-2023
– New terminology:
• Query (query vector): new input (test input) : x
• Key (key vector): each of the training examples : xn, n=1… N
• Value: each of the desired output corresponding to xn : yn
– Basically, prediction task is performed using 3 entries:
query, key and value
10
10
5
15-05-2023
ct = attentionMechanism(sD,t-1, sE,j)
sD,t-1: state of the decoder at time t
sE,j : state of the encoder at time j
• Attention score: jt f ATT (s D,t 1 , s E, j )
• Attention weight: a jt softmax( jt )
Ts
• Context vector: ct a jt s E, j
j 1
11
ct = attentionMechanism(sD,t-1, sE,j)
sD,t-1: state of the decoder at time t
sE,j : state of the encoder at time j
12
6
15-05-2023
Transformers:
Scaled Dot-Product Attention (SDPA)
• Consider the sequence X = (x1, x2, … , xt , …, xT)
• Consider the following d-dimensional vectors:
– Query vector: q
– Key vectors: kt t=1,2,…,T
– Value vectors: vt t=1,2,…,T
• SDPA gives the weight based on the similarity measure
(i.e., dot product) computed between query vector with
each of the key vectors
– The similarity measure is scaled to keep the magnitude of
similarity measure in control
• Attention score: scaled dot-product between q and kt
q, k t qT k t
t
d d
exp( t )
• Attention weight: at softmax ( t ) T
exp(t )
t 1
13
13
Transformers:
Scaled Dot-Product Attention (SDPA)
• Consider the sequence X = (x1, x2, … , xt , …, xT)
• Consider the following d-dimensional vectors:
– Query vector: q
– Key vectors: kt
– Value vectors: vt
• Attention score: scaled dot-product between q and kt
qT k t
t
d
exp( t )
• Attention weight: at softmax( t ) T
exp( )
t 1
t
T
• Context vector associated with Query vector q: c at vt
t 1
14
7
15-05-2023
Transformers for
Sequence-to-Sequence Mapping
• Given the input sequence X, map it onto output
sequence Y
• Input sequence: X = (x1, x2, … , xj , …, xTs)
– Each element in a sequence is a d-dimensional vector xj
– Ts be the length of input sequence
• Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Each element in a sequence is a d-dimensional vector yt
– Td be the length of output sequence
• Transformer also has encoder-decoder framework:
– The input sequence X is passed through encoder to get
some representation Z
– This Z will go as input to decoder and that will map it on
to output sequence Y
15
15
Transformers for
Sequence-to-Sequence Mapping
• Transformers does not use RNNs in encoder and
decoder
• Instead, sequence of transformations are performed
on X to get Z
• This representation Z will go as input to decoder
which also perform sequence of transformations
• Transformers try to capture and use
– Relations among elements in the input sequence (Self-
Attention)
– Relations among elements in the output sequence (Self-
Attention)
– Relations between elements in the input sequence and
elements in the output sequence (Cross-Attention)
16
16
8
15-05-2023
Self-Attention
• Self-attention captures the relations among the
elements of a sequence
17
17
18
9
15-05-2023
( v , v ,..., v ,..., v )
– Value matrix: V q j k 1 k j k Ts v 1 v j v Ts
1 2 j Ts ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
and
(SDPA) on q j w.r.t. the entities in K
V xj X ( x1 ,..., x j ..., xTs )
V
c j = SDPA(q j , K, ) j =1, 2, ..., Ts
19
20
20
10
15-05-2023
21
21
Transformers for
Sequence-to-Sequence Mapping
• Given the input sequence X, map it onto output
sequence Y
• Input sequence: X = (x1, x2, … , xj , …, xTs)
– Each element in a sequence is a d-dimensional vector xj
– Ts be the length of input sequence
• Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Each element in a sequence is a d-dimensional vector yt
– Td be the length of output sequence
• Transformer also has encoder-decoder framework:
– The input sequence X is passed through encoder to get
some representation X L
will go as input to decoder and that will map it on
– This X L
to output sequence Y
22
22
11
15-05-2023
Encoder: Self-Attention
• Self-attention captures the relations among the
elements of a sequence
– It gives the importance of each element in a sequence
with every element in that sequence
• A set of transformation matrices of size dxl, W(Q), W(K) and
W(V) corresponding to Query, Key and Value is considered
– They act as parametric matrices
• Now each of the query, key and value vectors are obtained
as:
(Q)
– Query vector: q j = W x j
– Key vector: k j = W (K) x j j =1, 2, ..., Ts
– Value vector: v j = W (V) x j
• Dimension of Query, Key and Value vectors: l (Note: l < d)
• These transformation matrices are learnt as a part of
training process
23
23
( v , v ,..., v ,..., v )
– Value matrix: V q j k 1 k j k Ts v 1 v j v Ts
1 2 j Ts ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
and
(SDPA) on q j w.r.t. the entities in K
V xj X ( x1 ,..., x j ..., xTs )
V
c j = SDPA(q j , K, ) j =1, 2, ..., Ts
V
c j = SDPA(q j , K, )
q Tj k m
• Attention score: jm m=1, 2, ..., Ts
d
• Attention weight: a jm softmax( jm )
Ts
• Context vector: c j a jm v m
m 1
24
24
12
15-05-2023
( v , v ,..., v ,..., v )
– Value matrix: V q j k 1 k j k Ts v 1 v j v Ts
1 2 j Ts ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
and
(SDPA) on q j w.r.t. the entities in K
V xj X ( x1 ,..., x j ..., xTs )
V
c j = SDPA(q j , K, ) j =1, 2, ..., Ts
25
26
26
13
15-05-2023
27
c1 j c 2 j c hj
Head 1 Head 2 Head h
q 1 j
Attention
k 11 k 1 j k 1Ts
... ...
v 11 v 1 j v 1Ts
... ...
q 2 j
Attention
k 21 k 2 j k 2Ts
... ...
v 21 v 2 j v 2Ts
... ...
... q hj
Attention
k h1 k hj k hTs
... ...
v h1 v hj v hTs
... ...
(Q) (K) (V) (Q) (V) (Q)
W 1 W 1 W 1 W 2 W (K)
2 W 2 W h W (K)
h Wh(V)
28
14
15-05-2023
z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA
29
z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA
30
15
15-05-2023
x 1 x 2 x j x Ts
Output layer Output layer Output layer Output layer
(o) (o) (o) (o)
WFF WFF WFF WFF
Hidden layer Hidden layer Hidden layer Hidden layer
(h ) (h ) (h) (h )
WFF WFF WFF WFF
Input layer Input layer Input layer Input layer
z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA
31
•
The contents of Z are nonlinearly related to the contents of X
• This, the contents of X are nonlinearly related to the
contents of X
X
PWFFNN
Z
Self
Attention
MHA
X
32
32
16
15-05-2023
Encoder in Transformer
X
• Stack of L number of encoder L
layers Encoder
• All these encoder layers are Layer L
going to be identical layers
X L1
– Each layer share the weights
...
• X is a sequence which is a
L X 2
better representation of input
sequence X Encoder
– It is generated by capturing the Layer 2
similarities among the elements
X
of the X 1
X 33
33
Decoder in Transformer
• Decoder:
– Given the input sequence, decoder should generate the
sequence that is close to target output sequence
– Example:
• Given the sentence in source language, generate the
sentence that is close to target language sentence
• Training process in decoder:
– For every example of input sequence, corresponding
target output sequence is given
– Input sequence: X = (x1, x2, … , xj , …, xTs)
• Each element in a sequence is a d-dimensional vector xj
• Ts be the length of input sequence
– Target output sequence: Y = (y1, y2, …, yt , …, yTd)
• Each element in a sequence is a d-dimensional vector yt
• Td be the length of output sequence
34
34
17
15-05-2023
35
35
36
36
18
15-05-2023
37
38
19
15-05-2023
39
40
20
15-05-2023
c1 j c 2 j c hj
Head 1 Head 2 Head h
Scaled Dot-Product Scaled Dot-Product Scaled Dot-Product
...
Attention with Attention with Attention with
Masking Masking Masking
q 1t k 11 k 1t k 1Td v 11 v 1t v 1Td q 2t k 21 k 2t k 2Td v 21 v 2t v 2Td q ht k h1 k ht k hTd v h1 v ht v hTd
... ... ... ... ... ... ... ... ... ... ... ...
(Q) (K) (V) (Q) (V) (Q)
W 1 W 1 W 1 W 2 W (K)
2 W 2 W h W (K)
h Wh(V)
yt Y ( y1 ,..., y t ..., y Td ) 41
41
42
21
15-05-2023
K V
X R
L
Self-Attention
MHA with
Masking
Y
43
43
• Value matrix: X
= (x , x ,..., x ,..., x )
L L1 L2 Lj L ,Ts
,X
c t = SDPA(rt , X )
L L
rtT x L ,m
• Attention score: tm
m =1, 2, ..., Ts
d
• Attention weight: atm softmax( tm )
Ts
• Context vector: c t atm x L ,m
m 1
44
44
22
15-05-2023
X R i =1, 2, ..., h
L
Self-Attention (Od1)
MHA with
z t = W c t
Masking
45
PWFFNN
Z
Cross-Attention MHA
SDPA
K V
X R
L
Self-Attention
MHA with
Masking
Y
46
46
23
15-05-2023
propagate
Y • The operations on the entities at
2
different time instance are kind of
Decoder
independent
Layer 2
• If we have GPUs, one can use them
X Y
L 1
to perform computation related to
Decoder each time instances in parallel
Layer 1
X L
Y 47
47
Y 48
48
24
15-05-2023
Residual
connection
Residual
connection
49
[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain
Gelly, Jakob Uszkoreit, Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale,” 9th International Conference on Learning Representations
(ICLR 2021), 2021. 50
50
25
15-05-2023
[3] L.Zhou, Y.Zhou, J.J. Corso, R.Socher and C.Xiong, “End-to-End Dense Video Captioning
with Masked Transformer,” Computer Vision and Pattern Recognition (CVPR), 2018. 51
51
[4] Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova, “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding,” NAACL, 2019. 52
52
26
15-05-2023
Text Books
1. Aston Zhang, Zachary C. Lipton, Mu Li, Alexander J. Smola,
Dive into Deep Learning, 2021
2. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning,
MIT Press, Available online: http://www.deeplearningbook.org,
2016
3. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer,
2018
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India,
1999.
5. Satish Kumar, Neural Networks - A Class Room Approach, Second
Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice Hall of
India, 2010.
7. C. M. Bishop, Pattern Recognition and Machine Learning, Springer,
2006.
53
53
27