You are on page 1of 27

15-05-2023

Attention-based Models:
Transformers
Dr. Dileep A. D.

Associate Professor,
Multimedia Analytics Networks And Systems (MANAS) Lab,
School of Computing and Electrical Engineering (SCEE),
Indian Institute of Technology Mandi, Kamand, H.P.
Email: addileep@iitmandi.ac.in

Sequence-to-Sequence Mapping Tasks


• Neural Machine Translation: Translation of a sentence in the
source language to a sentence in the target language
– Input: A sequence of words
– Output: A sequence of words

• Speech Recognition (Speech-to-Text Conversion):


Conversion of the speech signal of a sentence utterance to
the text of a sentence
– Input: A sequence of feature vectors extracted from the
speech signal of a sentence
– Output: A sequence of words
• Video Captioning: Generation of a sentence as the caption
for a video represented as a sequence of frames
– Input: A sequence of feature vectors extracted from the
frames of a video
– Output: A sequence of words
2

1
15-05-2023

Sequence-to-Sequence Mapping Tasks


• Each of the above tasks involves mapping an input
sequence to an output sequence
• So far we have seen encoder-decoder paradigm using RNN
models for sequence-to-sequence mapping
• Training the RNN-based sequence-to-sequence mapping
systems is
– computationally intensive, and
– there is not much scope for parallelization of operations in the
training process
• Goal: Come up with totally different approach for solving
sequence-to-sequence mapping tasks that
– avoid recurrent structure in encoder-decoder paradigm and
– avoid huge training time required for training RNNs

Attention-based Deep Learning Models for


Sequence-to-Sequence Mapping
• Attention-based models implement sequence-to-sequence
mapping using only the attention-based techniques
– They don’t use any RNNs for that matter
• Attention based models [4] try to capture and use
– Relations among elements in the input sequence (Self-
Attention)
– Relations among elements in the output sequence (Self-
Attention)
– Relations between elements in the input sequence and
elements in the output sequence (Cross-Attention)
• In literature, these attention-based models are called
transformers
– Perform several transformations to capture better representation that
will avoid recurrences

[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.


Polosukhin, “Attention is all you need,” 31st Conference on Neural Information Processing
Systems (NIPS 2017), Long Beach, CA, USA pp. 1-11, 2017. 4

2
15-05-2023

Attention-based Deep Learning Models for


Sequence-to-Sequence Mapping
• Given the sequence X = (x1, x2, … , xj , …, xTs)
– get a representation for X which will be capturing the
relationships among the elements in the sequence X
– Basically, apply some kind of transformations to get the
representation that preserve the sequence by avoiding
recurrences
• Major advantages:
– Training times are smaller compared to the RNNs as there is
no recurrences and hence no need to perform backpropagation
through time (BPTT)
– There will not be any vanishing and exploding gradient
problems
– Most importantly, this gives a lot of scope for parallelization
when you use GPUs for training

Seq2Seq Mapping: So Far …


• Attention mechanism used in typical encoder-decoder
framework for sequence-to-sequence mapping
• Uses RNN-based encoder and decoders
• We look at the similarity of the state of the decoder (sD,t-1)
with respect to the state of the encoder (sE,j) at different
instances of time
– We get attention scores as a measure of similarity
• Attention score:  jt  f ATT (s D,t 1 , s E, j )
• Attention weight: a jt  softmax( jt )
Ts

• Context vector: ct   a jt s E, j
j 1
• Context vector goes as input to the decoder at every time t
T
z t  cTt y Tt 
• State of the decoder at time t: s D,t  tanh( U D z t  WD s D,t 1  b D )
• Output of the model at time t : P (y t 1 )  f (VDs D,t  c D )
6

3
15-05-2023

Attention-based Deep Learning Models for


Sequence-to-Sequence Mapping
• Attention-based models [1] are another way of solving the
problem of sequence-to-sequence mapping
• Attention-based models implement sequence-to-sequence
mapping using only the attention-based techniques
– They don’t use any RNNs for that matter
• In literature, these attention-based deep learning models
are called transformers
– Perform sequence of transformations to capture better
representation that preserve the sequence by avoiding
recurrences
• Major advantages:
– Training times are smaller compared to the RNNs as there is
no recurrences
– Most importantly, this gives a lot of scope for parallelization
when you use GPUs for training

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.


Polosukhin, “Attention is all you need,” 1st Conference on Neural Information Processing
Systems (NIPS 2017), Long Beach, CA, USA pp. 1-11, 2017. 7

Terminologies used in Transformer Models


• Terminologies come from basic linear regression models
• Linear regression models:
– Given: Training data - D  {x n , yn }n 1 , x n  R
N d

• xn : nth training example


• yn : desired output corresponding to xn
– Prediction:
• Let x be the new data i.e., test data
• Predict the approximate estimate ŷ
– We consider the model that uses basis functions
• Define basis function, g(x, xn) that look at how similar the new
sample x with respect to each of the training samples (xn s)
– The function g(x, xn) can be used as weight associated with nth
training example (xn)
– The predicted output of model is expressed as the weighted sum
of yn s N
yˆ   g ( x, x n ) yn
n 1

» Here, g(x, xn) is indicating measure of similarity or dissimilarity 8

4
15-05-2023

Terminologies used in Transformer Models


• Terminologies come from basic linear regression
models
• Linear regression models:
– Given: Training data - D  {x n , yn }nN1 , x n  R d
• xn : nth training example
• yn : desired output corresponding to xn
– Prediction:
• Let x be the new data i.e., test data
• Predict the approximate estimate ŷ of y=f(x)
– We consider the model that uses basis functions
– The predicted output of model is expressed as the
weighted sum of yn s
N
wn  g (x, x n ) yˆ   wn yn
n 1

• Here, g(x, xn) is a basis function indicating measure of


similarity or dissimilarity
9

Terminologies used in Transformer Models


• Terminologies come from basic linear regression
models
• Linear regression models:
– Given: Training data - D  {x n , yn }nN1 , x n  R d
– Prediction: N N
yˆ   g ( x, x n ) yn   wn yn
n 1 n 1

– New terminology:
• Query (query vector): new input (test input) : x
• Key (key vector): each of the training examples : xn, n=1… N
• Value: each of the desired output corresponding to xn : yn
– Basically, prediction task is performed using 3 entries:
query, key and value

10

10

5
15-05-2023

Terminologies used in Transformer Models


• Query, key and value in the attention mechanism in
encoder-decoder framework:
– Input sequence: X = (x1, x2, … , xj , …, xTs)
– Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Attention mechanism in encoder-decoder framework:

ct = attentionMechanism(sD,t-1, sE,j)
sD,t-1: state of the decoder at time t
sE,j : state of the encoder at time j
• Attention score:  jt  f ATT (s D,t 1 , s E, j )
• Attention weight: a jt  softmax( jt )
Ts
• Context vector: ct   a jt s E, j
j 1

• Query (query vector): s D,t 1


• Key (key vector): s E, j  j  1 ... Ts Here, sE,j acts as both key
• Value (value vector): s E, j vector and value vector
11

11

Terminologies used in Transformer Models


• Query, key and value in the attention mechanism in
encoder-decoder framework:
– Input sequence: X = (x1, x2, … , xj , …, xTs)
– Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Attention mechanism in encoder-decoder framework:

ct = attentionMechanism(sD,t-1, sE,j)
sD,t-1: state of the decoder at time t
sE,j : state of the encoder at time j

• Query (query vector): s D,t 1


• Key (key vector): s E, j  j  1 ... Ts Here, sE,j acts as both key
• Value (value vector): s E, j vector and value vector

• However, query, key and value refer to completely


different entities in attention-based models
(transformers) 12

12

6
15-05-2023

Transformers:
Scaled Dot-Product Attention (SDPA)
• Consider the sequence X = (x1, x2, … , xt , …, xT)
• Consider the following d-dimensional vectors:
– Query vector: q
– Key vectors: kt t=1,2,…,T
– Value vectors: vt t=1,2,…,T
• SDPA gives the weight based on the similarity measure
(i.e., dot product) computed between query vector with
each of the key vectors
– The similarity measure is scaled to keep the magnitude of
similarity measure in control
• Attention score: scaled dot-product between q and kt
q, k t qT k t
t  
d d
exp( t )
• Attention weight: at  softmax ( t )  T
 exp(t )
t 1
13

13

Transformers:
Scaled Dot-Product Attention (SDPA)
• Consider the sequence X = (x1, x2, … , xt , …, xT)
• Consider the following d-dimensional vectors:
– Query vector: q
– Key vectors: kt
– Value vectors: vt
• Attention score: scaled dot-product between q and kt
qT k t
t 
d
exp( t )
• Attention weight: at  softmax( t )  T

 exp( )
t 1
t
T
• Context vector associated with Query vector q: c   at vt
t 1

• The context vector captures the relation of Query vector (q)


with the Key vectors (kt s) and is obtained as a weighted
combination of the corresponding Value vectors (vt s)
14

14

7
15-05-2023

Transformers for
Sequence-to-Sequence Mapping
• Given the input sequence X, map it onto output
sequence Y
• Input sequence: X = (x1, x2, … , xj , …, xTs)
– Each element in a sequence is a d-dimensional vector xj
– Ts be the length of input sequence
• Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Each element in a sequence is a d-dimensional vector yt
– Td be the length of output sequence
• Transformer also has encoder-decoder framework:
– The input sequence X is passed through encoder to get
some representation Z
– This Z will go as input to decoder and that will map it on
to output sequence Y

15

15

Transformers for
Sequence-to-Sequence Mapping
• Transformers does not use RNNs in encoder and
decoder
• Instead, sequence of transformations are performed
on X to get Z
• This representation Z will go as input to decoder
which also perform sequence of transformations
• Transformers try to capture and use
– Relations among elements in the input sequence (Self-
Attention)
– Relations among elements in the output sequence (Self-
Attention)
– Relations between elements in the input sequence and
elements in the output sequence (Cross-Attention)

16

16

8
15-05-2023

Self-Attention
• Self-attention captures the relations among the
elements of a sequence

17

17

Self-Attention on Input Sequence, X


• Query vectors (qj), Key vectors (kj) and Value vectors (vj)
are generated using the elements in the input sequence X
• Here, qj = kj = vj = xj
• However, we need different representations for qj , kj and vj
• These different representations are done using some kind
of transformations
• Let us consider 3 different weight matrices W(Q), W(K) and
W(V) corresponding to Query, Key and Value
– They act as transformation matrices
• Now each of the query, key and value vectors are obtained
as:
(Q)
– Query vector: q j = W x j
– Key vector: k = W (K) x
j j

– Value vector: v j = W (V) x j


• These transformation matrices are learnt as a part of
training process 18

18

9
15-05-2023

Self-Attention on Input Sequence, X


and Single-Head Attention
• Let us define, c j
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 j Ts
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention
1 2 j Ts

  ( v , v ,..., v ,..., v )
– Value matrix: V q j k 1 k j k Ts v 1 v j v Ts
1 2 j Ts ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
 and
(SDPA) on q j w.r.t. the entities in K

V xj X  ( x1 ,..., x j ..., xTs )
 V
c j = SDPA(q j , K, )  j =1, 2, ..., Ts

• Then we get C  matrix from c obtained for each of the query


j
vectors C  (c1 , c 2 ,..., c j ,..., c Ts )

 is the encoded representation of X
• C
• Single-Head Attention: Generation of context vector matrix
from the sequence X using one set of transformation
matrices
19

19

Self-Attention on Input Sequence, X


and Multi-Head Attention
• There is no restriction that only one set of transformations
matrices (W(Q), W(K) and W(V) ) should use
• We can use different number of transformation matrices for
query, key and value
– It is like CNN filters
• Multiple transformation matrices:
Wi(Q) Wi(K) Wi(V) i = 1, 2, …, h
– Here, h is the number of heads
• Associated with each head there is an attention process to

get C i

• Multi-Head Attention: Generation of context vector matrix


from the sequence X using multiple set of transformation
matrices

20

20

10
15-05-2023

21

21

Transformers for
Sequence-to-Sequence Mapping
• Given the input sequence X, map it onto output
sequence Y
• Input sequence: X = (x1, x2, … , xj , …, xTs)
– Each element in a sequence is a d-dimensional vector xj
– Ts be the length of input sequence
• Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Each element in a sequence is a d-dimensional vector yt
– Td be the length of output sequence
• Transformer also has encoder-decoder framework:
– The input sequence X is passed through encoder to get

some representation X L
 will go as input to decoder and that will map it on
– This X L
to output sequence Y

22

22

11
15-05-2023

Encoder: Self-Attention
• Self-attention captures the relations among the
elements of a sequence
– It gives the importance of each element in a sequence
with every element in that sequence
• A set of transformation matrices of size dxl, W(Q), W(K) and
W(V) corresponding to Query, Key and Value is considered
– They act as parametric matrices
• Now each of the query, key and value vectors are obtained
as:
(Q)
– Query vector: q j = W x j
– Key vector: k j = W (K) x j  j =1, 2, ..., Ts
– Value vector: v j = W (V) x j
• Dimension of Query, Key and Value vectors: l (Note: l < d)
• These transformation matrices are learnt as a part of
training process
23

23

Encoder: Self-Attention based


Single-Head Attention
• Let us define, c j
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 j Ts
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention
1 2 j Ts

  ( v , v ,..., v ,..., v )
– Value matrix: V q j k 1 k j k Ts v 1 v j v Ts
1 2 j Ts ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
 and
(SDPA) on q j w.r.t. the entities in K

V xj X  ( x1 ,..., x j ..., xTs )
 V
c j = SDPA(q j , K, )  j =1, 2, ..., Ts

 V
c j = SDPA(q j , K, )
q Tj k m
• Attention score:  jm   m=1, 2, ..., Ts
d
• Attention weight: a jm  softmax( jm )
Ts
• Context vector: c j   a jm v m
m 1
24

24

12
15-05-2023

Encoder: Self-Attention based


Single-Head Attention
• Let us define, c j
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 j Ts
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention
1 2 j Ts

  ( v , v ,..., v ,..., v )
– Value matrix: V q j k 1 k j k Ts v 1 v j v Ts
1 2 j Ts ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
 and
(SDPA) on q j w.r.t. the entities in K

V xj X  ( x1 ,..., x j ..., xTs )
 V
c j = SDPA(q j , K, )  j =1, 2, ..., Ts

• Then we get C  matrix from c obtained for each of the query


j
vectors C  (c1 , c 2 ,..., c j ,..., c Ts )

 is the encoded representation of X
• C
• Single-Head Attention: Generation of context vector matrix
from the sequence X using one set of transformation
matrices
25

25

Encoder: Self-Attention based


Multi-Head Attention
• There is no restriction that only one set of transformations
matrices (W(Q), W(K) and W(V) ) should use
• We can use different number of transformation matrices for
query, key and value
– It is like CNN filters
• Multiple transformation matrices:
Wi(Q) Wi(K) Wi(V) i = 1, 2, …, h
– Here, h is the number of heads
• Associated with each head there is an attention process to

get C i

• Multi-Head Attention: Generation of context vector matrix


from the sequence X using multiple set of transformation
matrices

26

26

13
15-05-2023

Encoder: Self-Attention based


Multi-Head Attention
• Multi-Head Attention (MHA): Multiple sets of transformation
matrices are used to generate multiple context vectors
• Transformation matrices associated with the ith head: Wi(Q) , Wi(K) , Wi(V)
• Query, Key and Value vectors generated using the ith head:
(Q)
– Query vector: q
 ij = Wi xj
– Key vector:  (K)
k ij = Wi x j
– Value vector: v
 ij = Wi(V) x j
• Context vector generated using the ith head:
 ,V
c ij = SDPA(q ij , K  )  j =1, 2, ..., Ts  i =1, 2, ..., h
i i

– Here, h is the number of heads


d
• Dimension of Query, Key and Value vectors: l
h
• Context vector in MHA: c j = concat (c1 j , c 2 j ,..., c ij ,..., c hj )  j =1, 2, ..., Ts
• Dimension of context vector in MHA: d
• Context vector is transformed using the dxd matrix, W(O), to
(O)
generate the output vector: z j = W c j
• Output of MHA is a sequence: Z = (z 1 , z 2 ,..., z j ,..., z Ts )
27

27

Encoder: Self-Attention based


Multi-Head Attention
z j • Self-attention MHA is a sub-
layer in the encoder layer of
W (O) Transformer model
c j

concat (c1 j , c 2 j ,..., c ij ,..., c hj )

c1 j c 2 j c hj
Head 1 Head 2 Head h

Scaled Dot-Product Scaled Dot-Product Scaled Dot-Product

q 1 j
Attention

k 11 k 1 j k 1Ts
... ...
v 11 v 1 j v 1Ts
... ...
q 2 j
Attention

k 21 k 2 j k 2Ts
... ...
v 21 v 2 j v 2Ts
... ...
... q hj
Attention

k h1 k hj k hTs
... ...
v h1 v hj v hTs
... ...
(Q) (K) (V) (Q) (V) (Q)
W 1 W 1 W 1 W 2 W (K)
2 W 2 W h W (K)
h Wh(V)

xj X  ( x1 ,..., x j ..., xTs ) 28

28

14
15-05-2023

Encoder: Self-Attention based


Multi-Head Attention
• Self-attention MHA is a sub-layer in the encoder layer of
Transformer model
• Output of MHA is a sequence: Z = (z 1 , z 2 ,..., z j ,..., z Ts )

z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA

x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 29

29

Encoder: Self-Attention MHA and Position-wise


Feedforward Neural Networks (PWFFNN)
• Capturing nonlinear relationship in the data
• One feedforward neural network (FFNN) is used for every position j
(every time step) in the sequence
– Input and output layers are linear and hidden layer is nonlinear
• There are Ts number of FFNNs in the PWFFNN
• The weight matrices are shared across the FFNNs in the PWFFNN at
every position j
x 1 x 2 x j x Ts
Output layer Output layer Output layer Output layer
(o) (o) (o) (o)
WFF WFF WFF WFF
Hidden layer Hidden layer Hidden layer Hidden layer
(h ) (h ) (h) (h )
WFF WFF WFF WFF
Input layer Input layer Input layer Input layer

z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA

x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 30

30

15
15-05-2023

Encoder: Self-Attention MHA and Position-wise


Feedforward Neural Networks (PWFFNN)
• Self-attention MHA and PWFFNN form an encoder layer
• Output of an encoder layer: X
 = (x , x ,..., x ,..., x )
1 2 j Ts

x 1 x 2 x j x Ts
Output layer Output layer Output layer Output layer
(o) (o) (o) (o)
WFF WFF WFF WFF
Hidden layer Hidden layer Hidden layer Hidden layer
(h ) (h ) (h) (h )
WFF WFF WFF WFF
Input layer Input layer Input layer Input layer

z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA

x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 31

31

Encoder Layer in Transformer


• Self-attention MHA and PWFFNN form an encoder layer
• Output of an encoder layer: X
 = (x , x ,..., x ,..., x )
1 2 j Ts

• 
The contents of Z are nonlinearly related to the contents of X
• This, the contents of X are nonlinearly related to the

contents of X

X

PWFFNN

Z
Self
Attention
MHA

X
32

32

16
15-05-2023

Encoder in Transformer

X
• Stack of L number of encoder L

layers Encoder
• All these encoder layers are Layer L
going to be identical layers 
X L1
– Each layer share the weights
...
• X is a sequence which is a 
L X 2
better representation of input
sequence X Encoder
– It is generated by capturing the Layer 2
similarities among the elements 
X
of the X 1

 goes as input to decoder Encoder


• X L
Layer 1

X 33

33

Decoder in Transformer
• Decoder:
– Given the input sequence, decoder should generate the
sequence that is close to target output sequence
– Example:
• Given the sentence in source language, generate the
sentence that is close to target language sentence
• Training process in decoder:
– For every example of input sequence, corresponding
target output sequence is given
– Input sequence: X = (x1, x2, … , xj , …, xTs)
• Each element in a sequence is a d-dimensional vector xj
• Ts be the length of input sequence
– Target output sequence: Y = (y1, y2, …, yt , …, yTd)
• Each element in a sequence is a d-dimensional vector yt
• Td be the length of output sequence
34

34

17
15-05-2023

Decoder in Training Process


• Big picture:
– Given: Input sequence and desired output sequence
– Perform the operations in decoder layer to generate an
output sequence
– Compare generated sequence with desired output
sequence
– Compute the error and then backpropagate the error to
update the parameter sets
• Training process in sequence-to-sequence learning:

35

35

Decoder in Training Process


• Big picture:
– Given: Input sequence and desired output sequence
– Perform the operations in decoder layer to generate an
output sequence
– Compare generated sequence with desired output
sequence
– Compute the error and then backpropagate the error to
update the parameter sets
• Structure of decoder layer:
– Two sub-layers
1. Self-attention based multi head attention (MHA) with
masking
2. Cross-attention based MHA

36

36

18
15-05-2023

Decoder in Training Process:


Self-Attention
• Self-attention in decoder captures the relations among
the elements of an output sequence
– It gives the importance of each element in a sequence
with every element in that sequence
• A set of transformation matrices of size dxl, W(Qd), W(Kd) and
W(Vd) corresponding to Query, Key and Value is considered
– They act as parametric matrices
• Now each of the query, key and value vectors are obtained
as:
(Qd )
– Query vector: q t = W y t
– Key vector: k t = W (Kd) y t  t =1, 2, ..., Td
– Value vector: v t = W (Vd ) y t
• Dimension of Query, Key and Value vectors: l
• These transformation matrices are learnt as a part of
training process
37

37

Decoder in Training Process:


Self-Attention based Single-Head Attention with Masking
• Let us define, c t
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 t Td
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention with
1 2 t Td Masking
  ( v , v ,..., v ,..., v )
– Value matrix: V q t k 1 k t k Td v 1 v t v Td
1 2 t Td ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
(SDPA) with masking on q t w.r.t. the
 and V
entities in K  yt Y  ( y1 ,..., y t ..., y Td )
 V
c t = SDPA _ Mask (q t , K, )  t =1, 2, ..., Td
 V
c t = SDPA _ Mask (q t , K, )
q Tt k m
• Attention score:  tm   m =1, 2, ..., Td
d
• Attention weight: atm  softmax( tm )
• Masking: atm  0 for m  t  1
Td
• Context vector: c t   atm v m
m 1 38

38

19
15-05-2023

Decoder in Training Process:


Self-Attention based Single-Head Attention with Masking
• Let us define, c t
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 t Td
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention with
1 2 t Td Masking
  ( v , v ,..., v ,..., v )
– Value matrix: V q t k 1 k t k Td v 1 v t v Td
1 2 t Td ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
(SDPA) with masking on q t w.r.t. the
 and V
entities in K  yt Y  ( y1 ,..., y t ..., y Td )
c t = SDPA _ Mask (q t , K, V  )  t =1, 2, ..., Td
• Then we get C  matrix from c obtained for each of the query
t
vectors C  (c , c ,..., c ,..., c )

1 2 t Td

 is the encoded representation of Y


• C
• Single-Head Attention with masking: Generation of context
vector matrix from the sequence Y using one set of
transformation matrices
39

39

Decoder in Training Process:


Self-Attention based Multi-Head Attention with Masking
• Multi-Head Attention (MHA): Multiple sets of transformation
matrices are used to generate multiple context vectors
• Transformation matrices associated with the ith head: Wi(Q) , Wi(K) , Wi(V)
• Query, Key and Value vectors generated using the ith head:
– Query vector: q it = Wi(Q) y t
– Key vector: k it = Wi(K ) y t
– Value vector: v  it = Wi(V) y t
• Context vector generated using the ith head:
 ,V
c it = SDPA _ Mask (q it , K  )  t =1, 2, ..., Td  i =1, 2, ..., h
i i

– Here, h is the number of heads


d
• Dimension of Query, Key and Value vectors: l
h
• Context vector in MHA: c t = concat (c 1t , c 2t ,..., c it ,..., c ht )  t =1, 2, ..., Td
• Dimension of context vector in MHA: d
• Context vector is transformed using the dxd matrix, W(Od), to
(Od )
generate the output vector: rt = W c t
• Output of MHA is a sequence: R = (r1 , r2 ,..., rt ,..., rTd )
40

40

20
15-05-2023

Decoder in Training Process:


Self-Attention based Multi-Head Attention with Masking
Multi-Head Attention • Self-attention MHA is the
(MHA): Multiple sets of rt
first sub-layer in the
transformation matrices
W (O) decoder layer of
are used to generate
multiple context vectors Transformer model
c t

concat (c1t , c 2t ,..., c it ,..., c ht )

c1 j c 2 j c hj
Head 1 Head 2 Head h
Scaled Dot-Product Scaled Dot-Product Scaled Dot-Product

...
Attention with Attention with Attention with
Masking Masking Masking
q 1t k 11 k 1t k 1Td v 11 v 1t v 1Td q 2t k 21 k 2t k 2Td v 21 v 2t v 2Td q ht k h1 k ht k hTd v h1 v ht v hTd
... ... ... ... ... ... ... ... ... ... ... ...
(Q) (K) (V) (Q) (V) (Q)
W 1 W 1 W 1 W 2 W (K)
2 W 2 W h W (K)
h Wh(V)

yt Y  ( y1 ,..., y t ..., y Td ) 41

41

Decoder in Training Process:


Self-Attention based MHA with Masking
• Self-attention MHA is the first sub-layer in the decoder
layer of Transformer model
• Output of Self-attention MHA with masking is a
sequence: R = (r1 , r2 ,..., rt ,..., rTd )

r1 r2 rt rTd


Self Attention
MHA with
Masking
Self Attention
MHA with
Masking
... Self Attention
MHA with
Masking
... Self Attention
MHA with
Masking

y1 Y  ( y1 ,..., y t ..., y Td ) y2 yt yTd


42

42

21
15-05-2023

Decoder in Training Process:


Cross-Attention based MHA
• Output of self-attention MHA with masking (R) is given as
input to second sub-layer of decoder, cross-attention based
MHA
• R acts as Query
•  which
Another input to the cross-attention based MHA is X L
acts as both Key and Value
• Query matrix: R = (r1 , r2 ,..., rt ,..., rTd )
Z • Key matrix:  = ( x , x ,..., x ,..., x )
X L L1 L2 Lj L ,Ts
Cross-Attention MHA
• Value matrix: X
 = (x , x ,..., x ,..., x )
SDPA L L1 L2 Lj L ,Ts

K V

X R
L
Self-Attention
MHA with
Masking

Y
43

43

Decoder in Training Process:


Cross-Attention based MHA
• Output of self-attention MHA with masking (R) is given as
input to second sub-layer of decoder, cross-attention based
MHA
• R acts as Query
•  which
Another input to the cross-attention based MHA is X L
acts as both Key and Value

 ,X ) • Query matrix: R = (r1 , r2 ,..., rt ,..., rTd )


c t = SDPA(rt , X L L
• Key matrix:  = ( x , x ,..., x ,..., x )
X
 t =1, 2, ..., Td L L1 L2 Lj L ,Ts

• Value matrix: X
 = (x , x ,..., x ,..., x )
L L1 L2 Lj L ,Ts

 ,X
c t = SDPA(rt , X  )
L L
rtT x L ,m
• Attention score:  tm 
 m =1, 2, ..., Ts
d
• Attention weight: atm  softmax( tm )
Ts
• Context vector: c t   atm x L ,m
m 1
44

44

22
15-05-2023

Decoder in Training Process:


Cross-Attention based MHA
• Output of self-attention MHA with masking (R) is given as
input to second sub-layer of decoder, cross-attention based
MHA
• R acts as Query
•  which
Another input to the cross-attention based MHA is X L
acts as both Key and Value
• Query matrix: R = (r1 , r2 ,..., rt ,..., rTd )
Z • Key matrix:  = ( x , x ,..., x ,..., x )
X L L1 L2 Lj L ,Ts
Cross-Attention MHA
• Value matrix: X
 = (x , x ,..., x ,..., x )
SDPA L L1 L2 Lj L ,Ts
 ,X
c t = SDPA(rt , X  )  t =1, 2, ..., Td
K V L L


X R  i =1, 2, ..., h
L
Self-Attention (Od1)
MHA with
z t = W c t
Masking

• Output of cross-attention MHA is


Y a sequence: Z = ( z 1 , z 2 ,..., z t ,..., z Td )
45

45

Decoder in Training Process:


Decoder Layer
• Output of cross-attention MHA (Z) is given as input to
position-wise FFNN (PWFFNN) to capture the nonlinear
relationship in the data

Y • Output of a decoder layer:
 = (y , y ,..., y ,..., y )
Y 1 2 t Td

PWFFNN

Z
Cross-Attention MHA
SDPA

K V

X R
L
Self-Attention
MHA with
Masking

Y
46

46

23
15-05-2023

Decoder in Training Process:


P(y ) P(y )
Decoder
P(y )
in Transformer
1 2
... Td
• Stack of L number of decoder layers
• All these encoder layers are going to
Softmax Layer
be identical layers

Y L
• Y is a sequence which is a better
L
Decoder representation of output sequence Y
Layer 1
• We are not considering the total loss
 
Y
X L1
– Look at the loss at time t and back
...
L

propagate

Y • The operations on the entities at
2
different time instance are kind of
Decoder
independent
Layer 2
  • If we have GPUs, one can use them
X Y
L 1
to perform computation related to
Decoder each time instances in parallel
Layer 1

X L

Y 47

47

Decoder in Testing Process:


• Given: Input sequence
P(yt)
– desired output sequence is not
known
Softmax Layer – Perform the operations in

Y encoder and decoder layers to
generate an output sequence
PWFFNN • Given: Test input sequence X =
(x1, x2, … , xj , …, xTs)
Z  of input sequence:
• Generate X L
Cross-Attention MHA
encoder output
SDPA
– It goes as input to decoder
K V
• Test is going to be a sequential

X R operation
L
Self-Attention
MHA with
Masking

Y 48

48

24
15-05-2023

Transformer Model [1]

N x: N number of times the encoder


and decoder operations

i.e., N is the number of encoder and


decoder layers

Residual
connection

Residual
connection

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.


Polosukhin, “Attention is all you need,” 31st Conference on Neural Information Processing
Systems (NIPS 2017), Long Beach, CA, USA pp. 1-11, 2017. 49

49

Vision Transformer (ViT) for


Image Classification [2]

[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain
Gelly, Jakob Uszkoreit, Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale,” 9th International Conference on Learning Representations
(ICLR 2021), 2021. 50

50

25
15-05-2023

Dense Video Captioning using CNNs


and Transformer [3]

[3] L.Zhou, Y.Zhou, J.J. Corso, R.Socher and C.Xiong, “End-to-End Dense Video Captioning
with Masked Transformer,” Computer Vision and Pattern Recognition (CVPR), 2018. 51

51

Bidirectional Encoder Representation


from Transformer (BERT) [4]

[4] Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova, “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding,” NAACL, 2019. 52

52

26
15-05-2023

Text Books
1. Aston Zhang, Zachary C. Lipton, Mu Li, Alexander J. Smola,
Dive into Deep Learning, 2021
2. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning,
MIT Press, Available online: http://www.deeplearningbook.org,
2016
3. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer,
2018
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India,
1999.
5. Satish Kumar, Neural Networks - A Class Room Approach, Second
Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice Hall of
India, 2010.
7. C. M. Bishop, Pattern Recognition and Machine Learning, Springer,
2006.

53

53

27

You might also like