Class47 49 - AttentionBasedModels Transformers 10 15may2023

15-05-2023
Attention-based Models:
Transformers
Dr. Dileep A. D.
Associate Professor,
Multimedia Analytics Networks And Systems (MANAS) Lab,
School of Computing and Electrical Engineering (SCEE),
Indian Institute of Technology Mandi, Kamand, H.P.
Email: addileep@iitmandi.ac.in
Sequence-to-Sequence Mapping Tasks

• Neural Machine Translation: Translation of a sentence in the
source language to a sentence in the target language
– Input: A sequence of words
– Output: A sequence of words
• Speech Recognition (Speech-to-Text Conversion):

Conversion of the speech signal of a sentence utterance to
the text of a sentence
– Input: A sequence of feature vectors extracted from the
speech signal of a sentence
• Video Captioning: Generation of a sentence as the caption
for a video represented as a sequence of frames
– Input: A sequence of feature vectors extracted from the
frames of a video
2
1
15-05-2023
Sequence-to-Sequence Mapping Tasks

• Each of the above tasks involves mapping an input
sequence to an output sequence
• So far we have seen encoder-decoder paradigm using RNN
models for sequence-to-sequence mapping
• Training the RNN-based sequence-to-sequence mapping
systems is
– computationally intensive, and
– there is not much scope for parallelization of operations in the
training process
• Goal: Come up with totally different approach for solving
sequence-to-sequence mapping tasks that
– avoid recurrent structure in encoder-decoder paradigm and
– avoid huge training time required for training RNNs
Attention-based Deep Learning Models for

Sequence-to-Sequence Mapping
• Attention-based models implement sequence-to-sequence
mapping using only the attention-based techniques
– They don’t use any RNNs for that matter
• Attention based models [4] try to capture and use
– Relations among elements in the input sequence (Self-
Attention)
– Relations among elements in the output sequence (Self-
Attention)
– Relations between elements in the input sequence and
elements in the output sequence (Cross-Attention)
• In literature, these attention-based models are called
transformers
– Perform several transformations to capture better representation that
will avoid recurrences
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.

Polosukhin, “Attention is all you need,” 31st Conference on Neural Information Processing
Systems (NIPS 2017), Long Beach, CA, USA pp. 1-11, 2017. 4
2
15-05-2023

• Given the sequence X = (x1, x2, … , xj , …, xTs)
– get a representation for X which will be capturing the
relationships among the elements in the sequence X
– Basically, apply some kind of transformations to get the
representation that preserve the sequence by avoiding
recurrences
• Major advantages:
– Training times are smaller compared to the RNNs as there is
no recurrences and hence no need to perform backpropagation
through time (BPTT)
– There will not be any vanishing and exploding gradient
problems
– Most importantly, this gives a lot of scope for parallelization
when you use GPUs for training
Seq2Seq Mapping: So Far …

• Attention mechanism used in typical encoder-decoder
framework for sequence-to-sequence mapping
• Uses RNN-based encoder and decoders
• We look at the similarity of the state of the decoder (sD,t-1)
with respect to the state of the encoder (sE,j) at different
instances of time
– We get attention scores as a measure of similarity
• Attention score:  jt  f ATT (s D,t 1 , s E, j )
• Attention weight: a jt  softmax( jt )
Ts
• Context vector: ct   a jt s E, j
j 1
• Context vector goes as input to the decoder at every time t
T
z t  cTt y Tt 
• State of the decoder at time t: s D,t  tanh( U D z t  WD s D,t 1  b D )
• Output of the model at time t : P (y t 1 )  f (VDs D,t  c D )
6
3
15-05-2023

• Attention-based models [1] are another way of solving the
problem of sequence-to-sequence mapping
• Attention-based models implement sequence-to-sequence
mapping using only the attention-based techniques
– They don’t use any RNNs for that matter
• In literature, these attention-based deep learning models
are called transformers
– Perform sequence of transformations to capture better
representation that preserve the sequence by avoiding
recurrences
• Major advantages:
– Training times are smaller compared to the RNNs as there is
no recurrences
– Most importantly, this gives a lot of scope for parallelization
when you use GPUs for training

Terminologies used in Transformer Models

• Terminologies come from basic linear regression models
• Linear regression models:
– Given: Training data - D  {x n , yn }n 1 , x n  R
N d
• xn : nth training example

• yn : desired output corresponding to xn
– Prediction:
• Let x be the new data i.e., test data
• Predict the approximate estimate ŷ
– We consider the model that uses basis functions
• Define basis function, g(x, xn) that look at how similar the new
sample x with respect to each of the training samples (xn s)
– The function g(x, xn) can be used as weight associated with nth
training example (xn)
– The predicted output of model is expressed as the weighted sum
of yn s N
yˆ   g ( x, x n ) yn
n 1
» Here, g(x, xn) is indicating measure of similarity or dissimilarity 8
4
15-05-2023

• Terminologies come from basic linear regression
models
– Given: Training data - D  {x n , yn }nN1 , x n  R d
• xn : nth training example
• yn : desired output corresponding to xn
– Prediction:
• Let x be the new data i.e., test data
• Predict the approximate estimate ŷ of y=f(x)
– We consider the model that uses basis functions
– The predicted output of model is expressed as the
weighted sum of yn s
N
wn  g (x, x n ) yˆ   wn yn
n 1
• Here, g(x, xn) is a basis function indicating measure of

similarity or dissimilarity
9

• Terminologies come from basic linear regression
models
– Given: Training data - D  {x n , yn }nN1 , x n  R d
– Prediction: N N
yˆ   g ( x, x n ) yn   wn yn
n 1 n 1
– New terminology:
• Query (query vector): new input (test input) : x
• Key (key vector): each of the training examples : xn, n=1… N
• Value: each of the desired output corresponding to xn : yn
– Basically, prediction task is performed using 3 entries:
query, key and value
10
10
5
15-05-2023

• Query, key and value in the attention mechanism in
encoder-decoder framework:
– Input sequence: X = (x1, x2, … , xj , …, xTs)
– Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Attention mechanism in encoder-decoder framework:
ct = attentionMechanism(sD,t-1, sE,j)
sD,t-1: state of the decoder at time t
sE,j : state of the encoder at time j
• Attention score:  jt  f ATT (s D,t 1 , s E, j )
• Attention weight: a jt  softmax( jt )
Ts
• Context vector: ct   a jt s E, j
j 1
• Query (query vector): s D,t 1

• Key (key vector): s E, j  j  1 ... Ts Here, sE,j acts as both key
• Value (value vector): s E, j vector and value vector
11
11

• Query, key and value in the attention mechanism in
encoder-decoder framework:
– Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Attention mechanism in encoder-decoder framework:
ct = attentionMechanism(sD,t-1, sE,j)
sD,t-1: state of the decoder at time t
sE,j : state of the encoder at time j
• Query (query vector): s D,t 1

• Key (key vector): s E, j  j  1 ... Ts Here, sE,j acts as both key
• Value (value vector): s E, j vector and value vector
• However, query, key and value refer to completely

different entities in attention-based models
(transformers) 12
12
6
15-05-2023
Transformers:
Scaled Dot-Product Attention (SDPA)
• Consider the sequence X = (x1, x2, … , xt , …, xT)
• Consider the following d-dimensional vectors:
– Query vector: q
– Key vectors: kt t=1,2,…,T
– Value vectors: vt t=1,2,…,T
• SDPA gives the weight based on the similarity measure
(i.e., dot product) computed between query vector with
each of the key vectors
– The similarity measure is scaled to keep the magnitude of
similarity measure in control
• Attention score: scaled dot-product between q and kt
q, k t qT k t
t  
d d
exp( t )
• Attention weight: at  softmax ( t )  T
 exp(t )
t 1
13
13
Transformers:
Scaled Dot-Product Attention (SDPA)
• Consider the sequence X = (x1, x2, … , xt , …, xT)
• Consider the following d-dimensional vectors:
– Query vector: q
– Key vectors: kt
– Value vectors: vt
• Attention score: scaled dot-product between q and kt
qT k t
t 
d
exp( t )
• Attention weight: at  softmax( t )  T
 exp( )
t 1
t
T
• Context vector associated with Query vector q: c   at vt
t 1
• The context vector captures the relation of Query vector (q)

with the Key vectors (kt s) and is obtained as a weighted
combination of the corresponding Value vectors (vt s)
14
14
7
15-05-2023
Transformers for
• Given the input sequence X, map it onto output
sequence Y
• Input sequence: X = (x1, x2, … , xj , …, xTs)
– Each element in a sequence is a d-dimensional vector xj
– Ts be the length of input sequence
• Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Each element in a sequence is a d-dimensional vector yt
– Td be the length of output sequence
• Transformer also has encoder-decoder framework:
– The input sequence X is passed through encoder to get
some representation Z
– This Z will go as input to decoder and that will map it on
to output sequence Y
15
15
Transformers for
• Transformers does not use RNNs in encoder and
decoder
• Instead, sequence of transformations are performed
on X to get Z
• This representation Z will go as input to decoder
which also perform sequence of transformations
• Transformers try to capture and use
– Relations among elements in the input sequence (Self-
Attention)
– Relations among elements in the output sequence (Self-
Attention)
– Relations between elements in the input sequence and
elements in the output sequence (Cross-Attention)
16
16
8
15-05-2023
Self-Attention
• Self-attention captures the relations among the
elements of a sequence
17
17
Self-Attention on Input Sequence, X

• Query vectors (qj), Key vectors (kj) and Value vectors (vj)
are generated using the elements in the input sequence X
• Here, qj = kj = vj = xj
• However, we need different representations for qj , kj and vj
• These different representations are done using some kind
of transformations
• Let us consider 3 different weight matrices W(Q), W(K) and
W(V) corresponding to Query, Key and Value
– They act as transformation matrices
• Now each of the query, key and value vectors are obtained
as:
(Q)
– Query vector: q j = W x j
– Key vector: k = W (K) x
j j
– Value vector: v j = W (V) x j

• These transformation matrices are learnt as a part of
training process 18
18
9
15-05-2023

and Single-Head Attention
• Let us define, c j
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 j Ts
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention
1 2 j Ts
  ( v , v ,..., v ,..., v )
– Value matrix: V q j k 1 k j k Ts v 1 v j v Ts
1 2 j Ts ... ... ... ...
• A context vector ( c j ) is obtained by W (Q)
W (K)
W (V)
applying scaled dot-product attention
 and
(SDPA) on q j w.r.t. the entities in K

V xj X  ( x1 ,..., x j ..., xTs )
 V
c j = SDPA(q j , K, )  j =1, 2, ..., Ts
• Then we get C  matrix from c obtained for each of the query

j
vectors C  (c1 , c 2 ,..., c j ,..., c Ts )

 is the encoded representation of X
• C
• Single-Head Attention: Generation of context vector matrix
from the sequence X using one set of transformation
matrices
19
19

and Multi-Head Attention
• There is no restriction that only one set of transformations
matrices (W(Q), W(K) and W(V) ) should use
• We can use different number of transformation matrices for
– It is like CNN filters
• Multiple transformation matrices:
Wi(Q) Wi(K) Wi(V) i = 1, 2, …, h
– Here, h is the number of heads
• Associated with each head there is an attention process to

get C i
• Multi-Head Attention: Generation of context vector matrix

from the sequence X using multiple set of transformation
matrices
20
20
10
15-05-2023
21
21
Transformers for
• Given the input sequence X, map it onto output
sequence Y
• Input sequence: X = (x1, x2, … , xj , …, xTs)
– Each element in a sequence is a d-dimensional vector xj
– Ts be the length of input sequence
• Output sequence: Y = (y1, y2, …, yt , …, yTd)
– Each element in a sequence is a d-dimensional vector yt
– Td be the length of output sequence
• Transformer also has encoder-decoder framework:
– The input sequence X is passed through encoder to get

some representation X L
 will go as input to decoder and that will map it on
– This X L
to output sequence Y
22
22
11
15-05-2023
Encoder: Self-Attention
• Self-attention captures the relations among the
elements of a sequence
– It gives the importance of each element in a sequence
with every element in that sequence
• A set of transformation matrices of size dxl, W(Q), W(K) and
W(V) corresponding to Query, Key and Value is considered
– They act as parametric matrices
as:
(Q)
– Query vector: q j = W x j
– Key vector: k j = W (K) x j  j =1, 2, ..., Ts
– Value vector: v j = W (V) x j
• Dimension of Query, Key and Value vectors: l (Note: l < d)
training process
23
23
Encoder: Self-Attention based

Single-Head Attention
  (q , q ,..., q ,..., q )
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention
1 2 j Ts
  ( v , v ,..., v ,..., v )
1 2 j Ts ... ... ... ...
W (K)
W (V)
 and

V xj X  ( x1 ,..., x j ..., xTs )
 V
c j = SDPA(q j , K, )  j =1, 2, ..., Ts
 V
c j = SDPA(q j , K, )
q Tj k m
• Attention score:  jm   m=1, 2, ..., Ts
d
• Attention weight: a jm  softmax( jm )
Ts
• Context vector: c j   a jm v m
m 1
24
24
12
15-05-2023

Single-Head Attention
  (q , q ,..., q ,..., q )
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention
1 2 j Ts
  ( v , v ,..., v ,..., v )
1 2 j Ts ... ... ... ...
W (K)
W (V)
 and

V xj X  ( x1 ,..., x j ..., xTs )
 V
c j = SDPA(q j , K, )  j =1, 2, ..., Ts

j
vectors C  (c1 , c 2 ,..., c j ,..., c Ts )

 is the encoded representation of X
• C
• Single-Head Attention: Generation of context vector matrix
from the sequence X using one set of transformation
matrices
25
25

Multi-Head Attention
• There is no restriction that only one set of transformations
matrices (W(Q), W(K) and W(V) ) should use
• We can use different number of transformation matrices for
– It is like CNN filters
• Multiple transformation matrices:
Wi(Q) Wi(K) Wi(V) i = 1, 2, …, h
• Associated with each head there is an attention process to

get C i
• Multi-Head Attention: Generation of context vector matrix

from the sequence X using multiple set of transformation
matrices
26
26
13
15-05-2023

• Multi-Head Attention (MHA): Multiple sets of transformation
matrices are used to generate multiple context vectors
• Transformation matrices associated with the ith head: Wi(Q) , Wi(K) , Wi(V)
• Query, Key and Value vectors generated using the ith head:
(Q)
– Query vector: q
 ij = Wi xj
– Key vector:  (K)
k ij = Wi x j
– Value vector: v
 ij = Wi(V) x j
• Context vector generated using the ith head:
 ,V
c ij = SDPA(q ij , K  )  j =1, 2, ..., Ts  i =1, 2, ..., h
i i

d
• Dimension of Query, Key and Value vectors: l
h
• Context vector in MHA: c j = concat (c1 j , c 2 j ,..., c ij ,..., c hj )  j =1, 2, ..., Ts
• Dimension of context vector in MHA: d
• Context vector is transformed using the dxd matrix, W(O), to
(O)
generate the output vector: z j = W c j
• Output of MHA is a sequence: Z = (z 1 , z 2 ,..., z j ,..., z Ts )
27
27

z j • Self-attention MHA is a sub-
layer in the encoder layer of
W (O) Transformer model
c j
concat (c1 j , c 2 j ,..., c ij ,..., c hj )
c1 j c 2 j c hj
Head 1 Head 2 Head h
Scaled Dot-Product Scaled Dot-Product Scaled Dot-Product
q 1 j
Attention
k 11 k 1 j k 1Ts
... ...
v 11 v 1 j v 1Ts
... ...
q 2 j
Attention
k 21 k 2 j k 2Ts
... ...
v 21 v 2 j v 2Ts
... ...
... q hj
Attention
k h1 k hj k hTs
... ...
v h1 v hj v hTs
... ...
(Q) (K) (V) (Q) (V) (Q)
W 1 W 1 W 1 W 2 W (K)
2 W 2 W h W (K)
h Wh(V)
xj X  ( x1 ,..., x j ..., xTs ) 28
28
14
15-05-2023

• Self-attention MHA is a sub-layer in the encoder layer of
Transformer model
• Output of MHA is a sequence: Z = (z 1 , z 2 ,..., z j ,..., z Ts )
z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA
x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 29
29
Encoder: Self-Attention MHA and Position-wise

Feedforward Neural Networks (PWFFNN)
• Capturing nonlinear relationship in the data
• One feedforward neural network (FFNN) is used for every position j
(every time step) in the sequence
– Input and output layers are linear and hidden layer is nonlinear
• There are Ts number of FFNNs in the PWFFNN
• The weight matrices are shared across the FFNNs in the PWFFNN at
every position j
x 1 x 2 x j x Ts
Output layer Output layer Output layer Output layer
(o) (o) (o) (o)
WFF WFF WFF WFF
Hidden layer Hidden layer Hidden layer Hidden layer
(h ) (h ) (h) (h )
WFF WFF WFF WFF
Input layer Input layer Input layer Input layer
z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA
x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 30
30
15
15-05-2023
Encoder: Self-Attention MHA and Position-wise

Feedforward Neural Networks (PWFFNN)
• Self-attention MHA and PWFFNN form an encoder layer
• Output of an encoder layer: X
 = (x , x ,..., x ,..., x )
1 2 j Ts
x 1 x 2 x j x Ts
Output layer Output layer Output layer Output layer
(o) (o) (o) (o)
WFF WFF WFF WFF
Hidden layer Hidden layer Hidden layer Hidden layer
(h ) (h ) (h) (h )
WFF WFF WFF WFF
Input layer Input layer Input layer Input layer
z 1 z 2 z j z Ts
Self Attention
MHA
Self Attention
MHA
... Self Attention
MHA
... Self Attention
MHA
x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 31
31
Encoder Layer in Transformer

• Self-attention MHA and PWFFNN form an encoder layer
• Output of an encoder layer: X
 = (x , x ,..., x ,..., x )
1 2 j Ts
• 
The contents of Z are nonlinearly related to the contents of X
• This, the contents of X are nonlinearly related to the

contents of X

X
PWFFNN
Z
Self
Attention
MHA
X
32
32
16
15-05-2023
Encoder in Transformer

X
• Stack of L number of encoder L
layers Encoder
• All these encoder layers are Layer L
going to be identical layers 
X L1
– Each layer share the weights
...
• X is a sequence which is a 
L X 2
better representation of input
sequence X Encoder
– It is generated by capturing the Layer 2
similarities among the elements 
X
of the X 1
 goes as input to decoder Encoder

• X L
Layer 1
X 33
33
Decoder in Transformer
• Decoder:
– Given the input sequence, decoder should generate the
sequence that is close to target output sequence
– Example:
• Given the sentence in source language, generate the
sentence that is close to target language sentence
• Training process in decoder:
– For every example of input sequence, corresponding
target output sequence is given
• Each element in a sequence is a d-dimensional vector xj
• Ts be the length of input sequence
– Target output sequence: Y = (y1, y2, …, yt , …, yTd)
• Each element in a sequence is a d-dimensional vector yt
• Td be the length of output sequence
34
34
17
15-05-2023
Decoder in Training Process

• Big picture:
– Given: Input sequence and desired output sequence
– Perform the operations in decoder layer to generate an
output sequence
– Compare generated sequence with desired output
sequence
– Compute the error and then backpropagate the error to
update the parameter sets
• Training process in sequence-to-sequence learning:
35
35
Decoder in Training Process

• Big picture:
– Given: Input sequence and desired output sequence
– Perform the operations in decoder layer to generate an
output sequence
– Compare generated sequence with desired output
sequence
– Compute the error and then backpropagate the error to
update the parameter sets
• Structure of decoder layer:
– Two sub-layers
1. Self-attention based multi head attention (MHA) with
masking
2. Cross-attention based MHA
36
36
18
15-05-2023
Decoder in Training Process:

Self-Attention
• Self-attention in decoder captures the relations among
the elements of an output sequence
– It gives the importance of each element in a sequence
with every element in that sequence
• A set of transformation matrices of size dxl, W(Qd), W(Kd) and
W(Vd) corresponding to Query, Key and Value is considered
– They act as parametric matrices
as:
(Qd )
– Query vector: q t = W y t
– Key vector: k t = W (Kd) y t  t =1, 2, ..., Td
– Value vector: v t = W (Vd ) y t
• Dimension of Query, Key and Value vectors: l
training process
37
37

Self-Attention based Single-Head Attention with Masking
• Let us define, c t
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 t Td
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention with
1 2 t Td Masking
  ( v , v ,..., v ,..., v )
– Value matrix: V q t k 1 k t k Td v 1 v t v Td
1 2 t Td ... ... ... ...
W (K)
W (V)
(SDPA) with masking on q t w.r.t. the
 and V
entities in K  yt Y  ( y1 ,..., y t ..., y Td )
 V
c t = SDPA _ Mask (q t , K, )  t =1, 2, ..., Td
 V
c t = SDPA _ Mask (q t , K, )
q Tt k m
• Attention score:  tm   m =1, 2, ..., Td
d
• Attention weight: atm  softmax( tm )
• Masking: atm  0 for m  t  1
Td
• Context vector: c t   atm v m
m 1 38
38
19
15-05-2023

Self-Attention based Single-Head Attention with Masking
• Let us define, c t
  (q , q ,..., q ,..., q )
– Query matrix: Q 1 2 t Td
Scaled Dot-Product
– Key matrix:   (k , k ,..., k ..., k )
K Attention with
1 2 t Td Masking
  ( v , v ,..., v ,..., v )
– Value matrix: V q t k 1 k t k Td v 1 v t v Td
1 2 t Td ... ... ... ...
W (K)
W (V)
(SDPA) with masking on q t w.r.t. the
 and V
entities in K  yt Y  ( y1 ,..., y t ..., y Td )
c t = SDPA _ Mask (q t , K, V  )  t =1, 2, ..., Td
t
vectors C  (c , c ,..., c ,..., c )

1 2 t Td
 is the encoded representation of Y

• C
• Single-Head Attention with masking: Generation of context
vector matrix from the sequence Y using one set of
transformation matrices
39
39

Self-Attention based Multi-Head Attention with Masking
• Multi-Head Attention (MHA): Multiple sets of transformation
matrices are used to generate multiple context vectors
• Transformation matrices associated with the ith head: Wi(Q) , Wi(K) , Wi(V)
• Query, Key and Value vectors generated using the ith head:
– Query vector: q it = Wi(Q) y t
– Key vector: k it = Wi(K ) y t
– Value vector: v  it = Wi(V) y t
• Context vector generated using the ith head:
 ,V
c it = SDPA _ Mask (q it , K  )  t =1, 2, ..., Td  i =1, 2, ..., h
i i

d
• Dimension of Query, Key and Value vectors: l
h
• Context vector in MHA: c t = concat (c 1t , c 2t ,..., c it ,..., c ht )  t =1, 2, ..., Td
• Dimension of context vector in MHA: d
• Context vector is transformed using the dxd matrix, W(Od), to
(Od )
generate the output vector: rt = W c t
• Output of MHA is a sequence: R = (r1 , r2 ,..., rt ,..., rTd )
40
40
20
15-05-2023

Self-Attention based Multi-Head Attention with Masking
Multi-Head Attention • Self-attention MHA is the
(MHA): Multiple sets of rt
first sub-layer in the
transformation matrices
W (O) decoder layer of
are used to generate
multiple context vectors Transformer model
c t
concat (c1t , c 2t ,..., c it ,..., c ht )
c1 j c 2 j c hj
Head 1 Head 2 Head h
Scaled Dot-Product Scaled Dot-Product Scaled Dot-Product
...
Attention with Attention with Attention with
Masking Masking Masking
q 1t k 11 k 1t k 1Td v 11 v 1t v 1Td q 2t k 21 k 2t k 2Td v 21 v 2t v 2Td q ht k h1 k ht k hTd v h1 v ht v hTd
... ... ... ... ... ... ... ... ... ... ... ...
(Q) (K) (V) (Q) (V) (Q)
W 1 W 1 W 1 W 2 W (K)
2 W 2 W h W (K)
h Wh(V)
yt Y  ( y1 ,..., y t ..., y Td ) 41
41

Self-Attention based MHA with Masking
• Self-attention MHA is the first sub-layer in the decoder
layer of Transformer model
• Output of Self-attention MHA with masking is a
sequence: R = (r1 , r2 ,..., rt ,..., rTd )
r1 r2 rt rTd

Self Attention
MHA with
Masking
Self Attention
MHA with
Masking
... Self Attention
MHA with
Masking
... Self Attention
MHA with
Masking
y1 Y  ( y1 ,..., y t ..., y Td ) y2 yt yTd

42
42
21
15-05-2023

Cross-Attention based MHA
• Output of self-attention MHA with masking (R) is given as
input to second sub-layer of decoder, cross-attention based
MHA
• R acts as Query
•  which
Another input to the cross-attention based MHA is X L
acts as both Key and Value
• Query matrix: R = (r1 , r2 ,..., rt ,..., rTd )
Z • Key matrix:  = ( x , x ,..., x ,..., x )
X L L1 L2 Lj L ,Ts
Cross-Attention MHA
• Value matrix: X
 = (x , x ,..., x ,..., x )
SDPA L L1 L2 Lj L ,Ts
K V

X R
L
Self-Attention
MHA with
Masking
Y
43
43

MHA
• R acts as Query
•  which
 ,X ) • Query matrix: R = (r1 , r2 ,..., rt ,..., rTd )

c t = SDPA(rt , X L L
• Key matrix:  = ( x , x ,..., x ,..., x )
X
 t =1, 2, ..., Td L L1 L2 Lj L ,Ts
• Value matrix: X
 = (x , x ,..., x ,..., x )
L L1 L2 Lj L ,Ts
 ,X
c t = SDPA(rt , X  )
L L
rtT x L ,m
• Attention score:  tm 
 m =1, 2, ..., Ts
d
• Attention weight: atm  softmax( tm )
Ts
• Context vector: c t   atm x L ,m
m 1
44
44
22
15-05-2023

MHA
• R acts as Query
•  which
• Query matrix: R = (r1 , r2 ,..., rt ,..., rTd )
Z • Key matrix:  = ( x , x ,..., x ,..., x )
X L L1 L2 Lj L ,Ts
Cross-Attention MHA
• Value matrix: X
 = (x , x ,..., x ,..., x )
SDPA L L1 L2 Lj L ,Ts
 ,X
c t = SDPA(rt , X  )  t =1, 2, ..., Td
K V L L

X R  i =1, 2, ..., h
L
Self-Attention (Od1)
MHA with
z t = W c t
Masking
• Output of cross-attention MHA is

Y a sequence: Z = ( z 1 , z 2 ,..., z t ,..., z Td )
45
45

Decoder Layer
• Output of cross-attention MHA (Z) is given as input to
position-wise FFNN (PWFFNN) to capture the nonlinear
relationship in the data

Y • Output of a decoder layer:
 = (y , y ,..., y ,..., y )
Y 1 2 t Td
PWFFNN
Z
Cross-Attention MHA
SDPA
K V

X R
L
Self-Attention
MHA with
Masking
Y
46
46
23
15-05-2023

P(y ) P(y )
Decoder
P(y )
in Transformer
1 2
... Td
• Stack of L number of decoder layers
• All these encoder layers are going to
Softmax Layer
be identical layers

Y L
• Y is a sequence which is a better
L
Decoder representation of output sequence Y
Layer 1
• We are not considering the total loss
 
Y
X L1
– Look at the loss at time t and back
...
L
propagate

Y • The operations on the entities at
2
different time instance are kind of
Decoder
independent
Layer 2
  • If we have GPUs, one can use them
X Y
L 1
to perform computation related to
Decoder each time instances in parallel
Layer 1

X L
Y 47
47
Decoder in Testing Process:

• Given: Input sequence
P(yt)
– desired output sequence is not
known
Softmax Layer – Perform the operations in

Y encoder and decoder layers to
generate an output sequence
PWFFNN • Given: Test input sequence X =
(x1, x2, … , xj , …, xTs)
Z  of input sequence:
• Generate X L
Cross-Attention MHA
encoder output
SDPA
– It goes as input to decoder
K V
• Test is going to be a sequential

X R operation
L
Self-Attention
MHA with
Masking
Y 48
48
24
15-05-2023
Transformer Model [1]
N x: N number of times the encoder

and decoder operations
i.e., N is the number of encoder and

decoder layers
Residual
connection
Residual
connection

49
Vision Transformer (ViT) for

Image Classification [2]
[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain
Gelly, Jakob Uszkoreit, Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for
Image Recognition at Scale,” 9th International Conference on Learning Representations
(ICLR 2021), 2021. 50
50
25
15-05-2023
Dense Video Captioning using CNNs

and Transformer [3]
[3] L.Zhou, Y.Zhou, J.J. Corso, R.Socher and C.Xiong, “End-to-End Dense Video Captioning
with Masked Transformer,” Computer Vision and Pattern Recognition (CVPR), 2018. 51
51
Bidirectional Encoder Representation

from Transformer (BERT) [4]
[4] Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova, “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding,” NAACL, 2019. 52
52
26
15-05-2023
Text Books
1. Aston Zhang, Zachary C. Lipton, Mu Li, Alexander J. Smola,
Dive into Deep Learning, 2021
2. Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep learning,
MIT Press, Available online: http://www.deeplearningbook.org,
2016
3. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer,
2018
4. B. Yegnanarayana, Artificial Neural Networks, Prentice-Hall of India,
1999.
5. Satish Kumar, Neural Networks - A Class Room Approach, Second
Edition, Tata McGraw-Hill, 2013.
6. S. Haykin, Neural Networks and Learning Machines, Prentice Hall of
India, 2010.
7. C. M. Bishop, Pattern Recognition and Machine Learning, Springer,
2006.
53
53
27

Class47 49 - AttentionBasedModels Transformers 10 15may2023

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class47 49 - AttentionBasedModels Transformers 10 15may2023

Uploaded by

Copyright:

Available Formats

15-05-2023

Sequence-to-Sequence Mapping Tasks

• Speech Recognition (Speech-to-Text Conversion):

Sequence-to-Sequence Mapping Tasks

Attention-based Deep Learning Models for

[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.

Attention-based Deep Learning Models for

Seq2Seq Mapping: So Far …

Attention-based Deep Learning Models for

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.

Terminologies used in Transformer Models

• xn : nth training example

» Here, g(x, xn) is indicating measure of similarity or dissimilarity 8

Terminologies used in Transformer Models

• Here, g(x, xn) is a basis function indicating measure of

Terminologies used in Transformer Models

Terminologies used in Transformer Models

• Query (query vector): s D,t 1

Terminologies used in Transformer Models

• Query (query vector): s D,t 1

• However, query, key and value refer to completely

• The context vector captures the relation of Query vector (q)

Self-Attention on Input Sequence, X

– Value vector: v j = W (V) x j

Self-Attention on Input Sequence, X

• Then we get C  matrix from c obtained for each of the query

Self-Attention on Input Sequence, X

• Multi-Head Attention: Generation of context vector matrix

Encoder: Self-Attention based

Encoder: Self-Attention based

• Then we get C  matrix from c obtained for each of the query

Encoder: Self-Attention based

• Multi-Head Attention: Generation of context vector matrix

Encoder: Self-Attention based

– Here, h is the number of heads

Encoder: Self-Attention based

concat (c1 j , c 2 j ,..., c ij ,..., c hj )

Scaled Dot-Product Scaled Dot-Product Scaled Dot-Product

xj X  ( x1 ,..., x j ..., xTs ) 28

Encoder: Self-Attention based

x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 29

Encoder: Self-Attention MHA and Position-wise

x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 30

Encoder: Self-Attention MHA and Position-wise

x1 X  ( x1 ,..., x j ..., xTs ) x 2 xj xTs 31

Encoder Layer in Transformer

 goes as input to decoder Encoder

Decoder in Training Process

Decoder in Training Process

Decoder in Training Process:

Decoder in Training Process:

Decoder in Training Process:

 is the encoded representation of Y

Decoder in Training Process:

– Here, h is the number of heads

Decoder in Training Process:

concat (c1t , c 2t ,..., c it ,..., c ht )

Decoder in Training Process:

r1 r2 rt rTd

y1 Y  ( y1 ,..., y t ..., y Td ) y2 yt yTd

Decoder in Training Process:

Decoder in Training Process:

 ,X ) • Query matrix: R = (r1 , r2 ,..., rt ,..., rTd )