Professional Documents
Culture Documents
Navid Rekab-Saz
navid.rekabsaz@jku.at
Institute of
Computational
Perception
Agenda
§ 𝑎 → scalar
§ 𝒃 → vector
- 𝑖 !" element of 𝒃 is the scalar 𝑏#
§ 𝑪 → matrix
- 𝑖 !" vector of 𝑪 is 𝒄#
- 𝑗!" element of the 𝑖 !" vector of 𝑪 is the scalar 𝑐#,%
4
Linear Algebra – Dot product
§ 𝒂 + 𝒃& = 𝑐
- dimensions: 1×d ' d×1 = 1
2
1 2 3 0 =5
1
§ 𝒂+𝑩= 𝒄
- dimensions: 1×d ' d×e = 1×e
2 3
1 2 3 0 1 = 5 2
1 −1
§ 𝑨+𝑩= 𝑪
- dimensions: l×m ' m×n = l×n
1 2 3 5 2
2 3
1 0 1 3 2
0 1 = 5 −5
0 0 5
1 −1 8 13
4 1 0
§ Linear transformation: dot product of a vector to a matrix
5
Probability
§ Probability distribution
- For a discrete random variable 𝒛 with 𝐾 states
• 0 ≤ 𝑝 𝑧# ≤ 1
• ∑1#/0 𝑝 𝑧# = 1
6
Distributional Representation
§ Distributed Representations
- Each dimension (units) is a feature of the entity
- Units in a layer are not mutually exclusive
- Two units can be ‘‘active’’ at the same time
𝑑
𝒙 𝑥! 𝑥" 𝑥# … 𝑥$
7
Word embeddings projected to a two-dimensional space
All we talk today is about …
Compositional Representations
9
Compositional Representations
𝒗0 𝒗2 𝒗3 𝒗4
Source: https://megagon.ai/blog/emu-enhancing-multilingual-sentence-embeddings-with-semantic-similarity/ 10
Compositional Representations
𝒗 ;2 𝒗
;0 𝒗 ;3
𝒗0 𝒗2 𝒗3
11
Recurrent Neural Networks – RECAP
12
RNN – Unrolling
13
RNN – Compositional embedding
sentence embedding
Contextualized
embedding of 𝒆(")
𝑶
𝑸
ATT
𝑽
16
Attention Networks
𝑶 = ATT(𝑸, 𝑽)
𝒐0 𝒐2
𝑶 𝑸 ×𝑑"
𝑸 𝒒2
ATT
𝑸 ×𝑑!
𝒒0 ATT
𝑽 𝑽 ×𝑑"
17
Attention Networks – definition
Formal definition:
§ Given a set of vector values 𝑽, and a set of vector
queries 𝑸, attention is a technique to compute a
weighted sum of the values, dependent on each query
18
Attentions!
𝒐0 𝒐2
𝒒2
𝑓
𝒗0 𝒗2 𝒗3 𝒗4
20
Attention variants
𝒗0 𝒗2 𝒗3 𝒗4
21
softmax – RECAP
1 0.004
2 0.013
𝒛= softmax(𝒛) =
5 0.264
6 0.717
22
Attention variants
Multiplicative attention
§ First, non-normalized attention scores:
𝛼L #,% = 𝒒# 𝑾𝒗A#
- 𝑾 is a matrix of model parameters
- provides a linear function for measuring
relations between query and value vectors 𝒐0 𝒐2
𝒒2
𝑓
§ Then, softmax over values: 𝒒0
𝑓 𝛼%,# 𝛼 𝛼%,& 𝛼%,'
𝛼#,% = softmax(;
𝜶# ) %
%,%
𝛼#,#
𝛼#,% 𝛼#,& 𝛼#,'
𝑽
§ Output (weighted sum): 𝒐# = ∑%/0 𝛼#,% 𝒗% 𝒗0 𝒗2 𝒗3 𝒗4
23
Attention variants
Additive attention
§ First, non-normalized attention scores:
𝛼L #,% = 𝒖Atanh(𝒒# 𝑾0 +𝒗% 𝑾2)
- 𝑾# , 𝑾$ , and 𝒖 are model parameters
- provides a non-linear function for measuring
relations between the query and value vectors 𝒐0 𝒐2
𝒒2
𝑓
§ Then, softmax over values:
𝒒0 𝛼%,# 𝛼 𝛼%,& 𝛼%,'
𝑓
𝛼#,% = softmax(;
𝜶# ) %
%,%
𝛼#,#
𝛼#,% 𝛼#,& 𝛼#,'
𝑽
§ Output (weighted sum): 𝒐# = ∑%/0 𝛼#,% 𝒗% 𝒗0 𝒗2 𝒗3 𝒗4
24
Attention in practice
𝒐
𝒒
ATT
𝒗0 𝒗2 𝒗3 𝒗4
25
Self-attention
𝒗 ;2 𝒗
;0 𝒗 ;3
𝒗3
b
𝑽
𝒗2 𝑽
ATT ATT
𝒗0
𝒗0 𝒗2 𝒗3 𝑽
26
Attention – summary
27
Agenda
Attention is all you need. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Polosukhin, I. (2017). In NeurIPS. 29
Transformer encoder
2nd sub-layer
1st sub-layer
𝒐0 𝒐2
§ Then, softmax over values:
𝒒2
𝛼#,% = softmax(;
𝜶# ) % 𝑓
𝑽
𝒒0 𝛼%,# 𝛼 𝛼%,& 𝛼%,'
∑%/0 𝑓
§ Output (weighted sum): 𝒐# = 𝛼#,% 𝒗% 𝛼#,#
%,%
𝒗0 𝒗2 𝒗3 𝒗4
32
Scaled dot-product attention
𝒗0 𝒗2 𝒗3 𝒗4
𝑽
§ Output (weighted sum): 𝒐% = ∑'(# 𝛼%,' 𝒗'
33
Problem with (single-head) attention
34
Multi-head attention
Multi-head attention:
1. Down-project query and value vectors to ℎ subspaces (heads)
35
Multi-head attention
𝒐
ℎ=2
𝑾/
⨁
𝒒𝟏 𝒐𝟏 𝒐𝟐
Multi-head Attention
𝒒 𝑾*.
𝒒𝟏 𝒒𝟐
Scaled Dot-Product Scaled Dot-Product
Attention Attention
𝒒𝟐
𝒗' 𝒗( 𝒗)
36
Multi-head attention – formulation
Size: 𝑑×𝑑
This matrix linearly combines
the dimensions of the
Parameters are shown in red concatenated vectors 37
Multi-head attention – in original paper
38
Self-attention – recap
𝒗 ;2 𝒗
;0 𝒗 ;3 𝒗 ;2 𝒗
;0 𝒗 ;3
𝒗3
𝒗2
𝒗0
(Multi-head)
Attention = (Multi-head)
Self-Attention
𝒗0 𝒗2 𝒗3 𝒗0 𝒗2 𝒗3
39
Residuals
Residual
connection
40
Layer normalization
§ Learn in detail:
- Batch Normalization: Deep Learning book section 8.7.1
http://www.deeplearningbook.org/contents/optimization.html
- Talk by Goodfellow https://www.youtube.com/watch?v=Xogn6veSyxA&feature=youtu.be
- Paper: https://arxiv.org/pdf/1607.06450.pdf
Layer norm
41
Feed Forward on embedding
42
Transformer encoder – summary
43
Finito!
Institute of
Computational
Perception
Position embeddings
45
Position embeddings – examples
An example of
embeddings with
four dimensions:
Position embedding
for location 0
Values from -1
Position (dark) to +1 (light)
embeddings
Position embedding
for location 20
Dimensions (512)
Source: http://jalammar.github.io/illustrated-transformer/ 46