Lecture11 VDL

Very Deep Learning
Lecture 12
Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

MindGarage, University of Kaiserslautern
afzal.tukl@gmail.com
M. Zeshan Afzal, Very Deep Learning Ch. 12

Administrative
◼ https://vlu.cs.uni-kl.de/
M. Zeshan Afzal, Very Deep Learning Ch. 12 2

Advertisement ☺
◼ Research Areas (Augmented Vision Group DFKI)

^ Computer Vision
• Stereo reconstruction and optical flow
• Pose estimation
• Object detection
• Activity recognition
• Multimodal information retrieval
^ Document Analysis
• Document Classification
• Information retrieval etc

Advertisement ☺
◼ Positions
^ Couple of PhD positions
^ Masters thesis
^ Masters project
^ Hiwi

Recap

Natural Language Processing
◼ Linguistic: is the scientific study of language. It involves analysis of language

form, language meaning, and language in context, as well as an analysis of the
social, cultural, historical, and political factors that influence language.
(Wikipedia)
◼ Computational Linguistic: is an interdisciplinary field concerned with the
computational modelling of natural language, as well as the study of
appropriate computational approaches to linguistic questions. (Wikipedia)
◼ Natural language processing (NLP): refers to the branch of computer

science—and more specifically, the branch of artificial intelligence or AI—
concerned with giving computers the ability to understand text and spoken
words in much the same way human beings can. (Wikipedia)

Language Models

Language Models
◼ Word Language model Example

p(Cat was sitting on the table <EOS>) = p(Cat) p(was | Cat)
p(sitting | Cat was)
p(on | Cat was sitting)
p(the | Cat was sitting on)
p(table | Cat was sitting on the)
p(<EOS> | Cat was sitting on the table )
◼ Word language models are auto-regressive models that predict the

next word given all previous words in the sentence
◼ A good model has a high probability of predicting likely next words

Applications Language Models



Evaluating Character Language Models

Evaluating Word Language Models

n-gram Models

Sampling from n-gram Models

Neural Language Models

Local Word Representations

Local Word Representations

Distributed Representations

Word Representations

Local vs distributed Representations

Learned Word Embeddings

Neural Probabilistic Language Model




Word Embeddings

Word2Vec (CBOW)
Have a great day
Mikolov, Chen, Corrado and Dean: Efficient Estimation of Word Representations in Vector Space. ICLR Workshops, 2013.

Word2Vec (CBOW)

Word2Vec (Skipgram)

Word2Vec (CBOW vs Skipgram)
◼ Both have their own advantages and disadvantages. According to authors,

Skip Gram works well with small amount of data and is found to represent
rare words well.
◼ On the other hand, CBOW is faster and has better representations for
more frequent words.

Lookup Table

Word Vector Arithmetic
Expression Nearest Token
Paris – France + Italy Rome

Bigger – big + cold colder
Sushi - Japan + Germany bratwurst
Cu – copper + Gold Au
Windows – Microsoft + Google Android

Neural Machine Translation

Sequence to Sequence Learning

Sequence to Sequence Learning
◼ Embed
◼ Encode
◼ (Attend)
◼ Decode
◼ Predict

Decoding

Beam Search

Beam Search

Sequence to Sequence Model with
Attention

Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)
he hit me with a pie <END>
Decoder RNN
Encoder RNN
il a m’ entarté <START> he hit me with a pie
Source sentence (input)
Problems with this architecture?

Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence.
he hit me with a pie <END>
Information bottleneck!
Decoder RNN
Encoder RNN
il a m’ entarté <START> he hit me with a pie

Attention
• Attention provides a solution to the bottleneck problem.
• Core idea: on each step of the decoder, use direct connection to

the encoder to focus on a particular part of the source sequence
• First we will show via diagram (no equations), then we will show
with equations

Sequence-to-sequence with
attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN
il a m’ entarté <START>

Sequence-to-sequence with attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN

attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN

attention
dot product
Attention
scores
Decoder RNN
Encoder
RNN

On this decoder timestep, we’re
mostly focusing on the first
scores distribution
Attention Attention encoder hidden state (”he”)
Take softmax to turn the scores

into a probability distribution
Decoder RNN
Encoder
RNN

Attention Use the attention distribution to take a
output weighted sum of the encoder hidden
states.
scores distribution
Attention Attention
The attention output mostly contains
information from the hidden states that
received high attention.
Decoder RNN
Encoder
RNN

Attention he
output
Concatenate attention output
scores distribution
Attention Attention ŷ with decoder hidden state, then
use to compute ŷ as before
Decoder RNN
Encoder
RNN

Attention hit
output
scores distribution
ŷ
Attention Attention
Decoder RNN
Encoder
RNN
Sometimes we take the

attention output from the
previous step, and also
feed it into the decoder
il a m’ entarté <START> he (along with the usual
decoder input).
55 M. Zeshan Afzal, Very Deep Learning Ch. 12 55

Attention me
output
scores distribution
Attention Attention ŷ
Decoder RNN
Encoder
RNN
il a m’ entarté <START> he hit

Attention with
output
ŷ
scores distribution
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START> he hit me

Attention a
output
scores distribution
Attention Attention
ŷ
Decoder RNN
Encoder
RNN
il a m’ entarté <START> he hit me with

Attention pie
output
ŷ
scores distribution
Attention Attention
Decoder RNN
Encoder
RNN
il a m’ entarté <START> he hit me with a

Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:
• We take softmax to get the attention distribution for this step (this is a
probability distribution and sums to 1)
• We use to take a weighted sum of the encoder hidden states to get the
attention output
• Finally we concatenate the attention output with the decoder hidden

state and proceed as in the non-attention seq2seq model
60 M. Zeshan Afzal, Very Deep Learning Ch. 12 60

Attention & Alignment

Thanks a lot for your Attention

Lecture11 VDL

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture11 VDL

Uploaded by

Copyright:

Available Formats

Very Deep Learning

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

M. Zeshan Afzal, Very Deep Learning Ch. 12

M. Zeshan Afzal, Very Deep Learning Ch. 12 2

◼ Research Areas (Augmented Vision Group DFKI)

M. Zeshan Afzal, Very Deep Learning Ch. 12 3

M. Zeshan Afzal, Very Deep Learning Ch. 12 4

M. Zeshan Afzal, Very Deep Learning Ch. 12 5

◼ Linguistic: is the scientific study of language. It involves analysis of language

◼ Natural language processing (NLP): refers to the branch of computer

M. Zeshan Afzal, Very Deep Learning Ch. 12 6

M. Zeshan Afzal, Very Deep Learning Ch. 12 7

◼ Word Language model Example

◼ Word language models are auto-regressive models that predict the

M. Zeshan Afzal, Very Deep Learning Ch. 12 8

M. Zeshan Afzal, Very Deep Learning Ch. 12 9

M. Zeshan Afzal, Very Deep Learning Ch. 12 10

M. Zeshan Afzal, Very Deep Learning Ch. 12 11

M. Zeshan Afzal, Very Deep Learning Ch. 12 12

M. Zeshan Afzal, Very Deep Learning Ch. 12 13

M. Zeshan Afzal, Very Deep Learning Ch. 12 14

M. Zeshan Afzal, Very Deep Learning Ch. 12 15

M. Zeshan Afzal, Very Deep Learning Ch. 12 16

M. Zeshan Afzal, Very Deep Learning Ch. 12 17

M. Zeshan Afzal, Very Deep Learning Ch. 12 18

M. Zeshan Afzal, Very Deep Learning Ch. 12 19

M. Zeshan Afzal, Very Deep Learning Ch. 12 20

M. Zeshan Afzal, Very Deep Learning Ch. 12 21

M. Zeshan Afzal, Very Deep Learning Ch. 12 22

M. Zeshan Afzal, Very Deep Learning Ch. 12 23

M. Zeshan Afzal, Very Deep Learning Ch. 12 24

M. Zeshan Afzal, Very Deep Learning Ch. 12 25

M. Zeshan Afzal, Very Deep Learning Ch. 12 26

M. Zeshan Afzal, Very Deep Learning Ch. 12 27

Have a great day

M. Zeshan Afzal, Very Deep Learning Ch. 12 28

M. Zeshan Afzal, Very Deep Learning Ch. 12 29

M. Zeshan Afzal, Very Deep Learning Ch. 12 30

◼ Both have their own advantages and disadvantages. According to authors,

M. Zeshan Afzal, Very Deep Learning Ch. 12 31

M. Zeshan Afzal, Very Deep Learning Ch. 12 34

Expression Nearest Token

Paris – France + Italy Rome

M. Zeshan Afzal, Very Deep Learning Ch. 12 37

M. Zeshan Afzal, Very Deep Learning Ch. 12 38

M. Zeshan Afzal, Very Deep Learning Ch. 12 39

M. Zeshan Afzal, Very Deep Learning Ch. 12 40

M. Zeshan Afzal, Very Deep Learning Ch. 12 41

M. Zeshan Afzal, Very Deep Learning Ch. 12 42

M. Zeshan Afzal, Very Deep Learning Ch. 12 43

M. Zeshan Afzal, Very Deep Learning Ch. 12 44

he hit me with a pie <END>

il a m’ entarté <START> he hit me with a pie

Source sentence (input)

Problems with this architecture?

M. Zeshan Afzal, Very Deep Learning Ch. 12 45

il a m’ entarté <START> he hit me with a pie

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 46

• Core idea: on each step of the decoder, use direct connection to

M. Zeshan Afzal, Very Deep Learning Ch. 12 47

Source sentence (input)