You are on page 1of 58

Very Deep Learning

Lecture 12

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker


MindGarage, University of Kaiserslautern
afzal.tukl@gmail.com

M. Zeshan Afzal, Very Deep Learning Ch. 12


Administrative

◼ https://vlu.cs.uni-kl.de/

M. Zeshan Afzal, Very Deep Learning Ch. 12 2


Advertisement ☺

◼ Research Areas (Augmented Vision Group DFKI)


^ Computer Vision
• Stereo reconstruction and optical flow
• Pose estimation
• Object detection
• Activity recognition
• Multimodal information retrieval
^ Document Analysis
• Document Classification
• Information retrieval etc

M. Zeshan Afzal, Very Deep Learning Ch. 12 3


Advertisement ☺

◼ Positions
^ Couple of PhD positions
^ Masters thesis
^ Masters project
^ Hiwi

M. Zeshan Afzal, Very Deep Learning Ch. 12 4


Recap

M. Zeshan Afzal, Very Deep Learning Ch. 12 5


Natural Language Processing

◼ Linguistic: is the scientific study of language. It involves analysis of language


form, language meaning, and language in context, as well as an analysis of the
social, cultural, historical, and political factors that influence language.
(Wikipedia)
◼ Computational Linguistic: is an interdisciplinary field concerned with the
computational modelling of natural language, as well as the study of
appropriate computational approaches to linguistic questions. (Wikipedia)

◼ Natural language processing (NLP): refers to the branch of computer


science—and more specifically, the branch of artificial intelligence or AI—
concerned with giving computers the ability to understand text and spoken
words in much the same way human beings can. (Wikipedia)

M. Zeshan Afzal, Very Deep Learning Ch. 12 6


Language Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 7


Language Models

◼ Word Language model Example


p(Cat was sitting on the table <EOS>) = p(Cat) p(was | Cat)
p(sitting | Cat was)
p(on | Cat was sitting)
p(the | Cat was sitting on)
p(table | Cat was sitting on the)
p(<EOS> | Cat was sitting on the table )

◼ Word language models are auto-regressive models that predict the


next word given all previous words in the sentence
◼ A good model has a high probability of predicting likely next words

M. Zeshan Afzal, Very Deep Learning Ch. 12 8


Applications Language Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 9


Applications Language Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 10


Applications Language Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 11


Evaluating Character Language Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 12


Evaluating Word Language Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 13


n-gram Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 14


Sampling from n-gram Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 15


Neural Language Models

M. Zeshan Afzal, Very Deep Learning Ch. 12 16


Local Word Representations

M. Zeshan Afzal, Very Deep Learning Ch. 12 17


Local Word Representations

M. Zeshan Afzal, Very Deep Learning Ch. 12 18


Distributed Representations

M. Zeshan Afzal, Very Deep Learning Ch. 12 19


Word Representations

M. Zeshan Afzal, Very Deep Learning Ch. 12 20


Local vs distributed Representations

M. Zeshan Afzal, Very Deep Learning Ch. 12 21


Learned Word Embeddings

M. Zeshan Afzal, Very Deep Learning Ch. 12 22


Neural Probabilistic Language Model

M. Zeshan Afzal, Very Deep Learning Ch. 12 23


Neural Probabilistic Language Model

M. Zeshan Afzal, Very Deep Learning Ch. 12 24


Neural Probabilistic Language Model

M. Zeshan Afzal, Very Deep Learning Ch. 12 25


Neural Probabilistic Language Model

M. Zeshan Afzal, Very Deep Learning Ch. 12 26


Word Embeddings

M. Zeshan Afzal, Very Deep Learning Ch. 12 27


Word2Vec (CBOW)

Have a great day

Mikolov, Chen, Corrado and Dean: Efficient Estimation of Word Representations in Vector Space. ICLR Workshops, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 12 28


Word2Vec (CBOW)

Mikolov, Chen, Corrado and Dean: Efficient Estimation of Word Representations in Vector Space. ICLR Workshops, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 12 29


Word2Vec (Skipgram)

Mikolov, Chen, Corrado and Dean: Efficient Estimation of Word Representations in Vector Space. ICLR Workshops, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 12 30


Word2Vec (CBOW vs Skipgram)

◼ Both have their own advantages and disadvantages. According to authors,


Skip Gram works well with small amount of data and is found to represent
rare words well.
◼ On the other hand, CBOW is faster and has better representations for
more frequent words.

M. Zeshan Afzal, Very Deep Learning Ch. 12 31


Lookup Table

Mikolov, Chen, Corrado and Dean: Efficient Estimation of Word Representations in Vector Space. ICLR Workshops, 2013.

M. Zeshan Afzal, Very Deep Learning Ch. 12 34


Word Vector Arithmetic

Expression Nearest Token

Paris – France + Italy Rome


Bigger – big + cold colder
Sushi - Japan + Germany bratwurst
Cu – copper + Gold Au
Windows – Microsoft + Google Android

M. Zeshan Afzal, Very Deep Learning Ch. 12 37


Neural Machine Translation

M. Zeshan Afzal, Very Deep Learning Ch. 12 38


Sequence to Sequence Learning

M. Zeshan Afzal, Very Deep Learning Ch. 12 39


Sequence to Sequence Learning

◼ Embed
◼ Encode
◼ (Attend)
◼ Decode
◼ Predict

M. Zeshan Afzal, Very Deep Learning Ch. 12 40


Decoding

M. Zeshan Afzal, Very Deep Learning Ch. 12 41


Beam Search

M. Zeshan Afzal, Very Deep Learning Ch. 12 42


Beam Search

M. Zeshan Afzal, Very Deep Learning Ch. 12 43


Sequence to Sequence Model with
Attention

M. Zeshan Afzal, Very Deep Learning Ch. 12 44


Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)

he hit me with a pie <END>

Decoder RNN
Encoder RNN

il a m’ entarté <START> he hit me with a pie

Source sentence (input)

Problems with this architecture?

M. Zeshan Afzal, Very Deep Learning Ch. 12 45


Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence.
he hit me with a pie <END>
Information bottleneck!

Decoder RNN
Encoder RNN

il a m’ entarté <START> he hit me with a pie

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 46


Attention
• Attention provides a solution to the bottleneck problem.

• Core idea: on each step of the decoder, use direct connection to


the encoder to focus on a particular part of the source sequence

• First we will show via diagram (no equations), then we will show
with equations

M. Zeshan Afzal, Very Deep Learning Ch. 12 47


Sequence-to-sequence with
attention
dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 48


Sequence-to-sequence with attention
dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 49


Sequence-to-sequence with
attention
dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 50


Sequence-to-sequence with
attention
dot product
Attention
scores

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 51


Sequence-to-sequence with attention
On this decoder timestep, we’re
mostly focusing on the first

scores distribution
Attention Attention encoder hidden state (”he”)

Take softmax to turn the scores


into a probability distribution

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 52


Sequence-to-sequence with attention
Attention Use the attention distribution to take a
output weighted sum of the encoder hidden
states.

scores distribution
Attention Attention
The attention output mostly contains
information from the hidden states that
received high attention.

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 53


Sequence-to-sequence with attention
Attention he
output
Concatenate attention output

scores distribution
Attention Attention ŷ with decoder hidden state, then
use to compute ŷ as before

Decoder RNN
Encoder
RNN

il a m’ entarté <START>

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 54


Sequence-to-sequence with attention
Attention hit
output

scores distribution
ŷ
Attention Attention

Decoder RNN
Encoder
RNN

Sometimes we take the


attention output from the
previous step, and also
feed it into the decoder
il a m’ entarté <START> he (along with the usual
decoder input).

Source sentence (input)

55 M. Zeshan Afzal, Very Deep Learning Ch. 12 55


Sequence-to-sequence with attention
Attention me
output

scores distribution
Attention Attention ŷ

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 56


Sequence-to-sequence with attention
Attention with
output
ŷ
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 57


Sequence-to-sequence with attention
Attention a
output

scores distribution
Attention Attention
ŷ

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 58


Sequence-to-sequence with attention
Attention pie
output
ŷ
scores distribution
Attention Attention

Decoder RNN
Encoder
RNN

il a m’ entarté <START> he hit me with a

Source sentence (input)

M. Zeshan Afzal, Very Deep Learning Ch. 12 59


Attention: in equations
• We have encoder hidden states
• On timestep t, we have decoder hidden state
• We get the attention scores for this step:

• We take softmax to get the attention distribution for this step (this is a
probability distribution and sums to 1)

• We use to take a weighted sum of the encoder hidden states to get the
attention output

• Finally we concatenate the attention output with the decoder hidden


state and proceed as in the non-attention seq2seq model

60 M. Zeshan Afzal, Very Deep Learning Ch. 12 60


Attention & Alignment

M. Zeshan Afzal, Very Deep Learning Ch. 12 61


Thanks a lot for your Attention

M. Zeshan Afzal, Very Deep Learning Ch. 12 62

You might also like