You are on page 1of 58

Advanced Machine Learning

RNN Layer Architectures


Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture
• List issues with vanilla RNN

• Show how LSTM and GRU overcome those


issues

• List design choices for LSTM

• Derive backprop for LSTM/GRU


Contents

• Need for memory to process sequential data

• Recurrent neural networks

• LSTM basics

• Some applications of LSTM in NLP

• Some advanced LSTM structures


What is sequential data?
• One-dimensional discrete index
• Example: time instances, character position
• Each data point can be a scalar, vector, or a
symbol from an alphabet

… xn-4 xn-3 xn-2 xn-1 xn …

• Number of data points in a series can be


variable
Examples of sequential data
• Speech

• Text (NLP)

• Music

• Protein and DNA sequences

• Stock prices and other time series


Traditional ML is one-to-one
• POS tagging in NLP (input: words, output: POS tag)
• Stock trade: {Buy, NoAction, Sell}

… yn-4 yn-3 yn-2 yn-1 yn …

f(x) f(x) f(x) f(x) f(x)

… xn-4 xn-3 xn-2 xn-1 xn …

• What about taking past data into account?


Need for past data or context
 Different POS
 It is a quick read
 I like to read

 Translation
I am going

मैं जा रहा हूँ (Re-ordered, ideal)


मैं हूँ जा रहा (Word by word, less than ideal)
Using traditional ML for sequences
• Work with a fixed window

… yn-2 yn-1 yn … yn

y(x) y(x) y(x) y(x)

… xn-4 xn-3 xn-2 xn-1 xn … xn-2 xn-1 xn

• What about influence of distant past?


What if we want to do many to one
• Sentiment analysis in NLP
yn

… xn-4 xn-3 xn-2 xn-1 xn …

• What about taking past data into account?


Using traditional ML for sequences
• Convert sequence into a feature vector

yn

g(h)

Histogram

f(x) f(x) f(x) f(x) f(x)

… xn-4 xn-3 xn-2 xn-1 xn …

• What about using the order of the data?


Introducing memory (recurrence or state)
in neural networks
• A memory state is computed in addition to an
output, which is sent to the next time instance

yn … yn-3 yn-2 yn-1 yn …

hn-1 y(x,h) hn y(x,h) y(x,h) y(x,h) y(x,h)


h(x,h) … h(x,h) h(x,h) h(x,h) h(x,h) …

xn … xn-3 xn-2 xn-1 xn …


Another view of recurrence
• A memory state is computed in addition to an
output, which is sent to the next time instance
y1,n y2,n yc,n
yn

hn-1 hn hn Delay Delay Delay h1,n h2,n hb,n

Delay
xn h1,n-1 h2,n-1 hb,n-1 x1,n x2,n xa,n
Types of analysis possible on
sequential data using “recurrence”

• One to one

• One to many

• Many to one

• Many to many
Examples: Many to many
• POS tagging in NLP
• Stock trade: {Buy, NoAction, Sell}

… yn-3 yn-2 yn-1 yn …

y(x,h) y(x,h) y(x,h) y(x,h)


… h(x,h) h(x,h) h(x,h) h(x,h) …

… xn-3 xn-2 xn-1 xn …


Examples: many to one
• Sentiment analysis in NLP

yn

y(x,h) y(x,h) y(x,h) y(x,h)


… h(x,h) h(x,h) h(x,h) h(x,h)

… xn-3 xn-2 xn-1 xn


Examples: One to many
• Generate caption based on an image
• Generate text given topic

yn yn+1 yn+2 yn+3 …

y(x,h) y(x,h) y(x,h) y(x,h)


h(x,h) h(x,h) h(x,h) h(x,h) …

xn
Examples: Many to many
• Language translation

yn-3 yn-2 yn-1 …

y(x,h) y(x,h) y(x,h) y(x,h) y(x,h) y(x,h) y(x,h)


… h(x,h) h(x,h) h(x,h) h(x,h) h(x,h) h(x,h) h(x,h) …

… xn-3 xn-2 xn-1 xn


Contents

• Need for memory to process sequential data

• Recurrent neural networks

• LSTM basics

• Some applications of LSTM in NLP

• Some advanced LSTM structures


Revising feedforward neural networks

y1 y2 … yn y

Softmax
Why
Why

h1 h1 … hm h

ReLU
Wxh
Wxh

x1 x2 … xd x
Recurrent neural networks
• Vanilla RNNs used the hidden layer activation
as a state
yn
… yn-3 yn-2 yn-1 yn …
tanh

y(x,h) y(x,h) y(x,h) y(x,h) Why


… h(x,h) h(x,h) h(x,h) h(x,h) … Whh
hn-1 tanh hn

… xn-3 xn-1 … Wxh


xn-2 xn

xn
Backpropagation through time (BPTT)
• Just like how forward • Backpropagation uses
propagation uses derivative from future
previous state … output
yn yn

tanh tanh

Why Why
Whh Whh
hn-1 tanh hn hn-1 tanh hn

Wxh Wxh

xn xn
Use of a window length
• We need to put a limit on how long will the
gradient travel back in time
yn-m yn-1 yn

tanh tanh tanh

Why Why Why


hn-m hn-2 Whh hn-1 Whh
tanh … tanh tanh hn

Wxh Wxh Wxh

xn-m xn-1 xn
Mathematical expression for BPTT
• Forward: 𝒚𝑛 = 𝒈 𝑾2 𝒉𝑛
= 𝒈 𝑾2 𝒇(𝑾11 𝒉𝑛−1 + 𝑾12 𝒙𝑛 )

• Backward example:
𝜕𝒚
= 𝒈′ 𝑾 2 𝒉𝑛 ′
𝜕𝑾11
= 𝒈′𝑾2 𝒇′(𝒉𝑛−1 + 𝑾11 𝒉𝑛−1 ′)
Vanishing and exploding gradient
• Gradient gets repeatedly multiplied by Whh
• This can lead to vanishing or exploding
gradient depending on the norm of Whh
yn-m yn-1 yn

tanh tanh tanh

Why Why Why


hn-m hn-2 Whh hn-1 Whh
tanh … tanh tanh hn

Wxh Wxh Wxh

xn-m xn-1 xn
Exploding and vanishing gradients (1 of
2)

Source: ”Supervised Sequence Labelling with Recurrent Neural Networks.” by Graves, Alex.
Exploding and vanishing gradients (2 of
2)

Source: ”Supervised Sequence Labelling with Recurrent Neural Networks.” by Graves, Alex.
Source: ”Supervised Sequence Labelling with Recurrent Neural Networks.” by Graves, Alex.
Contents

• Need for memory to process sequential data

• Recurrent neural networks

• LSTM basics

• Some applications of LSTM in NLP

• Some advanced LSTM structures


Introducing… a forget gate to control the

Hidden Layer
y gradient
t

• The state doesn’t multiply with


h a constant weight

sct-1 • A gate function f (usually a


bϕt whϕ sigmoid) represents on or off
st f
wiϕ • State is forgotten and replaced
CEC
by the input g if f = 0

g(act) • But, what if f = 1? How do we


g control the influence of input?
wic whc
Legend
Current
Input Layer xt
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves Delayed
Adding input and output gates
yt
Hidden Layer 𝑤𝑐𝑘

whω bωt bct • On similar lines as the forget


f
wiω h(st) gate, an input gate decides
h whether the input will
override the state or not
bϕt whϕ
st f • Similarly, an output gate will
wiϕ
CEC decide whether the output
whι
f
bιt will be passed out or not
wiι
g(act) • The input to all these gates
g are NN inputs, and NN
wi wh
hidden-layer output Legend
Current
Input Layer xt
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves Delayed
Another view
Hidden
Input
• CEC is constant error Hidden
carrousel
• No vanishing gradients Input
• But, it is not always on f × Hidden

s × f
• Introducing gates: Input
• Allow or disallow input Hidden
f ×
• Allow or disallow output g
• Remember or forget Input
state Input Hidden
A few words about the LSTM
• CEC: With the forget gate, influence of the state forward
can be modulated such that it can be remembered for a
long time, until the state or the input changes to make
LSTM forget it. This ability or the path to pass the past-state
unaltered to the future-state (and the gradient backward) is
called constant error carrousel (CEC). It gives LSTM the
ability to remember long term (hence, long short term
memory)
• Blocks: Since there are just too many weights to be learnt
for a single state bit, several state bits can be combined into
a single block such that the state bits in a block share gates
• Peepholes: The state itself can be an input for the gate
using peephole connections
• GRU: In a variant of LSTM called gated recurrent unit
(GRU), input gate can simply be one-minus-forget-gate.
That is, if the state is being forgotten, then replace it by
input, and if it is being remembered, then block the input
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1

whω bωt bct Block 1 … Block H/C


f
wiω h(sct)
h

bϕt whϕ
ssctcst t f
c
wiϕ
CEC
whι bιt
f
wiι Legend
g(act)
Current
g
Delayed
wic whc Peephole

Input Layer x1t … xit … xIt LSTM block


Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1

whω bωt bct Block 1 … Block H/C


f
wiω h(sct)
h
sct-1

bϕt whϕ
ssctcst t f
c
wiϕ
CEC
whι bιt
f
wiι Legend
g(act)
Current
g
Delayed
wic whc Peephole

Input Layer x1t … xit … xIt Adding peep-holes


Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1

whω bωt bct Block 1 … Block H/C


f
wiω h(sct)
h
sct-1 Forward Pass
bϕt whϕ
ssctcst t f
c
wiϕ
CEC Input gate:
whι bιt
f 𝐼 𝐻 𝐶
wiι
g(act) 𝑎𝑙𝑡 = 𝑤𝑖𝑙 𝑥𝑖𝑡 + 𝑤ℎ𝑙 𝑏ℎ𝑡−1 + 𝑤𝑐𝑙 𝑠𝑐𝑡−1
𝑖=1 ℎ=1 𝑐=1
g
wic whc 𝑏𝑙𝑡 = 𝑓(𝑎𝑙𝑡 )

Input Layer x1t … xit … xIt


Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1

whω bωt bct Block 1 … Block H/C


f
wiω
h
h(sct)
Forward Pass
sct-1
Forget gate:
bϕ t
ssctcst t whϕ 𝐼 𝐻 𝐶
c f
𝑡
wiϕ 𝑎𝜙 = 𝑤𝑖𝜙 𝑥𝑖𝑡 + 𝑤ℎ𝜙 𝑏ℎ𝑡−1 + 𝑤𝑐𝜙 𝑠𝑐𝑡−1
CEC 𝑖=1 ℎ=1 𝑐=1
whι bιt 𝑡 𝑡
f 𝑏𝜙 = 𝑓(𝑎𝜙 )
wiι
g(act)
Cell input:
g 𝐼 𝐻

wic whc 𝑎𝑐𝑡 = 𝑤𝑖𝑐 𝑥𝑖𝑡 + 𝑤ℎ𝑐 𝑏ℎ𝑡−1


𝑖=1 ℎ=1

Input Layer x1t … xit … xIt 𝑡 𝑡−1


𝑠𝑐𝑡 = 𝑏𝜙 𝑠𝑐 + 𝑏𝑙𝑡 𝑔(𝑎𝑐𝑡 )
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1

whω bωt bct Block 1 … Block H/C


f
wiω h(sct)
h
sct-1 Forward Pass
bϕt whϕ
ssctcst t f
c
wiϕ Output gate:
CEC 𝐼 𝐻 𝐶
𝑡 =
𝑎𝜔 𝑤𝑖𝜔 𝑥𝑖𝑡 + 𝑤ℎ𝜔 𝑏ℎ𝑡−1 + 𝑤𝑐𝜔 𝑠𝑐𝑡
whι bιt
f 𝑖=1 ℎ=1 𝑐=1
wiι
g(act) 𝑡 = 𝑓(𝑎 𝑡 )
𝑏𝜔 𝜔
g
wic whc Cell output:

𝑏𝑐𝑡 = 𝑏𝜔
𝑡 𝑕(𝑠 𝑡 )
𝑐
Input Layer x1 … t xit … xIt
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Revisiting backpropagation through b-
diagrams
• An efficient way to
perform gradient
descent in NNs
• Efficiency comes from
local computations
• This can be visualized
using b-diagrams
– Propagate x (actually w)
forward
– Propagate 1 backward

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.
Chain rule using b-diagram

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.
Addition of functions using b-diagram

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.
Weighted edge on a b-diagram

Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.
Product in a b-diagram
f

* fg

f’g

* 1

g’f
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … …

whω bωt bct Block 1 … Block H/C


f
wiω h(sct)
h
sct-1 Backpropagation
bϕt 𝜕𝐿 𝜕𝐿
ssctcst t whϕ ϵ𝑡𝑐 ≝ 𝜖𝑠𝑡 ≝
c f 𝜕𝑏𝑐𝑡 𝜕𝑠𝑐𝑡
wiϕ
CEC
whι bιt
f
wiι Cell output:
g(act)
g
wic whc 𝐾 𝐺

𝜖𝑐𝑡 = 𝑤𝑐𝑘 𝛿𝑘𝑡 + 𝑤𝑐𝑔 𝛿𝑔𝑡+1


Input Layer x1t … xit … xIt 𝑘=1 𝑔=1
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1

whω bωt bct Block 1 … Block H/C


f
wiω
h
h(sct)
Backpropagation
sct-1
𝐶
𝑡 = 𝑓 ′ (𝑎 𝑡 )
bϕt Output gate: 𝛿𝜔 𝜔 𝑕(𝑠𝑐𝑡 )𝜖𝑐𝑡
ssctcst t whϕ
c f 𝑐=1
wiϕ
CEC State:
whι bιt 𝑡 𝜖 𝑡 = 𝑏 𝑡 𝑕′ 𝑠 𝑡 𝜖 𝑡 + 𝑏 𝑡+1 𝜖 𝑡+1
f 𝛿𝜔 𝑠 𝜔 𝑐 𝑐 ∅ 𝑠
wiι +𝑤𝑐𝑙 𝛿𝑙𝑡+1 + 𝑤𝑐∅ 𝛿∅𝑡+1 + 𝑤𝑐𝜔 𝛿𝜔
𝑡
g(act)
g
wic whc
Cells: 𝛿𝑐𝑡 = 𝑏𝑙𝑡 𝑔′ (𝑎𝑐𝑡 )𝜖𝑠𝑡

Input Layer x1t … xit … xIt


Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1

whω bωt bct Block 1 … Block H/C


f
wiω h(sct)
h
sct-1 Backpropagation
bϕt whϕ
ssctcst t f
c Forget gate:
wiϕ
CEC 𝐶

whι bιt 𝛿∅𝑡 = 𝑓 ′ (𝑎∅𝑡 ) 𝑠𝑐𝑡−1 𝜖𝑠𝑡


f 𝑐=1
wiι
g(act)
Input gate:
g
𝐶
wic whc
𝛿𝑙𝑡 = 𝑓 ′ (𝑎𝑙𝑡 ) 𝑔(𝑎𝑐𝑡 )𝜖𝑠𝑡
𝑐=1
Input Layer x1t … xit … xIt
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
Contents

• Need for memory to process sequential data

• Recurrent neural networks

• LSTM basics

• Some applications of LSTM in NLP


Gated Recurrent Unit (GRU)
Hidden
Input
• Reduces the need for Hidden

input gate by reusing


Input
the forget gate f × Hidden

s × f
Input
Hidden
× 1-x

g
Input

Input Hidden
GRUs combine input and forget gates

LSTM unit LSTM unit with


RNN unit peepholes

Gated Recurrent
Unit combines input
and forget gate as f
and 1-f
Source: Cho, et al. (2014), and “Understanding LSTM Networks”, by C Olah, http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Sentence generation
• Very common for image captioning
• Input is given only in the beginning
• This is a one-to-many task

A boy swimming in water <END>

LSTM LSTM LSTM LSTM LSTM LSTM

CNN
Video Caption Generation

Source: “Translating Videos to Natural Language Using Deep Recurrent Neural Networks”, Venugopal et al., ArXiv 2014
Pre-processing for NLP
• The most basic pre-processing is to convert
words into an embedding using Word2Vec or
GloVe

• Otherwise, a one-hot-bit input vector can be


too long and sparse, and require lots on input
weights
Sentiment analysis
• Very common for customer review or new article
analysis
• Output before the end can be discarded (not
used for backpropagation)
• This is a many-to-one task
Positive

LSTM LSTM LSTM LSTM LSTM LSTM

Embed Embed Embed Embed Embed Embed

Very pleased with their service <END>


Machine
translation
using
encoder-
decoder

Source: “Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation”, Cho et al., ArXiv 2014
Multi-layer LSTM
• More than one hidden layer can be used

yn-3 yn-2 yn-1 yn

Second hidden layer  LSTM2 LSTM2 LSTM2 LSTM2

First hidden layer  LSTM1 LSTM1 LSTM1 LSTM1

xn-3 xn-2 xn-1 xn


Bi-directional LSTM
• Many problems require a reverse flow of
information as well
• For example, POS tagging may require context
from future words

yn-3 yn-2 yn-1 yn

Backward layer  LSTM2 LSTM2 LSTM2 LSTM2

Forward layer  LSTM1 LSTM1 LSTM1 LSTM1

xn-3 xn-2 xn-1 xn


Some problems in LSTM and its
troubleshooting
• Inappropriate model
• Identify the problem: One-to-many, many-to-one etc.
• Loss only for outputs that matter
• Separate LSTMs for separate languages
• High training loss
• Model not expressive
• Too few hidden nodes
• Only one hidden layer
• Overfitting
• Model has too much freedom
• Too many hidden nodes
• Too many blocks
• Too many layers
• Not bi-directional
Multi-dimensional RNNs

Source: ”Supervised Sequence Labelling with Recurrent Neural Networks.” by Graves, Alex.
In summary, LSTMs are powerful
• Using recurrent connections is an old idea
• It suffered from lack of gradient control over long term
• CEC was an important innovation for remembering states
long term
• This has many applications in time series modeling
• Newer innovations are:
– Forget gates
– Peepholes
– Combining input and forget gate in GRU
• LSTM can be generalized in direction, dimension, and
number of hidden layers to produce more complex
models 58

You might also like