Lecture 6 Recurrent Neural Networks

Advanced Machine Learning
RNN Layer Architectures

Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture
• List issues with vanilla RNN
• Show how LSTM and GRU overcome those

issues
• List design choices for LSTM
• Derive backprop for LSTM/GRU

Contents
• Need for memory to process sequential data
• Recurrent neural networks
• LSTM basics
• Some applications of LSTM in NLP
• Some advanced LSTM structures

What is sequential data?
• One-dimensional discrete index
• Example: time instances, character position
• Each data point can be a scalar, vector, or a
symbol from an alphabet
… xn-4 xn-3 xn-2 xn-1 xn …
• Number of data points in a series can be

variable
Examples of sequential data
• Speech
• Text (NLP)
• Music
• Protein and DNA sequences
• Stock prices and other time series

Traditional ML is one-to-one
• POS tagging in NLP (input: words, output: POS tag)
• Stock trade: {Buy, NoAction, Sell}
… yn-4 yn-3 yn-2 yn-1 yn …
f(x) f(x) f(x) f(x) f(x)
… xn-4 xn-3 xn-2 xn-1 xn …
• What about taking past data into account?

Need for past data or context
 Different POS
 It is a quick read
 I like to read
 Translation
I am going
मैं जा रहा हूँ (Re-ordered, ideal)

मैं हूँ जा रहा (Word by word, less than ideal)
Using traditional ML for sequences
• Work with a fixed window
… yn-2 yn-1 yn … yn
y(x) y(x) y(x) y(x)
… xn-4 xn-3 xn-2 xn-1 xn … xn-2 xn-1 xn
• What about influence of distant past?

What if we want to do many to one
• Sentiment analysis in NLP
yn
… xn-4 xn-3 xn-2 xn-1 xn …
• What about taking past data into account?

Using traditional ML for sequences
• Convert sequence into a feature vector
yn
g(h)
Histogram
f(x) f(x) f(x) f(x) f(x)
… xn-4 xn-3 xn-2 xn-1 xn …
• What about using the order of the data?

Introducing memory (recurrence or state)
in neural networks
• A memory state is computed in addition to an
output, which is sent to the next time instance
yn … yn-3 yn-2 yn-1 yn …
hn-1 y(x,h) hn y(x,h) y(x,h) y(x,h) y(x,h)

h(x,h) … h(x,h) h(x,h) h(x,h) h(x,h) …
xn … xn-3 xn-2 xn-1 xn …

Another view of recurrence
• A memory state is computed in addition to an
output, which is sent to the next time instance
y1,n y2,n yc,n
yn
hn-1 hn hn Delay Delay Delay h1,n h2,n hb,n
Delay
xn h1,n-1 h2,n-1 hb,n-1 x1,n x2,n xa,n
Types of analysis possible on
sequential data using “recurrence”
• One to one
• One to many
• Many to one
• Many to many
Examples: Many to many
• POS tagging in NLP
• Stock trade: {Buy, NoAction, Sell}
… yn-3 yn-2 yn-1 yn …
y(x,h) y(x,h) y(x,h) y(x,h)

… h(x,h) h(x,h) h(x,h) h(x,h) …
… xn-3 xn-2 xn-1 xn …

Examples: many to one
• Sentiment analysis in NLP
yn

… h(x,h) h(x,h) h(x,h) h(x,h)
… xn-3 xn-2 xn-1 xn

Examples: One to many
• Generate caption based on an image
• Generate text given topic
yn yn+1 yn+2 yn+3 …

h(x,h) h(x,h) h(x,h) h(x,h) …
xn
Examples: Many to many
• Language translation
yn-3 yn-2 yn-1 …
y(x,h) y(x,h) y(x,h) y(x,h) y(x,h) y(x,h) y(x,h)

… h(x,h) h(x,h) h(x,h) h(x,h) h(x,h) h(x,h) h(x,h) …
… xn-3 xn-2 xn-1 xn

Contents
• LSTM basics

Revising feedforward neural networks
y1 y2 … yn y
Softmax
Why
Why
h1 h1 … hm h
ReLU
Wxh
Wxh
x1 x2 … xd x
Recurrent neural networks
• Vanilla RNNs used the hidden layer activation
as a state
yn
… yn-3 yn-2 yn-1 yn …
tanh
y(x,h) y(x,h) y(x,h) y(x,h) Why

… h(x,h) h(x,h) h(x,h) h(x,h) … Whh
hn-1 tanh hn
… xn-3 xn-1 … Wxh

xn-2 xn
xn
Backpropagation through time (BPTT)
• Just like how forward • Backpropagation uses
propagation uses derivative from future
previous state … output
yn yn
tanh tanh
Why Why
Whh Whh
hn-1 tanh hn hn-1 tanh hn
Wxh Wxh
xn xn
Use of a window length
• We need to put a limit on how long will the
gradient travel back in time
yn-m yn-1 yn
tanh tanh tanh
Why Why Why

hn-m hn-2 Whh hn-1 Whh
tanh … tanh tanh hn
Wxh Wxh Wxh
xn-m xn-1 xn
Mathematical expression for BPTT
• Forward: 𝒚𝑛 = 𝒈 𝑾2 𝒉𝑛
= 𝒈 𝑾2 𝒇(𝑾11 𝒉𝑛−1 + 𝑾12 𝒙𝑛 )
• Backward example:
𝜕𝒚
= 𝒈′ 𝑾 2 𝒉𝑛 ′
𝜕𝑾11
= 𝒈′𝑾2 𝒇′(𝒉𝑛−1 + 𝑾11 𝒉𝑛−1 ′)
Vanishing and exploding gradient
• Gradient gets repeatedly multiplied by Whh
• This can lead to vanishing or exploding
gradient depending on the norm of Whh
yn-m yn-1 yn
tanh tanh tanh
Why Why Why

hn-m hn-2 Whh hn-1 Whh
tanh … tanh tanh hn
Wxh Wxh Wxh
xn-m xn-1 xn
Exploding and vanishing gradients (1 of
2)
Source: ”Supervised Sequence Labelling with Recurrent Neural Networks.” by Graves, Alex.
Exploding and vanishing gradients (2 of
2)
Contents
• LSTM basics

Introducing… a forget gate to control the
Hidden Layer
y gradient
t
• The state doesn’t multiply with

h a constant weight
sct-1 • A gate function f (usually a

bϕt whϕ sigmoid) represents on or off
st f
wiϕ • State is forgotten and replaced
CEC
by the input g if f = 0
g(act) • But, what if f = 1? How do we

g control the influence of input?
wic whc
Legend
Current
Input Layer xt
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves Delayed
Adding input and output gates
yt
Hidden Layer 𝑤𝑐𝑘
whω bωt bct • On similar lines as the forget

f
wiω h(st) gate, an input gate decides
h whether the input will
override the state or not
bϕt whϕ
st f • Similarly, an output gate will
wiϕ
CEC decide whether the output
whι
f
bιt will be passed out or not
wiι
g(act) • The input to all these gates
g are NN inputs, and NN
wi wh
hidden-layer output Legend
Current
Input Layer xt
Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves Delayed
Another view
Hidden
Input
• CEC is constant error Hidden
carrousel
• No vanishing gradients Input
• But, it is not always on f × Hidden
s × f
• Introducing gates: Input
• Allow or disallow input Hidden
f ×
• Allow or disallow output g
• Remember or forget Input
state Input Hidden
A few words about the LSTM
• CEC: With the forget gate, influence of the state forward
can be modulated such that it can be remembered for a
long time, until the state or the input changes to make
LSTM forget it. This ability or the path to pass the past-state
unaltered to the future-state (and the gradient backward) is
called constant error carrousel (CEC). It gives LSTM the
ability to remember long term (hence, long short term
memory)
• Blocks: Since there are just too many weights to be learnt
for a single state bit, several state bits can be combined into
a single block such that the state bits in a block share gates
• Peepholes: The state itself can be an input for the gate
using peephole connections
• GRU: In a variant of LSTM called gated recurrent unit
(GRU), input gate can simply be one-minus-forget-gate.
That is, if the state is being forgotten, then replace it by
input, and if it is being remembered, then block the input
Output Layer y1t … ykt … y Kt
𝑤𝑐𝑘
Hidden Layer … bht-1 … bHt-1
whω bωt bct Block 1 … Block H/C

f
wiω h(sct)
h
bϕt whϕ
ssctcst t f
c
wiϕ
CEC
whι bιt
f
wiι Legend
g(act)
Current
g
Delayed
wic whc Peephole
Input Layer x1t … xit … xIt LSTM block

Adapted from: “Supervised Sequence Labelling with Recurrent Neural Networks.” by Alex Graves
𝑤𝑐𝑘

f
wiω h(sct)
h
sct-1
bϕt whϕ
ssctcst t f
c
wiϕ
CEC
whι bιt
f
wiι Legend
g(act)
Current
g
Delayed
wic whc Peephole
Input Layer x1t … xit … xIt Adding peep-holes

𝑤𝑐𝑘

f
wiω h(sct)
h
sct-1 Forward Pass
bϕt whϕ
ssctcst t f
c
wiϕ
CEC Input gate:
whι bιt
f 𝐼 𝐻 𝐶
wiι
g(act) 𝑎𝑙𝑡 = 𝑤𝑖𝑙 𝑥𝑖𝑡 + 𝑤ℎ𝑙 𝑏ℎ𝑡−1 + 𝑤𝑐𝑙 𝑠𝑐𝑡−1
𝑖=1 ℎ=1 𝑐=1
g
wic whc 𝑏𝑙𝑡 = 𝑓(𝑎𝑙𝑡 )
Input Layer x1t … xit … xIt

𝑤𝑐𝑘

f
wiω
h
h(sct)
Forward Pass
sct-1
Forget gate:
bϕ t
ssctcst t whϕ 𝐼 𝐻 𝐶
c f
𝑡
wiϕ 𝑎𝜙 = 𝑤𝑖𝜙 𝑥𝑖𝑡 + 𝑤ℎ𝜙 𝑏ℎ𝑡−1 + 𝑤𝑐𝜙 𝑠𝑐𝑡−1
CEC 𝑖=1 ℎ=1 𝑐=1
whι bιt 𝑡 𝑡
f 𝑏𝜙 = 𝑓(𝑎𝜙 )
wiι
g(act)
Cell input:
g 𝐼 𝐻
wic whc 𝑎𝑐𝑡 = 𝑤𝑖𝑐 𝑥𝑖𝑡 + 𝑤ℎ𝑐 𝑏ℎ𝑡−1

𝑖=1 ℎ=1
Input Layer x1t … xit … xIt 𝑡 𝑡−1

𝑠𝑐𝑡 = 𝑏𝜙 𝑠𝑐 + 𝑏𝑙𝑡 𝑔(𝑎𝑐𝑡 )
𝑤𝑐𝑘

f
wiω h(sct)
h
sct-1 Forward Pass
bϕt whϕ
ssctcst t f
c
wiϕ Output gate:
CEC 𝐼 𝐻 𝐶
𝑡 =
𝑎𝜔 𝑤𝑖𝜔 𝑥𝑖𝑡 + 𝑤ℎ𝜔 𝑏ℎ𝑡−1 + 𝑤𝑐𝜔 𝑠𝑐𝑡
whι bιt
f 𝑖=1 ℎ=1 𝑐=1
wiι
g(act) 𝑡 = 𝑓(𝑎 𝑡 )
𝑏𝜔 𝜔
g
wic whc Cell output:
𝑏𝑐𝑡 = 𝑏𝜔
𝑡 𝑕(𝑠 𝑡 )
𝑐
Input Layer x1 … t xit … xIt
Revisiting backpropagation through b-
diagrams
• An efficient way to
perform gradient
descent in NNs
• Efficiency comes from
local computations
• This can be visualized
using b-diagrams
– Propagate x (actually w)
forward
– Propagate 1 backward
Source: “Neural Networks - A Systematic Introduction,” by Raul Rojas, Springer-Verlag, Berlin, New-York, 1996.
Chain rule using b-diagram
Addition of functions using b-diagram
Weighted edge on a b-diagram
Product in a b-diagram
f
* fg
f’g
* 1
g’f
𝑤𝑐𝑘
Hidden Layer … …

f
wiω h(sct)
h
sct-1 Backpropagation
bϕt 𝜕𝐿 𝜕𝐿
ssctcst t whϕ ϵ𝑡𝑐 ≝ 𝜖𝑠𝑡 ≝
c f 𝜕𝑏𝑐𝑡 𝜕𝑠𝑐𝑡
wiϕ
CEC
whι bιt
f
wiι Cell output:
g(act)
g
wic whc 𝐾 𝐺
𝜖𝑐𝑡 = 𝑤𝑐𝑘 𝛿𝑘𝑡 + 𝑤𝑐𝑔 𝛿𝑔𝑡+1

Input Layer x1t … xit … xIt 𝑘=1 𝑔=1
𝑤𝑐𝑘

f
wiω
h
h(sct)
Backpropagation
sct-1
𝐶
𝑡 = 𝑓 ′ (𝑎 𝑡 )
bϕt Output gate: 𝛿𝜔 𝜔 𝑕(𝑠𝑐𝑡 )𝜖𝑐𝑡
ssctcst t whϕ
c f 𝑐=1
wiϕ
CEC State:
whι bιt 𝑡 𝜖 𝑡 = 𝑏 𝑡 𝑕′ 𝑠 𝑡 𝜖 𝑡 + 𝑏 𝑡+1 𝜖 𝑡+1
f 𝛿𝜔 𝑠 𝜔 𝑐 𝑐 ∅ 𝑠
wiι +𝑤𝑐𝑙 𝛿𝑙𝑡+1 + 𝑤𝑐∅ 𝛿∅𝑡+1 + 𝑤𝑐𝜔 𝛿𝜔
𝑡
g(act)
g
wic whc
Cells: 𝛿𝑐𝑡 = 𝑏𝑙𝑡 𝑔′ (𝑎𝑐𝑡 )𝜖𝑠𝑡

𝑤𝑐𝑘

f
wiω h(sct)
h
sct-1 Backpropagation
bϕt whϕ
ssctcst t f
c Forget gate:
wiϕ
CEC 𝐶
whι bιt 𝛿∅𝑡 = 𝑓 ′ (𝑎∅𝑡 ) 𝑠𝑐𝑡−1 𝜖𝑠𝑡

f 𝑐=1
wiι
g(act)
Input gate:
g
𝐶
wic whc
𝛿𝑙𝑡 = 𝑓 ′ (𝑎𝑙𝑡 ) 𝑔(𝑎𝑐𝑡 )𝜖𝑠𝑡
𝑐=1
Contents
• LSTM basics

Gated Recurrent Unit (GRU)
Hidden
Input
• Reduces the need for Hidden
input gate by reusing

Input
the forget gate f × Hidden
s × f
Input
Hidden
× 1-x
g
Input
Input Hidden
GRUs combine input and forget gates
LSTM unit LSTM unit with

RNN unit peepholes
Gated Recurrent
Unit combines input
and forget gate as f
and 1-f
Source: Cho, et al. (2014), and “Understanding LSTM Networks”, by C Olah, http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Sentence generation
• Very common for image captioning
• Input is given only in the beginning
• This is a one-to-many task
A boy swimming in water <END>
LSTM LSTM LSTM LSTM LSTM LSTM
CNN
Video Caption Generation
Source: “Translating Videos to Natural Language Using Deep Recurrent Neural Networks”, Venugopal et al., ArXiv 2014
Pre-processing for NLP
• The most basic pre-processing is to convert
words into an embedding using Word2Vec or
GloVe
• Otherwise, a one-hot-bit input vector can be

too long and sparse, and require lots on input
weights
Sentiment analysis
• Very common for customer review or new article
analysis
• Output before the end can be discarded (not
used for backpropagation)
• This is a many-to-one task
Positive
LSTM LSTM LSTM LSTM LSTM LSTM
Embed Embed Embed Embed Embed Embed
Very pleased with their service <END>

Machine
translation
using
encoder-
decoder
Source: “Learning Phrase Representations using RNN Encoder–Decoderfor Statistical Machine Translation”, Cho et al., ArXiv 2014
Multi-layer LSTM
• More than one hidden layer can be used
yn-3 yn-2 yn-1 yn
Second hidden layer  LSTM2 LSTM2 LSTM2 LSTM2
First hidden layer  LSTM1 LSTM1 LSTM1 LSTM1
xn-3 xn-2 xn-1 xn

Bi-directional LSTM
• Many problems require a reverse flow of
information as well
• For example, POS tagging may require context
from future words
yn-3 yn-2 yn-1 yn
Backward layer  LSTM2 LSTM2 LSTM2 LSTM2
Forward layer  LSTM1 LSTM1 LSTM1 LSTM1
xn-3 xn-2 xn-1 xn

Some problems in LSTM and its
troubleshooting
• Inappropriate model
• Identify the problem: One-to-many, many-to-one etc.
• Loss only for outputs that matter
• Separate LSTMs for separate languages
• High training loss
• Model not expressive
• Too few hidden nodes
• Only one hidden layer
• Overfitting
• Model has too much freedom
• Too many hidden nodes
• Too many blocks
• Too many layers
• Not bi-directional
Multi-dimensional RNNs
In summary, LSTMs are powerful
• Using recurrent connections is an old idea
• It suffered from lack of gradient control over long term
• CEC was an important innovation for remembering states
long term
• This has many applications in time series modeling
• Newer innovations are:
– Forget gates
– Peepholes
– Combining input and forget gate in GRU
• LSTM can be generalized in direction, dimension, and
number of hidden layers to produce more complex
models 58

Lecture 6 Recurrent Neural Networks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 6 Recurrent Neural Networks

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

RNN Layer Architectures

• Show how LSTM and GRU overcome those

• List design choices for LSTM

• Derive backprop for LSTM/GRU

• Need for memory to process sequential data

• Recurrent neural networks

• Some applications of LSTM in NLP

• Some advanced LSTM structures

… xn-4 xn-3 xn-2 xn-1 xn …

• Number of data points in a series can be

• Protein and DNA sequences

• Stock prices and other time series

… yn-4 yn-3 yn-2 yn-1 yn …

f(x) f(x) f(x) f(x) f(x)

… xn-4 xn-3 xn-2 xn-1 xn …

• What about taking past data into account?

मैं जा रहा हूँ (Re-ordered, ideal)

y(x) y(x) y(x) y(x)

… xn-4 xn-3 xn-2 xn-1 xn … xn-2 xn-1 xn

• What about influence of distant past?

… xn-4 xn-3 xn-2 xn-1 xn …

• What about taking past data into account?

f(x) f(x) f(x) f(x) f(x)

… xn-4 xn-3 xn-2 xn-1 xn …

• What about using the order of the data?

yn … yn-3 yn-2 yn-1 yn …

hn-1 y(x,h) hn y(x,h) y(x,h) y(x,h) y(x,h)

xn … xn-3 xn-2 xn-1 xn …

hn-1 hn hn Delay Delay Delay h1,n h2,n hb,n

… yn-3 yn-2 yn-1 yn …

y(x,h) y(x,h) y(x,h) y(x,h)

… xn-3 xn-2 xn-1 xn …

y(x,h) y(x,h) y(x,h) y(x,h)

… xn-3 xn-2 xn-1 xn

yn yn+1 yn+2 yn+3 …

y(x,h) y(x,h) y(x,h) y(x,h)

yn-3 yn-2 yn-1 …

y(x,h) y(x,h) y(x,h) y(x,h) y(x,h) y(x,h) y(x,h)

… xn-3 xn-2 xn-1 xn

• Need for memory to process sequential data

• Recurrent neural networks

• Some applications of LSTM in NLP

• Some advanced LSTM structures

y(x,h) y(x,h) y(x,h) y(x,h) Why

… xn-3 xn-1 … Wxh

tanh tanh tanh

Why Why Why

Wxh Wxh Wxh

tanh tanh tanh

Why Why Why

Wxh Wxh Wxh

• Need for memory to process sequential data

• Recurrent neural networks

• Some applications of LSTM in NLP

• Some advanced LSTM structures

• The state doesn’t multiply with

sct-1 • A gate function f (usually a

g(act) • But, what if f = 1? How do we

whω bωt bct • On similar lines as the forget

whω bωt bct Block 1 … Block H/C

Input Layer x1t … xit … xIt LSTM block

whω bωt bct Block 1 … Block H/C

Input Layer x1t … xit … xIt Adding peep-holes