You are on page 1of 85

# Recurrent Neural Networks

Saket Anand

## Adapted from Chetan Arora, IIT-Delhi

Multilayer Neural Networks
Types of Exclusive-OR Classes with Most General
Structure
Decision Regions Problem Meshed regions Region Shapes
Single-Layer Half Plane A B
B
Bounded By A
B A
Hyperplane

## Two-Layer Convex Open A B

B
Or A
B A
Closed Regions

Three-Layer Arbitrary A B
(Complexity B
A
Limited by No. B A
of Nodes)
Convolutional Neural Network: Key Idea
Exploit
1. Structure
2. Local Connectivity
3. Share Parameter

To Give
1. Translation Invariance
2. Minor Distortion and Scale Invariance
3. Occlusion Invariance
Handling sequences through NNs
• How to capture sequential information using neural network?

each other.

## • Imagine human beings understanding a sentence: we understand

each word based on our understanding of previous words.

## • For many problems like speech recognition, language modeling,

machine translation, etc assuming that there is no dependency
between inputs is a bad idea.
Handling sequences through NNs

ot-1 ot ot+1

## 𝑉𝑉𝑡𝑡−1 𝑉𝑉𝑡𝑡 𝑉𝑉𝑡𝑡+1

𝑆𝑆𝑡𝑡−1 𝑆𝑆𝑡𝑡 𝑆𝑆𝑡𝑡+1
𝑊𝑊𝑡𝑡−1 𝑊𝑊𝑡𝑡 𝑊𝑊𝑡𝑡+1

## 𝑈𝑈𝑡𝑡−1 𝑈𝑈𝑡𝑡 𝑈𝑈𝑡𝑡+1

xt-1 xt Xt+1
Handling sequences through NNs
• Too many parameters.
ot-1 ot ot+1

## • Use the same trick as in

Convolutional Neural Networks. 𝑉𝑉 𝑉𝑉 𝑉𝑉
𝑆𝑆𝑡𝑡−1 𝑆𝑆𝑡𝑡 𝑆𝑆𝑡𝑡+1
• Tie the weights. 𝑊𝑊 𝑊𝑊 𝑊𝑊

𝑈𝑈 𝑈𝑈 𝑈𝑈
xt-1 xt Xt+1
Recurrent Neural Network

ot-1 ot ot+1 ot

𝑉𝑉 𝑉𝑉 𝑉𝑉 𝑉𝑉
𝑆𝑆𝑡𝑡−1 𝑆𝑆𝑡𝑡 𝑆𝑆𝑡𝑡+1 𝑆𝑆𝑡𝑡 𝑊𝑊
𝑊𝑊 𝑊𝑊 𝑊𝑊

𝑈𝑈 𝑈𝑈 𝑈𝑈 𝑈𝑈
xt-1 xt-1 Xt+1 xt
Recurrent Neural Network
• RNNs are called recurrent because they perform same task for every
element of a sequence.

## • RNNs can be seen as an NN having “memory” about what has been

calculated so far.
Recurrent Neural Network - Unrolling
• Unrolling/Unfolding means that we write out the network for the
complete sequence. For example, if RNN is used for language modeling
then a sequence will contain N words and it would be unrolled into a N-
layer neural network, one for each word.
ot o1 o2 o100

𝑉𝑉
𝑆𝑆𝑡𝑡 𝑊𝑊
𝑉𝑉

𝑆𝑆𝑡𝑡−1
𝑊𝑊
𝑆𝑆𝑡𝑡
𝑉𝑉
… 𝑉𝑉
𝑆𝑆𝑁𝑁
𝑊𝑊

𝑈𝑈 𝑈𝑈 𝑈𝑈 𝑈𝑈
xt x1 x2 XN
Notation
• 𝒙𝒙𝒕𝒕 is the input at time step 𝒕𝒕.

• 𝒔𝒔𝒕𝒕 is the hidden state at time step 𝒕𝒕. It’s the “memory” of the
network.

• 𝒔𝒔𝒕𝒕 is calculated based on the previous hidden state and the input at
current step, 𝒔𝒔𝒕𝒕 = 𝒇𝒇(𝑼𝑼𝒙𝒙𝒕𝒕 + 𝑾𝑾𝒔𝒔𝒕𝒕−𝟏𝟏 ).

## • The function 𝒇𝒇 is a non-linearity function such as tanh or ReLU.

Notation (contd.)
• 𝒔𝒔−𝟏𝟏 , which is required to calculate the first hidden state, is typically
initialized to all zeros.

## • 𝒐𝒐𝒕𝒕 is the output at step 𝒕𝒕. It can be calculated as 𝒐𝒐𝒕𝒕 = 𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔𝒔(𝑽𝑽𝒔𝒔𝒕𝒕 )

RNN – Model
• The main feature of an RNN is its hidden state, which captures
information about what happened in a sequence in all the previous
time steps.

## • A RNN shares the same parameters across all steps. Therefore,

performing same task at each step. This reduces the number of
parameters to learn.

• Depending upon task we may want output at each time step or just
one final output at last time step.
RNN Modeling Based on Input/Output

## House Image Sentiment Video Language,

number Captioning Analysis from Description Speech
Sequential Processing of Non-sequential Data

## RNN learns to read RNN learns to paint house numbers

house numbers
Training: Backpropagation Through Time
• Basic equation of RNN is
𝑠𝑠𝑡𝑡 = 𝑓𝑓(𝑈𝑈𝑥𝑥𝑡𝑡 + 𝑊𝑊𝑠𝑠𝑡𝑡−1 )
𝑦𝑦𝑡𝑡′ = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠(𝑉𝑉𝑠𝑠𝑡𝑡 )

## • Error in prediction is computed as the cross entropy loss, given by

𝐸𝐸 𝑦𝑦, 𝑦𝑦′ = 𝑦𝑦 log 𝑦𝑦 ′ = � 𝑦𝑦𝑡𝑡 log(𝑦𝑦𝑡𝑡′ )
𝑡𝑡

• 𝑦𝑦𝑡𝑡 is the ground truth word at time step 𝑡𝑡, and 𝑦𝑦𝑡𝑡′ is the predicted word.

## • Total error is the sum of the errors at each time step.

Backpropagation Through Time (BPTT)
• Gradients of the error are calculated w.r.t. Parameters 𝑈𝑈, 𝑉𝑉 and 𝑊𝑊
using stochastic gradient descent (or mini-batch or any other variant).

• Gradients are summed up at each time step for one training example
𝜕𝜕𝜕𝜕 𝜕𝜕𝐸𝐸𝑡𝑡
=�
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
𝑡𝑡

## • Chain rule of differentiation is used to calculate gradients. When

applied in backward direction it is backpropagation algorithm and
hence called as backpropagation through time(BPTT).
RNN Example: Character-Level Language
Model
RNN Example: Word-Level Language Model
RNN Example: Sentiment Classification
RNN Example: Machine Translation
Long Term Dependencies
I grew up in France… I speak fluent French

## ot-2 ot-1 ot ot+1 ot+2

V V V V V

𝑆𝑆 W 𝑆𝑆 W 𝑆𝑆 W 𝑆𝑆 W 𝑆𝑆 W

U U U U U
xt-2 xt-1 xt Xt+1 Xt+2
Difficulties involved in BPTT
• RNNs trained with BPTT have difficulties learning long term
dependencies due to what is called the vanishing gradient problem.

• When there are many hidden layers the error gradient weakens as it
moves from the back of the network to the front, because the
derivative the sigmoid weakens towards the poles

• The updates as you move to the front of the network will contain less
information.
Difficulties involved in BPTT (Cont.)
• The problem exist in CNNs also. RNNs amplify this. Effectively the
number of layers that is traversed by back-propagation grows
dramatically.

## • This makes it impossible for the model to learn correlation between

temporally distant events.
Some Partial Solutions
• Vanishing gradients can be partially solved by properly initializing the
weight matrices (U, V, W) or performing regularization.

## • Using ReLU instead of tanh or sigmoid activation functions is also a

preferred solution (typically done in Deep CNNs).
Solution Sketch
• What could be the simplest solution to handle vanishing/exploding

## • Or ‘Learn’ when to let the information pass.

LSTM Forget Insert Recurrent

ht

* +
tanh

* *
σ σ tanh σ

Xt
Long Short-Term Memory
LSTM Architecture * +
tanh

* *
• Each line carries an entire vector. σ σ tanh σ

## • Lines splitting denotes vectors being copied and copies going

to different locations.
LSTM Networks * +
tanh

* *
σ σ tanh σ

## • LSTM networks are capable of learning long-term dependencies.

• LSTMs are explicitly designed to avoid the long term dependency problem.

## • The structure of this repeating module is quite different from RNN’s

module.
Unrolling LSTMs

ht-1 ht ht+1

* +
tanh
* +
tanh
* +
tanh

* * * * * *
σ σ tanh σ σ σ tanh σ σ σ tanh σ

Xt-1 Xt Xt+1
LSTM Networks
ht

𝑐𝑐𝑡𝑡−1
* + 𝑐𝑐𝑡𝑡
tanh

*𝑐𝑐̃
𝑖𝑖𝑡𝑡
*
𝑓𝑓𝑡𝑡 𝑜𝑜𝑡𝑡
𝑡𝑡
σ σ tanh σ
ℎ𝑡𝑡−1 ℎ𝑡𝑡
Xt
ResNet Analogy
How does LSTM cell works? (Cont.)
• The key to LSTMs is the cell state, the
horizontal line running through the top of the ht
cell.
𝑐𝑐𝑡𝑡−1 𝑐𝑐𝑡𝑡
• This line runs straight through the entire chain * +

## interactions with cells. 𝑓𝑓𝑡𝑡 𝑖𝑖𝑡𝑡

* 𝑜𝑜𝑡𝑡
𝑐𝑐𝑡𝑡̃
*
σ σ tanh σ
• The LSTM have the ability to add or remove ℎ
information to the cell state regulated by gates. 𝑡𝑡−1 ℎ𝑡𝑡

Xt
• Gates consist of a sigmoid neural network
layer and a pointwise multiplication operation.
Gate
• The sigmoid layer outputs numbers between zero and one
representing how much information each component should let
through.

## • A value of zero means “let nothing through”, while a value of one

means “let everything though!”.

σ
LSTM Operations: Forget
ht
• First step is to decide what information
to throw away from the cell state. 𝑐𝑐𝑡𝑡−1 𝑐𝑐𝑡𝑡
* +
tanh

## • A sigmoid layer names “forget gate 𝑓𝑓𝑡𝑡 𝑖𝑖𝑡𝑡 𝑜𝑜𝑡𝑡

* *
layer” makes this decision. σ σ
𝑐𝑐𝑡𝑡̃
tanh σ
ℎ𝑡𝑡−1 ℎ𝑡𝑡
• It looks at past state output, ht-1 and
current input, xt and outputs a number Xt
between 0 and 1 telling how much to
𝒇𝒇𝒕𝒕 = 𝛔𝛔 𝐖𝐖𝐟𝐟 � [𝒉𝒉𝒕𝒕−𝟏𝟏 , 𝒙𝒙𝒕𝒕 ] + 𝒃𝒃𝒇𝒇
keep.
LSTM Operations: Input/Insert
ht
• Next step is to decide what new
information should be stored in the cell 𝑐𝑐𝑡𝑡−1 * + 𝑐𝑐𝑡𝑡
state. tanh
𝑓𝑓𝑡𝑡 𝑖𝑖𝑡𝑡 𝑜𝑜𝑡𝑡
* *
• First, a sigmoid layer called the “input 𝑐𝑐𝑡𝑡̃
σ σ tanh σ
gate layer” decides which values should ℎ𝑡𝑡−1 ℎ𝑡𝑡
be updated, it.
Xt
• Next, a tanh layer creates a vector of
new candidate values, 𝒄𝒄 ̃𝒕𝒕 that could be 𝒊𝒊𝒕𝒕 = 𝝈𝝈 𝑾𝑾𝒊𝒊 � [𝒉𝒉𝒕𝒕−𝟏𝟏 , 𝒙𝒙𝒕𝒕 ] + 𝒃𝒃𝒊𝒊
𝒄𝒄� 𝒕𝒕 = 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕 𝑾𝑾𝒄𝒄 � [𝒉𝒉𝒕𝒕−𝟏𝟏 , 𝒙𝒙𝒕𝒕 ] + 𝒃𝒃𝒄𝒄
LSTM Operations: Update
ht
• Old cell state, ct-1 is updated into the
new cell state, ct. 𝑐𝑐𝑡𝑡−1 𝑐𝑐𝑡𝑡
• Old state is multiplied by forget layer * +
tanh
output, ft. 𝑓𝑓𝑡𝑡 𝑖𝑖𝑡𝑡
* 𝑜𝑜𝑡𝑡 *
• Input gate layer output, it is multiplied 𝑐𝑐𝑡𝑡̃
σ tanh σ
σ
with candidate values, 𝒄𝒄 𝒕𝒕̃ and the result ℎ𝑡𝑡−1 ℎ𝑡𝑡
is added to values obtained by above
multiplication. Xt
• The output of above computations is 𝑪𝑪𝒕𝒕 = 𝒇𝒇𝒕𝒕 ∗ 𝒄𝒄𝒕𝒕−𝟏𝟏 + 𝒊𝒊𝒕𝒕 ∗ 𝒄𝒄� 𝒕𝒕
the new candidate value, ct.
LSTM Operations: Output/Recurrent
ht
• Final step is to decide what to output.
• Output is based on the current cell 𝑐𝑐𝑡𝑡−1 𝑐𝑐𝑡𝑡
state, ct. * +
tanh

## • First a sigmoid layer decides what parts 𝑓𝑓𝑡𝑡 𝑖𝑖𝑡𝑡

* 𝑜𝑜𝑡𝑡 *
of cell state is going to output. 𝑐𝑐𝑡𝑡̃
σ tanh σ
σ
• Then cell state is passed through a tanh ℎ𝑡𝑡−1 ℎ𝑡𝑡
layer.
• The output is then multiplied by the Xt
output of the sigmoid gate.
𝒉𝒉𝒕𝒕 = 𝒐𝒐𝒕𝒕 ∗ 𝒕𝒕𝒕𝒕𝒕𝒕𝒕𝒕 𝒄𝒄𝒕𝒕
𝒐𝒐𝒕𝒕 = 𝝈𝝈 𝑾𝑾𝒐𝒐 � [𝒉𝒉𝒕𝒕−𝟏𝟏 , 𝒙𝒙𝒕𝒕 ] + 𝒃𝒃𝒐𝒐
LSTM Example
Variants of LSTMs
• No Input Gate (NIG)
• No Forget Gate (NFG)
• No Output Gate (NOG)
• No Input Activation Function (NIAF)
• No Output Activation Function (NOAF)
• Peepholes
• Coupled Input and Forget Gate (CIFG)
• Full Gate Recurrence (FGR).

## LSTM: A Search Space Odyssey

Peephole
ht

𝑐𝑐𝑡𝑡−1
* + 𝑐𝑐𝑡𝑡
tanh

*𝑐𝑐̃
𝑖𝑖𝑡𝑡
*
𝑓𝑓𝑡𝑡 𝑜𝑜𝑡𝑡
𝑡𝑡
σ σ tanh σ
ℎ𝑡𝑡−1 ℎ𝑡𝑡
Xt
Coupled Input and Forget Gates
ht

𝑐𝑐𝑡𝑡−1
* + 𝑐𝑐𝑡𝑡
tanh
-1
*𝑐𝑐̃ *
𝑓𝑓𝑡𝑡 𝑜𝑜𝑡𝑡
𝑡𝑡
σ tanh σ
ℎ𝑡𝑡−1 𝑖𝑖𝑡𝑡 ℎ𝑡𝑡
Xt
σ
Gated Recurrent Unit (GRU)
• Combine the forget and input gates into a single “update gate.”

## • Make some other changes

Gated Recurrent Unit (GRU)

ℎ𝑡𝑡−1
* + ℎ𝑡𝑡

* -1

* ℎ�
𝑟𝑟𝑡𝑡 𝑧𝑧𝑡𝑡
𝑡𝑡
σ σ
tanh

𝑥𝑥𝑡𝑡
Gated Recurrent Unit (GRU)
• 𝑧𝑧𝑡𝑡 = 𝜎𝜎 𝑊𝑊𝑧𝑧 ∗ ℎ𝑡𝑡−1 , 𝑥𝑥𝑡𝑡

## • ℎ𝑡𝑡 = 1 − 𝑧𝑧𝑡𝑡 ∗ ℎ𝑡𝑡 + 𝑧𝑧𝑡𝑡 ∗ ℎ� 𝑡𝑡

Gated Recurrent Unit (GRU)
• Simplifies the design

## • Easy to train than LSTM. Need less data.

Alternate Representation
Effect of various LSTM structures
• The most commonly used LSTM architecture (vanilla LSTM) performs
reasonably well on various datasets and using any of eight possible
modifications does not significantly improve the LSTM performance
• Certain modifications such as coupling the input and forget gates or
removing peephole connections simplify LSTM without significantly
hurting performance.
• The forget gate and the output activation function are the critical
components of the LSTM block. While the first is crucial for LSTM
performance, the second is necessary whenever the cell state is
unbounded.
RNN Extensions
• There are some variants of RNNs available, some of them are:

• Bidirectional RNNs

## • Deep Bidirectional RNNs

Bidirectional RNNs
• Models that current state
depends on both previous
state as well as future state in
the sequence.
• They are simple.
• Two RNNs stacked on top of
each other.
• Output is then computed
based on the hidden state of
both RNNs.
Deep Bidirectional RNNs
• Similar to bidirectional RNNs.
• Only difference is layered
architecture.
• Multiple layers of bidirectional
RNNs.
• In practice gives a higher
learning capacity.
• Requires a lot of training data.
Feature-based approaches to Activity Recognition
• Dense trajectories and motion boundary descriptors for action recognition:
Wang et al., 2013
• Action Recognition with Improved Trajectories: Wang and Schmid, 2013

Dense Trajectories
Dense Trajectories
• Dense trajectories and motion boundary descriptors for action recognition:
Wang et al., 2013

## detect feature track features with extract HOG/HOF/MBH features

points optical flow in the (stabilized) coordinate
system of each tracklet
Spatio-Temporal ConvNets
• 3D Convolutional Neural Networks for Human Action Recognition: Ji et al.,
2010
Spatio-Temporal ConvNets
• Sequential Deep Learning for Human Action Recognition: Baccouche et al.,
2011
Spatio-Temporal ConvNets
• Large-scale Video Classification with Convolutional Neural Networks,
Karpathy et al., 2014

1 million videos
487 sports classes
Spatio-Temporal ConvNets
• Large-scale Video Classification with Convolutional Neural Networks,
Karpathy et al., 2014
Spatio-Temporal ConvNets
• Large-scale Video Classification with Convolutional Neural Networks,
Karpathy et al., 2014

## The motion information didn’t add all that much...

Spatio-Temporal ConvNets
• Learning Spatiotemporal Features with 3D Convolutional Networks: Tran et
al. 2015

3D VGGNet, basically.
Spatio-Temporal ConvNets
• Two-Stream Convolutional Networks for Action Recognition in Videos:
Simonyan and Zisserman 2014
Spatio-Temporal ConvNets
• Two-Stream Convolutional Networks for Action Recognition in Videos:
Simonyan and Zisserman 2014

## Two-stream version works much better than either alone

Long-time Spatio-Temporal ConvNets
Action Classification in Soccer Videos with Long Short-Term Memory
Recurrent Neural Networks: ICANN'10
Long-time Spatio-Temporal ConvNets
• Long-term Recurrent Convolutional Networks for Visual Recognition and
Description: Donahue et al., 2015
Long-time Spatio-Temporal ConvNets
• Beyond Short Snippets: Deep Networks for Video Classification: Ng et al.,
2015
Beyond Short Snippets: Deep Networks for
Video Classification: Ng et al., 2015
• Deep Video LSTM takes input the output
from the final CNN layer at each
consecutive video frame.

## • CNN outputs are processed forward

through time and upwards through five
layers of stacked LSTMs.

## • A softmax layer predicts the class at each

time step.
Beyond Short Snippets: Deep Networks for
Video Classification: Ng et al., 2015

Combining predictions:
• Return the prediction at the last time
step
• max-pooling the predictions over time,
• summing the predictions over time and
return the max
• linearly weighting the predictions over
time
Less than 1% difference in output by any
of the 4 choices.
Bi-Directional RNN
A Multi-Stream Bi-Directional Recurrent Neural Network for Fine-Grained
Action Detection, CVPR 2016
Image Captioning
Image Sentence Datasets

## • Microsoft COCO: Tsung-Yi Lin

et al. 2014. www.mscoco.org
• Currently: ~120K images, ~5
sentences each
RNNs for Image Captioning
Soft Attention for Captioning
Soft Attention
Show Attend and Tell: Xu et al., 2015
• RNN attends spatially to different parts of images while generating
each word of the sentence
Soft Attention
Soft Attention for Everything!
Attending to Arbitrary Regions

Attention mechanism from Show, Attend, and Tell only lets us softly attend
to fixed grid positions … can we do better?
Spatial Transformer Networks
Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015
Spatial Transformer Networks
Spatial Transformer Networks
Attention: Recap
• Soft attention:
• Easy to implement: produce distribution over input locations,
reweight features and feed as input
• Attend to arbitrary input locations using spatial transformer
networks

• Hard attention:
• Attend to a single input location
• Need reinforcement learning!
Other Image Captioning Works
• Explain Images with Multimodal Recurrent Neural Networks, Mao et
al.
• Deep Visual-Semantic Alignments for Generating Image Descriptions,
Karpathy and Fei-Fei
• Show and Tell: A Neural Image Caption Generator, Vinyals et al.
• Long-term Recurrent Convolutional Networks for Visual Recognition
and Description, Donahue et al.
• Learning a Recurrent Visual Representation for Image Caption
Generation, Chen and Zitnick
Learning Representation
Unsupervised Learning with LSTMs, Arxiv 2015.
Pose Estimation
Recurrent Network Models for Human Dynamics, ICCV 2015
Reidentification
Recurrent Convolutional Network for Video-based Person Re-Identification,
CVPR 2016
OCR
Recursive Recurrent Nets with
Attention Modeling for OCR in the
Wild, CVPR 2016

## Pass input images through

recursive convolutional layers to
extract encoded image features

## Then decode the features to

output characters by recurrent
neural networks