DeepBit6. Recurrent Neural Network

Deep Learning
6. Recurrent NN
Fully connect ed net work ConvNet
conv layers
neuron
weight s
FC layers kernel
feat ure maps
non-liniarit y
[Image from https://www.papernot.fr/marauder_map.pdf] [Image from http://benanne.github.io/images]
Segment at ionNeural Networks
Applications in
Computer Vision
- Fully Convolutional Network
- Deconvolution Network
[Photos from: http://deeplearning.net/tutorial/fcn_2D_segm.html; https://medium.com/@wilburdes/semantic-segmentation-

using-fully-convolutional-neural-networks-86e45336f99b]
Mot ivat ion
● Until now we have:

■ FCN: unstructured data
■ CNN: localized data
● BUT sometimes we have sequent ial dat a

■ Input is a sequence
■ Output is a sequence
Text generat ion
[A. Karpathy and J.

Johnson]
Machine Translat ion
[Wu, Yonghui et al. “Google's Neural Machine Translation

System: Bridging the Gap between Human and Machine
Capt ioning
[I. Duta, A. Nicolicioiu, V. Bogolin, M. Leordeanu . “Mining

[O. Vinyals et al. “ Show and Tell: A Neural Image
for meaning: from vision to language through multiple
Caption Generator”(2015).]
networks consensus”(2018).]
Video classificat ion:
?
● Extract a representation for

each frame using CNN
● concatenate all the features

in a huge video
representation vector
● for each frame extract

features using CNN
● concatenate all the features

in a huge video
representation vector
● process the video

representation using fully
connect ed layers
W hy not ?
W hy not ?
W1 ∈ R(d1*f) x d2
W hy not ?
● no sequentiality
W1 ∈ R(d1*f) x d2
W hy not ?
● huge number of parameters
W1 ∈ R(d1*f) x d2
W hy not ?
● huge number of parameters
● fixed input length
W1 ∈ R(d1*f) x d2
Solut ion:
Recurrent Neural Net work

Feed-forw ard vs
RNN
MANY - TO - ONE ONE - TO - MANY
ONE - TO - ONE
MANY - TO - MANY
x - input
h - hidden state / context
y - output
Forw ard st ep
Forw ard st ep
Forw ard st ep
Forw ard st ep
same W / same O at each t ime st ep
- allow input sequence of variable lengt hs

- given a cont ext and an input , all t imest eps should be
process in t he same way
Forw ard st ep
Forw ard st ep
DUMMY INPUT
Forw ard st ep
What about ?
1. zeros vector
1. random vector
1. representation of data used as

conditional information for current
model
Vanilla RNN vs Mult i-modal RNN
One-modal input
Vanilla RNN vs Mult i-modal RNN
One-modal input Mult i-modal input
Video classificat ion: RNN
So:
● one CNN to extract

features
● one RNN to process the

frames features
● output from the final step

(many to one)
Backpropagat ion t hrough t ime (BPTT)
NO!
YES!
Backpropagat ion t hrough t ime
For large number of timesteps (t):
[image from: https://hackernoon.com/understanding-architecture-

of-lstm-cell-from-scratch-with-code-8da40f0b71f4]
[image from Toronto University CSC321 course 2017, R. Grosse]

HARD TO TRAIN WHY?

Vanishing / exploding gradient s
Solut ion 1: Gradient clipping
● avoid exploding gradients

[Goodfellow et al. Deep Learning.]
Solut ion 2: Ident it y init ializat ion
● recurrent weights are initialized to the

ident it y matrix
● activation functions are ReLU

Solut ion 3: Archit ect ures:
● LSTM
● GRU
● avoid vanishing gradients

[Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long [Cho, Kyunghyun et al. “Learning Phrase Representations using
Short-Term Memory. Neural Comput. 9, 8 (November 1997), RNN Encoder-Decoder for Statistical Machine Translation.” EMNLP
1735-1780] (2014).]
Gat ing mechanism
Vanilla RNN Simple Gat e
● Gates control the flow of information into the cell, allowing the
model to forget irrelevant information
● - dismiss all the

information
- keep all the information
GRU
GRU use a gating mechanism with 2 gates:
a. updat e gat e: determine how much of
the past information needs to be
passed along to the future
a. reset gat e: decide how to combine

new input with previous state
GRU
GRU use a gating mechanism with 2 gates:
a. updat e gat e: determine how much of
the past information needs to be
passed along to the future
a. reset gat e: decide how to combine

new input with previous state
keep/reset previous
info for current
target state
copy/ignore previous info

Truncat ed backpropagat ion t hrough t ime (TBPTT)
[slide from Stanford CS231 course 2018, Li Fei-Fei]

When the sequence is very long, BPTT accumulate

gradients over many timesteps:
● learning is computationally expensive (slow)

● Learning suffer from vanishing / exploding gradients
TBPTT:
● update step is applied after each sequence of k

forward-pass
● each update accumulates gradients over a limited,
fixed number of timesteps (k)



Classificat ion vs Generat ion
CLASSIFICATIO
N
GENERATION
[Image from https://www.buzzfeed.com/stephenlaconte/coffee-is-bad-people-who-drink-it-are-bad]

Generat e Sequences - Teacher forcing
● Language models usually use

the output from step t-1 as input
for step t
● Early mistakes propagates in
later steps:
○ Slow convergence
○ Instability of model
Generat e Sequences - Teacher forcing
● Teacher forcing:
during train phase the model
receive ground truth y* instead of
model output y
○ later steps receive correct
input even in the beginning
of training
● during test phase use the

model’s output, because we
don’t have access to the ground
truth
Process longer input
● Problem:
○ can’t fit the entire document
in the model
○ need batch_size > 1
○ need previous context for
each sub-sequence
[Fragment from: Rebilius Cruso (Francis William Newman)

http://www.gutenberg.org/files/50732/50732-h/50732-h.htm ]
● Problem:
in the model
○ Need previous context for
each sub-sequence
● Solut ion:
○ Split document in batch_size
continuous chunks
○ one batch receive, in each
training iteration, sequences
from different chunks
● Problem:
in the model
○ Need previous context for
each sub-sequence
● Solut ion:
○ Split document in batch_size
continuous chunks
○ one batch receive, in each
training iteration, sequences
from different chunks
Deep Neural Net w ork
[Images from: https://steemit.com/neuralnet/@longwhitecoat/what-is-a-neural-network-deep-machine-learning;

https://www.jeremyjordan.me/convnet-architectures/]
Mult ilayer RNN
● Better capture the structure of

the input sequence
● Harder to optimize (longer paths)

Bidirect ional RNN
● Prediction at timestep t depend on the

whole input
● Combine context from 2 RNNs:

a forward RNN and a backward RNN
[Schuster, Mike and Kuldip K. Paliwal. “Bidirectional recurrent

neural networks.” IEEE Trans. Signal Processing 45 (1997): 2673-
Normalizat ion: Recap and more
- during training, activations are changing a lot - each layer adapt to

new input distribution every step (internal covariate shift)
⇒ Training is slow
- force the distributions of input layer to be the same every training

step (normalization = ct. mean, ct. std)
⇒ Training is fast
Bat ch Normalizat ion
ONLY AT TRAINING TIME !!!

[Lei Ba et al., Layer Normalization, 2016]
Normalizat ion for RNN
● Classic Bat chNorm is not recommended

○ Each timestep has different statistics (mean and variance) -
statistics computed and used overall are not accurate
● Modified BatchNorm used sometimes (Cojimas et al. ICLR’17)

○ Keep different statistics for each timestep:
○ Converge when the number of timestep is large

Normalizat ion for RNN: LayerNorm
● LayerNorm (Lei Ba et al. 2016)
○ Normalize within each example, across

channels
[Cooijmans et al., “Recurrent Batch Normalization.” ICLR 2017]

[Lei Ba et al., Layer Normalization, 2016]
TCN or RNN?
● 1D convolutions could be applied for 1D data
● Possible applications:
- Audio processing:
- Video processing
- Natural language processing
BOTH
[Slides from Course 5]

● RNNs are good for sequent ial dat a
● RNNs allow flexibilit y in
architecture
● BPTT suffer from vanishing / Recap
exploding gradients
● clipping gradient s and LSTM/ GRU
architectures are common ways to
avoid them
Thank you!
(Next: Applications of RNN in NLP)
Ot her Resources
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/ - Lectures 15-16
https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/ - Lecture 12-13
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf - Lecture 10
http://cs224d.stanford.edu/syllabus.html Lecture 6-8
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
GRU: https://arxiv.org/abs/1406.1078
LSTM: https://www.bioinf.jku.at/publications/older/2604.pdf

DeepBit6. Recurrent Neural Network

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DeepBit6. Recurrent Neural Network

Uploaded by

Copyright:

Available Formats

Deep Learning

[Photos from: http://deeplearning.net/tutorial/fcn_2D_segm.html; https://medium.com/@wilburdes/semantic-segmentation-

● Until now we have:

● BUT sometimes we have sequent ial dat a

[A. Karpathy and J.

[Wu, Yonghui et al. “Google's Neural Machine Translation

[I. Duta, A. Nicolicioiu, V. Bogolin, M. Leordeanu . “Mining

● Extract a representation for

● concatenate all the features

● for each frame extract

● concatenate all the features

● process the video

Recurrent Neural Net work

MANY - TO - ONE ONE - TO - MANY

h - hidden state / context

same W / same O at each t ime st ep

- allow input sequence of variable lengt hs

1. representation of data used as

● one CNN to extract

● one RNN to process the

● output from the final step

For large number of timesteps (t):

[image from: https://hackernoon.com/understanding-architecture-

For large number of timesteps (t):

[image from Toronto University CSC321 course 2017, R. Grosse]

For large number of timesteps (t):

HARD TO TRAIN WHY?

Solut ion 1: Gradient clipping

● avoid exploding gradients

Solut ion 2: Ident it y init ializat ion

● recurrent weights are initialized to the

● activation functions are ReLU

Solut ion 3: Archit ect ures:

● avoid vanishing gradients

Vanilla RNN Simple Gat e

● - dismiss all the

a. reset gat e: decide how to combine

a. reset gat e: decide how to combine

copy/ignore previous info

[slide from Stanford CS231 course 2018, Li Fei-Fei]

When the sequence is very long, BPTT accumulate

● learning is computationally expensive (slow)

● update step is applied after each sequence of k

[slide from Stanford CS231 course 2018, Li Fei-Fei]

[slide from Stanford CS231 course 2018, Li Fei-Fei]

[slide from Stanford CS231 course 2018, Li Fei-Fei]

[Image from https://www.buzzfeed.com/stephenlaconte/coffee-is-bad-people-who-drink-it-are-bad]

● Language models usually use

● during test phase use the

[Fragment from: Rebilius Cruso (Francis William Newman)

[Images from: https://steemit.com/neuralnet/@longwhitecoat/what-is-a-neural-network-deep-machine-learning;

● Better capture the structure of

● Harder to optimize (longer paths)

● Prediction at timestep t depend on the

● Combine context from 2 RNNs:

[Schuster, Mike and Kuldip K. Paliwal. “Bidirectional recurrent

- during training, activations are changing a lot - each layer adapt to

- force the distributions of input layer to be the same every training

ONLY AT TRAINING TIME !!!

● Classic Bat chNorm is not recommended

● Modified BatchNorm used sometimes (Cojimas et al. ICLR’17)

○ Converge when the number of timestep is large

● LayerNorm (Lei Ba et al. 2016)

○ Normalize within each example, across