You are on page 1of 70

Deep Learning

6. Recurrent NN
Fully connect ed net work ConvNet
conv layers

neuron
weight s
FC layers kernel
feat ure maps

non-liniarit y
[Image from https://www.papernot.fr/marauder_map.pdf] [Image from http://benanne.github.io/images]
Segment at ionNeural Networks
Applications in
Computer Vision
- Fully Convolutional Network

- Deconvolution Network

[Photos from: http://deeplearning.net/tutorial/fcn_2D_segm.html; https://medium.com/@wilburdes/semantic-segmentation-


using-fully-convolutional-neural-networks-86e45336f99b]
Mot ivat ion

● Until now we have:


■ FCN: unstructured data
■ CNN: localized data

● BUT sometimes we have sequent ial dat a


■ Input is a sequence
■ Output is a sequence
Text generat ion

[A. Karpathy and J.


Johnson]
Machine Translat ion

[Wu, Yonghui et al. “Google's Neural Machine Translation


System: Bridging the Gap between Human and Machine
Capt ioning

[I. Duta, A. Nicolicioiu, V. Bogolin, M. Leordeanu . “Mining


[O. Vinyals et al. “ Show and Tell: A Neural Image
for meaning: from vision to language through multiple
Caption Generator”(2015).]
networks consensus”(2018).]
Video classificat ion:

?
Video classificat ion:

● Extract a representation for


each frame using CNN

● concatenate all the features


in a huge video
representation vector
Video classificat ion:

● for each frame extract


features using CNN

● concatenate all the features


in a huge video
representation vector

● process the video


representation using fully
connect ed layers
Video classificat ion:

W hy not ?
Video classificat ion:

W hy not ?

W1 ∈ R(d1*f) x d2
Video classificat ion:

W hy not ?

● no sequentiality

W1 ∈ R(d1*f) x d2
Video classificat ion:

W hy not ?

● no sequentiality
● huge number of parameters

W1 ∈ R(d1*f) x d2
Video classificat ion:

W hy not ?

● no sequentiality
● huge number of parameters
● fixed input length

W1 ∈ R(d1*f) x d2
Video classificat ion:

Solut ion:

Recurrent Neural Net work


Feed-forw ard vs
RNN

MANY - TO - ONE ONE - TO - MANY

ONE - TO - ONE

MANY - TO - MANY
x - input

h - hidden state / context

y - output
Forw ard st ep
Forw ard st ep
Forw ard st ep
Forw ard st ep

same W / same O at each t ime st ep

- allow input sequence of variable lengt hs


- given a cont ext and an input , all t imest eps should be
process in t he same way
Forw ard st ep
Forw ard st ep

DUMMY INPUT
Forw ard st ep

What about ?
1. zeros vector

1. random vector

1. representation of data used as


conditional information for current
model
Vanilla RNN vs Mult i-modal RNN
One-modal input
Vanilla RNN vs Mult i-modal RNN
One-modal input Mult i-modal input
Video classificat ion: RNN

So:

● one CNN to extract


features

● one RNN to process the


frames features

● output from the final step


(many to one)
Backpropagat ion t hrough t ime (BPTT)
Backpropagat ion t hrough t ime (BPTT)
Backpropagat ion t hrough t ime (BPTT)

NO!
Backpropagat ion t hrough t ime (BPTT)

YES!
Backpropagat ion t hrough t ime (BPTT)
Backpropagat ion t hrough t ime (BPTT)
Backpropagat ion t hrough t ime (BPTT)
Backpropagat ion t hrough t ime
Backpropagat ion t hrough t ime

For large number of timesteps (t):

[image from: https://hackernoon.com/understanding-architecture-


of-lstm-cell-from-scratch-with-code-8da40f0b71f4]
Backpropagat ion t hrough t ime

For large number of timesteps (t):

[image from Toronto University CSC321 course 2017, R. Grosse]


Backpropagat ion t hrough t ime

For large number of timesteps (t):

HARD TO TRAIN WHY?


Vanishing / exploding gradient s

Solut ion 1: Gradient clipping

● avoid exploding gradients


[Goodfellow et al. Deep Learning.]
Vanishing / exploding gradient s

Solut ion 2: Ident it y init ializat ion

● recurrent weights are initialized to the


ident it y matrix

● activation functions are ReLU


Vanishing / exploding gradient s

Solut ion 3: Archit ect ures:

● LSTM
● GRU

● avoid vanishing gradients


[Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long [Cho, Kyunghyun et al. “Learning Phrase Representations using
Short-Term Memory. Neural Comput. 9, 8 (November 1997), RNN Encoder-Decoder for Statistical Machine Translation.” EMNLP
1735-1780] (2014).]
Gat ing mechanism

Vanilla RNN Simple Gat e

● Gates control the flow of information into the cell, allowing the
model to forget irrelevant information

● - dismiss all the


information
- keep all the information
GRU
GRU use a gating mechanism with 2 gates:
a. updat e gat e: determine how much of
the past information needs to be
passed along to the future

a. reset gat e: decide how to combine


new input with previous state
GRU
GRU use a gating mechanism with 2 gates:
a. updat e gat e: determine how much of
the past information needs to be
passed along to the future

a. reset gat e: decide how to combine


new input with previous state

keep/reset previous
info for current
target state

copy/ignore previous info


Truncat ed backpropagat ion t hrough t ime (TBPTT)

[slide from Stanford CS231 course 2018, Li Fei-Fei]


Truncat ed backpropagat ion t hrough t ime (TBPTT)

When the sequence is very long, BPTT accumulate


gradients over many timesteps:

● learning is computationally expensive (slow)


● Learning suffer from vanishing / exploding gradients

TBPTT:

● update step is applied after each sequence of k


forward-pass
● each update accumulates gradients over a limited,
fixed number of timesteps (k)
Truncat ed backpropagat ion t hrough t ime (TBPTT)

[slide from Stanford CS231 course 2018, Li Fei-Fei]


Truncat ed backpropagat ion t hrough t ime (TBPTT)

[slide from Stanford CS231 course 2018, Li Fei-Fei]


Truncat ed backpropagat ion t hrough t ime (TBPTT)

[slide from Stanford CS231 course 2018, Li Fei-Fei]


Classificat ion vs Generat ion

CLASSIFICATIO
N

GENERATION

[Image from https://www.buzzfeed.com/stephenlaconte/coffee-is-bad-people-who-drink-it-are-bad]


Generat e Sequences - Teacher forcing

● Language models usually use


the output from step t-1 as input
for step t
● Early mistakes propagates in
later steps:
○ Slow convergence
○ Instability of model
Generat e Sequences - Teacher forcing

● Teacher forcing:
during train phase the model
receive ground truth y* instead of
model output y
○ later steps receive correct
input even in the beginning
of training

● during test phase use the


model’s output, because we
don’t have access to the ground
truth
Process longer input

● Problem:
○ can’t fit the entire document
in the model
○ need batch_size > 1
○ need previous context for
each sub-sequence

[Fragment from: Rebilius Cruso (Francis William Newman)


http://www.gutenberg.org/files/50732/50732-h/50732-h.htm ]
Process longer input

● Problem:
○ can’t fit the entire document
in the model
○ need batch_size > 1
○ Need previous context for
each sub-sequence
● Solut ion:
○ Split document in batch_size
continuous chunks
○ one batch receive, in each
training iteration, sequences
from different chunks
Process longer input

● Problem:
○ can’t fit the entire document
in the model
○ need batch_size > 1
○ Need previous context for
each sub-sequence
● Solut ion:
○ Split document in batch_size
continuous chunks
○ one batch receive, in each
training iteration, sequences
from different chunks
Deep Neural Net w ork

[Images from: https://steemit.com/neuralnet/@longwhitecoat/what-is-a-neural-network-deep-machine-learning;


https://www.jeremyjordan.me/convnet-architectures/]
Mult ilayer RNN

● Better capture the structure of


the input sequence

● Harder to optimize (longer paths)


Bidirect ional RNN

● Prediction at timestep t depend on the


whole input

● Combine context from 2 RNNs:


a forward RNN and a backward RNN

[Schuster, Mike and Kuldip K. Paliwal. “Bidirectional recurrent


neural networks.” IEEE Trans. Signal Processing 45 (1997): 2673-
Normalizat ion: Recap and more

- during training, activations are changing a lot - each layer adapt to


new input distribution every step (internal covariate shift)
⇒ Training is slow

- force the distributions of input layer to be the same every training


step (normalization = ct. mean, ct. std)
⇒ Training is fast
Bat ch Normalizat ion

ONLY AT TRAINING TIME !!!


[Lei Ba et al., Layer Normalization, 2016]
Normalizat ion for RNN

● Classic Bat chNorm is not recommended


○ Each timestep has different statistics (mean and variance) -
statistics computed and used overall are not accurate

● Modified BatchNorm used sometimes (Cojimas et al. ICLR’17)


○ Keep different statistics for each timestep:

○ Converge when the number of timestep is large


Normalizat ion for RNN: LayerNorm

● LayerNorm (Lei Ba et al. 2016)

○ Normalize within each example, across


channels

[Cooijmans et al., “Recurrent Batch Normalization.” ICLR 2017]


[Lei Ba et al., Layer Normalization, 2016]
TCN or RNN?

● 1D convolutions could be applied for 1D data

● Possible applications:
- Audio processing:
- Video processing
- Natural language processing

BOTH

[Slides from Course 5]


● RNNs are good for sequent ial dat a
● RNNs allow flexibilit y in
architecture
● BPTT suffer from vanishing / Recap
exploding gradients
● clipping gradient s and LSTM/ GRU
architectures are common ways to
avoid them
Thank you!
(Next: Applications of RNN in NLP)
Ot her Resources
http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/ - Lectures 15-16
https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/ - Lecture 12-13
http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf - Lecture 10
http://cs224d.stanford.edu/syllabus.html Lecture 6-8

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

GRU: https://arxiv.org/abs/1406.1078
LSTM: https://www.bioinf.jku.at/publications/older/2604.pdf

You might also like