Professional Documents
Culture Documents
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 1 May 3, 2018
Administrative
A2 due yesterday
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 2 May 3, 2018
Administrative: Midterm
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 3 May 3, 2018
Last Time: CNN Architectures
AlexNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 4 May 3, 2018
Last Time: CNN Architectures
Softmax
FC 1000
Softmax FC 4096
FC 1000 FC 4096
FC 4096 Pool
FC 4096 3x3 conv, 512
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 Pool
Pool 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
Pool Pool
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
Pool Pool
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
Pool Pool
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64
Input Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 5 May 3, 2018
Last Time: CNN Architectures
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
relu 3x3 conv, 64
3x3 conv, 64
F(x) + x 3x3 conv, 64
...
conv 3x3 conv, 128
3x3 conv, 128
F(x) relu X
3x3 conv, 128
identity 3x3 conv, 128
conv 3x3 conv, 128
3x3 conv, 128 / 2
3x3 conv, 64
3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 6 May 3, 2018
DenseNet
FractalNet
Softmax
FC
1x1 conv, 64
Pool
Dense Block 3
Concat
Conv
Conv
Concat
Dense Block 2
Conv
Conv
Pool
Concat Conv
Dense Block 1
Conv
Conv
Input Input
Dense Block
Figures copyright Larsson et al., 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 7 May 3, 2018
Last Time: CNN Architectures
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 8 May 3, 2018
Last Time: CNN Architectures
AlexNet and VGG have
tons of parameters in the
fully connected layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 9 May 3, 2018
ImageNet pretraining -> Instagram pretraining
Bigger models are saturated
on ImageNet, but with more
data bigger models do better
Mahajan et al, “Exploring the Limits of Weakly Supervised Pretraining”, arXiv 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 10 May 3, 2018
ImageNet pretraining -> Instagram pretraining
Bigger models are saturated
on ImageNet, but with more
data bigger models do better
Mahajan et al, “Exploring the Limits of Weakly Supervised Pretraining”, arXiv 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 11 May 3, 2018
Today: Recurrent Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 12 May 3, 2018
“Vanilla” Neural Network
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 13 May 3, 2018
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 14 May 3, 2018
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 15 May 3, 2018
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 16 May 3, 2018
Recurrent Neural Networks: Process Sequences
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 17 May 3, 2018
Sequential Processing of Non-Sequence Data
Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015.
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra,
2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 18 May 3, 2018
Sequential Processing of Non-Sequence Data
Generate images one piece at a time!
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with
permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 19 May 3, 2018
Recurrent Neural Network
RNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 20 May 3, 2018
Recurrent Neural Network
usually want to
y
predict a vector at
some time steps
RNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 21 May 3, 2018
Recurrent Neural Network
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y
RNN
new state old state input vector at
some time step
some function x
with parameters W
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 22 May 3, 2018
Recurrent Neural Network
We can process a sequence of vectors x by
applying a recurrence formula at every time step: y
RNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 23 May 3, 2018
(Simple) Recurrent Neural Network
The state consists of a single “hidden” vector h:
RNN
x
Sometimes called a “Vanilla RNN” or an
“Elman RNN” after Prof. Jeffrey Elman
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 24 May 3, 2018
RNN: Computational Graph
h0 fW h1
x1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 25 May 3, 2018
RNN: Computational Graph
h0 fW h1 fW h2
x1 x2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 26 May 3, 2018
RNN: Computational Graph
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 27 May 3, 2018
RNN: Computational Graph
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 28 May 3, 2018
RNN: Computational Graph: Many to Many
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 29 May 3, 2018
RNN: Computational Graph: Many to Many
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 30 May 3, 2018
RNN: Computational Graph: Many to Many L
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 31 May 3, 2018
RNN: Computational Graph: Many to One
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 32 May 3, 2018
RNN: Computational Graph: One to Many
y1 y2 y3 yT
h0 fW h1 fW h2 fW h3
… hT
x
W
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 33 May 3, 2018
Sequence to Sequence: Many-to-one +
one-to-many
Many to one: Encode input
sequence in a single vector
h h h h … h
fW fW fW
0 1 2 3 T
x x x
W
1 2 3
1
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 34 May 3, 2018
Sequence to Sequence: Many-to-one +
one-to-many
One to many: Produce output
sequence from single input vector
Many to one: Encode input
sequence in a single vector y y
1 2
h h h h … h h h
fW fW fW fW fW fW …
0 1 2 3 T 1 2
x x x
W W
1 2 3
1 2
Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 35 May 3, 2018
Example:
Character-level
Language Model
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 36 May 3, 2018
Example:
Character-level
Language Model
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 37 May 3, 2018
Example:
Character-level
Language Model
Vocabulary:
[h,e,l,o]
Example training
sequence:
“hello”
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 38 May 3, 2018
“e” “l” “l” “o”
Example: Sample
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Language Model .84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 39 May 3, 2018
“e” “l” “l” “o”
Example: Sample
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Language Model .84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 40 May 3, 2018
“e” “l” “l” “o”
Example: Sample
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Language Model .84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 41 May 3, 2018
“e” “l” “l” “o”
Example: Sample
Character-level .03
.13
.25
.20
.11
.17
.11
.02
Softmax .00 .05 .68 .08
Language Model .84 .50 .03 .79
Sampling
Vocabulary:
[h,e,l,o]
At test-time sample
characters one at a time,
feed back to model
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 42 May 3, 2018
Forward through entire sequence to
Loss
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 43 May 3, 2018
Truncated Backpropagation through time
Loss
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 44 May 3, 2018
Truncated Backpropagation through time
Loss
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 45 May 3, 2018
Truncated Backpropagation through time
Loss
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 46 May 3, 2018
min-char-rnn.py gist: 112 lines of Python
(https://gist.github.com/karpathy/d4dee
566867f8291f086)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 47 May 3, 2018
y
RNN
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 48 May 3, 2018
at first:
train more
train more
train more
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 49 May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 50 May 3, 2018
The Stacks Project: open source algebraic geometry textbook
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 51 May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 52 May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 53 May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 54 May 3, 2018
Generated
C code
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 55 May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 56 May 3, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 57 May 3, 2018
Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 58 May 3, 2018
Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 59 May 3, 2018
Searching for interpretable cells
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 60 May 3, 2018
Searching for interpretable cells
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 61 May 3, 2018
Searching for interpretable cells
if statement cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 62 May 3, 2018
Searching for interpretable cells
quote/comment cell
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 63 May 3, 2018
Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 64 May 3, 2018
Image Captioning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 65 May 3, 2018
Recurrent Neural Network
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 66 May 3, 2018
test image
X
test image
x0
<STA
RT>
<START>
test image
y0
before:
h = tanh(Wxh * x + Whh * h)
h0
Wih
now:
h = tanh(Wxh * x + Whh * h + Wih * v)
x0
<STA
RT>
v <START>
test image
y0
sample!
h0
x0
<STA straw
RT>
<START>
test image
y0 y1
h0 h1
x0
<STA straw
RT>
<START>
test image
y0 y1
h0 h1
sample!
x0
<STA straw hat
RT>
<START>
test image
y0 y1 y2
h0 h1 h2
x0
<STA straw hat
RT>
<START>
test image
y0 y1 y2
sample
<END> token
h0 h1 h2 => finish.
x0
<STA straw hat
RT>
<START>
Captions generated using neuraltalk2
All images are CC0 Public domain:
Image Captioning: Example Results cat suitcase, cat tree, dog, bear,
surfers, tennis, giraffe, motorcycle
A cat sitting on a A cat is sitting on a tree A dog is running in the A white teddy bear sitting in
suitcase on the floor branch grass with a frisbee the grass
Two people walking on A tennis player in action Two giraffes standing in a A man riding a dirt bike on
the beach with surfboards on the court grassy field a dirt track
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 77 May 3, 2018
Captions generated using neuraltalk2
All images are CC0 Public domain: fur
A bird is perched on
a tree branch
A woman is holding a
cat in her hand
A man in a
baseball uniform
throwing a ball
A woman standing on a
beach holding a surfboard
A person holding a
computer mouse on a desk
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 78 May 3, 2018
Image Captioning with Attention
RNN focuses its attention at a different spatial location
when generating each word
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 79 May 3, 2018
Image Captioning with Attention
CNN h0
Features:
Image: LxD
HxWx3
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 80 May 3, 2018
Image Captioning with Attention
Distribution over
L locations
a1
CNN h0
Features:
Image: LxD
HxWx3
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 81 May 3, 2018
Image Captioning with Attention
Distribution over
L locations
a1
CNN h0
Features:
Image: LxD
HxWx3 Weighted
z1
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination
Attention”, ICML 2015 of features
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 82 May 3, 2018
Image Captioning with Attention
Distribution over
L locations
a1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 83 May 3, 2018
Image Captioning with Attention
Distribution over Distribution
L locations over vocab
a1 a2 d1
CNN h0 h1
Features:
Image: LxD
HxWx3 Weighted
z1 y1
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 84 May 3, 2018
Image Captioning with Attention
Distribution over Distribution
L locations over vocab
a1 a2 d1
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 85 May 3, 2018
Image Captioning with Attention
Distribution over Distribution
L locations over vocab
a1 a2 d1 a3 d2
CNN h0 h1 h2
Features:
Image: LxD
HxWx3 Weighted
z1 y1 z2 y2
features: D
Weighted
Xu et al, “Show, Attend and Tell: Neural
Image Caption Generation with Visual
combination First word
Attention”, ICML 2015 of features
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 86 May 3, 2018
Image Captioning with Attention
Soft attention
Hard attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 87 May 3, 2018
Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 88 May 3, 2018
Visual Question Answering
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 89 May 3, 2018
Visual Question Answering: RNNs with Attention
Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016
Figures from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 90 May 3, 2018
Multilayer RNNs
LSTM:
depth
time
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 91 May 3, 2018
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
W tanh
ht-1 stack ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 92 May 3, 2018
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
Backpropagation from ht
to ht-1 multiplies by W
(actually WhhT)
W tanh
ht-1 stack ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 93 May 3, 2018
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
h0 h1 h2 h3 h4
x1 x2 x3 x4
Computing gradient
of h0 involves many
factors of W
(and repeated tanh)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 94 May 3, 2018
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95 May 3, 2018
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 96 May 3, 2018
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent
is difficult”, IEEE Transactions on Neural Networks, 1994
Pascanu et al, “On the difficulty of training recurrent neural networks”,
ICML 2013
h0 h1 h2 h3 h4
x1 x2 x3 x4
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 97 May 3, 2018
Long Short Term Memory (LSTM)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 98 May 3, 2018
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997] i: Input gate, whether to write to cell
f: Forget gate, Whether to erase cell
o: Output gate, How much to reveal cell
vector from g: Gate gate (?), How much to write to cell
below (x)
x sigmoid i
h sigmoid f
W
vector from sigmoid o
before (h)
tanh g
4h x 2h 4h 4*h
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 99 May 3, 2018
Long Short Term Memory (LSTM)
[Hochreiter et al., 1997]
ct-1 ☉ + ct
f
i
W ☉ tanh
g
ht-1 stack
o ☉ ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 100 May 3, 2018
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
Backpropagation from ct to
ct-1 only elementwise
multiplication by f, no matrix
ct-1 ☉ + ct multiply by W
f
i
W ☉ tanh
g
ht-1 stack
o ☉ ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 101 May 3, 2018
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 102 May 3, 2018
c3
c2
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64 / 2
Input
Similar to ResNet!
c0
Long Short Term Memory (LSTM): Gradient Flow
[Hochreiter et al., 1997]
In between:
Highway Networks
3x3 conv, 128 / 2
7x7 conv, 64 / 2
Similar to ResNet!
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
FC 1000
Softmax
Input
Pool
Pool
...
Srivastava et al, “Highway Networks”,
ICML DL Workshop 2015
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 104 May 3, 2018
[An Empirical Exploration of
Other RNN Variants Recurrent Network Architectures,
Jozefowicz et al., 2015]
GRU [Learning phrase representations using rnn
encoder-decoder for statistical machine translation,
Cho et al. 2014]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 105 May 3, 2018
Summary
- RNNs allow a lot of flexibility in architecture design
- Vanilla RNNs are simple but don’t work very well
- Common to use LSTM or GRU: their additive interactions
improve gradient flow
- Backward flow of gradients in RNN can explode or vanish.
Exploding is controlled with gradient clipping. Vanishing is
controlled with additive interactions (LSTM)
- Better/simpler architectures are a hot topic of current research
- Better understanding (both theoretical and empirical) is needed.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 106 May 3, 2018
Next time: Midterm!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 107 May 3, 2018