Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights

Lecture 6 Smaller Network: RNN
This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Recurrent Neural Network
How does RNN reduce complexity?
 Given function f: h’,y=f(h,x) h and h’ are vectors with

the same dimension
y1 y2 y3
h0 f h1 f h2 f h3 ……
x1 x2 x3
No matter how long the input/output sequence is, we only need
one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)
…
z1 z2 z3
g0 f2 g1 f2 g2 f2 g3 ……
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3 ……
x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3
g0 f2 g1 f2 g2 f2 g3
z1 z2 z3
p=f3(y,z) f3 p1 f3 p2 f3 p3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3
x1 x2 x3
Significantly speed up training
Pyramid RNN
 Reducing the number of time steps
Bidirectional
RNN
W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural

network for large vocabulary conversational speech recognition,” ICASSP, 2016
Naïve RNN
y
h' Wh h Wi x
h f h'
y Wo h’ Note, y is computed
from h’
x softmax
We have ignored the bias

Problems with naive RNN
 When dealing with a time series, it tends

to forget old information. When there is a
distant relationship of unknown length, we
wish to have a “memory” to it.
 Vanishing gradient problem.
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.
LSTM This
Output
This gategate
sigmoid
decides
Controls
what info
what
Isdetermines how
to add to the cellmuch
state
goes into output
information goes thru
Ct-1
ht-1
Forget input
gate gate
The core idea is this cell
Why sigmoid or tanh:
state Ct, it is changed
Sigmoid: 0,1 gating as switch.
slowly, with only minor
Vanishing gradient problem in
linear interactions. It
LSTM is handled already. is very
easy for
ReLU information
replaces to flow
tanh ok?
along it unchanged.
it decides what component
is to be updated.
C’t provides change contents
Updating the cell state
Decide what part of the cell

state to output
RNN vs LSTM
Peephole LSTM
Allows “peeping into the memory”

Naïve RNN vs LSTM yt
yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN
xt xt
c changes slowly ct is ct-1 added by something
h changes faster ht and ht-1 can be very different

These 4 matrix
computation should
be done concurrently.
xt
z W
ht-1
ct-1 xt
zi = Wi
σ( ht-1)
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = Wf
σ(
ht-1)
zf zi z zo
xt
zo = Wo
σ(
h t-1
x t ht-1)
Information flow of LSTM

xt
z =tanh( W ht-1 )
ct-1 ct-1
diagonal
“peephole” z z z
o f i obtained by the same way
zf zi z zo
ht-1 xt
Element-wise multiply
yt
ct-1 ct
ct = zf  ct-1 + ziz
tanh ht = zo  tanh(ct)
yt = σ(W’ ht)
zf zi z zo
ht-1 xt ht

LSTM information flow
yt yt+1
ct-1 ct ct+1
tanh tanh
zf zi z zo zf zi z zo
ht+1
ht-1 xt ht xt+1

LSTM
GRU – gated recurrent unit

(more compression)
reset gate Update gate
It combines the forget and input into a single update gate.

It also merges the cell state and hidden state. This is simpler
than LSTM. There are many other variants too.
X,*: element-wise multiply

GRUs also takes xt and ht-1 as inputs. They perform some
calculations and then pass along ht. What makes them different
from LSTMs is that GRUs don't need the cell layer to pass values
along. The calculations within each iteration insure that the ht
values being passed along either retain a high amount of old
information or are jump-started with a high amount of new
information.
Feed-forward vs Recurrent Network
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer
x f1 a1 f2 a2 f3 a3 f4 y
t is layer
a = ft(a ) = σ(W a + b )
t t-1 t t-1 t
h0 f h1 f h2 f h3 f g y4
x1 x2 x3 x4
t is time step
at= f(at-1, xt) = σ(Wh at-1 + Wixt + bi)
We will turn the recurrent network 90 degrees.

yt
No input xt at
each step at-1
h
t-1
ahtt
No output yt at
each step 1-
at-1 is the output of
the (t-1)-th layer
reset update
at is the output of r z h'
the t-th layer
No reset gate
hat-1
t-1
xt xt
h’=σ(Wat-1)
z=σ(W’at-1)
Highway Network at = z  at-1 + (1-z)  h
• Highway Network • Residual Network

at at
z controls red arrow

+ at-1
h’
h’
Gate copy
controller copy
at-1 at-1
Training Very Deep Networks Deep Residual Learning for Image

https://arxiv.org/pdf/1507.06228v2.p Recognition
df http://arxiv.org/abs/1512.03385
output layer output layer output layer
Highway Network automatically

determines the layers needed!
Input layer Input layer Input layer
Highway Network Experiments
Grid LSTM Memory for both
time and depth
depth
y a’ b’
c c’ c c’
LSTM Grid
LSTM
h h’ h h’
x a b
time
Grid LSTM
h' b'
a’ b’ c c'
a tanh a'
c c’
Grid
LSTM
h h’ zf zi z zo
a b
h b
You can generalize this to 3D, and more.
Applications of LSTM / RNN
Neural machine translation
LSTM
Sequence to sequence chat model
Chat with context
M: Hi
M: Hello
U: Hi
M: Hi
M: Hello U: Hi
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical
Baidu’s speech recognition using RNN
Attention
Image caption generation using attention
(From CY Lee lecture)
z0 is initial parameter, it is also learned
A vector for
each region
z0 match 0.7
CNN filter filter filter

filter filter filter

Image Caption Generation
Word 1
A vector for
each region
z0 z1
Attention to
a region
weighted
CNN filter filter filter sum
0.7 0.1 0.1
0.1 0.0 0.0

Word 1 Word 2
A vector for
each region
z0 z1 z2
weighted
CNN filter filter filter sum
0.0 0.8 0.2
0.0 0.0 0.0

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron

Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show,
Attend and Tell: Neural Image Caption Generation with Visual Attention”,
ICML, 2015
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron

Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show,
Attend and Tell: Neural Image Caption Generation with Visual Attention”,
ICML, 2015
* Possible project?
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo

Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV,
2015

Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights

Uploaded by

Copyright:

Available Formats

Lecture 6 Smaller Network: RNN

 Given function f: h’,y=f(h,x) h and h’ are vectors with

W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural

We have ignored the bias

 When dealing with a time series, it tends

Updating the cell state

Decide what part of the cell

Allows “peeping into the memory”

c changes slowly ct is ct-1 added by something

h changes faster ht and ht-1 can be very different

Information flow of LSTM

Information flow of LSTM

Information flow of LSTM

GRU – gated recurrent unit

reset gate Update gate

It combines the forget and input into a single update gate.

X,*: element-wise multiply

at= f(at-1, xt) = σ(Wh at-1 + Wixt + bi)

We will turn the recurrent network 90 degrees.

• Highway Network • Residual Network

z controls red arrow

Training Very Deep Networks Deep Residual Learning for Image

Highway Network automatically

CNN filter filter filter

filter filter filter

filter filter filter

filter filter filter

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo

You might also like