You are on page 1of 36

RNN, LSTM and GRU

CSE 4237
Soft Computing
Smaller Network: RNN

This is our fully connected network. If x1 .... xn, n is very large and growing,
this network would become too large. We now will input one xi at a time,
and re-use the same edge weights.
Recurrent Neural Network
How does RNN reduce complexity?

• Given function f: h’,y=f(h,x) h and h’ are vectors with


the same dimension

y1 y2 y3

h0 f h1 f h2 f h3 ……

x1 x2 x3
No matter how long the input/output sequence is, we only
need one function f. If f’s are different, then it becomes a
feedforward NN. This may be treated as another compression
from fully connected network.
Deep RNN h’,y = f1(h,x), g’,z = f2(g,y)


z1 z2 z3

g0 f2 g1 f2 g2 f2 g3 ……

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3 ……

x1 x2 x3
Bidirectional RNN y,h=f1(x,h) z,g = f2(g,x)
x1 x2 x3

g0 f2 g1 f2 g2 f2 g3

z1 z2 z3

p=f3(y,z) f3 p1 f3 p2 f3 p3

y1 y2 y3

h0 f1 h1 f1 h2 f1 h3

x1 x2 x3
Significantly speed up training
Pyramid RNN
• Reducing the number of time steps

Bidirectional
RNN

W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural


network for large vocabulary conversational speech recognition,” ICASSP, 2016
Naïve RNN

y Wh h
Wi
h' x

h f h'
y Wo h’ Note, y is computed
from h’
x softmax

We have ignored the bias


Naïve RNN
Problems with naive RNN

• When dealing with a time series, it tends to forget old information.


When there is a distant relationship of unknown length, we wish to
have a “memory” to it.
• Vanishing gradient problem.
The sigmoid layer outputs numbers between 0-1 determine how much
each component should be let through. Pink X gate is point-wise multiplication.
Output
This gategate
sigmoid
LSTM This decides
Controls
Isdetermines
what info
what
how
to add to the cellmuch
state
goes into output
information goes thru

Ct-1

ht-1

Forget input
gate gate
The core idea is this cell
Why sigmoid or tanh:
state Ct, it is changed
Sigmoid: 0,1 gating as switch.
slowly, with only minor
Vanishing gradient problem in
linear interactions. It is very
LSTM is handled already.
easy for information to flow
ReLU replaces tanh ok?
along it unchanged.
LSTM

• Sigmoid specifically, is used as the gating function for the three gates (in,
out, and forget) in LSTM, since it outputs a value between 0 and 1, and it can
either let no flow or complete flow of information throughout the gates.

• On the other hand, to overcome the vanishing gradient problem, we need a


function whose second derivative can sustain for a long range before going
to zero. Tanh is a good function with the above property.

• A good neuron unit should be bounded, easily differentiable, monotonic


(good for convex optimization) and easy to handle. If you consider these
qualities, then you can use ReLU in place of the tanh function since they are
very good alternatives of each other.

• But before making a choice for activation functions, you must know what the
advantages and disadvantages of your choice over others are.
it decides what component
is to be updated.
C’t provides change contents

Updating the cell state

Decide what part of the cell


state to output
RNN vs LSTM
Peephole LSTM

Allows “peeping into the memory”


Naïve RNN vs LSTM yt

yt ct-1 ct
LSTM
Naïve
ht-1 ht ht-1 ht
RNN

xt xt

c changes slowly ct is ct-1 added by something

h changes faster ht and ht-1 can be very different


These 4 matrix
computation should
be done concurrently.
xt
z W
ht-1

ct-1 xt
zi = σ( Wi )
ht-1
Controls Controls Updating Controls
forget gate input gate information Output gate xt
zf = σ( Wf )
ht-1
zf zi z zo

xt
zo = σ( Wo )

ht-1 xt ht-1

Information flow of LSTM


xt

z =tanh( W ht-1 )

ct-1 ct-1
diagonal
“peephole” zo zf zi obtained by the same way

zf zi z zo

ht-1 xt
Information flow of LSTM
Element-wise multiply

yt

ct-1 ct
ct = zf  ct-1 + ziz

tanh ht = zo  tanh(ct)

yt = σ(W’ ht)

zf zi z zo

ht-1 xt ht

Information flow of LSTM


LSTM information flow

yt yt+1

ct-1 ct ct+1

tanh tanh

zf zi z zo zf zi z zo

ht+1
ht-1 xt ht xt+1

Information flow of LSTM


LSTM

GRU – gated recurrent unit


(more compression)

reset gate Update gate

It combines the forget and input into a single update gate.


It also merges the cell state and hidden state. This is simpler
than LSTM. There are many other variants too.

X,*: element-wise multiply


GRUs also takes xt and ht-1 as inputs. They perform some
calculations and then pass along ht. What makes them different
from LSTMs is that GRUs don't need the cell layer to pass values
along. The calculations within each iteration insure that the ht
values being passed along either retain a high amount of old
information or are jump-started with a high amount of new
information.
Feed-forward vs Recurrent Network
1. Feedforward network does not have input at each step
2. Feedforward network has different parameters for each layer

x f1 a1 f2 a2 f3 a3 f4 y

t is layer
at = ft(at-1) = σ(Wtat-1 + b t)

h0 f h1 f h2 f h3 f g y4

x1 x2 x3 x4
t is time step

at= f(at-1, xt) = σ(Wh at-1 + Wixt + bi)

We will turn the recurrent network 90 degrees.


Applications of LSTM / RNN
Neural machine translation

LSTM
Sequence to sequence chat model
Chat with context
M: Hi
M: Hello
U: Hi
M: Hi

M: Hello U: Hi
Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle
Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical
Baidu’s speech recognition using RNN
Attention
Image caption generation using attention
(From CY Lee lecture)
z0 is initial parameter, it is also learned
A vector for
each region
z0 match 0.7

CNN filter filter filter


filter filter filter

filter filter filter


filter filter filter
Image Caption Generation
Word 1
A vector for
each region
z0 z1
Attention to
a region
weighted
CNN filter filter filter sum
filter filter filter
0.7 0.1 0.1
0.1 0.0 0.0

filter filter filter


filter filter filter
Image Caption Generation
Word 1 Word 2
A vector for
each region
z0 z1 z2

weighted
filter filter filter
CNN sum
filter filter filter
0.0 0.8 0.2
0.0 0.0 0.0

filter filter filter


filter filter filter
Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron


Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show,
Attend and Tell: Neural Image Caption Generation with Visual Attention”,
ICML, 2015
Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron


Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show,
Attend and Tell: Neural Image Caption Generation with Visual Attention”,
ICML, 2015
* Possible project?

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo
Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV,
2015

You might also like