You are on page 1of 44

Recurrent Neural Network

(RNN)
Time-indexed data points
The time-indexed data points may be:
[1] Equally spaced samples from a continuous real-world process.
Examples include
● The still images that comprise the frames of videos
● The discrete amplitudes sampled at fixed intervals that comprise
audio recordings.
● Daily Values of current exchange rate
● Rainfall measurements in Successive days (in certain location)
[2] Ordinal Time steps, with no exact correspondence to durations.
● Natural language (word sequence)
● Neucleotide base pairs in strand of DNA
Traditional Language Models
Traditional Language Models
RECURRENT NEURAL NETWORK (RNN)
Recurrent: perform same task for every element of a sequence
Output: depend on:
previous computations as well as
new inputs
RNNs have a “memory” of past !

Apply the same set of weights (u,v,w) recursively


RNN (FOLDED - UNFOLDED)
RNN (FOLDED - UNFOLDED)

Hidden state at time step t


Output state at time step t

Activation function

Input at time step


Unfolded RNN
(Multiple hidden layers)
Depth (multi Layer)

Time
Examples of Sequence
Application of RNN - LSTM
image sequence named entity
captioning classification translation
recognition
CHARACTER-LEVEL LANGUAGE MODEL
One-HOT Vectors input for Word Sequence
Indices instead of one-hot vectors?
CHARACTER-LEVEL LANGUAGE MODEL
CHARACTER-LEVEL LANGUAGE MODEL
CHARACTER-LEVEL LANGUAGE MODEL
(Generative Model)
Simple and Real RNN
(Number of Parameters)
Generated Text
Using Wikipedia
Generated Text
C Source Code
Generated Text
BACKPROPAGATION THROUGH TIME (BPTT)
BACKPROPAGATION THROUGH TIME
(BPTT)
Calculate gradients of error
with respect to: U, V, W.

Sum up the gradients at each time step


BACKPROPAGATION THROUGH TIME
(BPTT)
Calculate Gradients by chain Rule

Remember:

Softmax fn
BACKPROPAGATION THROUGH TIME
(BPTT)

S1 and S2 Depends on W and U too


BACKPROPAGATION THROUGH TIME
(BPTT)

S1 and S2 Depends on W and U too


BACKPROPAGATION THROUGH TIME
(BPTT)
Sum up the gradients at each time step

Propagation through time


(RNN)
=
Propagation through layers
(FNN)
Gradients of Some Common
Activation functions
VANISHING GRADIENT PROBLEM
Error Gradients pass through nonlinearity every step
Saturation at both ends ==> zero gradient
Vanishing completely after a few time steps.

Tanh Derivative ranges from 0 to 1


VANISHING GRADIENT PROBLEM

Tanh Derivative ranges from 0 to 1


VANISHING GRADIENT PROBLEM

Tanh Derivative ranges from 0 to 1


From
Recurrent Neural Network (RNN )
to
Lone Short Term Memory (LSTM)
RNN Cell
y’t
Output

Softmax

V
Hidden

Feedback Feedback
(t-1)
W
∑ Activation Fn (t)
Ct ht
W
Use ht as feedback
And as Output

U
Input

t-1 t t+1
From RNN to LSTM
y’t
Output

Softmax

ht
Hidden

ht-1 ∑ Tanh ht
W Ct W
U

X
Input

t-1 t t+1
From RNN to LSTM
Use Feed back from two Inputs: y’t
Output

Ct-1 Previous Cell STATE (before tanh)


Softmax
ht-1 Previous Cell OUTPUT (after tanh)
V

Ct-1
(Memory) (Memory)
Ct-1
(Output ) ht
Hidden

ht-1 ∑ σ Tanh ht
W Ct W (Output )

X
Input

t-1 t t+1
From RNN to LSTM
Attenuate I/P & O/P of Activation function
ft ”forget” Gate (Control Feedback) y’t
Output

it ”Input” Gate (Control Input) Softmax

Ot “Output” Gate (Control Output of tanh) V

Ct-1
(Memory) (Memory)
Ct-1 ft
Attenuation
(Output ) ht
Hidden

ht-1 ∑ σ Tanh ht
W Ct St W (Output )
Ot
U it
Attenuation
Attenuation

X
Input

t-1 t t+1
From RNN to LSTM
ft, it , Ot Attenuation Factors
All Factors are based on: y’t
Output

I/P to cell (Xt) with Param [ Uf Ui UO ] Softmax


O/P of Prev. cell (ht-1) with Param [Wf Wi WO] V
(Use Different Parameters (Weights) for each)

Ct-1
(Memory) (Memory)
Ct-1 ft

(Output ) ht
Hidden

ht-1 ∑ σ Tanh ht
W Ct St W (Output )
Ot
U it
σ σ σ
∑ ∑ ∑

Wf Uf X Wi Ui WO UO
Input

ht-1
t-1 t X t+1
From RNN to LSTM

Control Gates Depend on:


Input X (t)
Previous Output ( t-1)

Values Ranges from 0 to 1 (Segmoid)

Remember
In RNN

St = tanh ( W St-1 + U Xt)

Y’ = Softmax ( V ht ) Y’= softmax( V St )


LSTM Cell

st

ht -1 ht
LSTM Cell
Cell State
The cell state carries the essential information over time

st

ht -1 ht
LSTM Cell
Activation Functions
σ ∈ (0, 1): control gate – something like a switch

tanh ∈ −1, 1 : recurrent nonlinearity

st

ht -1 ht
LSTM Cell
forget Gate
Decide what to forget and what to remember for the new memory

Sigmoid 1 ==> Remember everything


Sigmoid 0 ==> Forget everything

st

ht -1 ht
LSTM Cell
Input Gate
Decide what new information should you add to the new memory

Modulate the input it


Generate candidate memories Ct

st

ht -1 ht
LSTM Cell
Update State
Compute and update the current cell state Ct
Depends on the previous cell state
What we decide to forget
What inputs we allow
The candidate memories

st

ht -1 ht
LSTM Cell
Cell Output
Modulate the output
Does the cell state contain something relevant? --> Sigmoid 1

st

ht -1 ht
Unrolled LSTM

t-1 t t+1

You might also like