Lecture 13

CS7015 (Deep Learning) : Lecture 13
Sequence Learning Problems, Recurrent Neural Networks, Backpropagation

Through Time (BPTT), Vanishing and Exploding Gradients, Truncated BPTT
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 1/1
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 13
Module : Sequence Learning Problems
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 2/1
In feedforward and convolutional
neural networks the size of the input
was always fixed
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 3/1
was always fixed
For example, we fed fixed size (32 ×
32) images to convolutional neural
networks for image classification
10
5
10 5
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 3/1
P (man|sat, h
P (on|sat, he)
P (he|sat, he)
P (chair|sat,
. . . . . . . . . neural networks the size of the input
was always fixed
. . . . . . . . . . . . For example, we fed fixed size (32 ×
32) images to convolutional neural
networks for image classification
. . . . . . . . . . Similarly in word2vec, we fed a fixed
window (k) of words to the network
Wcontext Wcontext
he sat
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 3/1
was always fixed
apple
bus 32) images to convolutional neural
10
5 car networks for image classification
10 5
.. Similarly in word2vec, we fed a fixed
.
Further, each input to the network
was independent of the previous or
future inputs
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 3/1
was always fixed
apple
bus 32) images to convolutional neural
10
5 car networks for image classification
10 5
.. Similarly in word2vec, we fed a fixed
.
Further, each input to the network
was independent of the previous or
future inputs
For example, the computatations,
outputs and decisions for two success-
ive images are completely independ-
ent of each other
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 3/1
In many applications the input is not
of a fixed size
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 4/1
of a fixed size
Further successive inputs may not be
independent of each other
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 4/1
of a fixed size
For example, consider the task of
auto completion
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 4/1
of a fixed size
auto completion
Given the first character ‘d’ you want
to predict the next character ‘e’ and
so on
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 4/1
of a fixed size
e independent of each other
auto completion
so on
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 4/1
of a fixed size
e e p independent of each other
auto completion
so on
d e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 4/1
of a fixed size
e e p ⟨ stop ⟩ independent of each other
auto completion
so on
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 4/1
Notice a few things
e e p ⟨ stop ⟩
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 5/1
Notice a few things
First, successive inputs are no longer
independent (while predicting ‘e’ you
e e p ⟨ stop ⟩ would want to know what the previ-
ous input was in addition to the cur-
rent input)
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 5/1
Notice a few things
rent input)
Second, the length of the inputs and
the number of predictions you need
to make is not fixed (for example,
“learn”, “deep”, “machine” have dif-
ferent number of characters)
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 5/1
Notice a few things
rent input)
Second, the length of the inputs and
the number of predictions you need
to make is not fixed (for example,
“learn”, “deep”, “machine” have dif-
ferent number of characters)
Third, each network (orange-blue-
d e e p
green structure) is performing the
same task (input : character output
: character)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 5/1
These are known as sequence learning
problems
e e p ⟨ stop ⟩
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 6/1
problems
We need to look at a sequence of (de-
e e p ⟨ stop ⟩ pendent) inputs and produce an out-
put (or outputs)
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 6/1
problems
put (or outputs)
Each input corresponds to one time
step
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 6/1
problems
put (or outputs)
Each input corresponds to one time
step
Let us look at some more examples of
such problems
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 6/1
Consider the task of predicting the part
of speech tag (noun, adverb, adjective
verb) of each word in a sentence
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
noun
man
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
noun verb
man is
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
noun verb article
man is a
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
noun verb article adjective
man is a social
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
noun verb article adjective noun
man is a social animal
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
Once we see an adjective (social) we are
noun verb article adjective noun almost sure that the next word should be
a noun (man)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
a noun (man)
Thus the current output (noun) depends
on the current input as well as the previ-
ous input
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
a noun (man)
ous input
Further the size of the input is not fixed
(sentences could have arbitrary number
of words)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
a noun (man)
ous input
of words)
Notice that here we are interested in pro-
man is a social animal ducing an output at each time step
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
a noun (man)
ous input
of words)
Notice that here we are interested in pro-
man is a social animal ducing an output at each time step
Each network is performing the same
task (input. :. .word, output : tag)
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 7/1
Sometimes we may not be interested
in producing an output at every stage
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
Instead we would look at the full se-
quence and then produce an output
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
For example, consider the task of pre-
dicting the polarity of a movie review
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
The
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
The movie
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
The movie was
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
The movie was boring
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
The movie was boring and
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
+/−
The movie was boring and long
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
+/−
The prediction clearly does not de-
pend only on the last word but also
on some words which appear before
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
don’t
care
don’t
care
don’t
care
don’t
care
don’t
care +/−
The prediction clearly does not de-
pend only on the last word but also
on some words which appear before
Here again we could think that the
network is performing the same task
at each step (input : word, output :
+/−) but it’s just that we don’t care
about intermediate outputs
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 8/1
Sequences could be composed of any-
thing (not just words)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
For example, a video could be treated
as a sequence of images
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
...
...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
Surya Namaskar
...
...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
Surya Namaskar We may want to look at the entire se-
quence and detect the activity being
performed
...
...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 9/1
Module : Recurrent Neural Networks
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 10/1
How do we model such tasks involving sequences ?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 11/1
Wishlist
Account for dependence between inputs
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 12/1
Wishlist
Account for variable number of inputs
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 12/1
Wishlist
Make sure that the function executed at each time step is the same
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 12/1
Wishlist
Make sure that the function executed at each time step is the same
We will focus on each of these to arrive at a model for dealing with sequences
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 12/1
What is the function being executed
at each time step ?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 13/1
at each time step ?
y1 y2
si = σ(U xi + b)
V V
s1 s2
U U
x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 13/1
at each time step ?
y1 y2
si = σ(U xi + b)
yi = O(V si + c)
V V
s1 s2
U U
x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 13/1
at each time step ?
y1 y2
si = σ(U xi + b)
yi = O(V si + c)
i = timestep
V V
s1 s2
U U
x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 13/1
at each time step ?
y1 y2
si = σ(U xi + b)
yi = O(V si + c)
i = timestep
V V
Since we want the same function to be
s1 s2 executed at each timestep we should
share the same network (i.e., same
U U
parameters at each timestep)
x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 13/1
This parameter sharing also ensures
that the network becomes agnostic to
the length (size) of the input
y1 y2
V V
s1 s2
U U
x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14/1
y1 y2 Since we are simply going to compute
the same function (with same para-
meters) at each timestep, the number
V V of timesteps doesn’t matter
s1 s2
U U
x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14/1
y1 y2 Since we are simply going to compute
V V of timesteps doesn’t matter
We just create multiple copies of the
s1 s2
network and execute them at each
U U timestep
x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14/1
y1 y2 y3 Since we are simply going to compute
V V V of timesteps doesn’t matter
s1 s2 s3
U U U timestep
x1 x2 x3
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14/1
y1 y2 y3 y4 Since we are simply going to compute
V V V V of timesteps doesn’t matter
s1 s2 s3 s4
U U U U timestep
x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14/1
y1 y2 y3 y4 Since we are simply going to compute
V V V V of timesteps doesn’t matter
s1 s2 s3 s4 . . .
U U U U timestep
x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14/1
y1 y2 y3 y4 yn Since we are simply going to compute
V V V V V of timesteps doesn’t matter
s1 s2 s3 s4 . . . sn
U U U U U timestep
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 14/1
How do we account for dependence
between inputs ?
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
y1 How do we account for dependence
between inputs ?
v Let us first see an infeasible way of
doing this
u
x1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
y1 y2 How do we account for dependence
between inputs ?
v v Let us first see an infeasible way of
doing this
u u
At each timestep we will feed all the
previous inputs to the network
x1 x1 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
between inputs ?
doing this
u u
x1 x1 x2
y3
x1 x2 x3
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
between inputs ?
doing this
u u
x1 x1 x2
y3 y4
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
between inputs ?
doing this
u u
x1 x1 x2
Is this okay ?
y3 y4
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
between inputs ?
doing this
u u
x1 x1 x2
Is this okay ?
No, it violates the other two items on
y3 y4
our wishlist
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
between inputs ?
doing this
u u
x1 x1 x2
Is this okay ?
y3 y4
our wishlist
How ?
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
between inputs ?
doing this
u u
x1 x1 x2
Is this okay ?
y3 y4
our wishlist
How ? Let us see
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 15/1
y1 y2 First, the function being computed at
each time-step now is diﬀerent
v v
u u
x1 x1 x2
y3 y4
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 16/1
v v
y1 = f1 (x1 )
u u
x1 x1 x2
y3 y4
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 16/1
v v
y1 = f1 (x1 )
y2 = f2 (x1 , x2 )
u u
x1 x1 x2
y3 y4
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 16/1
v v
y1 = f1 (x1 )
y2 = f2 (x1 , x2 )
u u
y3 = f3 (x1 , x2 , x3 )
x1 x1 x2
y3 y4
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 16/1
v v
y1 = f1 (x1 )
y2 = f2 (x1 , x2 )
u u
y3 = f3 (x1 , x2 , x3 )
x1 x1 x2
The network is now sensitive to the

length of the sequence
y3 y4
v v
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 16/1
v v
y1 = f1 (x1 )
y2 = f2 (x1 , x2 )
u u
y3 = f3 (x1 , x2 , x3 )
x1 x1 x2
The network is now sensitive to the

length of the sequence
y3 y4
For example a sequence of length

10 will require f1 , . . . , f10 whereas a
v v
sequence of length 100 will require
f1 , . . . , f100
u u
x1 x2 x3 x1 x2 x3 x4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 16/1
The solution is to add a recurrent
connection in the network,
y1 y2 y3 y4 yn
V V V V V
W W W W ... W sn
U U U U U
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
y1 y2 y3 y4 yn
si = σ(U xi + W si−1 + b)
V V V V V
W W W W ... W sn
U U U U U
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
y1 y2 y3 y4 yn
yi = O(V si + c)
V V V V V
W W W W ... W sn
U U U U U
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
y1 y2 y3 y4 yn
yi = O(V si + c)
or
V V V V V
W W W W ... W sn
U U U U U
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
y1 y2 y3 y4 yn
yi = O(V si + c)
or
V V V V V yi = f (xi , si−1 , W, U, V, b, c)
W W W W ... W sn
U U U U U
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
y1 y2 y3 y4 yn
yi = O(V si + c)
or
W W W W ... W sn si is the state of the network at
timestep i
U U U U U
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
y1 y2 y3 y4 yn
yi = O(V si + c)
or
timestep i
U U U U U
The parameters are W, U, V, c, b
which are shared across timesteps
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
y1 y2 y3 y4 yn
yi = O(V si + c)
or
timestep i
U U U U U
The parameters are W, U, V, c, b
which are shared across timesteps
x1 x2 x3 x4 xn
The same network (and parameters)
can be used to compute y1 , y2 , . . . , y10
or y100
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 17/1
This can be represented more com-
pactly
yi
si W
xi
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 18/1
Let us revisit the sequence learning
e e p ⟨ stop ⟩ noun verb article adjective noun problems that we saw earlier
d e e p man is a social animal
Surya Namaskar
don’t don’t don’t don’t don’t
care care care care care +/−
...
...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 19/1
Let us revisit the sequence learning
e e p ⟨ stop ⟩ noun verb article adjective noun problems that we saw earlier
We now have recurrent connections
between time steps which account for
dependence between inputs
d e e p man is a social animal
Surya Namaskar
don’t don’t don’t don’t don’t
care care care care care +/−
...
...
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 19/1
Module : Backpropagation through time
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 20/1
Before proceeding let us look at the
dimensions of the parameters care-
fully
y1 y2 y2 y2
V V V V
W W W W
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
xi ∈ Rn (n-dimensional input)
V V V V
W W W W
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
si ∈ Rd (d-dimensional state)
V V V V
W W W W
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
V V V V yi ∈ R k
(say k classes)
W W W W
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
V V V V yi ∈ R k
(say k classes)
W W W W U∈
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
V V V V yi ∈ R k
(say k classes)
W W W W U ∈ Rn×d
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
V V V V yi ∈ R k
(say k classes)
W W W W U ∈ Rn×d
V ∈
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
V V V V yi ∈ R k
(say k classes)
W W W W U ∈ Rn×d
V ∈ Rd×k
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
V V V V yi ∈ R k
(say k classes)
W W W W U ∈ Rn×d
V ∈ Rd×k
U U U U
W ∈
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
fully
y1 y2 y2 y2
V V V V yi ∈ R k
(say k classes)
W W W W U ∈ Rn×d
V ∈ Rd×k
U U U U
W ∈ Rd×d
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 21/1
How do we train this network ?
y1 y2 y2 y2
V V V V
W W W W
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 22/1
(Ans: using backpropagation)
y1 y2 y2 y2
V V V V
W W W W
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 22/1
(Ans: using backpropagation)
Let us understand this with a con-
y1 y2 y2 y2 crete example
V V V V
W W W W
U U U U
x1 x2 x2 x2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 22/1
Suppose we consider our task of auto-
completion (predicting the next char-
acter)
e e p ⟨ stop ⟩
V V V V
W W W
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23/1
acter)
For simplicity we assume that there
e e p ⟨ stop ⟩ are only 4 characters in our vocabu-
lary (d,e,p, <stop>)
V V V V
W W W
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23/1
acter)
At each timestep we want to predict
V V V V one of these 4 characters
W W W
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23/1
acter)
W W W What is a suitable output function for
this task ?
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23/1
acter)
this task ? (softmax)
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23/1
acter)
U U U U What is a suitable loss function for
this task ?
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23/1
acter)
U U U U What is a suitable loss function for
this task ? (cross entropy)
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 23/1
Suppose we initialize U, V, W ran-
L1 (θ) L2 (θ) L3 (θ) L4 (θ) domly and the network predicts the
y1 y2 y3 y4
probabilities as shown
Predicted Predicted Predicted Predicted
d 0.2 0.2 0.2 0.2
e 0.7 0.7 0.1 0.1
p 0.1 0.1 0.7 0.7
stop 0.1 0.1 0.1 0.1
V V V V
W W W
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 24/1
y1 y2 y3 y4
PredictedTrue Predicted
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7
p 0.1
1 0.7 1 0.1 0 0.1 0 And the true probabilities are as
0 0.1 0 0.7 1 0.7 0
stop 0.1 0 0.1 0 0.1 0 0.1 1 shown
V V V V
W W W
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 24/1
y1 y2 y3 y4
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7
p 0.1
0 0.1 0 0.7 1 0.7 0
stop 0.1 0 0.1 0 0.1 0 0.1 1 shown
V V V V We need to answer two questions
W W W
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 24/1
y1 y2 y3 y4
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7
p 0.1
0 0.1 0 0.7 1 0.7 0
stop 0.1 0 0.1 0 0.1 0 0.1 1 shown
W W W What is the total loss made by the
model ?
U U U U
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 24/1
y1 y2 y3 y4
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7
p 0.1
0 0.1 0 0.7 1 0.7 0
stop 0.1 0 0.1 0 0.1 0 0.1 1 shown
W W W What is the total loss made by the
model ?
U U U U How do we backpropagate this loss
and update the parameters (θ =
{U, V, W, b, c}) of the network ?
d e e p
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 24/1
The total loss is simply the sum of the
loss over all time-steps
y1 y2 y3 y4
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7 1 0.7 1 0.1 0 0.1 0
p 0.1 0 0.1 0 0.7 1 0.7 1
stop 0.1 0 0.1 0 0.1 0 0.1 0
V V V V
W W W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ) loss over all time-steps
y1 y2 y3 y4
d 0.2 0 0.2
True Predicted
0 0.2
True Predicted
0 0.2
True
0 ∑
T
e 0.7
p 0.1
1
0
0.7
0.1
1
0
0.1
0.7
0
1
0.1
0.7
0
1 L (θ) = Lt (θ)
stop 0.1 0 0.1 0 0.1 0 0.1 0
t=1
V V V V
W W W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25/1
y1 y2 y3 y4
d 0.2 0 0.2
True Predicted
0 0.2
True Predicted
0 0.2
True
0 ∑
T
e 0.7
p 0.1
1
0
0.7
0.1
1
0
0.1
0.7
0
1
0.1
0.7
0
1 L (θ) = Lt (θ)
stop 0.1 0 0.1 0 0.1 0 0.1 0
t=1
V V V V Lt (θ) = −log(ytc )
W W W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25/1
y1 y2 y3 y4
d 0.2 0 0.2
True Predicted
0 0.2
True Predicted
0 0.2
True
0 ∑
T
e 0.7
p 0.1
1
0
0.7
0.1
1
0
0.1
0.7
0
1
0.1
0.7
0
1 L (θ) = Lt (θ)
stop 0.1 0 0.1 0 0.1 0 0.1 0
t=1
W W W ytc = predicted probability of true
character at time-step t
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25/1
y1 y2 y3 y4
d 0.2 0 0.2
True Predicted
0 0.2
True Predicted
0 0.2
True
0 ∑
T
e 0.7
p 0.1
1
0
0.7
0.1
1
0
0.1
0.7
0
1
0.1
0.7
0
1 L (θ) = Lt (θ)
stop 0.1 0 0.1 0 0.1 0 0.1 0
t=1
U U U U T = number of timesteps
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25/1
y1 y2 y3 y4
d 0.2 0 0.2
True Predicted
0 0.2
True Predicted
0 0.2
True
0 ∑
T
e 0.7
p 0.1
1
0
0.7
0.1
1
0
0.1
0.7
0
1
0.1
0.7
0
1 L (θ) = Lt (θ)
stop 0.1 0 0.1 0 0.1 0 0.1 0
t=1
d e e e
For backpropagation we need to com-
pute the gradients w.r.t. W, U, V, b, c
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25/1
y1 y2 y3 y4
d 0.2 0 0.2
True Predicted
0 0.2
True Predicted
0 0.2
True
0 ∑
T
e 0.7
p 0.1
1
0
0.7
0.1
1
0
0.1
0.7
0
1
0.1
0.7
0
1 L (θ) = Lt (θ)
stop 0.1 0 0.1 0 0.1 0 0.1 0
t=1
d e e e
For backpropagation we need to com-
pute the gradients w.r.t. W, U, V, b, c
Let us see how to do that
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 25/1
Let us consider ∂L (θ)
∂V (V is a matrix
L1 (θ) L2 (θ) L3 (θ) L4 (θ) so ideally we should write ∇v L (θ))
y1 y2 y3 y4
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7 1 0.7 1 0.1 0 0.1 0
p 0.1 0 0.1 0 0.7 1 0.7 1
stop 0.1 0 0.1 0 0.1 0 0.1 0
V V V V
W W W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 26/1
∂V (V is a matrix
y1 y2 y3 y4
∂L (θ) ∑ ∂Lt (θ)

True Predicted
True Predicted
True T
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7 1 0.7 1 0.1 0 0.1 0
p 0.1 0 0.1 0 0.7 1 0.7 1 =
stop 0.1 0 0.1 0 0.1 0 0.1 0 ∂V ∂V
t=1
V V V V
W W W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 26/1
∂V (V is a matrix
y1 y2 y3 y4
∂L (θ) ∑ ∂Lt (θ)

True Predicted
True Predicted
True T
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7 1 0.7 1 0.1 0 0.1 0
p 0.1 0 0.1 0 0.7 1 0.7 1 =
stop 0.1 0 0.1 0 0.1 0 0.1 0 ∂V ∂V
t=1
V V V V
W W W Each term is the summation is simply
the derivative of the loss w.r.t. the
U U U U weights in the output layer
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 26/1
∂V (V is a matrix
y1 y2 y3 y4
∂L (θ) ∑ ∂Lt (θ)

True Predicted
True Predicted
True T
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7 1 0.7 1 0.1 0 0.1 0
p 0.1 0 0.1 0 0.7 1 0.7 1 =
stop 0.1 0 0.1 0 0.1 0 0.1 0 ∂V ∂V
t=1
V V V V
W W W Each term is the summation is simply
the derivative of the loss w.r.t. the
U U U U weights in the output layer
We have already seen how to do this
when we studied backpropagation
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 26/1
∂L (θ)
Let us consider the derivative ∂W
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
y1 y2 y3 y4
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0
e 0.7 1 0.7 1 0.1 0 0.1 0
p 0.1 0 0.1 0 0.7 1 0.7 1
stop 0.1 0 0.1 0 0.1 0 0.1 0
V V V V
W W W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 27/1
∂L (θ)
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L (θ) ∑ ∂Lt (θ)

y1 y2 y3 y4 T
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0 =
e 0.7
p 0.1
1 0.7 1 0.1 0 0.1 0 ∂W ∂W
0 0.1 0 0.7 1 0.7 1 t=1
stop 0.1 0 0.1 0 0.1 0 0.1 0
V V V V
W W W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 27/1
∂L (θ)
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L (θ) ∑ ∂Lt (θ)

y1 y2 y3 y4 T
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0 =
e 0.7
p 0.1
1 0.7 1 0.1 0 0.1 0 ∂W ∂W
0 0.1 0 0.7 1 0.7 1 t=1
stop 0.1 0 0.1 0 0.1 0 0.1 0
By the chain rule of derivatives we
V V V V know that ∂L t (θ)
∂W is obtained by sum-
W W W ming gradients along all the paths
from Lt (θ) to W
U U U U
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 27/1
∂L (θ)
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L (θ) ∑ ∂Lt (θ)

y1 y2 y3 y4 T
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0 =
e 0.7
p 0.1
1 0.7 1 0.1 0 0.1 0 ∂W ∂W
0 0.1 0 0.7 1 0.7 1 t=1
stop 0.1 0 0.1 0 0.1 0 0.1 0
from Lt (θ) to W
U U U U What are the paths connecting Lt (θ)
to W ?
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 27/1
∂L (θ)
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L (θ) ∑ ∂Lt (θ)

y1 y2 y3 y4 T
True Predicted
True Predicted
True
d 0.2 0 0.2 0 0.2 0 0.2 0 =
e 0.7
p 0.1
1 0.7 1 0.1 0 0.1 0 ∂W ∂W
0 0.1 0 0.7 1 0.7 1 t=1
stop 0.1 0 0.1 0 0.1 0 0.1 0
from Lt (θ) to W
U U U U What are the paths connecting Lt (θ)
to W ?
Let us see this by considering L4 (θ)
d e e e
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 27/1
L4 (θ) depends on s4
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 28/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 in turn depends on s3 and W
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 28/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 28/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 28/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
V V V V s1 in turn depends on s0 and W
s1
W s2
W s3
W s4
W ... where s0 is a constant starting state.
U U U U
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 28/1
What we have here is an ordered net-
L1 (θ) L2 (θ) L3 (θ) L4 (θ) work
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 29/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ) work
In an ordered network each state vari-
able is computed one at a time in a
V V V V
specified order (first s1 , then s2 and
W W W W ...
s1 s2 s3 s4 so on)
U U U U
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 29/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ) work
V V V V
W W W W ...
s1 s2 s3 s4 so on)
U U U U Now we have
∂L4 (θ) ∂L4 (θ) ∂s4
x1 x2 x2 x2 =
∂W ∂s4 ∂W
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 29/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ) work
V V V V
W W W W ...
s1 s2 s3 s4 so on)
U U U U Now we have
∂L4 (θ) ∂L4 (θ) ∂s4
x1 x2 x2 x2 =
∂W ∂s4 ∂W
s0 s1 s2 s3 s4 L4 (θ)
We have already seen how to compute
∂L4 (θ)
∂s4 when we studied backprop
W
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 29/1
L1 (θ) L2 (θ) L3 (θ) L4 (θ) work
V V V V
W W W W ...
s1 s2 s3 s4 so on)
U U U U Now we have
∂L4 (θ) ∂L4 (θ) ∂s4
x1 x2 x2 x2 =
∂W ∂s4 ∂W
s0 s1 s2 s3 s4 L4 (θ)
We have already seen how to compute
∂L4 (θ)
∂s4 when we studied backprop
∂s4
W But how do we compute ∂W
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 29/1
Recall that
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 = σ(W s3 + b)
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 30/1
Recall that
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 = σ(W s3 + b)
V V V V In such an ordered network, we can’t

∂s4
W W W W
compute ∂W by simply treating s3 as
s1 s2 s3 s4 ...
a constant (because it also depends
U U U U
on W )
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 30/1
Recall that
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 = σ(W s3 + b)

∂s4
W W W W
s1 s2 s3 s4 ...
U U U U
on W )
In such networks the total derivative
∂s4
x1 x2 x2 x2 ∂W has two parts
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 30/1
Recall that
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 = σ(W s3 + b)

∂s4
W W W W
s1 s2 s3 s4 ...
U U U U
on W )
∂s4
+
s0 s1 s2 s3 s4 L4 (θ) Explicit : ∂∂Ws4 , treating all other in-
puts as constant
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 30/1
Recall that
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 = σ(W s3 + b)

∂s4
W W W W
s1 s2 s3 s4 ...
U U U U
on W )
∂s4
+
puts as constant
Implicit : Summing over all indirect
paths from s4 to W
W
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 30/1
Recall that
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
s4 = σ(W s3 + b)

∂s4
W W W W
s1 s2 s3 s4 ...
U U U U
on W )
∂s4
+
puts as constant
Implicit : Summing over all indirect
paths from s4 to W
W
Let us see how to do this
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 30/1
∂s4 ∂ + s4 ∂s4 ∂s3
= +
∂W |∂W
{z } |∂s3{z∂W}
explicit implicit
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 31/1
∂s4 ∂ + s4 ∂s4 ∂s3
= +
∂W |∂W
{z } |∂s3{z∂W}
explicit implicit
∂+s 4 ∂s4 [ ∂ + s3 ∂s3 ∂s2 ]
= + +
∂W ∂s3 |∂W
{z } ∂s ∂W
| 2{z }
explicit implicit
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 31/1
∂s4 ∂ + s4 ∂s4 ∂s3
= +
∂W |∂W
{z } |∂s3{z∂W}
explicit implicit
∂+s 4 ∂s4 [ ∂ + s3 ∂s3 ∂s2 ]
= + +
∂W ∂s3 |∂W
{z } ∂s ∂W
| 2{z }
explicit implicit
∂+s 4 ∂s4 ∂ + s3 ∂s4 ∂s3 [ ∂ + s2 ∂s2 ∂s1 ]
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s1 ∂W
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 31/1
∂s4 ∂ + s4 ∂s4 ∂s3
= +
∂W |∂W
{z } |∂s3{z∂W}
explicit implicit
∂+s 4 ∂s4 [ ∂ + s3 ∂s3 ∂s2 ]
= + +
∂W ∂s3 |∂W
{z } ∂s ∂W
| 2{z }
explicit implicit
∂+s 4 ∂s4 ∂s3 [ ∂ + s2 ∂s2 ∂s1 ]
∂s4 ∂ + s3
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s1 ∂W
∂ s4 ∂s4 ∂ s3 ∂s4 ∂s3 ∂ s2 ∂s4 ∂s3 ∂s2 [ ∂ + s1 ]
+ + +
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s3 ∂s2 ∂s1 ∂W
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 31/1
∂s4 ∂ + s4 ∂s4 ∂s3
= +
∂W |∂W
{z } |∂s3{z∂W}
explicit implicit
∂+s 4 ∂s4 [ ∂ + s3 ∂s3 ∂s2 ]
= + +
∂W ∂s3 |∂W
{z } ∂s ∂W
| 2{z }
explicit implicit
∂+s 4 ∂s4 ∂s3 [ ∂ + s2 ∂s2 ∂s1 ]
∂s4 ∂ + s3
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s1 ∂W
∂ s4 ∂s4 ∂ s3 ∂s4 ∂s3 ∂ s2 ∂s4 ∂s3 ∂s2 [ ∂ + s1 ]
+ + +
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s3 ∂s2 ∂s1 ∂W
For simplicity we will short-circuit some of the paths
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 31/1
∂s4 ∂ + s4 ∂s4 ∂s3
= +
∂W |∂W
{z } |∂s3{z∂W}
explicit implicit
∂+s 4 ∂s4 [ ∂ + s3 ∂s3 ∂s2 ]
= + +
∂W ∂s3 |∂W
{z } ∂s ∂W
| 2{z }
explicit implicit
∂+s 4 ∂s4 ∂s3 [ ∂ + s2 ∂s2 ∂s1 ]
∂s4 ∂ + s3
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s1 ∂W
∂ s4 ∂s4 ∂ s3 ∂s4 ∂s3 ∂ s2 ∂s4 ∂s3 ∂s2 [ ∂ + s1 ]
+ + +
= + + +
∂W ∂s3 ∂W ∂s3 ∂s2 ∂W ∂s3 ∂s2 ∂s1 ∂W
For simplicity we will short-circuit some of the paths
∂s4 ∂ + s4 ∂s4 ∂ + s3 ∂s4 ∂ + s2 ∂s4 ∂ + s1 ∑ ∂s4 ∂ + sk

4
∂s4
= + + + =
∂W ∂s4 ∂W ∂s3 ∂W ∂s2 ∂W ∂s1 ∂W ∂sk ∂W
.
k=1
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 31/1
Finally we have
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 32/1
Finally we have
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L4 (θ) ∂L4 (θ) ∂s4
=
∂W ∂s4 ∂W
V V V V
s1
W s2
W s3
W s4
W ...
U U U U
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 32/1
Finally we have
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L4 (θ) ∂L4 (θ) ∂s4
=
∂W ∂s4 ∂W
V V V V
∂s4 ∑ ∂s4 ∂ + sk
4
W W W W =
s1 s2 s3 s4 ... ∂W ∂sk ∂W
k=1
U U U U
x1 x2 x2 x2
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 32/1
Finally we have
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L4 (θ) ∂L4 (θ) ∂s4
=
∂W ∂s4 ∂W
V V V V
∂s4 ∑ ∂s4 ∂ + sk
4
W W W W =
s1 s2 s3 s4 ... ∂W ∂sk ∂W
k=1
∂Lt (θ) ∑ ∂st ∂ + sk

U U U U t
∂Lt (θ)
∴ =
∂W ∂st ∂sk ∂W
x1 x2 x2 x2
k=1
s0 s1 s2 s3 s4 L4 (θ)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 32/1
Finally we have
L1 (θ) L2 (θ) L3 (θ) L4 (θ)
∂L4 (θ) ∂L4 (θ) ∂s4
=
∂W ∂s4 ∂W
V V V V
∂s4 ∑ ∂s4 ∂ + sk
4
W W W W =
s1 s2 s3 s4 ... ∂W ∂sk ∂W
k=1
∂Lt (θ) ∑ ∂st ∂ + sk

U U U U t
∂Lt (θ)
∴ =
∂W ∂st ∂sk ∂W
x1 x2 x2 x2
k=1
s0 s1 s2 s3 s4 L4 (θ) This algorithm is called backpropaga-

tion through time (BPTT) as we
backpropagate over all previous time
W steps
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 32/1
Module : The problem of Exploding and Vanishing
Gradients
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 33/1
∂st
We will now focus on ∂s k
and high-
light an important problem in train-
ing RNN’s using BPTT
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 34/1
∂st
and high-
∂st ∂st ∂st−1 ∂sk+1

= ...
∂sk ∂st−1 ∂st−2 ∂sk
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 34/1
∂st
and high-

= ...
∏ ∂sj+1
t−1
=
∂sj
j=k
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 34/1
∂st
and high-

= ...
∏ ∂sj+1
t−1
=
∂sj
j=k
Let us look at one such term in the

∂s
product (i.e., ∂sj+1
j
)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 34/1
∂sj
We are interested in ∂sj−1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj
aj = W s j + b
sj = σ(aj )
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj
aj = W s j + b
sj = σ(aj )
∂sj ∂sj ∂aj

=
∂sj−1 ∂aj ∂sj−1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj
aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

sj = [σ(aj1 ), σ(aj2 ), . . . σ(ajd )] sj = σ(aj )
∂sj ∂sj ∂aj

=
∂sj−1 ∂aj ∂sj−1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

  ∂sj ∂sj ∂aj

=
  ∂sj−1 ∂aj ∂sj−1
∂sj  
= 
∂aj  
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s  ∂sj ∂sj ∂aj

j1
∂aj1 =
  ∂sj−1 ∂aj ∂sj−1
∂sj  
= 
∂aj  
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2
 ∂sj ∂sj ∂aj
j1
∂aj1 ∂aj1 =
  ∂sj−1 ∂aj ∂sj−1
∂sj  
= 
∂aj  
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
∂aj1 ∂aj1 ∂aj1 =
  ∂sj−1 ∂aj ∂sj−1
∂sj  
= 
∂aj  
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
∂aj1 ∂aj1 ∂aj1 ... =
  ∂sj−1 ∂aj ∂sj−1
∂sj  
= 
∂aj  
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
... =
 ∂aj1 ∂aj1 ∂aj1
 ∂sj−1 ∂aj ∂sj−1
∂sj  ∂sj1 ∂sj2 .. 
=  ∂aj2 . 
∂aj  ∂aj2

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
... =
 ∂aj1 ∂aj1 ∂aj1
 ∂sj−1 ∂aj ∂sj−1
∂sj  ∂sj1 ∂sj2 .. 
=  ∂aj2 . 
∂aj  ∂aj2

.. .. .. ∂sjd
. . . ∂ajd
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
... =
 ∂aj1 ∂aj1 ∂aj1
 ∂sj−1 ∂aj ∂sj−1
∂sj  ∂sj1 ∂sj2 .. 
=  ∂aj2 . 
∂aj  ∂aj2

.. .. .. ∂sjd
. . . ∂ajd
 ′ 
σ (aj1 ) 0 0 0
 
 
= 
 
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
... =
 ∂aj1 ∂aj1 ∂aj1
 ∂sj−1 ∂aj ∂sj−1
∂sj  ∂sj1 ∂sj2 .. 
=  ∂aj2 . 
∂aj  ∂aj2

.. .. .. ∂sjd
. . . ∂ajd
 ′ 
σ (aj1 ) 0 0 0
 0 ′
σ (aj2 ) 0 0 
 
= . 
 0 0 . . 
′
0 0 . . . σ (ajd )
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
... =
 ∂aj1 ∂aj1 ∂aj1
 ∂sj−1 ∂aj ∂sj−1
∂sj  ∂sj1 ∂sj2 .. 
=  ∂aj2 . 
∂aj  ∂aj2

.. .. .. ∂sjd
. . . ∂ajd
 ′ 
σ (aj1 ) 0 0 0
 0 ′
σ (aj2 ) 0 0 
 
= . 
 0 0 . . 
′
0 0 . . . σ (ajd )
′
= diag(σ (aj ))
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
... =
 ∂aj1 ∂aj1 ∂aj1
 ∂sj−1 ∂aj ∂sj−1
∂sj  ∂sj1 ∂sj2 .. 
=  ∂aj2 .  ′
∂aj  ∂aj2
 = diag(σ (aj ))W
.. .. .. ∂sjd
. . . ∂ajd
 ′ 
σ (aj1 ) 0 0 0
 0 ′
σ (aj2 ) 0 0 
 
= . 
 0 0 . . 
′
0 0 . . . σ (ajd )
′
= diag(σ (aj ))
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1
∂sj

 ∂s ∂sj2 ∂sj3
j1
... =
 ∂aj1 ∂aj1 ∂aj1
 ∂sj−1 ∂aj ∂sj−1
∂sj  ∂sj1 ∂sj2 .. 
=  ∂aj2 .  ′
∂aj  ∂aj2
 = diag(σ (aj ))W
.. .. .. ∂sjd
. . . ∂ajd
 ′ 
σ (aj1 ) 0 0 0
 0 ′
σ (aj2 ) 0 0 
 
= .  We are interested in the magnitude
 0 0 . .  ∂sj
of ∂sj−1 ← if it is small (large) ∂s
∂st
′ k
0 0 . . . σ (ajd ) ∂Lt
and hence ∂W will vanish (explode)
′
= diag(σ (aj ))
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 35/1

∂sj
= ′
∂sj−1 diag(σ (aj ))W
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj
= ′

′
≤ diag(σ (aj ) ∥W ∥
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj
= ′

′
∵ σ(aj ) is a bounded function (sigmoid,

′
tanh) σ (aj ) is bounded
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj
= ′

′

′
′ 1
σ (aj ) ≤ = γ [if σ is logistic ]
4
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj
= ′

′

′
′ 1
4
≤ 1 = γ [if σ is tanh ]
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj
= ′

′

′
′ 1
4

∂sj

∂sj−1 ≤ γ ∥W ∥
≤ γλ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj t
= ′ ∂st ∏ j
∂sj−1 diag(σ (aj ))W = ∂s
∂sk ∂s
′ j−1
j=k+1

′
′ 1
4

∂sj

∂sj−1 ≤ γ ∥W ∥
≤ γλ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj t
= ′ ∂st ∏ j
∂sk ∂s
′ j−1
j=k+1
∏
t
≤ γλ
′ j=k+1
′ 1
4

∂sj

∂sj−1 ≤ γ ∥W ∥
≤ γλ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj t
= ′ ∂st ∏ j
∂sk ∂s
′ j−1
j=k+1
∏
t
≤ γλ
′ j=k+1
≤ (γλ)t−k
′ 1
4

∂sj

∂sj−1 ≤ γ ∥W ∥
≤ γλ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj t
= ′ ∂st ∏ j
∂sk ∂s
′ j−1
j=k+1
∏
t
≤ γλ
′ j=k+1
≤ (γλ)t−k
′ 1
4 If γλ < 1 the gradient will vanish

∂sj

∂sj−1 ≤ γ ∥W ∥
≤ γλ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj t
= ′ ∂st ∏ j
∂sk ∂s
′ j−1
j=k+1
∏
t
≤ γλ
′ j=k+1
≤ (γλ)t−k
′ 1
If γλ > 1 the gradient could explode
∂sj

∂sj−1 ≤ γ ∥W ∥
≤ γλ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1

∂sj t
= ′ ∂st ∏ j
∂sk ∂s
′ j−1
j=k+1
∏
t
≤ γλ
′ j=k+1
≤ (γλ)t−k
′ 1
If γλ > 1 the gradient could explode
∂sj

∂sj−1 ≤ γ ∥W ∥ This is known as the problem of
vanishing/ exploding gradients
≤ γλ
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 36/1
One simple way of avoiding this is to
use truncated backpropogation
where we restrict the product to
τ (< t − k) terms
Lt
y1 y2 y3 y4 yn
v v v v v
w w w w
u u u u u
x1 x2 x3 x4 xn
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 37/1
Module : Some Gory Details
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 38/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
∈Rd×d
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
∈Rd×d ∈R1×d
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
∈Rd×d ∈R1×d ∈Rd×d
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
∈Rd×d ∈R1×d ∈Rd×d ∈R
d×d×d
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
d×d×d
We know how to compute ∂L∂st (θ)

t
(derivative of Lt (θ) (scalar) w.r.t. last
hidden layer (vector)) using backpropagation
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
d×d×d

t
∂st
We just saw a formula for ∂sk which is the derivative of a vector w.r.t. a
vector)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
d×d×d

t
∂st
vector)
∂ + sk
∂W is a tensor ∈ Rd×d×d , the derivative of a vector ∈ Rd w.r.t. a matrix
∈ Rd×d
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂Lt (θ) ∑ ∂st ∂ + sk
t
∂Lt (θ)
=
| ∂W
{z } | ∂s ∂sk |∂W
{zt } k=1 |{z} {z }
d×d×d

t
∂st
vector)
∂ + sk
∂W is a tensor ∈ Rd×d×d , the derivative of a vector ∈ Rd w.r.t. a matrix
∈ Rd×d
∂ + sk
How do we compute ∂W ? Let us see
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 39/1
∂ + sk
We just look at one element of this ∂W tensor
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 40/1
∂ + sk
∂ + skp
∂Wqr is the (p, q, r)-th element of the 3d tensor
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 40/1
∂ + sk
∂ + skp
ak = W sk−1 + b
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 40/1
∂ + sk
∂ + skp
ak = W sk−1 + b
sk = σ(ak )
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 40/1
ak = W sk−1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
ak = W sk−1
    
ak1 W11 W12 ... W1d sk−1,1
ak2    sk−1,2 
    
 ..   .. .. .. ..   .. 
 .   . . . .   . 
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
ak = W sk−1
    
ak1 W11 W12 ... W1d sk−1,1
ak2    sk−1,2 
    
 ..   .. .. .. ..   .. 
 .   . . . .   . 
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
ak = W sk−1
    
ak1 W11 W12 ... W1d sk−1,1
ak2    sk−1,2 
    
 ..   .. .. .. ..   .. 
 .   . . . .   . 
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
ak = W sk−1
    
ak1 W11 W12 ... W1d sk−1,1
ak2    sk−1,2 
    
 ..   .. .. .. ..   .. 
 .   . . . .   . 
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
∂skp ∂skp ∂akp
=
∂Wqr ∂akp ∂Wqr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
ak = W sk−1
    
ak1 W11 W12 ... W1d sk−1,1
ak2    sk−1,2 
    
 ..   .. .. .. ..   .. 
 .   . . . .   . 
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
=
∂akp
= σ ′ (akp )
∂Wqr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
∑d
ak = W sk−1 ∂akp ∂ i=1Wpi sk−1,i
     =
ak1 W11 W12 ... W1d sk−1,1 ∂Wqr ∂Wqr
ak2    sk−1,2 
    
 ..   .. .. .. ..   .. 
 .   . . . .   . 
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
=
∂akp
= σ ′ (akp )
∂Wqr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
∑d
     =
ak1 W11 W12 ... W1d sk−1,1 ∂Wqr ∂Wqr
ak2    sk−1,2  = sk−1,i if p = q and i=r
    
 ..   .. .. .. ..   .. 
 .   . . . .   . 
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
=
∂akp
= σ ′ (akp )
∂Wqr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
∑d
     =
ak1 W11 W12 ... W1d sk−1,1 ∂Wqr ∂Wqr
    
 ..   .. .. .. ..   .. 
 .   . . . .   .  =0 otherwise
 =  
akp  Wp1 Wp2 ... Wpd   
    sk−1,p 
 .   . .. .. ..   .. 
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
=
∂akp
= σ ′ (akp )
∂Wqr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
∑d
     =
ak1 W11 W12 ... W1d sk−1,1 ∂Wqr ∂Wqr
    
 ..   .. .. .. ..   .. 
 .   . . . .   .  =0 otherwise
 =  
akp  Wp1 Wp2 ... Wpd    ∂skp
    sk−1,p  = σ ′ (akp )sk−1,r if p=q and i = r
 .   . .. .. ..   ..  ∂Wqr
 ..   .. . . .  . 
akd sk−1,d
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
=
∂akp
= σ ′ (akp )
∂Wqr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1
∑d
     =
ak1 W11 W12 ... W1d sk−1,1 ∂Wqr ∂Wqr
    
 ..   .. .. .. ..   .. 
 .   . . . .   .  =0 otherwise
 =  
akp  Wp1 Wp2 ... Wpd    ∂skp
    sk−1,p  = σ ′ (akp )sk−1,r if p=q and i = r
 .   . .. .. ..   ..  ∂Wqr
 ..   .. . . .  . 
akd sk−1,d =0 otherwise
∑
d
akp = Wpi sk−1,i
i=1
skp = σ(akp )
=
∂akp
= σ ′ (akp )
∂Wqr
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 41/1

Lecture 13

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 13

Uploaded by

Copyright:

Available Formats

CS7015 (Deep Learning) : Lecture 13

Sequence Learning Problems, Recurrent Neural Networks, Backpropagation

Department of Computer Science and Engineering

noun verb article

noun verb article adjective

noun verb article adjective noun

man is a social animal

man is a social animal

man is a social animal

man is a social animal

The movie was

The movie was boring

The movie was boring and

The movie was boring and long

The movie was boring and long

The network is now sensitive to the

The network is now sensitive to the

For example a sequence of length

d e e p man is a social animal

∂L (θ) ∑ ∂Lt (θ)

∂L (θ) ∑ ∂Lt (θ)

∂L (θ) ∑ ∂Lt (θ)

∂L (θ) ∑ ∂Lt (θ)

∂L (θ) ∑ ∂Lt (θ)

∂L (θ) ∑ ∂Lt (θ)

∂L (θ) ∑ ∂Lt (θ)

V V V V In such an ordered network, we can’t

V V V V In such an ordered network, we can’t

V V V V In such an ordered network, we can’t

V V V V In such an ordered network, we can’t

V V V V In such an ordered network, we can’t

∂s4 ∂ + s4 ∂s4 ∂ + s3 ∂s4 ∂ + s2 ∂s4 ∂ + s1 ∑ ∂s4 ∂ + sk

∂Lt (θ) ∑ ∂st ∂ + sk

∂Lt (θ) ∑ ∂st ∂ + sk

s0 s1 s2 s3 s4 L4 (θ) This algorithm is called backpropaga-

∂st ∂st ∂st−1 ∂sk+1

∂st ∂st ∂st−1 ∂sk+1

∂st ∂st ∂st−1 ∂sk+1

Let us look at one such term in the

∂sj ∂sj ∂aj

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

∂sj ∂sj ∂aj

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

  ∂sj ∂sj ∂aj

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

 ∂s  ∂sj ∂sj ∂aj

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

aj = [aj1 , aj2 , aj3 , . . . ajd , ] aj = W s j + b

∵ σ(aj ) is a bounded function (sigmoid,

∵ σ(aj ) is a bounded function (sigmoid,

∵ σ(aj ) is a bounded function (sigmoid,

∵ σ(aj ) is a bounded function (sigmoid,

∵ σ(aj ) is a bounded function (sigmoid,

We know how to compute ∂L∂st (θ)

We know how to compute ∂L∂st (θ)

We know how to compute ∂L∂st (θ)