Lecture 4

CS7015 (Deep Learning): Lecture 4
Feedforward Neural Networks, Backpropagation
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
0/100
References/Acknowledgments
See the excellent videos by Hugo Larochelle on Backpropagation
1/100
1
Module 4.1: Feedforward Neural Networks (a.k.a. multilayered
network of neurons)
2/100
2
• The input to the network is an n-dimensional vector
3/100
3
x1 x2 xn
3/100
3
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
x1 x2 xn
3/100
3
x1 x2 xn
3/100
3
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
x1 x2 xn
3/100
3
x1 x2 xn
3/100
3
• Each neuron in the hidden layer and output layer can
be split into two parts :
x1 x2 xn
3/100
3
be split into two parts :
x1 x2 xn
3/100
3
a3
a2 be split into two parts : pre-activation
a1
x1 x2 xn
3/100
3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
a3
h2 (say, corresponding to k classes)
a2 be split into two parts : pre-activation and activation
h1
a1
x1 x2 xn
3/100
3
a3
h1 (ai and hi are vectors)
a1
x1 x2 xn
3/100
3
a3
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
x1 x2 xn
3/100
3
a3
a1
b1 • Wi ∈ R
W1 n×n and b ∈ Rn are the weight and bias
i
between layers i − 1 and i (0 < i < L)
x1 x2 xn
3/100
3
a3
W2 b2 (ai and hi are vectors)
h1
a1
b1 • Wi ∈ R
i
x1 x2 xn
3/100
3
a3
W3 • Finally, there is one output layer containing k neurons
b3
W2 b2 (ai and hi are vectors)
h1
a1
b1 • Wi ∈ R
i
x1 x2 xn
• WL ∈ Rn×k and bL ∈ Rk are the weight and bias
3/100
between the last hidden layer and the output layer
3
(L = 3 in this case)
hL = ŷ = f (x)
• The pre-activation at layer i is given by
ai (x) = bi + Wi hi−1 (x)

a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
4/100
4
hL = ŷ = f (x)

a3
W3 b3
h2 • The activation at layer i is given by
hi (x) = g(ai (x))

a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
4/100
4
hL = ŷ = f (x)

a3
W3 b3
hi (x) = g(ai (x))

a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
a1
W1 b1
x1 x2 xn
4/100
4
hL = ŷ = f (x)

a3
W3 b3
hi (x) = g(ai (x))

a2
h1
• The activation at the output layer is given by
a1
W1 b1 f (x) = hL (x) = O(aL (x))
x1 x2 xn
4/100
4
hL = ŷ = f (x)

a3
W3 b3
hi (x) = g(ai (x))

a2
h1
a1
W1 b1 f (x) = hL (x) = O(aL (x))
x1 x2 xn
where O is the output activation function (for example,
softmax, linear, etc.) 4/100
4
hL = ŷ = f (x)

a3
W3 b3
hi (x) = g(ai (x))

a2
h1
a1
W1 b1 f (x) = hL (x) = O(aL (x))
x1 x2 xn
• To simplify notation we will refer to ai (x) as ai and 4

hL = ŷ = f (x)
ai = bi + Wi hi−1
a3
W3 b3
hi = g(ai )
a2
h1
a1
W1 b1 f (x) = hL = O(aL )
x1 x2 xn
5
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6/100
6
i=1
hL = ŷ = f (x)
• Model:
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6/100
6
i=1
hL = ŷ = f (x)
• Model:
a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )

W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6/100
6
i=1
hL = ŷ = f (x)
• Model:

W3 b3
h2
• Parameters:
θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6/100
6
i=1
hL = ŷ = f (x)
• Model:

W3 b3
h2
• Parameters:
θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
a2
W2 b2 • Algorithm: Gradient Descent with Back-propagation
h1 (we will see soon)
a1
W1 b1
x1 x2 xn
6/100
6
i=1
hL = ŷ = f (x)
• Model:

W3 b3
h2
• Parameters:
θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
a2
W2 b2 • Algorithm: Gradient Descent with Back-propagation
h1 (we will see soon)
• Objective/Loss/Error function: Say,
a1 N k
W1 1 XX
b1 min (ŷij − yij )2
N
i=1 j=1
x1 x2 xn
In general, min L (θ) 6/100
6
where L (θ) is some function of the parameters
Module 4.2: Learning Parameters of Feedforward Neural Networks
(Intuition)
7/100
7
The story so far...
• We have introduced feedforward neural networks
• We are now interested in finding an algorithm for learning the parameters of this model
8/100
8
hL = ŷ = f (x)
• Recall our gradient descent algorithm
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize w0 , b0 ;
W2 b2 while t++ < max_iterations do
h1 wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
a1 end
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
• We can write it more concisely as
W3 b3
h2 t ← 0;
a2 Initialize w0 , b0 ;
h1 wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
a1 end
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
W3 b3
h2 t ← 0;
a2 Initialize θ0 = [w0 , b0 ];
h1 θt+1 ← θt − η∇θt ;
end
a1
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
W3 b3
h2 t ← 0;
h1 θt+1 ← θt − η∇θt ;
end
a1 ∂L (θ) T
W1 b1 • where ∇θt = ∂wt , ∂L (θ)
∂bt
x1 x2 xn
9/100
9
hL = ŷ = f (x)
W3 b3
h2 t ← 0;
h1 θt+1 ← θt − η∇θt ;
end
a1 ∂L (θ) T
∂bt
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100
9
hL = ŷ = f (x)
W3 b3
h2 t ← 0;
h1 θt+1 ← θt − η∇θt ;
end
a1 ∂L (θ) T
∂bt
x1 x2 xn
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100
• We can still use the same algorithm for learning 9

hL = ŷ = f (x)
W3 b3
h2 t ← 0;
a2 Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
h1 θt+1 ← θt − η∇θt ;
end
a1 T
• where ∇θt =
∂L (θ)
W1 b1 ∂W1,t , ., ∂L (θ) ∂L (θ) ∂L (θ)
∂WL,t , ∂b1,t , ., ∂bL,t

x1 x2 xn
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100
• We can still use the same algorithm for learning 9

• Except that now our ∇θ looks much more nasty
10/100
10
 ∂L (θ) 
∂W111
 
 
 
 
 
 
 
 
10/100
10
 ∂L (θ) 
∂W111 ...
 
 
 
 
 
 
 
 
10/100
10
 ∂L (θ) ∂L (θ) 
∂W111 ... ∂W11n
 
 
 
 
 
 
 
 
10/100
10
 ∂L (θ) ∂L (θ) 
∂W111 ... ∂W11n
 
 
 ∂L (θ) ∂L (θ)

 ∂W121 . . .
 
∂W12n 
 . .. ..
 .

 . . .


∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn
10/100
10
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

∂W111 ... ∂W11n ∂W211 ... ∂W21n
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

 ∂W121 . . . ...
 
∂W12n ∂W221 ∂W22n 
 . .. .. .. .. ..
 .

 . . . . . .


∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn
10/100
10
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

∂W111 ... ∂W11n ∂W211 ... ∂W21n ...
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

 ∂W121 . . . ... ...
 
∂W12n ∂W221 ∂W22n 
 . .. .. .. .. .. ..
 .

 . . . . . . .


∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ...
10/100
10
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

∂W111 ... ∂W11n ∂W211 ... ∂W21n ... ∂WL,11 ... ∂WL,1k ∂WL,1k
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

 ∂W121 . . . ... ... ...
 
∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k 
 . .. .. .. .. .. .. .. .. .. ..
 .

 . . . . . . . . . . .


∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk
10/100
10
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

∂W111 ... ∂W11n ∂W211 ... ∂W21n ... ∂WL,11 ... ∂WL,1k ∂WL,1k ∂b11 ... ∂bL1
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂W121 . . . ... ... ... . . . ∂bL2 

∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12
 . .. .. .. .. .. .. .. .. .. .. .. .. .. 
 .
 . . . . . . . . . . . . . . 

∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk ∂b1n . . . ∂L
∂bLk
(θ)
10/100
10
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

∂W111 ... ∂W11n ∂W211 ... ∂W21n ... ∂WL,11 ... ∂WL,1k ∂WL,1k ∂b11 ... ∂bL1
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂W121 . . . ... ... ... . . . ∂bL2 

∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12
 . .. .. .. .. .. .. .. .. .. .. .. .. .. 
 .
 . . . . . . . . . . . . . . 

∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk ∂b1n . . . ∂L
∂bLk
(θ)
• ∇θ is thus composed of
∇W1 , ∇W2 , ...∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k ,
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk
10/100
10
We need to answer two questions
11/100
11
• How to choose the loss function L (θ)?
11/100
11
• How to choose the loss function L (θ)?
• How to compute ∇θ which is composed of
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
11/100
11
Module 4.3: Output Functions and Loss Functions
12/100
12
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
13/100
13
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
13/100
13
• The choice of loss function depends on
the problem at hand
14/100
14
the problem at hand
• We will illustrate this with the help of
two examples
14/100
14
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers
isActor isDirector
.. ........
Damon Nolan
xi
14/100
14
imdb Critics RT
two examples
L − 1 hidden layers • Here yi ∈ R3
isActor isDirector
.. ........
Damon Nolan
xi
14/100
14
imdb Critics RT
two examples
• The loss function should capture how
much ŷi deviates from yi
isActor isDirector
.. ........
Damon Nolan
xi
14/100
14
imdb Critics RT
two examples
• The loss function should capture how
much ŷi deviates from yi
isActor isDirector • If yi ∈ Rn then the squared error loss can
.. ........ capture this deviation
Damon Nolan
N 3
xi 1 XX
L (θ) = (ŷij − yij )2 14/100
N
i=1 j=1
14
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15/100
15
hL = ŷ = f (x)
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15/100
15
hL = ŷ = f (x)
W3 b3 function?
h2
• No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15/100
15
hL = ŷ = f (x)
W3 b3 function?
h2
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O’ as linear function
a1 f (x) = hL = O(aL )
W1 b1 = WO aL + bO
x1 x2 xn
15/100
15
hL = ŷ = f (x)
W3 b3 function?
h2
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O’ as linear function
a1 f (x) = hL = O(aL )
W1 b1 = WO aL + bO
x1 x2 xn • ŷi = f (xi ) is no longer bounded

between 0 and 1 15/100
15
Intentionally left blank 16/100
16
16
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
Neural network with

17/100
17
• Suppose we want to classify an image
into 1 of k classes
Neural network with
17/100
17
into 1 of k classes
Neural network with • Here again we could use the squared
L − 1 hidden layers error loss to capture the deviation
17/100
17
into 1 of k classes
Neural network with • Here again we could use the squared
L − 1 hidden layers error loss to capture the deviation
• But can you think of a better function?
17/100
17
• Notice that y is a probability distribution
y = [1 0 0 0]
Apple Mango Orange Banana
Neural network with

18/100
18
y = [1 0 0 0] • Therefore we should also ensure that ŷ is
Apple Mango Orange Banana a probability distribution
Neural network with

18/100
18
hL = ŷ = f (x)
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O’
will ensure this ?
a2 aL = WL hL−1 + bL
W2 b2
h1
a1
W1 b1
x1 x2 xn
18/100
18
hL = ŷ = f (x)
W3 b3
will ensure this ?
W2 b2 eaL,j
h1 ŷj = O(aL )j = Pk
aL,i
i=1 e
a1 O(aL )j is the j th element of ŷ and aL,j

W1 b1 is the j th element of the vector aL .
x1 x2 xn
18/100
18
hL = ŷ = f (x)
W3 b3
will ensure this ?
W2 b2 eaL,j
h1 ŷj = O(aL )j = Pk
aL,i
i=1 e
a1 O(aL )j is the j th element of ŷ and aL,j

W1 b1 is the j th element of the vector aL .
• This function is called the softmax
x1 x2 xn
function
18/100
18
• Now that we have ensured that both y
y = [1 0 0 0] & ŷ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
Neural network with

19/100
19
• Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1
19/100
19
• Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1
• Notice that
yc = 1 if c = ` (the true class label)

=0 otherwise
∴ L (θ) = − log ŷ`
19/100
19
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
20/100
20
function
a3
W3 b3 θ
h2
θ
a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
a1
W1 b1
x1 x2 xn
20/100
20
function
a3
W3 b3 θ
h2
θ
a2 • But wait!
h1
• Yes, it is indeed a function of θ
a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

W1 b1
x1 x2 xn
20/100
20
function
a3
W3 b3 θ
h2
θ
a2 • But wait!
h1
a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

W1 b1 • What does ŷ` encode?
x1 x2 xn
20/100
20
function
a3
W3 b3 θ
h2
θ
a2 • But wait!
h1
a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

• It is the probability that x belongs to the `th class (bring
x1 x2 xn
it as close to 1). 20/100
20
function
a3
W3 b3 θ
h2
θ
a2 • But wait!
h1
a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

• It is the probability that x belongs to the `th class (bring
x1 x2 xn
it as close to 1). 20/100
20
• log ŷ` is called the log-likelihood of the data.
Outputs
Real Values Probabilities
Output Activation
Loss Function
21/100
21
Outputs
Output Activation Linear
Loss Function
21/100
21
Outputs
Output Activation Linear Softmax
Loss Function
21/100
21
Outputs
Loss Function Squared Error
21/100
21
Outputs
Loss Function Squared Error Cross Entropy
21/100
21
Outputs
• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
21/100
21
Outputs
• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
• For the rest of this lecture we will focus on the case where the output activation is a
softmax function and the loss function is cross entropy
21/100
21
Module 4.4: Backpropagation (Intuition)
22/100
22
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
23/100
23
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
23/100
23
• Let us focus on this one
ŷ = f (x)
Algorithm: gradient
weight (W112 ). descent()
a31
h21
W311 b3 t ← 0;
max_iterations ←
a21 1000;
W211 b2
h11 Initialize θ0 ;
while
a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end
24/100
24
ŷ = f (x)
Algorithm: gradient
a31
W311 t ← 0;
• To learn this weight using h21
b3
max_iterations ←
SGD we need a formula
∂L (θ) a21 1000;
for ∂W . W211 b2
112 h11 Initialize θ0 ;
while
a11
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end
24/100
24
ŷ = f (x)
Algorithm: gradient
a31
W311 t ← 0;
• To learn this weight using h21
b3
max_iterations ←
SGD we need a formula
∂L (θ) a21 1000;
for ∂W . W211 b2
112 h11 Initialize θ0 ;
• We will see how to
while
calculate this. a11
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end
24/100
24
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
aL1
WL11
h21
a21
W211
h11
a11
W111
x1
25/100
25
L (θ)
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
a21
W211
h11
a11
W111
x1
25/100
25
L (θ)
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
W211
h11
a11
W111
x1
25/100
25
L (θ)
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
= (just compressing the chain rule) h11
∂W111 ∂h11 ∂W111
a11
W111
x1
25/100
25
L (θ)
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
= a11
∂W211 ∂h21 ∂W211 W111
x1
25/100
25
L (θ)
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
= a11
∂W211 ∂h21 ∂W211 W111
∂L (θ) ∂L (θ) ∂aL1
=
∂WL11 ∂aL1 ∂WL11 x1
25/100
25
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details
26/100
26
• We get a certain loss at the output and we try to figure − log ŷ`
out who is responsible for this loss
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
27/100
27
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
27/100
27
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
• The output layer says “Well, I take responsibility for my h2
part but please understand that I am only as the good
as the hidden layer and weights below me". After all a2
... W2 b2
f (x) = ŷ = O(WL hL−1 + bL )
h1
a1
W1 b1
x1 x2 xn
27/100
27
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
with you?"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
with you?"
a3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
with you?"
a3
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
W1 b1
x1 x2 xn
28/100
28
with you?"
a3
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 W1 b1
=
∂W111 ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z } |
Talk to the
{z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the and now
x1 x2 xn
weight directly output layer previous hidden previous talk to
layer hidden layer the 28/100
weights
28
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
layer hidden layer the
weights
29/100
29
Quantities of interest (roadmap for the remaining part):
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
29/100
29
• Gradient w.r.t. output units
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
29/100
29
• Gradient w.r.t. hidden units
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
29/100
29
• Gradient w.r.t. weights and biases
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
29/100
29
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
• Our focus is on Cross entropy loss and Softmax output.
29/100
29
Module 4.5: Backpropagation: Computing Gradients w.r.t. the
Output Units
30/100
30
• Gradient w.r.t. weights
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
31/100
31
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
w.r.t. i-th output
L (θ) = − log ŷ` (` = true class label)

a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
w.r.t. i-th output

∂ a3
(L (θ)) =
∂ ŷi W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
w.r.t. i-th output

∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
w.r.t. i-th output

∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
w.r.t. i-th output

∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
= 0 otherwise W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
w.r.t. i-th output

∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
= 0 otherwise W2 b2
h1
More compactly,
a1
W1 b1
x1 x2 xn
32/100
32
w.r.t. i-th output

∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
= 0 otherwise W2 b2
h1
More compactly,
∂ 1(i=`) a1
(L (θ)) = − W1
∂ ŷi ŷ` b1
x1 x2 xn
32/100
32
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
We can now talk about the gradient w.r.t. a3

the vector ŷ W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
  a2
W2 b2
h1
∇ŷ L (θ)
 
= 



a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
 ∂L (θ)  a2
∂ ŷ1 W2 b2
h1
∇ŷ L (θ)
 
= 



a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
 ∂L (θ)  a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
 ..  h1
= 
 . 

a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
 ∂L (θ)  a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
 ..  h1
= 
 . 

∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
 ∂L (θ)  a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
 ..  =−1
 h1
= 
 .  ŷ`
∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
 
 ∂L (θ)  a2
∂ ŷ1   W2 b2
 .  1  h1
∇ŷ L (θ) ..  = − 

=  
  ŷ`  
∂L (θ)  
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
1`=1
 
 ∂L (θ)  a2
∂ ŷ1   W2 b2
 .  1  h1
∇ŷ L (θ) ..  = − 

=  
  ŷ`  
∂L (θ)  
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =−
  ŷ`  
∂L (θ)  
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
1
= − e`
ŷ` x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

h2
1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
1
= − e`
ŷ` x1 x2 xn
where e(`) is a k-dimensional vector whose 33/100

`-th element is 1 and all other elements are 0.
33
What we are actually interested in is − log ŷ`
∂L (θ) ∂(− log ŷ` )

=
∂aLi ∂aLi
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34/100
34
∂L (θ) ∂(− log ŷ` )

=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34/100
34
∂L (θ) ∂(− log ŷ` )

=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
Does ŷ` depend on aLi ?
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34/100
34
∂L (θ) ∂(− log ŷ` )

=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
Does ŷ` depend on aLi ? Indeed, it does.
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34/100
34
∂L (θ) ∂(− log ŷ` )

=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
exp(aL` ) a2
ŷ` = P W2 b2
i exp(aLi ) h1
a1
W1 b1
x1 x2 xn
34/100
34
∂L (θ) ∂(− log ŷ` )

=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
exp(aL` ) a2
ŷ` = P W2 b2
i exp(aLi ) h1
Having established this, we will now derive
a1
the full expression on the next slide W1 b1
x1 x2 xn
34/100
34
∂
− log ŷ` =
∂aLi
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
−1 ∂
= sof tmax(aL )`
ŷ` ∂aLi
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
−1 ∂
= sof tmax(aL )`
ŷ` ∂aLi
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
!
exp(aL )` ∂a∂Li i0 exp(aL )i0
P
∂
−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
!
P
∂
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
!
P
∂
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0

−1
= 1(`=i) sof tmax(aL )` − sof tmax(aL )` sof tmax(aL )i
ŷ`
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
!
P
∂
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0

−1
ŷ`
−1
1(`=i) ŷ` − ŷ` ŷi

=
ŷ`
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
!
P
∂
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0

−1
ŷ`
−1
1(`=i) ŷ` − ŷ` ŷi

=
ŷ`
1(`=i) − ŷi

=−
35/100
35
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
a2
W2 b2
∇aL L (θ) h1
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
 ∂L (θ) 
a2
∂aL1 W2 b2
∇aL L (θ) = 
 
 h1
 
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
 ∂L (θ) 
a2
∂aL1 W2 b2
∇aL L (θ) = 
 .. 
h1
 . 

a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
 ∂L (θ) 
a2
∂aL1 W2 b2
∇aL L (θ) = 
 .. 
h1
 . 

∂L (θ)
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
 
 ∂L (θ) 
a2
∂aL1   W2 b2
∇aL L (θ) = 
 ..   
h1
 .   =
 

∂L (θ)  
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1   W2 b2
∇aL L (θ) = 
 ..   
h1
 .   =
 

∂L (θ)  
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
 ∂a.L1   − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) =  .   h1
 .  =


∂L (θ)  
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1
  − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) = 
 .. h1
.  = .. 
.
   
∂L (θ)  
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1
  − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) = 
 .. h1
.  = .. 
.
   
∂L (θ)  
∂aLk − (1`=k − ŷk ) a1
W1 b1
x1 x2 xn
36/100
36
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
h2
− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1
  − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) = 
 .. h1
.  = .. 
.
   
∂L (θ)  
∂aLk − (1`=k − ŷk ) a1
W1 b1
= −(e(`) − ŷ)
x1 x2 xn
36/100
36
Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden
Units
37/100
37
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
38/100
38
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
39/100
39
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
39/100
39
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
In our case: W2 b2
h1
• p(z) is the loss function L (θ)
a1
W1 b1
x1 x2 xn
39/100
39
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
In our case: W2 b2
h1
• z = hij a1
W1 b1
x1 x2 xn
39/100
39
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
In our case: W2 b2
h1
• z = hij a1
W1 b1
• qm (z) = aLm
x1 x2 xn
39/100
39
40
− log ŷ`
∂L (θ)
∂hij
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
   
a2
W2 b2
∇ai+1 L (θ) = 
   
 ; Wi+1, · ,j =   h1
   
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1
a2
W2 b2
∇ai+1 L (θ) = 
   
 ; Wi+1, · ,j =   h1
   
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
W2 b2
∇ai+1 L (θ) = 
   
 ; Wi+1, · ,j =   h1
   
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ;
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,
x1 x2 xn
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
x1 x2 xn
(Wi+1, · ,j )T ∇ai+1 L (θ) =
41/100
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
x1 x2 xn
k
X ∂L (θ)
(Wi+1, · ,j )T ∇ai+1 L (θ) = Wi+1,m,j
∂ai+1,m 41/100
m=1 ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij
We can now write the gradient w.r.t. hi a3

W3 b3
h2
∇hi L (θ) a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

    W3 b3
h2
   
∇hi L (θ) = 
   
=  a2
W2 b2
   
   
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)    W3 b3
h2
∂hi1
   
∇hi L (θ) = 
   
=  a2
W2 b2
   
   
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
   
∇hi L (θ) = 
   
=  a2
W2 b2
   
   
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)   
 ∂hi2 
∇hi L (θ) = 
 

=
   a2
W2 b2

   
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
=
   a2
W2 b2

   
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
∂L (θ) h1
∂hin
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1
• We are almost done except that we do not know x1 x2 xn
how to calculate ∇ai+1 L (θ) for i < L − 1
42/100
42
− log ŷ`
∂L (θ)
∂hij

 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1
• We are almost done except that we do not know x1 x2 xn
how to calculate ∇ai+1 L (θ) for i < L − 1
• We will see how to compute that 42/100
42
− log ŷ`
∇ai L (θ)
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 
∇ai L (θ) = 
 

 
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 

 
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) W3 b3
h2
∂aij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
h1
∇ai L (θ) a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
  h1
∇ai L (θ) = 
 
 a1
  W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1
∇ai L (θ) = 

 a1
  W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1 .
∇ai L (θ) =  ..

 a1
  W1 b1
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1 .
∇ai L (θ) =  ..

 a1

∂L (θ) 0
 W1 b1
∂hin g (ain )
x1 x2 xn
43/100
43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1 .
∇ai L (θ) =  ..

 a1

∂L (θ) 0
 W1 b1
∂hin g (ain )
0 x1 x2 xn
= ∇hi L (θ) [. . . , g (aik ), . . . ]
43/100
43
Module 4.7: Backpropagation: Computing Gradients w.r.t.
Parameters
44/100
44
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
weights
45/100
45
Recall that, − log ŷ`
ak = bk + Wk hk−1
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) W3 b3
h2
∂Wkij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂L (θ)
= hk−1,j a2
∂aki W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂L (θ)
= hk−1,j a2
∂aki W2 b2
h1
∇Wk L (θ) = a1
W1 b1
x1 x2 xn
46/100
46
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂L (θ)
= hk−1,j a2
∂aki W2 b2
 ∂L (θ) ∂L (θ) ∂L (θ)
 h1
∂W ∂Wk12 ... ... ∂Wk1n
 k11
 ... ... ... ... ...

∇Wk L (θ) = 
 a1
 .. .. .. .. .. 
W1 b1
 . . . . .


∂L (θ)
... ... ... ... ∂Wknn x1 x2 xn
46/100
46
47
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
48/100
48
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48/100
48
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
 ∂L (θ) ∂L (θ) ∂L (θ)


∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100
48
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
 ∂L (θ) ∂L (θ) ∂L (θ)


∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100
48
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
 ∂L (θ) ∂L (θ) ∂L (θ)


∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100
48
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
 ∂L (θ) ∂L (θ) ∂L (θ)


∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100
48
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
 ∂L (θ) ∂L (θ) ∂L (θ)


∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100
48
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
 ∂L (θ) ∂L (θ) ∂L (θ)


∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) =   = ∇a L (θ) · hk−1 T

 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3  k
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100
48
Finally, coming to the biases − log ŷ`
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
X
aki = bki + Wkij hk−1,j
j
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
X
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
X
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂L (θ) h2
=
∂aki
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
X
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂L (θ) h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector bk W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
X
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂L (θ) h2
=
∂aki
a2
h1
 ∂L (θ) 
a
 ∂Lk1(θ)  a1
 ak2  W1 b1
∇bk L (θ) = 
 .. 

 . 
∂L (θ) x1 x2 xn
akn
49/100
49
X
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂L (θ) h2
=
∂aki
a2
h1
 ∂L (θ) 
a
 ∂Lk1(θ)  a1
 ak2  W1 b1
∇bk L (θ) = 
 ..  = ∇ak L (θ)

 . 
∂L (θ) x1 x2 xn
akn
49/100
49
Module 4.8: Backpropagation: Pseudo code
50/100
50
Finally, we have all the pieces of the puzzle
51/100
51
∇aL L (θ) (gradient w.r.t. output layer)
51/100
51
∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)
51/100
51
∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)
51/100
51
∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)
We can now write the full learning algorithm
51/100
51
Algorithm: gradient_descent()
t ← 0;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
52/100
52
t ← 0;
while t++ < max_iterations do
end
52/100
52
t ← 0;
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward_propagation(θt );
end
52/100
52
t ← 0;
∇θt = backward_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);
end
52/100
52
t ← 0;
∇θt = backward_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);
θt+1 ← θt − η∇θt ;
end
52/100
52
Algorithm: forward_propagation(θ)
53/100
53
for k = 1 to L − 1 do
end
53/100
53
ak = bk + Wk hk−1 ;
end
53/100
53
hk = g(ak );
end
53/100
53
hk = g(ak );
end
aL = bL + WL hL−1 ;
53/100
53
hk = g(ak );
end
aL = bL + WL hL−1 ;
ŷ = O(aL );
53/100
53
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
end
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
end
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
end
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
end
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
end
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
end
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
end
54/100
54
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
∇ak−1 L (θ) = ∇hk−1 L (θ) [. . . , g 0 (ak−1,j ), . . . ] ;
end
54/100
54
Module 4.9: Derivative of the activation function
55/100
55
Now, the only thing we need to figure out is how to compute g 0
56/100
56
Logistic function
g(z) = σ(z)
1
=
1 + e−z
56/100
56
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
56/100
56
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
56/100
56
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1

1
=
1 + e−z 1 + e−z
56/100
56
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1

1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))
56/100
56
Logistic function tanh
g(z) = σ(z) g(z) = tanh (z)
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1

1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))
56/100
56
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1 + e−z − 1

1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))
56/100
56
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1

1 + e−z − 1
(ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
= g(z)(1 − g(z))
56/100
56
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1

1 + e−z − 1
(ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
(ez − e−z )2
= g(z)(1 − g(z)) =1 − z
(e + e−z )2
56/100
56
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1

1 + e−z − 1
(ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
(ez − e−z )2
= g(z)(1 − g(z)) =1 − z
(e + e−z )2
=1 − (g(z))2 56/100
56

Lecture 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

CS7015 (Deep Learning): Lecture 4

Feedforward Neural Networks, Backpropagation

Department of Computer Science and Engineering

ai (x) = bi + Wi hi−1 (x)

ai (x) = bi + Wi hi−1 (x)

hi (x) = g(ai (x))

ai (x) = bi + Wi hi−1 (x)

hi (x) = g(ai (x))

ai (x) = bi + Wi hi−1 (x)

hi (x) = g(ai (x))

ai (x) = bi + Wi hi−1 (x)

hi (x) = g(ai (x))

ai (x) = bi + Wi hi−1 (x)

hi (x) = g(ai (x))

• To simplify notation we will refer to ai (x) as ai and 4

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )

• We can still use the same algorithm for learning 9

• Now, in this feedforward neural network,

• We can still use the same algorithm for learning 9

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

x1 x2 xn • ŷi = f (xi ) is no longer bounded

Neural network with

Neural network with

Neural network with

a1 O(aL )j is the j th element of ŷ and aL,j

a1 O(aL )j is the j th element of ŷ and aL,j

Neural network with

yc = 1 if c = ` (the true class label)

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`

Real Values Probabilities

Real Values Probabilities

Output Activation Linear

Real Values Probabilities

Output Activation Linear Softmax

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

• Our focus is on Cross entropy loss and Softmax output.

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

• Our focus is on Cross entropy loss and Softmax output.