You are on page 1of 288

CS7015 (Deep Learning): Lecture 4

Feedforward Neural Networks, Backpropagation

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

0/100
References/Acknowledgments
See the excellent videos by Hugo Larochelle on Backpropagation

1/100

1
Module 4.1: Feedforward Neural Networks (a.k.a. multilayered
network of neurons)

2/100

2
• The input to the network is an n-dimensional vector

3/100

3
• The input to the network is an n-dimensional vector

x1 x2 xn
3/100

3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each

x1 x2 xn
3/100

3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each

x1 x2 xn
3/100

3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)

x1 x2 xn
3/100

3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)

x1 x2 xn
3/100

3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
be split into two parts :

x1 x2 xn
3/100

3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
be split into two parts :

x1 x2 xn
3/100

3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation

a1

x1 x2 xn
3/100

3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1

a1

x1 x2 xn
3/100

3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1 (ai and hi are vectors)

a1

x1 x2 xn
3/100

3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1 (ai and hi are vectors)
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1

x1 x2 xn
3/100

3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1 (ai and hi are vectors)
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
b1 • Wi ∈ R
W1 n×n and b ∈ Rn are the weight and bias
i
between layers i − 1 and i (0 < i < L)
x1 x2 xn
3/100

3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and hi are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
b1 • Wi ∈ R
W1 n×n and b ∈ Rn are the weight and bias
i
between layers i − 1 and i (0 < i < L)
x1 x2 xn
3/100

3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
W3 • Finally, there is one output layer containing k neurons
b3
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and hi are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
b1 • Wi ∈ R
W1 n×n and b ∈ Rn are the weight and bias
i
between layers i − 1 and i (0 < i < L)
x1 x2 xn
• WL ∈ Rn×k and bL ∈ Rk are the weight and bias
3/100
between the last hidden layer and the output layer
3
(L = 3 in this case)
hL = ŷ = f (x)
• The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)


a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

4/100

4
hL = ŷ = f (x)
• The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)


a3
W3 b3
h2 • The activation at layer i is given by

hi (x) = g(ai (x))


a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

4/100

4
hL = ŷ = f (x)
• The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)


a3
W3 b3
h2 • The activation at layer i is given by

hi (x) = g(ai (x))


a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)

a1
W1 b1

x1 x2 xn

4/100

4
hL = ŷ = f (x)
• The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)


a3
W3 b3
h2 • The activation at layer i is given by

hi (x) = g(ai (x))


a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x) = hL (x) = O(aL (x))
x1 x2 xn

4/100

4
hL = ŷ = f (x)
• The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)


a3
W3 b3
h2 • The activation at layer i is given by

hi (x) = g(ai (x))


a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x) = hL (x) = O(aL (x))
x1 x2 xn
where O is the output activation function (for example,
softmax, linear, etc.) 4/100

4
hL = ŷ = f (x)
• The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)


a3
W3 b3
h2 • The activation at layer i is given by

hi (x) = g(ai (x))


a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x) = hL (x) = O(aL (x))
x1 x2 xn
where O is the output activation function (for example,
softmax, linear, etc.) 4/100

• To simplify notation we will refer to ai (x) as ai and 4


hL = ŷ = f (x)
• The pre-activation at layer i is given by

ai = bi + Wi hi−1
a3
W3 b3
h2 • The activation at layer i is given by

hi = g(ai )
a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x) = hL = O(aL )
x1 x2 xn
where O is the output activation function (for example,
softmax, linear, etc.) 5/100

5
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
6/100

6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
6/100

6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )


W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
6/100

6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )


W3 b3
h2
• Parameters:
θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
6/100

6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )


W3 b3
h2
• Parameters:
θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
a2
W2 b2 • Algorithm: Gradient Descent with Back-propagation
h1 (we will see soon)

a1
W1 b1

x1 x2 xn
6/100

6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:

a3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )


W3 b3
h2
• Parameters:
θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
a2
W2 b2 • Algorithm: Gradient Descent with Back-propagation
h1 (we will see soon)
• Objective/Loss/Error function: Say,
a1 N k
W1 1 XX
b1 min (ŷij − yij )2
N
i=1 j=1
x1 x2 xn
In general, min L (θ) 6/100

6
where L (θ) is some function of the parameters
Module 4.2: Learning Parameters of Feedforward Neural Networks
(Intuition)

7/100

7
The story so far...
• We have introduced feedforward neural networks
• We are now interested in finding an algorithm for learning the parameters of this model

8/100

8
hL = ŷ = f (x)
• Recall our gradient descent algorithm

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

9/100

9
hL = ŷ = f (x)
• Recall our gradient descent algorithm

a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize w0 , b0 ;
W2 b2 while t++ < max_iterations do
h1 wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
a1 end
W1 b1

x1 x2 xn

9/100

9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize w0 , b0 ;
W2 b2 while t++ < max_iterations do
h1 wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
a1 end
W1 b1

x1 x2 xn

9/100

9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1
W1 b1

x1 x2 xn

9/100

9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1  ∂L (θ) T
W1 b1 • where ∇θt = ∂wt , ∂L (θ)
∂bt

x1 x2 xn

9/100

9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1  ∂L (θ) T
W1 b1 • where ∇θt = ∂wt , ∂L (θ)
∂bt
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100

9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1  ∂L (θ) T
W1 b1 • where ∇θt = ∂wt , ∂L (θ)
∂bt
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100

• We can still use the same algorithm for learning 9


hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1 T
• where ∇θt =
 ∂L (θ)
W1 b1 ∂W1,t , ., ∂L (θ) ∂L (θ) ∂L (θ)
∂WL,t , ∂b1,t , ., ∂bL,t

• Now, in this feedforward neural network,


x1 x2 xn
instead of θ = [w, b] we have θ =
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100

• We can still use the same algorithm for learning 9


• Except that now our ∇θ looks much more nasty

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) 
∂W111
 
 
 
 
 
 
 
 

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) 
∂W111 ...
 
 
 
 
 
 
 
 

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) 
∂W111 ... ∂W11n
 
 
 
 
 
 
 
 

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) 
∂W111 ... ∂W11n
 
 
 ∂L (θ) ∂L (θ)

 ∂W121 . . .
 
∂W12n 
 . .. ..
 .

 . . .


∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 


∂W111 ... ∂W11n ∂W211 ... ∂W21n
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

 ∂W121 . . . ...
 
∂W12n ∂W221 ∂W22n 
 . .. .. .. .. ..
 .

 . . . . . .


∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 


∂W111 ... ∂W11n ∂W211 ... ∂W21n ...
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

 ∂W121 . . . ... ...
 
∂W12n ∂W221 ∂W22n 
 . .. .. .. .. .. ..
 .

 . . . . . . .


∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ...

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 


∂W111 ... ∂W11n ∂W211 ... ∂W21n ... ∂WL,11 ... ∂WL,1k ∂WL,1k
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

 ∂W121 . . . ... ... ...
 
∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k 
 . .. .. .. .. .. .. .. .. .. ..
 .

 . . . . . . . . . . .


∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 


∂W111 ... ∂W11n ∂W211 ... ∂W21n ... ∂WL,11 ... ∂WL,1k ∂WL,1k ∂b11 ... ∂bL1
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂W121 . . . ... ... ... . . . ∂bL2 

∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12
 . .. .. .. .. .. .. .. .. .. .. .. .. .. 
 .
 . . . . . . . . . . . . . . 

∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk ∂b1n . . . ∂L
∂bLk
(θ)

10/100

10
• Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 


∂W111 ... ∂W11n ∂W211 ... ∂W21n ... ∂WL,11 ... ∂WL,1k ∂WL,1k ∂b11 ... ∂bL1
 
 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) 

 ∂W121 . . . ... ... ... . . . ∂bL2 

∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12
 . .. .. .. .. .. .. .. .. .. .. .. .. .. 
 .
 . . . . . . . . . . . . . . 

∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk ∂b1n . . . ∂L
∂bLk
(θ)

• ∇θ is thus composed of
∇W1 , ∇W2 , ...∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k ,
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk
10/100

10
We need to answer two questions

11/100

11
We need to answer two questions
• How to choose the loss function L (θ)?

11/100

11
We need to answer two questions
• How to choose the loss function L (θ)?
• How to compute ∇θ which is composed of
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

11/100

11
Module 4.3: Output Functions and Loss Functions

12/100

12
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

13/100

13
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

13/100

13
• The choice of loss function depends on
the problem at hand

14/100

14
• The choice of loss function depends on
the problem at hand
• We will illustrate this with the help of
two examples

14/100

14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers

isActor isDirector
.. ........
Damon Nolan
xi
14/100

14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers • Here yi ∈ R3

isActor isDirector
.. ........
Damon Nolan
xi
14/100

14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers • Here yi ∈ R3
• The loss function should capture how
much ŷi deviates from yi
isActor isDirector
.. ........
Damon Nolan
xi
14/100

14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers • Here yi ∈ R3
• The loss function should capture how
much ŷi deviates from yi
isActor isDirector • If yi ∈ Rn then the squared error loss can
.. ........ capture this deviation
Damon Nolan
N 3
xi 1 XX
L (θ) = (ŷij − yij )2 14/100
N
i=1 j=1
14
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

15/100

15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

15/100

15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

15/100

15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O’ as linear function

a1 f (x) = hL = O(aL )
W1 b1 = WO aL + bO

x1 x2 xn

15/100

15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O’ as linear function

a1 f (x) = hL = O(aL )
W1 b1 = WO aL + bO

x1 x2 xn • ŷi = f (xi ) is no longer bounded


between 0 and 1 15/100

15
Intentionally left blank 16/100

16
Intentionally left blank 16/100

16
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate

Neural network with


L − 1 hidden layers

17/100

17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
Neural network with
L − 1 hidden layers

17/100

17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
Neural network with • Here again we could use the squared
L − 1 hidden layers error loss to capture the deviation

17/100

17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
Neural network with • Here again we could use the squared
L − 1 hidden layers error loss to capture the deviation
• But can you think of a better function?

17/100

17
• Notice that y is a probability distribution
y = [1 0 0 0]
Apple Mango Orange Banana

Neural network with


L − 1 hidden layers

18/100

18
• Notice that y is a probability distribution
y = [1 0 0 0] • Therefore we should also ensure that ŷ is
Apple Mango Orange Banana a probability distribution

Neural network with


L − 1 hidden layers

18/100

18
hL = ŷ = f (x)
• Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O’
will ensure this ?

a2 aL = WL hL−1 + bL
W2 b2
h1

a1
W1 b1

x1 x2 xn

18/100

18
hL = ŷ = f (x)
• Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O’
will ensure this ?

a2 aL = WL hL−1 + bL
W2 b2 eaL,j
h1 ŷj = O(aL )j = Pk
aL,i
i=1 e

a1 O(aL )j is the j th element of ŷ and aL,j


W1 b1 is the j th element of the vector aL .

x1 x2 xn

18/100

18
hL = ŷ = f (x)
• Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O’
will ensure this ?

a2 aL = WL hL−1 + bL
W2 b2 eaL,j
h1 ŷj = O(aL )j = Pk
aL,i
i=1 e

a1 O(aL )j is the j th element of ŷ and aL,j


W1 b1 is the j th element of the vector aL .
• This function is called the softmax
x1 x2 xn
function
18/100

18
• Now that we have ensured that both y
y = [1 0 0 0] & ŷ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?

Neural network with


L − 1 hidden layers

19/100

19
• Now that we have ensured that both y
y = [1 0 0 0] & ŷ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
• Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1

19/100

19
• Now that we have ensured that both y
y = [1 0 0 0] & ŷ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
• Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1

• Notice that

yc = 1 if c = ` (the true class label)


=0 otherwise
∴ L (θ) = − log ŷ`
19/100

19
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
20/100

20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ

a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1

a1
W1 b1

x1 x2 xn
20/100

20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ

a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`


W1 b1

x1 x2 xn
20/100

20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ

a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`


W1 b1 • What does ŷ` encode?

x1 x2 xn
20/100

20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ

a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`


W1 b1 • What does ŷ` encode?
• It is the probability that x belongs to the `th class (bring
x1 x2 xn
it as close to 1). 20/100

20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ

a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ

a1 ŷ` = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]`


W1 b1 • What does ŷ` encode?
• It is the probability that x belongs to the `th class (bring
x1 x2 xn
it as close to 1). 20/100

20
• log ŷ` is called the log-likelihood of the data.
Outputs

Real Values Probabilities

Output Activation

Loss Function

21/100

21
Outputs

Real Values Probabilities

Output Activation Linear

Loss Function

21/100

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function

21/100

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error

21/100

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

21/100

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often

21/100

21
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
• For the rest of this lecture we will focus on the case where the output activation is a
softmax function and the loss function is cross entropy
21/100

21
Module 4.4: Backpropagation (Intuition)

22/100

22
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

23/100

23
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

23/100

23
• Let us focus on this one
ŷ = f (x)
Algorithm: gradient
weight (W112 ). descent()
a31

h21
W311 b3 t ← 0;
max_iterations ←
a21 1000;
W211 b2
h11 Initialize θ0 ;
while
a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end

24/100

24
• Let us focus on this one
ŷ = f (x)
Algorithm: gradient
weight (W112 ). descent()
a31
W311 t ← 0;
• To learn this weight using h21
b3
max_iterations ←
SGD we need a formula
∂L (θ) a21 1000;
for ∂W . W211 b2
112 h11 Initialize θ0 ;
while
a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end

24/100

24
• Let us focus on this one
ŷ = f (x)
Algorithm: gradient
weight (W112 ). descent()
a31
W311 t ← 0;
• To learn this weight using h21
b3
max_iterations ←
SGD we need a formula
∂L (θ) a21 1000;
for ∂W . W211 b2
112 h11 Initialize θ0 ;
• We will see how to
while
calculate this. a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end

24/100

24
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.

aL1
WL11
h21

a21
W211
h11

a11
W111

x1

25/100

25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21

a21
W211
h11

a11
W111

x1

25/100

25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
W211
h11

a11
W111

x1

25/100

25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
= (just compressing the chain rule) h11
∂W111 ∂h11 ∂W111

a11
W111

x1

25/100

25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
= (just compressing the chain rule) h11
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
= a11
∂W211 ∂h21 ∂W211 W111

x1

25/100

25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
= (just compressing the chain rule) h11
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
= a11
∂W211 ∂h21 ∂W211 W111
∂L (θ) ∂L (θ) ∂aL1
=
∂WL11 ∂aL1 ∂WL11 x1

25/100

25
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details

26/100

26
• We get a certain loss at the output and we try to figure − log ŷ`
out who is responsible for this loss

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
27/100

27
• We get a certain loss at the output and we try to figure − log ŷ`
out who is responsible for this loss
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
27/100

27
• We get a certain loss at the output and we try to figure − log ŷ`
out who is responsible for this loss
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
• The output layer says “Well, I take responsibility for my h2
part but please understand that I am only as the good
as the hidden layer and weights below me". After all a2
... W2 b2
f (x) = ŷ = O(WL hL−1 + bL )
h1

a1
W1 b1

x1 x2 xn
27/100

27
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
28/100

28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
28/100

28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
28/100

28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
h1

a1
W1 b1

x1 x2 xn
28/100

28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
W1 b1

x1 x2 xn
28/100

28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 W1 b1
=
∂W111 ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z } |
Talk to the
{z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the and now
x1 x2 xn
weight directly output layer previous hidden previous talk to
layer hidden layer the 28/100
weights

28
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

29/100

29
Quantities of interest (roadmap for the remaining part):

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

29/100

29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

29/100

29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

29/100

29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

29/100

29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

• Our focus is on Cross entropy loss and Softmax output.

29/100

29
Module 4.5: Backpropagation: Computing Gradients w.r.t. the
Output Units

30/100

30
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

• Our focus is on Cross entropy loss and Softmax output.

31/100

31
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

32/100

32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

L (θ) = − log ŷ` (` = true class label)


a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

32/100

32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

L (θ) = − log ŷ` (` = true class label)


∂ a3
(L (θ)) =
∂ ŷi W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

32/100

32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

L (θ) = − log ŷ` (` = true class label)


∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

32/100

32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

L (θ) = − log ŷ` (` = true class label)


∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

32/100

32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

L (θ) = − log ŷ` (` = true class label)


∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
= 0 otherwise W2 b2
h1

a1
W1 b1

x1 x2 xn

32/100

32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

L (θ) = − log ŷ` (` = true class label)


∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
= 0 otherwise W2 b2
h1
More compactly,
a1
W1 b1

x1 x2 xn

32/100

32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output

L (θ) = − log ŷ` (` = true class label)


∂ ∂ a3
(L (θ)) = (− log ŷ` )
∂ ŷi ∂ ŷi W3 b3
1 h2
= − if i = `
ŷ`
a2
= 0 otherwise W2 b2
h1
More compactly,
∂ 1(i=`) a1
(L (θ)) = − W1
∂ ŷi ŷ` b1

x1 x2 xn

32/100

32
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

  a2
W2 b2
h1
∇ŷ L (θ)
 
= 



a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

 ∂L (θ)  a2
∂ ŷ1 W2 b2
h1
∇ŷ L (θ)
 
= 



a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

 ∂L (θ)  a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
 ..  h1
= 
 . 

a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

 ∂L (θ)  a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
 ..  h1
= 
 . 

∂L (θ)
∂ ŷk a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

 ∂L (θ)  a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
 ..  =−1
 h1
= 
 .  ŷ`
∂L (θ)
∂ ŷk a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2
 
 ∂L (θ)  a2
∂ ŷ1   W2 b2
 .  1  h1
∇ŷ L (θ) ..  = − 

=  
  ŷ`  
∂L (θ)  
∂ ŷk a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

1`=1
 
 ∂L (θ)  a2
∂ ŷ1   W2 b2
 .  1  h1
∇ŷ L (θ) ..  = − 

=  
  ŷ`  
∂L (θ)  
∂ ŷk a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =−
  ŷ`  
∂L (θ)  
∂ ŷk a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk 1`=k a1
W1 b1

x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
1
= − e`
ŷ` x1 x2 xn

33/100

33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`

We can now talk about the gradient w.r.t. a3


the vector ŷ W3 b3
h2

1`=1
 
 ∂L (θ)  a2
W2 b2
∂ ŷ1
..

1  1  h1
∇ŷ L (θ)  `=2 
  
=  .  =− . 
 .. 
ŷ` 
 
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
1
= − e`
ŷ` x1 x2 xn

where e(`) is a k-dimensional vector whose 33/100


`-th element is 1 and all other elements are 0.
33
What we are actually interested in is − log ŷ`

∂L (θ) ∂(− log ŷ` )


=
∂aLi ∂aLi

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34/100

34
What we are actually interested in is − log ŷ`

∂L (θ) ∂(− log ŷ` )


=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34/100

34
What we are actually interested in is − log ŷ`

∂L (θ) ∂(− log ŷ` )


=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
Does ŷ` depend on aLi ?
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34/100

34
What we are actually interested in is − log ŷ`

∂L (θ) ∂(− log ŷ` )


=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
Does ŷ` depend on aLi ? Indeed, it does.
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

34/100

34
What we are actually interested in is − log ŷ`

∂L (θ) ∂(− log ŷ` )


=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
Does ŷ` depend on aLi ? Indeed, it does.

exp(aL` ) a2
ŷ` = P W2 b2
i exp(aLi ) h1

a1
W1 b1

x1 x2 xn

34/100

34
What we are actually interested in is − log ŷ`

∂L (θ) ∂(− log ŷ` )


=
∂aLi ∂aLi
∂(− log ŷ` ) ∂ ŷ`
= a3
∂ ŷ` ∂aLi W3 b3
h2
Does ŷ` depend on aLi ? Indeed, it does.

exp(aL` ) a2
ŷ` = P W2 b2
i exp(aLi ) h1
Having established this, we will now derive
a1
the full expression on the next slide W1 b1

x1 x2 xn

34/100

34

− log ŷ` =
∂aLi

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
−1 ∂
= sof tmax(aL )`
ŷ` ∂aLi

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
−1 ∂
= sof tmax(aL )`
ŷ` ∂aLi
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
 !
exp(aL )` ∂a∂Li i0 exp(aL )i0
P

−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
 !
exp(aL )` ∂a∂Li i0 exp(aL )i0
P

−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
 !
exp(aL )` ∂a∂Li i0 exp(aL )i0
P

−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0

 
−1
= 1(`=i) sof tmax(aL )` − sof tmax(aL )` sof tmax(aL )i
ŷ`

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
 !
exp(aL )` ∂a∂Li i0 exp(aL )i0
P

−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0

 
−1
= 1(`=i) sof tmax(aL )` − sof tmax(aL )` sof tmax(aL )i
ŷ`
−1
1(`=i) ŷ` − ŷ` ŷi

=
ŷ`

35/100

35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
 !
exp(aL )` ∂a∂Li i0 exp(aL )i0
P

−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0

 
−1
= 1(`=i) sof tmax(aL )` − sof tmax(aL )` sof tmax(aL )i
ŷ`
−1
1(`=i) ŷ` − ŷ` ŷi

=
ŷ`
1(`=i) − ŷi

=−
35/100

35
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2

a2
W2 b2
∇aL L (θ) h1

a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
 ∂L (θ) 
a2
∂aL1 W2 b2
∇aL L (θ) = 
 
 h1
 

a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
 ∂L (θ) 
a2
∂aL1 W2 b2
∇aL L (θ) = 
 .. 
h1
 . 

a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
 ∂L (θ) 
a2
∂aL1 W2 b2
∇aL L (θ) = 
 .. 
h1
 . 

∂L (θ)
∂aLk
a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
 
 ∂L (θ) 
a2
∂aL1   W2 b2
∇aL L (θ) = 
 ..   
h1
 .   =
 

∂L (θ)  
∂aLk
a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2

− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1   W2 b2
∇aL L (θ) = 
 ..   
h1
 .   =
 

∂L (θ)  
∂aLk
a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2

− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
 ∂a.L1   − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) =  .   h1
 .  =


∂L (θ)  
∂aLk
a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2

− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1
  − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) = 
 .. h1
.  = .. 
.
   
∂L (θ)  
∂aLk
a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2

− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1
  − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) = 
 .. h1
.  = .. 
.
   
∂L (θ)  
∂aLk − (1`=k − ŷk ) a1
W1 b1

x1 x2 xn

36/100

36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL

∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2

− (1`=1 − ŷ1 )
 
 ∂L (θ) 
a2
∂aL1
  − (1`=2 − ŷ2 )  W2 b2
 
∇aL L (θ) = 
 .. h1
.  = .. 
.
   
∂L (θ)  
∂aLk − (1`=k − ŷk ) a1
W1 b1
= −(e(`) − ŷ)
x1 x2 xn

36/100

36
Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden
Units

37/100

37
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

• Our focus is on Cross entropy loss and Softmax output.

38/100

38
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

39/100

39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :

a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

39/100

39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :

a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m

a2
In our case: W2 b2
h1
• p(z) is the loss function L (θ)
a1
W1 b1

x1 x2 xn

39/100

39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :

a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m

a2
In our case: W2 b2
h1
• p(z) is the loss function L (θ)
• z = hij a1
W1 b1

x1 x2 xn

39/100

39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :

a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m

a2
In our case: W2 b2
h1
• p(z) is the loss function L (θ)
• z = hij a1
W1 b1
• qm (z) = aLm
x1 x2 xn

39/100

39
Intentionally left blank 40/100

40
− log ŷ`
∂L (θ)
∂hij

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
   
a2
W2 b2
∇ai+1 L (θ) = 
   
 ; Wi+1, · ,j =   h1
   

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1
a2
W2 b2
∇ai+1 L (θ) = 
   
 ; Wi+1, · ,j =   h1
   

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
W2 b2
∇ai+1 L (θ) = 
   
 ; Wi+1, · ,j =   h1
   

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1

a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1

x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ;
x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,
x1 x2 xn

41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,
x1 x2 xn
(Wi+1, · ,j )T ∇ai+1 L (θ) =
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) = 
   
 .  ; Wi+1, · ,j
 =
 . 
 h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,
x1 x2 xn
k
X ∂L (θ)
(Wi+1, · ,j )T ∇ai+1 L (θ) = Wi+1,m,j
∂ai+1,m 41/100
m=1 ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


W3 b3
h2

∇hi L (θ) a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


    W3 b3
h2
   
∇hi L (θ) = 
   
=  a2
W2 b2
   
   
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)    W3 b3
h2
∂hi1
   
∇hi L (θ) = 
   
=  a2
W2 b2
   
   
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
   
∇hi L (θ) = 
   
=  a2
W2 b2
   
   
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)   
 ∂hi2 
∇hi L (θ) = 
 

=
   a2
W2 b2

   
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
=
   a2
W2 b2

   
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
∂L (θ) h1
∂hin

a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1

x1 x2 xn

42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1
• We are almost done except that we do not know x1 x2 xn
how to calculate ∇ai+1 L (θ) for i < L − 1
42/100

42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)  W3 b3
(Wi+1, · ,1 )T ∇ai+1 L (θ)
 
h2
∂hi1
 ∂L (θ)
 (Wi+1, · ,2 )T ∇ai+1 L (θ) 
  
 ∂hi2 
∇hi L (θ) = 
..

=
 ..  a2
. . W2 b2

   
h1
∂L (θ)
∂hin
(Wi+1, · ,n )T ∇ai+1 L (θ)
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1
• We are almost done except that we do not know x1 x2 xn
how to calculate ∇ai+1 L (θ) for i < L − 1
• We will see how to compute that 42/100

42
− log ŷ`

∇ai L (θ)
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 

∇ai L (θ) = 
 

 
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 

 
a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) W3 b3
h2
∂aij
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
h1

a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
h1

∇ai L (θ) a1
W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
  h1

∇ai L (θ) = 
 
 a1
  W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1
∇ai L (θ) = 

 a1
  W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1 .
∇ai L (θ) =  ..

 a1
  W1 b1

x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1 .
∇ai L (θ) =  ..

 a1

∂L (θ) 0
 W1 b1
∂hin g (ain )
x1 x2 xn

43/100

43
− log ŷ`
 ∂L (θ) 
∂ai1
∇ai L (θ) = 
 .. 
 . 

∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
 ∂L (θ) 0  h1
g (ai1 )
 ∂hi1 .
∇ai L (θ) =  ..

 a1

∂L (θ) 0
 W1 b1
∂hin g (ain )
0 x1 x2 xn
= ∇hi L (θ) [. . . , g (aik ), . . . ]

43/100

43
Module 4.7: Backpropagation: Computing Gradients w.r.t.
Parameters

44/100

44
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

• Our focus is on Cross entropy loss and Softmax output.

45/100

45
Recall that, − log ŷ`

ak = bk + Wk hk−1

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

46/100

46
Recall that, − log ŷ`

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

46/100

46
Recall that, − log ŷ`

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) W3 b3
h2
∂Wkij
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

46/100

46
Recall that, − log ŷ`

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

46/100

46
Recall that, − log ŷ`

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
∂L (θ)
= hk−1,j a2
∂aki W2 b2
h1

a1
W1 b1

x1 x2 xn

46/100

46
Recall that, − log ŷ`

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
∂L (θ)
= hk−1,j a2
∂aki W2 b2
h1

∇Wk L (θ) = a1
W1 b1

x1 x2 xn

46/100

46
Recall that, − log ŷ`

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
∂L (θ)
= hk−1,j a2
∂aki W2 b2
 ∂L (θ) ∂L (θ) ∂L (θ)
 h1
∂W ∂Wk12 ... ... ∂Wk1n
 k11
 ... ... ... ... ...

∇Wk L (θ) = 
 a1
 .. .. .. .. .. 
W1 b1
 . . . . .


∂L (θ)
... ... ... ... ∂Wknn x1 x2 xn

46/100

46
Intentionally left blank 47/100

47
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like

48/100

48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

48/100

48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ)



∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100

48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ)



∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100

48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ)



∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100

48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ)



∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100

48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ)



∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

=
 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3 
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100

48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ)

∂W ∂Wk12 ∂Wk13
 k11 
 
∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
 ∂L (θ) ∂L (θ)
∇Wk L (θ) = 

 ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij
= ∂aki ∂Wkij
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ)



∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) =   = ∇a L (θ) · hk−1 T

 ∂ak2 hk−1,1 ∂ak2 hk−1,2 ∂ak2 hk−1,3  k
 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 h
∂ak3 k−1,3 48/100

48
Finally, coming to the biases − log ŷ`

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

49/100

49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j

a3
W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

49/100

49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
h2

a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

49/100

49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
W2 b2
h1

a1
W1 b1

x1 x2 xn

49/100

49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector bk W2 b2
h1

a1
W1 b1

x1 x2 xn

49/100

49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector bk W2 b2
h1
 ∂L (θ) 
a
 ∂Lk1(θ)  a1
 ak2  W1 b1
∇bk L (θ) = 
 .. 

 . 
∂L (θ) x1 x2 xn
akn
49/100

49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector bk W2 b2
h1
 ∂L (θ) 
a
 ∂Lk1(θ)  a1
 ak2  W1 b1
∇bk L (θ) = 
 ..  = ∇ak L (θ)

 . 
∂L (θ) x1 x2 xn
akn
49/100

49
Module 4.8: Backpropagation: Pseudo code

50/100

50
Finally, we have all the pieces of the puzzle

51/100

51
Finally, we have all the pieces of the puzzle

∇aL L (θ) (gradient w.r.t. output layer)

51/100

51
Finally, we have all the pieces of the puzzle

∇aL L (θ) (gradient w.r.t. output layer)

∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

51/100

51
Finally, we have all the pieces of the puzzle

∇aL L (θ) (gradient w.r.t. output layer)

∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

51/100

51
Finally, we have all the pieces of the puzzle

∇aL L (θ) (gradient w.r.t. output layer)

∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

51/100

51
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];

52/100

52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do

end

52/100

52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward_propagation(θt );

end

52/100

52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward_propagation(θt );
∇θt = backward_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);

end

52/100

52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward_propagation(θt );
∇θt = backward_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);
θt+1 ← θt − η∇θt ;
end

52/100

52
Algorithm: forward_propagation(θ)

53/100

53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do

end

53/100

53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;

end

53/100

53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end

53/100

53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end
aL = bL + WL hL−1 ;

53/100

53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end
aL = bL + WL hL−1 ;
ŷ = O(aL );

53/100

53
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do

end

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;

end

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;

end

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;

end

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;

end

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;

end

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);

end

54/100

54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
∇ak−1 L (θ) = ∇hk−1 L (θ) [. . . , g 0 (ak−1,j ), . . . ] ;
end

54/100

54
Module 4.9: Derivative of the activation function

55/100

55
Now, the only thing we need to figure out is how to compute g 0

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function

g(z) = σ(z)
1
=
1 + e−z

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function

g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function

g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function

g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1
 
1
=
1 + e−z 1 + e−z

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function

g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1
 
1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function tanh

g(z) = σ(z) g(z) = tanh (z)

=
1 ez − e−z
=
1 + e−z ez + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1
 
1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function tanh

g(z) = σ(z) g(z) = tanh (z)

=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1 + e−z − 1
 
1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function tanh

g(z) = σ(z) g(z) = tanh (z)

=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1

1 + e−z − 1
 (ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2

= g(z)(1 − g(z))

56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function tanh

g(z) = σ(z) g(z) = tanh (z)

=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1

1 + e−z − 1
 (ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
(ez − e−z )2
= g(z)(1 − g(z)) =1 − z
(e + e−z )2
56/100

56
Now, the only thing we need to figure out is how to compute g 0

Logistic function tanh

g(z) = σ(z) g(z) = tanh (z)

=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1

1 + e−z − 1
 (ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
(ez − e−z )2
= g(z)(1 − g(z)) =1 − z
(e + e−z )2
=1 − (g(z))2 56/100

56

You might also like