Professional Documents
Culture Documents
Lecture 4
Lecture 4
Mitesh M. Khapra
0/100
References/Acknowledgments
See the excellent videos by Hugo Larochelle on Backpropagation
1/100
1
Module 4.1: Feedforward Neural Networks (a.k.a. multilayered
network of neurons)
2/100
2
• The input to the network is an n-dimensional vector
3/100
3
• The input to the network is an n-dimensional vector
x1 x2 xn
3/100
3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
x1 x2 xn
3/100
3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
x1 x2 xn
3/100
3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
x1 x2 xn
3/100
3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
x1 x2 xn
3/100
3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
be split into two parts :
x1 x2 xn
3/100
3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
be split into two parts :
x1 x2 xn
3/100
3
• The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
(say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation
a1
x1 x2 xn
3/100
3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1
a1
x1 x2 xn
3/100
3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1 (ai and hi are vectors)
a1
x1 x2 xn
3/100
3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1 (ai and hi are vectors)
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
x1 x2 xn
3/100
3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
h1 (ai and hi are vectors)
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
b1 • Wi ∈ R
W1 n×n and b ∈ Rn are the weight and bias
i
between layers i − 1 and i (0 < i < L)
x1 x2 xn
3/100
3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
• Finally, there is one output layer containing k neurons
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and hi are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
b1 • Wi ∈ R
W1 n×n and b ∈ Rn are the weight and bias
i
between layers i − 1 and i (0 < i < L)
x1 x2 xn
3/100
3
hL = ŷ = f (x) • The input to the network is an n-dimensional vector
• The network contains L − 1 hidden layers (2, in this
case) having n neurons each
a3
W3 • Finally, there is one output layer containing k neurons
b3
h2 (say, corresponding to k classes)
• Each neuron in the hidden layer and output layer can
a2 be split into two parts : pre-activation and activation
W2 b2 (ai and hi are vectors)
h1
• The input layer can be called the 0-th layer and the
output layer can be called the (L)-th layer
a1
b1 • Wi ∈ R
W1 n×n and b ∈ Rn are the weight and bias
i
between layers i − 1 and i (0 < i < L)
x1 x2 xn
• WL ∈ Rn×k and bL ∈ Rk are the weight and bias
3/100
between the last hidden layer and the output layer
3
(L = 3 in this case)
hL = ŷ = f (x)
• The pre-activation at layer i is given by
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
4/100
4
hL = ŷ = f (x)
• The pre-activation at layer i is given by
a1
W1 b1
x1 x2 xn
4/100
4
hL = ŷ = f (x)
• The pre-activation at layer i is given by
a1
W1 b1
x1 x2 xn
4/100
4
hL = ŷ = f (x)
• The pre-activation at layer i is given by
4/100
4
hL = ŷ = f (x)
• The pre-activation at layer i is given by
4
hL = ŷ = f (x)
• The pre-activation at layer i is given by
ai = bi + Wi hi−1
a3
W3 b3
h2 • The activation at layer i is given by
hi = g(ai )
a2
W2 b2 where g is called the activation function (for example,
h1
logistic, tanh, linear, etc.)
• The activation at the output layer is given by
a1
W1 b1 f (x) = hL = O(aL )
x1 x2 xn
where O is the output activation function (for example,
softmax, linear, etc.) 5/100
5
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6/100
6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6/100
6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
6/100
6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:
a1
W1 b1
x1 x2 xn
6/100
6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:
a1
W1 b1
x1 x2 xn
6/100
6
• Data: {xi , yi }N
i=1
hL = ŷ = f (x)
• Model:
6
where L (θ) is some function of the parameters
Module 4.2: Learning Parameters of Feedforward Neural Networks
(Intuition)
7/100
7
The story so far...
• We have introduced feedforward neural networks
• We are now interested in finding an algorithm for learning the parameters of this model
8/100
8
hL = ŷ = f (x)
• Recall our gradient descent algorithm
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize w0 , b0 ;
W2 b2 while t++ < max_iterations do
h1 wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
a1 end
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize w0 , b0 ;
W2 b2 while t++ < max_iterations do
h1 wt+1 ← wt − η∇wt ;
bt+1 ← bt − η∇bt ;
a1 end
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1
W1 b1
x1 x2 xn
9/100
9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1 ∂L (θ) T
W1 b1 • where ∇θt = ∂wt , ∂L (θ)
∂bt
x1 x2 xn
9/100
9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1 ∂L (θ) T
W1 b1 • where ∇θt = ∂wt , ∂L (θ)
∂bt
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100
9
hL = ŷ = f (x)
• Recall our gradient descent algorithm
• We can write it more concisely as
a3 Algorithm: gradient_descent()
W3 b3
h2 t ← 0;
max_iterations ← 1000;
a2 Initialize θ0 = [w0 , b0 ];
W2 b2 while t++ < max_iterations do
h1 θt+1 ← θt − η∇θt ;
end
a1 ∂L (θ) T
W1 b1 • where ∇θt = ∂wt , ∂L (θ)
∂bt
• Now, in this feedforward neural network,
x1 x2 xn
instead of θ = [w, b] we have θ =
[W1 , W2 , .., WL , b1 , b2 , .., bL ] 9/100
10/100
10
• Except that now our ∇θ looks much more nasty
∂L (θ)
∂W111
10/100
10
• Except that now our ∇θ looks much more nasty
∂L (θ)
∂W111 ...
10/100
10
• Except that now our ∇θ looks much more nasty
∂L (θ) ∂L (θ)
∂W111 ... ∂W11n
10/100
10
• Except that now our ∇θ looks much more nasty
∂L (θ) ∂L (θ)
∂W111 ... ∂W11n
∂L (θ) ∂L (θ)
∂W121 . . .
∂W12n
. .. ..
.
. . .
∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn
10/100
10
• Except that now our ∇θ looks much more nasty
10/100
10
• Except that now our ∇θ looks much more nasty
10/100
10
• Except that now our ∇θ looks much more nasty
10/100
10
• Except that now our ∇θ looks much more nasty
10/100
10
• Except that now our ∇θ looks much more nasty
• ∇θ is thus composed of
∇W1 , ∇W2 , ...∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k ,
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk
10/100
10
We need to answer two questions
11/100
11
We need to answer two questions
• How to choose the loss function L (θ)?
11/100
11
We need to answer two questions
• How to choose the loss function L (θ)?
• How to compute ∇θ which is composed of
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
11/100
11
Module 4.3: Output Functions and Loss Functions
12/100
12
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
13/100
13
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
13/100
13
• The choice of loss function depends on
the problem at hand
14/100
14
• The choice of loss function depends on
the problem at hand
• We will illustrate this with the help of
two examples
14/100
14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers
isActor isDirector
.. ........
Damon Nolan
xi
14/100
14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers • Here yi ∈ R3
isActor isDirector
.. ........
Damon Nolan
xi
14/100
14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers • Here yi ∈ R3
• The loss function should capture how
much ŷi deviates from yi
isActor isDirector
.. ........
Damon Nolan
xi
14/100
14
• The choice of loss function depends on
yi = {7.5 8.2 7.7} the problem at hand
imdb Critics RT
• We will illustrate this with the help of
Rating Rating Rating
two examples
• Consider our movie example again but
this time we are interested in predicting
Neural network with ratings
L − 1 hidden layers • Here yi ∈ R3
• The loss function should capture how
much ŷi deviates from yi
isActor isDirector • If yi ∈ Rn then the squared error loss can
.. ........ capture this deviation
Damon Nolan
N 3
xi 1 XX
L (θ) = (ŷij − yij )2 14/100
N
i=1 j=1
14
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15/100
15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15/100
15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
15/100
15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O’ as linear function
a1 f (x) = hL = O(aL )
W1 b1 = WO aL + bO
x1 x2 xn
15/100
15
hL = ŷ = f (x)
• A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 • More specifically, can it be the logistic
W3 b3 function?
h2
• No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2
W2 b2 • So, in such cases it makes sense to have
h1
‘O’ as linear function
a1 f (x) = hL = O(aL )
W1 b1 = WO aL + bO
15
Intentionally left blank 16/100
16
Intentionally left blank 16/100
16
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
17/100
17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
Neural network with
L − 1 hidden layers
17/100
17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
Neural network with • Here again we could use the squared
L − 1 hidden layers error loss to capture the deviation
17/100
17
• Now let us consider another problem for
y = [1 0 0 0] which a different loss function would be
Apple Mango Orange Banana appropriate
• Suppose we want to classify an image
into 1 of k classes
Neural network with • Here again we could use the squared
L − 1 hidden layers error loss to capture the deviation
• But can you think of a better function?
17/100
17
• Notice that y is a probability distribution
y = [1 0 0 0]
Apple Mango Orange Banana
18/100
18
• Notice that y is a probability distribution
y = [1 0 0 0] • Therefore we should also ensure that ŷ is
Apple Mango Orange Banana a probability distribution
18/100
18
hL = ŷ = f (x)
• Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O’
will ensure this ?
a2 aL = WL hL−1 + bL
W2 b2
h1
a1
W1 b1
x1 x2 xn
18/100
18
hL = ŷ = f (x)
• Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O’
will ensure this ?
a2 aL = WL hL−1 + bL
W2 b2 eaL,j
h1 ŷj = O(aL )j = Pk
aL,i
i=1 e
x1 x2 xn
18/100
18
hL = ŷ = f (x)
• Notice that y is a probability distribution
• Therefore we should also ensure that ŷ is
a3 a probability distribution
W3 b3
h2 • What choice of the output activation ‘O’
will ensure this ?
a2 aL = WL hL−1 + bL
W2 b2 eaL,j
h1 ŷj = O(aL )j = Pk
aL,i
i=1 e
18
• Now that we have ensured that both y
y = [1 0 0 0] & ŷ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
19/100
19
• Now that we have ensured that both y
y = [1 0 0 0] & ŷ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
• Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1
19/100
19
• Now that we have ensured that both y
y = [1 0 0 0] & ŷ are probability distributions can you
Apple Mango Orange Banana think of a function which captures the
difference between them?
• Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1
• Notice that
19
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
20/100
20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ
a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
a1
W1 b1
x1 x2 xn
20/100
20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ
a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ
x1 x2 xn
20/100
20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ
a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ
x1 x2 xn
20/100
20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ
a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ
20
• So, for classification problem (where you have to
hL = ŷ = f (x) choose 1 of K classes), we use the following objective
function
a3
minimize L (θ) = − log ŷ`
W3 b3 θ
h2
or maximize − L (θ) = log ŷ`
θ
a2 • But wait!
W2 b2 Is ŷ` a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
h1
• Yes, it is indeed a function of θ
20
• log ŷ` is called the log-likelihood of the data.
Outputs
Output Activation
Loss Function
21/100
21
Outputs
Loss Function
21/100
21
Outputs
Loss Function
21/100
21
Outputs
21/100
21
Outputs
21/100
21
Outputs
• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
21/100
21
Outputs
• Of course, there could be other loss functions depending on the problem at hand but
the two loss functions that we just saw are encountered very often
• For the rest of this lecture we will focus on the case where the output activation is a
softmax function and the loss function is cross entropy
21/100
21
Module 4.4: Backpropagation (Intuition)
22/100
22
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
23/100
23
We need to answer two questions
• How to choose the loss function L (θ) ?
• How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?
23/100
23
• Let us focus on this one
ŷ = f (x)
Algorithm: gradient
weight (W112 ). descent()
a31
h21
W311 b3 t ← 0;
max_iterations ←
a21 1000;
W211 b2
h11 Initialize θ0 ;
while
a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end
24/100
24
• Let us focus on this one
ŷ = f (x)
Algorithm: gradient
weight (W112 ). descent()
a31
W311 t ← 0;
• To learn this weight using h21
b3
max_iterations ←
SGD we need a formula
∂L (θ) a21 1000;
for ∂W . W211 b2
112 h11 Initialize θ0 ;
while
a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end
24/100
24
• Let us focus on this one
ŷ = f (x)
Algorithm: gradient
weight (W112 ). descent()
a31
W311 t ← 0;
• To learn this weight using h21
b3
max_iterations ←
SGD we need a formula
∂L (θ) a21 1000;
for ∂W . W211 b2
112 h11 Initialize θ0 ;
• We will see how to
while
calculate this. a11
W111 W112 b1 t++ < max_iterations
x1 x2 xd do
θt+1 ← θt − η∇θt ;
end
24/100
24
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
aL1
WL11
h21
a21
W211
h11
a11
W111
x1
25/100
25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
a21
W211
h11
a11
W111
x1
25/100
25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
W211
h11
a11
W111
x1
25/100
25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
= (just compressing the chain rule) h11
∂W111 ∂h11 ∂W111
a11
W111
x1
25/100
25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
= (just compressing the chain rule) h11
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
= a11
∂W211 ∂h21 ∂W211 W111
x1
25/100
25
L (θ)
• First let us take the simple case when we ŷ = f (x)
have a deep but thin network.
• In this case it is easy to find the derivative
aL1
by chain rule. WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21
∂L (θ) ∂L (θ) ∂h11 W211
= (just compressing the chain rule) h11
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
= a11
∂W211 ∂h21 ∂W211 W111
∂L (θ) ∂L (θ) ∂aL1
=
∂WL11 ∂aL1 ∂WL11 x1
25/100
25
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details
26/100
26
• We get a certain loss at the output and we try to figure − log ŷ`
out who is responsible for this loss
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
27/100
27
• We get a certain loss at the output and we try to figure − log ŷ`
out who is responsible for this loss
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
27/100
27
• We get a certain loss at the output and we try to figure − log ŷ`
out who is responsible for this loss
• So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility". a3
W3 b3
• The output layer says “Well, I take responsibility for my h2
part but please understand that I am only as the good
as the hidden layer and weights below me". After all a2
... W2 b2
f (x) = ŷ = O(WL hL−1 + bL )
h1
a1
W1 b1
x1 x2 xn
27/100
27
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
28/100
28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
W1 b1
x1 x2 xn
28/100
28
• So, we talk to WL , bL and hL and ask them “What is wrong − log ŷ`
with you?"
• WL and bL take full responsibility but hL says “Well, please
understand that I am only as good as the pre-activation layer"
• The pre-activation layer in turn says that I am only as good as
a3
the hidden layer and weights below me. W3 b3
• We continue in this manner and realize that the responsibility h2
lies with all the weights and biases (i.e. all the parameters of
the model) a2
W2 b2
• But instead of talking to them directly, it is easier to talk to h1
them through the hidden layers and output layers (and this is
exactly what the chain rule allows us to do)
a1
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 W1 b1
=
∂W111 ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z } |
Talk to the
{z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the and now
x1 x2 xn
weight directly output layer previous hidden previous talk to
layer hidden layer the 28/100
weights
28
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z }
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights
29/100
29
Quantities of interest (roadmap for the remaining part):
29/100
29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
29/100
29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
29/100
29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases
29/100
29
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases
29/100
29
Module 4.5: Backpropagation: Computing Gradients w.r.t. the
Output Units
30/100
30
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights
31/100
31
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
32/100
32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
a1
W1 b1
x1 x2 xn
32/100
32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
a1
W1 b1
x1 x2 xn
32/100
32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
x1 x2 xn
32/100
32
Let us first consider the partial derivative − log ŷ`
w.r.t. i-th output
x1 x2 xn
32/100
32
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
a2
W2 b2
h1
∇ŷ L (θ)
=
a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
∂L (θ) a2
∂ ŷ1 W2 b2
h1
∇ŷ L (θ)
=
a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
∂L (θ) a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
.. h1
=
.
a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
∂L (θ) a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
.. h1
=
.
∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
∂L (θ) a2
∂ ŷ1 W2 b2
∇ŷ L (θ)
.. =−1
h1
=
. ŷ`
∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
1`=1
∂L (θ) a2
∂ ŷ1 W2 b2
. 1 h1
∇ŷ L (θ) .. = −
=
ŷ`
∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
1`=1
∂L (θ) a2
W2 b2
∂ ŷ1
..
1 1 h1
∇ŷ L (θ) `=2
= . =−
ŷ`
∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
1`=1
∂L (θ) a2
W2 b2
∂ ŷ1
..
1 1 h1
∇ŷ L (θ) `=2
= . =− .
..
ŷ`
∂L (θ)
∂ ŷk a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
1`=1
∂L (θ) a2
W2 b2
∂ ŷ1
..
1 1 h1
∇ŷ L (θ) `=2
= . =− .
..
ŷ`
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
1`=1
∂L (θ) a2
W2 b2
∂ ŷ1
..
1 1 h1
∇ŷ L (θ) `=2
= . =− .
..
ŷ`
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
1
= − e`
ŷ` x1 x2 xn
33/100
33
− log ŷ`
∂ 1(`=i)
(L (θ)) = −
∂ ŷi ŷ`
1`=1
∂L (θ) a2
W2 b2
∂ ŷ1
..
1 1 h1
∇ŷ L (θ) `=2
= . =− .
..
ŷ`
∂L (θ)
∂ ŷk 1`=k a1
W1 b1
1
= − e`
ŷ` x1 x2 xn
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34/100
34
What we are actually interested in is − log ŷ`
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
34/100
34
What we are actually interested in is − log ŷ`
a1
W1 b1
x1 x2 xn
34/100
34
What we are actually interested in is − log ŷ`
a1
W1 b1
x1 x2 xn
34/100
34
What we are actually interested in is − log ŷ`
exp(aL` ) a2
ŷ` = P W2 b2
i exp(aLi ) h1
a1
W1 b1
x1 x2 xn
34/100
34
What we are actually interested in is − log ŷ`
exp(aL` ) a2
ŷ` = P W2 b2
i exp(aLi ) h1
Having established this, we will now derive
a1
the full expression on the next slide W1 b1
x1 x2 xn
34/100
34
∂
− log ŷ` =
∂aLi
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
−1 ∂
= sof tmax(aL )`
ŷ` ∂aLi
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi
−1 ∂
= sof tmax(aL )`
ŷ` ∂aLi
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
!
exp(aL )` ∂a∂Li i0 exp(aL )i0
P
∂
−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
!
exp(aL )` ∂a∂Li i0 exp(aL )i0
P
∂
−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
!
exp(aL )` ∂a∂Li i0 exp(aL )i0
P
∂
−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0
−1
= 1(`=i) sof tmax(aL )` − sof tmax(aL )` sof tmax(aL )i
ŷ`
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
!
exp(aL )` ∂a∂Li i0 exp(aL )i0
P
∂
−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0
−1
= 1(`=i) sof tmax(aL )` − sof tmax(aL )` sof tmax(aL )i
ŷ`
−1
1(`=i) ŷ` − ŷ` ŷi
=
ŷ`
35/100
35
∂ −1 ∂
− log ŷ` = ŷ`
∂aLi ŷ` ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )`
ŷ` ∂aLi ∂x ∂x h(x) h(x)2 ∂x
−1 ∂ exp(aL )`
= P
ŷ` ∂aLi i0 exp(aL )`
!
exp(aL )` ∂a∂Li i0 exp(aL )i0
P
∂
−1 ∂aLi exp(aL )`
= P − P
ŷ` i0 exp(aL )i
0 ( i0 (exp(aL )i0 )2
1(`=i) exp(aL )`
!
−1 exp(aL )` exp(aL )i
= P −P P
ŷ` i0 exp(aL )i
0
i0 exp(aL )i
0
i0 exp(aL )i
0
−1
= 1(`=i) sof tmax(aL )` − sof tmax(aL )` sof tmax(aL )i
ŷ`
−1
1(`=i) ŷ` − ŷ` ŷi
=
ŷ`
1(`=i) − ŷi
=−
35/100
35
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
a2
W2 b2
∇aL L (θ) h1
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
∂L (θ)
a2
∂aL1 W2 b2
∇aL L (θ) =
h1
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
∂L (θ)
a2
∂aL1 W2 b2
∇aL L (θ) =
..
h1
.
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
∂L (θ)
a2
∂aL1 W2 b2
∇aL L (θ) =
..
h1
.
∂L (θ)
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
∂L (θ)
a2
∂aL1 W2 b2
∇aL L (θ) =
..
h1
. =
∂L (θ)
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
− (1`=1 − ŷ1 )
∂L (θ)
a2
∂aL1 W2 b2
∇aL L (θ) =
..
h1
. =
∂L (θ)
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
− (1`=1 − ŷ1 )
∂L (θ)
a2
∂a.L1 − (1`=2 − ŷ2 ) W2 b2
∇aL L (θ) = . h1
. =
∂L (θ)
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
− (1`=1 − ŷ1 )
∂L (θ)
a2
∂aL1
− (1`=2 − ŷ2 ) W2 b2
∇aL L (θ) =
.. h1
. = ..
.
∂L (θ)
∂aLk
a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
− (1`=1 − ŷ1 )
∂L (θ)
a2
∂aL1
− (1`=2 − ŷ2 ) W2 b2
∇aL L (θ) =
.. h1
. = ..
.
∂L (θ)
∂aLk − (1`=k − ŷk ) a1
W1 b1
x1 x2 xn
36/100
36
So far we have derived the partial derivative w.r.t. the − log ŷ`
i-th element of aL
∂L (θ)
= −(1`=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
− (1`=1 − ŷ1 )
∂L (θ)
a2
∂aL1
− (1`=2 − ŷ2 ) W2 b2
∇aL L (θ) =
.. h1
. = ..
.
∂L (θ)
∂aLk − (1`=k − ŷk ) a1
W1 b1
= −(e(`) − ŷ)
x1 x2 xn
36/100
36
Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden
Units
37/100
37
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases
38/100
38
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
39/100
39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
39/100
39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
In our case: W2 b2
h1
• p(z) is the loss function L (θ)
a1
W1 b1
x1 x2 xn
39/100
39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
In our case: W2 b2
h1
• p(z) is the loss function L (θ)
• z = hij a1
W1 b1
x1 x2 xn
39/100
39
Chain rule along multiple paths: If a function p(z) − log ŷ`
can be written as a function of intermediate results
qi (z) then we have :
a3
∂p(z) X ∂p(z) ∂qm (z) W3
= b3
∂z ∂qm (z) ∂z h2
m
a2
In our case: W2 b2
h1
• p(z) is the loss function L (θ)
• z = hij a1
W1 b1
• qm (z) = aLm
x1 x2 xn
39/100
39
Intentionally left blank 40/100
40
− log ŷ`
∂L (θ)
∂hij
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
a2
W2 b2
∇ai+1 L (θ) =
; Wi+1, · ,j = h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1
a2
W2 b2
∇ai+1 L (θ) =
; Wi+1, · ,j = h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
W2 b2
∇ai+1 L (θ) =
; Wi+1, · ,j = h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
∂L (θ)
∂ai+1,k
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ;
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,
x1 x2 xn
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,
x1 x2 xn
(Wi+1, · ,j )T ∇ai+1 L (θ) =
41/100
ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
k
∂L (θ) ∂L (θ) ∂ai+1,m
X
∂hij = ∂ai+1,m ∂hij
m=1
k
∂L (θ)
X
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
∂L (θ)
∂ai+1,1 Wi+1,1,j a2
.. .. W2 b2
∇ai+1 L (θ) =
. ; Wi+1, · ,j
=
.
h1
∂L (θ)
∂ai+1,k
Wi+1,k,j
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,
x1 x2 xn
k
X ∂L (θ)
(Wi+1, · ,j )T ∇ai+1 L (θ) = Wi+1,m,j
∂ai+1,m 41/100
m=1 ai+1 = Wi+1 hij + bi+1
41
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
∇hi L (θ) a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
a1
W1 b1
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
x1 x2 xn
42/100
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
42
− log ŷ`
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij
42
− log ŷ`
∇ai L (θ)
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∇ai L (θ) =
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) W3 b3
h2
∂aij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
h1
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
h1
∇ai L (θ) a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
h1
∇ai L (θ) =
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
∂L (θ) 0 h1
g (ai1 )
∂hi1
∇ai L (θ) =
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
∂L (θ) 0 h1
g (ai1 )
∂hi1 .
∇ai L (θ) = ..
a1
W1 b1
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
∂L (θ) 0 h1
g (ai1 )
∂hi1 .
∇ai L (θ) = ..
a1
∂L (θ) 0
W1 b1
∂hin g (ain )
x1 x2 xn
43/100
43
− log ŷ`
∂L (θ)
∂ai1
∇ai L (θ) =
..
.
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) 0 a2
= g (aij ) [∵ hij = g(aij )] W2
∂hij b2
∂L (θ) 0 h1
g (ai1 )
∂hi1 .
∇ai L (θ) = ..
a1
∂L (θ) 0
W1 b1
∂hin g (ain )
0 x1 x2 xn
= ∇hi L (θ) [. . . , g (aik ), . . . ]
43/100
43
Module 4.7: Backpropagation: Computing Gradients w.r.t.
Parameters
44/100
44
Quantities of interest (roadmap for the remaining part):
• Gradient w.r.t. output units
• Gradient w.r.t. hidden units
• Gradient w.r.t. weights and biases
45/100
45
Recall that, − log ŷ`
ak = bk + Wk hk−1
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
Recall that, − log ŷ`
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
Recall that, − log ŷ`
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) W3 b3
h2
∂Wkij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
Recall that, − log ŷ`
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
Recall that, − log ŷ`
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
∂L (θ)
= hk−1,j a2
∂aki W2 b2
h1
a1
W1 b1
x1 x2 xn
46/100
46
Recall that, − log ŷ`
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
∂L (θ)
= hk−1,j a2
∂aki W2 b2
h1
∇Wk L (θ) = a1
W1 b1
x1 x2 xn
46/100
46
Recall that, − log ŷ`
ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij a3
∂L (θ) ∂L (θ) ∂aki W3 b3
= h2
∂Wkij ∂aki ∂Wkij
∂L (θ)
= hk−1,j a2
∂aki W2 b2
∂L (θ) ∂L (θ) ∂L (θ)
h1
∂W ∂Wk12 ... ... ∂Wk1n
k11
... ... ... ... ...
∇Wk L (θ) =
a1
.. .. .. .. ..
W1 b1
. . . . .
∂L (θ)
... ... ... ... ∂Wknn x1 x2 xn
46/100
46
Intentionally left blank 47/100
47
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
48/100
48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ)
∂W ∂Wk12 ∂Wk13
k11
∂L (θ) ∂L (θ) ∂L (θ) ∂aki
∂L (θ) ∂L (θ)
∇Wk L (θ) =
∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij
= ∂aki ∂Wkij
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48/100
48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ)
∂W ∂Wk12 ∂Wk13
k11
∂L (θ) ∂L (θ) ∂L (θ) ∂aki
∂L (θ) ∂L (θ)
∇Wk L (θ) =
∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij
= ∂aki ∂Wkij
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ)
∂W ∂Wk12 ∂Wk13
k11
∂L (θ) ∂L (θ) ∂L (θ) ∂aki
∂L (θ) ∂L (θ)
∇Wk L (θ) =
∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij
= ∂aki ∂Wkij
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ)
∂W ∂Wk12 ∂Wk13
k11
∂L (θ) ∂L (θ) ∂L (θ) ∂aki
∂L (θ) ∂L (θ)
∇Wk L (θ) =
∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij
= ∂aki ∂Wkij
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ)
∂W ∂Wk12 ∂Wk13
k11
∂L (θ) ∂L (θ) ∂L (θ) ∂aki
∂L (θ) ∂L (θ)
∇Wk L (θ) =
∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij
= ∂aki ∂Wkij
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ)
∂W ∂Wk12 ∂Wk13
k11
∂L (θ) ∂L (θ) ∂L (θ) ∂aki
∂L (θ) ∂L (θ)
∇Wk L (θ) =
∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij
= ∂aki ∂Wkij
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ)
∂W ∂Wk12 ∂Wk13
k11
∂L (θ) ∂L (θ) ∂L (θ) ∂aki
∂L (θ) ∂L (θ)
∇Wk L (θ) =
∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij
= ∂aki ∂Wkij
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33
48
Finally, coming to the biases − log ŷ`
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
a3
W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
h2
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector bk W2 b2
h1
a1
W1 b1
x1 x2 xn
49/100
49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector bk W2 b2
h1
∂L (θ)
a
∂Lk1(θ) a1
ak2 W1 b1
∇bk L (θ) =
..
.
∂L (θ) x1 x2 xn
akn
49/100
49
Finally, coming to the biases − log ŷ`
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3 b3
∂L (θ) h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector bk W2 b2
h1
∂L (θ)
a
∂Lk1(θ) a1
ak2 W1 b1
∇bk L (θ) =
.. = ∇ak L (θ)
.
∂L (θ) x1 x2 xn
akn
49/100
49
Module 4.8: Backpropagation: Pseudo code
50/100
50
Finally, we have all the pieces of the puzzle
51/100
51
Finally, we have all the pieces of the puzzle
51/100
51
Finally, we have all the pieces of the puzzle
51/100
51
Finally, we have all the pieces of the puzzle
51/100
51
Finally, we have all the pieces of the puzzle
51/100
51
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
52/100
52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do
end
52/100
52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward_propagation(θt );
end
52/100
52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward_propagation(θt );
∇θt = backward_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);
end
52/100
52
Algorithm: gradient_descent()
t ← 0;
max_iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max_iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward_propagation(θt );
∇θt = backward_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);
θt+1 ← θt − η∇θt ;
end
52/100
52
Algorithm: forward_propagation(θ)
53/100
53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
end
53/100
53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
end
53/100
53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end
53/100
53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end
aL = bL + WL hL−1 ;
53/100
53
Algorithm: forward_propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end
aL = bL + WL hL−1 ;
ŷ = O(aL );
53/100
53
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
end
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
end
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
end
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
end
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
end
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
end
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
end
54/100
54
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back_propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
∇ak−1 L (θ) = ∇hk−1 L (θ) [. . . , g 0 (ak−1,j ), . . . ] ;
end
54/100
54
Module 4.9: Derivative of the activation function
55/100
55
Now, the only thing we need to figure out is how to compute g 0
56/100
56
Now, the only thing we need to figure out is how to compute g 0
Logistic function
g(z) = σ(z)
1
=
1 + e−z
56/100
56
Now, the only thing we need to figure out is how to compute g 0
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
56/100
56
Now, the only thing we need to figure out is how to compute g 0
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
56/100
56
Now, the only thing we need to figure out is how to compute g 0
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1
1
=
1 + e−z 1 + e−z
56/100
56
Now, the only thing we need to figure out is how to compute g 0
Logistic function
g(z) = σ(z)
1
=
1 + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1
1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))
56/100
56
Now, the only thing we need to figure out is how to compute g 0
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d
g 0 (z) = (−1) −z 2
(1 + e−z )
(1 + e ) dz
1
= (−1) (−e−z )
(1 + e−z )2
1 + e−z − 1
1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))
56/100
56
Now, the only thing we need to figure out is how to compute g 0
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1 + e−z − 1
1
=
1 + e−z 1 + e−z
= g(z)(1 − g(z))
56/100
56
Now, the only thing we need to figure out is how to compute g 0
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1
1 + e−z − 1
(ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
= g(z)(1 − g(z))
56/100
56
Now, the only thing we need to figure out is how to compute g 0
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1
1 + e−z − 1
(ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
(ez − e−z )2
= g(z)(1 − g(z)) =1 − z
(e + e−z )2
56/100
56
Now, the only thing we need to figure out is how to compute g 0
=
1 ez − e−z
=
1 + e−z ez + e−z
1 d !
g 0 (z) = (−1) (1 + e−z ) (ez + e−z ) dz
d
(ez − e−z )
−z 2
(1 + e ) dz
− (ez − e−z ) dzd
(ez + e−z )
= (−1)
1
(−e−z ) g 0 (z) =
(1 + e−z )2 (ez + e−z )2
1
1 + e−z − 1
(ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
(ez − e−z )2
= g(z)(1 − g(z)) =1 − z
(e + e−z )2
=1 − (g(z))2 56/100
56