Lecture 4 Backpropagation

Advanced Machine Learning
Backpropagation
Amit Sethi
Electrical Engineering, IIT Bombay
Learning objectives
• Write derivative of a nested function using
chain rule
• Articulate how storage of partial derivatives

leads to an efficient gradient descent for
neural networks
• Write gradient descent as matrix operations

Overall function of a neural network
• 𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Weights form a matrix
• Output of the previous layer form a vector
• The activation (nonlinear) function is applied
point-wise to the weight times input
• Design questions (hyper parameters):

– Number of layers
– Number of neurons in each layer (rows of weight
matrices)
Training the neural network
• Given 𝒙𝑖 and 𝑦𝑖
• Think of what hyper-parameters and neural
network design might work
• Form a neural network:
𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Compute 𝑓𝒘 𝒙𝑖 as an estimate of 𝑦𝑖 for all
samples
• Compute loss:
1 𝑁 1 𝑁
𝑖=1 𝐿(𝑓𝒘 𝒙𝑖 , 𝑦𝑖 ) = 𝑖=1 𝑙𝑖 (𝒘)
𝑁 𝑁
• Tweak 𝒘 to reduce loss (optimization algorithm)
• Repeat last three steps
Gradient ascent
• If you didn’t know the shape of a mountain
• But at every step you knew the slope
• Can you reach the top of the mountain?
Gradient descent minimizes the loss
function
• At every point, compute
• Loss (scalar): 𝑙𝑖 (𝒘)
• Gradient of loss with respect to
weights (vector):
𝛻𝒘 𝑙𝑖 (𝒘)
• Take a step towards negative
gradient:
1 𝑁
𝒘 ← 𝒘 − 𝜂 𝛻𝒘 𝑙𝑖 (𝒘)
𝑁 𝑖=1
Derivative of a function of a scalar
E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

𝑑 𝑓(𝑥)
• Derivative 𝑓’ 𝑥 = is the rate of change of 𝑓 𝑥 with 𝑥
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the
minimum or maximum of 𝑓 𝑥
• It is positive when 𝑓 𝑥 is sloping up, and negative when 𝑓 𝑥 is
sloping down
• To move towards the maxima, taking a small step in a direction of
the derivative
Gradient of a function of a vector
• Derivative with respect to each
dimension, holding other
dimensions constant
f(x1, x2) →
𝜕𝑓
𝜕𝑥1
• 𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = 𝜕𝑓
𝜕𝑥2
• At a minima or a maxima the
gradient is a zero vector
The function is flat in every
direction
• At a minima or a maxima the
gradient is a zero vector
Gradient of a function of a vector
• Gradient gives a direction for
moving towards the minima
• Take a small step towards
f(x1, x2) →
negative of the gradient

Example of gradient
• Let 𝑓 𝒙 = 𝑓 𝑥1 , 𝑥2 = 5𝑥1 2 + 3𝑥2 2
𝜕𝑓
𝜕𝑥1 10𝑥1
• Then 𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = =
𝜕𝑓 6𝑥2
𝜕𝑥2
20 0.958
• At a location 2,1 a step in or
6 0.287
direction will lead to maximal increase in the
function
This story is unfolding in multiple
dimensions
Backpropagation
• Backpropagation is an
efficient method to do
y1 y2 … yn gradient descent
• It saves the gradient
w.r.t. the upper layer
output to compute the
… gradient w.r.t. the
weights immediately
… …
below
…
• It is linked to the chain
rule of derivatives
h11 h12 …
h1n • All intermediary
1 functions must be
differentiable,
including the
x1 x2 … xd activation functions
Chain rule of differentiation
• Very handy for complicated functions
 Especially functions of functions
 E.g. NN outputs are functions of previous layers
 For example: Let 𝑓 𝑥 = 𝑔 𝑕 𝑥
 Let 𝑦 = 𝑕 𝑥 , 𝑧 = 𝑔 𝑦 = 𝑔 𝑕 𝑥
′ 𝑑𝑧 𝑑𝑧𝑑𝑦
 Then 𝑓 𝑥 = = = 𝑔′ (𝑦)𝑕′ (𝑥)
𝑑𝑥 𝑑𝑦𝑑𝑥
𝑑 sin(𝑥 2 )
 For example: = 2𝑥 cos(𝑥 2 )
𝑑𝑥
Backpropagation makes use of
chain rule of derivatives
𝜕𝑓(𝑔 𝑥 ) 𝜕𝑓(𝑔 𝑥 ) 𝜕𝑔 𝑥
• Chain rule: =
𝜕𝑥 𝜕𝑔(𝑥) 𝜕𝑥
x
* ?
ReL
W1 + Z1 A1
U
b1 * ?
SoftM
W2 + Z2 A2
ax
b2 CE Loss
targ
et
Vector valued functions and Jacobians
• We often deal with functions that give multiple
outputs
𝑓1 (𝒙) 𝑓1 (𝑥1 , 𝑥2 , 𝑥3 )
• Let 𝒇 𝒙 = =
𝑓2 (𝒙) 𝑓2 (𝑥1 , 𝑥2 , 𝑥3 )
• Thinking in terms of vector of functions can make the
representation less cumbersome and computations
more efficient
• Then the Jacobian is
𝜕𝑓1 𝜕𝑓1 𝜕𝑓1
𝜕𝒇 𝜕𝒇 𝜕𝒇 𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
• 𝑱(𝒇) = 𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
= 𝜕𝑓2 𝜕𝑓2 𝜕𝑓2
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
Jacobian of each layer
• Compute the derivatives of a higher layer’s
output with respect to those of the lower
layer
• What if we scale all the weights by a factor R?
• What happens a few layers down?

Role of step size and learning rate
• Tale of two loss functions
– Same value, and
– Same gradient (first derivative), but
– Different Hessian (second derivative)
– Different step sizes needed
• Success not guaranteed
The perfect step size is impossible to
guess
• Goldilocks finds the perfect balance only in a
fairy tale
• The step size is decided by learning rate 𝜂 and

the gradient
Double derivative
E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

𝑑 2 𝑓(𝑥)
• Double derivative 𝑓’’ 𝑥 = is the derivative of
𝑑 𝑥2
derivative of 𝑓 𝑥
• Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a
single maxima)
Double derivative
𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐,
𝑓 ′ 𝑥 = 2𝑎𝑥 + 𝑏,
𝑓′′ 𝑥 = 2𝑎
• Double derivative tells how far the minima might be

from a given point.
• From 𝑥 = 0 the minima is closer for the red dashed
curve than for the blue solid curve, because the former
has a larger second derivative (its slope reverses faster)
Perfect step size for a paraboloid
• Let 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐
• Assuming 𝑎 < 0
∗ 𝑏
• Minima is at: 𝑥 = −
2𝑎
• For any 𝑥 the perfect step would be:
𝑏 2𝑎𝑥+𝑏 𝑓′ 𝑥
− −𝑥 = − = − ′′
2𝑎 2𝑎 𝑓 𝑥
∗ 1
• So, the perfect learning rate is: 𝜂 =
𝑓′′ 𝑥
−1
• In multiple dimensions, 𝒙 ← 𝒙 − 𝐻 𝑓 𝒙 𝛻(𝑓 𝒙 )
• Practically, we do not want to compute the inverse of
a Hessian matrix, so we approximate Hessian inverse
Hessian of a function of a vector
• Double derivative with respect
to a pair of dimensions forms
the Hessian matrix:
f(x1, x2) →
• If all eigenvalues of a Hessian

matrix are positive, then the
function is convex
Example of Hessian
• Let 𝑓 𝒙 = 𝑓 𝑥1 , 𝑥2 = 5𝑥1 2 + 3𝑥2 2 + 4𝑥1 𝑥2
• Then
𝜕𝑓
𝜕𝑥1 10𝑥1 + 4𝑥2
𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = =
𝜕𝑓 6𝑥2 + 4𝑥1
𝜕𝑥2
𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥1 2 𝜕𝑥1 𝜕𝑥2 10 4
• And, 𝐻(𝑓 𝒙 ) = =
𝜕2 𝑓 𝜕2 𝑓 4 6
𝜕𝑥2 𝜕𝑥1 𝜕𝑥2 2
Saddle points, Hessian and long local
furrows
• Some variables may

have reached a local
minima while others
have not
• Some weights may have
almost zero gradient
• At least some
eigenvalues may not be
negative
Complicated loss functions
Global
minima?
Saddle
point
A realistic picture
Local
minima
Local
maxima
Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

Lecture 4 Backpropagation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4 Backpropagation

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

• Articulate how storage of partial derivatives

• Write gradient descent as matrix operations

• Design questions (hyper parameters):

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

negative of the gradient

• What if we scale all the weights by a factor R?

• What happens a few layers down?

• The step size is decided by learning rate 𝜂 and

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

• Double derivative tells how far the minima might be

• If all eigenvalues of a Hessian

• Some variables may

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

You might also like