You are on page 1of 62

Backpropagation

1
Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer

Output Layer
Width

Depth
6
Computational Graphs
• Neural network is a computational graph

– It has compute nodes

– It has edges that connect nodes

– It is directional

– It is organized in ‘layers’

9
Backprop
The Importance of Gradients
• Our optimization schemes are based on computing
gradients

• One can compute gradients analytically but what if


our function is too complex?

• Break down gradient computation Backpropagation

Done by many people before, but somehow (mostly) credited to Rumelhart 1986
11
Backprop: Forward Pass

sum

mult

12
Backprop: Backward Pass

sum

mult

13
Backprop: Backward Pass

sum

mult

14
Backprop: Backward Pass

sum

mult

15
Backprop: Backward Pass

sum

mult

16
Backprop: Backward Pass

sum

mult

Chain Rule:

17
Backprop: Backward Pass

sum

mult

Chain Rule:

Upstream Gradient
Downstream Gradient Local Gradient
18
Compute Graphs -> Neural Networks

19
Compute Graphs -> Neural Networks
Input layer Output layer

Loss
/
cost

Input Weights L2 Loss


function
(unknowns!)

e.g., class label/


regression target
Compute Graphs -> Neural Networks
Input layer Output layer

Loss
/
cost

Input Weights L2 Loss


(unknowns!) function

e.g., class label/


regression target
21
Compute Graphs -> Neural Networks
Input layer
… Output layer

⟶ use chain rule to compute partials

Activation bias
function

23
Gradient Descent for Neural Networks

Input Layer Hidden Layer 1 Hidden Layer 2 Hidden Layer 3

Output Layer

Gradient
step:

35
NNs can Become Quite Complex…
• These graphs can be huge!

[Szegedy et al.,CVPR’15] Going Deeper with Convolutions


36
The Flow of the Gradients
• Many many many many of these nodes form a
neural network

NEURONS

• Each one has its own work to do


FORWARD AND BACKWARD PASS

37
Gradient Descent for Neural Networks

39
Gradient Descent for Neural Networks
Backpropagation

Just go through layer by layer

40
Gradient Descent for Neural Networks

Note that some activations


have also weights

41
Derivatives of Cross Entropy Loss
Gradients of weights of last layer:

Binary Cross Entropy loss

output scores
I 42
Derivatives of Cross Entropy Loss
Gradients of weights of first layer:

43
Back to Compute Graphs & NNs

44
Gradient Descent for Neural Networks

45
Gradient Descent for Neural Networks

46
Gradient Descent for Neural Networks

47
Gradient Descent for Neural Networks

48
Gradient Descent for Neural Networks

49
Gradient Descent for Neural Networks

50
Gradient Descent for Neural Networks

51
Gradient Descent for Neural Networks

52
Gradient Descent for Neural Networks

53
Gradient Descent
• How to pick good learning rate?

• How to compute gradient for single training pair?

• How to compute gradient for large training set?

• How to speed things up? More to see in next


lectures…

54
sdasdas
AdaGrad

• AdaGrad’s concept is to modify the learning rate for every parameter in a model
depending on the parameter’s previous gradients.
• Calculates the learning rate as the sum of the squares of the gradients over time,
one for each parameter

• This reduces the learning rate for parameters with big partial derivative of
loss,while raising/small decrease in the learning rate for parameters with modest
gradients.
• The net effect is greater progress in the more gently sloped directions of
parameter space
• what would happen to the sum of squared gradients if the training takes too
long.

• Over time, this term would grow larger. When the current gradient is divided
by this large number, the update step for the weights becomes very small.

• It is as if we were using a very low learning rate, which becomes even lower
the longer the training takes.

• AdaGrad performs well for some but not all deep learning models
RMSProp: "Leaky Adagrad"
• instead of allowing this sum to increase continuously over the
training period, we allow the sum to decrease by including the term of
Decay Rate.

grads_ quard e = 0
fort in ra ng e n( um_steps):
dw = co mute radi ne t( ) w
gradsquard e = decy a rate * gradsquard e + (1 - decay rat)
e * dw * dw
RMSProp
w -= le r
a ning r_ t
a e * dw / g( rads_ quard
e s. qrt() + le-7)
ADAM
• we have used the momentum term to determine the velocity of the gradient
and update the weight parameter in the direction of that velocity.
• the sum of squared gradients to scale the current gradient so that we could
update the weights in each space dimension with the same ratio.

You might also like