Professional Documents
Culture Documents
1
Neural Network
Hidden Layer 1 Hidden Layer 2 Hidden Layer 3
Input Layer
Output Layer
Width
Depth
6
Computational Graphs
• Neural network is a computational graph
– It is directional
– It is organized in ‘layers’
9
Backprop
The Importance of Gradients
• Our optimization schemes are based on computing
gradients
Done by many people before, but somehow (mostly) credited to Rumelhart 1986
11
Backprop: Forward Pass
•
sum
mult
12
Backprop: Backward Pass
sum
mult
13
Backprop: Backward Pass
sum
mult
14
Backprop: Backward Pass
sum
mult
15
Backprop: Backward Pass
sum
mult
16
Backprop: Backward Pass
sum
mult
Chain Rule:
17
Backprop: Backward Pass
sum
mult
Chain Rule:
Upstream Gradient
Downstream Gradient Local Gradient
18
Compute Graphs -> Neural Networks
•
19
Compute Graphs -> Neural Networks
Input layer Output layer
Loss
/
cost
Loss
/
cost
Activation bias
function
23
Gradient Descent for Neural Networks
Output Layer
Gradient
step:
35
NNs can Become Quite Complex…
• These graphs can be huge!
NEURONS
37
Gradient Descent for Neural Networks
39
Gradient Descent for Neural Networks
Backpropagation
40
Gradient Descent for Neural Networks
41
Derivatives of Cross Entropy Loss
Gradients of weights of last layer:
output scores
I 42
Derivatives of Cross Entropy Loss
Gradients of weights of first layer:
43
Back to Compute Graphs & NNs
•
44
Gradient Descent for Neural Networks
45
Gradient Descent for Neural Networks
46
Gradient Descent for Neural Networks
47
Gradient Descent for Neural Networks
48
Gradient Descent for Neural Networks
49
Gradient Descent for Neural Networks
50
Gradient Descent for Neural Networks
51
Gradient Descent for Neural Networks
52
Gradient Descent for Neural Networks
53
Gradient Descent
• How to pick good learning rate?
54
sdasdas
AdaGrad
• AdaGrad’s concept is to modify the learning rate for every parameter in a model
depending on the parameter’s previous gradients.
• Calculates the learning rate as the sum of the squares of the gradients over time,
one for each parameter
• This reduces the learning rate for parameters with big partial derivative of
loss,while raising/small decrease in the learning rate for parameters with modest
gradients.
• The net effect is greater progress in the more gently sloped directions of
parameter space
• what would happen to the sum of squared gradients if the training takes too
long.
• Over time, this term would grow larger. When the current gradient is divided
by this large number, the update step for the weights becomes very small.
• It is as if we were using a very low learning rate, which becomes even lower
the longer the training takes.
• AdaGrad performs well for some but not all deep learning models
RMSProp: "Leaky Adagrad"
• instead of allowing this sum to increase continuously over the
training period, we allow the sum to decrease by including the term of
Decay Rate.
grads_ quard e = 0
fort in ra ng e n( um_steps):
dw = co mute radi ne t( ) w
gradsquard e = decy a rate * gradsquard e + (1 - decay rat)
e * dw * dw
RMSProp
w -= le r
a ning r_ t
a e * dw / g( rads_ quard
e s. qrt() + le-7)
ADAM
• we have used the momentum term to determine the velocity of the gradient
and update the weight parameter in the direction of that velocity.
• the sum of squared gradients to scale the current gradient so that we could
update the weights in each space dimension with the same ratio.