Professional Documents
Culture Documents
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾
5. Return weights
• The amount that the weights are updated during training is referred to as the step size or the learning rate.
• The learning rate is a configurable hyper parameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
• The learning rate controls how quickly the model is adapted to the problem.
• The magic line here is actually how to you compute that gradient – that’s something not easy at all. So the question is
given a loss given all of our weights in our network how do we know which way is good –which way is a good place to
move. - That’s a process called back-propagation. We will discuss back propagation using elementary calculus.
How does a small change in one weight (ex. W2) affect the final loss J (W)?
• This is a Simple network with one input layer, one hidden layer (one hidden neuron) and one output layer,
the simplest neural network you can create.
• Computing the gradient of loss of W with respect to w2 ( that is between hidden state and W) can perform
lot of changes in loss value. We actually want to see - How does a small change in one weight (ex. W2)
affect the final loss J (W)?
• So this derivative is going to tell us how much a small change in this weight will affect our loss if we make
a small change in the weight, in one Direction will it increase our loss or decrease our loss.
• Like how a small change in w2 – makes how much change – up or down – how does it change – and by
how much really !
𝜕𝐽(𝑾)
Gradient loss of W with respect to w2
𝜕 𝒘𝟐
So, to compute that we can use this derivative, we can start with applying the chain
rule backwards from the loss function through the output specifically.
So that’s the gradient we care about – The gradient of our loss with respect to w2.
• We can do is we can actually just decompose this derivative into two components the first component
• To evaluate this we can use the chain rule in elementary calculus.
• We can split them into gradient of the loss with respect to our output multiplied by gradient of output s with
respect to w2.
• The derivative of the loss with respect to our output multiplied by the derivative of our output with respect
to W2, this is just a standard use of the chain rule with this original derivative that we had on the left hand
side
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)
Here it can be seen on the red component that last component of the chain rule, we have to once again recursively apply one more chain
rule because that's again another derivative that we can't directly evaluate. We can expand that once more with another instantiation of
the chain Rule and now all of these components.
We can directly propagate these gradients through the hidden units right in our neural network all the way back to the weight that we're
interested. In in this example we first computed the derivative with respect to W2 then we back propagated and used that information also
with W1. That's why we really call it back propagation because this process occurs from the output all the way back to the input
Repeat this process essentially many times over the course of training by back-propagating.
These gradients over and over again through the network all the way from the output to the inputs to
determine for every single weight answering this question which is how much does a small change in these
weights affect our loss function if it increases it or decreases and how we can use that to improve the loss
ultimately because that's our final goal so that's the back propagation algorithm that's the core of training
neural networks.