You are on page 1of 223

Deep learning

Loss Function, Gradient


Decent Algorithm,
Backpropagation
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Loss Function
• Compares the target and predicted output values to measures how well the neural
network models the training data.
• The aim is to minimize this loss between predicted & target outputs.
• Majorly 2 types of loss function
Note ( Loss function
vs cost function)
Loss Cost
Function Function
Is loss for Is the Regression loss Classification Loss
a single average
training loss over
example/ the entire
• MSE (Mean square error ) • Binary cross-entropy
input training • MAE (Mean absolute error) • Categorical cross-entropy
dataset.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mean Squared Error (MSE)

• MSE finds the average of squared differences b/w


the target and predicted outputs
• The difference is squared, which means it does not
matter whether the predicted value is above or
below the target value; however, values with a
large error are penalized.
• MSE is also a convex function with a clearly defined
global minimum.
• This allows to more easily utilize gradient descent
optimization to set the weight values.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mean Absolute Error (MAE)
• MAE finds the average of the absolute
differences between the target and the
predicted outputs.

• As MSE is highly sensitive to outliers,


which can dramatically affect the loss
because the distance is squared. MAE
is used in cases when the training data
has a large number of outliers to
mitigate this.
mae = tf.keras.losses.MeanAbsoluteError()
mae(y_true, y_pred)

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Binary cross-entropy/Log Loss
• Binary cross entropy compares each
of the predicted probabilities to the
actual class output which can be
either 0 or 1.
• It then calculates the score that
penalizes the probabilities based on
the distance from the expected
High Low Low High value. That means how close or far
penalty penalty penalty penalty from the actual value.
• Advantage –A cost function is a
differential.
• Disadvantage –Multiple local
minima, Not intuitive

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Categorical cross-entropy
• Also called Softmax Loss. It is a
One-hot encoding Softmax activation plus a Cross-
Entropy loss.
• It is used for multi-class classification.
• In the specific (and usual) case of
Multi-Class classification the labels are
one-hot encoded.
• Sparse Categorical Cross Entropy Loss
Function: ​
• Used when number of classes is
too large (eg 1000)​
• Avoid one-hot encoding, which
requires large memory​

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India
Amity Centre for Artificial Intelligence, Amity University, Noida, India
• We want to find the network weights that
achieve the lowest loss and those weights
∗ 1 can be used for its prediction.
W = argmin σ𝑛𝑖=1 𝐿(𝑓 𝑥𝑖 ; 𝑾 , 𝑠𝑖) • Here W is the set of weights, we need to
𝑛
W
find the optimal set of weights that tries to
W*= argmin J(W) minimize the loss over our entire test set.
W • Test set is the data set that we want to
evaluate or model.
• Argmin = argument of minimum is used to
get the minimum weight. Where W is the
collection or set of all the weights

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W)
w
Remember : Our Loss function is just a
simple function in terms of those
weights.
If we plot a 2 Dimensional

• Weights are on x and y axis whereas


loss is marked on z axis.
• For any value of w, we can see the loss
would be at that point.
• We need to find the point on this
landscape i.e. what are the values of W
that has minimum loss.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W)
w
• Randomly pick a place on this
landscape to start finding the
minimum weights.
• Form this random place we find
how this landscape is changing,
how the slope of landscape is
changes using gradient of the loss
with respect to each of the weights.
• The gradient is a vector which gives
us the direction in which loss
function has the steepest ascent.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W) • Gradient tell us which way to move
w
to find the steepest landscape using
function:
𝝏𝑱(𝑾)
𝝏𝑾
• Here we can see the higher
landscape with respect to the
selected point so we need to take
step in direction that’s lower than
the selected point.
We can take the gradient of the loss
with respect to each of these weights
to understand the direction of
maximum ascent.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


W* = argmin J(W)
w

• Take small step in opposite direction


of gradient.
• On getting the lower point, the
process need to be repeated over
and over again until we converged
to a local minimum point.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
• Repeat until Convergence

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
Algorithm for gradient descent:
1. Initialize the weights randomly ~N(0,𝝈 𝟐) weights = tf.random_normal( )

2. Loop until finding the convergence:


𝝏𝑱(𝒘)
3. Compute gradient, grads = tf.gradients(loss, weights)
𝝏𝒘
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾
5. Return weights
Now to summarize the algorithm which is known as gradient descent – taking a gradient and descending down the
landscape by initializing the weights randomly we compute the gradient d(J) with respect to all of our weights then we
update our weights in opposite direction of that gradient and we take a small step which we call here eta of that gradient
and this is referred to as Learning rate. eta is a scalar number – that indicates how much step you want to take at each
iteration – how strongly or aggressively you want to step towards that gradient.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
Algorithm for gradient descent:
1. Initialize the weights randomly ~N(0,𝝈 𝟐 ) weights = tf.random_normal( )

2. Loop until finding the convergence:


𝜕𝐽(𝒘)
3. Compute gradient, 𝜕𝒘
grads = tf.gradients(loss, weights)

𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ weights_new = weights.assign(weights – lr * grads)
𝝏𝑾

5. Return weights
• The amount that the weights are updated during training is referred to as the step size or the learning rate.
• The learning rate is a configurable hyper parameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
• The learning rate controls how quickly the model is adapted to the problem.
• The magic line here is actually how to you compute that gradient – that’s something not easy at all. So the question is
given a loss given all of our weights in our network how do we know which way is good –which way is a good place to
move. - That’s a process called back-propagation. We will discuss back propagation using elementary calculus.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent

Algorithm for gradient descent:


1. Initialize the weights randomly ~N(0,𝝈 𝟐)
2. Loop until finding the convergence:
𝝏𝑱(𝒘)
3. Compute gradient,
𝝏𝒘
𝝏𝑱(𝒘)
4. Update weight, W ← 𝑾 −ƞ
𝝏𝑾
5. Return weights Can be very
computationally
intensive to
compute!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stochastic Gradient Descent

Algorithm for gradient descent:


1. Initialize the weights randomly ~N(0,𝝈 𝟐)
2. Loop until finding the convergence:
3. Pick a single data point i,
𝝏𝑱𝒊(𝒘)
4. Compute gradient,
𝝏𝒘
𝝏𝑱(𝒘)
5. Update weight, W ← 𝑾 −ƞ
𝝏𝑾
Easy to compute but
6. Return weights very noisy
(stochastic)!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stochastic Gradient Descent with momentum
• SGD is noisy & requires more iteration to
reach minima. Adding a momentum term to
regular SGD for faster convergence of loss
function.
• SGD oscillates between either direction of
gradient & updates the weights accordingly.
By adding a fraction of the previous update
to the current update will make the process
a bit faster. velocity v denote
• Updated weight, Wt+1 = 𝑾𝒕 − 𝑽𝒕 the change in the
𝝏𝑱(𝒘) gradient to reach the
𝑽𝒕= β Vt-1 + ƞ global minima.
𝝏𝑾
• learning rate should be decreased wit
momentum term.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batch Gradient Descent

Algorithm for gradient descent:


1. Initialize the weights randomly ~N(0,𝝈𝟐)
2. Loop until finding the convergence:

3. Pick batch of B data points


𝝏𝑱(𝒘) 𝟏 𝝏𝑱(𝒘)
4. Compute gradient, = σ𝑩
𝒌=𝟏
𝝏𝒘 𝑩 𝝏𝒘
𝝏𝑱(𝒘)
5. Update weight, W ← 𝑾 −ƞ
𝝏𝑾
Fast to compute and a
6. Return weights much better estimate of
the true gradient!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batches while training
• Mini-batch gradient descent is a variation of the gradient
descent algorithm that splits the training dataset into small
batches that are used to calculate model error and update
model coefficients.
• Mini-batch gradient descent seeks to find a balance between
the robustness of stochastic gradient descent and the efficiency
of batch gradient descent.
• More accurate estimation of gradient
• Smoother convergence Allows for larger learning
rates

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batches while training
Summary Points
More accurate • Because of this estimation allows us to converge towards the target much quicker also it
means that gradients are more accurate in practice.
estimation of gradient • If it is quite noisy in our learning estimation we probably can increase our learning rate more
so that we don’t fully step in a wrong direction –if we are not confident with the gradient.
Smoother convergence • If we have a larger batch with more data to estimate our gradients with we can trust that
learning rate a little more and increase its steps more aggressively in that direction.
Allows for larger So we can finally Summarize:
learning rates • Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the
training dataset into small batches that are used to calculate model error and update model
Mini-batches lead to coefficients.
• Mini-batch gradient descent seeks to find a balance between the robustness of stochastic
fast training gradient descent and the efficiency of batch gradient descent.
Increase the
computation and Which this also means that we can massively parallelize the
computation because we can split up batches on multiple
achieve increased GPUs or multiple computers even to achieve more significant
speed on GPU’s speed ups with this training process.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Summary
• Batch Gradient Descent (BGD):
It uses the entire dataset at every step, making it slow for large datasets.
However, it is computationally efficient, since it produces a stable error gradient and a
stable convergence
• Stochastic Gradient Descent (SGD):
It is on the other extreme of the idea, using a single example (batch of 1) per each
learning step. Much faster, may return noisy gradients which can cause the error rate to
jump around
• Mini Batch Gradient Descent:
Computes the gradients on small random sets of instances called mini batches.
Reduce noise from SGD and still more efficient than BGD

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Backpropagation Algorithm
• The algorithm is used to effectively
train a neural network through a
method called chain rule.

• After each forward pass through a


network, backpropagation
performs a backward pass while
adjusting the model’s parameters
(weights and biases).

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Backpropagation aims to
minimize the cost function by
adjusting network’s weights and
biases.

• The level of adjustment is


determined by the gradients of the
cost function with respect to those
parameters.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


• Gradient of a function C(x1, x2, …, xm) in point x is a
vector of the partial derivatives of C in x.
• Equation for derivative of C in x

• Function C derivative measures the sensitivity to


change of function value (output value) with respect to
a change in its argument x (input value) or the
derivative tells us the direction C is going.
• The gradient shows how much the parameter x needs
to change (in positive or negative direction) to
minimize C.
• Compute those gradients happens using a technique
called chain rule.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Computing Gradients: Backpropagation
w2
x1 w1 Z1 𝑠ǁ 1 J(W)

How does a small change in one weight (ex. W2) affect the final loss J (W)?

• This is a Simple network with one input layer, one hidden layer (one hidden neuron) and one output layer,
the simplest neural network you can create.
• Computing the gradient of loss of W with respect to w2 ( that is between hidden state and W) can perform
lot of changes in loss value. We actually want to see - How does a small change in one weight (ex. W2)
affect the final loss J (W)?
• So this derivative is going to tell us how much a small change in this weight will affect our loss if we make
a small change in the weight, in one Direction will it increase our loss or decrease our loss.
• Like how a small change in w2 – makes how much change – up or down – how does it change – and by
how much really !

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w2
x1 w1 Z1 𝑠ǁ 1 J(W)

𝜕𝐽(𝑾)
Gradient loss of W with respect to w2
𝜕 𝒘𝟐

Let’s use the chain rule!

So, to compute that we can use this derivative, we can start with applying the chain
rule backwards from the loss function through the output specifically.
So that’s the gradient we care about – The gradient of our loss with respect to w2.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w2
x1 w1 Z1 𝑠ǁ 1 J(W)

Split this into the gradient of our loss 𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ


= *
with respect to output. 𝜕 ( 𝑤 2) 𝜕 𝑠ǁ 𝜕(𝑤 2 )

• We can do is we can actually just decompose this derivative into two components the first component
• To evaluate this we can use the chain rule in elementary calculus.
• We can split them into gradient of the loss with respect to our output multiplied by gradient of output s with
respect to w2.
• The derivative of the loss with respect to our output multiplied by the derivative of our output with respect
to W2, this is just a standard use of the chain rule with this original derivative that we had on the left hand
side
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)

Now if we want to repeat this process with a


different weight – say w1.
𝜕𝐽(𝑊) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ
*
Replace w2 with w1 and apply chain rule but we 𝜕( 𝑤1) = 𝜕 𝑠ǁ 𝜕 (𝑤 1 )
now notice that the gradient of output s with
respect to w1 is not directly computable , apply
chain rule again to evaluate.
Apply chain rule! Apply chain rule!
Here replace W2 with W1 and that chain Rule still
holds right that same equation holds.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)
• So, lets apply the chain rule, and split with respect to Z.
• This way - Back propagation is done performing it, all the
gradients from the output to all the way back to the input that 𝜕𝐽 (𝑊 ) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ 𝜕𝑍 1
allow the error to propagate from output layer to input layer 𝜕 ( 𝑤 1) = * *
𝜕 𝑠ǁ 𝜕(𝑍 1 ) 𝜕𝑤 1
and allows these gradients to be computed in practice.
• There is lot of deep popular neural networks that performs
automatic differentiation which does all of these back
propagation.

Here it can be seen on the red component that last component of the chain rule, we have to once again recursively apply one more chain
rule because that's again another derivative that we can't directly evaluate. We can expand that once more with another instantiation of
the chain Rule and now all of these components.
We can directly propagate these gradients through the hidden units right in our neural network all the way back to the weight that we're
interested. In in this example we first computed the derivative with respect to W2 then we back propagated and used that information also
with W1. That's why we really call it back propagation because this process occurs from the output all the way back to the input

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Computing Gradients: Backpropagation
w1 w2
x1 Z1 𝑠ǁ 1 J(W)

Repeat this for every weight in the 𝜕𝐽 (𝑊 ) 𝜕𝐽(𝑊) 𝜕 𝑠ǁ 𝜕𝑍 1


𝜕 ( 𝑤 1) = * *
network using gradients from later layers 𝜕 𝑠ǁ 𝜕(𝑍 1 ) 𝜕𝑤 1

Repeat this process essentially many times over the course of training by back-propagating.
These gradients over and over again through the network all the way from the output to the inputs to
determine for every single weight answering this question which is how much does a small change in these
weights affect our loss function if it increases it or decreases and how we can use that to improve the loss
ultimately because that's our final goal so that's the back propagation algorithm that's the core of training
neural networks.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Amity Centre for Artificial Intelligence, Amity University, Noida, India

You might also like