You are on page 1of 15

Gradient Descent

Steps in Gradient Descent Algorithm


Step_1: First we shall randomly initialize b ,w1,w2,w3…..wm.

Step_2: Use all parameter values and predict y for each data point in the training data set.

Step_3: Calculate Loss J(b, w)

Step_4: Calculate Gradient of J(b ,w) w.r.t each parameters

Step_5: b(new) = b (old) – α (gradient)b

w1(new)=w1(old) – α*(gradient)w1

w2(new)=w2(old) – α*(gradient)w2

Similarly for the mth term w m(new)=wm(old) – α*(gradient)wm

Step_6: Update all parameters b , w1 , w2 ,w3 , w4 , .. wm simultaneously.

Step_7: Repeat steps 2-6 till the convergence.

Source: https://www.analyticsvidhya.com/blog/2021/05/gradient-descent-algorithm-understanding-the-logic-
behind/
how to check the function’s convergence

Case 1: When the graph of Loss vs parameters is


purely convex:

we know a convex function converges when


slope ( Gradient ) is equal to zero. So in that case
observe the parameter values either in subsequent
iterations or for the past few iterations. If we get that
the values are not significantly changing, we can stop
the training.
how to check the function’s convergence

Situation 2: If the Loss function is not purely


convex or some tricky kind of function

Here we can see that graph is going almost flat


after some time. Therefore, beyond a certain
number of steps the parameter values don’t
change significantly. So here the trick is to
define the max number of steps. Defining the
max number of steps is necessary as

Source: https://www.analyticsvidhya.com/blog/2021/05/gradient-descent-algorithm-understanding-the-logic-
behind/
Variants of Gradient Descent Algorithm

Batch Gradient Descent


● All the observations in the dataset are used to calculate the cost function.
● Take the entire training set, perform forward propagation and calculate the cost function.
● Update the parameters using the rate of change of this cost function with respect to the
parameters
● An epoch is when the entire training set is passed through the model, forward propagation
and backward propagation are performed and the parameters are updated.
○ For other variables being constant, the number of epochs is an indication of the relative amount of learning.
● In batch Gradient Descent since we are using the entire training set, the parameters will be
updated only once per epoch.
● The term “batch” denotes the total number of samples/observations from a
dataset that is used for calculating the gradient.
Variants of Gradient Descent Algorithm

Stochastic gradient descent:

● A single observation is taken randomly from the dataset to calculate the cost
function
● commonly abbreviated as SGD.
● We pass a single observation at a time, calculate the cost and update the
parameters.
● Each time the parameter is updated, it is known as an Iteration
● In the case of SGD, there will be ‘m’ iterations per epoch, where ‘m’ is the
number of observations in a dataset.
● The path taken by the algorithm to reach the minima is usually noisier than
your typical Gradient Descent algorithm.
● A weight update may reduce the error on the single observation being
presented, yet increase the error on the the full training set. Given a large
number of such individual update, however, the total error decreases.
Variants of Gradient Descent Algorithm

Stochastic Gradient Descent

Let’s say we have 5 observations and each observation has three features (the
features values taken are completely random)

Now if we use the SGD, will take the first observation, then pass it through the neural network,
calculate the error and then update the parameters.
Variants of Gradient Descent Algorithm

Stochastic Gradient Descent:

Then will take the second observation and perform similar steps with it. This step will be repeated until all observations have
been passed through the network and the parameters have been updated.
Variation of Gradient Descent Algorithms
Mini-batch gradient descent:
● It takes a subset of the entire dataset to calculate the cost function.
● if there are ‘m’ observations then the number of observations in each subset or mini-batches
will be more than 1 and less than ‘m’.
● The number of observations in the subset is called batch size (b).
● The batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64,
128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve
better run time with common batch sizes such as power of 2.
● Mini-batch gradient descent makes a compromise between the speedy convergence and the
noise associated with gradient update which makes it a more flexible and robust algorithm
● With large training datasets, we don’t usually need more than 2–10 passes over all training
examples (epochs).
● Note: with batch size b = m (number of training examples), we get the Batch Gradient
Descent.
Variation of Gradient Descent Algorithms

Mini-batch gradient
Variation of Gradient Descent
Mini-batch Gradient:
Summary

1. Batch Gradient Descent: Parameters are updated after computing the


gradient of error with respect to the entire training set
2. Stochastic Gradient Descent: Parameters are updated after computing the
gradient of error with respect to a single training example
3. Mini-Batch Gradient Descent: Parameters are updated after computing the
gradient of error with respect to a subset of the training set
Cost vs iteration
Batch Gradient Descent Stochastic Gradient Descent Mini-batch Gradient Descent

● since we update the parameters using the entire data set in the case of the Batch GD, the cost in this case, reduces
smoothly.
● Reduction of cost in the case of SGD is not that smooth. Since we’re updating the parameters based on a single
observation, there are a lot of iterations. It might also be possible that the model starts learning noise as well.
● In the case of Mini-batch Gradient Descent the cost is smoother as compared SGD. Since we’re not updating the
parameters after every single observation but after every subset of the data.
Computational Cost of 3 variants of Gradient Descent

● Batch gradient descent processes all the training examples for each iteration. many epochs are used in find
the optimal parameters in gradient descent. Hence it is computationally very expensive and slow
● Computational cost in the case of SGD is much less as compared to the Batch Gradient Descent
since we’ve to process every single observation at a time. Computation time here increases for large
dataset as there will be more number of iterations.
● Mini-batch gradient descent works faster than both batch gradient descent and stochastic gradient
descent. Here b examples where b<m are processed per iteration. So even if the number of training
examples is large, it is processed in batches of b training examples in one go. Thus, it works for larger
training examples and that too with lesser number of iterations.
Comparison

Batch Stochastic Mini-batch

Entire dataset is used for Single observation is used Subset of observations is


parameter update for parameter update used for parameter update

computationally very quite faster than batch works faster than both batch
expensive gradient descent gradient descent and
stochastic gradient descent

It makes smooth updates in the It makes very noisy updates in Depending upon the batch
model parameters the parameters size, the updates can be made
less noisy – greater the batch
size less noisy is the update

You might also like