You are on page 1of 14

Gradient Descent and

Cost Function
Optimization:
• In our day-to-day lives, we are optimizing variables based on our personal
decisions and we don’t even recognize the process consciously.
• We are constantly using optimization techniques all day long.
• For example, while going to work, choosing a shorter route in order to minimize
traffic woes, scheduling a cab in advance to reach the airport on time.
• Optimization is the ultimate goal, whether you are dealing with actual events in
real-life or creating a technology-based product.
• Optimization may be defined as the process by which an optimum is achieved. It
is all about designing an optimal output for your problems with the use of
resources available.
• One of the most popular optimization technique is Gradient Descent.
What is gradient descent?
• It is an optimization algorithm which is mainly used to find the
minimum of a function by optimizing parameters.
• Parameters means coefficient in linear regression and weights in
neural networks.
Contd..
• The main goal of gradient descent is to minimize cost function
• The inclined and/or irregular is the cost function when it is plotted
and the role of gradient descent is to provide direction and the
velocity (learning rate) of the movement in order to attain the
minima of the function i.e where the cost is minimum.
What is cost function?
• A Cost Function/Loss Function tells us “how good” our model is at
making predictions for a given set of parameters.
• Generally, the cost function is in the form of Y = X². In a Cartesian
coordinate system, this represents an equation for a parabola which
can be graphically represented as :
• Now in order to minimize the function mentioned above, firstly we
need to find the value of X which will produce the lowest value of Y
(in this case it is the red dot)
• Now a function is required which will minimize the parameters over a
dataset
• The most common function which is often used is the mean squared
error.
•  m_curr=m_curr-learning_rate*d/dm
•  b_curr=b_curr-learning_rate*d/db
Type of gradient descent
• BATCH GRADIENT DESCENT
• STOCHASTIC GRADIENT DESCENT
• MINI-BATCH GRADIENT DESCENT
BATCH GRADIENT DESCENT

• calculates the error for each example within the training dataset, but
only after all training examples have been evaluated does the model
get updated
• This whole process is like a cycle and it's called a training epoch.
• advantage of batch gradient descent are its computational efficient, it
produces a stable error gradient and a stable convergence
• disadvantages are the stable error gradient can sometimes result in a
state of convergence that isn’t the best the model can achieve. It also
requires the entire training dataset be in memory and available to the
algorithm.
STOCHASTIC GRADIENT DESCENT
• By contrast, stochastic gradient descent (SGD) does this for each
training example within the dataset, meaning it updates the
parameters for each training example one by one.
• One advantage is the frequent updates allow us to have a pretty
detailed rate of improvement.
• The frequent updates, however, are more computationally expensive
than the batch gradient descent approach. Additionally, the frequency
of those updates can result in noisy gradients, which may cause the
error rate to jump around instead of slowly decreasing.
MINI-BATCH GRADIENT DESCENT
• Mini-batch gradient descent is the go-to method since it’s a
combination of the concepts of SGD and batch gradient descent.
• It simply splits the training dataset into small batches and performs an
update for each of those batches.
• This creates a balance between the robustness of stochastic gradient
descent and the efficiency of batch gradient descent.

You might also like