You are on page 1of 27

GRADIENT DESCENT

OPTIMIZATION
Gradient Descent?
• An optimization algorithm for finding a local minimum of a
differentiable function.

• Used when training a machine learning model.

• Based on a convex function and tweaks its parameters


iteratively to minimize a given function to its local
minimum.

• Used to find the values of a function's parameters


(coefficients) that minimize a cost function as far as
possible.
Gradient?
• Slope of a function

• A gradient measures how much the output of a


function changes if you change the inputs a little bit.
— Lex Fridman (MIT)

• In ML terms - measures the change in all weights with


regard to the change in error.
• The higher the gradient, the steeper the slope and the faster a
model can learn.
• But if the slope is zero, the model stops learning.
• In mathematical terms, a gradient is a partial derivative with
respect to its inputs.
Gradient Descent- How it works?

• A blindfolded person trying to Climb up the mountain


[1]
Gradient Descent- How it works?
Gradient Descent Algorithm
• Repeat until convergence {
𝜕
𝜃𝑗 = 𝜃𝑗 − 𝛼 𝜕𝜃 J(𝜃0 , 𝜃1 )
𝑗

Simultaneously update j=0 and j=1

Each iteration tells the next position to go, which is in the


direction of the steepest descent.
Learning rate
• Should be set to an appropriate value, which is neither too
high or too low
Visualization of cost function
Types of gradient descent
• Batch gradient descent
• Also called vanilla gradient descent
• Calculates the error for each example within the training dataset,
but only after all training examples have been evaluated does the
model get updated.
• This whole process is like a cycle and it's called a training epoch.

• Advantages
• Computationally efficient
• Produces a stable convergence

• Disadvantages
• Requires the entire training dataset be in memory and available to
the algorithm
Stochastic Gradient Descent (SGD)
• SGD does the update for each training example within the
dataset, meaning it updates the parameters for each
training example one by one.

• Advantages:
• frequent updates allow a detailed rate of improvement.

• Disadvantages:
• more computationally expensive than the batch gradient descent
approach
• Frequent updates may result in noisy gradients, which may cause
the error rate to jump around instead of slowly decreasing.
Mini-Batch Gradient Descent (SGD)
• combination of the concepts of SGD and batch gradient
descent.

• splits the training dataset into small batches and performs


an update for each of those batches.

• The go-to algorithm used when training a neural network

• The most common type of gradient descent within deep


learning.
Gradient descent with Momentum

Pathological Curvature
Gradient descent with Momentum
Gradient Descent with Momentum
• Momentum is a method that helps accelerate SGD in the
relevant direction and dampens oscillations

SGD without Momentum SGD with Momentum

• The basic idea of Gradient Descent with momentum is to


calculate the exponentially weighted average of your gradients
and then use that gradient instead to update your weights.
Gradient Descent with Momentum
• It does this by adding a fraction 𝛾 of the update vector of the past time
step to the current update vector
𝜕
• 𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝛼 𝐽(𝜃)
𝜕𝜃
• 𝜃 = 𝜃 − 𝑣𝑡

• VdW = β * VdW + (1 — β) * dW
• Vdb = β * Vdb + (1 — β) *db

• The momentum term 𝛾 is defined in the range 0.0 to 1.0. It is usually


set to 0.9

• The momentum term increases for dimensions whose gradients point


in the same directions and reduces updates for dimensions whose
gradients change directions.
• As a result, we gain faster convergence and reduced oscillation.
Adagrad
• Adagrad uses a different learning rate for every
parameter 𝜃𝑖 at every time step t.

• Adagrad adapts the learning rate to the parameters,


performing smaller updates (i.e. low learning rates) for
parameters associated with frequently occurring features,
and larger updates (i.e. high learning rates) for
parameters associated with infrequent features.

• It is well-suited for dealing with sparse data.


Adagrad's per-parameter update
• The SGD update for every parameter 𝜃𝑖 at each time
step 𝑡 then becomes:

• Where

• Update Rule modifies the general learning rate 𝜂 at each


time step 𝑡 for every parameter 𝜃𝑖 based on the past
gradients that have been computed 𝜃𝑖 :


Adagrad's per-parameter update

• - diagonal matrix where each diagonal


element 𝑖, 𝑖 is the sum of the squares of the gradients
w.r.t. 𝜃𝑖 up to time step 𝑡
• 𝜖 is a smoothing term that avoids division by zero (usually
on the order of 1e−8).
• Now, we can vectorize by performing a matrix-vector
product ⊙ between 𝐺𝑡 and 𝑔𝑡 :
Pros and Cons
• Pros:
• It eliminates the need to manually tune the learning rate. Most
implementations use a default value of 0.01

• Cons:
• Its accumulation of the squared gradients in the denominator
(keeps growing during training)

• This in turn results in diminishing learning rates.


• causes the learning rate to shrink and eventually become infinitely
small, at which point the algorithm is no longer able to acquire additional
knowledge.
RMSprop
• Adaptive learning rate method proposed by Geoff Hinton
• The sum of gradients is recursively defined as a decaying
average of all past squared gradients.
• The running average at time step 𝑡 then depends only on
the previous average and the current gradient:

AMSGrad
• AMS – Adaptive Moment Estimation (Adam)
• Stores two components
• exponentially decaying average of past squared gradients 𝑣𝑡 like
Adadelta and RMSprop,
• exponentially decaying average of past gradients 𝑚𝑡 , similar to
momentum.
AMSGrad
• As 𝑚𝑡 and 𝑣𝑡 are initialized as vectors of 0's, the authors
of Adam observe that they are biased towards zero,
especially during the initial time steps, and especially
when the decay rates are small (i.e. 𝛽1 and 𝛽2 are close
to 1).
• Bias-corrected first and second moment estimates are
computed as:

• Adam update rule:

• The authors propose default values of 0.9 for 𝛽1 , 0.999 for 𝛽2 ,


and 10−8 for 𝜖.
AMSGrad
• Adaptive learning rate methods fail to converge to an
optimal solution in cases such as object recognition or
machine translation.
• Reddi et al. pinpoint that the exponential moving average
of past squared gradients as a reason for the poor
generalization behaviour of adaptive learning rate
methods.
• AMSGrad uses the maximum of past squared
gradients 𝑣𝑡 rather than the exponential average to
update the parameters.
AMSGrad
• AMSGrad update rule:
References
• https://builtin.com/data-science/gradient-descent

• https://www.youtube.com/watch?v=rIVLE3condE

• https://ruder.io/optimizing-gradient-
descent/index.html#momentum

• https://distill.pub/2017/momentum/

• https://medium.com/optimization-algorithms-for-deep-neural-
networks/gradient-descent-with-momentum-dce805cd8de8

• https://blog.paperspace.com/intro-to-optimization-momentum-
rmsprop-adam/

You might also like