4.1 - EDA Lecture Module 4 Vetri Sir New

Contents
1 Module 4: Optimization
SENSE VIT-Chennai Winter Semester 2020-21 1 / 19

Module 4: Optimization
Topics to be covered in Module-4
Introduction to Optimization
Gradient Descent
Variants of Gradient Descent
Momentum Optimizer
Nesterov Accelerated Gradient
Adagrad
Adadelta
RMSProp
Adam
AMSGrad

Optimization is the process of maximizing or minimizing a real
function by systematically choosing input values from an allowed set
of values and computing the value of the function.
It refers to usage of specific methods to determine the best solution
from all feasible solutions, say for example, finding the best functional
representation and finding the best hyperplane to classify data.
Three components of an optimization problem: objective function
(minimzation or maximization), decision variables and constraints.
Based on the type of objective function, constraints and decision
variables, several types of optimization problems exists. An
optmization can be linear or non-linear, convex or non-convex,
iterative or non-iterative, etc.
Optimization is considered as one among the three pillars of data
science. Linear algebra and statistics are the other two pillars.
Consider the following optimization problem which attempts to find
the maximal marigin hyperplane with marigin M:
maximizeα0 ,α1 ,...,αp M (1)

p
X
subject to αj2 = 1, (2)
j=1
p
X
yi α0 + αj xij ≥ M for all i = 1, 2, ..., n. (3)
j=1
Equation (1) is the objective function, equations (2) and (3) are the
constraints, and α0 , α1 , ..., αp are the decision variables.
In general, an objective function is denoted as f (·) and minimizer of
f (·) is same as the maximizer of −f (·).
Gradient Descent
Gradient Descent is the most common optimization algorithm in
machine learning and deep learning.
It is a first-order, iterative-based optimization algorithm which only
takes into account the first derivative when performing the updates
on the parameters.
In each iteration, there are 2 steps: (i) finding the (locally) steepest
direction according to the first derivative of an objective function; and
(ii) finding the best point in the line. The parameters are updated in
the opposite direction of the gradient of the objective function.
The learning rate determines the convergence (i.e. the number of
iterations required to reach the local minimum). It should neither be
too small nor too large. Very small α leads to very slow convergance
and a very large value leads to oscillations around the minima or may
even lead to divergence.
Gradient (Steepest) Descent

Let f (X ) denote the objective function and X0 denote the starting
point. In iteration k, the best point is given by
Xk = Xk−1 − αGk−1
where α is the learning rate (step length) and

Gk−1 = Of (X ) = f 0 (X ) is the derivative of f (X ) (search direction).
Consider for example, f (X ) = x1 + 2x12 + 2x1 x2 + 3x22 , α = 0.1 and

0.5
X0 = .
0.5
In this case,
0 1 + 4x1 + 2x2
f (X ) = .
2x1 + 6x2

Gradient (Steepest) Descent (contd.)
In the first iteration, the direction G0 and the best point X1 are
estimated as follows:

0 4 0.1
G0 = f (X0 ) = and X1 = X0 − αG0 = .
4 0.1
Similarly, In the next iteration,

0 1.6 −0.6
G1 = f (X1 ) = and X2 = X1 − αG1 = .
0.8 0.02
The iterations continue till convergence. The parameter α plays a

significant role in both convergence and stability. Figure 1 shows a
sample plot of sequence of estimated points.

Figure 1: Steepest descent - convergence plot. Source: Mishra S.K., Ram B.

(2019) Steepest Descent Method. In: Introduction to Unconstrained
Optimization with R. Springer, Singapore.
Variants of Gradient Descent

There are three variants of gradient descent based on the amount of data
(samples) considered for computing the gradient at each iteration.
1 Batch Gradient Descent: The parameter update step involves
summing up all data samples. It has straight trajectory towards the
minimum and its convergence is guaranteed.
2 Mini-Batch Gradient Descent: The parameter update involves
summing up lower number of samples based on batch size. It is faster
than batch gradient descent but convergence is not guaranteed.
3 Stochastic Gradient Descent: The parameter update is done
sample-wise. It has less generalization error compared to mini-batch
gradient descent but the run time is more.
Therefore, there exists a gradient accuracy - time complexity tradeoff
between these variants.
Question 4.1
Apply gradient descent approach to minimize the function:
f (X ) = 4 x12 + 3 x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2 .
Assume the step size is 0.135 and the starting point is

x1 (0) 2
X0 = = .
x2 (0) 2
Let the stopping criteria be the absolute difference between the function
values in successive iterations less than 0.005. Your answer should show
the search direction and the value of the function in each iteration.

Momentum Optimizer
In gradient descent approach, the biggest challenge lies in choosing a

proper learning rate α. In addition, there are challenges such as
non-convex error functions (quite common in neural networks) getting
struck at their suboptimal local minima.
To circumvent these challenges, several optimization algorithms were
proposed and used by the deep learning community. Notable among
them are momentum, Nesterov accelerated gradient, Adagrad,
Adadelta, RMSprop, and AMSGrad.
As indicated earlier, gradient descent approach (say the stochastic
gradient descent) with improper α value might lead to oscillations
around the minima. Momentum optimizer attempts to dampen these
oscillations by accelerating the stochastic gradient descent in the
relevant direction.

Momentum Optimizer
Momentum optimizer accomplishes the task by adding a fraction γ of

the update vector of the past iteration to the current update vector:
wk = wk−1 − [γ vk−2 + α f 0 (wk−1 )]

= wk−1 − γ vk−2 − α f 0 (wk−1 )
where the term vk−2 = wk−2 − wk−1 = γ vk−3 + α f 0 (wk−2 ) refers to

the update vector in the previous iteration.
Two forces act on the parameter to be updated in an iteration: the
gradient force (α f 0 (wk−1 )) and the momentum force (γ vk−2 ).
The momentum term γ vk−2 decreases where there is a change in
gradient direction(s) and increases when there is no change in
direction(s). Therefore, this approach dampens oscillations and leads
to faster convergence.
Nesterov Accelerated Gradient
Nesterov Accelerated Gradient (NAG) attempts to use the
momentum more effectively compared to momemtum optimizer.
Given the fact that wk−1 − γ vk−2 gives a rough approximation of wk ,
the search direction (i.e. gradient) is computed with respect to
anticipated current update wk−1 − γ vk−2 instead of previous update
wk−1 . The current update vector is expressed as follows:
wk = wk−1 − [γ vk−2 + α f 0 (wk−1 − γ vk−2 )]
= wk−1 − γ vk−2 − α f 0 (wk−1 − γ vk−2 )
This anticipatory update in NAG improves the performance of
gradient descent further. Click here for more details.
Both momentum optimizer and NAG require two hyper-parameters (γ
and α) to be set manually. These parameters decide the learning rate.
These two optimizers use same learning rate for all dimensions which
is not proper.
Adagrad
Adaptive Gradient (Adagrad) optimizer adaptively scales the learning
rate for different dimensions. For a parameter, the scale factor is
inversely proportional to the square root of sum of historical squared
values of gradient. The update rule is:
α
wk (i) = wk−1 (i) − p Gk−1 (i)
Rk−1 (i, i) +
where Rk−1 is a diagonal matrix with each diagonal element i, i being
the sum of squares of the gradients with respect to w (i) upto time
step k − 1, and is the smoothing term (usually 10−8 ).
The learning rate reduces faster for parameters showing large slope.
Adagrad does not require manual tuning of hyper-parameters.
It converges rapidly when applied to convex functions. In the case of
non-convex functions, the learning rate becomes too small and
therefore, at some point, the model may stop learning.
Adadelta
Adadelta, an extension of Adagrad, attempts to resolve Adagrad’s
issue - radically diminishing learning rates. It limits the window of
accumulated gradients to some fixed size.
Instead of storing the previous squared gradients, the sum of
gradients is recursively defined as a decaying average of all past
squared gradients. The update becomes:
α
wk = wk−1 − p Gk−1 (4)
2
E [G ]k−1 +
where E [G 2 ]k−1 = βE [G 2 ]k−2 + (1 − β)Gk−1
2 .
p
The term E [G 2 ]k−1 + is the Root-Mean-Square (RMS) of the
gradient. Adadelta, further replaces α term in the numerator with the
RMS of the previous update. Therefore, there is no need to set the
value of α.
RMSProp
Both Adadelta and RMSProp have been developed independently

around the same time.
RMSProp is same as the first update of Adadelta (given as Equation
(4) in previous slide):
α
wk = wk−1 − Gk−1 .
RMS[G ]k−1
Like Adadelta, it uses exponentially decaying average of squared

gradient and discards history from the extreme past.
It converges rapidly once it finds a locally convex bowl. It behaves
like Adagrad initialized within that convex bowl.
RMSProp is very effective for mini-batch gradient descent learning.

Adam
Adaptive Moment estimation (Adam) combines RMSProp and
Momentum.
It incorporates the momentum term (i.e. first moment with
exponential weighting of the gradient) in RMSProp as follows:
α
wk = wk−1 − p m̂k−1
v̂k−1 +
where m̂k−1 and v̂k−1 are bias-corrected versions of mk−1 (first
moment) and vk−1 (second moment) respectively. The first and
second moments are:
mk−1 = β1 mk−2 + (1 − β1 )Gk−1

2
vk−1 = β2 vk−2 + (1 − β2 )Gk−1 .

AMSGrad
In situations where some mini-batches provide large and informative

gradients, Adam converges to a suboptimal solution. This is due to
the fact that the exponential averaging diminishes the influence of
such rarely occuring mini-batches, which leads to poor convergence.
AMSGrad updates the parameters by considering the maximum of
past squared gradients rather than the exponential average. The
update rule is:
α
wk = wk−1 − p mk−1 .
MAX(ṽk−2 , vk−1 ) +
Note that bias-correction is not considered.

AMSGrad results in a non-increasing step size. This resolves the
problem suffered by Adam.

Module-4 Summary
Introduction to Optimization: three components
Gradient Descent: first-order, iterative-based optimization algorithm
Variants of Gradient Descent: batch gradient descent, mini-batch
gradient descent and stochastic gradient descent
Momentum Optimizer: accelerates the stochastic gradient descent in
the relevant direction - NAG uses the momentum term for
anticipatory update
Adagrad: adaptively scales learning rate for different dimension
Adadelta: sum of gradients recursively defined as the decaying
average of past gradients
RMSProp: same first update of Adadelta
Adam: combination of RMSProp and momentum
AMSGrad: considers the maximum of past squared gradients

4.1 - EDA Lecture Module 4 Vetri Sir New

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.1 - EDA Lecture Module 4 Vetri Sir New

Uploaded by

Copyright:

Available Formats

Contents

SENSE VIT-Chennai Winter Semester 2020-21 1 / 19

Topics to be covered in Module-4

SENSE VIT-Chennai Winter Semester 2020-21 2 / 19

maximizeα0 ,α1 ,...,αp M (1)

Gradient (Steepest) Descent

where α is the learning rate (step length) and

SENSE VIT-Chennai Winter Semester 2020-21 6 / 19

Gradient (Steepest) Descent (contd.)

Similarly, In the next iteration,

The iterations continue till convergence. The parameter α plays a

SENSE VIT-Chennai Winter Semester 2020-21 7 / 19

Figure 1: Steepest descent - convergence plot. Source: Mishra S.K., Ram B.

Variants of Gradient Descent

Apply gradient descent approach to minimize the function:

f (X ) = 4 x12 + 3 x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2 .

Assume the step size is 0.135 and the starting point is

SENSE VIT-Chennai Winter Semester 2020-21 10 / 19

In gradient descent approach, the biggest challenge lies in choosing a

SENSE VIT-Chennai Winter Semester 2020-21 11 / 19

Momentum optimizer accomplishes the task by adding a fraction γ of

wk = wk−1 − [γ vk−2 + α f 0 (wk−1 )]

where the term vk−2 = wk−2 − wk−1 = γ vk−3 + α f 0 (wk−2 ) refers to

Both Adadelta and RMSProp have been developed independently

Like Adadelta, it uses exponentially decaying average of squared

SENSE VIT-Chennai Winter Semester 2020-21 16 / 19

mk−1 = β1 mk−2 + (1 − β1 )Gk−1

SENSE VIT-Chennai Winter Semester 2020-21 17 / 19

In situations where some mini-batches provide large and informative

Note that bias-correction is not considered.

SENSE VIT-Chennai Winter Semester 2020-21 18 / 19

You might also like