MAI Lecture 04 Optimization

Optimization
Optimization
Optimization
• Optimization is concerned with optimizing an objective function —
finding the value of an argument that minimizes of maximizes the
function
 Most optimization algorithms are formulated in terms of minimizing a
function 𝑓(𝑥)
 Maximization is accomplished vie minimizing the negative of an objective
function (e.g., minimize −𝑓(𝑥))
 In minimization problems, the objective function is often referred to as a cost
function or loss function or error function
• Optimization is very important for machine learning
 The performance of optimization algorithms affect the model’s training
efficiency
• Most optimization problems in machine learning are nonconvex
 Meaning that the loss function is not a convex function
 Nonetheless, the design and analysis of algorithms for solving convex
problems has been very instructive for advancing the field of machine learning
Optimization Optimization
• Optimization and machine learning have related, but somewhat different
goals
 Goal in optimization: minimize an objective function
oFor a set of training examples, reduce the training error
 Goal in ML: find a suitable model, to predict on data examples
oFor a set of testing examples, reduce the generalization error
• For a given empirical function g (dashed purple curve), optimization
algorithms attempt to find the point of minimum empirical risk
• The expected function f (blue curve) is
obtained given a limited amount of
training data examples
• ML algorithms attempt to find the point of
minimum expected risk, based on
minimizing the error on a set of testing
examples
o Which may be at a different location than the
minimum of the training examples
o And which may not be minimal in a formal
sense
Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html
Optimization Stationary Points
• Stationary points ( or critical points) of a differentiable function 𝑓(𝑥) of
one variable are the points where the derivative of the function is zero,
i.e., 𝑓′(𝑥) = 0
• The stationary points can be:
 Minimum, a point where the derivative changes from negative to positive
 Maximum, a point where the derivative changes from positive to negative
 Saddle point, derivative is either positive or negative on both sides of the point
• The minimum and maximum points are collectively known as extremum
points
• The nature of stationary points can be
determined based on the second
derivative of 𝑓(𝑥) at the point
 If 𝑓 ′′ 𝑥 > 0, the point is a minimum
 If 𝑓 ′′ 𝑥 < 0, the point is a maximum
 If 𝑓 ′′ 𝑥 = 0, inconclusive, the point can
be a saddle point, but it may not
• The same concept also applies to
gradients of multivariate functions
Optimization
Local Minima
• Among the challenges in optimization of model’s parameters in ML
involve local minima, saddle points, vanishing gradients
• For an objective function 𝑓(𝑥), if the value at a point x is the minimum of
the objective function over the entire domain of x, then it is the global
minimum
• If the value of 𝑓(𝑥) at x is smaller than the values of the objective function
at any other points in the vicinity of x, then it is the local minimum
 The objective functions in ML usually
have many local minima
o When the solution of the optimization
algorithm is near the local minimum,
the gradient of the loss function
approaches or becomes zero
(vanishing gradients)
o Therefore, the obtained solution in the
final iteration can be a local minimum,
rather than the global minimum

Optimization
Saddle Points
• The gradient of a function 𝑓(𝑥) at a saddle point is 0, but the point is not a
minimum or maximum point
 The optimization algorithms may stall at saddle points, without reaching a
minima
• Note also that the point of a function at which the sign of the curvature
changes is called an inflection point
 An inflection point (𝑓′′ 𝑥 = 0) can also be a saddle point, but it does not have
to be
• For the 2D function (right figure), the saddle point is at (0,0)
 The point looks like a saddle, and gives the minimum with respect to x, and the
maximum with respect to y
saddle point
x
Optimization
Convex Optimization
• A function of a single variable is concave if every line segment joining two
points on its graph does not lie above the graph at any point
• Symmetrically, a function of a single variable is convex if every line
segment joining two points on its graph does not lie below the graph at
any point
Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cv1/t

Optimization
Convex Functions
• In mathematical terms, the function f is a convex function if
for all points 𝑥1 , 𝑥2 and for all 𝜆 ∈ [0,1]
𝜆𝑓(𝑥1 ) + (1 − 𝜆)𝑓(𝑥2 ) ≥ 𝑓 𝜆𝑥1 + (1 − 𝜆)𝑥2
Optimization
Convex Functions
• One important property of convex functions is that they do
not have local minima
 Every local minimum of a convex function is a global minimum
 I.e., every point at which the gradient of a convex function = 0 is the
global minimum
 The figure below illustrates two convex functions, and one nonconvex
function
convex non-convex convex
Picture from: http://d2l.ai/chapter_optimization/convexity.html

Optimization Convex Functions
• Another important property of convex functions is stated by
the Jensen’s inequality
• Namely, if we let 𝛼1 = 𝜆 and 𝛼2 = 1 − 𝜆, the definition of
convex function becomes
𝛼1𝑓(𝑥1 ) + 𝛼2𝑓(𝑥2 ) ≥ 𝑓 𝛼1𝑥1 + 𝛼2𝑥2
• The Danish mathematician Johan Jensen showed that this can
be generalized for all 𝛼𝑖 that are non-negative real numbers
and 𝑖 𝛼𝑖 , to the following:
𝛼1𝑓 𝑥1 + 𝛼2𝑓 𝑥2 + ⋯ + 𝛼𝑛𝑓 𝑥𝑛 ≥ 𝑓 𝛼1𝑥1 + 𝛼2𝑥2 + ⋯ + 𝛼𝑛𝑥𝑛
• This inequality is also identical to
𝔼𝑥 [𝑓(𝑥)] ≥ 𝑓 𝔼𝑥 [𝑥]
 I.e., the expectation of a convex function is larger than the convex
function of an expectation
Optimization
Convex Sets
• A set 𝒳 in a vector space is a convex set is for any 𝑎, 𝑏 ∈ 𝒳 the
line segment connecting a and b is also in 𝒳
• For all 𝜆 ∈ [0,1], we have
𝜆 ∙ 𝑎 + 1 − 𝜆 ∙ 𝑏 ∈ 𝒳 for all 𝑎, 𝑏 ∈ 𝒳
• In the figure, each point represents a 2D vector

 The left set is nonconvex, and the other two sets are convex
• Properties of convex sets include:
 If 𝒳 and 𝒴 are convex sets, then 𝒳 ∩ 𝒴 is also convex
 If 𝒳 and 𝒴 are convex sets, then 𝒳 ∪ 𝒴 is not necessarily convex

Optimization
Derivatives and Convexity
• A twice-differentiable function of a single variable 𝑓: ℝ → ℝ is
convex if and only if its second derivative is non-negative
everywhere
𝑑2𝑓
 Or, we can write, ≥0
𝑑𝑥 2
 For example, 𝑓 𝑥 = 𝑥 2 is convex, since 𝑓′ 𝑥 = 2𝑥, and 𝑓′′ 𝑥 = 2,
meaning that 𝑓 ′′ 𝑥
>0
• A twice-differentiable function of many variables 𝑓: ℝ𝑛 → ℝ
is convex if and only if its Hessian matrix is positive semi-
definite everywhere
 Or, we can write, 𝐇𝑓 ≽ 0
 This is equivalent to stating that all eigenvalues of the Hessian matrix
are non-negative (i.e., ≥ 0)
Optimization Constrained Optimization
• The optimization problem that involves a set of constraints which need to
be satisfied to optimize the objective function is called constrained
optimization
• E.g., for a given objective function 𝑓(𝐱) and a set of constraint functions
𝑐𝑖 𝐱
minimize 𝑓(𝐱)
𝐱
subject to 𝑐𝑖 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁
• The points that satisfy the constraints form the feasible region
• Various optimization algorithms have been developed for handling
optimization problems based on whether the constraints are equalities,
inequalities, or a combination of equalities and inequalities
Lagrange Multipliers
Optimization
Lagrange Multipliers
• One approach to solving optimization problems is to substitute the initial
problem with optimizing another related function
• The Lagrange function for optimization of the constrained problem on the
previous page is defined as
𝐿 𝐱, 𝛼 = 𝑓 𝐱 + 𝑖 𝛼𝑖 𝑐𝑖 𝐱 where 𝛼𝑖 ≥ 0
• The variables 𝛼𝑖 are called Lagrange multipliers and ensure that the
constraints are properly enforced
 They are chosen just large enough to ensure that 𝑐𝑖 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁
• This is a saddle-point optimization problem where one wants to
minimize 𝐿 𝐱, 𝛼 with respect to 𝐱 and simultaneously maximize 𝐿 𝐱, 𝛼
with respect to 𝛼𝑖
 The saddle point of 𝐿 𝐱, 𝛼 gives the optimal solution to the original
constrained optimization problem
Definition
Lagrange method is used for maximizing or minimizing a general
function f(x,y,z)
subject to a constraint (or side condition) of the form g(x,y,z) =k.
Assumptions made: the extreme values exist
∇g≠0
Then there is a number λ such that
∇ f(x0,y0,z0) =λ ∇ g(x0,y0,z0) and λ
is called the Lagrange multiplier.
…
• Finding all values of x,y,z and λ such that

∇ f(x,y,z) =λ ∇ g(x,y,z)
and g(x,y,z) =k
And then evaluating f at all the points, the values obtained are studied.
The largest of these values is the maximum value of f; the smallest is
the minimum value of f.
• Writing the vector equation ∇f= λ ∇g in terms of its
components, give
∇fx= λ ∇gx ∇fy= λ ∇gy ∇fz= λ ∇gz g(x,y,z) =k
• It is a system of four equations in the four unknowns, however

it is not necessary to find explicit values for λ.
• A similar analysis is used for functions of two variables.

Examples
• Example 1:
A rectangular box without a lid is to be made from 12 m2 of
cardboard. Find the maximum volume of such a box.
• Solution:
let x,y and z are the length, width and height, respectively, of
the box in meters.
and V= xyz
Constraint: g(x, y, z)= 2xz+ 2yz+ xy=12
Using Lagrange multipliers, 2xz+ 2yz+

xy=12
Vx= λgx Vy= λgy Vz=
λgz which become
Continued..
yz= λ(2z+y) (1)

xz= λ(2z+x) (2)
xy= λ(2x+2y) (3)
• 2xz+ 2yz+ xy=12 (4)
• Solving these equations;

• Let’s multiply (2) by x, (3) by y and (4) by z, making the
left hand sides identical.
• Therefore,
• x yz= λ(2xz+xy) (6)
• x yz= λ(2yz+xy) (7)
• x yz= λ(2xz+2yz) (8)
Continued
• It is observed that λ≠0 therefore from (6) and (7)

2xz+xy=2yz+xy
which gives xz = yz. But z ≠ 0, so x = y. From (7) and (8) we
have 2yz+xy=2xz+2yz
which gives 2xz = xy and so (since x ≠0) y=2z. If we now put
x=y=2z in (5), we get
4z2+4z2+4z2=12
Since x, y, and z are all positive, we therefore have z=1 and so
x=2 and y = 2.
More Examples
• Example 2:
Find the extreme values of the function f(x,y)=x2+2y2 on the circle
x2+y2=1.
• Solution:
Solve equations ∇f= λ ∇g and g(x,y)=1 using Lagrange
multipliers
Constraint: g(x, y)= x2+y2=1 Using
Lagrange multipliers,
fx= λgx fy= λgy g(x,y) = 1 which
become
Continued…
• 2x= 2xλ (9)

• 4y= 2yλ (10)
• x2+y2= 1 (11)
• From (9) we have x=0 or λ=1. If x=0, then (11) gives y=±1. If λ=1,
then y=0 from (10), so then (11) gives x=±1. Therefore f has possible
extreme values at the points (0,1), (0,-1), (1,0), (1,0). Evaluating f at
these four points, we find that
• f(0,1)=2
• f(0,-1)=2
• f(1,0)=1
• f(-1,0)=1
• Therefore the maximum value of f on the circle
• x2+y2=1 is f(0,±1) =2 and the minimum value is f(±1,0) =1.
More Examples.
• Example 3
Find the extreme values of f(x,y)=x2+2y2 on the disk x2+y2≤1.
• Solution:
Compare the values of f at the critical points with values at the points
on the boundary. Since fx=2x and fy=4y, the only critical point is (0,0).
We compare the value of f at that point with the extreme values on the
boundary from Example 2:
• f(0,0)=0
• f(±1,0)=1
• f(0,±1)=2
• Therefore the maximum value of f on the disk x2+y2≤1 is
f(0,±1)=2 and the minimum value is f(0,0)=0.
Continued…
• Example 4
• Find the points on the sphere x2+y2+z2=4 that are closest to and farthest
from the point (3,1,-1).
• Solution:
The distance from a point (x,y,z) to the point (3,1,-1) is
d= (x−3)2+(y−1)2+(z+1)2
But the algebra is simple if we instead maximize and minimize the square
of the distance:
d2=f(x,y,z)=(x-3)2+(y-1)2+(z+1)2
Constraint: g(x,y,z)= x2+y2+z2=4
Using Lagrange multipliers, solve ∇f= λ ∇g and g=4 This gives
Continued
• 2(x-3)=2xλ (12)
• 2(y-1)=2yλ (13)
• 2(z+1)=2zλ (14)
• x2+y2+z2=4 (15)
• The simplest way to solve these equations is to solve for x, y, and z in

terms of λ from (12), (13), and (14), and then substitute these values
into (15). From12 we have
• x-3=xλ or
• x(1- λ)=3 or
3
• x=
1−λ
Continued
• Similarly (13) and (14) give
• 𝑦=
1 • Therefore, from (15), we have
1−λ
32 12 (−1)2
• z= −
1 • + + =4
1−λ (1− λ ) 2 (1− λ )2 (1− λ ) 2
11
• Which gives (1- λ)2=11 , 1- λ=± , so
4 2
11
• λ=1± 2
• These values of λ then give the corresponding points (x,y,z):
6 2
• ( , 2
,− ) and (− 6
,− 2
, 2
)
11 11 11 11 11 11
• f has a smaller value at the first of these points, so the closest

6 2 2
point is ( 11 , 11 , − ) and the farthest is
11
6 2 2
− ,− , .
11 11 11
Two constraints
• Say there is a new constraint, h(x,y,z)=c.
• So there are numbers λ and μ (called Lagrange multipliers)
such that
–∇ f(x0,y0,z0) =λ ∇ g(x0,y0,z0) + μ ∇ h(x0,y0,z0)
• The extreme values are obtained by solving for the five
unknowns x, y, z, λ and μ. This is done by writing the above
equation in terms of the components and using the constraint
equations:
• fx= λgx + μhx
• g(x,y,z) =k
• fy= λgy+μhy
• fz= λgz+μhz
• h(x,y,z)=c
Reference
•Calculus – Stewart 6th Edition

•Section 15.8 “Lagrange Multipliers”
Optimization
Projections
• An alternative strategy for satisfying constraints are
projections
• E.g., gradient clipping in NNs can require that the norm of
the gradient is bounded by a constant value c
• Approach:
 At each iteration during training
𝑔𝑜𝑙𝑑
 If the norm of the gradient 𝑔 ≥ c, then the update is 𝑔𝑛𝑒𝑤 ← 𝑐 ∙ 𝑜𝑙𝑑
𝑔
 If the norm of the gradient 𝑔 < c, then the update is 𝑔𝑛𝑒𝑤 ← 𝑔𝑜𝑙𝑑
𝑔𝑜𝑙𝑑
• Note that since 𝑜𝑙𝑑 is a unit vector (i.e., it has a norm = 1),
𝑔
𝑔𝑜𝑙𝑑
then the vector 𝑐 ∙ 𝑜𝑙𝑑 has a norm = 𝑐
𝑔
• Such clipping is the projection of the gradient g onto the ball
of radius c
 Projection on the unit ball is for 𝑐 = 1
Optimization
Projections
• More generally, a projection of a vector 𝐱 onto a set 𝒳 is defined as
Proj 𝐱 = argmin 𝐱 − 𝐱′ 2
𝒳 𝐱′∈𝒳
• This means that the vector 𝐱 is projected onto the closest vector 𝐱′ that
belongs to the set 𝒳
• For example, in the figure, the blue circle
represents a convex set 𝒳
 The points inside the circle project to itself
o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the
set 𝒳 is itself: the distance between 𝐱 and 𝐱′ is
𝐱 − 𝐱′ 2 = 0
 The points outside the circle project to the closest
point inside the circle
o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the
set 𝒳 is the red vector
o Among all vectors in the set 𝒳, the red vector 𝐱′ has
the smallest distance to 𝐱, i.e., 𝐱 − 𝐱′ 2

First-order vs Second-order Optimization
Optimization
• First-order optimization algorithms use the gradient of a function for

finding the extrema points
 Methods: gradient descent, proximal algorithms, optimal gradient schemes
 The disadvantage is that they can be slow and inefficient
• Second-order optimization algorithms use the Hessian matrix of a
function for finding the extrema points
 This is since the Hessian matrix holds the second-order partial derivatives
 Methods: Newton’s method, conjugate gradient method, Quasi-Newton
method, Gauss-Newton method, BFGS (Broyden-Fletcher-Goldfarb-Shanno)
method, Levenberg-Marquardt method, Hessian-free method
 The second-order derivatives can be thought of as measuring the curvature of
the loss function
 Recall also that the second-order derivative can be used to determine whether a
stationary points is a maximum (𝑓 ′′ 𝑥 < 0), minimum (𝑓 ′′ 𝑥 > 0)
 This information is richer than the information provided by the gradient
 Disadvantage: computing the Hessian matrix is computationally expensive,
and even prohibitive for high-dimensional data
Optimization
Lower Bound and Infimum
• Lower bound of a subset 𝒮 from a partially ordered set 𝒳 is an
element 𝑎 of 𝒳, such that 𝑎 ≤ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, lower
bounds are the numbers 2, 1, 0, −3, and all other natural numbers ≤ 2
• Infimum of a subset 𝒮 from a partially ordered set 𝒳 is the
greatest lower bound in 𝒳, denoted inf𝑠∈𝒮 𝑠
 It is the maximal quantity 𝑕 such that 𝑕 ≤ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., the infimum of the set 𝒮 = 2, 3, 6, 8 is 𝑕 =2, since it is the greatest
lower bound
• Example: consider the subset of positive real numbers
(excluding zero) ℝ≥0 = 𝑥 ∈ ℝ: 𝑥 ≥ 0
 The subset ℝ≥0 does not have a minimum, because for every small positive
number, there is a another even smaller positive number
 On the other hand, all real negative numbers and 0 are lower bounds on
the subset ℝ≥0
 0 is the greatest lower bound of all lower bounds, and therefore, the
infimum of ℝ≥0 is 0
Optimization
Upper Bound and Supremum
• Upper bound of a subset 𝒮 from a partially ordered set 𝒳 is
an element 𝑏 of 𝒳, such that 𝑏 ≥ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, upper
bounds are the numbers 8, 9, 40, and all other natural numbers ≥ 8
• Supremum of a subset 𝒮 from a partially ordered set 𝒳 is the
least upper bound in 𝒳, denoted sup𝑠∈𝒮 𝑠
 It is the minimal quantity 𝑔 such that g ≥ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., the supremum of the subset 𝒮 = 2, 3, 6, 8 is 𝑔 = 8, since it is the
least upper bound
• Example: for the subset of negative real numbers (excluding
zero) ℝ≤0 = 𝑥 ∈ ℝ: 𝑥 ≤ 0
 All real positive numbers and 0 are upper bounds
 0 is the least upper bound, and therefore, the supremum of ℝ≤0
Optimization
Lipschitz Function
• A function 𝑓 𝑥 is a Lipschitz continuous function if a constant 𝜌 > 0
exists, such that for all points 𝑥1 , 𝑥2
𝑓 𝑥1 − 𝑓 𝑥2 ≤ 𝜌 𝑥1 − 𝑥2
• Such function is also called a 𝜌-Lipschitz function
• Intuitively, a Lipschitz function cannot change too fast
 I.e., if the points 𝑥1 and 𝑥2 are close (i.e., the distance 𝑥1 − 𝑥2 is small), that
means that the 𝑓 𝑥1 and 𝑓 𝑥2 are also close (i.e., the distance 𝑓 𝑥1 − 𝑓 𝑥2
is also small)
oThe smallest real number that bounds the change of 𝑓 𝑥1 − 𝑓 𝑥2 for all
points 𝑥1 , 𝑥2 is the Lipschitz constant 𝜌 of the function 𝑓 𝑥
 For a 𝜌-Lipschitz function 𝑓 𝑥 , the first derivative 𝑓′ 𝑥 is bounded
everywhere by 𝜌
• E.g., the function 𝑓 𝑥 = 𝑙𝑜𝑔 1 + exp(𝑥) is 1-Lipschitz over ℝ
exp(𝑥) 1 𝟏
 Since 𝑓′ 𝑥 = = = ≤1
1+exp(𝑥) exp −𝑥 +1 exp −𝑥 +1
 I.e., 𝜌 = 1
Optimization
Lipschitz Continuous Gradient
• A differentiable function 𝑓 𝑥 has a Lipschitz continuous
gradient if a constant 𝜌 > 0 exists, such that for all points 𝑥1 ,
𝑥2
𝛻𝑓 𝑥1 − 𝛻𝑓 𝑥2 ≤ 𝜌 𝑥1 − 𝑥2
• For a function 𝑓 𝑥 with a 𝜌-Lipschitz gradient, the second
derivative 𝑓′′ 𝑥 is bounded everywhere by 𝜌
• E.g., consider the function 𝑓 𝑥 = 𝑥 2
 𝑓 𝑥 = 𝑥 2 is not a Lipschitz continuous function, since 𝑓 ′ (𝑥) = 2𝑥, so
when 𝑥 → ∞ then 𝑓 ′ (𝑥) → ∞, i.e., the derivative is not bounded
everywhere
 Since 𝑓 ′′ (𝑥) = 2, therefore the gradient 𝑓 ′ (𝑥) is 2-Lipschitz everywhere,
since the second derivative is bounded everywhere by 2

MAI Lecture 04 Optimization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MAI Lecture 04 Optimization

Uploaded by

Copyright:

Available Formats

Optimization

Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html

Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cv1/t

Picture from: http://d2l.ai/chapter_optimization/convexity.html

• In the figure, each point represents a 2D vector

Picture from: http://d2l.ai/chapter_optimization/convexity.html

• Finding all values of x,y,z and λ such that

∇fx= λ ∇gx ∇fy= λ ∇gy ∇fz= λ ∇gz g(x,y,z) =k

• It is a system of four equations in the four unknowns, however

• A similar analysis is used for functions of two variables.

Using Lagrange multipliers, 2xz+ 2yz+

yz= λ(2z+y) (1)

• Solving these equations;

• It is observed that λ≠0 therefore from (6) and (7)

• 2x= 2xλ (9)

• The simplest way to solve these equations is to solve for x, y, and z in

• f has a smaller value at the first of these points, so the closest

•Calculus – Stewart 6th Edition

Picture from: http://d2l.ai/chapter_optimization/convexity.html

• First-order optimization algorithms use the gradient of a function for

You might also like