You are on page 1of 36

Optimization

Optimization
Optimization
• Optimization is concerned with optimizing an objective function —
finding the value of an argument that minimizes of maximizes the
function
 Most optimization algorithms are formulated in terms of minimizing a
function 𝑓(𝑥)
 Maximization is accomplished vie minimizing the negative of an objective
function (e.g., minimize −𝑓(𝑥))
 In minimization problems, the objective function is often referred to as a cost
function or loss function or error function
• Optimization is very important for machine learning
 The performance of optimization algorithms affect the model’s training
efficiency
• Most optimization problems in machine learning are nonconvex
 Meaning that the loss function is not a convex function
 Nonetheless, the design and analysis of algorithms for solving convex
problems has been very instructive for advancing the field of machine learning
Optimization Optimization
• Optimization and machine learning have related, but somewhat different
goals
 Goal in optimization: minimize an objective function
oFor a set of training examples, reduce the training error
 Goal in ML: find a suitable model, to predict on data examples
oFor a set of testing examples, reduce the generalization error
• For a given empirical function g (dashed purple curve), optimization
algorithms attempt to find the point of minimum empirical risk
• The expected function f (blue curve) is
obtained given a limited amount of
training data examples
• ML algorithms attempt to find the point of
minimum expected risk, based on
minimizing the error on a set of testing
examples
o Which may be at a different location than the
minimum of the training examples
o And which may not be minimal in a formal
sense
Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html
Optimization Stationary Points
• Stationary points ( or critical points) of a differentiable function 𝑓(𝑥) of
one variable are the points where the derivative of the function is zero,
i.e., 𝑓′(𝑥) = 0
• The stationary points can be:
 Minimum, a point where the derivative changes from negative to positive
 Maximum, a point where the derivative changes from positive to negative
 Saddle point, derivative is either positive or negative on both sides of the point
• The minimum and maximum points are collectively known as extremum
points
• The nature of stationary points can be
determined based on the second
derivative of 𝑓(𝑥) at the point
 If 𝑓 ′′ 𝑥 > 0, the point is a minimum
 If 𝑓 ′′ 𝑥 < 0, the point is a maximum
 If 𝑓 ′′ 𝑥 = 0, inconclusive, the point can
be a saddle point, but it may not
• The same concept also applies to
gradients of multivariate functions
Optimization
Local Minima
• Among the challenges in optimization of model’s parameters in ML
involve local minima, saddle points, vanishing gradients
• For an objective function 𝑓(𝑥), if the value at a point x is the minimum of
the objective function over the entire domain of x, then it is the global
minimum
• If the value of 𝑓(𝑥) at x is smaller than the values of the objective function
at any other points in the vicinity of x, then it is the local minimum
 The objective functions in ML usually
have many local minima
o When the solution of the optimization
algorithm is near the local minimum,
the gradient of the loss function
approaches or becomes zero
(vanishing gradients)
o Therefore, the obtained solution in the
final iteration can be a local minimum,
rather than the global minimum

Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html


Optimization
Saddle Points
• The gradient of a function 𝑓(𝑥) at a saddle point is 0, but the point is not a
minimum or maximum point
 The optimization algorithms may stall at saddle points, without reaching a
minima
• Note also that the point of a function at which the sign of the curvature
changes is called an inflection point
 An inflection point (𝑓′′ 𝑥 = 0) can also be a saddle point, but it does not have
to be
• For the 2D function (right figure), the saddle point is at (0,0)
 The point looks like a saddle, and gives the minimum with respect to x, and the
maximum with respect to y
saddle point

x
Picture from: http://d2l.ai/chapter_optimization/optimization-intro.html
Optimization
Convex Optimization
• A function of a single variable is concave if every line segment joining two
points on its graph does not lie above the graph at any point
• Symmetrically, a function of a single variable is convex if every line
segment joining two points on its graph does not lie below the graph at
any point

Picture from: https://mjo.osborne.economics.utoronto.ca/index.php/tutorial/index/1/cv1/t


Optimization
Convex Functions
• In mathematical terms, the function f is a convex function if
for all points 𝑥1 , 𝑥2 and for all 𝜆 ∈ [0,1]
𝜆𝑓(𝑥1 ) + (1 − 𝜆)𝑓(𝑥2 ) ≥ 𝑓 𝜆𝑥1 + (1 − 𝜆)𝑥2
Optimization
Convex Functions
• One important property of convex functions is that they do
not have local minima
 Every local minimum of a convex function is a global minimum
 I.e., every point at which the gradient of a convex function = 0 is the
global minimum
 The figure below illustrates two convex functions, and one nonconvex
function
convex non-convex convex

Picture from: http://d2l.ai/chapter_optimization/convexity.html


Optimization Convex Functions
• Another important property of convex functions is stated by
the Jensen’s inequality
• Namely, if we let 𝛼1 = 𝜆 and 𝛼2 = 1 − 𝜆, the definition of
convex function becomes
𝛼1𝑓(𝑥1 ) + 𝛼2𝑓(𝑥2 ) ≥ 𝑓 𝛼1𝑥1 + 𝛼2𝑥2
• The Danish mathematician Johan Jensen showed that this can
be generalized for all 𝛼𝑖 that are non-negative real numbers
and 𝑖 𝛼𝑖 , to the following:
𝛼1𝑓 𝑥1 + 𝛼2𝑓 𝑥2 + ⋯ + 𝛼𝑛𝑓 𝑥𝑛 ≥ 𝑓 𝛼1𝑥1 + 𝛼2𝑥2 + ⋯ + 𝛼𝑛𝑥𝑛
• This inequality is also identical to
𝔼𝑥 [𝑓(𝑥)] ≥ 𝑓 𝔼𝑥 [𝑥]
 I.e., the expectation of a convex function is larger than the convex
function of an expectation
Optimization
Convex Sets
• A set 𝒳 in a vector space is a convex set is for any 𝑎, 𝑏 ∈ 𝒳 the
line segment connecting a and b is also in 𝒳
• For all 𝜆 ∈ [0,1], we have
𝜆 ∙ 𝑎 + 1 − 𝜆 ∙ 𝑏 ∈ 𝒳 for all 𝑎, 𝑏 ∈ 𝒳

• In the figure, each point represents a 2D vector


 The left set is nonconvex, and the other two sets are convex
• Properties of convex sets include:
 If 𝒳 and 𝒴 are convex sets, then 𝒳 ∩ 𝒴 is also convex
 If 𝒳 and 𝒴 are convex sets, then 𝒳 ∪ 𝒴 is not necessarily convex

Picture from: http://d2l.ai/chapter_optimization/convexity.html


Optimization
Derivatives and Convexity
• A twice-differentiable function of a single variable 𝑓: ℝ → ℝ is
convex if and only if its second derivative is non-negative
everywhere
𝑑2𝑓
 Or, we can write, ≥0
𝑑𝑥 2
 For example, 𝑓 𝑥 = 𝑥 2 is convex, since 𝑓′ 𝑥 = 2𝑥, and 𝑓′′ 𝑥 = 2,
meaning that 𝑓 ′′ 𝑥
>0
• A twice-differentiable function of many variables 𝑓: ℝ𝑛 → ℝ
is convex if and only if its Hessian matrix is positive semi-
definite everywhere
 Or, we can write, 𝐇𝑓 ≽ 0
 This is equivalent to stating that all eigenvalues of the Hessian matrix
are non-negative (i.e., ≥ 0)
Optimization Constrained Optimization
• The optimization problem that involves a set of constraints which need to
be satisfied to optimize the objective function is called constrained
optimization
• E.g., for a given objective function 𝑓(𝐱) and a set of constraint functions
𝑐𝑖 𝐱

minimize 𝑓(𝐱)
𝐱
subject to 𝑐𝑖 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁
• The points that satisfy the constraints form the feasible region
• Various optimization algorithms have been developed for handling
optimization problems based on whether the constraints are equalities,
inequalities, or a combination of equalities and inequalities
Lagrange Multipliers
Optimization
Lagrange Multipliers
• One approach to solving optimization problems is to substitute the initial
problem with optimizing another related function
• The Lagrange function for optimization of the constrained problem on the
previous page is defined as
𝐿 𝐱, 𝛼 = 𝑓 𝐱 + 𝑖 𝛼𝑖 𝑐𝑖 𝐱 where 𝛼𝑖 ≥ 0
• The variables 𝛼𝑖 are called Lagrange multipliers and ensure that the
constraints are properly enforced
 They are chosen just large enough to ensure that 𝑐𝑖 𝐱 ≤ 0 for all 𝑖 ∈ 1, 2, … , 𝑁
• This is a saddle-point optimization problem where one wants to
minimize 𝐿 𝐱, 𝛼 with respect to 𝐱 and simultaneously maximize 𝐿 𝐱, 𝛼
with respect to 𝛼𝑖
 The saddle point of 𝐿 𝐱, 𝛼 gives the optimal solution to the original
constrained optimization problem
Definition
Lagrange method is used for maximizing or minimizing a general
function f(x,y,z)
subject to a constraint (or side condition) of the form g(x,y,z) =k.
Assumptions made: the extreme values exist
∇g≠0
Then there is a number λ such that
∇ f(x0,y0,z0) =λ ∇ g(x0,y0,z0) and λ
is called the Lagrange multiplier.

• Finding all values of x,y,z and λ such that


∇ f(x,y,z) =λ ∇ g(x,y,z)
and g(x,y,z) =k

And then evaluating f at all the points, the values obtained are studied.
The largest of these values is the maximum value of f; the smallest is
the minimum value of f.
• Writing the vector equation ∇f= λ ∇g in terms of its
components, give

∇fx= λ ∇gx ∇fy= λ ∇gy ∇fz= λ ∇gz g(x,y,z) =k

• It is a system of four equations in the four unknowns, however


it is not necessary to find explicit values for λ.

• A similar analysis is used for functions of two variables.


Examples

• Example 1:
A rectangular box without a lid is to be made from 12 m2 of
cardboard. Find the maximum volume of such a box.
• Solution:
let x,y and z are the length, width and height, respectively, of
the box in meters.
and V= xyz
Constraint: g(x, y, z)= 2xz+ 2yz+ xy=12

Using Lagrange multipliers, 2xz+ 2yz+


xy=12
Vx= λgx Vy= λgy Vz=
λgz which become
Continued..

yz= λ(2z+y) (1)


xz= λ(2z+x) (2)
xy= λ(2x+2y) (3)
• 2xz+ 2yz+ xy=12 (4)

• Solving these equations;


• Let’s multiply (2) by x, (3) by y and (4) by z, making the
left hand sides identical.
• Therefore,
• x yz= λ(2xz+xy) (6)
• x yz= λ(2yz+xy) (7)
• x yz= λ(2xz+2yz) (8)
Continued

• It is observed that λ≠0 therefore from (6) and (7)


2xz+xy=2yz+xy
which gives xz = yz. But z ≠ 0, so x = y. From (7) and (8) we
have 2yz+xy=2xz+2yz
which gives 2xz = xy and so (since x ≠0) y=2z. If we now put
x=y=2z in (5), we get
4z2+4z2+4z2=12
Since x, y, and z are all positive, we therefore have z=1 and so
x=2 and y = 2.
More Examples

• Example 2:
Find the extreme values of the function f(x,y)=x2+2y2 on the circle
x2+y2=1.
• Solution:
Solve equations ∇f= λ ∇g and g(x,y)=1 using Lagrange
multipliers
Constraint: g(x, y)= x2+y2=1 Using
Lagrange multipliers,
fx= λgx fy= λgy g(x,y) = 1 which
become
Continued…

• 2x= 2xλ (9)


• 4y= 2yλ (10)
• x2+y2= 1 (11)
• From (9) we have x=0 or λ=1. If x=0, then (11) gives y=±1. If λ=1,
then y=0 from (10), so then (11) gives x=±1. Therefore f has possible
extreme values at the points (0,1), (0,-1), (1,0), (1,0). Evaluating f at
these four points, we find that
• f(0,1)=2
• f(0,-1)=2
• f(1,0)=1
• f(-1,0)=1
• Therefore the maximum value of f on the circle
• x2+y2=1 is f(0,±1) =2 and the minimum value is f(±1,0) =1.
More Examples.

• Example 3
Find the extreme values of f(x,y)=x2+2y2 on the disk x2+y2≤1.
• Solution:
Compare the values of f at the critical points with values at the points
on the boundary. Since fx=2x and fy=4y, the only critical point is (0,0).
We compare the value of f at that point with the extreme values on the
boundary from Example 2:
• f(0,0)=0
• f(±1,0)=1
• f(0,±1)=2
• Therefore the maximum value of f on the disk x2+y2≤1 is
f(0,±1)=2 and the minimum value is f(0,0)=0.
Continued…
• Example 4
• Find the points on the sphere x2+y2+z2=4 that are closest to and farthest
from the point (3,1,-1).
• Solution:
The distance from a point (x,y,z) to the point (3,1,-1) is
d= (x−3)2+(y−1)2+(z+1)2
But the algebra is simple if we instead maximize and minimize the square
of the distance:
d2=f(x,y,z)=(x-3)2+(y-1)2+(z+1)2
Constraint: g(x,y,z)= x2+y2+z2=4
Using Lagrange multipliers, solve ∇f= λ ∇g and g=4 This gives
Continued

• 2(x-3)=2xλ (12)
• 2(y-1)=2yλ (13)
• 2(z+1)=2zλ (14)
• x2+y2+z2=4 (15)

• The simplest way to solve these equations is to solve for x, y, and z in


terms of λ from (12), (13), and (14), and then substitute these values
into (15). From12 we have
• x-3=xλ or
• x(1- λ)=3 or
3
• x=
1−λ
Continued
• Similarly (13) and (14) give
• 𝑦=
1 • Therefore, from (15), we have
1−λ
32 12 (−1)2
• z= −
1 • + + =4
1−λ (1− λ ) 2 (1− λ )2 (1− λ ) 2

11
• Which gives (1- λ)2=11 , 1- λ=± , so
4 2
11
• λ=1± 2
• These values of λ then give the corresponding points (x,y,z):
6 2
• ( , 2
,− ) and (− 6
,− 2
, 2
)
11 11 11 11 11 11

• f has a smaller value at the first of these points, so the closest


6 2 2
point is ( 11 , 11 , − ) and the farthest is
11
6 2 2
− ,− , .
11 11 11
Two constraints
• Say there is a new constraint, h(x,y,z)=c.
• So there are numbers λ and μ (called Lagrange multipliers)
such that
–∇ f(x0,y0,z0) =λ ∇ g(x0,y0,z0) + μ ∇ h(x0,y0,z0)
• The extreme values are obtained by solving for the five
unknowns x, y, z, λ and μ. This is done by writing the above
equation in terms of the components and using the constraint
equations:
• fx= λgx + μhx
• g(x,y,z) =k
• fy= λgy+μhy
• fz= λgz+μhz
• h(x,y,z)=c
Reference

•Calculus – Stewart 6th Edition


•Section 15.8 “Lagrange Multipliers”
Optimization
Projections
• An alternative strategy for satisfying constraints are
projections
• E.g., gradient clipping in NNs can require that the norm of
the gradient is bounded by a constant value c
• Approach:
 At each iteration during training
𝑔𝑜𝑙𝑑
 If the norm of the gradient 𝑔 ≥ c, then the update is 𝑔𝑛𝑒𝑤 ← 𝑐 ∙ 𝑜𝑙𝑑
𝑔
 If the norm of the gradient 𝑔 < c, then the update is 𝑔𝑛𝑒𝑤 ← 𝑔𝑜𝑙𝑑
𝑔𝑜𝑙𝑑
• Note that since 𝑜𝑙𝑑 is a unit vector (i.e., it has a norm = 1),
𝑔
𝑔𝑜𝑙𝑑
then the vector 𝑐 ∙ 𝑜𝑙𝑑 has a norm = 𝑐
𝑔
• Such clipping is the projection of the gradient g onto the ball
of radius c
 Projection on the unit ball is for 𝑐 = 1
Optimization
Projections
• More generally, a projection of a vector 𝐱 onto a set 𝒳 is defined as

Proj 𝐱 = argmin 𝐱 − 𝐱′ 2
𝒳 𝐱′∈𝒳

• This means that the vector 𝐱 is projected onto the closest vector 𝐱′ that
belongs to the set 𝒳
• For example, in the figure, the blue circle
represents a convex set 𝒳
 The points inside the circle project to itself
o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the
set 𝒳 is itself: the distance between 𝐱 and 𝐱′ is
𝐱 − 𝐱′ 2 = 0
 The points outside the circle project to the closest
point inside the circle
o E.g., 𝐱 is the yellow vector, its closest point 𝐱′ in the
set 𝒳 is the red vector
o Among all vectors in the set 𝒳, the red vector 𝐱′ has
the smallest distance to 𝐱, i.e., 𝐱 − 𝐱′ 2

Picture from: http://d2l.ai/chapter_optimization/convexity.html


First-order vs Second-order Optimization
Optimization

• First-order optimization algorithms use the gradient of a function for


finding the extrema points
 Methods: gradient descent, proximal algorithms, optimal gradient schemes
 The disadvantage is that they can be slow and inefficient
• Second-order optimization algorithms use the Hessian matrix of a
function for finding the extrema points
 This is since the Hessian matrix holds the second-order partial derivatives
 Methods: Newton’s method, conjugate gradient method, Quasi-Newton
method, Gauss-Newton method, BFGS (Broyden-Fletcher-Goldfarb-Shanno)
method, Levenberg-Marquardt method, Hessian-free method
 The second-order derivatives can be thought of as measuring the curvature of
the loss function
 Recall also that the second-order derivative can be used to determine whether a
stationary points is a maximum (𝑓 ′′ 𝑥 < 0), minimum (𝑓 ′′ 𝑥 > 0)
 This information is richer than the information provided by the gradient
 Disadvantage: computing the Hessian matrix is computationally expensive,
and even prohibitive for high-dimensional data
Optimization
Lower Bound and Infimum
• Lower bound of a subset 𝒮 from a partially ordered set 𝒳 is an
element 𝑎 of 𝒳, such that 𝑎 ≤ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, lower
bounds are the numbers 2, 1, 0, −3, and all other natural numbers ≤ 2
• Infimum of a subset 𝒮 from a partially ordered set 𝒳 is the
greatest lower bound in 𝒳, denoted inf𝑠∈𝒮 𝑠
 It is the maximal quantity 𝑕 such that 𝑕 ≤ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., the infimum of the set 𝒮 = 2, 3, 6, 8 is 𝑕 =2, since it is the greatest
lower bound
• Example: consider the subset of positive real numbers
(excluding zero) ℝ≥0 = 𝑥 ∈ ℝ: 𝑥 ≥ 0
 The subset ℝ≥0 does not have a minimum, because for every small positive
number, there is a another even smaller positive number
 On the other hand, all real negative numbers and 0 are lower bounds on
the subset ℝ≥0
 0 is the greatest lower bound of all lower bounds, and therefore, the
infimum of ℝ≥0 is 0
Optimization
Upper Bound and Supremum
• Upper bound of a subset 𝒮 from a partially ordered set 𝒳 is
an element 𝑏 of 𝒳, such that 𝑏 ≥ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., for the subset 𝒮 = 2, 3, 6, 8 from the natural numbers ℕ, upper
bounds are the numbers 8, 9, 40, and all other natural numbers ≥ 8
• Supremum of a subset 𝒮 from a partially ordered set 𝒳 is the
least upper bound in 𝒳, denoted sup𝑠∈𝒮 𝑠
 It is the minimal quantity 𝑔 such that g ≥ 𝑠 for all 𝑠 ∈ 𝒮
 E.g., the supremum of the subset 𝒮 = 2, 3, 6, 8 is 𝑔 = 8, since it is the
least upper bound
• Example: for the subset of negative real numbers (excluding
zero) ℝ≤0 = 𝑥 ∈ ℝ: 𝑥 ≤ 0
 All real positive numbers and 0 are upper bounds
 0 is the least upper bound, and therefore, the supremum of ℝ≤0
Optimization
Lipschitz Function
• A function 𝑓 𝑥 is a Lipschitz continuous function if a constant 𝜌 > 0
exists, such that for all points 𝑥1 , 𝑥2
𝑓 𝑥1 − 𝑓 𝑥2 ≤ 𝜌 𝑥1 − 𝑥2
• Such function is also called a 𝜌-Lipschitz function
• Intuitively, a Lipschitz function cannot change too fast
 I.e., if the points 𝑥1 and 𝑥2 are close (i.e., the distance 𝑥1 − 𝑥2 is small), that
means that the 𝑓 𝑥1 and 𝑓 𝑥2 are also close (i.e., the distance 𝑓 𝑥1 − 𝑓 𝑥2
is also small)
oThe smallest real number that bounds the change of 𝑓 𝑥1 − 𝑓 𝑥2 for all
points 𝑥1 , 𝑥2 is the Lipschitz constant 𝜌 of the function 𝑓 𝑥
 For a 𝜌-Lipschitz function 𝑓 𝑥 , the first derivative 𝑓′ 𝑥 is bounded
everywhere by 𝜌
• E.g., the function 𝑓 𝑥 = 𝑙𝑜𝑔 1 + exp(𝑥) is 1-Lipschitz over ℝ
exp(𝑥) 1 𝟏
 Since 𝑓′ 𝑥 = = = ≤1
1+exp(𝑥) exp −𝑥 +1 exp −𝑥 +1
 I.e., 𝜌 = 1
Optimization
Lipschitz Continuous Gradient
• A differentiable function 𝑓 𝑥 has a Lipschitz continuous
gradient if a constant 𝜌 > 0 exists, such that for all points 𝑥1 ,
𝑥2
𝛻𝑓 𝑥1 − 𝛻𝑓 𝑥2 ≤ 𝜌 𝑥1 − 𝑥2
• For a function 𝑓 𝑥 with a 𝜌-Lipschitz gradient, the second
derivative 𝑓′′ 𝑥 is bounded everywhere by 𝜌
• E.g., consider the function 𝑓 𝑥 = 𝑥 2
 𝑓 𝑥 = 𝑥 2 is not a Lipschitz continuous function, since 𝑓 ′ (𝑥) = 2𝑥, so
when 𝑥 → ∞ then 𝑓 ′ (𝑥) → ∞, i.e., the derivative is not bounded
everywhere
 Since 𝑓 ′′ (𝑥) = 2, therefore the gradient 𝑓 ′ (𝑥) is 2-Lipschitz everywhere,
since the second derivative is bounded everywhere by 2

You might also like