Continuous Optimization

7 Continuous Optimization
Introduction
• Since machine learning algorithms are implemented

on a computer, the mathematical formulations are
expressed as numerical optimization methods
• Training a machine learning model: finding a good
set of parameters determined by the objective
function or the probabilistic model
 Use optimization algorithms
2/25/2023 Chapter 7 - Continuous Optimization 2

Optimization Using Gradient Descent
• We solve for the minimum of f(x): d → 
• Gradient descent exploits the fact that f(x0) decreases fastest if one
moves from x0 in the direction of the negative gradient −(∇f(x0))T of f
at x0
• For a “good” step-size  > 0, if f(x) = x2 + 1
x1 = x0 − (∇f(x0))T,
then
f(x1)  (x0)

Gradient Descent Algorithm
Algorithm.
1. Choose initial guess x0
2. Compute xi iteratively until we meet the stopping criteria using
xi+1 = xi − i(∇f(xi))T
3. Return parameter xi
{f(xi)  minxf(x)}
For suitable learning rate i, the sequence f(x0) > f(x1) > . . . converges
to a local minimum

Gradient Descent Algorithm – Ex 1
Find local minimum of f(x) = (x3-3x)/(x2+3)
Gradient
3x4 −2x3+3x2 − 9
∇f(x) =
(x2+3)2
x0 = 0, step-size =  = 0.05
number of iterations = 30
xi = xi-1 − (∇f(xi-1))T

Learning rate (Step-size) 
• Choosing a good step-size (learning rate) is important in GD
• step-size is too small  GD can be slow
• step-size is too large  GD can overshoot, fail to converge, or even
diverge
 = 0.01 =3
 = 0.01: Slowly converges  = 3: Fail to converge

• Find local minimum of z = f(x, y) = x2 + 2y2 + 4
 ∇f(x, y) = [2x 4y]
x0 −1
• Choose X0 = y =
0 2
• Learning rate =  = 0.1
• GD Xi+1 = Xi - ∇f(Xi)T
−1 −2 −0.8
X1 = - 0.1 =
2 8 1.2

x y z = f(x, y)

Too large 
GD with
 = 0.4
 zigzag
shape

How to choose suitable learning rate 
• Adaptive gradient methods rescale the learning rate  at each
iteration, depending on local properties of the function f
• f increases after a gradient step, the learning rate  was too large 
undo the step and decrease the learning rate 
• f decreases, the step could have been larger  try to increase the
learning rate 

Gradient Descent With Momentum (1986)
• A method that introduces an additional term to remember what
happened in the previous iteration
• The momentum-based method remembers the update ∆xi at each
iteration i and determines the next update as a linear combination of
the current and previous gradients
xi+1 = xi − i(∇f(xi))T + α∆xi
∆xi = xi − xi−1 = α∆xi−1 − i-1(∇f(xi−1))T,
where α ∈ [0, 1].

GD with Momentum
Find local minimum value of f(x) = 12x3 - 48x2 + 36x
Without momentum With momentum

(x0 = -1,  = 0.01) ( = 0.7, x0 = -1,  = 0.01)

Different values of 
With momentum With momentum

( = 0.7, x0 = 4,  = 0.01) ( = 1, x0 = 4,  = 0.01)

Stochastic Gradient Descent (SGD)
• Computing the gradient can be very time consuming
 find a “cheap” approximation of the gradient
• Since the goal in machine learning does not necessarily need a
precise estimate of the minimum of the objective function,
approximate gradients have been widely used
• SGD is very effective in large-scale machine learning problems such
as training deep neural networks on millions of images

Mini-batch Gradient Descent (SGD)
In ML, given N data points, consider the sum of the losses Ln incurred
by each example n.
L(θ) = 𝑁𝑛=1 Ln(θ), where θ are parameters
• Standard GD (“batch” optimization method) is performed using
very expensive evaluations

• In contrast to batch gradient descent, which uses all Ln, we randomly
choose a subset of Ln for mini-batch gradient descent

Convex sets
Definition. A set C is a convex set if for any x, y ∈ C and for any scalar
θ with 0  θ  1, we have
θx + (1 − θ)y ∈ C
Example of Example of a
a convex set non-convex set
Note. Convex sets are sets such that a straight line connecting any
two elements of the set lie inside the set.

Some convex sets
• In R, every interval (a, b) is convex
• In R2, C1 = {(x, y) | x2 + y2 < 1} is convex but C2 = {(x, y) | 0 < x2 + y2
< 1} is not.
• In Rn, C = (x1, x2, …, xn) | c1x1 + c2x2 + … + cnxn  b} is convex, for
all real numbers c1, c2, …, cn, b
Theorem. The intersection of two convex sets is also convex.
convex Intersection is a convex set
convex

Convex functions
• Definition. Let  be a convex set of D.
A function f :  →  is called a convex function if
f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y), x ∈ , y ∈ ,θ[0, 1]
Note. A concave function is the negative of a convex function

Concave functions
Definition. Let  be a convex set of D.
A function f :  →  is called a concave function if

f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y), x ∈ , y ∈ ,θ[0, 1]
Note. A concave function is the negative of a convex function

Theorem
Suppose that C is a convex set,
• If f : C →  is a convex function, then a local minimum is a global
minimum of f over C.
• If f : C →  is a concave function, then a local maximum is a global
maximum of f over C.

Convexity test
• If a function f : n →  is twice differentiable, then
• f(x) is convex if and only if for any two points x, y it holds that
f(y) > f(x) + ∇xf(x)T(y − x)
• f(x) is convex if and only if Hessian ∇x2f(x) is positive semidefinite

Example

Convex functions - Ex
• The negative entropy f(x) = xlog2x is convex for x > 0
• In fact,
Gradient ∇xf(x) = log2x + x(log2x)
= log2x + log2e
Hessian ∇x2f(x) = (1/x)log2e > 0,

for all x > 0

Some common convex functions
• ax + b on  for any a, b 
• ax on  for any a 
• |x|p on  for p  1
• xlogx, x > 0
• cTx + b, xn for any cn, b 
• Every norm in n
• Spectral norm of a matrix: A2 = max(A) = [max(ATA)]1/2

Sum of convex functions is convex
Theorem. If f and g are convex functions, then so is f + g.
• In fact, suppose f and g are convex functions
• Then, f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y)
and g(θx + (1 − θ)y)  θg(x) + (1 − θ)g(y), for any 0  θ  1
 Summing up both sides
f (θx + (1 − θ)y) + g (θx + (1 − θ)y)  θf (x) + (1 − θ)f (y) + θg(x) + (1 −
θ)g(y) = θ(f(x) + g(x)) + (1 − θ)(f (y) + g(y))
 f + g is convex

Constrained Optimization
Ex. Find the minimum values of the function f(x, y) = x2 + 2y2 subject to
the constraint x2 + y2 = 1.
Level curve of f(x, y)
g(x, y) = 0

Constrained Optimization. Lagrange multipliers
To minimize f(x, y)
Level/contour curve of f(x, y)
subject to the constraint g(x, y) = 0 is to find
the smallest value of c such that the level
curve f(x, y) = c intersects g(x, y) = 0.
Two curves are tangent at (x0, y0) and

their gradients are parallel.
f(x0, y0) = g(x0, y0) for some scalar .
: Lagrange multiplier
L(x, y, ) := f(x, y) + g(x, y) is called g(x, y) = 0
Lagrangian

Ex0. Minimize f(x, y) = x2 + 2y2 s.t x2 + y2 = 1.

Ex1. Minimize f(x, y) = x2 + y2 s.t x – y = 1.
Set 0 = g(x, y) = x – y – 1
and L(x, y, ) = f(x, y) + g(x, y) = x2 + y2 + (x – y – 1)
We find all values of (x0, y0, ) such that
Lx(x0,y0,)= 0, Ly(x0,y0,) = 0, and L(x0,y0,) = 0 // partial derivatives of L
2x0 +  = 0, 2y0 -  = 0, and x0 – y0 – 1 = 0
 x0 = 1/2, y0 = -1/2,  = -1
The minimum value of f s.t. x – y – 1 = 0 is f(x0,y0) = ½.
(Note that we can use the fact y = x – 1 and plug in f(x, y) = x2 + y2.)

Ex2. Minimize f(x, y) = 2x + y
s.t x2 + y2 = 1.
Set 0 = g(x, y) = x2 + y2 – 1
and L(x, y, ) = f(x, y) + g(x, y) = 2x + y + (x2 + y2 – 1)
We find all values of (x0, y0, ) such that
Lx(x0,y0,)= 0, Ly(x0,y0,) = 0, and L(x0,y0,) = 0
 2 + 2x0 = 0, 1 + 2y0 = 0, and x02 + y02 – 1 = 0
 (x0, y0, ) = (2/5,1/5,-5/2), (x0, y0, ) = (-2/5,-1/5, 5/2)
The minimum value of f s.t. g(x, y) = 0 is f(-2/5, -1/5) = -5.

For real-valued functions f : D → , we consider the constrained
optimization problem
minxf(x)
subject to gi(x)  0 for all i = 1, . . . , m
For λ = [λ1 λ2 … λm]T, λi  0, Lagrange multipliers λi
set L(x, λ) := f(x) + λTg(x) // Lagrangian

Dual Lagrangian
In general, duality in optimization is the idea of converting an optimization
problem in one set of variables x (called the primal variables) into another one in
a different set of variables λ (called the dual variables).
Primal problem minx f(x)

s. t. gi(x)  0, for all i = 1, 2, …, m
Dual Lagrangian
D() = minxL(x, ) Lagrange multipliers are named
after the French-Italian
mathematician Joseph-Louis
Lagrange (1736–1813).
Lagrangian dual problem maxD()
s. t.   0

Weak duality vs Strong duality minimax  maximin
f(x) D()
f(x) D() f(x)
f(x)
D() D()
Weak duality Strong duality:

minxmaxλL(x, λ)  maxλminxL(x, λ) minxmaxλL(x, λ) = maxλminxL(x, λ)
f(·) and gi(·) may be nonconvex f(·) and gi(·) are convex
D(λ) = minxL(x, λ) is concave even though f(·) and gi(·) may be nonconvex. The outer problem, maximization
D(λ) over λ, is the maximum of a concave function and can be efficiently computed.

Convex Optimization
• Convex optimization problem
• f(·) is a convex function,
• the constraints involving g(·) and h(·) are convex sets
 strong duality: The optimal solution of the dual problem is the same
as the optimal solution of the primal problem

Convex Optimization
Problem
minxf(x)
subject to gi(x)  0 for all i = 1, . . . , m
hj(x) = 0 for all j = 1, . . . , n,
where all functions f(x) and gi(x) are convex functions,

and all hj(x) = 0 are convex sets

Convex optimization
Ex3. Minimize f(x, y) = x2 – 4y s.t. g(x, y) = y2 – 2x  0
L(x, y, ) = f(x, y) + g(x, y) = x2 – 4y + (y2 – 2x)

Lx = 0, and Ly = 0
 2x - 2 = 0, and -4 + 2y = 0
 x = , and y = 2/
min(x,y)L(x, y, ) = -2 – 4/ =: D()
3 3
 max0 D() = D( 2) = -3 4
3
 Result = -3 4

Example
min 2 x 2y 3
x ,y
• Consider the problem
s .t . x 2 y2 4
1/ Find the Lagrangian L(x, y, )
2/ Find the dual Lagrangian D()

Linear Programming
• Consider the special case when all the preceding functions are linear,
i.e.,
minx cTx
subject to Ax  b,
where A ∈ m×d and b ∈ m
• This is known as a linear program, which has d variables and m linear
constraints

Linear program - Ex
• Consider the linear program
5 T x1
min −
𝑥∈2 3 x2
2 2 33
subject to  2 4  8
 x   
 2 1   1    5 
   x2   
 0 1  1
 0 1   8 

Linear program - Ex
T
5 x1
min −
𝑥∈2 3 x2
2 2 33
 2 4  8
 x   
subject to  2 1  1   5 
 
   x2   
 0 1  1
 0 1   8 

Linear program – Exercise
• Consider the linear programming
• Write the program in standard form (matrix notation).

Linear program - Lagrangian
• The Lagrangian is given by
L(x, λ) = cTx + λT(Ax − b)
= (c + ATλ)Tx − λTb
𝜕
• L(x, λ) = 0  c + ATλ = 0
𝜕x
• Therefore, the dual Lagrangian is
D(λ) = minx L(x, λ) = −λTb,
And we would like to find maxλ0D(λ)

Linear program - Dual program
• The dual optimization problem
maxλ (− bTλ)
subject to c + ATλ = 0,
m  λ  0
This is also a linear program, but with m variables
• We have two choices
• Solve the primal program for d variables
• Solve the dual program for m variables

Linear program - Lagrangian
• Lagrangian 33
• D(λ) = minx L(x, λ) = −λTb 8
 
= [-1 -2 -3 -4 -5] 5
 
 1
 8 
 D(λ) = -331 -82 -53 +4 -85

Example
min 2
2x 1 x2
x 1,x 2
1 2 1
x1
s .t . 3 1 4
x2
2 3 3
• Find the dual Lagrangian D()

Quadratic Programming
• Consider the problem
1 T
minx x Qx + cTx
2
subject to Ax  b,
where A ∈ m×d, b ∈ m, and c ∈ d,
Q ∈ d×d is positive definite (and therefore the objective function is
convex)
• This is known as a quadratic program with d variables and m linear
constraints

Quadratic Programming – Ex
The optimal value must lie in the shaded region, and is indicated by the star

Quadratic Programming – Exercise
Consider the quadratic programming
Write the program in standard form (matrix notation).

Quadratic Programming - Lagrangian
• The Lagrangian is given by
1
L(x, λ) = xTQx + cTx + λT(Ax − b)
2
1
= xTQx + (c + ATλ)Tx − λTb,
2
Taking the derivative of L(x, λ) with respect to x and setting it to zero
gives
Qx + (c + ATλ) = 0
Assuming that Q is invertible, we get
x = −Q−1(c + ATλ)

Quadratic Programming – Dual Lagrangian
• The dual Lagrangian
1
D(λ) = − (c + ATλ)TQ−1(c + ATλ) − λTb
2
Therefore, the dual optimization problem is given by
1
maxλ − (c + ATλ)TQ−1(c + ATλ) − λTb
2
subject to λ  0
We will see an application of quadratic programming in ML in Chapter

12 Support Vector Machines

Example
T
1 x1 2 2
• Consider the linear program min x1 x 2
x 1,x 2 2
2 x2 2 4
2 1 1
x1
s .t . 3 2 2
x2
1 1 3
• Find the dual Lagrangian D()

THANKS

Continuous Optimization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Continuous Optimization

Uploaded by

Copyright:

Available Formats

7 Continuous Optimization

• Since machine learning algorithms are implemented

2/25/2023 Chapter 7 - Continuous Optimization 2

2/25/2023 Chapter 7 - Continuous Optimization 3

2/25/2023 Chapter 7 - Continuous Optimization 4

2/25/2023 Chapter 7 - Continuous Optimization 5

 = 0.01: Slowly converges  = 3: Fail to converge

2/25/2023 Chapter 7 - Continuous Optimization 6

2/25/2023 Chapter 7 - Continuous Optimization 7

2/25/2023 Chapter 7 - Continuous Optimization 8

2/25/2023 Chapter 7 - Continuous Optimization 9

2/25/2023 Chapter 7 - Continuous Optimization 10

2/25/2023 Chapter 7 - Continuous Optimization 11

Without momentum With momentum

2/25/2023 Chapter 7 - Continuous Optimization 12

With momentum With momentum

2/25/2023 Chapter 7 - Continuous Optimization 13

2/25/2023 Chapter 7 - Continuous Optimization 14

very expensive evaluations

2/25/2023 Chapter 7 - Continuous Optimization 15

2/25/2023 Chapter 7 - Continuous Optimization 16

convex Intersection is a convex set

2/26/2023 Chapter 7 - Continuous Optimization 17

2/25/2023 Chapter 7 - Continuous Optimization 18

A function f :  →  is called a concave function if

Note. A concave function is the negative of a convex function

2/25/2023 Chapter 7 - Continuous Optimization 19

2/25/2023 Chapter 7 - Continuous Optimization 20

• f(x) is convex if and only if Hessian ∇x2f(x) is positive semidefinite

2/25/2023 Chapter 7 - Continuous Optimization 21

Hessian ∇x2f(x) = (1/x)log2e > 0,

2/25/2023 Chapter 7 - Continuous Optimization 22

2/26/2023 Chapter 7 - Continuous Optimization 23

2/26/2023 Chapter 7 - Continuous Optimization 24

2/26/2023 Chapter 7 - Continuous Optimization 25

Two curves are tangent at (x0, y0) and

2/25/2023 Chapter 7 - Continuous Optimization 26

2/26/2023 Chapter 7 - Continuous Optimization 27

2/25/2023 Chapter 7 - Continuous Optimization 28

2/25/2023 Chapter 7 - Continuous Optimization 29

For λ = [λ1 λ2 … λm]T, λi  0, Lagrange multipliers λi

set L(x, λ) := f(x) + λTg(x) // Lagrangian

2/25/2023 Chapter 7 - Continuous Optimization 30

Primal problem minx f(x)

2/25/2023 Chapter 7 - Continuous Optimization 31

Weak duality Strong duality:

2/25/2023 Chapter 7 - Continuous Optimization 32

2/25/2023 Chapter 7 - Continuous Optimization 33

where all functions f(x) and gi(x) are convex functions,

2/27/2023 Chapter 7 - Continuous Optimization 34

L(x, y, ) = f(x, y) + g(x, y) = x2 – 4y + (y2 – 2x)

2/25/2023 Chapter 7 - Continuous Optimization 35

2/ Find the dual Lagrangian D()

2/25/2023 Chapter 7 - Continuous Optimization 36

2/25/2023 Chapter 7 - Continuous Optimization 37

2/26/2023 Chapter 7 - Continuous Optimization 38

2/26/2023 Chapter 7 - Continuous Optimization 39

• Write the program in standard form (matrix notation).

2/26/2023 Chapter 7 - Continuous Optimization 40

2/25/2023 Chapter 7 - Continuous Optimization 41

2/25/2023 Chapter 7 - Continuous Optimization 42

2/25/2023 Chapter 7 - Continuous Optimization 43

• Find the dual Lagrangian D()