You are on page 1of 51

7 Continuous Optimization

Introduction

• Since machine learning algorithms are implemented


on a computer, the mathematical formulations are
expressed as numerical optimization methods
• Training a machine learning model: finding a good
set of parameters determined by the objective
function or the probabilistic model
 Use optimization algorithms

2/25/2023 Chapter 7 - Continuous Optimization 2


Optimization Using Gradient Descent
• We solve for the minimum of f(x): d → 
• Gradient descent exploits the fact that f(x0) decreases fastest if one
moves from x0 in the direction of the negative gradient −(∇f(x0))T of f
at x0
• For a “good” step-size  > 0, if f(x) = x2 + 1
x1 = x0 − (∇f(x0))T,
then
f(x1)  (x0)

2/25/2023 Chapter 7 - Continuous Optimization 3


Gradient Descent Algorithm
Algorithm.
1. Choose initial guess x0
2. Compute xi iteratively until we meet the stopping criteria using
xi+1 = xi − i(∇f(xi))T
3. Return parameter xi
{f(xi)  minxf(x)}
For suitable learning rate i, the sequence f(x0) > f(x1) > . . . converges
to a local minimum

2/25/2023 Chapter 7 - Continuous Optimization 4


Gradient Descent Algorithm – Ex 1
Find local minimum of f(x) = (x3-3x)/(x2+3)
Gradient
3x4 −2x3+3x2 − 9
∇f(x) =
(x2+3)2
x0 = 0, step-size =  = 0.05
number of iterations = 30
xi = xi-1 − (∇f(xi-1))T

2/25/2023 Chapter 7 - Continuous Optimization 5


Learning rate (Step-size) 
• Choosing a good step-size (learning rate) is important in GD
• step-size is too small  GD can be slow
• step-size is too large  GD can overshoot, fail to converge, or even
diverge

 = 0.01 =3

 = 0.01: Slowly converges  = 3: Fail to converge

2/25/2023 Chapter 7 - Continuous Optimization 6


Gradient Descent Algorithm – Ex 2
• Find local minimum of z = f(x, y) = x2 + 2y2 + 4
 ∇f(x, y) = [2x 4y]
x0 −1
• Choose X0 = y =
0 2
• Learning rate =  = 0.1
• GD Xi+1 = Xi - ∇f(Xi)T
−1 −2 −0.8
X1 = - 0.1 =
2 8 1.2

2/25/2023 Chapter 7 - Continuous Optimization 7


Gradient Descent Algorithm – Ex 2
x y z = f(x, y)

2/25/2023 Chapter 7 - Continuous Optimization 8


Too large 

GD with
 = 0.4
 zigzag
shape

2/25/2023 Chapter 7 - Continuous Optimization 9


How to choose suitable learning rate 
• Adaptive gradient methods rescale the learning rate  at each
iteration, depending on local properties of the function f
• f increases after a gradient step, the learning rate  was too large 
undo the step and decrease the learning rate 
• f decreases, the step could have been larger  try to increase the
learning rate 

2/25/2023 Chapter 7 - Continuous Optimization 10


Gradient Descent With Momentum (1986)
• A method that introduces an additional term to remember what
happened in the previous iteration
• The momentum-based method remembers the update ∆xi at each
iteration i and determines the next update as a linear combination of
the current and previous gradients
xi+1 = xi − i(∇f(xi))T + α∆xi
∆xi = xi − xi−1 = α∆xi−1 − i-1(∇f(xi−1))T,
where α ∈ [0, 1].

2/25/2023 Chapter 7 - Continuous Optimization 11


GD with Momentum
Find local minimum value of f(x) = 12x3 - 48x2 + 36x

Without momentum With momentum


(x0 = -1,  = 0.01) ( = 0.7, x0 = -1,  = 0.01)

2/25/2023 Chapter 7 - Continuous Optimization 12


Different values of 

With momentum With momentum


( = 0.7, x0 = 4,  = 0.01) ( = 1, x0 = 4,  = 0.01)

2/25/2023 Chapter 7 - Continuous Optimization 13


Stochastic Gradient Descent (SGD)
• Computing the gradient can be very time consuming
 find a “cheap” approximation of the gradient
• Since the goal in machine learning does not necessarily need a
precise estimate of the minimum of the objective function,
approximate gradients have been widely used
• SGD is very effective in large-scale machine learning problems such
as training deep neural networks on millions of images

2/25/2023 Chapter 7 - Continuous Optimization 14


Mini-batch Gradient Descent (SGD)
In ML, given N data points, consider the sum of the losses Ln incurred
by each example n.
L(θ) = 𝑁𝑛=1 Ln(θ), where θ are parameters
• Standard GD (“batch” optimization method) is performed using

very expensive evaluations


• In contrast to batch gradient descent, which uses all Ln, we randomly
choose a subset of Ln for mini-batch gradient descent

2/25/2023 Chapter 7 - Continuous Optimization 15


Convex sets

Definition. A set C is a convex set if for any x, y ∈ C and for any scalar
θ with 0  θ  1, we have
θx + (1 − θ)y ∈ C

Example of Example of a
a convex set non-convex set

Note. Convex sets are sets such that a straight line connecting any
two elements of the set lie inside the set.

2/25/2023 Chapter 7 - Continuous Optimization 16


Some convex sets
• In R, every interval (a, b) is convex
• In R2, C1 = {(x, y) | x2 + y2 < 1} is convex but C2 = {(x, y) | 0 < x2 + y2
< 1} is not.
• In Rn, C = (x1, x2, …, xn) | c1x1 + c2x2 + … + cnxn  b} is convex, for
all real numbers c1, c2, …, cn, b
Theorem. The intersection of two convex sets is also convex.

convex Intersection is a convex set

convex

2/26/2023 Chapter 7 - Continuous Optimization 17


Convex functions
• Definition. Let  be a convex set of D.
A function f :  →  is called a convex function if
f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y), x ∈ , y ∈ ,θ[0, 1]
Note. A concave function is the negative of a convex function

2/25/2023 Chapter 7 - Continuous Optimization 18


Concave functions
Definition. Let  be a convex set of D.

A function f :  →  is called a concave function if


f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y), x ∈ , y ∈ ,θ[0, 1]

Note. A concave function is the negative of a convex function

2/25/2023 Chapter 7 - Continuous Optimization 19


Theorem
Suppose that C is a convex set,
• If f : C →  is a convex function, then a local minimum is a global
minimum of f over C.
• If f : C →  is a concave function, then a local maximum is a global
maximum of f over C.

2/25/2023 Chapter 7 - Continuous Optimization 20


Convexity test
• If a function f : n →  is twice differentiable, then

• f(x) is convex if and only if for any two points x, y it holds that
f(y) > f(x) + ∇xf(x)T(y − x)

• f(x) is convex if and only if Hessian ∇x2f(x) is positive semidefinite


Example

2/25/2023 Chapter 7 - Continuous Optimization 21


Convex functions - Ex
• The negative entropy f(x) = xlog2x is convex for x > 0
• In fact,
Gradient ∇xf(x) = log2x + x(log2x)
= log2x + log2e

Hessian ∇x2f(x) = (1/x)log2e > 0,


for all x > 0

2/25/2023 Chapter 7 - Continuous Optimization 22


Some common convex functions
• ax + b on  for any a, b 
• ax on  for any a 
• |x|p on  for p  1
• xlogx, x > 0
• cTx + b, xn for any cn, b 
• Every norm in n
• Spectral norm of a matrix: A2 = max(A) = [max(ATA)]1/2

2/26/2023 Chapter 7 - Continuous Optimization 23


Sum of convex functions is convex
Theorem. If f and g are convex functions, then so is f + g.
• In fact, suppose f and g are convex functions
• Then, f(θx + (1 − θ)y)  θf(x) + (1 − θ)f(y)
and g(θx + (1 − θ)y)  θg(x) + (1 − θ)g(y), for any 0  θ  1
 Summing up both sides
f (θx + (1 − θ)y) + g (θx + (1 − θ)y)  θf (x) + (1 − θ)f (y) + θg(x) + (1 −
θ)g(y) = θ(f(x) + g(x)) + (1 − θ)(f (y) + g(y))
 f + g is convex

2/26/2023 Chapter 7 - Continuous Optimization 24


Constrained Optimization
Ex. Find the minimum values of the function f(x, y) = x2 + 2y2 subject to
the constraint x2 + y2 = 1.
Level curve of f(x, y)

g(x, y) = 0

2/26/2023 Chapter 7 - Continuous Optimization 25


Constrained Optimization. Lagrange multipliers
To minimize f(x, y)
Level/contour curve of f(x, y)
subject to the constraint g(x, y) = 0 is to find
the smallest value of c such that the level
curve f(x, y) = c intersects g(x, y) = 0.

Two curves are tangent at (x0, y0) and


their gradients are parallel.
f(x0, y0) = g(x0, y0) for some scalar .

: Lagrange multiplier
L(x, y, ) := f(x, y) + g(x, y) is called g(x, y) = 0
Lagrangian

2/25/2023 Chapter 7 - Continuous Optimization 26


Constrained Optimization. Lagrange multipliers
Ex0. Minimize f(x, y) = x2 + 2y2 s.t x2 + y2 = 1.

2/26/2023 Chapter 7 - Continuous Optimization 27


Constrained Optimization. Lagrange multipliers
Ex1. Minimize f(x, y) = x2 + y2 s.t x – y = 1.

Set 0 = g(x, y) = x – y – 1
and L(x, y, ) = f(x, y) + g(x, y) = x2 + y2 + (x – y – 1)
We find all values of (x0, y0, ) such that
Lx(x0,y0,)= 0, Ly(x0,y0,) = 0, and L(x0,y0,) = 0 // partial derivatives of L
2x0 +  = 0, 2y0 -  = 0, and x0 – y0 – 1 = 0
 x0 = 1/2, y0 = -1/2,  = -1
The minimum value of f s.t. x – y – 1 = 0 is f(x0,y0) = ½.
(Note that we can use the fact y = x – 1 and plug in f(x, y) = x2 + y2.)

2/25/2023 Chapter 7 - Continuous Optimization 28


Constrained Optimization. Lagrange multipliers
Ex2. Minimize f(x, y) = 2x + y
s.t x2 + y2 = 1.

Set 0 = g(x, y) = x2 + y2 – 1
and L(x, y, ) = f(x, y) + g(x, y) = 2x + y + (x2 + y2 – 1)
We find all values of (x0, y0, ) such that
Lx(x0,y0,)= 0, Ly(x0,y0,) = 0, and L(x0,y0,) = 0
 2 + 2x0 = 0, 1 + 2y0 = 0, and x02 + y02 – 1 = 0
 (x0, y0, ) = (2/5,1/5,-5/2), (x0, y0, ) = (-2/5,-1/5, 5/2)
The minimum value of f s.t. g(x, y) = 0 is f(-2/5, -1/5) = -5.

2/25/2023 Chapter 7 - Continuous Optimization 29


Constrained Optimization. Lagrange multipliers
For real-valued functions f : D → , we consider the constrained
optimization problem
minxf(x)
subject to gi(x)  0 for all i = 1, . . . , m

For λ = [λ1 λ2 … λm]T, λi  0, Lagrange multipliers λi

set L(x, λ) := f(x) + λTg(x) // Lagrangian

2/25/2023 Chapter 7 - Continuous Optimization 30


Dual Lagrangian
In general, duality in optimization is the idea of converting an optimization
problem in one set of variables x (called the primal variables) into another one in
a different set of variables λ (called the dual variables).

Primal problem minx f(x)


s. t. gi(x)  0, for all i = 1, 2, …, m
Dual Lagrangian
D() = minxL(x, ) Lagrange multipliers are named
after the French-Italian
mathematician Joseph-Louis
Lagrange (1736–1813).
Lagrangian dual problem maxD()
s. t.   0

2/25/2023 Chapter 7 - Continuous Optimization 31


Weak duality vs Strong duality minimax  maximin

f(x) D()
f(x) D() f(x)
f(x)

D() D()

Weak duality Strong duality:


minxmaxλL(x, λ)  maxλminxL(x, λ) minxmaxλL(x, λ) = maxλminxL(x, λ)
f(·) and gi(·) may be nonconvex f(·) and gi(·) are convex
D(λ) = minxL(x, λ) is concave even though f(·) and gi(·) may be nonconvex. The outer problem, maximization
D(λ) over λ, is the maximum of a concave function and can be efficiently computed.

2/25/2023 Chapter 7 - Continuous Optimization 32


Convex Optimization
• Convex optimization problem
• f(·) is a convex function,
• the constraints involving g(·) and h(·) are convex sets
 strong duality: The optimal solution of the dual problem is the same
as the optimal solution of the primal problem

2/25/2023 Chapter 7 - Continuous Optimization 33


Convex Optimization
Problem
minxf(x)
subject to gi(x)  0 for all i = 1, . . . , m
hj(x) = 0 for all j = 1, . . . , n,

where all functions f(x) and gi(x) are convex functions,


and all hj(x) = 0 are convex sets

2/27/2023 Chapter 7 - Continuous Optimization 34


Convex optimization
Ex3. Minimize f(x, y) = x2 – 4y s.t. g(x, y) = y2 – 2x  0

L(x, y, ) = f(x, y) + g(x, y) = x2 – 4y + (y2 – 2x)


Lx = 0, and Ly = 0
 2x - 2 = 0, and -4 + 2y = 0
 x = , and y = 2/
min(x,y)L(x, y, ) = -2 – 4/ =: D()
3 3
 max0 D() = D( 2) = -3 4
3
 Result = -3 4

2/25/2023 Chapter 7 - Continuous Optimization 35


Example
min 2 x 2y 3
x ,y
• Consider the problem
s .t . x 2 y2 4
1/ Find the Lagrangian L(x, y, )

2/ Find the dual Lagrangian D()

2/25/2023 Chapter 7 - Continuous Optimization 36


Linear Programming
• Consider the special case when all the preceding functions are linear,
i.e.,
minx cTx
subject to Ax  b,
where A ∈ m×d and b ∈ m
• This is known as a linear program, which has d variables and m linear
constraints

2/25/2023 Chapter 7 - Continuous Optimization 37


Linear program - Ex
• Consider the linear program
5 T x1
min −
𝑥∈2 3 x2

2 2 33
subject to  2 4  8
 x   
 2 1   1    5 
   x2   
 0 1  1
 0 1   8 

2/26/2023 Chapter 7 - Continuous Optimization 38


Linear program - Ex
• Consider the linear program
T
5 x1
min −
𝑥∈2 3 x2

2 2 33
 2 4  8
 x   
subject to  2 1  1   5 
 
   x2   
 0 1  1
 0 1   8 

2/26/2023 Chapter 7 - Continuous Optimization 39


Linear program – Exercise
• Consider the linear programming

• Write the program in standard form (matrix notation).

2/26/2023 Chapter 7 - Continuous Optimization 40


Linear program - Lagrangian
• The Lagrangian is given by
L(x, λ) = cTx + λT(Ax − b)
= (c + ATλ)Tx − λTb
𝜕
• L(x, λ) = 0  c + ATλ = 0
𝜕x
• Therefore, the dual Lagrangian is
D(λ) = minx L(x, λ) = −λTb,
And we would like to find maxλ0D(λ)

2/25/2023 Chapter 7 - Continuous Optimization 41


Linear program - Dual program
• The dual optimization problem
maxλ (− bTλ)
subject to c + ATλ = 0,
m  λ  0
This is also a linear program, but with m variables
• We have two choices
• Solve the primal program for d variables
• Solve the dual program for m variables

2/25/2023 Chapter 7 - Continuous Optimization 42


Linear program - Lagrangian
• Lagrangian 33
• D(λ) = minx L(x, λ) = −λTb 8
 
= [-1 -2 -3 -4 -5] 5
 
 1
 8 
 D(λ) = -331 -82 -53 +4 -85

2/25/2023 Chapter 7 - Continuous Optimization 43


Example
• Consider the linear program
min 2
2x 1 x2
x 1,x 2

1 2 1
x1
s .t . 3 1 4
x2
2 3 3

• Find the dual Lagrangian D()

2/25/2023 Chapter 7 - Continuous Optimization 44


Quadratic Programming
• Consider the problem
1 T
minx x Qx + cTx
2
subject to Ax  b,
where A ∈ m×d, b ∈ m, and c ∈ d,
Q ∈ d×d is positive definite (and therefore the objective function is
convex)
• This is known as a quadratic program with d variables and m linear
constraints

2/25/2023 Chapter 7 - Continuous Optimization 45


Quadratic Programming – Ex

The optimal value must lie in the shaded region, and is indicated by the star

2/25/2023 Chapter 7 - Continuous Optimization 46


Quadratic Programming – Exercise
Consider the quadratic programming

Write the program in standard form (matrix notation).

2/26/2023 Chapter 7 - Continuous Optimization 47


Quadratic Programming - Lagrangian
• The Lagrangian is given by
1
L(x, λ) = xTQx + cTx + λT(Ax − b)
2
1
= xTQx + (c + ATλ)Tx − λTb,
2
Taking the derivative of L(x, λ) with respect to x and setting it to zero
gives
Qx + (c + ATλ) = 0
Assuming that Q is invertible, we get
x = −Q−1(c + ATλ)

2/25/2023 Chapter 7 - Continuous Optimization 48


Quadratic Programming – Dual Lagrangian
• The dual Lagrangian
1
D(λ) = − (c + ATλ)TQ−1(c + ATλ) − λTb
2
Therefore, the dual optimization problem is given by
1
maxλ − (c + ATλ)TQ−1(c + ATλ) − λTb
2
subject to λ  0

We will see an application of quadratic programming in ML in Chapter


12 Support Vector Machines

2/25/2023 Chapter 7 - Continuous Optimization 49


Example
T
1 x1 2 2
• Consider the linear program min x1 x 2
x 1,x 2 2
2 x2 2 4
2 1 1
x1
s .t . 3 2 2
x2
1 1 3

• Find the dual Lagrangian D()

2/25/2023 Chapter 7 - Continuous Optimization 50


THANKS

2/25/2023 Chapter 7 - Continuous Optimization 51

You might also like