You are on page 1of 301

Convex Optimization Boyd & Vandenberghe

1. Introduction
mathematical optimization
least-squares and linear programming
convex optimization
example
course goals and topics
nonlinear optimization
brief history of convex optimization

11

Mathematical optimization
(mathematical) optimization problem
minimize f0(x)
subject to fi(x) bi,

i = 1, . . . , m

x = (x1, . . . , xn): optimization variables


f0 : Rn R: objective function
fi : Rn R, i = 1, . . . , m: constraint functions
optimal solution x has smallest value of f0 among all vectors that
satisfy the constraints

Introduction

12

Examples
portfolio optimization
variables: amounts invested in different assets

constraints: budget, max./min. investment per asset, minimum return

objective: overall risk or return variance


device sizing in electronic circuits
variables: device widths and lengths

constraints: manufacturing limits, timing requirements, maximum area

objective: power consumption


data fitting
variables: model parameters

constraints: prior information, parameter limits


objective: measure of misfit or prediction error
Introduction

13

Solving optimization problems


general optimization problem
very difficult to solve
methods involve some compromise, e.g., very long computation time, or
not always finding the solution

exceptions: certain problem classes can be solved efficiently and reliably


least-squares problems
linear programming problems
convex optimization problems

Introduction

14

Least-squares
minimize kAx bk22
solving least-squares problems
analytical solution: x = (AT A)1AT b
reliable and efficient algorithms and software

computation time proportional to n2k (A Rkn); less if structured


a mature technology
using least-squares
least-squares problems are easy to recognize
a few standard techniques increase flexibility (e.g., including weights,
adding regularization terms)
Introduction

15

Linear programming
minimize cT x
subject to aTi x bi,

i = 1, . . . , m

solving linear programs


no analytical formula for solution
reliable and efficient algorithms and software

computation time proportional to n2m if m n; less with structure


a mature technology
using linear programming
not as easy to recognize as least-squares problems
a few standard tricks used to convert problems into linear programs
(e.g., problems involving 1- or -norms, piecewise-linear functions)
Introduction

16

Convex optimization problem


minimize f0(x)
subject to fi(x) bi,

i = 1, . . . , m

objective and constraint functions are convex:


fi(x + y) fi(x) + fi(y)
if + = 1, 0, 0
includes least-squares problems and linear programs as special cases

Introduction

17

solving convex optimization problems


no analytical solution
reliable and efficient algorithms
computation time (roughly) proportional to max{n3, n2m, F }, where F
is cost of evaluating fis and their first and second derivatives
almost a technology

using convex optimization


often difficult to recognize
many tricks for transforming problems into convex form
surprisingly many problems can be solved via convex optimization

Introduction

18

Example
m lamps illuminating n (small, flat) patches
lamp power pj

rkj
kj
illumination Ik

intensity Ik at patch k depends linearly on lamp powers pj :


Ik =

m
X

akj pj ,

2
max{cos kj , 0}
akj = rkj

j=1

problem: achieve desired illumination Ides with bounded lamp powers


minimize maxk=1,...,n | log Ik log Ides|
subject to 0 pj pmax, j = 1, . . . , m
Introduction

19

how to solve?
1. use uniform power: pj = p, vary p
2. use least-squares:
minimize
round pj if pj > pmax or pj < 0

Pn

2
(I

I
)
k
des
k=1

3. use weighted least-squares:


Pn
Pm
2
2
minimize
k=1 (Ik Ides ) +
j=1 wj (pj pmax /2)
iteratively adjust weights wj until 0 pj pmax

4. use linear programming:


minimize maxk=1,...,n |Ik Ides|
subject to 0 pj pmax, j = 1, . . . , m
which can be solved via linear programming
of course these are approximate (suboptimal) solutions
Introduction

110

5. use convex optimization: problem is equivalent to


minimize f0(p) = maxk=1,...,n h(Ik /Ides)
subject to 0 pj pmax, j = 1, . . . , m
with h(u) = max{u, 1/u}
5

h(u)

0
0

f0 is convex because maximum of convex functions is convex


exact solution obtained with effort modest factor least-squares effort
Introduction

111

additional constraints: does adding 1 or 2 below complicate the problem?


1. no more than half of total power is in any 10 lamps
2. no more than half of the lamps are on (pj > 0)
answer: with (1), still easy to solve; with (2), extremely difficult
moral: (untrained) intuition doesnt always work; without the proper
background very easy problems can appear quite similar to very difficult
problems

Introduction

112

Course goals and topics


goals
1. recognize/formulate problems (such as the illumination problem) as
convex optimization problems
2. develop code for problems of moderate size (1000 lamps, 5000 patches)
3. characterize optimal solution (optimal power distribution), give limits of
performance, etc.
topics
1. convex sets, functions, optimization problems
2. examples and applications
3. algorithms

Introduction

113

Nonlinear optimization
traditional techniques for general nonconvex problems involve compromises
local optimization methods (nonlinear programming)
find a point that minimizes f0 among feasible points near it
fast, can handle large problems
require initial guess
provide no information about distance to (global) optimum
global optimization methods
find the (global) solution
worst-case complexity grows exponentially with problem size
these algorithms are often based on solving convex subproblems
Introduction

114

Brief history of convex optimization


theory (convex analysis): ca19001970
algorithms

1947: simplex algorithm for linear programming (Dantzig)


1960s: early interior-point methods (Fiacco & McCormick, Dikin, . . . )
1970s: ellipsoid method and other subgradient methods
1980s: polynomial-time interior-point methods for linear programming
(Karmarkar 1984)
late 1980snow: polynomial-time interior-point methods for nonlinear
convex optimization (Nesterov & Nemirovski 1994)
applications
before 1990: mostly in operations research; few in engineering
since 1990: many new applications in engineering (control, signal
processing, communications, circuit design, . . . ); new problem classes
(semidefinite and second-order cone programming, robust optimization)
Introduction

115

Convex Optimization Boyd & Vandenberghe

2. Convex sets
affine and convex sets
some important examples
operations that preserve convexity
generalized inequalities
separating and supporting hyperplanes
dual cones and generalized inequalities

21

Affine set
line through x1, x2: all points
x = x1 + (1 )x2

( R)

x1
= 1.2
=1
= 0.6
x2

=0
= 0.2

affine set: contains the line through any two distinct points in the set
example: solution set of linear equations {x | Ax = b}
(conversely, every affine set can be expressed as solution set of system of
linear equations)
Convex sets

22

Convex set
line segment between x1 and x2: all points
x = x1 + (1 )x2
with 0 1
convex set: contains line segment between any two points in the set
x1, x2 C,

01

x1 + (1 )x2 C

examples (one convex, two nonconvex sets)

Convex sets

23

Convex combination and convex hull


convex combination of x1,. . . , xk : any point x of the form
x = 1 x1 + 2 x2 + + k xk
with 1 + + k = 1, i 0
convex hull conv S: set of all convex combinations of points in S

Convex sets

24

Convex cone
conic (nonnegative) combination of x1 and x2: any point of the form
x = 1 x1 + 2 x2
with 1 0, 2 0

x1
x2
0

convex cone: set that contains all conic combinations of points in the set

Convex sets

25

Hyperplanes and halfspaces


hyperplane: set of the form {x | aT x = b} (a 6= 0)
a

x0
x
aT x = b

halfspace: set of the form {x | aT x b} (a 6= 0)


a

x0

aT x b

aT x b

a is the normal vector

hyperplanes are affine and convex; halfspaces are convex


Convex sets

26

Euclidean balls and ellipsoids


(Euclidean) ball with center xc and radius r:
B(xc, r) = {x | kx xck2 r} = {xc + ru | kuk2 1}
ellipsoid: set of the form
{x | (x xc)T P 1(x xc) 1}
with P Sn++ (i.e., P symmetric positive definite)

xc

other representation: {xc + Au | kuk2 1} with A square and nonsingular


Convex sets

27

Norm balls and norm cones


norm: a function k k that satisfies
kxk 0; kxk = 0 if and only if x = 0
ktxk = |t| kxk for t R
kx + yk kxk + kyk
notation: k k is general (unspecified) norm; k ksymb is particular norm
norm ball with center xc and radius r: {x | kx xck r}
1

Euclidean norm cone is called secondorder cone

0.5

norm cone: {(x, t) | kxk t}

0
1

x2 1 1

x1

norm balls and cones are convex


Convex sets

28

Polyhedra
solution set of finitely many linear inequalities and equalities
Ax  b,

Cx = d

(A Rmn, C Rpn,  is componentwise inequality)


a1

a2

a5

a3
a4

polyhedron is intersection of finite number of halfspaces and hyperplanes


Convex sets

29

Positive semidefinite cone


notation:
Sn is set of symmetric n n matrices

Sn+ = {X Sn | X  0}: positive semidefinite n n matrices


X Sn+

z T Xz 0 for all z

Sn+ is a convex cone


Sn++ = {X Sn | X 0}: positive definite n n matrices
1

x y
y z

S2+

0.5

example:

0
1

y
Convex sets

0.5
1 0

x
210

Operations that preserve convexity


practical methods for establishing convexity of a set C
1. apply definition
x1, x2 C,

01

x1 + (1 )x2 C

2. show that C is obtained from simple convex sets (hyperplanes,


halfspaces, norm balls, . . . ) by operations that preserve convexity

intersection
affine functions
perspective function
linear-fractional functions

Convex sets

211

Intersection
the intersection of (any number of) convex sets is convex
example:

S = {x Rm | |p(t)| 1 for |t| /3}

where p(t) = x1 cos t + x2 cos 2t + + xm cos mt


for m = 2:
2
1

x2

p(t)

1
1

Convex sets

/3

2/3

2
2

x01

212

Affine function
suppose f : Rn Rm is affine (f (x) = Ax + b with A Rmn, b Rm)
the image of a convex set under f is convex
S Rn convex

f (S) = {f (x) | x S} convex

the inverse image f 1(C) of a convex set under f is convex


C Rm convex

f 1(C) = {x Rn | f (x) C} convex

examples
scaling, translation, projection

solution set of linear matrix inequality {x | x1A1 + + xmAm  B}


(with Ai, B Sp)

hyperbolic cone {x | xT P x (cT x)2, cT x 0} (with P Sn+)


Convex sets

213

Perspective and linear-fractional function


perspective function P : Rn+1 Rn:
P (x, t) = x/t,

dom P = {(x, t) | t > 0}

images and inverse images of convex sets under perspective are convex

linear-fractional function f : Rn Rm:


Ax + b
,
f (x) = T
c x+d

dom f = {x | cT x + d > 0}

images and inverse images of convex sets under linear-fractional functions


are convex

Convex sets

214

example of a linear-fractional function


f (x) =

1
x
x1 + x2 + 1

1
1

Convex sets

x2

x2

0
x1

1
1

f (C)

x1

215

Generalized inequalities
a convex cone K Rn is a proper cone if
K is closed (contains its boundary)
K is solid (has nonempty interior)
K is pointed (contains no line)
examples
nonnegative orthant K = Rn+ = {x Rn | xi 0, i = 1, . . . , n}
positive semidefinite cone K = Sn+

nonnegative polynomials on [0, 1]:


K = {x Rn | x1 + x2t + x3t2 + + xntn1 0 for t [0, 1]}
Convex sets

216

generalized inequality defined by a proper cone K:


x K y

y x K,

x K y

y x int K

examples
componentwise inequality (K = Rn+)
x Rn+ y

xi yi ,

i = 1, . . . , n

matrix inequality (K = Sn+)


X Sn+ Y

Y X positive semidefinite

these two types are so common that we drop the subscript in K


properties: many properties of K are similar to on R, e.g.,
x K y,
Convex sets

u K v

x + u K y + v
217

Minimum and minimal elements


K is not in general a linear ordering : we can have x 6K y and y 6K x
x S is the minimum element of S with respect to K if
yS

x K y

x S is a minimal element of S with respect to K if


y S,

y K x

example (K = R2+)
x1 is the minimum element of S1
x2 is a minimal element of S2
x1

Convex sets

S1

y=x

S2

x2

218

Separating hyperplane theorem


if C and D are nonempty disjoint convex sets, there exist a 6= 0, b s.t.
aT x b for x C,

aT x b for x D
aT x b

aT x b

D
C
a

the hyperplane {x | aT x = b} separates C and D


strict separation requires additional assumptions (e.g., C is closed, D is a
singleton)
Convex sets

219

Supporting hyperplane theorem


supporting hyperplane to set C at boundary point x0:
{x | aT x = aT x0}
where a 6= 0 and aT x aT x0 for all x C
a
C

x0

supporting hyperplane theorem: if C is convex, then there exists a


supporting hyperplane at every boundary point of C
Convex sets

220

Dual cones and generalized inequalities


dual cone of a cone K:
K = {y | y T x 0 for all x K}
examples
K = Rn+: K = Rn+

K = Sn+: K = Sn+

K = {(x, t) | kxk2 t}: K = {(x, t) | kxk2 t}

K = {(x, t) | kxk1 t}: K = {(x, t) | kxk t}


first three examples are self-dual cones
dual cones of proper cones are proper, hence define generalized inequalities:
y K 0
Convex sets

y T x 0 for all x K 0
221

Minimum and minimal elements via dual inequalities


minimum element w.r.t. K
x is minimum element of S iff for all
K 0, x is the unique minimizer
of T z over S

S
x

minimal element w.r.t. K


if x minimizes T z over S for some K 0, then x is minimal
1

x1

S
x2

if x is a minimal element of a convex set S, then there exists a nonzero


K 0 such that x minimizes T z over S
Convex sets

222

optimal production frontier


different production methods use different amounts of resources x Rn
production set P : resource vectors x for all possible production methods
efficient (Pareto optimal) methods correspond to resource vectors x
that are minimal w.r.t. Rn+

fuel

example (n = 2)
x1, x2, x3 are efficient; x4, x5 are not

P
x1
x2 x5 x4

x3
labor

Convex sets

223

Convex Optimization Boyd & Vandenberghe

3. Convex functions
basic properties and examples
operations that preserve convexity
the conjugate function
quasiconvex functions
log-concave and log-convex functions
convexity with respect to generalized inequalities

31

Definition
f : Rn R is convex if dom f is a convex set and
f (x + (1 )y) f (x) + (1 )f (y)
for all x, y dom f , 0 1
(y, f (y))
(x, f (x))

f is concave if f is convex

f is strictly convex if dom f is convex and


f (x + (1 )y) < f (x) + (1 )f (y)
for x, y dom f , x 6= y, 0 < < 1
Convex functions

32

Examples on R
convex:
affine: ax + b on R, for any a, b R
exponential: eax, for any a R
powers: x on R++, for 1 or 0
powers of absolute value: |x|p on R, for p 1
negative entropy: x log x on R++
concave:
affine: ax + b on R, for any a, b R
powers: x on R++, for 0 1
logarithm: log x on R++
Convex functions

33

Examples on Rn and Rmn


affine functions are convex and concave; all norms are convex
examples on Rn
affine function f (x) = aT x + b
Pn
norms: kxkp = ( i=1 |xi|p)1/p for p 1; kxk = maxk |xk |

examples on Rmn (m n matrices)


affine function

f (X) = tr(AT X) + b =

n
m X
X

Aij Xij + b

i=1 j=1

spectral (maximum singular value) norm


f (X) = kXk2 = max(X) = (max(X T X))1/2
Convex functions

34

Restriction of a convex function to a line


f : Rn R is convex if and only if the function g : R R,
g(t) = f (x + tv),

dom g = {t | x + tv dom f }

is convex (in t) for any x dom f , v Rn


can check convexity of f by checking convexity of functions of one variable
example. f : Sn R with f (X) = log det X, dom f = Sn++
g(t) = log det(X + tV ) = log det X + log det(I + tX 1/2V X 1/2)
n
X
log(1 + ti)
= log det X +
i=1

where i are the eigenvalues of X 1/2V X 1/2


g is concave in t (for any choice of X 0, V ); hence f is concave
Convex functions

35

Extended-value extension
extended-value extension f of f is
f(x) = f (x),

x dom f,

f(x) = ,

x 6 dom f

often simplifies notation; for example, the condition


01

f(x + (1 )y) f(x) + (1 )f(y)

(as an inequality in R {}), means the same as the two conditions


dom f is convex
for x, y dom f ,
01
Convex functions

f (x + (1 )y) f (x) + (1 )f (y)


36

First-order condition
f is differentiable if dom f is open and the gradient
f (x) =

f (x) f (x)
f (x)
,
,...,
x1
x2
xn

exists at each x dom f


1st-order condition: differentiable f with convex domain is convex iff
f (y) f (x) + f (x)T (y x)

for all x, y dom f

f (y)
f (x) + f (x)T (y x)
(x, f (x))

first-order approximation of f is global underestimator


Convex functions

37

Second-order conditions
f is twice differentiable if dom f is open and the Hessian 2f (x) Sn,
2f (x)
f (x)ij =
,
xixj
2

i, j = 1, . . . , n,

exists at each x dom f


2nd-order conditions: for twice differentiable f with convex domain
f is convex if and only if
2f (x)  0

for all x dom f

if 2f (x) 0 for all x dom f , then f is strictly convex


Convex functions

38

Examples
quadratic function: f (x) = (1/2)xT P x + q T x + r (with P Sn)
2f (x) = P

f (x) = P x + q,
convex if P  0

least-squares objective: f (x) = kAx bk22


f (x) = 2AT (Ax b),

2f (x) = 2AT A

convex (for any A)


quadratic-over-linear: f (x, y) = x2/y
y
x



y
x

T

0

f (x, y)

2
2f (x, y) = 3
y

2
1
0
2

2
0

convex for y > 0


Convex functions

0 2

x
39

log-sum-exp: f (x) = log


2f (x) =

1
1T z

Pn

k=1 exp xk

diag(z)

is convex

1
T
zz
(1T z)2

(zk = exp xk )

to show 2f (x)  0, we must verify that v T 2f (x)v 0 for all v:


T

v f (x)v =
since (

k v k zk )

2
k zk vk )(

2
k zk vk )(

geometric mean: f (x) = (

P
P
zk ) ( k v k zk ) 2
k
P
0
( k zk ) 2

k zk )

(from Cauchy-Schwarz inequality)

Qn

n
1/n
x
)
on
R
k
++ is concave
k=1

(similar proof as for log-sum-exp)

Convex functions

310

Epigraph and sublevel set


-sublevel set of f : Rn R:
C = {x dom f | f (x) }
sublevel sets of convex functions are convex (converse is false)
epigraph of f : Rn R:
epi f = {(x, t) Rn+1 | x dom f, f (x) t}
epi f
f

f is convex if and only if epi f is a convex set


Convex functions

311

Jensens inequality
basic inequality: if f is convex, then for 0 1,
f (x + (1 )y) f (x) + (1 )f (y)
extension: if f is convex, then
f (E z) E f (z)
for any random variable z
basic inequality is special case with discrete distribution
prob(z = x) = ,

Convex functions

prob(z = y) = 1

312

Operations that preserve convexity


practical methods for establishing convexity of a function
1. verify definition (often simplified by restricting to a line)
2. for twice differentiable functions, show 2f (x)  0
3. show that f is obtained from simple convex functions by operations
that preserve convexity

nonnegative weighted sum


composition with affine function
pointwise maximum and supremum
composition
minimization
perspective

Convex functions

313

Positive weighted sum & composition with affine function


nonnegative multiple: f is convex if f is convex, 0
sum: f1 + f2 convex if f1, f2 convex (extends to infinite sums, integrals)
composition with affine function: f (Ax + b) is convex if f is convex
examples
log barrier for linear inequalities
f (x) =

m
X
i=1

log(bi aTi x),

dom f = {x | aTi x < bi, i = 1, . . . , m}

(any) norm of affine function: f (x) = kAx + bk

Convex functions

314

Pointwise maximum
if f1, . . . , fm are convex, then f (x) = max{f1(x), . . . , fm(x)} is convex
examples
piecewise-linear function: f (x) = maxi=1,...,m(aTi x + bi) is convex

sum of r largest components of x Rn:

f (x) = x[1] + x[2] + + x[r]


is convex (x[i] is ith largest component of x)
proof:
f (x) = max{xi1 + xi2 + + xir | 1 i1 < i2 < < ir n}

Convex functions

315

Pointwise supremum
if f (x, y) is convex in x for each y A, then
g(x) = sup f (x, y)
yA

is convex
examples
support function of a set C: SC (x) = supyC y T x is convex

distance to farthest point in a set C:

f (x) = sup kx yk
yC

maximum eigenvalue of symmetric matrix: for X Sn,


max(X) = sup y T Xy
kyk2 =1

Convex functions

316

Composition with scalar functions


composition of g : Rn R and h : R R:
f (x) = h(g(x))
nondecreasing
g convex, h convex, h
f is convex if
nonincreasing
g concave, h convex, h
proof (for n = 1, differentiable g, h)
f (x) = h(g(x))g (x)2 + h(g(x))g (x)

note: monotonicity must hold for extended-value extension h


examples
exp g(x) is convex if g is convex
1/g(x) is convex if g is concave and positive
Convex functions

317

Vector composition
composition of g : Rn Rk and h : Rk R:
f (x) = h(g(x)) = h(g1(x), g2(x), . . . , gk (x))
nondecreasing in each argument
gi convex, h convex, h
f is convex if
nonincreasing in each argument
gi concave, h convex, h
proof (for n = 1, differentiable g, h)
f (x) = g (x)T 2h(g(x))g (x) + h(g(x))T g (x)
examples
Pm

i=1 log gi (x) is concave if gi are concave and positive


Pm
log i=1 exp gi(x) is convex if gi are convex
Convex functions

318

Minimization
if f (x, y) is convex in (x, y) and C is a convex set, then
g(x) = inf f (x, y)
yC

is convex
examples
f (x, y) = xT Ax + 2xT By + y T Cy with


A
BT

B
C

 0,

C0

minimizing over y gives g(x) = inf y f (x, y) = xT (A BC 1B T )x

g is convex, hence Schur complement A BC 1B T  0

distance to a set: dist(x, S) = inf yS kx yk is convex if S is convex


Convex functions

319

Perspective
the perspective of a function f : Rn R is the function g : Rn R R,
dom g = {(x, t) | x/t dom f, t > 0}

g(x, t) = tf (x/t),
g is convex if f is convex
examples

f (x) = xT x is convex; hence g(x, t) = xT x/t is convex for t > 0


negative logarithm f (x) = log x is convex; hence relative entropy
g(x, t) = t log t t log x is convex on R2++
if f is convex, then
T

g(x) = (c x + d)f (Ax + b)/(c x + d)

is convex on {x | cT x + d > 0, (Ax + b)/(cT x + d) dom f }


Convex functions

320

The conjugate function


the conjugate of a function f is
f (y) =

sup (y T x f (x))

xdom f

f (x)
xy

x
(0, f (y))

f is convex (even if f is not)


will be useful in chapter 5
Convex functions

321

examples
negative logarithm f (x) = log x
f (y) = sup(xy + log x)
x>0

1 log(y) y < 0

otherwise

strictly convex quadratic f (x) = (1/2)xT Qx with Q Sn++


f (y) = sup(y T x (1/2)xT Qx)
x

Convex functions

1 T 1
y Q y
2

322

Quasiconvex functions
f : Rn R is quasiconvex if dom f is convex and the sublevel sets
S = {x dom f | f (x) }
are convex for all

f is quasiconcave if f is quasiconvex
f is quasilinear if it is quasiconvex and quasiconcave
Convex functions

323

Examples

|x| is quasiconvex on R

ceil(x) = inf{z Z | z x} is quasilinear


log x is quasilinear on R++

f (x1, x2) = x1x2 is quasiconcave on R2++


linear-fractional function
aT x + b
,
f (x) = T
c x+d

dom f = {x | cT x + d > 0}

is quasilinear
distance ratio
kx ak2
,
f (x) =
kx bk2

dom f = {x | kx ak2 kx bk2}

is quasiconvex
Convex functions

324

internal rate of return


cash flow x = (x0, . . . , xn); xi is payment in period i (to us if xi > 0)

we assume x0 < 0 and x0 + x1 + + xn > 0

present value of cash flow x, for interest rate r:


PV(x, r) =

n
X

(1 + r)ixi

i=0

internal rate of return is smallest interest rate for which PV(x, r) = 0:


IRR(x) = inf{r 0 | PV(x, r) = 0}
IRR is quasiconcave: superlevel set is intersection of open halfspaces
IRR(x) R

Convex functions

n
X
i=0

(1 + r)ixi > 0 for 0 r < R

325

Properties
modified Jensen inequality: for quasiconvex f
01

f (x + (1 )y) max{f (x), f (y)}

first-order condition: differentiable f with cvx domain is quasiconvex iff


f (y) f (x)

f (x)T (y x) 0

f (x)

sums of quasiconvex functions are not necessarily quasiconvex


Convex functions

326

Log-concave and log-convex functions


a positive function f is log-concave if log f is concave:
f (x + (1 )y) f (x) f (y)1

for 0 1

f is log-convex if log f is convex


powers: xa on R++ is log-convex for a 0, log-concave for a 0
many common probability densities are log-concave, e.g., normal:
f (x) = p

1
(2)n det

e 2 (xx)

1 (x
x)

cumulative Gaussian distribution function is log-concave


Z x
2
1
(x) =
eu /2 du
2
Convex functions

327

Properties of log-concave functions


twice differentiable f with convex domain is log-concave if and only if
f (x)2f (x)  f (x)f (x)T
for all x dom f
product of log-concave functions is log-concave
sum of log-concave functions is not always log-concave
integration: if f : Rn Rm R is log-concave, then
g(x) =

f (x, y) dy

is log-concave (not easy to show)


Convex functions

328

consequences of integration property


convolution f g of log-concave functions f , g is log-concave
(f g)(x) =

f (x y)g(y)dy

if C Rn convex and y is a random variable with log-concave pdf then


f (x) = prob(x + y C)
is log-concave
proof: write f (x) as integral of product of log-concave functions
f (x) =

g(x + y)p(y) dy,

g(u) =

1 uC
0 u
6 C,

p is pdf of y
Convex functions

329

example: yield function


Y (x) = prob(x + w S)
x Rn: nominal parameter values for product
w Rn: random variations of parameters in manufactured product
S: set of acceptable values

if S is convex and w has a log-concave pdf, then


Y is log-concave
yield regions {x | Y (x) } are convex

Convex functions

330

Convexity with respect to generalized inequalities


f : Rn Rm is K-convex if dom f is convex and
f (x + (1 )y) K f (x) + (1 )f (y)
for x, y dom f , 0 1
example f : Sm Sm, f (X) = X 2 is Sm
+ -convex
proof: for fixed z Rm, z T X 2z = kXzk22 is convex in X, i.e.,
z T (X + (1 )Y )2z z T X 2z + (1 )z T Y 2z
for X, Y Sm, 0 1
therefore (X + (1 )Y )2  X 2 + (1 )Y 2
Convex functions

331

Convex Optimization Boyd & Vandenberghe

4. Convex optimization problems


optimization problem in standard form
convex optimization problems
quasiconvex optimization
linear optimization
quadratic optimization
geometric programming
generalized inequality constraints
semidefinite programming
vector optimization

41

Optimization problem in standard form


minimize f0(x)
subject to fi(x) 0,
hi(x) = 0,

i = 1, . . . , m
i = 1, . . . , p

x Rn is the optimization variable

f0 : Rn R is the objective or cost function

fi : Rn R, i = 1, . . . , m, are the inequality constraint functions


hi : Rn R are the equality constraint functions

optimal value:
p = inf{f0(x) | fi(x) 0, i = 1, . . . , m, hi(x) = 0, i = 1, . . . , p}
p = if problem is infeasible (no x satisfies the constraints)

p = if problem is unbounded below


Convex optimization problems

42

Optimal and locally optimal points


x is feasible if x dom f0 and it satisfies the constraints
a feasible x is optimal if f0(x) = p; Xopt is the set of optimal points
x is locally optimal if there is an R > 0 such that x is optimal for
minimize (over z) f0(z)
subject to
fi(z) 0, i = 1, . . . , m,
kz xk2 R

hi(z) = 0,

i = 1, . . . , p

examples (with n = 1, m = p = 0)
f0(x) = 1/x, dom f0 = R++: p = 0, no optimal point

f0(x) = log x, dom f0 = R++: p =

f0(x) = x log x, dom f0 = R++: p = 1/e, x = 1/e is optimal

f0(x) = x3 3x, p = , local optimum at x = 1


Convex optimization problems

43

Implicit constraints
the standard form optimization problem has an implicit constraint
xD=

m
\

i=0

dom fi

p
\

dom hi,

i=1

we call D the domain of the problem


the constraints fi(x) 0, hi(x) = 0 are the explicit constraints
a problem is unconstrained if it has no explicit constraints (m = p = 0)
example:
minimize f0(x) =

Pk

i=1 log(bi

aTi x)

is an unconstrained problem with implicit constraints aTi x < bi


Convex optimization problems

44

Feasibility problem
find
x
subject to fi(x) 0,
hi(x) = 0,

i = 1, . . . , m
i = 1, . . . , p

can be considered a special case of the general problem with f0(x) = 0:


minimize 0
subject to fi(x) 0,
hi(x) = 0,

i = 1, . . . , m
i = 1, . . . , p

p = 0 if constraints are feasible; any feasible x is optimal


p = if constraints are infeasible

Convex optimization problems

45

Convex optimization problem


standard form convex optimization problem
minimize f0(x)
subject to fi(x) 0, i = 1, . . . , m
aTi x = bi, i = 1, . . . , p
f0, f1, . . . , fm are convex; equality constraints are affine
problem is quasiconvex if f0 is quasiconvex (and f1, . . . , fm convex)
often written as
minimize f0(x)
subject to fi(x) 0,
Ax = b

i = 1, . . . , m

important property: feasible set of a convex optimization problem is convex


Convex optimization problems

46

example
minimize f0(x) = x21 + x22
subject to f1(x) = x1/(1 + x22) 0
h1(x) = (x1 + x2)2 = 0
f0 is convex; feasible set {(x1, x2) | x1 = x2 0} is convex
not a convex problem (according to our definition): f1 is not convex, h1
is not affine
equivalent (but not identical) to the convex problem
minimize x21 + x22
subject to x1 0
x1 + x2 = 0

Convex optimization problems

47

Local and global optima


any locally optimal point of a convex problem is (globally) optimal
proof: suppose x is locally optimal, but there exists a feasible y with
f0(y) < f0(x)
x locally optimal means there is an R > 0 such that
z feasible,

kz xk2 R

f0(z) f0(x)

consider z = y + (1 )x with = R/(2ky xk2)


ky xk2 > R, so 0 < < 1/2
z is a convex combination of two feasible points, hence also feasible
kz xk2 = R/2 and
f0(z) f0(y) + (1 )f0(x) < f0(x)
which contradicts our assumption that x is locally optimal
Convex optimization problems

48

Optimality criterion for differentiable f0


x is optimal if and only if it is feasible and
f0(x)T (y x) 0

for all feasible y

f0(x)
X

if nonzero, f0(x) defines a supporting hyperplane to feasible set X at x


Convex optimization problems

49

unconstrained problem: x is optimal if and only if


x dom f0,

f0(x) = 0

equality constrained problem


minimize f0(x) subject to Ax = b
x is optimal if and only if there exists a such that
x dom f0,

f0(x) + AT = 0

Ax = b,

minimization over nonnegative orthant


minimize f0(x) subject to x  0
x is optimal if and only if
x dom f0,
Convex optimization problems

x  0,

f0(x)i 0 xi = 0
f0(x)i = 0 xi > 0
410

Equivalent convex problems


two problems are (informally) equivalent if the solution of one is readily
obtained from the solution of the other, and vice-versa
some common transformations that preserve convexity:
eliminating equality constraints
minimize f0(x)
subject to fi(x) 0,
Ax = b

i = 1, . . . , m

is equivalent to
minimize (over z) f0(F z + x0)
subject to
fi(F z + x0) 0,

i = 1, . . . , m

where F and x0 are such that


Ax = b

Convex optimization problems

x = F z + x0 for some z
411

introducing equality constraints


minimize f0(A0x + b0)
subject to fi(Aix + bi) 0,

i = 1, . . . , m

is equivalent to
minimize (over x, yi) f0(y0)
subject to
fi(yi) 0, i = 1, . . . , m
yi = Aix + bi, i = 0, 1, . . . , m
introducing slack variables for linear inequalities
minimize f0(x)
subject to aTi x bi,

i = 1, . . . , m

is equivalent to
minimize (over x, s) f0(x)
subject to
aTi x + si = bi, i = 1, . . . , m
si 0, i = 1, . . . m
Convex optimization problems

412

epigraph form: standard form convex problem is equivalent to


minimize (over x, t) t
subject to
f0(x) t 0
fi(x) 0, i = 1, . . . , m
Ax = b
minimizing over some variables
minimize f0(x1, x2)
subject to fi(x1) 0,

i = 1, . . . , m

minimize f0(x1)
subject to fi(x1) 0,

i = 1, . . . , m

is equivalent to

where f0(x1) = inf x2 f0(x1, x2)


Convex optimization problems

413

Quasiconvex optimization
minimize f0(x)
subject to fi(x) 0, i = 1, . . . , m
Ax = b
with f0 : Rn R quasiconvex, f1, . . . , fm convex
can have locally optimal points that are not (globally) optimal

(x, f0(x))

Convex optimization problems

414

convex representation of sublevel sets of f0


if f0 is quasiconvex, there exists a family of functions t such that:
t(x) is convex in x for fixed t
t-sublevel set of f0 is 0-sublevel set of t, i.e.,
f0(x) t
example

t(x) 0

p(x)
f0(x) =
q(x)

with p convex, q concave, and p(x) 0, q(x) > 0 on dom f0


can take t(x) = p(x) tq(x):
for t 0, t convex in x
p(x)/q(x) t if and only if t(x) 0
Convex optimization problems

415

quasiconvex optimization via convex feasibility problems


t(x) 0,

fi(x) 0,

i = 1, . . . , m,

Ax = b

(1)

for fixed t, a convex feasibility problem in x

if feasible, we can conclude that t p; if infeasible, t p

Bisection method for quasiconvex optimization


given l p, u p, tolerance > 0.
repeat
1. t := (l + u)/2.
2. Solve the convex feasibility problem (1).
3. if (1) is feasible, u := t; else l := t.
until u l .

requires exactly log2((u l)/) iterations (where u, l are initial values)


Convex optimization problems

416

Linear program (LP)


minimize cT x + d
subject to Gx  h
Ax = b
convex problem with affine objective and constraint functions
feasible set is a polyhedron

c
P

Convex optimization problems

417

Examples
diet problem: choose quantities x1, . . . , xn of n foods
one unit of food j costs cj , contains amount aij of nutrient i

healthy diet requires nutrient i in quantity at least bi


to find cheapest healthy diet,
minimize cT x
subject to Ax  b,

x0

piecewise-linear minimization
minimize maxi=1,...,m(aTi x + bi)
equivalent to an LP
minimize t
subject to aTi x + bi t,
Convex optimization problems

i = 1, . . . , m
418

Chebyshev center of a polyhedron


Chebyshev center of
P = {x | aTi x bi, i = 1, . . . , m}

xcheb

is center of largest inscribed ball


B = {xc + u | kuk2 r}
aTi x bi for all x B if and only if
sup{aTi (xc + u) | kuk2 r} = aTi xc + rkaik2 bi
hence, xc, r can be determined by solving the LP
maximize r
subject to aTi xc + rkaik2 bi,
Convex optimization problems

i = 1, . . . , m

419

Linear-fractional program
minimize f0(x)
subject to Gx  h
Ax = b
linear-fractional program
cT x + d
f0(x) = T
,
e x+f

dom f0(x) = {x | eT x + f > 0}

a quasiconvex optimization problem; can be solved by bisection

also equivalent to the LP (variables y, z)

minimize cT y + dz
subject to Gy  hz
Ay = bz
eT y + f z = 1
z0
Convex optimization problems

420

generalized linear-fractional program


cTi x + di
f0(x) = max T
,
i=1,...,r e x + fi
i

dom f0(x) = {x | eTi x+fi > 0, i = 1, . . . , r}

a quasiconvex optimization problem; can be solved by bisection


example: Von Neumann model of a growing economy
maximize (over x, x+) mini=1,...,n x+
i /xi
subject to
x+  0, Bx+  Ax
x, x+ Rn: activity levels of n sectors, in current and next period
(Ax)i, (Bx+)i: produced, resp. consumed, amounts of good i
x+
i /xi : growth rate of sector i

allocate activity to maximize growth rate of slowest growing sector

Convex optimization problems

421

Quadratic program (QP)


minimize (1/2)xT P x + q T x + r
subject to Gx  h
Ax = b
P Sn+, so objective is convex quadratic
minimize a convex quadratic function over a polyhedron
f0(x)
x
P

Convex optimization problems

422

Examples
least-squares
minimize kAx bk22
analytical solution x = Ab (A is pseudo-inverse)
can add linear constraints, e.g., l  x  u
linear program with random cost
minimize cT x + xT x = E cT x + var(cT x)
subject to Gx  h, Ax = b
c is random vector with mean c and covariance

hence, cT x is random variable with mean cT x and variance xT x


> 0 is risk aversion parameter; controls the trade-off between
expected cost and variance (risk)
Convex optimization problems

423

Quadratically constrained quadratic program (QCQP)


minimize (1/2)xT P0x + q0T x + r0
subject to (1/2)xT Pix + qiT x + ri 0,
Ax = b

i = 1, . . . , m

Pi Sn+; objective and constraints are convex quadratic


if P1, . . . , Pm Sn++, feasible region is intersection of m ellipsoids and
an affine set

Convex optimization problems

424

Second-order cone programming


minimize f T x
subject to kAix + bik2 cTi x + di,
Fx = g

i = 1, . . . , m

(Ai Rnin, F Rpn)


inequalities are called second-order cone (SOC) constraints:
(Aix + bi, cTi x + di) second-order cone in Rni+1
for ni = 0, reduces to an LP; if ci = 0, reduces to a QCQP
more general than QCQP and LP

Convex optimization problems

425

Robust linear programming


the parameters in optimization problems are often uncertain, e.g., in an LP
minimize cT x
subject to aTi x bi,

i = 1, . . . , m,

there can be uncertainty in c, ai, bi


two common approaches to handling uncertainty (in ai, for simplicity)
deterministic model: constraints must hold for all ai Ei
minimize cT x
subject to aTi x bi for all ai Ei,

i = 1, . . . , m,

stochastic model: ai is random variable; constraints must hold with


probability
minimize cT x
subject to prob(aTi x bi) ,
Convex optimization problems

i = 1, . . . , m
426

deterministic approach via SOCP


choose an ellipsoid as Ei:
Ei = {
ai + Piu | kuk2 1}

(
a i Rn ,

Pi Rnn)

center is a
i, semi-axes determined by singular values/vectors of Pi
robust LP
minimize cT x
subject to aTi x bi

ai Ei,

i = 1, . . . , m

cT x
a
Ti x + kPiT xk2 bi,

i = 1, . . . , m

is equivalent to the SOCP


minimize
subject to

(follows from supkuk21(


ai + Piu)T x = a
Ti x + kPiT xk2)
Convex optimization problems

427

stochastic approach via SOCP


assume ai is Gaussian with mean a
i, covariance i (ai N (
ai, i))
Ti x, variance xT ix; hence
aTi x is Gaussian r.v. with mean a
!
T
bi a
i x
T
prob(ai x bi) =
1/2
ki xk2

where (x) = (1/ 2)


robust LP

Rx

t
e

/2

dt is CDF of N (0, 1)

minimize cT x
subject to prob(aTi x bi) ,

i = 1, . . . , m,

with 1/2, is equivalent to the SOCP


minimize
subject to
Convex optimization problems

cT x
1/2
a
Ti x + 1()ki xk2 bi,

i = 1, . . . , m
428

Geometric programming
monomial function
f (x) = cxa1 1 xa2 2 xann ,

dom f = Rn++

with c > 0; exponent ai can be any real number


posynomial function: sum of monomials
f (x) =

K
X

k=1

ck x1 1k x2 2k xannk ,

dom f = Rn++

geometric program (GP)


minimize f0(x)
subject to fi(x) 1,
hi(x) = 1,

i = 1, . . . , m
i = 1, . . . , p

with fi posynomial, hi monomial


Convex optimization problems

429

Geometric program in convex form


change variables to yi = log xi, and take logarithm of cost, constraints
monomial f (x) = cxa1 1 xann transforms to
log f (ey1 , . . . , eyn ) = aT y + b
posynomial f (x) =

PK

(b = log c)

1k
2k
nk
transforms to
k=1 ck x1 x2 xn

log f (ey1 , . . . , eyn ) = log

K
X

k=1

eak y+bk

geometric program transforms to convex problem


P

K
T
minimize log
k=1 exp(a0k y + b0k )
P

K
T
subject to log
k=1 exp(aik y + bik ) 0,
Gy + d = 0
Convex optimization problems

(bk = log ck )

i = 1, . . . , m

430

Design of cantilever beam


segment 4 segment 3 segment 2 segment 1

N segments with unit lengths, rectangular cross-sections of size wi hi


given vertical force F applied at the right end
design problem
minimize total weight
subject to upper & lower bounds on wi, hi
upper bound & lower bounds on aspect ratios hi/wi
upper bound on stress in each segment
upper bound on vertical deflection at the end of the beam
variables: wi, hi for i = 1, . . . , N
Convex optimization problems

431

objective and constraint functions


total weight w1h1 + + wN hN is posynomial
aspect ratio hi/wi and inverse aspect ratio wi/hi are monomials
maximum stress in segment i is given by 6iF/(wih2i ), a monomial
the vertical deflection yi and slope vi of central axis at the right end of
segment i are defined recursively as
F
+ vi+1
3
Ewihi
F
+ vi+1 + yi+1
= 6(i 1/3)
3
Ewihi

vi = 12(i 1/2)
yi

for i = N, N 1, . . . , 1, with vN +1 = yN +1 = 0 (E is Youngs modulus)


vi and yi are posynomial functions of w, h
Convex optimization problems

432

formulation as a GP
minimize

w 1 h1 + + w N hN

1
subject to wmax
wi 1,

h1
max hi 1,

wminwi1 1,

hminh1
i 1,

1
Smax
wi1hi 1,

i = 1, . . . , N
i = 1, . . . , N

Sminwih1
i 1,

1
6iF max
wi1h2
i 1,

i = 1, . . . , N

i = 1, . . . , N

1
ymax
y1 1

note
we write wmin wi wmax and hmin hi hmax
wmin/wi 1,

wi/wmax 1,

hmin/hi 1,

hi/hmax 1

we write Smin hi/wi Smax as


Sminwi/hi 1,
Convex optimization problems

hi/(wiSmax) 1
433

Minimizing spectral radius of nonnegative matrix


Perron-Frobenius eigenvalue pf (A)
exists for (elementwise) positive A Rnn
a real, positive eigenvalue of A, equal to spectral radius maxi |i(A)|

determines asymptotic growth (decay) rate of Ak : Ak kpf as k


alternative characterization: pf (A) = inf{ | Av  v for some v 0}
minimizing spectral radius of matrix of posynomials
minimize pf (A(x)), where the elements A(x)ij are posynomials of x
equivalent geometric program:
minimize
Pn
subject to
j=1 A(x)ij vj /(vi ) 1,

i = 1, . . . , n

variables , v, x
Convex optimization problems

434

Generalized inequality constraints


convex problem with generalized inequality constraints
minimize f0(x)
subject to fi(x) Ki 0,
Ax = b

i = 1, . . . , m

f0 : Rn R convex; fi : Rn Rki Ki-convex w.r.t. proper cone Ki


same properties as standard convex problem (convex feasible set, local
optimum is global, etc.)
conic form problem: special case with affine objective and constraints
minimize cT x
subject to F x + g K 0
Ax = b
extends linear programming (K = Rm
+ ) to nonpolyhedral cones
Convex optimization problems

435

Semidefinite program (SDP)


minimize cT x
subject to x1F1 + x2F2 + + xnFn + G  0
Ax = b
with Fi, G Sk
inequality constraint is called linear matrix inequality (LMI)
includes problems with multiple LMI constraints: for example,
 0,
x1F1 + + xnFn + G

0
x1F1 + + xnFn + G

is equivalent to single LMI


x1

F1 0
0 F1

Convex optimization problems

+x2

F2 0
0 F2

+ +xn

Fn 0
0 Fn

 


G 0
+
0

0 G
436

LP and SOCP as SDP


LP and equivalent SDP
LP:

minimize cT x
subject to Ax  b

SDP:

minimize cT x
subject to diag(Ax b)  0

(note different interpretation of generalized inequality )


SOCP and equivalent SDP
SOCP:

SDP:

minimize f T x
subject to kAix + bik2 cTi x + di,
f T x
(cTi x + di)I
subject to
(Aix + bi)T
minimize

Convex optimization problems

A i x + bi
cTi x + di

i = 1, . . . , m


 0,

i = 1, . . . , m

437

Eigenvalue minimization

minimize max(A(x))
where A(x) = A0 + x1A1 + + xnAn (with given Ai Sk )
equivalent SDP
minimize t
subject to A(x)  tI
variables x Rn, t R
follows from

Convex optimization problems

max(A) t

A  tI

438

Matrix norm minimization


T

minimize kA(x)k2 = max(A(x) A(x))

1/2

where A(x) = A0 + x1A1 + + xnAn (with given Ai Rpq )


equivalent SDP
minimize
subject to

t

tI
A(x)T

A(x)
tI

0

variables x Rn, t R
constraint follows from

kAk2 t AT A  t2I, t 0


tI A

0
AT tI
Convex optimization problems

439

Vector optimization
general vector optimization problem
minimize (w.r.t. K) f0(x)
subject to
fi(x) 0,
hi(x) = 0,

i = 1, . . . , m
i = 1, . . . , p

vector objective f0 : Rn Rq , minimized w.r.t. proper cone K Rq


convex vector optimization problem
minimize (w.r.t. K) f0(x)
subject to
fi(x) 0,
Ax = b

i = 1, . . . , m

with f0 K-convex, f1, . . . , fm convex

Convex optimization problems

440

Optimal and Pareto optimal points


set of achievable objective values
O = {f0(x) | x feasible}
feasible x is optimal if f0(x) is the minimum value of O
feasible x is Pareto optimal if f0(x) is a minimal value of O

O
O

f0(x)

x is optimal

Convex optimization problems

f0(xpo)

xpo is Pareto optimal

441

Multicriterion optimization
vector optimization problem with K = Rq+
f0(x) = (F1(x), . . . , Fq (x))
q different objectives Fi; roughly speaking we want all Fis to be small
feasible x is optimal if

y feasible

f0(x)  f0(y)

if there exists an optimal point, the objectives are noncompeting


feasible xpo is Pareto optimal if
y feasible,

f0(y)  f0(xpo)

f0(xpo) = f0(y)

if there are multiple Pareto optimal values, there is a trade-off between


the objectives
Convex optimization problems

442

Regularized least-squares
minimize (w.r.t. R2+) (kAx bk22, kxk22)

F2(x) = kxk22

25

20

15

10

0
0

10

20

30

40

50

F1(x) = kAx bk22


example for A R10010; heavy line is formed by Pareto optimal points
Convex optimization problems

443

Risk return trade-off in portfolio optimization


minimize (w.r.t. R2+) (
pT x, xT x)
subject to
1T x = 1, x  0
x Rn is investment portfolio; xi is fraction invested in asset i

p Rn is vector of relative asset price changes; modeled as a random


variable with mean p, covariance

pT x = E r is expected return; xT x = var r is return variance


example
1

allocation x

mean return

15%

10%

5%

x(4) x(3)

x(2)

0.5

x(1)

0
0%

0%

10%

20%

standard deviation of return


Convex optimization problems

0%

10%

20%

standard deviation of return


444

Scalarization
to find Pareto optimal points: choose K 0 and solve scalar problem
minimize T f0(x)
subject to fi(x) 0,
hi(x) = 0,

i = 1, . . . , m
i = 1, . . . , p

if x is optimal for scalar problem,


then it is Pareto-optimal for vector
optimization problem

f0(x1)
1

f0(x3)
f0(x2) 2

for convex vector optimization problems, can find (almost) all Pareto
optimal points by varying K 0
Convex optimization problems

445

Scalarization for multicriterion problems


to find Pareto optimal points, minimize positive weighted sum
T f0(x) = 1F1(x) + + q Fq (x)
examples
regularized least-squares problem of page 443
20

minimize kAx bk22 + kxk22


for fixed , a LS problem

15

kxk22

take = (1, ) with > 0

10

=1
0
0

10

kAx
Convex optimization problems

bk22

15

20

446

risk-return trade-off of page 444


minimize
pT x + xT x
subject to 1T x = 1, x  0
for fixed > 0, a quadratic program

Convex optimization problems

447

Convex Optimization Boyd & Vandenberghe

5. Duality
Lagrange dual problem
weak and strong duality
geometric interpretation
optimality conditions
perturbation and sensitivity analysis
examples
generalized inequalities

51

Lagrangian
standard form problem (not necessarily convex)
minimize f0(x)
subject to fi(x) 0,
hi(x) = 0,

i = 1, . . . , m
i = 1, . . . , p

variable x Rn, domain D, optimal value p


Lagrangian: L : Rn Rm Rp R, with dom L = D Rm Rp,
L(x, , ) = f0(x) +

m
X
i=1

ifi(x) +

p
X

ihi(x)

i=1

weighted sum of objective and constraint functions


i is Lagrange multiplier associated with fi(x) 0
i is Lagrange multiplier associated with hi(x) = 0
Duality

52

Lagrange dual function


Lagrange dual function: g : Rm Rp R,
g(, ) =
=

inf L(x, , )

xD

inf

xD

f0(x) +

m
X

ifi(x) +

i=1

p
X

ihi(x)

i=1

g is concave, can be for some ,


lower bound property: if  0, then g(, ) p
proof: if x
is feasible and  0, then
f0(
x) L(
x, , ) inf L(x, , ) = g(, )
xD

minimizing over all feasible x


gives p g(, )
Duality

53

Least-norm solution of linear equations


minimize xT x
subject to Ax = b
dual function
Lagrangian is L(x, ) = xT x + T (Ax b)

to minimize L over x, set gradient equal to zero:


xL(x, ) = 2x + AT = 0

x = (1/2)AT

plug in in L to obtain g:
1
g() = L((1/2)AT , ) = T AAT bT
4
a concave function of
lower bound property: p (1/4) T AAT bT for all
Duality

54

Standard form LP
minimize cT x
subject to Ax = b,

x0

dual function
Lagrangian is
L(x, , ) = cT x + T (Ax b) T x

= bT + (c + AT )T x

L is affine in x, hence
g(, ) = inf L(x, , ) =
x

bT

AT + c = 0
otherwise

g is linear on affine domain {(, ) | AT + c = 0}, hence concave


lower bound property: p bT if AT + c  0
Duality

55

Equality constrained norm minimization


minimize kxk
subject to Ax = b
dual function
T

g() = inf (kxk Ax + b ) =


x

bT kAT k 1
otherwise

where kvk = supkuk1 uT v is dual norm of k k

proof: follows from inf x(kxk y T x) = 0 if kyk 1, otherwise


if kyk 1, then kxk y T x 0 for all x, with equality if x = 0

if kyk > 1, choose x = tu where kuk 1, uT y = kyk > 1:


kxk y T x = t(kuk kyk)

as t

lower bound property: p bT if kAT k 1


Duality

56

Two-way partitioning
minimize xT W x
subject to x2i = 1,

i = 1, . . . , n

a nonconvex problem; feasible set contains 2n discrete points

interpretation: partition {1, . . . , n} in two sets; Wij is cost of assigning


i, j to the same set; Wij is cost of assigning to different sets
dual function
T

g() = inf (x W x +
x

X
i

i(x2i 1)) = inf xT (W + diag())x 1T


x

1T W + diag()  0
=

otherwise

lower bound property: p 1T if W + diag()  0


example: = min(W )1 gives bound p nmin(W )
Duality

57

Lagrange dual and conjugate function


minimize f0(x)
subject to Ax  b,

Cx = d

dual function
g(, ) =

inf

xdom f0

f0(x) + (A + C ) x b d

= f0(AT C T ) bT dT

recall definition of conjugate f (y) = supxdom f (y T x f (x))

simplifies derivation of dual if conjugate of f0 is known


example: entropy maximization
f0(x) =

n
X
i=1

Duality

xi log xi,

f0(y) =

n
X

eyi1

i=1

58

The dual problem


Lagrange dual problem
maximize g(, )
subject to  0
finds best lower bound on p, obtained from Lagrange dual function

a convex optimization problem; optimal value denoted d


, are dual feasible if  0, (, ) dom g

often simplified by making implicit constraint (, ) dom g explicit


example: standard form LP and its dual (page 55)
minimize cT x
subject to Ax = b
x0
Duality

maximize bT
subject to AT + c  0

59

Weak and strong duality


weak duality: d p
always holds (for convex and nonconvex problems)

can be used to find nontrivial lower bounds for difficult problems


for example, solving the SDP

maximize 1T
subject to W + diag()  0
gives a lower bound for the two-way partitioning problem on page 57
strong duality: d = p
does not hold in general

(usually) holds for convex problems

conditions that guarantee strong duality in convex problems are called


constraint qualifications
Duality

510

Slaters constraint qualification


strong duality holds for a convex problem
minimize f0(x)
subject to fi(x) 0,
Ax = b

i = 1, . . . , m

if it is strictly feasible, i.e.,


x int D :

fi(x) < 0,

i = 1, . . . , m,

Ax = b

also guarantees that the dual optimum is attained (if p > )


can be sharpened: e.g., can replace int D with relint D (interior
relative to affine hull); linear inequalities do not need to hold with strict
inequality, . . .
there exist many other types of constraint qualifications
Duality

511

Inequality form LP
primal problem

minimize cT x
subject to Ax  b

dual function
T

g() = inf (c + A ) x b =
x

dual problem

bT AT + c = 0

otherwise

maximize bT
subject to AT + c = 0,

0

from Slaters condition: p = d if A


x b for some x

in fact, p = d except when primal and dual are infeasible


Duality

512

Quadratic program
primal problem (assume P Sn++)
minimize xT P x
subject to Ax  b
dual function


1
g() = inf x P x + (Ax b) = T AP 1AT bT
x
4
T

dual problem
maximize (1/4)T AP 1AT bT
subject to  0
from Slaters condition: p = d if A
x b for some x

in fact, p = d always
Duality

513

A nonconvex problem with strong duality


minimize xT Ax + 2bT x
subject to xT x 1
A 6 0, hence nonconvex
dual function: g() = inf x(xT (A + I)x + 2bT x )
unbounded below if A + I 6 0 or if A + I  0 and b 6 R(A + I)

minimized by x = (A + I)b otherwise: g() = bT (A + I)b


dual problem and equivalent SDP:
maximize bT (A + I)b
subject to A + I  0
b R(A + I)

t

A + I
subject to
bT

maximize

b
t

0

strong duality although primal problem is not convex (not easy to show)
Duality

514

Geometric interpretation
for simplicity, consider problem with one constraint f1(x) 0
interpretation of dual function:
g() =

inf (t + u),

(u,t)G

where

G = {(f1(x), f0(x)) | x D}
t

G
p
d

p
u + t = g()

g()
u

u + t = g() is (non-vertical) supporting hyperplane to G

hyperplane intersects t-axis at t = g()


Duality

515

epigraph variation: same interpretation if G is replaced with


A = {(u, t) | f1(x) u, f0(x) t for some x D}
t

u + t = g()

p
g()

strong duality
holds if there is a non-vertical supporting hyperplane to A at (0, p)

for convex problem, A is convex, hence has supp. hyperplane at (0, p)

Slaters condition: if there exist (


u, t) A with u
< 0, then supporting
hyperplanes at (0, p) must be non-vertical
Duality

516

Complementary slackness
assume strong duality holds, x is primal optimal, (, ) is dual optimal
f0(x) = g(, ) = inf
x

f0(x) +

f0(x) +
f0(x)

m
X

ifi(x) +

i=1

m
X

ifi(x) +

i=1

p
X

ihi(x)

i=1

p
X

ihi(x)

i=1

hence, the two inequalities hold with equality


x minimizes L(x, , )

ifi(x) = 0 for i = 1, . . . , m (known as complementary slackness):


i > 0 = fi(x) = 0,
Duality

fi(x) < 0 = i = 0
517

Karush-Kuhn-Tucker (KKT) conditions


the following four conditions are called KKT conditions (for a problem with
differentiable fi, hi):
1. primal constraints: fi(x) 0, i = 1, . . . , m, hi(x) = 0, i = 1, . . . , p
2. dual constraints:  0
3. complementary slackness: ifi(x) = 0, i = 1, . . . , m
4. gradient of Lagrangian with respect to x vanishes:
f0(x) +

m
X
i=1

ifi(x) +

p
X
i=1

ihi(x) = 0

from page 517: if strong duality holds and x, , are optimal, then they
must satisfy the KKT conditions
Duality

518

KKT conditions for convex problem


satisfy KKT for a convex problem, then they are optimal:
if x
, ,
)
from complementary slackness: f0(
x) = L(
x, ,

) = L(
)
from 4th condition (and convexity): g(,
x, ,
)
hence, f0(
x) = g(,

if Slaters condition is satisfied:


x is optimal if and only if there exist , that satisfy KKT conditions
recall that Slater implies strong duality, and dual optimum is attained
generalizes optimality condition f0(x) = 0 for unconstrained problem

Duality

519

example: water-filling (assume i > 0)


Pn

minimize i=1 log(xi + i)


subject to x  0, 1T x = 1
x is optimal iff x  0, 1T x = 1, and there exist Rn, R such that
 0,

ixi = 0,

1
+ i =
xi + i

if < 1/i: i = 0 and xi = 1/ i

if 1/i: i = 1/i and xi = 0


Pn
determine from 1T x = i=1 max{0, 1/ i} = 1

interpretation

n patches; level of patch i is at height i

1/

xi

flood area with unit amount of water


resulting level is 1/
Duality

i
520

Perturbation and sensitivity analysis


(unperturbed) optimization problem and its dual
minimize f0(x)
subject to fi(x) 0,
hi(x) = 0,

maximize g(, )
subject to  0

i = 1, . . . , m
i = 1, . . . , p

perturbed problem and its dual


min. f0(x)
s.t. fi(x) ui,
hi(x) = vi,

i = 1, . . . , m
i = 1, . . . , p

max. g(, ) uT v T
s.t.
0

x is primal variable; u, v are parameters

p(u, v) is optimal value as a function of u, v

we are interested in information about p(u, v) that we can obtain from


the solution of the unperturbed problem and its dual
Duality

521

global sensitivity result


assume strong duality holds for unperturbed problem, and that , are
dual optimal for unperturbed problem
apply weak duality to perturbed problem:
p(u, v) g(, ) uT v T
= p(0, 0) uT v T

sensitivity interpretation
if i large: p increases greatly if we tighten constraint i (ui < 0)

if i small: p does not decrease much if we loosen constraint i (ui > 0)

if i large and positive: p increases greatly if we take vi < 0;


if i large and negative: p increases greatly if we take vi > 0

if i small and positive: p does not decrease much if we take vi > 0;


if i small and negative: p does not decrease much if we take vi < 0
Duality

522

local sensitivity: if (in addition) p(u, v) is differentiable at (0, 0), then


i

p(0, 0)
=
,
ui

p(0, 0)
=
vi

proof (for i): from global sensitivity result,


p(0, 0)
p(tei, 0) p(0, 0)
= lim
i
t0
ui
t
p(tei, 0) p(0, 0)
p(0, 0)
= lim
i
t0
ui
t
hence, equality

p(u) for a problem with one (inequality)


constraint:

u
p(u)

u=0
p(0) u

Duality

523

Duality and problem reformulations


equivalent formulations of a problem can lead to very different duals
reformulating the primal problem can be useful when the dual is difficult
to derive, or uninteresting

common reformulations
introduce new variables and equality constraints
make explicit constraints implicit or vice-versa
transform objective or constraint functions
e.g., replace f0(x) by (f0(x)) with convex, increasing

Duality

524

Introducing new variables and equality constraints


minimize f0(Ax + b)
dual function is constant: g = inf x L(x) = inf x f0(Ax + b) = p

we have strong duality, but dual is quite useless


reformulated problem and its dual
minimize f0(y)
subject to Ax + b y = 0

maximize bT f0()
subject to AT = 0

dual function follows from


g() = inf (f0(y) T y + T Ax + bT )
x,y

f0() + bT AT = 0
=

otherwise
Duality

525

norm approximation problem: minimize kAx bk


minimize kyk
subject to y = Ax b
can look up conjugate of k k, or derive dual directly
g() = inf (kyk + T y T Ax + bT )
x,y
 T
b + inf y (kyk + T y) AT = 0
=

otherwise
 T
b AT = 0, kk 1
=
otherwise
(see page 54)
dual of norm approximation problem
maximize bT
subject to AT = 0,
Duality

kk 1
526

Implicit constraints
LP with box constraints: primal and dual problem
minimize cT x
subject to Ax = b
1  x  1

maximize bT 1T 1 1T 2
subject to c + AT + 1 2 = 0
1  0, 2  0

reformulation with box constraints made implicit


 T
c x 1  x  1
minimize f0(x) =

otherwise
subject to Ax = b
dual function
g() =

inf

(cT x + T (Ax b))

1x1

= bT kAT + ck1
dual problem: maximize bT kAT + ck1
Duality

527

Problems with generalized inequalities


minimize f0(x)
subject to fi(x) Ki 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p
Ki is generalized inequality on Rki
definitions are parallel to scalar case:
Lagrange multiplier for fi(x) Ki 0 is vector i Rki

Lagrangian L : Rn Rk1 Rkm Rp R, is defined as


L(x, 1, , m, ) = f0(x) +

m
X
i=1

Ti fi(x) +

p
X

ihi(x)

i=1

dual function g : Rk1 Rkm Rp R, is defined as


g(1, . . . , m, ) = inf L(x, 1, , m, )
xD

Duality

528

lower bound property: if i Ki 0, then g(1, . . . , m, ) p


proof: if x
is feasible and Ki 0, then
f0(
x) f0(
x) +

m
X

x) +
Ti fi(

i=1

p
X

ihi(
x)

i=1

inf L(x, 1, . . . , m, )

xD

= g(1, . . . , m, )
minimizing over all feasible x
gives p g(1, . . . , m, )
dual problem
maximize g(1, . . . , m, )
subject to i Ki 0, i = 1, . . . , m
weak duality: p d always
strong duality: p = d for convex problem with constraint qualification
(for example, Slaters: primal problem is strictly feasible)
Duality

529

Semidefinite program
primal SDP (Fi, G Sk )
minimize cT x
subject to x1F1 + + xnFn  G
Lagrange multiplier is matrix Z Sk

Lagrangian L(x, Z) = cT x + tr (Z(x1F1 + + xnFn G))

dual function

g(Z) = inf L(x, Z) =


x

tr(GZ) tr(FiZ) + ci = 0,

otherwise

i = 1, . . . , n

dual SDP
maximize tr(GZ)
subject to Z  0, tr(FiZ) + ci = 0,

i = 1, . . . , n

p = d if primal SDP is strictly feasible (x with x1F1 + + xnFn G)


Duality

530

Convex Optimization Boyd & Vandenberghe

6. Approximation and fitting

norm approximation
least-norm problems
regularized approximation
robust approximation

61

Norm approximation
minimize kAx bk
(A Rmn with m n, k k is a norm on Rm)
interpretations of solution x = argminx kAx bk:
geometric: Ax is point in R(A) closest to b
estimation: linear measurement model
y = Ax + v
y are measurements, x is unknown, v is measurement error
given y = b, best guess of x is x
optimal design: x are design variables (input), Ax is result (output)
x is design that best approximates desired result b

Approximation and fitting

62

examples
least-squares approximation (k k2): solution satisfies normal equations
AT Ax = AT b
(x = (AT A)1AT b if rank A = n)
Chebyshev approximation (k k): can be solved as an LP
minimize t
subject to t1  Ax b  t1
sum of absolute residuals approximation (k k1): can be solved as an LP
minimize 1T y
subject to y  Ax b  y

Approximation and fitting

63

Penalty function approximation


minimize (r1) + + (rm)
subject to r = Ax b
(A Rmn, : R R is a convex penalty function)
examples
deadzone-linear with width a:
(u) = max{0, |u| a}
log-barrier with limit a:
(u) =

log barrier
quadratic

1.5

(u)

quadratic: (u) = u

deadzone-linear

0.5

0
1.5

0.5

0.5

1.5

a2 log(1 (u/a)2) |u| < a

otherwise

Approximation and fitting

64

example (m = 100, n = 30): histogram of residuals for penalties


(u) = u2,

(u) = |u|,

(u) = max{0, |u|a},

(u) = log(1u2)

Log barrier

Deadzone

p=2

p=1

40

0
2
10

0
2

0
2
10

0
2

20

shape of penalty function has large effect on distribution of residuals


Approximation and fitting

65

Huber
replacements

penalty function (with parameter M )


hub(u) =

u2
|u| M
M (2|u| M ) |u| > M

linear growth for large u makes approximation less sensitive to outliers


2
20
10

f (t)

hub(u)

1.5

0.5

0
1.5

10

0.5

0.5

1.5

20
10

10

left: Huber penalty for M = 1


right: affine function f (t) = + t fitted to 42 points ti, yi (circles)
using quadratic (dashed) and Huber (solid) penalty
Approximation and fitting

66

Least-norm problems
minimize kxk
subject to Ax = b
(A Rmn with m n, k k is a norm on Rn)
interpretations of solution x = argminAx=b kxk:
geometric: x is point in affine set {x | Ax = b} with minimum
distance to 0
estimation: b = Ax are (perfect) measurements of x; x is smallest
(most plausible) estimate consistent with measurements
design: x are design variables (inputs); b are required results (outputs)
x is smallest (most efficient) design that satisfies requirements

Approximation and fitting

67

examples
least-squares solution of linear equations (k k2):
can be solved via optimality conditions
2x + AT = 0,

Ax = b

minimum sum of absolute values (k k1): can be solved as an LP


minimize 1T y
subject to y  x  y,

Ax = b

tends to produce sparse solution x


extension: least-penalty problem
minimize (x1) + + (xn)
subject to Ax = b
: R R is convex penalty function
Approximation and fitting

68

Regularized approximation

minimize (w.r.t. R2+) (kAx bk, kxk)


A Rmn, norms on Rm and Rn can be different
interpretation: find good approximation Ax b with small x
estimation: linear measurement model y = Ax + v, with prior
knowledge that kxk is small
optimal design: small x is cheaper or more efficient, or the linear
model y = Ax is only valid for small x
robust approximation: good approximation Ax b with small x is
less sensitive to errors in A than good approximation with large x

Approximation and fitting

69

Scalarized problem
minimize kAx bk + kxk
solution for > 0 traces out optimal trade-off curve

other common method: minimize kAx bk2 + kxk2 with > 0

Tikhonov regularization
minimize kAx bk22 + kxk22
can be solved as a least-squares problem


  2

A
b

minimize
x
0 2
I

solution x = (AT A + I)1AT b


Approximation and fitting

610

Optimal input design


linear dynamical system with impulse response h:
y(t) =

t
X

=0

h( )u(t ),

t = 0, 1, . . . , N

input design problem: multicriterion problem with 3 objectives


PN
1. tracking error with desired output ydes: Jtrack = t=0(y(t) ydes(t))2
PN
2. input magnitude: Jmag = t=0 u(t)2
PN 1
3. input variation: Jder = t=0 (u(t + 1) u(t))2
track desired output using a small and slowly varying input signal

regularized least-squares formulation


minimize Jtrack + Jder + Jmag
for fixed , , a least-squares problem in u(0), . . . , u(N )
Approximation and fitting

611

example: 3 solutions on optimal trade-off surface


(top) = 0, small ; (middle) = 0, larger ; (bottom) large
1
0.5

y(t)

u(t)

10
0

0.5
50

100
t

150

200

y(t)

u(t)

100
t

150

200

50

100
t

150

200

50

100
t

150

200

0.5

2
50

100
t

150

200

1
0
1

0.5
y(t)

u(t)

50

0.5

0.5

2
4
0

1
0
1

4
0

50

Approximation and fitting

100
t

150

200

1
0

612

Signal reconstruction

minimize (w.r.t. R2+) (k


x xcork2, (
x))
x Rn is unknown signal
xcor = x + v is (known) corrupted version of x, with additive noise v
variable x
(reconstructed signal) is estimate of x

: Rn R is regularization function or smoothing objective


examples: quadratic smoothing, total variation smoothing:
quad(
x) =

n1
X
i=1

Approximation and fitting

(
xi+1 x
i )2 ,

tv (
x) =

n1
X
i=1

|
xi+1 x
i |

613

quadratic smoothing example


0.5

0.5

0.5

1000

2000

3000

4000

1000

2000

3000

4000

1000

2000

3000

4000

0.5
0

1000

2000

3000

4000

0.5

0.5

0
0.5

xcor

0.5

0.5
0

0.5
1000

2000

3000

original signal x and noisy


signal xcor
Approximation and fitting

4000

three solutions on trade-off curve


k
x xcork2 versus quad(
x)
614

total variation reconstruction example

x
i

2
0
2

500

1000

1500

2000

500

1000

1500

2000

500

1000

1500

2000

2
0

x
i

1
500

1000

1500

2
0
2

2
1
0

x
i

xcor

2000

1
2
0

500

1000

1500

original signal x and noisy


signal xcor

2000

2
0

three solutions on trade-off curve


k
x xcork2 versus quad(
x)

quadratic smoothing smooths out noise and sharp transitions in signal


Approximation and fitting

615

2
0
2

500

1000

1500

2000

500

1000

1500

2000

500

1000

1500

2000

2
0

1
500

1000

1500

2000
2
0
2

2
1
0

xcor

1
2
0

500

1000

1500

original signal x and noisy


signal xcor

2000

2
0

three solutions on trade-off curve


k
x xcork2 versus tv (
x)

total variation smoothing preserves sharp transitions in signal

Approximation and fitting

616

Robust approximation
minimize kAx bk with uncertain A
two approaches:
stochastic: assume A is random, minimize E kAx bk

worst-case: set A of possible values of A, minimize supAA kAx bk


tractable only in special cases (certain norms k k, distributions, sets A)
12

example: A(u) = A0 + uA1

xstoch minimizes E kA(u)x bk22


with u uniform on [1, 1]
xwc minimizes sup1u1 kA(u)x bk22
figure shows r(u) = kA(u)x bk2
Approximation and fitting

xnom

r(u)

xnom minimizes kA0x

10

bk22

6
4

xstoch
xwc

2
0
2

2
617

stochastic robust LS with A = A + U , U random, E U = 0, E U T U = P


minimize E k(A + U )x bk22
explicit expression for objective:
b + U xk2
E kAx bk22 = E kAx
2
bk2 + E xT U T U x
= kAx

= kAx

2
bk22

+ xT P x

hence, robust LS problem is equivalent to LS problem


bk2 + kP 1/2xk2
minimize kAx
2
2
for P = I, get Tikhonov regularized problem
bk2 + kxk2
minimize kAx
2
2
Approximation and fitting

618

worst-case robust LS with A = {A + u1A1 + + upAp | kuk2 1}


minimize supAA kAx bk22 = supkuk21 kP (x)u + q(x)k22
where P (x) =

A1 x A2 x

b
Apx , q(x) = Ax

from page 514, strong duality holds between the following problems
maximize kP u + qk22
subject to kuk22 1

minimize

t+
I
subject to P T
qT

P
I
0

q
0 0
t

hence, robust LS problem is equivalent to SDP


minimize

t+

I
subject to P (x)T
q(x)T
Approximation and fitting

P (x) q(x)
I
0 0
0
t
619

example: histogram of residuals


r(u) = k(A0 + u1A1 + u2A2)x bk2
with u uniformly distributed on unit disk, for three values of x
0.25

xrls

frequency

0.2
0.15
0.1

xtik

xls

0.05
0
0

r(u)

xls minimizes kA0x bk2

xtik minimizes kA0x bk22 + kxk22 (Tikhonov solution)

xrls minimizes supAA kAx bk22 + kxk22


Approximation and fitting

620

Convex Optimization Boyd & Vandenberghe

7. Statistical estimation

maximum likelihood estimation


optimal detector design
experiment design

71

Parametric distribution estimation


distribution estimation problem: estimate probability density p(y) of a
random variable from observed values
parametric distribution estimation: choose from a family of densities
px(y), indexed by a parameter x
maximum likelihood estimation
maximize (over x) log px(y)
y is observed value
l(x) = log px(y) is called log-likelihood function
can add constraints x C explicitly, or define px(y) = 0 for x 6 C
a convex optimization problem if log px(y) is concave in x for fixed y
Statistical estimation

72

Linear measurements with IID noise


linear measurement model
yi = aTi x + vi,

i = 1, . . . , m

x Rn is vector of unknown parameters


vi is IID measurement noise, with density p(z)
yi is measurement: y R

has density px(y) =

Qm

i=1 p(yi

aTi x)

maximum likelihood estimate: any solution x of


maximize l(x) =
(y is observed value)

Statistical estimation

Pm

i=1 log p(yi

aTi x)

73

examples
Gaussian noise N (0, 2): p(z) = (2 2)1/2ez

/(2 2 )

1 X T
m
2
(ai x yi)2
l(x) = log(2 ) 2
2
2 i=1

ML estimate is LS solution

Laplacian noise: p(z) = (1/(2a))e|z|/a,


m

1X T
l(x) = m log(2a)
|ai x yi|
a i=1

ML estimate is 1-norm solution

uniform noise on [a, a]:



m log(2a) |aTi x yi| a,
l(x) =

otherwise

i = 1, . . . , m

ML estimate is any x with |aTi x yi| a


Statistical estimation

74

Logistic regression
random variable y {0, 1} with distribution
exp(aT u + b)
p = prob(y = 1) =
1 + exp(aT u + b)
a, b are parameters; u Rn are (observable) explanatory variables

estimation problem: estimate a, b from m observations (ui, yi)

log-likelihood function (for y1 = = yk = 1, yk+1 = = ym = 0):

m
k
T
Y
Y
1
exp(a
u
+
b)
i

l(a, b) = log
T
T
1 + exp(a ui + b)
1 + exp(a ui + b)
i=1
i=k+1

k
X
i=1

(aT ui + b)

m
X

log(1 + exp(aT ui + b))

i=1

concave in a, b
Statistical estimation

75

example (n = 1, m = 50 measurements)
1

prob(y = 1)

0.8
0.6
0.4
0.2
0
0

10

circles show 50 points (ui, yi)


solid curve is ML estimate of p = exp(au + b)/(1 + exp(au + b))
Statistical estimation

76

(Binary) hypothesis testing


detection (hypothesis testing) problem
given observation of a random variable X {1, . . . , n}, choose between:
hypothesis 1: X was generated by distribution p = (p1, . . . , pn)
hypothesis 2: X was generated by distribution q = (q1, . . . , qn)
randomized detector
a nonnegative matrix T R2n, with 1T T = 1T
if we observe X = k, we choose hypothesis 1 with probability t1k ,
hypothesis 2 with probability t2k
if all elements of T are 0 or 1, it is called a deterministic detector

Statistical estimation

77

detection probability matrix:


D=

Tp Tq

1 Pfp
Pfn
Pfp
1 Pfn

Pfp is probability of selecting hypothesis 2 if X is generated by


distribution 1 (false positive)
Pfn is probability of selecting hypothesis 1 if X is generated by
distribution 2 (false negative)
multicriterion formulation of detector design
minimize (w.r.t. R2+) (Pfp, Pfn) = ((T p)2, (T q)1)
subject to
t1k + t2k = 1, k = 1, . . . , n
tik 0, i = 1, 2, k = 1, . . . , n
variable T R2n
Statistical estimation

78

scalarization (with weight > 0)


minimize (T p)2 + (T q)1
subject to t1k + t2k = 1, tik 0,

i = 1, 2,

k = 1, . . . , n

an LP with a simple analytical solution


(t1k , t2k ) =

(1, 0) pk qk
(0, 1) pk < qk

a deterministic detector, given by a likelihood ratio test


if pk = qk for some k, any value 0 t1k 1, t1k = 1 t2k is optimal
(i.e., Pareto-optimal detectors include non-deterministic detectors)
minimax detector
minimize max{Pfp, Pfn} = max{(T p)2, (T q)1}
subject to t1k + t2k = 1, tik 0, i = 1, 2, k = 1, . . . , n
an LP; solution is usually not deterministic
Statistical estimation

79

example

0.70
0.20
P =
0.05
0.05

0.10
0.10

0.70
0.10

1
0.8

Pfn

0.6
0.4
0.2
0
0

1
2 4
0.2

3
0.4

Pfp

0.6

0.8

solutions 1, 2, 3 (and endpoints) are deterministic; 4 is minimax detector


Statistical estimation

710

Experiment design
m linear measurements yi = aTi x + wi, i = 1, . . . , m of unknown x Rn
measurement errors wi are IID N (0, 1)

ML (least-squares) estimate is
x
=

m
X

aiaTi

i=1

!1

m
X

yi a i

i=1

error e = x
x has zero mean and covariance
E = E eeT =

m
X
i=1

aiaTi

!1

confidence ellipsoids are given by {x | (x x


)T E 1(x x
) }
experiment design: choose ai {v1, . . . , vp} (a set of possible test
vectors) to make E small
Statistical estimation

711

vector optimization formulation


minimize (w.r.t.
subject to

Sn+)


T 1
k=1 mk vk vk

Pp

E=
mk 0,
mk Z

m1 + + mp = m

variables are mk (# vectors ai equal to vk )


difficult in general, due to integer constraint
relaxed experiment design
assume m p, use k = mk /m as (continuous) real variable
minimize (w.r.t.
subject to

Sn+)


T 1
k=1 k vk vk

Pp

E = (1/m)
 0, 1T = 1

common scalarizations: minimize log det E, tr E, max(E), . . .

can add other convex constraints, e.g., bound experiment cost cT B

Statistical estimation

712

D-optimal design
minimize log det
subject to  0,


T 1
k=1 k vk vk
T

Pp

1 =1

interpretation: minimizes volume of confidence ellipsoids


dual problem
maximize log det W + n log n
subject to vkT W vk 1, k = 1, . . . , p
interpretation: {x | xT W x 1} is minimum volume ellipsoid centered at
origin, that includes all test vectors vk
complementary slackness: for , W primal and dual optimal
k (1 vkT W vk ) = 0,

k = 1, . . . , p

optimal experiment uses vectors vk on boundary of ellipsoid defined by W


Statistical estimation

713

example (p = 20)
1 = 0.5

2 = 0.5

design uses two vectors, on boundary of ellipse defined by optimal W

Statistical estimation

714

derivation of dual of page 713


first reformulate primal problem with new variable X:
1
minimize log det
X
Pp
subject to X = k=1 k vk vkT ,

L(X, , Z, z, ) = log det X 1+tr Z

 0,
p
X

1T = 1

k vk vkT

k=1

!!

z T +(1T 1)

minimize over X by setting gradient to zero: X 1 + Z = 0


minimum over k is unless vkT Zvk zk + = 0
dual problem
maximize n + log det Z
subject to vkT Zvk , k = 1, . . . , p
change variable W = Z/, and optimize over to get dual of page 713
Statistical estimation

715

Convex Optimization Boyd & Vandenberghe

8. Geometric problems

extremal volume ellipsoids


centering
classification
placement and facility location

81

Minimum volume ellipsoid around a set


L
owner-John ellipsoid of a set C: minimum volume ellipsoid E s.t. C E
parametrize E as E = {v | kAv + bk2 1}; w.l.o.g. assume A Sn++

vol E is proportional to det A1; to compute minimum volume ellipsoid,


minimize (over A, b) log det A1
subject to
supvC kAv + bk2 1
convex, but evaluating the constraint can be hard (for general C)
finite set C = {x1, . . . , xm}:
minimize (over A, b) log det A1
subject to
kAxi + bk2 1,

i = 1, . . . , m

also gives Lowner-John ellipsoid for polyhedron conv{x1, . . . , xm}


Geometric problems

82

Maximum volume inscribed ellipsoid


maximum volume ellipsoid E inside a convex set C Rn
parametrize E as E = {Bu + d | kuk2 1}; w.l.o.g. assume B Sn++

vol E is proportional to det B; can compute E by solving


maximize log det B
subject to supkuk21 IC (Bu + d) 0
(where IC (x) = 0 for x C and IC (x) = for x 6 C)

convex, but evaluating the constraint can be hard (for general C)


polyhedron {x | aTi x bi, i = 1, . . . , m}:
maximize log det B
subject to kBaik2 + aTi d bi,

i = 1, . . . , m

(constraint follows from supkuk21 aTi (Bu + d) = kBaik2 + aTi d)


Geometric problems

83

Efficiency of ellipsoidal approximations


C Rn convex, bounded, with nonempty interior
Lowner-John ellipsoid, shrunk by a factor n, lies inside C
maximum volume inscribed ellipsoid, expanded by a factor n, covers C
example (for two polyhedra in R2)

factor n can be improved to


Geometric problems

n if C is symmetric
84

Centering
some possible definitions of center of a convex set C:
center of largest inscribed ball (Chebyshev center)
for polyhedron, can be computed via linear programming (page 419)
center of maximum volume inscribed ellipsoid (page 83)

xcheb

xmve

MVE center is invariant under affine coordinate transformations


Geometric problems

85

Analytic center of a set of inequalities


the analytic center of set of convex inequalities and linear equations
fi(x) 0,

i = 1, . . . , m,

Fx = g

is defined as the optimal point of


Pm

minimize i=1 log(fi(x))


subject to F x = g

more easily computed than MVE or Chebyshev center (see later)


not just a property of the feasible set: two sets of inequalities can
describe the same set, but have different analytic centers

Geometric problems

86

analytic center of linear inequalities aTi x bi, i = 1, . . . , m


xac is minimizer of
(x) =

m
X
i=1

log(bi aTi x)

xac

inner and outer ellipsoids from analytic center:


Einner {x | aTi x bi, i = 1, . . . , m} Eouter
where
Einner = {x | (x xac)T 2(xac)(x xac) 1}

Eouter = {x | (x xac)T 2(xac)(x xac) m(m 1)}


Geometric problems

87

Linear discrimination
separate two sets of points {x1, . . . , xN }, {y1, . . . , yM } by a hyperplane:
aT xi + b > 0,

i = 1, . . . , N,

aT yi + b < 0,

i = 1, . . . , M

homogeneous in a, b, hence equivalent to


aT xi + b 1,

i = 1, . . . , N,

aT yi + b 1,

i = 1, . . . , M

a set of linear inequalities in a, b


Geometric problems

88

Robust linear discrimination


(Euclidean) distance between hyperplanes
H1 = {z | aT z + b = 1}

H2 = {z | aT z + b = 1}

is dist(H1, H2) = 2/kak2


to separate two sets of points by maximum margin,
minimize (1/2)kak2
subject to aT xi + b 1, i = 1, . . . , N
aT yi + b 1, i = 1, . . . , M

(1)

(after squaring objective) a QP in a, b


Geometric problems

89

Lagrange dual of maximum margin separation problem (1)


1T + 1T

PM
P N

subject to 2 i=1 ixi i=1 iyi 1
maximize

1T = 1T ,

 0,

(2)

0

from duality, optimal value is inverse of maximum margin of separation


interpretation
change variables to i = i/1T , i = i/1T , t = 1/(1T + 1T )
invert objective to minimize 1/(1T + 1T ) = t
minimize

t

PM
P N

subject to i=1 ixi i=1 iyi t
2
T
 0, 1 = 1,  0, 1T = 1
optimal value is distance between convex hulls
Geometric problems

810

Approximate linear separation of non-separable sets


minimize 1T u + 1T v
subject to aT xi + b 1 ui, i = 1, . . . , N
aT yi + b 1 + vi, i = 1, . . . , M
u  0, v  0
an LP in a, b, u, v

at optimum, ui = max{0, 1 aT xi b}, vi = max{0, 1 + aT yi + b}

can be interpreted as a heuristic for minimizing #misclassified points

Geometric problems

811

Support vector classifier


minimize kak2 + (1T u + 1T v)
subject to aT xi + b 1 ui, i = 1, . . . , N
aT yi + b 1 + vi, i = 1, . . . , M
u  0, v  0
produces point on trade-off curve between inverse of margin 2/kak2 and
classification error, measured by total slack 1T u + 1T v

same example as previous page,


with = 0.1:

Geometric problems

812

Nonlinear discrimination
separate two sets of points by a nonlinear function:
f (xi) > 0,

i = 1, . . . , N,

f (yi) < 0,

i = 1, . . . , M

choose a linearly parametrized family of functions


f (z) = T F (z)
F = (F1, . . . , Fk ) : Rn Rk are basis functions
solve a set of linear inequalities in :
T F (xi) 1,

Geometric problems

i = 1, . . . , N,

T F (yi) 1,

i = 1, . . . , M

813

quadratic discrimination: f (z) = z T P z + q T z + r


xTi P xi + q T xi + r 1,

yiT P yi + q T yi + r 1

can add additional constraints (e.g., P  I to separate by an ellipsoid)


polynomial discrimination: F (z) are all monomials up to a given degree

separation by ellipsoid
Geometric problems

separation by 4th degree polynomial


814

Placement and facility location


N points with coordinates xi R2 (or R3)
some positions xi are given; the other xis are variables
for each pair of points, a cost function fij (xi, xj )
placement problem
minimize

i6=j

fij (xi, xj )

variables are positions of free points


interpretations

points represent plants or warehouses; fij is transportation cost between


facilities i and j
points represent cells on an IC; fij represents wirelength
Geometric problems

815

example: minimize

(i,j)A h(kxi

xj k2), with 6 free points, 27 links

optimal placement for h(z) = z, h(z) = z 2, h(z) = z 4


1

histograms of connection lengths kxi xj k2


4

5
3

00

0.5

Geometric problems

1.5

00

4
3
2
1
0.5

1.5

00

0.5

1.5
816

Convex Optimization Boyd & Vandenberghe

9. Numerical linear algebra background

matrix structure and algorithm complexity


solving linear equations with factored matrices
LU, Cholesky, LDLT factorization
block elimination and the matrix inversion lemma
solving underdetermined equations

91

Matrix structure and algorithm complexity


cost (execution time) of solving Ax = b with A Rnn
for general methods, grows as n3
less if A is structured (banded, sparse, Toeplitz, . . . )
flop counts
flop (floating-point operation): one addition, subtraction,
multiplication, or division of two floating-point numbers
to estimate complexity of an algorithm: express number of flops as a
(polynomial) function of the problem dimensions, and simplify by
keeping only the leading terms
not an accurate predictor of computation time on modern computers
useful as a rough estimate of complexity
Numerical linear algebra background

92

vector-vector operations (x, y Rn)


inner product xT y: 2n 1 flops (or 2n if n is large)
sum x + y, scalar multiplication x: n flops
matrix-vector product y = Ax with A Rmn
m(2n 1) flops (or 2mn if n large)
2N if A is sparse with N nonzero elements

2p(n + m) if A is given as A = U V T , U Rmp, V Rnp


matrix-matrix product C = AB with A Rmn, B Rnp
mp(2n 1) flops (or 2mnp if n large)
less if A and/or B are sparse
(1/2)m(m + 1)(2n 1) m2n if m = p and C symmetric
Numerical linear algebra background

93

Linear equations that are easy to solve


diagonal matrices (aij = 0 if i 6= j): n flops
x = A1b = (b1/a11, . . . , bn/ann)
lower triangular (aij = 0 if j > i): n2 flops
x1 := b1/a11
x2 := (b2 a21x1)/a22

x3 := (b3 a31x1 a32x2)/a33


..

xn := (bn an1x1 an2x2 an,n1xn1)/ann


called forward substitution
upper triangular (aij = 0 if j < i): n2 flops via backward substitution
Numerical linear algebra background

94

orthogonal matrices: A1 = AT
2n2 flops to compute x = AT b for general A

less with structure, e.g., if A = I 2uuT with kuk2 = 1, we can


compute x = AT b = b 2(uT b)u in 4n flops
permutation matrices:
aij =

1 j = i
0 otherwise

where = (1, 2, . . . , n) is a permutation of (1, 2, . . . , n)


interpretation: Ax = (x1 , . . . , xn )

satisfies A1 = AT , hence cost of solving Ax = b is 0 flops


example:

0 1 0
A = 0 0 1 ,
1 0 0
Numerical linear algebra background

A1

0 0 1
= AT = 1 0 0
0 1 0
95

The factor-solve method for solving Ax = b


factor A as a product of simple matrices (usually 2 or 3):
A = A 1 A2 Ak
(Ai diagonal, upper or lower triangular, etc)
1 1

A
compute x = A1b = A1
2 A1 b by solving k easy equations
k

A1x1 = b,

A2 x 2 = x 1 ,

...,

Ak x = xk1

cost of factorization step usually dominates cost of solve step


equations with multiple righthand sides
Ax1 = b1,

Ax2 = b2,

...,

Axm = bm

cost: one factorization plus m solves


Numerical linear algebra background

96

LU factorization
every nonsingular matrix A can be factored as
A = P LU
with P a permutation matrix, L lower triangular, U upper triangular
cost: (2/3)n3 flops

Solving linear equations by LU factorization.


given a set of linear equations Ax = b, with A nonsingular.
1. LU factorization. Factor A as A = P LU ((2/3)n3 flops).
2. Permutation. Solve P z1 = b (0 flops).
3. Forward substitution. Solve Lz2 = z1 (n2 flops).
4. Backward substitution. Solve U x = z2 (n2 flops).

cost: (2/3)n3 + 2n2 (2/3)n3 for large n


Numerical linear algebra background

97

sparse LU factorization
A = P1LU P2
adding permutation matrix P2 offers possibility of sparser L, U (hence,
cheaper factor and solve steps)
P1 and P2 chosen (heuristically) to yield sparse L, U
choice of P1 and P2 depends on sparsity pattern and values of A
cost is usually much less than (2/3)n3; exact value depends in a
complicated way on n, number of zeros in A, sparsity pattern

Numerical linear algebra background

98

Cholesky factorization
every positive definite A can be factored as
A = LLT
with L lower triangular
cost: (1/3)n3 flops

Solving linear equations by Cholesky factorization.


given a set of linear equations Ax = b, with A Sn++.
1. Cholesky factorization. Factor A as A = LLT ((1/3)n3 flops).
2. Forward substitution. Solve Lz1 = b (n2 flops).
3. Backward substitution. Solve LT x = z1 (n2 flops).

cost: (1/3)n3 + 2n2 (1/3)n3 for large n


Numerical linear algebra background

99

sparse Cholesky factorization


A = P LLT P T
adding permutation matrix P offers possibility of sparser L
P chosen (heuristically) to yield sparse L
choice of P only depends on sparsity pattern of A (unlike sparse LU)
cost is usually much less than (1/3)n3; exact value depends in a
complicated way on n, number of zeros in A, sparsity pattern

Numerical linear algebra background

910

LDLT factorization
every nonsingular symmetric matrix A can be factored as
A = P LDLT P T
with P a permutation matrix, L lower triangular, D block diagonal with
1 1 or 2 2 diagonal blocks
cost: (1/3)n3
cost of solving symmetric sets of linear equations by LDLT factorization:
(1/3)n3 + 2n2 (1/3)n3 for large n
for sparse A, can choose P to yield sparse L; cost (1/3)n3

Numerical linear algebra background

911

Equations with structured sub-blocks




A11 A12
A21 A22



x1
x2

b1
b2

(1)

variables x1 Rn1 , x2 Rn2 ; blocks Aij Rninj

if A11 is nonsingular, can eliminate x1: x1 = A1


11 (b1 A12 x2 );
to compute x2, solve
1
A
)x
=
b

A
A
(A22 A21A1
12
2
2
21
11 b1
11

Solving linear equations by block elimination.


given a nonsingular set of linear equations (1), with A11 nonsingular.
1
1. Form A1
11 A12 and A11 b1 .
1

2. Form S = A22 A21A1


11 A12 and b = b2 A21 A11 b1 .
3. Determine x2 by solving Sx2 =
b.
4. Determine x1 by solving A11x1 = b1 A12x2.

Numerical linear algebra background

912

dominant terms in flop count


step 1: f + n2s (f is cost of factoring A11; s is cost of solve step)

step 2: 2n22n1 (cost dominated by product of A21 and A1


11 A12 )
step 3: (2/3)n32
total: f + n2s + 2n22n1 + (2/3)n32
examples

general A11 (f = (2/3)n31, s = 2n21): no gain over standard method


#flops = (2/3)n31 + 2n21n2 + 2n22n1 + (2/3)n32 = (2/3)(n1 + n2)3
block elimination is useful for structured A11 (f n31)
for example, diagonal (f = 0, s = n1): #flops 2n22n1 + (2/3)n32

Numerical linear algebra background

913

Structured matrix plus low rank term


(A + BC)x = b
A Rnn, B Rnp, C Rpn

assume A has structure (Ax = b easy to solve)

first write as

A B
C I



x
y

b
0

now apply block elimination: solve


(I + CA1B)y = CA1b,
then solve Ax = b By
this proves the matrix inversion lemma: if A and A + BC nonsingular,
(A + BC)1 = A1 A1B(I + CA1B)1CA1
Numerical linear algebra background

914

example: A diagonal, B, C dense


method 1: form D = A + BC, then solve Dx = b
cost: (2/3)n3 + 2pn2
method 2 (via matrix inversion lemma): solve
(I + CA1B)y = CA1b,

(2)

then compute x = A1b A1By


total cost is dominated by (2): 2p2n + (2/3)p3 (i.e., linear in n)

Numerical linear algebra background

915

Underdetermined linear equations


if A Rpn with p < n, rank A = p,
{x | Ax = b} = {F z + x
| z Rnp}
x
is (any) particular solution
columns of F Rn(np) span nullspace of A
there exist several numerical methods for computing F
(QR factorization, rectangular LU factorization, . . . )

Numerical linear algebra background

916

Convex Optimization Boyd & Vandenberghe

10. Unconstrained minimization


terminology and assumptions
gradient descent method
steepest descent method
Newtons method
self-concordant functions
implementation

101

Unconstrained minimization
minimize f (x)
f convex, twice continuously differentiable (hence dom f open)
we assume optimal value p = inf x f (x) is attained (and finite)
unconstrained minimization methods
produce sequence of points x(k) dom f , k = 0, 1, . . . with
f (x(k)) p
can be interpreted as iterative methods for solving optimality condition
f (x) = 0
Unconstrained minimization

102

Initial point and sublevel set


algorithms in this chapter require a starting point x(0) such that
x(0) dom f

sublevel set S = {x | f (x) f (x(0))} is closed


2nd condition is hard to verify, except when all sublevel sets are closed:
equivalent to condition that epi f is closed

true if dom f = Rn

true if f (x) as x bd dom f


examples of differentiable functions with closed sublevel sets:
f (x) = log(

m
X
i=1

Unconstrained minimization

exp(aTi x + bi)),

f (x) =

m
X
i=1

log(bi aTi x)

103

Strong convexity and implications


f is strongly convex on S if there exists an m > 0 such that
2f (x)  mI

for all x S

implications
for x, y S,
f (y) f (x) + f (x)T (y x) +

m
kx yk22
2

hence, S is bounded
p > , and for x S,
1
f (x) p
kf (x)k22
2m

useful as stopping criterion (if you know m)


Unconstrained minimization

104

Descent methods
x(k+1) = x(k) + t(k)x(k)

with f (x(k+1)) < f (x(k))

other notations: x+ = x + tx, x := x + tx

x is the step, or search direction; t is the step size, or step length


from convexity, f (x+) < f (x) implies f (x)T x < 0
(i.e., x is a descent direction)

General descent method.


given a starting point x dom f .
repeat
1. Determine a descent direction x.
2. Line search. Choose a step size t > 0.
3. Update. x := x + tx.
until stopping criterion is satisfied.

Unconstrained minimization

105

Line search types


exact line search: t = argmint>0 f (x + tx)
backtracking line search (with parameters (0, 1/2), (0, 1))
starting at t = 1, repeat t := t until
f (x + tx) < f (x) + tf (x)T x
graphical interpretation: backtrack until t t0

f (x + tx)

f (x) + tf (x)T x
t=0
Unconstrained minimization

t0

f (x) + tf (x)T x
t

106

Gradient descent method


general descent method with x = f (x)
given a starting point x dom f .
repeat
1. x := f (x).
2. Line search. Choose step size t via exact or backtracking line search.
3. Update. x := x + tx.
until stopping criterion is satisfied.

stopping criterion usually of the form kf (x)k2


convergence result: for strongly convex f ,

f (x(k)) p ck (f (x(0)) p)
c (0, 1) depends on m, x(0), line search type

very simple, but often very slow; rarely used in practice


Unconstrained minimization

107

quadratic problem in R2
f (x) = (1/2)(x21 + x22)

( > 0)

with exact line search, starting at x(0) = (, 1):


(k)
x1

1
+1

k

(k)
x2

+1

k

very slow if 1 or 1

example for = 10:


4
x2

x(0)
0
x(1)
4
10

Unconstrained minimization

0
x1

10

108

nonquadratic example
f (x1, x2) = ex1+3x20.1 + ex13x20.1 + ex10.1

x(0)

x(0)

x(2)
x(1)
x(1)

backtracking line search

Unconstrained minimization

exact line search

109

a problem in R100
f (x) = cT x

500
X
i=1

log(bi aTi x)

f (x(k)) p

104

102

100

exact l.s.

102

backtracking l.s.
104
0

50

100

150

200

linear convergence, i.e., a straight line on a semilog plot


Unconstrained minimization

1010

Steepest descent method


normalized steepest descent direction (at x, for norm k k):
xnsd = argmin{f (x)T v | kvk = 1}
interpretation: for small v, f (x + v) f (x) + f (x)T v;
direction xnsd is unit-norm step with most negative directional derivative
(unnormalized) steepest descent direction
xsd = kf (x)kxnsd
satisfies f (x)T xsd = kf (x)k2
steepest descent method
general descent method with x = xsd

convergence properties similar to gradient descent


Unconstrained minimization

1011

examples
Euclidean norm: xsd = f (x)

quadratic norm kxkP = (xT P x)1/2 (P Sn++): xsd = P 1f (x)


1-norm: xsd = (f (x)/xi)ei, where |f (x)/xi| = kf (x)k
unit balls and normalized steepest descent directions for a quadratic norm
and the 1-norm:

f (x)
xnsd

Unconstrained minimization

f (x)
xnsd

1012

choice of norm for steepest descent

x
x(1)

x(0)

(0)

x(2)
x(2)

x(1)

steepest descent with backtracking line search for two quadratic norms
ellipses show {x | kx x(k)kP = 1}
equivalent interpretation of steepest descent with quadratic norm k kP :
gradient descent after change of variables x
= P 1/2x
shows choice of P has strong effect on speed of convergence
Unconstrained minimization

1013

Newton step
xnt = 2f (x)1f (x)
interpretations
x + xnt minimizes second order approximation
1 T 2
T
b
f (x + v) = f (x) + f (x) v + v f (x)v
2

x + xnt solves linearized optimality condition

f (x + v) fb(x + v) = f (x) + 2f (x)v = 0

fb

fb
(x, f (x))
(x + xnt, f (x + xnt))
Unconstrained minimization

f
(x + xnt, f (x + xnt))
(x, f (x))

f
1014

xnt is steepest descent direction at x in local Hessian norm


T

kuk2f (x) = u f (x)u

1/2

x + xnsd
x + xnt

dashed lines are contour lines of f ; ellipse is {x + v | v T 2f (x)v = 1}


arrow shows f (x)

Unconstrained minimization

1015

Newton decrement
T

(x) = f (x) f (x)

f (x)

a measure of the proximity of x to x

1/2

properties
gives an estimate of f (x) p, using quadratic approximation fb:
1
b
f (x) inf f (y) = (x)2
y
2

equal to the norm of the Newton step in the quadratic Hessian norm
(x) =

1/2
2
T
xnt f (x)xnt

directional derivative in the Newton direction: f (x)T xnt = (x)2


affine invariant (unlike kf (x)k2)
Unconstrained minimization

1016

Newtons method

given a starting point x dom f , tolerance > 0.


repeat
1. Compute the Newton step and decrement.
xnt := 2f (x)1f (x); 2 := f (x)T 2f (x)1f (x).
2. Stopping criterion. quit if 2/2 .
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + txnt.

affine invariant, i.e., independent of linear changes of coordinates:


Newton iterates for f(y) = f (T y) with starting point y (0) = T 1x(0) are
y (k) = T 1x(k)

Unconstrained minimization

1017

Classical convergence analysis


assumptions
f strongly convex on S with constant m

2f is Lipschitz continuous on S, with constant L > 0:


k2f (x) 2f (y)k2 Lkx yk2
(L measures how well f can be approximated by a quadratic function)
outline: there exist constants (0, m2/L), > 0 such that
if kf (x)k2 , then f (x(k+1)) f (x(k))
if kf (x)k2 < , then
L
(k+1)
kf
(x
)k2
2
2m

Unconstrained minimization

L
(k)
kf
(x
)k2
2
2m

2
1018

damped Newton phase (kf (x)k2 )


most iterations require backtracking steps
function value decreases by at least
if p > , this phase ends after at most (f (x(0)) p)/ iterations

quadratically convergent phase (kf (x)k2 < )


all iterations use step size t = 1
kf (x)k2 converges to zero quadratically: if kf (x(k))k2 < , then
L
l
kf
(x
)k2
2
2m

Unconstrained minimization

L
k
kf
(x
)k2
2
2m

2lk

 2lk
1

,
2

lk

1019

conclusion: number of iterations until f (x) p is bounded above by


f (x(0)) p
+ log2 log2(0/)

, 0 are constants that depend on m, L, x(0)


second term is small (of the order of 6) and almost constant for
practical purposes
in practice, constants m, L (hence , 0) are usually unknown
provides qualitative insight in convergence properties (i.e., explains two
algorithm phases)

Unconstrained minimization

1020

Examples
example in R2 (page 109)

x(0)
x(1)

f (x(k)) p

105
100
105
1010
10150

backtracking parameters = 0.1, = 0.7


converges in only 5 steps
quadratic local convergence
Unconstrained minimization

1021

example in R100 (page 1010)


105

100

10

1.5

backtracking

exact line search


1010
1015
0

step size t(k)

f (x(k)) p

exact line search

0.5

10

0
0

backtracking

backtracking parameters = 0.01, = 0.5


backtracking line search almost as fast as exact l.s. (and much simpler)
clearly shows two phases in algorithm
Unconstrained minimization

1022

example in R10000 (with sparse ai)


f (x) =

10000
X
i=1

log(1 x2i )

100000
X
i=1

log(bi aTi x)

f (x(k)) p

105

100

105

10

15

20

backtracking parameters = 0.01, = 0.5.


performance similar as for small examples

Unconstrained minimization

1023

Self-concordance
shortcomings of classical convergence analysis
depends on unknown constants (m, L, . . . )
bound is not affinely invariant, although Newtons method is
convergence analysis via self-concordance (Nesterov and Nemirovski)
does not depend on any unknown constants
gives affine-invariant bound
applies to special class of convex functions (self-concordant functions)
developed to analyze polynomial-time interior-point methods for convex
optimization

Unconstrained minimization

1024

Self-concordant functions
definition
convex f : R R is self-concordant if |f (x)| 2f (x)3/2 for all
x dom f

f : Rn R is self-concordant if g(t) = f (x + tv) is self-concordant for


all x dom f , v Rn
examples on R
linear and quadratic functions
negative logarithm f (x) = log x
negative entropy plus negative logarithm: f (x) = x log x log x
affine invariance: if f : R R is s.c., then f(y) = f (ay + b) is s.c.:
f(y) = a3f (ay + b),

Unconstrained minimization

f(y) = a2f (ay + b)

1025

Self-concordant calculus
properties
preserved under positive scaling 1, and sum
preserved under composition with affine function

if g is convex with dom g = R++ and |g (x)| 3g (x)/x then


f (x) = log(g(x)) log x
is self-concordant
examples: properties can be used to show that the following are s.c.
Pm
f (x) = i=1 log(bi aTi x) on {x | aTi x < bi, i = 1, . . . , m}
f (X) = log det X on Sn++

f (x) = log(y 2 xT x) on {(x, y) | kxk2 < y}


Unconstrained minimization

1026

Convergence analysis for self-concordant functions


summary: there exist constants (0, 1/4], > 0 such that
if (x) > , then
if (x) , then

f (x(k+1)) f (x(k))


2(x(k+1)) 2(x(k))

2

( and only depend on backtracking parameters , )


complexity bound: number of Newton iterations bounded by
f (x(0)) p
+ log2 log2(1/)

for = 0.1, = 0.8, = 1010, bound evaluates to 375(f (x(0)) p) + 6


Unconstrained minimization

1027

numerical example: 150 randomly generated instances of


minimize f (x) =
25

Pm

T
log(b

a
i
i x)
i=1

m = 100, n = 50
: m = 1000, n = 500
: m = 1000, n = 50

iterations

20
15
10
5
0
0

10

15

f (x

(0)

20

)p

25

30

35

number of iterations much smaller than 375(f (x(0)) p) + 6


bound of the form c(f (x(0)) p) + 6 with smaller c (empirically) valid
Unconstrained minimization

1028

Implementation
main effort in each iteration: evaluate derivatives and solve Newton system
Hx = g
where H = 2f (x), g = f (x)
via Cholesky factorization
H = LLT ,

xnt = LT L1g,

(x) = kL1gk2

cost (1/3)n3 flops for unstructured system


cost (1/3)n3 if H sparse, banded

Unconstrained minimization

1029

example of dense Newton system with structure


f (x) =

n
X

H = D + AT H0 A

i(xi) + 0(Ax + b),

i=1

assume A Rpn, dense, with p n

D diagonal with diagonal elements i(xi); H0 = 20(Ax + b)


method 1: form H, solve via dense Cholesky factorization: (cost (1/3)n3)
method 2 (page 915): factor H0 = L0LT0 ; write Newton system as
Dx + AT L0w = g,

LT0 Ax w = 0

eliminate x from first equation; compute w and x from


(I + LT0 AD1AT L0)w = LT0 AD1g,

Dx = g AT L0w

cost: 2p2n (dominated by computation of LT0 AD1AT L0)


Unconstrained minimization

1030

Convex Optimization Boyd & Vandenberghe

11. Equality constrained minimization

equality constrained minimization


eliminating equality constraints
Newtons method with equality constraints
infeasible start Newton method
implementation

111

Equality constrained minimization


minimize f (x)
subject to Ax = b
f convex, twice continuously differentiable
A Rpn with rank A = p

we assume p is finite and attained


optimality conditions: x is optimal iff there exists a such that
f (x) + AT = 0,

Equality constrained minimization

Ax = b

112

equality constrained quadratic minimization (with P Sn+)


minimize (1/2)xT P x + q T x + r
subject to Ax = b
optimality condition:


P A
A 0



q
b

coefficient matrix is called KKT matrix


KKT matrix is nonsingular if and only if
Ax = 0,

x 6= 0

xT P x > 0

equivalent condition for nonsingularity: P + AT A 0


Equality constrained minimization

113

Eliminating equality constraints


represent solution of {x | Ax = b} as
{x | Ax = b} = {F z + x
| z Rnp}
x
is (any) particular solution

range of F Rn(np) is nullspace of A (rank F = n p and AF = 0)

reduced or eliminated problem


minimize f (F z + x
)
an unconstrained problem with variable z Rnp

from solution z , obtain x and as


x = F z + x
,

Equality constrained minimization

= (AAT )1Af (x)


114

example: optimal allocation with resource constraint


minimize f1(x1) + f2(x2) + + fn(xn)
subject to x1 + x2 + + xn = b
eliminate xn = b x1 xn1, i.e., choose
x
= ben,

F =

I
1T

Rn(n1)

reduced problem:
minimize f1(x1) + + fn1(xn1) + fn(b x1 xn1)
(variables x1, . . . , xn1)

Equality constrained minimization

115

Newton step
Newton step xnt of f at feasible x is given by solution v of


f (x) A
A
0



v
w

f (x)
0

interpretations
xnt solves second order approximation (with variable v)
minimize fb(x + v) = f (x) + f (x)T v + (1/2)v T 2f (x)v
subject to A(x + v) = b
xnt equations follow from linearizing optimality conditions
f (x + v) + AT w f (x) + 2f (x)v + AT w = 0,
Equality constrained minimization

A(x + v) = b

116

Newton decrement
(x) =
properties

1/2
T
2
xnt f (x)xnt

= f (x) xnt

1/2

gives an estimate of f (x) p using quadratic approximation fb:


1
f (x) inf fb(y) = (x)2
Ay=b
2

directional derivative in Newton direction:




d
f (x + txnt)
= (x)2
dt
t=0
T

in general, (x) 6= f (x) f (x)


Equality constrained minimization

f (x)

1/2

117

Newtons method with equality constraints

given starting point x dom f with Ax = b, tolerance > 0.


repeat
1. Compute the Newton step and decrement xnt, (x).
2. Stopping criterion. quit if 2/2 .
3. Line search. Choose step size t by backtracking line search.
4. Update. x := x + txnt.

a feasible descent method: x(k) feasible and f (x(k+1)) < f (x(k))


affine invariant

Equality constrained minimization

118

Newtons method and elimination


Newtons method for reduced problem
minimize f(z) = f (F z + x
)
variables z Rnp
x
satisfies A
x = b; rank F = n p and AF = 0

Newtons method for f, started at z (0), generates iterates z (k)

Newtons method with equality constraints


when started at x(0) = F z (0) + x
, iterates are
x(k+1) = F z (k) + x

hence, dont need separate convergence analysis


Equality constrained minimization

119

Newton step at infeasible points


2nd interpretation of page 116 extends to infeasible x (i.e., Ax 6= b)
linearizing optimality conditions at infeasible x (with x dom f ) gives

 2



T
f (x)
xnt
f (x) A
=
(1)
Ax b
w
A
0
primal-dual interpretation
write optimality condition as r(y) = 0, where
y = (x, ),

r(y) = (f (x) + AT , Ax b)

linearizing r(y) = 0 gives r(y + y) r(y) + Dr(y)y = 0:


 2




T
T
xnt
f (x) A
f (x) + A
=
nt
Ax b
A
0
same as (1) with w = + nt
Equality constrained minimization

1110

Infeasible start Newton method

given starting point x dom f , , tolerance > 0, (0, 1/2), (0, 1).
repeat
1. Compute primal and dual Newton steps xnt, nt.
2. Backtracking line search on krk2.
t := 1.
while kr(x + txnt, + tnt)k2 > (1 t)kr(x, )k2, t := t.
3. Update. x := x + txnt, := + tnt.
until Ax = b and kr(x, )k2 .

not a descent method: f (x(k+1)) > f (x(k)) is possible


directional derivative of kr(y)k2 in direction y = (xnt, nt) is


d
kr(y + ty)k2
dt

Equality constrained minimization

t=0

= kr(y)k2

1111

Solving KKT systems




H
A

A
0



v
w

g
h

solution methods
LDLT factorization
elimination (if H nonsingular)
AH 1AT w = h AH 1g,

Hv = (g + AT w)

elimination with singular H: write as







T
T
T
v
H + A QA A
g + A Qh
=
w
A
0
h
with Q  0 for which H + AT QA 0, and apply elimination
Equality constrained minimization

1112

Equality constrained analytic centering


primal problem: minimize

Pn

i=1 log xi

dual problem: maximize b +

Pn

subject to Ax = b

i=1 log(A

)i + n

three methods for an example with A R100500, different starting points


1. Newton method with equality constraints (requires x(0) 0, Ax(0) = b)
f (x(k)) p

105
100
105
10100

Equality constrained minimization

10

15

20

1113

2. Newton method applied to dual problem (requires AT (0) 0)


p g( (k))

105
100
105
10100

10

3. infeasible start Newton method (requires x(0) 0)


kr(x(k), (k))k2

1010
105
100
105
1010
10150
Equality constrained minimization

10

15

20

25
1114

complexity per iteration of three methods is identical


1. use block elimination to solve KKT system


diag(x)
A

A
0



x
w

diag(x)
0

reduces to solving A diag(x)2AT w = b


2. solve Newton system A diag(AT )2AT = b + A diag(AT )11
3. use block elimination to solve KKT system


diag(x)
A

A
0



diag(x) 1 A
b Ax

reduces to solving A diag(x)2AT w = 2Ax b


conclusion: in each case, solve ADAT w = h with D positive diagonal
Equality constrained minimization

1115

Network flow optimization


Pn

minimize
i=1 i (xi )
subject to Ax = b
directed graph with n arcs, p + 1 nodes

xi: flow through arc i; i: cost flow function for arc i (with i (x) > 0)

node-incidence matrix A R(p+1)n defined as

1 arc j leaves node i


1 arc j enters node i
Aij =

0 otherwise

reduced node-incidence matrix A Rpn is A with last row removed

b Rp is (reduced) source vector

rank A = p if graph is connected


Equality constrained minimization

1116

KKT system


H
A

A
0



v
w

g
h

H = diag(1 (x1), . . . , n(xn)), positive diagonal


solve via elimination:
AH 1AT w = h AH 1g,

Hv = (g + AT w)

sparsity pattern of coefficient matrix is given by graph connectivity


(AH 1AT )ij 6= 0 (AAT )ij 6= 0

nodes i and j are connected by an arc

Equality constrained minimization

1117

Analytic center of linear matrix inequality


minimize log det X
subject to tr(AiX) = bi,

i = 1, . . . , p

variable X Sn
optimality conditions
X 0,

(X )1 +

p
X

jAi = 0,

tr(AiX ) = bi,

i = 1, . . . , p

j=1

Newton equation at feasible X:


X 1XX 1 +

p
X

wj Ai = X 1,

tr(AiX) = 0,

i = 1, . . . , p

j=1

follows from linear approximation (X + X)1 X 1 X 1XX 1

n(n + 1)/2 + p variables X, w


Equality constrained minimization

1118

solution by block elimination


eliminate X from first equation: X = X
substitute X in second equation
p
X

tr(AiXAj X)wj = bi,

Pp

j=1 wj XAj X

i = 1, . . . , p

(2)

j=1

a dense positive definite set of linear equations with variable w Rp


flop count (dominant terms) using Cholesky factorization X = LLT :
form p products LT Aj L: (3/2)pn3
form p(p + 1)/2 inner products tr((LT AiL)(LT Aj L)): (1/2)p2n2
solve (2) via Cholesky factorization: (1/3)p3
Equality constrained minimization

1119

Convex Optimization Boyd & Vandenberghe

12. Interior-point methods

inequality constrained minimization


logarithmic barrier function and central path
barrier method
feasibility and phase I methods
complexity analysis via self-concordance
generalized inequalities

121

Inequality constrained minimization


minimize f0(x)
subject to fi(x) 0,
Ax = b

i = 1, . . . , m

(1)

fi convex, twice continuously differentiable


A Rpn with rank A = p

we assume p is finite and attained


we assume problem is strictly feasible: there exists x
with
x
dom f0,

fi(
x) < 0,

i = 1, . . . , m,

A
x=b

hence, strong duality holds and dual optimum is attained

Interior-point methods

122

Examples
LP, QP, QCQP, GP
entropy maximization with linear inequality constraints
Pn

minimize
i=1 xi log xi
subject to F x  g
Ax = b
with dom f0 = Rn++
differentiability may require reformulating the problem, e.g.,
piecewise-linear minimization or -norm approximation via LP
SDPs and SOCPs are better handled as problems with generalized
inequalities (see later)
Interior-point methods

123

Logarithmic barrier
reformulation of (1) via indicator function:
Pm
minimize f0(x) + i=1 I(fi(x))
subject to Ax = b
where I(u) = 0 if u 0, I(u) = otherwise (indicator function of R)
approximation via logarithmic barrier
minimize f0(x) (1/t)
subject to Ax = b

Pm

i=1 log(fi (x))

10

an equality constrained problem


for t > 0, (1/t) log(u) is a
smooth approximation of I
approximation improves as t
Interior-point methods

5
0
5
3

1
124

logarithmic barrier function


(x) =

m
X

log(fi(x)),

i=1

dom = {x | f1(x) < 0, . . . , fm(x) < 0}

convex (follows from composition rules)


twice continuously differentiable, with derivatives
(x) =
2(x) =

Interior-point methods

m
X

1
fi(x)
f
(x)
i
i=1

m
X

X 1
1
2
T
f
(x)f
(x)
+

fi(x)
i
i
2
f (x)
fi(x)
i=1
i=1 i

125

Central path
for t > 0, define x(t) as the solution of
minimize tf0(x) + (x)
subject to Ax = b
(for now, assume x(t) exists and is unique for each t > 0)
central path is {x(t) | t > 0}
example: central path for an LP
minimize cT x
subject to aTi x bi,
T

i = 1, . . . , 6
x

x(10)

hyperplane c x = c x (t) is tangent to


level curve of through x(t)
Interior-point methods

126

Dual points on central path


x = x(t) if there exists a w such that
m
X

1
tf0(x) +
fi(x) + AT w = 0,
fi(x)
i=1

Ax = b

therefore, x(t) minimizes the Lagrangian


L(x, (t), (t)) = f0(x) +

m
X
i=1

i(t)fi(x) + (t)T (Ax b)

where we define i(t) = 1/(tfi(x(t)) and (t) = w/t


this confirms the intuitive idea that f0(x(t)) p if t :
p g((t), (t))

= L(x(t), (t), (t))

= f0(x(t)) m/t
Interior-point methods

127

Interpretation via KKT conditions


x = x(t), = (t), = (t) satisfy
1. primal constraints: fi(x) 0, i = 1, . . . , m, Ax = b
2. dual constraints:  0
3. approximate complementary slackness: ifi(x) = 1/t, i = 1, . . . , m
4. gradient of Lagrangian with respect to x vanishes:
f0(x) +

m
X
i=1

ifi(x) + AT = 0

difference with KKT is that condition 3 replaces ifi(x) = 0

Interior-point methods

128

Force field interpretation


centering problem (for problem with no equality constraints)
minimize tf0(x)

Pm

i=1 log(fi (x))

force field interpretation


tf0(x) is potential of force field F0(x) = tf0(x)
log(fi(x)) is potential of force field Fi(x) = (1/fi(x))fi(x)
the forces balance at x(t):
F0(x(t)) +

m
X

Fi(x(t)) = 0

i=1

Interior-point methods

129

example

minimize cT x
subject to aTi x bi,

i = 1, . . . , m

objective force field is constant: F0(x) = tc

constraint force field decays as inverse distance to constraint hyperplane:


Fi(x) =

ai
,
T
bi a i x

kFi(x)k2 =

1
dist(x, Hi)

where Hi = {x | aTi x = bi}

c
3c
t=1
Interior-point methods

t=3
1210

Barrier method
given strictly feasible x, t := t(0) > 0, > 1, tolerance > 0.
repeat
1.
2.
3.
4.

Centering step. Compute x(t) by minimizing tf0 + , subject to Ax = b.


Update. x := x(t).
Stopping criterion. quit if m/t < .
Increase t. t := t.

terminates with f0(x) p (stopping criterion follows from


f0(x(t)) p m/t)
centering usually done using Newtons method, starting at current x
choice of involves a trade-off: large means fewer outer iterations,
more inner (Newton) iterations; typical values: = 1020
several heuristics for choice of t(0)
Interior-point methods

1211

Convergence analysis
number of outer (centering) iterations: exactly


log(m/(t(0)))
log

plus the initial centering step (to compute x(t(0)))


centering problem
minimize tf0(x) + (x)
see convergence analysis of Newtons method
tf0 + must have closed sublevel sets for t t(0)
classical analysis requires strong convexity, Lipschitz condition
analysis via self-concordance requires self-concordance of tf0 +
Interior-point methods

1212

Examples
inequality form LP (m = 100 inequalities, n = 50 variables)
140

10

102
10

106

= 50 = 150
0

20

40

60

=2

Newton iterations

duality gap

102

80

120
100
80
60
40
20
0
0

40

80

Newton iterations

120

160

200

starts with x on central path (t(0) = 1, duality gap 100)


terminates when t = 108 (gap 106)

centering uses Newtons method with backtracking


total number of Newton iterations not very sensitive for 10
Interior-point methods

1213

geometric program (m = 100 inequalities and n = 50 variables)


minimize
subject to

P

5
T
log
k=1 exp(a0k x + b0k )
P

5
T
log
exp(a
ik x + bik )
k=1

0,

i = 1, . . . , m

duality gap

102
100
102
104
106
0

= 150
20

= 50
40

60

=2
80

100

120

Newton iterations
Interior-point methods

1214

family of standard LPs (A Rm2m)


minimize cT x
subject to Ax = b,

x0

m = 10, . . . , 1000; for each m, solve 100 randomly generated instances

Newton iterations

35

30

25

20

15 1
10

102

103

number of iterations grows very slowly as m ranges over a 100 : 1 ratio


Interior-point methods

1215

Feasibility and phase I methods


feasibility problem: find x such that
fi(x) 0,

i = 1, . . . , m,

Ax = b

(2)

phase I: computes strictly feasible starting point for barrier method


basic phase I method
minimize (over x, s) s
subject to
fi(x) s,
Ax = b

i = 1, . . . , m

(3)

if x, s feasible, with s < 0, then x is strictly feasible for (2)

if optimal value p of (3) is positive, then problem (2) is infeasible

if p = 0 and attained, then problem (2) is feasible (but not strictly);


if p = 0 and not attained, then problem (2) is infeasible
Interior-point methods

1216

sum of infeasibilities phase I method


minimize 1T s
subject to s  0, fi(x) si,
Ax = b

i = 1, . . . , m

for infeasible problems, produces a solution that satisfies many more


inequalities than basic phase I method

60

60

40

40

number

number

example (infeasible set of 100 linear inequalities in 50 variables)

20
0
1

0.5

bi

0.5

aTi xmax

1.5

20
0
1

0.5

bi

0.5

aTi xsum

1.5

left: basic phase I solution; satisfies 39 inequalities


right: sum of infeasibilities phase I solution; satisfies 79 inequalities
Interior-point methods

1217

example: family of linear inequalities Ax  b + b


data chosen to be strictly feasible for > 0, infeasible for 0
100
80
60
40
20
0
1

0.5

100

0.5

100

80
60
40
20
0
100

Feasible

Infeasible

Newton iterations

Newton iterations

Newton iterations

use basic phase I, terminate when s < 0 or dual objective is positive

102

104

106

80
60
40
20
0
106

104

2
10

100

number of iterations roughly proportional to log(1/||)


Interior-point methods

1218

Complexity analysis via self-concordance


same assumptions as on page 122, plus:
sublevel sets (of f0, on the feasible set) are bounded
tf0 + is self-concordant with closed sublevel sets
second condition
holds for LP, QP, QCQP
may require reformulating the problem, e.g.,
Pn

minimize
i=1 xi log xi
subject to F x  g

Pn

minimize
i=1 xi log xi
subject to F x  g, x  0

needed for complexity analysis; barrier method works even when


self-concordance assumption does not apply
Interior-point methods

1219

Newton iterations per centering step: from self-concordance theory


tf0(x) + (x) tf0(x+) (x+)
#Newton iterations
+c

bound on effort of computing x+ = x(t) starting at x = x(t)

, c are constants (depend only on Newton algorithm parameters)


from duality (with = (t), = (t)):

tf0(x) + (x) tf0(x+) (x+)


m
X
log(tifi(x+)) m log
= tf0(x) tf0(x+) +
i=1

tf0(x) tf0(x+) t

m
X
i=1

ifi(x+) m m log

tf0(x) tg(, ) m m log


= m( 1 log )

Interior-point methods

1220

total number of Newton iterations (excluding first centering step)


#Newton iterations N =

(0)

log(m/(t ))
log



m( 1 log )
+c

5 104
4 104

figure shows N for typical values of , c,

3 104
2 10

m = 100,

m
t(0)

= 105

1 104
0
1

1.1

1.2

confirms trade-off in choice of


in practice, #iterations is in the tens; not very sensitive for 10
Interior-point methods

1221

polynomial-time complexity of barrier method

for = 1 + 1/ m:
N =O

m/t(0)
m log



number of Newton iterations for fixed gap reduction is O( m)


multiply with cost of one Newton iteration (a polynomial function of
problem dimensions), to get bound on number of flops
this choice of optimizes worst-case complexity; in practice we choose
fixed ( = 10, . . . , 20)

Interior-point methods

1222

Generalized inequalities
minimize f0(x)
subject to fi(x) Ki 0,
Ax = b

i = 1, . . . , m

f0 convex, fi : Rn Rki , i = 1, . . . , m, convex with respect to proper


cones Ki Rki
fi twice continuously differentiable
A Rpn with rank A = p

we assume p is finite and attained


we assume problem is strictly feasible; hence strong duality holds and
dual optimum is attained
examples of greatest interest: SOCP, SDP

Interior-point methods

1223

Generalized logarithm for proper cone


: Rq R is generalized logarithm for proper cone K Rq if:
dom = int K and 2(y) 0 for y K 0
(sy) = (y) + log s for y K 0, s > 0 ( is the degree of )
examples
nonnegative orthant K =

Rn+:

(y) =

positive semidefinite cone K = Sn+:


(Y ) = log det Y

Pn

i=1 log yi ,

with degree = n

( = n)

second-order cone K = {y Rn+1 | (y12 + + yn2 )1/2 yn+1}:


2
y12 yn2 )
(y) = log(yn+1

Interior-point methods

( = 2)

1224

properties (without proof): for y K 0,


y T (y) =

(y) K 0,
nonnegative orthant

Rn+:

(y) =

Pn

i=1 log yi

(y) = (1/y1, . . . , 1/yn),

y T (y) = n

positive semidefinite cone Sn+: (Y ) = log det Y


(Y ) = Y 1,

tr(Y (Y )) = n

second-order cone K = {y Rn+1 | (y12 + + yn2 )1/2 yn+1}:

y1
..
2
T
,

y
(y) = 2
(y) = 2
2

2
y
yn+1 y1 yn
n
yn+1
Interior-point methods

1225

Logarithmic barrier and central path


logarithmic barrier for f1(x) K1 0, . . . , fm(x) Km 0:
(x) =

m
X
i=1

i(fi(x)),

dom = {x | fi(x) Ki 0, i = 1, . . . , m}

i is generalized logarithm for Ki, with degree i


is convex, twice continuously differentiable
central path: {x(t) | t > 0} where x(t) solves
minimize tf0(x) + (x)
subject to Ax = b

Interior-point methods

1226

Dual points on central path


x = x(t) if there exists w Rp,
tf0(x) +

m
X
i=1

Dfi(x)T i(fi(x)) + AT w = 0

(Dfi(x) Rkin is derivative matrix of fi)


therefore, x(t) minimizes Lagrangian L(x, (t), (t)), where
i(t)

1
= i(fi(x(t))),
t

w
(t) =
t

from properties of i: i(t) Ki 0, with duality gap


f0(x(t)) g((t), (t)) = (1/t)

Interior-point methods

m
X

i=1

1227

example: semidefinite programming (with Fi Sp)


minimize cT x
Pn
subject to F (x) = i=1 xiFi + G  0

logarithmic barrier: (x) = log det(F (x)1)

central path: x(t) minimizes tcT x log det(F (x)); hence


tci tr(FiF (x(t))1) = 0,

i = 1, . . . , n

dual point on central path: Z (t) = (1/t)F (x(t))1 is feasible for


maximize tr(GZ)
subject to tr(FiZ) + ci = 0,
Z0

i = 1, . . . , n

duality gap on central path: cT x(t) tr(GZ (t)) = p/t


Interior-point methods

1228

Barrier method
given strictly feasible x, t := t(0) > 0, > 1, tolerance > 0.
repeat
1.
2.
3.
4.

Centering step. Compute x(t) by minimizing tf0 + , subject to Ax = b.


Update. x := x(t).
P
Stopping criterion. quit if ( i i)/t < .
Increase t. t := t.

only difference is duality gap m/t on central path is replaced by


number of outer iterations:
&
'
P
(0)
log(( i i)/(t ))
log

i i /t

complexity analysis via self-concordance applies to SDP, SOCP


Interior-point methods

1229

Examples

duality gap

102
10

102
104
106

= 50 = 200

20

=2

60

40

Newton iterations

second-order cone program (50 variables, 50 SOC constraints in R6)


120
80
40
0

80

20

60

100

Newton iterations

140

180

semidefinite program (100 variables, LMI constraint in S100)


Newton iterations

duality gap

10

140

100
102
10

106 = 150 = 50
0

20

40

60

=2
80

Newton iterations
Interior-point methods

100

100
60
20
0

20

40

60

80

100

120
1230

family of SDPs (A Sn, x Rn)


minimize 1T x
subject to A + diag(x)  0
n = 10, . . . , 1000, for each n solve 100 randomly generated instances

Newton iterations

35

30

25

20

15
101

Interior-point methods

102

103

1231

Primal-dual interior-point methods


more efficient than barrier method when high accuracy is needed
update primal and dual variables at each iteration; no distinction
between inner and outer iterations
often exhibit superlinear asymptotic convergence
search directions can be interpreted as Newton directions for modified
KKT conditions
can start at infeasible points
cost per iteration same as barrier method

Interior-point methods

1232

Convex Optimization Boyd & Vandenberghe

13. Conclusions

main ideas of the course


importance of modeling in optimization

131

Modeling
mathematical optimization
problems in engineering design, data analysis and statistics, economics,
management, . . . , can often be expressed as mathematical
optimization problems
techniques exist to take into account multiple objectives or uncertainty
in the data
tractability
roughly speaking, tractability in optimization requires convexity
algorithms for nonconvex optimization find local (suboptimal) solutions,
or are very expensive
surprisingly many applications can be formulated as convex problems
Conclusions

132

Theoretical consequences of convexity


local optima are global
extensive duality theory

systematic way of deriving lower bounds on optimal value


necessary and sufficient optimality conditions
certificates of infeasibility
sensitivity analysis

solution methods with polynomial worst-case complexity theory


(with self-concordance)

Conclusions

133

Practical consequences of convexity

(most) convex problems can be solved globally and efficiently


interior-point methods require 20 80 steps in practice
basic algorithms (e.g., Newton, barrier method, . . . ) are easy to
implement and work well for small and medium size problems (larger
problems if structure is exploited)
more and more high-quality implementations of advanced algorithms
and modeling tools are becoming available
high level modeling tools like cvx ease modeling and problem
specification

Conclusions

134

How to use convex optimization


to use convex optimization in some applied context
use rapid prototyping, approximate modeling
start with simple models, small problem instances, inefficient solution
methods
if you dont like the results, no need to expend further effort on more
accurate models or efficient algorithms
work out, simplify, and interpret optimality conditions and dual
even if the problem is quite nonconvex, you can use convex optimization
in subproblems, e.g., to find search direction
by repeatedly forming and solving a convex approximation at the
current point

Conclusions

135

Further topics
some topics we didnt cover:
methods for very large scale problems
subgradient calculus, convex analysis
localization, subgradient, and related methods
distributed convex optimization
applications that build on or use convex optimization

Conclusions

136

Whats next?

EE364B convex optimization II


MATH301 advanced topics in convex optimization
MS&E314 linear and conic optimization
EE464 semidefinite optimization and algebraic techniques

Conclusions

137

You might also like