Professional Documents
Culture Documents
1. Introduction
• mathematical optimization
• convex optimization
• example
• nonlinear optimization
1–1
Mathematical optimization
minimize f0(x)
subject to fi(x) ≤ bi, i = 1, . . . , m
• f0 : Rn → R: objective function
• fi : Rn → R, i = 1, . . . , m: constraint functions
Introduction 1–2
Examples
portfolio optimization
• variables: amounts invested in different assets
• constraints: budget, max./min. investment per asset, minimum return
• objective: overall risk or return variance
data fitting
• variables: model parameters
• constraints: prior information, parameter limits
• objective: measure of misfit or prediction error
Introduction 1–3
Solving optimization problems
• least-squares problems
• linear programming problems
• convex optimization problems
Introduction 1–4
Least-squares
using least-squares
Introduction 1–5
Linear programming
minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , m
Introduction 1–6
Convex optimization problem
minimize f0(x)
subject to fi(x) ≤ bi, i = 1, . . . , m
if α + β = 1, α ≥ 0, β ≥ 0
Introduction 1–7
solving convex optimization problems
• no analytical solution
• reliable and efficient algorithms
• computation time (roughly) proportional to max{n3, n2m, F }, where F
is cost of evaluating fi’s and their first and second derivatives
• almost a technology
Introduction 1–8
Example
rkj
θkj
illumination Ik
Introduction 1–9
how to solve?
1. use uniform power: pj = p, vary p
2. use least-squares:
Pn 2
minimize k=1 (I k − I des )
Introduction 1–10
5. use convex optimization: problem is equivalent to
3
h(u)
0
0 1 2 3 4
u
Introduction 1–11
additional constraints: does adding 1 or 2 below complicate the problem?
• answer: with (1), still easy to solve; with (2), extremely difficult
• moral: (untrained) intuition doesn’t always work; without the proper
background very easy problems can appear quite similar to very difficult
problems
Introduction 1–12
Course goals and topics
goals
topics
Introduction 1–13
Nonlinear optimization
Introduction 1–14
Brief history of convex optimization
theory (convex analysis): ca1900–1970
algorithms
• 1947: simplex algorithm for linear programming (Dantzig)
• 1960s: early interior-point methods (Fiacco & McCormick, Dikin, . . . )
• 1970s: ellipsoid method and other subgradient methods
• 1980s: polynomial-time interior-point methods for linear programming
(Karmarkar 1984)
• late 1980s–now: polynomial-time interior-point methods for nonlinear
convex optimization (Nesterov & Nemirovski 1994)
applications
• before 1990: mostly in operations research; few in engineering
• since 1990: many new applications in engineering (control, signal
processing, communications, circuit design, . . . ); new problem classes
(semidefinite and second-order cone programming, robust optimization)
Introduction 1–15
Convex Optimization — Boyd & Vandenberghe
2. Convex sets
• generalized inequalities
2–1
Affine set
x = θx1 + (1 − θ)x2 (θ ∈ R)
θ = 1.2 x1
θ=1
θ = 0.6
x2
θ=0
θ = −0.2
affine set: contains the line through any two distinct points in the set
x = θx1 + (1 − θ)x2
with 0 ≤ θ ≤ 1
convex set: contains line segment between any two points in the set
x = θ1 x1 + θ2 x2 + · · · + θk xk
with θ1 + · · · + θk = 1, θi ≥ 0
x = θ1 x1 + θ2 x2
with θ1 ≥ 0, θ2 ≥ 0
x1
x2
0
convex cone: set that contains all conic combinations of points in the set
x0
x
aT x = b
x0 aT x ≥ b
aT x ≤ b
xc
Ax b, Cx = d
a1 a2
P
a5
a3
a4
x y 0.5
z
example: ∈ S2+
y z
0
1
1
0
0.5
y −1 0 x
1. apply definition
example:
S = {x ∈ Rm | |p(t)| ≤ 1 for |t| ≤ π/3}
for m = 2:
2
1
1
p(t)
0
x2 0 S
−1
−1
−2
0 π/3 2π/3 π −2 −1
t x01 1 2
examples
• scaling, translation, projection
• solution set of linear matrix inequality {x | x1A1 + · · · + xmAm B}
(with Ai, B ∈ Sp)
• hyperbolic cone {x | xT P x ≤ (cT x)2, cT x ≥ 0} (with P ∈ Sn+)
images and inverse images of convex sets under perspective are convex
Ax + b
f (x) = T , dom f = {x | cT x + d > 0}
c x+d
1
f (x) = x
x1 + x2 + 1
1 1
x2
x2
0 C 0
f (C)
−1 −1
−1 0 1 −1 0 1
x1 x1
examples
• nonnegative orthant K = Rn+ = {x ∈ Rn | xi ≥ 0, i = 1, . . . , n}
• positive semidefinite cone K = Sn+
• nonnegative polynomials on [0, 1]:
x K y ⇐⇒ y − x ∈ K, x ≺K y ⇐⇒ y − x ∈ int K
examples
• componentwise inequality (K = Rn+)
x Rn+ y ⇐⇒ xi ≤ yi , i = 1, . . . , n
x K y, u K v =⇒ x + u K y + v
y∈S =⇒ x K y
y ∈ S, y K x =⇒ y=x
example (K = R2+)
S2
S1 x2
x1 is the minimum element of S1
x2 is a minimal element of S2 x1
aT x ≤ b for x ∈ C, aT x ≥ b for x ∈ D
aT x ≥ b aT x ≤ b
D
C
{x | aT x = aT x0}
a
x0
C
K ∗ = {y | y T x ≥ 0 for all x ∈ K}
examples
• K = Rn+: K ∗ = Rn+
• K = Sn+: K ∗ = Sn+
• K = {(x, t) | kxk2 ≤ t}: K ∗ = {(x, t) | kxk2 ≤ t}
• K = {(x, t) | kxk1 ≤ t}: K ∗ = {(x, t) | kxk∞ ≤ t}
y K ∗ 0 ⇐⇒ y T x ≥ 0 for all x K 0
x1 S
λ2
x2
fuel
example (n = 2)
x1, x2, x3 are efficient; x4, x5 are not P
x1
x2 x5 x4
λ
x3
labor
3. Convex functions
• quasiconvex functions
3–1
Definition
f : Rn → R is convex if dom f is a convex set and
(y, f (y))
(x, f (x))
• f is concave if −f is convex
• f is strictly convex if dom f is convex and
convex:
• affine: ax + b on R, for any a, b ∈ R
• exponential: eax, for any a ∈ R
• powers: xα on R++, for α ≥ 1 or α ≤ 0
• powers of absolute value: |x|p on R, for p ≥ 1
• negative entropy: x log x on R++
concave:
• affine: ax + b on R, for any a, b ∈ R
• powers: xα on R++, for 0 ≤ α ≤ 1
• logarithm: log x on R++
extended-value extension f˜ of f is
• dom f is convex
• for x, y ∈ dom f ,
f (y)
f (x) + ∇f (x)T (y − x)
(x, f (x))
f is twice differentiable if dom f is open and the Hessian ∇2f (x) ∈ Sn,
2 ∂ 2f (x)
∇ f (x)ij = , i, j = 1, . . . , n,
∂xi∂xj
convex if P 0
least-squares objective: f (x) = kAx − bk22
f (x, y)
T 1
2 y y
∇2f (x, y) = 3 0
y −x −x 0
2 2
1 0
convex for y > 0 y 0 −2 x
1 1
∇2f (x) = diag(z) − zz T
(zk = exp xk )
1T z (1T z)2
to show ∇2f (x) 0, we must verify that v T ∇2f (x)v ≥ 0 for all v:
P 2
P P
T 2 ( k zk vk )( k zk ) − ( k v k zk ) 2
v ∇ f (x)v = P ≥0
( k zk ) 2
P 2
P 2
P
since ( k v k zk ) ≤( k zk vk )( k zk ) (from Cauchy-Schwarz inequality)
Qn 1/n n
geometric mean: f (x) = ( k=1 x k ) on R ++ is concave
α-sublevel set of f : Rn → R:
Cα = {x ∈ dom f | f (x) ≤ α}
epigraph of f : Rn → R:
epi f
f (E z) ≤ E f (z)
prob(z = x) = θ, prob(z = y) = 1 − θ
examples
m
X
f (x) = − log(bi − aTi x), dom f = {x | aTi x < bi, i = 1, . . . , m}
i=1
examples
is convex
examples
• support function of a set C: SC (x) = supy∈C y T x is convex
• distance to farthest point in a set C:
f (x) = sup kx − yk
y∈C
λmax(X) = sup y T Xy
kyk2 =1
composition of g : Rn → R and h : R → R:
f (x) = h(g(x))
examples
• exp g(x) is convex if g is convex
• 1/g(x) is convex if g is concave and positive
composition of g : Rn → Rk and h : Rk → R:
examples
Pm
• i=1 log gi (x) is concave if gi are concave and positive
Pm
• log i=1 exp gi(x) is convex if gi are convex
is convex
examples
• f (x, y) = xT Ax + 2xT By + y T Cy with
A B
0, C≻0
BT C
g is convex if f is convex
examples
• f (x) = xT x is convex; hence g(x, t) = xT x/t is convex for t > 0
• negative logarithm f (x) = − log x is convex; hence relative entropy
g(x, t) = t log t − t log x is convex on R2++
• if f is convex, then
T T
g(x) = (c x + d)f (Ax + b)/(c x + d)
f (x)
xy
(0, −f ∗(y))
Sα = {x ∈ dom f | f (x) ≤ α}
a b c
• f is quasiconcave if −f is quasiconvex
• f is quasilinear if it is quasiconvex and quasiconcave
is quasiconvex
∇f (x)
x
1 1 T
Σ−1 (x−x̄)
f (x) = p e− 2 (x−x̄)
(2π)n det Σ
f (x) = prob(x + y ∈ C)
is log-concave
proof: write f (x) as integral of product of log-concave functions
Z
1 u∈C
f (x) = g(x + y)p(y) dy, g(u) =
0 u∈
6 C,
p is pdf of y
Y (x) = prob(x + w ∈ S)
• Y is log-concave
for x, y ∈ dom f , 0 ≤ θ ≤ 1
for X, Y ∈ Sm, 0 ≤ θ ≤ 1
4–1
Optimization problem in standard form
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p
optimal value:
examples (with n = 1, m = p = 0)
• f0(x) = 1/x, dom f0 = R++: p⋆ = 0, no optimal point
• f0(x) = − log x, dom f0 = R++: p⋆ = −∞
• f0(x) = x log x, dom f0 = R++: p⋆ = −1/e, x = 1/e is optimal
• f0(x) = x3 − 3x, p⋆ = −∞, local optimum at x = 1
m
\ p
\
x∈D= dom fi ∩ dom hi,
i=0 i=1
example:
Pk
minimize f0(x) = − i=1 log(bi − aTi x)
find x
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p
minimize 0
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
aTi x = bi, i = 1, . . . , p
often written as
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b
−∇f0(x)
x
X
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b
is equivalent to
Ax = b ⇐⇒ x = F z + x0 for some z
is equivalent to
minimize f0(x)
subject to aTi x ≤ bi, i = 1, . . . , m
is equivalent to
minimize (over x, t) t
subject to f0(x) − t ≤ 0
fi(x) ≤ 0, i = 1, . . . , m
Ax = b
is equivalent to
minimize f˜0(x1)
subject to fi(x1) ≤ 0, i = 1, . . . , m
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b
with f0 : Rn → R quasiconvex, f1, . . . , fm convex
can have locally optimal points that are not (globally) optimal
(x, f0(x))
f0(x) ≤ t ⇐⇒ φt(x) ≤ 0
example
p(x)
f0(x) =
q(x)
minimize cT x + d
subject to Gx h
Ax = b
−c
P x⋆
minimize cT x
subject to Ax b, x0
piecewise-linear minimization
equivalent to an LP
minimize t
subject to aTi x + bi ≤ t, i = 1, . . . , m
P = {x | aTi x ≤ bi, i = 1, . . . , m}
xcheb
is center of largest inscribed ball
B = {xc + u | kuk2 ≤ r}
maximize r
subject to aTi xc + rkaik2 ≤ bi, i = 1, . . . , m
minimize f0(x)
subject to Gx h
Ax = b
linear-fractional program
cT x + d
f0(x) = T , dom f0(x) = {x | eT x + f > 0}
e x+f
minimize cT y + dz
subject to Gy hz
Ay = bz
eT y + f z = 1
z≥0
cTi x + di
f0(x) = max T , dom f0(x) = {x | eTi x+fi > 0, i = 1, . . . , r}
i=1,...,r e x + fi
i
minimize (1/2)xT P x + q T x + r
subject to Gx h
Ax = b
−∇f0(x⋆)
x⋆
least-squares
minimize kAx − bk22
minimize f T x
subject to kAix + bik2 ≤ cTi x + di, i = 1, . . . , m
Fx = g
minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , m,
minimize cT x
subject to aTi x ≤ bi for all ai ∈ Ei, i = 1, . . . , m,
minimize cT x
subject to prob(aTi x ≤ bi) ≥ η, i = 1, . . . , m
• robust LP
minimize cT x
subject to aTi x ≤ bi ∀ai ∈ Ei, i = 1, . . . , m
minimize cT x
subject to āTi x + kPiT xk2 ≤ bi, i = 1, . . . , m
• robust LP
minimize cT x
subject to prob(aTi x ≤ bi) ≥ η, i = 1, . . . , m,
minimize cT x
1/2
subject to āTi x + Φ−1(η)kΣi xk2 ≤ bi, i = 1, . . . , m
minimize f0(x)
subject to fi(x) ≤ 1, i = 1, . . . , m
hi(x) = 1, i = 1, . . . , p
PK a
1k 2k a nk a
• posynomial f (x) = k=1 ck x1 x2 · · · xn transforms to
K
!
X T
log f (ey1 , . . . , eyn ) = log eak y+bk (bk = log ck )
k=1
design problem
• aspect ratio hi/wi and inverse aspect ratio wi/hi are monomials
• the vertical deflection yi and slope vi of central axis at the right end of
segment i are defined recursively as
F
vi = 12(i − 1/2) 3 + vi+1
Ewihi
F
yi = 6(i − 1/3) 3 + vi+1 + yi+1
Ewihi
note
• we write wmin ≤ wi ≤ wmax and hmin ≤ hi ≤ hmax
Sminwi/hi ≤ 1, hi/(wiSmax) ≤ 1
minimize λ Pn
subject to j=1 A(x)ij vj /(λvi ) ≤ 1, i = 1, . . . , n
variables λ, v, x
minimize f0(x)
subject to fi(x) Ki 0, i = 1, . . . , m
Ax = b
conic form problem: special case with affine objective and constraints
minimize cT x
subject to F x + g K 0
Ax = b
minimize cT x
subject to x1F1 + x2F2 + · · · + xnFn + G 0
Ax = b
with Fi, G ∈ Sk
SOCP: minimize f T x
subject to kAix + bik2 ≤ cTi x + di, i = 1, . . . , m
SDP: minimize f T x
(cTi x + di)I A i x + bi
subject to 0, i = 1, . . . , m
(Aix + bi)T cTi x + di
minimize λmax(A(x))
equivalent SDP
minimize t
subject to A(x) tI
• variables x ∈ Rn, t ∈ R
• follows from
λmax(A) ≤ t ⇐⇒ A tI
T
1/2
minimize kA(x)k2 = λmax(A(x) A(x))
minimize t
tI A(x)
subject to 0
A(x)T tI
• variables x ∈ Rn, t ∈ R
• constraint follows from
kAk2 ≤ t ⇐⇒ AT A t2I, t ≥ 0
tI A
⇐⇒ 0
AT tI
O = {f0(x) | x feasible}
O
O
f0(xpo)
f0(x⋆)
x⋆ is optimal xpo is Pareto optimal
25
20 O
F2(x) = kxk22
15
10
0
0 10 20 30 40 50
example
15% 1
allocation x
mean return
10%
0.5
x(1)
5%
0
0%
0% 10% 20% 0% 10% 20%
standard deviation of return standard deviation of return
Convex optimization problems 4–44
Scalarization
minimize λT f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p
O
if x is optimal for scalar problem,
then it is Pareto-optimal for vector f0(x1)
optimization problem
f0(x3)
λ1
f0(x2) λ2
for convex vector optimization problems, can find (almost) all Pareto
optimal points by varying λ ≻K ∗ 0
examples
kxk22
minimize kAx − bk22 + γkxk22
10
5
for fixed γ, a LS problem γ=1
0
0 5 10 15 20
kAx − bk22
5. Duality
5–1
Lagrangian
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p
Duality 5–2
Lagrange dual function
Duality 5–3
Least-norm solution of linear equations
minimize xT x
subject to Ax = b
dual function
• Lagrangian is L(x, ν) = xT x + ν T (Ax − b)
• to minimize L over x, set gradient equal to zero:
∇xL(x, ν) = 2x + AT ν = 0 =⇒ x = −(1/2)AT ν
• plug in in L to obtain g:
1
g(ν) = L((−1/2)AT ν, ν) = − ν T AAT ν − bT ν
4
a concave function of ν
Duality 5–4
Standard form LP
minimize cT x
subject to Ax = b, x0
dual function
• Lagrangian is
L(x, λ, ν) = cT x + ν T (Ax − b) − λT x
= −bT ν + (c + AT ν − λ)T x
• L is affine in x, hence
−bT ν AT ν − λ + c = 0
g(λ, ν) = inf L(x, λ, ν) =
x −∞ otherwise
Duality 5–5
Equality constrained norm minimization
minimize kxk
subject to Ax = b
dual function
T T bT ν kAT νk∗ ≤ 1
g(ν) = inf (kxk − ν Ax + b ν) =
x −∞ otherwise
where kvk∗ = supkuk≤1 uT v is dual norm of k · k
Duality 5–6
Two-way partitioning
minimize xT W x
subject to x2i = 1, i = 1, . . . , n
dual function
X
T
g(ν) = inf (x W x + νi(x2i − 1)) = inf xT (W + diag(ν))x − 1T ν
x x
i
−1T ν W + diag(ν) 0
=
−∞ otherwise
Duality 5–7
Lagrange dual and conjugate function
minimize f0(x)
subject to Ax b, Cx = d
dual function
T T T T T
g(λ, ν) = inf f0(x) + (A λ + C ν) x − b λ − d ν
x∈dom f0
= −f0∗(−AT λ − C T ν) − bT λ − dT ν
Duality 5–8
The dual problem
maximize g(λ, ν)
subject to λ 0
• finds best lower bound on p⋆, obtained from Lagrange dual function
• a convex optimization problem; optimal value denoted d⋆
• λ, ν are dual feasible if λ 0, (λ, ν) ∈ dom g
• often simplified by making implicit constraint (λ, ν) ∈ dom g explicit
Duality 5–9
Weak and strong duality
weak duality: d⋆ ≤ p⋆
• always holds (for convex and nonconvex problems)
• can be used to find nontrivial lower bounds for difficult problems
for example, solving the SDP
maximize −1T ν
subject to W + diag(ν) 0
gives a lower bound for the two-way partitioning problem on page 5–7
strong duality: d⋆ = p⋆
• does not hold in general
• (usually) holds for convex problems
• conditions that guarantee strong duality in convex problems are called
constraint qualifications
Duality 5–10
Slater’s constraint qualification
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m
Ax = b
• also guarantees that the dual optimum is attained (if p⋆ > −∞)
• can be sharpened: e.g., can replace int D with relint D (interior
relative to affine hull); linear inequalities do not need to hold with strict
inequality, . . .
• there exist many other types of constraint qualifications
Duality 5–11
Inequality form LP
primal problem
minimize cT x
subject to Ax b
dual function
T T T
−bT λ AT λ + c = 0
g(λ) = inf (c + A λ) x − b λ =
x −∞ otherwise
dual problem
maximize −bT λ
subject to AT λ + c = 0, λ0
Duality 5–12
Quadratic program
primal problem (assume P ∈ Sn++)
minimize xT P x
subject to Ax b
dual function
1
g(λ) = inf x P x + λ (Ax − b) = − λT AP −1AT λ − bT λ
T T
x 4
dual problem
maximize −(1/4)λT AP −1AT λ − bT λ
subject to λ 0
Duality 5–13
A nonconvex problem with strong duality
minimize xT Ax + 2bT x
subject to xT x ≤ 1
A 6 0, hence nonconvex
strong duality although primal problem is not convex (not easy to show)
Duality 5–14
Geometric interpretation
for simplicity, consider problem with one constraint f1(x) ≤ 0
interpretation of dual function:
t t
G G
p⋆ p⋆
λu + t = g(λ) d⋆
g(λ)
u u
Duality 5–15
epigraph variation: same interpretation if G is replaced with
λu + t = g(λ) p⋆
g(λ)
u
strong duality
• holds if there is a non-vertical supporting hyperplane to A at (0, p⋆)
• for convex problem, A is convex, hence has supp. hyperplane at (0, p⋆)
• Slater’s condition: if there exist (ũ, t̃) ∈ A with ũ < 0, then supporting
hyperplanes at (0, p⋆) must be non-vertical
Duality 5–16
Complementary slackness
Duality 5–17
Karush-Kuhn-Tucker (KKT) conditions
the following four conditions are called KKT conditions (for a problem with
differentiable fi, hi):
m
X p
X
∇f0(x) + λi∇fi(x) + νi∇hi(x) = 0
i=1 i=1
from page 5–17: if strong duality holds and x, λ, ν are optimal, then they
must satisfy the KKT conditions
Duality 5–18
KKT conditions for convex problem
if x̃, λ̃, ν̃ satisfy KKT for a convex problem, then they are optimal:
• recall that Slater implies strong duality, and dual optimum is attained
• generalizes optimality condition ∇f0(x) = 0 for unconstrained problem
Duality 5–19
example: water-filling (assume αi > 0)
Pn
minimize − i=1 log(xi + αi)
subject to x 0, 1T x = 1
interpretation
• n patches; level of patch i is at height αi
1/ν ⋆
xi
• flood area with unit amount of water
αi
⋆
• resulting level is 1/ν
i
Duality 5–20
Perturbation and sensitivity analysis
(unperturbed) optimization problem and its dual
Duality 5–21
global sensitivity result
assume strong duality holds for unperturbed problem, and that λ⋆, ν ⋆ are
dual optimal for unperturbed problem
apply weak duality to perturbed problem:
p⋆(u, v) ≥ g(λ⋆, ν ⋆) − uT λ⋆ − v T ν ⋆
= p⋆(0, 0) − uT λ⋆ − v T ν ⋆
sensitivity interpretation
Duality 5–22
local sensitivity: if (in addition) p⋆(u, v) is differentiable at (0, 0), then
∂p⋆(0, 0) ∂p⋆(0, 0)
λ⋆i =− , νi⋆ =−
∂ui ∂vi
p⋆(0) − λ⋆u
Duality 5–23
Duality and problem reformulations
common reformulations
Duality 5–24
Introducing new variables and equality constraints
minimize f0(Ax + b)
Duality 5–25
norm approximation problem: minimize kAx − bk
minimize kyk
subject to y = Ax − b
maximize bT ν
subject to AT ν = 0, kνk∗ ≤ 1
Duality 5–26
Implicit constraints
LP with box constraints: primal and dual problem
dual function
g(ν) = inf (cT x + ν T (Ax − b))
−1x1
Duality 5–27
Problems with generalized inequalities
minimize f0(x)
subject to fi(x) Ki 0, i = 1, . . . , m
hi(x) = 0, i = 1, . . . , p
Duality 5–28
lower bound property: if λi Ki∗ 0, then g(λ1, . . . , λm, ν) ≤ p⋆
proof: if x̃ is feasible and λ Ki∗ 0, then
m
X p
X
f0(x̃) ≥ f0(x̃) + λTi fi(x̃) + νihi(x̃)
i=1 i=1
≥ inf L(x, λ1, . . . , λm, ν)
x∈D
= g(λ1, . . . , λm, ν)
Duality 5–29
Semidefinite program
primal SDP (Fi, G ∈ Sk )
minimize cT x
subject to x1F1 + · · · + xnFn G
dual SDP
maximize − tr(GZ)
subject to Z 0, tr(FiZ) + ci = 0, i = 1, . . . , n
Duality 5–30
Convex Optimization — Boyd & Vandenberghe
• norm approximation
• least-norm problems
• regularized approximation
• robust approximation
6–1
Norm approximation
minimize kAx − bk
y = Ax + v
AT Ax = AT b
minimize t
subject to −t1 Ax − b t1
minimize 1T y
subject to −y Ax − b y
examples
2
2
• quadratic: φ(u) = u log barrier
1.5 quadratic
• deadzone-linear with width a:
φ(u)
1
deadzone-linear
φ(u) = max{0, |u| − a}
0.5
40
p=1
0
−2 −1 0 1 2
10
p=2
0
−2 −1 0 1 2
Deadzone
20
0
−2 −1 0 1 2
Log barrier
10
0
−2 −1 0 1 2
r
1.5
10
φhub(u)
f (t)
1
0
0.5 −10
0 −20
−1.5 −1 −0.5 0 0.5 1 1.5 −10 −5 0 5 10
u t
minimize kxk
subject to Ax = b
2x + AT ν = 0, Ax = b
minimize 1T y
subject to −y x y, Ax = b
Tikhonov regularization
y(t)
u(t)
0
−5 −0.5
−10 −1
0 50 100 150 200 0 50 100 150 200
t t
4 1
2 0.5
y(t)
u(t)
0 0
−2 −0.5
−4 −1
0 50 100 150 200 0 50 100 150 200
t t
4 1
2 0.5
y(t)
u(t)
0 0
−2 −0.5
−4 −1
0 50 100 150 200 0 50 100 150 200
t t
• x ∈ Rn is unknown signal
• xcor = x + v is (known) corrupted version of x, with additive noise v
• variable x̂ (reconstructed signal) is estimate of x
• φ : Rn → R is regularization function or smoothing objective
n−1
X n−1
X
φquad(x̂) = (x̂i+1 − x̂i)2, φtv (x̂) = |x̂i+1 − x̂i|
i=1 i=1
0.5
x̂
0
−0.5
x
0
0 1000 2000 3000 4000
0.5
−0.5
0 1000 2000 3000 4000
x̂
0
−0.5
0.5
0 1000 2000 3000 4000
0.5
xcor
x̂
0
−0.5 −0.5
0 1000 2000 3000 4000 0 1000 2000 3000 4000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φquad(x̂)
x̂i
2 0
1
−2
0 500 1000 1500 2000
x
0
2
−1
x̂i
−2 0
0 500 1000 1500 2000
2 −2
0 500 1000 1500 2000
1 2
xcor
x̂i
0
−1
−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φquad(x̂)
x̂
2 0
1
−2
0 500 1000 1500 2000
x
0
2
−1
x̂
−2 0
0 500 1000 1500 2000
2 −2
0 500 1000 1500 2000
1 2
xcor
x̂
0
−1
−2 −2
0 500 1000 1500 2000 0 500 1000 1500 2000
i i
original signal x and noisy three solutions on trade-off curve
signal xcor kx̂ − xcork2 versus φtv (x̂)
12
example: A(u) = A0 + uA1
10
xnom
• xnom minimizes kA0x − bk22
8
r(u)
6 xstoch
with u uniform on [−1, 1] xwc
4
• from page 5–14, strong duality holds between the following problems
minimize t+ λ
I P (x) q(x)
subject to P (x)T λI 0 0
q(x)T 0 t
0.2 xrls
frequency
0.15
0.1
xtik
0.05 xls
0
0 1 2 3 4 5
r(u)
7. Statistical estimation
• experiment design
7–1
Parametric distribution estimation
• y is observed value
• l(x) = log px(y) is called log-likelihood function
• can add constraints x ∈ C explicitly, or define px(y) = 0 for x 6∈ C
• a convex optimization problem if log px(y) is concave in x for fixed y
yi = aTi x + vi, i = 1, . . . , m
(y is observed value)
ML estimate is LS solution
• Laplacian noise: p(z) = (1/(2a))e−|z|/a,
m
1X T
l(x) = −m log(2a) − |ai x − yi|
a i=1
exp(aT u + b)
p = prob(y = 1) =
1 + exp(aT u + b)
k
X m
X
= (aT ui + b) − log(1 + exp(aT ui + b))
i=1 i=1
concave in a, b
0.8
prob(y = 1)
0.6
0.4
0.2
0 2 4 6 8 10
u
randomized detector
variable T ∈ R2×n
minimax detector
0.8
0.6
Pfn
0.4
1
0.2
2 4
3
0
0 0.2 0.4 0.6 0.8 1
Pfp
m
!−1 m
X X
x̂ = aiaTi yi a i
i=1 i=1
λk (1 − vkT W vk ) = 0, k = 1, . . . , p
λ2 = 0.5
p
!!
X
L(X, λ, Z, z, ν) = log det X −1+tr Z X− λk vk vkT −z T λ+ν(1T λ−1)
k=1
dual problem
change variable W = Z/ν, and optimize over ν to get dual of page 7–13
8. Geometric problems
• centering
• classification
8–1
Minimum volume ellipsoid around a set
√
factor n can be improved to n if C is symmetric
xcheb xmve
fi(x) ≤ 0, i = 1, . . . , m, Fx = g
xac is minimizer of
m
X xac
φ(x) = − log(bi − aTi x)
i=1
where
aT xi + b > 0, i = 1, . . . , N, aT yi + b < 0, i = 1, . . . , M
aT xi + b ≥ 1, i = 1, . . . , N, aT yi + b ≤ −1, i = 1, . . . , M
H1 = {z | aT z + b = 1}
H2 = {z | aT z + b = −1}
minimize (1/2)kak2
subject to aT xi + b ≥ 1, i = 1, . . . , N (1)
aT yi + b ≤ −1, i = 1, . . . , M
maximize 1T
λ + 1T µ
P N PM
subject to 2
i=1 λixi − i=1 µiyi
≤ 1 (2)
2
1T λ = 1T µ, λ 0, µ0
interpretation
minimize t
P N PM
subject to
i=1 θixi − i=1 γiyi
≤ t
2
θ 0, 1 θ = 1, γ 0, 1T γ = 1
T
minimize 1T u + 1T v
subject to aT xi + b ≥ 1 − ui, i = 1, . . . , N
aT yi + b ≤ −1 + vi, i = 1, . . . , M
u 0, v 0
• an LP in a, b, u, v
• at optimum, ui = max{0, 1 − aT xi − b}, vi = max{0, 1 + aT yi + b}
• can be interpreted as a heuristic for minimizing #misclassified points
f (z) = θT F (z)
xTi P xi + q T xi + r ≥ 1, yiT P yi + q T yi + r ≤ −1
placement problem
P
minimize i6=j fij (xi, xj )
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
4 4 6
5
3 3
4
2 2 3
2
1 1
1
00 0.5 1 1.5 2 00 0.5 1 1.5 00 0.5 1 1.5
9–1
Matrix structure and algorithm complexity
flop counts
x1 := b1/a11
x2 := (b2 − a21x1)/a22
x3 := (b3 − a31x1 − a32x2)/a33
..
permutation matrices:
1 j = πi
aij =
0 otherwise
A = A 1 A2 · · · Ak
A = P LU
A = P1LU P2
A = LLT
A = P LLT P T
A = P LDLT P T
cost: (1/3)n3
(A22 − A21A−1
11 A 12 )x 2 = b 2 − A 21 A −1
11 b1
• step 3: (2/3)n32
examples
• general A11 (f = (2/3)n31, s = 2n21): no gain over standard method
(A + BC)x = b
first write as
A B x b
=
C −I y 0
(I + CA−1B)y = CA−1b,
then solve Ax = b − By
this proves the matrix inversion lemma: if A and A + BC nonsingular,
{x | Ax = b} = {F z + x̂ | z ∈ Rn−p}
• Newton’s method
• self-concordant functions
• implementation
10–1
Unconstrained minimization
minimize f (x)
f (x(k)) → p⋆
∇f (x⋆) = 0
2nd condition is hard to verify, except when all sublevel sets are closed:
• equivalent to condition that epi f is closed
• true if dom f = Rn
• true if f (x) → ∞ as x → bd dom f
implications
• for x, y ∈ S,
m
f (y) ≥ f (x) + ∇f (x)T (y − x) + kx − yk22
2
hence, S is bounded
• p⋆ > −∞, and for x ∈ S,
⋆ 1
f (x) − p ≤ k∇f (x)k22
2m
useful as stopping criterion (if you know m)
f (x + t∆x)
• very slow if γ ≫ 1 or γ ≪ 1
• example for γ = 10:
x(0)
x2
0
x(1)
−4
−10 0 10
x1
x(0) x(0)
x(2)
x(1)
x(1)
500
X
f (x) = cT x − log(bi − aTi x)
i=1
104
102
f (x(k)) − p⋆
10−2
backtracking l.s.
10−4
0 50 100 150 200
k
unit balls and normalized steepest descent directions for a quadratic norm
and the ℓ1-norm:
−∇f (x)
−∇f (x)
∆xnsd
∆xnsd
(0) x(0)
x
x(2)
x(1) x(2)
x(1)
• steepest descent with backtracking line search for two quadratic norms
• ellipses show {x | kx − x(k)kP = 1}
• equivalent interpretation of steepest descent with quadratic norm k · kP :
gradient descent after change of variables x̄ = P 1/2x
interpretations
• x + ∆xnt minimizes second order approximation
b T 1 T 2
f (x + v) = f (x) + ∇f (x) v + v ∇ f (x)v
2
fb′
fb f′
(x + ∆xnt, f ′(x + ∆xnt))
(x, f (x)) (x, f ′(x))
(x + ∆xnt, f (x + ∆xnt)) f
T 2
1/2
kuk∇2f (x) = u ∇ f (x)u
x + ∆xnsd
x + ∆xnt
b 1
f (x) − inf f (y) = λ(x)2
y 2
• equal to the norm of the Newton step in the quadratic Hessian norm
T 2
1/2
λ(x) = ∆xnt∇ f (x)∆xnt
Newton iterates for f˜(y) = f (T y) with starting point y (0) = T −1x(0) are
y (k) = T −1x(k)
assumptions
• f strongly convex on S with constant m
• ∇2f is Lipschitz continuous on S, with constant L > 0:
2l−k 2l−k
L l L k 1
2
k∇f (x )k2 ≤ 2
k∇f (x )k2 ≤ , l≥k
2m 2m 2
f (x(0)) − p⋆
+ log2 log2(ǫ0/ǫ)
γ
105
x(0) 100
f (x(k)) − p⋆
x(1) 10−5
10−10
10−150 1 2 3 4 5
k
105 2
10−15 0
0 2 4 6 8 10 0 2 4 6 8
k k
f (x(k)) − p⋆ 105
100
10−5
0 5 10 15 20
k
definition
• convex f : R → R is self-concordant if |f ′′′(x)| ≤ 2f ′′(x)3/2 for all
x ∈ dom f
• f : Rn → R is self-concordant if g(t) = f (x + tv) is self-concordant for
all x ∈ dom f , v ∈ Rn
examples on R
• linear and quadratic functions
• negative logarithm f (x) = − log x
• negative entropy plus negative logarithm: f (x) = x log x − log x
properties
• preserved under positive scaling α ≥ 1, and sum
• preserved under composition with affine function
• if g is convex with dom g = R++ and |g ′′′(x)| ≤ 3g ′′(x)/x then
is self-concordant
examples: properties can be used to show that the following are s.c.
Pm
• f (x) = − i=1 log(bi − aTi x) on {x | aTi x < bi, i = 1, . . . , m}
• f (X) = − log det X on Sn++
• f (x) = − log(y 2 − xT x) on {(x, y) | kxk2 < y}
• if λ(x) ≤ η, then
2
2λ(x(k+1)) ≤ 2λ(x(k))
f (x(0)) − p⋆
+ log2 log2(1/ǫ)
γ
25
20
◦:
iterations
m = 100, n = 50 15
: m = 1000, n = 500
♦: m = 1000, n = 50 10
0
0 5 10 15 20 25 30 35
(0) ⋆
f (x )−p
main effort in each iteration: evaluate derivatives and solve Newton system
H∆x = −g
• implementation
11–1
Equality constrained minimization
minimize f (x)
subject to Ax = b
∇f (x⋆) + AT ν ⋆ = 0, Ax⋆ = b
minimize (1/2)xT P x + q T x + r
subject to Ax = b
optimality condition:
T
⋆
P A x −q
=
A 0 ν⋆ b
Ax = 0, x 6= 0 =⇒ xT P x > 0
represent solution of {x | Ax = b} as
{x | Ax = b} = {F z + x̂ | z ∈ Rn−p}
minimize f (F z + x̂)
reduced problem:
interpretations
T 2
1/2 T
1/2
λ(x) = ∆xnt∇ f (x)∆xnt = −∇f (x) ∆xnt
properties
1
f (x) − inf fb(y) = λ(x)2
Ay=b 2
T 2 −1
1/2
• in general, λ(x) 6= ∇f (x) ∇ f (x) ∇f (x)
• variables z ∈ Rn−p
• x̂ satisfies Ax̂ = b; rank F = n − p and AF = 0
• Newton’s method for f˜, started at z (0), generates iterates z (k)
x(k+1) = F z (k) + x̂
primal-dual interpretation
• write optimality condition as r(y) = 0, where
given starting point x ∈ dom f , ν , tolerance ǫ > 0, α ∈ (0, 1/2), β ∈ (0, 1).
repeat
1. Compute primal and dual Newton steps ∆xnt, ∆νnt.
2. Backtracking line search on krk2.
t := 1.
while kr(x + t∆xnt, ν + t∆νnt)k2 > (1 − αt)kr(x, ν)k2, t := βt.
3. Update. x := x + t∆xnt, ν := ν + t∆νnt.
until Ax = b and kr(x, ν)k2 ≤ ǫ.
T
H A v g
=−
A 0 w h
solution methods
• LDLT factorization
• elimination (if H nonsingular)
105
f (x(k)) − p⋆
100
10−5
10−100 5 10 15 20
k
p⋆ − g(ν (k))
100
10−5
10−100 2 4 6 8 10
k
3. infeasible start Newton method (requires x(0) ≻ 0)
1010
105
kr(x(k), ν (k))k2
100
10−5
10−10
10−150 5 10 15 20 25
k
Equality constrained minimization 11–14
complexity per iteration of three methods is identical
T
H A v g
=−
A 0 w h
p
X
tr(AiXAj X)wj = bi, i = 1, . . . , p (2)
j=1
12–1
Inequality constrained minimization
minimize f0(x)
subject to fi(x) ≤ 0, i = 1, . . . , m (1)
Ax = b
• approximation improves as t → ∞
−5
−3 −2 −1 0 1
u
Interior-point methods 12–4
logarithmic barrier function
m
X
φ(x) = − log(−fi(x)), dom φ = {x | f1(x) < 0, . . . , fm(x) < 0}
i=1
m
X 1
∇φ(x) = ∇fi(x)
i=1
−f i (x)
m
X m
X 1
1
∇2φ(x) = 2
∇f i (x)∇f i (x) T
+ ∇ 2
fi(x)
f (x)
i=1 i i=1
−fi(x)
(for now, assume x⋆(t) exists and is unique for each t > 0)
• central path is {x⋆(t) | t > 0}
minimize cT x
subject to aTi x ≤ bi, i = 1, . . . , 6
⋆ x⋆(10)
x
T T ⋆
hyperplane c x = c x (t) is tangent to
level curve of φ through x⋆(t)
p⋆ ≥ g(λ⋆(t), ν ⋆(t))
= L(x⋆(t), λ⋆(t), ν ⋆(t))
= f0(x⋆(t)) − m/t
m
X
∇f0(x) + λi∇fi(x) + AT ν = 0
i=1
m
X
F0(x⋆(t)) + Fi(x⋆(t)) = 0
i=1
−c
−3c
t=1 t=3
centering problem
102 140
Newton iterations
120
0
10
duality gap
100
10−2 80
60
−4
10
40
10−6 µ = 50 µ = 150 µ=2 20
0
0 20 40 60 80 0 40 80 120 160 200
Newton iterations µ
102
100
duality gap
10−2
10−4
0 20 40 60 80 100 120
Newton iterations
minimize cT x
subject to Ax = b, x0
35
Newton iterations
30
25
20
15 1
10 102 103
m
fi(x) ≤ 0, i = 1, . . . , m, Ax = b (2)
minimize (over x, s) s
subject to fi(x) ≤ s, i = 1, . . . , m (3)
Ax = b
minimize 1T s
subject to s 0, fi(x) ≤ si, i = 1, . . . , m
Ax = b
40 40
number
number
20 20
0 0
−1 −0.5 0 0.5 1 1.5 −1 −0.5 0 0.5 1 1.5
bi − aTi xmax bi − aTi xsum
Newton iterations
100
80 Infeasible Feasible
60
40
20
0
−1 −0.5 0 0.5 1
γ
Newton iterations
Newton iterations
100 100
80 80
60 60
40 40
20 20
0 0
−100 −10−2 −10−4 −10−6 10−6 10−4 −2
100
γ γ 10
number of iterations roughly proportional to log(1/|γ|)
second condition
5 104
4 104
figure shows N for typical values of γ, c,
3 104
m
N
4 m = 100, = 105
2 10 t(0)ǫ
1 104
0
1 1.1 1.2
µ
√
• for µ = 1 + 1/ m:
√ m/t(0)
N =O m log
ǫ
√
• number of Newton iterations for fixed gap reduction is O( m)
minimize f0(x)
subject to fi(x) Ki 0, i = 1, . . . , m
Ax = b
examples
Pn
• nonnegative orthant K = Rn+: ψ(y) = i=1 log yi , with degree θ = n
• positive semidefinite cone K = Sn+:
2
ψ(y) = log(yn+1 − y12 − · · · − yn2 ) (θ = 2)
∇ψ(y) K ∗ 0, y T ∇ψ(y) = θ
Pn
• nonnegative orthant Rn+: ψ(y) = i=1 log yi
m
X
φ(x) = − ψi(−fi(x)), dom φ = {x | fi(x) ≺Ki 0, i = 1, . . . , m}
i=1
1 w
λ⋆i(t) = ∇ψi(−fi(x⋆(t))), ⋆
ν (t) =
t t
m
X
f0(x⋆(t)) − g(λ⋆(t), ν ⋆(t)) = (1/t) θi
i=1
minimize cT x Pn
subject to F (x) = i=1 xiFi + G 0
maximize tr(GZ)
subject to tr(FiZ) + ci = 0, i = 1, . . . , n
Z0
P
• only difference is duality gap m/t on central path is replaced by i θi /t
Newton iterations
120
duality gap
0
10
10−2 80
10−4 40
10−6 µ = 50 µ = 200 µ=2
0
0 20 40 60 80 20 60 100 140 180
Newton iterations µ
100 100
10−2
60
−4
10
minimize 1T x
subject to A + diag(x) 0
35
Newton iterations
30
25
20
15
101 102 103
n
13. Conclusions
13–1
Modeling
mathematical optimization
tractability
Conclusions 13–2
Theoretical consequences of convexity
Conclusions 13–3
Practical consequences of convexity
• high level modeling tools like cvx ease modeling and problem
specification
Conclusions 13–4
How to use convex optimization
• even if the problem is quite nonconvex, you can use convex optimization
– in subproblems, e.g., to find search direction
– by repeatedly forming and solving a convex approximation at the
current point
Conclusions 13–5
Further topics
Conclusions 13–6
What’s next?
Conclusions 13–7