Professional Documents
Culture Documents
1
Last time: ADMM
Consider a problem of the form:
2
ADMM in scaled form
3
Outline
Today:
• Conditional gradient method
• Convergence analysis
• Properties and variants
4
Consider the constrained problem
5
Conditional gradient (Frank-Wolfe) method
Choose an initial x(0) ∈ C and for k = 1, 2, 3, . . .
i.e., we are moving less and less in the direction of the linearization
minimizer as the algorithm proceeds
6
the linearization of the objective fu
. toward
o this lin
f
y over th
r f (x)
e In te
g g(x)
gence,
- that
- Algorit
x f (x(k) )
s D
- for x⇤
(From Jaggi 2011)
, solution to (1) (Frank & Wolfe, 195
barger, 1978). In recent years,7
Norm constraints
What happens when C = {x : kxk ≤ t} for a norm k · k? Then
s ∈ argmin ∇f (x(k−1) )T s
ksk≤t
(k−1) T
= −t · argmax ∇f (x ) s
ksk≤1
8
Example: `1 regularization
Note: this is a lot simpler than projection onto the `1 ball, though
both require O(n) operations
9
Example: `p regularization
For the `p -regularized problem
Note: this is a lot simpler than projection onto the `p ball, for
general p. Aside from special cases (p = 1, 2, ∞), these projections
cannot be directly computed (must be treated as an optimization)
10
Example: trace norm regularization
S (k−1) = −t · uv T
Note: this is a lot simpler and more efficient than projection onto
the trace norm ball, which requires a singular value decomposition.
11
Constrained and Lagrange forms
12
• `1 norm: Frank-Wolfe update scans for maximum of gradient;
proximal operator soft-thresholds the gradient step; both use
O(n) flops
13
Comparing projected and conditional gradient for constrained lasso
problem, with n = 100, p = 500:
1e+03
Projected gradient
Conditional gradient
1e+02
1e+01
f−fstar
1e+00
1e−01
15
Note that
16
Convergence analysis
Following Jaggi (2011), define the curvature constant of f over C:
2 T
M= max f (y) − f (x) − ∇f (x) (y − x)
x,s,y∈C γ2
y=(1−γ)x+γs
2M
f (x(k) ) − f ? ≤
k+2
2 L
M≤ max · ky − xk22 = max Lkx − sk22
x,s,y∈C γ2 2 x,s∈C
y=(1−γ)x+γs
18
Basic inequality
The key inequality used to prove the Frank-Wolfe convergence rate
is:
γ2
f (x(k) ) ≤ f (x(k−1) ) − γk g(x(k−1) ) + k M
2
Here g(x) = maxs∈C ∇f (x)T (x − s) is the duality gap discussed
earlier. The rate follows from this inequality, using induction
19
Affine invariance
Important property of Frank-Wolfe: its updates are affine invariant.
Given nonsingular A : Rn → Rn , define x = Ax0 , h(x0 ) = f (Ax0 ).
Then Frank-Wolfe on h(x0 ) proceeds as
s0 = argmin ∇h(x0 )T z
z∈A−1 C
(x ) = (1 − γ)x0 + γs0
0 +
20
Inexact updates
Jaggi (2011) also analyzes inexact Frank-Wolfe updates. That is,
suppose we choose s(k−1) so that
M γk
∇f (x(k−1) )T s(k−1) ≤ min ∇f (x(k−1) )T s + ·δ
s∈C 2
where δ ≥ 0 is our inaccuracy parameter. Then we basically attain
the same rate
M γk
Note: the optimization error at step k is 2 · δ. Since γk → 0, we
require the errors to vanish
21
Some variants
Can make much better progress, but is also quite a bit harder
22
Another variant: away steps
Suppose C = conv(A) for a set of atoms A.
Keep explicit description of x ∈ C as a convex combination of
elements in A X
x= λa (x)a
a∈A
24
Linear convergence
Consider the constrained problem
min f (x) subject to x ∈ conv(A) ⊆ Rn
x
µ Φ(A)2
f (x(k) ) − f ? ≤ (1 − r)k/2 (f (x(0) ) − f ? ) for r = · .
L 4diam(A)2
25
Path following
Given the norm constrained problem
the Frank-Wolfe algorithm can be used for path following, i.e., can
produce an (approximate) solution path x̂(t), t ≥ 0. Beginning at
t0 = 0 and x? (0) = 0, we fix parameters , m > 0, then repeat for
k = 1, 2, 3, . . .:
• Calculate
(1 − 1/m)
tk = tk−1 +
k∇f (x̂(tk−1 ))k∗
and set x̂(t) = x̂(tk−1 ) for all t ∈ (tk−1 , tk )
• Compute x̂(tk ) by running Frank-Wolfe at t = tk , terminating
when the duality gap is ≤ /m
i.e., the duality gap remains ≤ for the same x, between t and t+
27
References
• K. Clarkson (2010), “Coresets, sparse greedy approximation,
and the Frank-Wolfe algorithm”
• J. Giesen and M. Jaggi and S. Laue, S. (2012),
“Approximating parametrized convex optimization problems”
• M. Jaggi (2011), “Sparse convex optimization methods for
machine learning”
• M. Jaggi (2011), “Revisiting Frank-Wolfe: projection-free
sparse convex optimization”
• M. Frank and P. Wolfe (1956), “An algorithm for quadratic
programming”
• J. Peña and D. Rodrı́guez (2016), “Polytope conditioning and
linear convergence of the Frank-Wolfe algorithm”
• R. J. Tibshirani (2015), “A general framework for fast
stagewise algorithms”
28