Professional Documents
Culture Documents
Overview: In this lecture, we discuss the lower bounds on the complexity of first order optimization
algorithms. We introduce the Projected Gradient Descent for constrained optimization problems and discuss
their convergence rates. We illustrate how to find the projections for a few convex sets using examples.
10.1 Recap
In the previous lecture, we have seen the convergence rates of gradient descent and accelerated gradient
descent algorithms which can be summarized below. Note that κ = L µ.
In order to show the optimality of these algorithms in terms of complexity, lower bounds will be discussed
in this lecture.
We look at lower complexity bounds for convex optimization problems which use first order methods for
objective functions belonging to certain classes .
Theorem 10.1 For any t, 1 ≤ t ≤ 12 (n − 1), and any x0 ∈ Rn , there exists an L-smooth convex function
f such that, for any first order method M that generates a sequence of points xt such that xt ∈ x0 +
span{∇f (x0 ), . . . , ∇f (xt−1 )}, we have
3Lkx0 − x∗ k2
f (xt ) − f (x∗ ) ≥ .
32(t + 1)2
Proof: The sequence of iterates for f (x) starting from x0 is just a shift of the sequence generated for f (x+x0 )
from the origin. Without loss of generality, we assume that x0 = 0.
Consider function f where
L 1 T
f (x) = · ( x Ax − eT1 x),
4 2
10-1
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-2
2 −1 0 0
−1 1
2 −1 0 0
A= .. and e1 = .
0 −1 2 . .
.. .. 0 n×1
0 0 . .
(n×n)
L
1. We show that the function f is convex and L smooth. We have ∇2 f (x) = 4A and we need to show
that 0 A 4I. For any x ∈ Rn , we have
n
X n−1
X
xT Ax = 2 x2i − 2 xi xi+1
i=1 i=1
n−1
X
= x21 + x2n + (xi − xi+1 )2 .
i=1
1 2
2(1 −
) − (1 − )=1
n+1 n+1
i−1 i i+1
−(1 − ) + 2(1 − ) − (1 − ) = 0, ∀i ≥ 2.
n+1 n+1 n+1
Thus, the optimal solution x∗ is given by
i
x∗i = 1 − , ∀1 ≤ i ≤ n.
n+1
and the optimal value of the function is given by
L 1 T ∗ L 1
f (x∗ ) = (− e1 x ) = (−1 + ).
4 2 8 (n + 1)
We have used t ≤ 12 (n − 1) which gives n ≥ 2t + 1 in the above to obtain the required bounds.
Theorem 10.2 For any x0 ∈ l2 , and any constants µ > 0 and L > µ, there exists a µ-strongly convex and
L-smooth function f such that for any first order method M that generates a sequence of points xt such that
xt ∈ x0 + span{∇f (x0 ), . . . , ∇f (xt−1 )}, we have
√
∗ µ κ − 1 2t
f (xt ) − f (x ) ≥ ( √ ) kx0 − x∗ k2 .
2 κ+1
L
where κ = µ.
Proof:
Consider the function f as follows
L−µ 1 T µ
f (x) = ( x Ax − eT1 x) + kxk2
4 2 2
where A and e1 are as defined in the proof of Theorem 10.1.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-4
µI ∇2 f (x) LI.
2. We now compute the optimal solution of f . From the first order optimality conditions, we have
L−µ L−µ
∇f (x) = ( A + µI)x − ( )e1 = 0
4 4
4
which gives (A + κ−1 )x = e1 . This gives us a set of equations
κ+1
2 x1 − x2 = 1,
κ−1
κ+1
xi+1 + xi−1 − 2 xi = 0, ∀i ≥ 2.
κ−1
√ √
Let q = √κ−1
κ+1
. We verify that x∗i = q i = ( √κ−1
κ+1
)i gives us the optimal solution. Substituting q in the
optimality conditions gives us
√
κ+1 ( κ + 1)2
q2 − 2 q + 1 = q(− ) + 1 = 0.
κ−1 κ−1
This solution is also unique due to strong convexity of f .
3. We now find bounds to prove the theorem. From the sum of infinite geometric series, we have,
∞
X q2
kx∗ k2 = q 2i = .
i=1
1 − q2
Similar to the proof of Theorem 10.1, it can be seen that xt ∈ span{e1 , . . . , et }. Therefore,
∞
X q 2(t+1)
kxt − x∗ k2 ≥ q 2i = .
i=t+1
1 − q2
From strong convexity of f , we have f (xt ) − f (x∗ ) ≥ h∇f (x∗ ), xt − x∗ i + µ2 kxt − x∗ k2 . Since x∗ is
optimal, we have h∇f (x∗ ), x − x∗ i ≥ 0 for any x. Hence,
µ
f (xt ) − f (x∗ ) ≥ kxt − x∗ k2 .
2
We obtain the final result as follows.
q 2(t+1)
f (xt ) − f (x∗ ) µkxt − x∗ k2 µ 1−q 2 µ 2t
≥ ≥ q2
= q .
kx0 − x∗ k2 ∗
2kx0 − x k 2 2 2
1−q 2
Thus, we see that the accelerated gradient descent algorithms discussed in the previous lecture are optimal
in the complexity sense.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-5
So far, we were concerned with finding the optimal solution of an unconstrained optimization problem. In
real life, optimization problems we are likely to come across constrained optimization problems. In this
section, we discuss how to solve constrained optimization problem:
min f (x)
x∈X
If we wish to use gradient descent update to a point xt ∈ X, it is possible that the iterate xt+1 = xt − ∇fL(xt )
may not belong to the constraint set X. In the projected gradient descent, we simply choose the point
nearest to xt − ∇fL(xt ) in the set X as xt+1 i.e., the projection of xt − ∇fL(xt ) onto the set X.
Projected Gradient Descent (PGD): Given a starting point x0 ∈ X and step-size γ > 0, PGD works
as follows until a certain stopping criterion is satisfied,
1. If y ∈ X, then ΠX (y) = y.
2. The projection onto a convex set X is non-expansive.
kΠX (x) − ΠX (y)k2 ≤ kx − yk2 ,
kΠX (x) − ΠX (y)k22 ≤ hΠX (x) − ΠX (y), x − yi ≤ kx − yk22 .
Proof:
1. We first note that the 12 kx − yk22 is strictly convex since ∇2 f (x) = 1. Hence the solution to the
optimization problem is unique. If y ∈ X, then we have 21 ky − yk22 = 0. Since 12 kx − yk22 ≥ 0, zero is
the optimal value of the function and y is it’s unique minimizer, thus giving ΠX (y) = y.
h∇f (x∗ ), z − x∗ i ≥ 0, ∀z ∈ X.
10.3.1 Interpretation
1. Each iterate in the PGD can be viewed as the minimizer of a quadratic approximation of the objective
function, similar to the case of Gradient Descent (GD).
µ
xt+1 = argmin{f (xt ) + h∇f (xt ), x − xt i + kx − xt k2 }.
x∈X 2
We also observe that the objective function is non-increasing with each iteration, f (xt+1 ) ≤ f (xt ).
∇f (x) ∇f (x∗ )
2. The optimal solution x∗ is a fixed point of x = ΠX (x − L ) i.e., x∗ = ΠX (x∗ − L ).
Proof:
It can be shown that the key inequalities used to show convergence in the gradient descent method follow in
a similar way for the projected gradient descent as well by replacing the gradient with the gradient mapping.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-7
Recall that when showing the convergence rates for the gradient descent algorithm, we used the following
properties:
1
(a) For a L-smooth function f , the iterates given by the gradient descent with step size γ = L satisfy
Similar results hold for the projected gradient descent which are presented below. Details can be found in
Section 2.2 from [NES’04].
Remark. With these facts, we can immediately adapt previous convergence analysis for GD method to
analyze PGD and obtain similar results, namely, a sublinear rate O(1/t) for general smooth convex case and a
linear rate O((1−κ−1 )t ) for the smooth strongly convex case. Moreover, if we combine the projected gradient
descent with Nesterov’s acceleration, we will also obtain the optimal convergence results for constained convex
optimization, similar to what we have in the unconstrained case.
Theorem 10.6 For a convex function f that is L-smooth, the iterates given by the projected gradient descent
with step size γ = L1 satisfy
2L
f (xt ) − f (x∗ ) ≤ kx0 − x∗ k2 .
t
If f is further µ-strongly convex, we have
µ
kxt − x∗ k2 ≤ (1 − )t kx0 − x∗ k2 .
L
Proof: The proofs are similar to the gradient descent case. We present the proof for µ-strongly convex case.
1
kxt+1 − x∗ k2 = kxt − gX (xt ) − x∗ k2
L
1 2
= kxt − x∗ k2 + 2 kgX (xt )k2 − hgX (xt ), xt − x∗ i
L L
1 2 µ 1
≤ kxt − x∗ k2 + 2 kgX (xt )k2 − ( kxt − x∗ k2 + k∇gX (xt )k2 )
L L 2 2L
µ
= (1 − )kxt − x∗ k2
L
µ
≤ (1 − )t kx0 − x∗ k2 .
L
where we have used Proposition 10.5 to bound the inner product.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-8
10.3.3 Examples
We present a few examples to illustrate how to find projection onto different sets.
Pn
3. Simplex: X = {x ∈ Rn : xi ≥ 0, i=1 xi ≤ 1}. Computing the projection requires solving the
quadratic problem
1
ΠX (y) = argmin kx − yk22 .
xi ≥0;
P
xi ≤1 2
If y ∈ X, then ΠX (y) = y. Let’s consider the case when y ∈
/ X. We form the Langrangian dual
function for λ ≥ 0 and µ ≥ 0,
1 X
L(x, λ, µ) = kx − yk22 + λT (−x) + µ( xi − 1).
2
We use KKT conditions in order to solve the optimization problem. The optimal solution satisfies
xi − yi − λi + µ = 0, ∀i (optimality)
T
λ x=0 ( complementary slackness)
X
µ( xi − 1) = 0 ( complementary slackness)
λ ≥ 0, µ ≥ 0 ( feasibility)
X
x ≥ 0, xi ≤ 1 ( feasibility)
i
Let µ∗ be such that (yi − µ∗ )+ = 1, where (x)+ = max{x, 0}. Let x∗i = (yi − µ∗ )+ and λi =
P
x∗i − (yi − µ∗ ) for any i. We can easily verify that the solution (x∗ ; λ∗ , µ∗ ) satisfies the KKT conditions
listed above. Therefore, the projection of y onto the simplex set X is given by,
( P
y for y ≥ 0, yi ≥ 0,
ΠX (y) =
(y − µ∗ )+ otherwise.
where µ∗ is the solution to the constraint (yi − µ)+ = 1. Note that µ∗ can be found using tools such
P
as the bisection method.
4. Nuclear ball: X = {x ∈ Rm×n : kxknuc =
P
σi (x) = kσ(x)k1 ≤ 1}.
From the P
singular value decomposition of y, there exist appropriate matrices U, V, Σ such that y =
U ΣV T = σi ui viT .
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-9
Let x = U x0 V T . Since we are looking for the projection of a diagonal matrix Σ, the closest x0 will be
a diagonal matrix and well thus x0 denotes the singular values of x. Let the singular values of y be
denoted by a vector σ ∈ Rrank(y) and that of x0 by σ 0 ∈ Rrank(y) .
1 1 0
ΠX (y) = argmin kx − yk22 = argmin kσ − σk22 .
kxknuc ≤1 2 σ 0 ≥0; σi0 ≤1 2
P
This is equivalent to projecting onto a simplex. Therefore, the projection of y onto the nuclear ball is
given by, (
y for kσ(y)k1 ≤ 1,
ΠX (y) = P ∗ T
(σi − µ )+ ui vi otherwise.
where µ∗ is the solution to the constraint (σi −µ)+ = 1. Note that in order to compute the projection,
P
we need to find the full singular value decomposition of an m × n matrix, the complexity of which
is known as O(mn2 ). Hence using projected gradient descent may be computationally prohibitive for
high-dimensional problems.
References
[NES ’04] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, 2004.