You are on page 1of 9

IE 598: Big Data Optimization Fall 2016

Lecture 10: Lower bounds & Projected Gradient Descent– September 22


Lecturer: Niao He Scriber: Meghana Bande

Overview: In this lecture, we discuss the lower bounds on the complexity of first order optimization
algorithms. We introduce the Projected Gradient Descent for constrained optimization problems and discuss
their convergence rates. We illustrate how to find the projections for a few convex sets using examples.

10.1 Recap

In the previous lecture, we have seen the convergence rates of gradient descent and accelerated gradient
descent algorithms which can be summarized below. Note that κ = L µ.

Gradient Descent Accelerated Gradient Descent


2
LD 2
f is convex, L smooth O( LDt ) O(√ t2
)
f is µ strongly convex, L smooth O(( κ−1 2t
κ+1 ) ) O(( √κ−1
κ+1
)2t
)

In order to show the optimality of these algorithms in terms of complexity, lower bounds will be discussed
in this lecture.

10.2 Lower Bounds

We look at lower complexity bounds for convex optimization problems which use first order methods for
objective functions belonging to certain classes .

10.2.1 Lower Bound for Smooth Convex Problems

Theorem 10.1 For any t, 1 ≤ t ≤ 12 (n − 1), and any x0 ∈ Rn , there exists an L-smooth convex function
f such that, for any first order method M that generates a sequence of points xt such that xt ∈ x0 +
span{∇f (x0 ), . . . , ∇f (xt−1 )}, we have

3Lkx0 − x∗ k2
f (xt ) − f (x∗ ) ≥ .
32(t + 1)2

Proof: The sequence of iterates for f (x) starting from x0 is just a shift of the sequence generated for f (x+x0 )
from the origin. Without loss of generality, we assume that x0 = 0.
Consider function f where
L 1 T
f (x) = · ( x Ax − eT1 x),
4 2

10-1
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-2

 
2 −1 0 0  
−1 1
 2 −1 0  0
A= ..  and e1 =   .
0 −1 2 . .
 
.. .. 0 n×1
0 0 . .
(n×n)

L
1. We show that the function f is convex and L smooth. We have ∇2 f (x) = 4A and we need to show
that 0  A  4I. For any x ∈ Rn , we have
n
X n−1
X
xT Ax = 2 x2i − 2 xi xi+1
i=1 i=1
n−1
X
= x21 + x2n + (xi − xi+1 )2 .
i=1

From this, we have 0  A. In order to show A  4I consider the following,


n−1
X
xT Ax = x21 + x2n + (xi − xi+1 )2
i=1
n−1
X
≤ x21 + x2n + 2 (x2i + x2i+1 )
i=1
n
X
≤ 4 x2i .
i=1

where we have used (a − b)2 ≤ 2(a2 + b2 ) since (a − b)2 + (a + b)2 = 2(a2 + b2 ).


2. We now find the optimal value and optimal solution of the function f . To this end, we set ∇f (x) = 0
which gives us Ax − e1 = 0 or the set of equations

2x1 − x2 = 1, −xi−1 + 2xi − xi+1 = 0, ∀i ≥ 2.

We verify that x∗i = 1 − i


n+1 satisfies the above equations.

1 2
2(1 −
) − (1 − )=1
n+1 n+1
i−1 i i+1
−(1 − ) + 2(1 − ) − (1 − ) = 0, ∀i ≥ 2.
n+1 n+1 n+1
Thus, the optimal solution x∗ is given by
i
x∗i = 1 − , ∀1 ≤ i ≤ n.
n+1
and the optimal value of the function is given by
L 1 T ∗ L 1
f (x∗ ) = (− e1 x ) = (−1 + ).
4 2 8 (n + 1)

3. We now obtain bounds in order to prove the theorem.


We have x0 = 0, ∇f (x) = Ax−e1 and the iterate x1 ∈ span{∇f (x0 )} = span{e1 }. It is easy to see from
the structure of A that span{Ae1 , . . . , Aei } = span{e1 , . . . , ei+1 }. Thus, we have xt ∈ span{e1 , . . . , et }.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-3

Using this, we bound f (xt ).


L 1
f (xt ) ≥ min f (x) = (−1 + ).
x∈span{e1 ,...,et } 8 (t + 1)

Now we try to obtain bounds on kx0 − x∗ k2 . Note that x0 = 0.


n
∗ 2
X i 2
kx k = (1 − )
i=1
n + 1
n Pn 2
i 2 i=1 i
X
= ( ) =
i=1
n+1 (n + 1)2
n(n + 1)(2n + 1)
=
6(n + 1)2
n+1
≤ .
3
Pn n(n+1)(2n+1)
We have used i=1 i2 = 6 .
Using the above bounds, we obtain the following,
1 1 1 1
f (xt ) − f (x∗ ) 8 (−1 +
(t+1) ) + 8 (1 − (n+1) )
≥ n+1
Lkx0 − x∗ k2 3
3 1 1
(
8 t+1 − 2t+2 ) 3
≥ = .
2t + 2 32(t + 1)2

We have used t ≤ 12 (n − 1) which gives n ≥ 2t + 1 in the above to obtain the required bounds.

We define l2 space as the space containing all sequences x = {xi }∞


i=1 with finite norm. For convex L smooth
functions which are µ strongly convex, we present the following lower bound.

10.2.2 Lower Bound for Smooth and Strongly Convex Problems

Theorem 10.2 For any x0 ∈ l2 , and any constants µ > 0 and L > µ, there exists a µ-strongly convex and
L-smooth function f such that for any first order method M that generates a sequence of points xt such that
xt ∈ x0 + span{∇f (x0 ), . . . , ∇f (xt−1 )}, we have

∗ µ κ − 1 2t
f (xt ) − f (x ) ≥ ( √ ) kx0 − x∗ k2 .
2 κ+1
L
where κ = µ.

Proof:
Consider the function f as follows
L−µ 1 T µ
f (x) = ( x Ax − eT1 x) + kxk2
4 2 2
where A and e1 are as defined in the proof of Theorem 10.1.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-4

1. We show that it is L smooth and µ strongly convex. We have


L−µ
∇2 f (x) = A + µI.
4
In the proof of Theorem 10.1, we have seen that 0  A  4I. Hence we have,

µI  ∇2 f (x)  LI.

2. We now compute the optimal solution of f . From the first order optimality conditions, we have
L−µ L−µ
∇f (x) = ( A + µI)x − ( )e1 = 0
4 4
4
which gives (A + κ−1 )x = e1 . This gives us a set of equations

κ+1
2 x1 − x2 = 1,
κ−1
κ+1
xi+1 + xi−1 − 2 xi = 0, ∀i ≥ 2.
κ−1
√ √
Let q = √κ−1
κ+1
. We verify that x∗i = q i = ( √κ−1
κ+1
)i gives us the optimal solution. Substituting q in the
optimality conditions gives us

κ+1 ( κ + 1)2
q2 − 2 q + 1 = q(− ) + 1 = 0.
κ−1 κ−1
This solution is also unique due to strong convexity of f .
3. We now find bounds to prove the theorem. From the sum of infinite geometric series, we have,

X q2
kx∗ k2 = q 2i = .
i=1
1 − q2

Similar to the proof of Theorem 10.1, it can be seen that xt ∈ span{e1 , . . . , et }. Therefore,

X q 2(t+1)
kxt − x∗ k2 ≥ q 2i = .
i=t+1
1 − q2

From strong convexity of f , we have f (xt ) − f (x∗ ) ≥ h∇f (x∗ ), xt − x∗ i + µ2 kxt − x∗ k2 . Since x∗ is
optimal, we have h∇f (x∗ ), x − x∗ i ≥ 0 for any x. Hence,
µ
f (xt ) − f (x∗ ) ≥ kxt − x∗ k2 .
2
We obtain the final result as follows.
q 2(t+1)
f (xt ) − f (x∗ ) µkxt − x∗ k2 µ 1−q 2 µ 2t
≥ ≥ q2
= q .
kx0 − x∗ k2 ∗
2kx0 − x k 2 2 2
1−q 2

Thus, we see that the accelerated gradient descent algorithms discussed in the previous lecture are optimal
in the complexity sense.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-5

10.3 Projected Gradient Descent

So far, we were concerned with finding the optimal solution of an unconstrained optimization problem. In
real life, optimization problems we are likely to come across constrained optimization problems. In this
section, we discuss how to solve constrained optimization problem:

min f (x)
x∈X

where f is a convex function and X is a convex set.

If we wish to use gradient descent update to a point xt ∈ X, it is possible that the iterate xt+1 = xt − ∇fL(xt )
may not belong to the constraint set X. In the projected gradient descent, we simply choose the point
nearest to xt − ∇fL(xt ) in the set X as xt+1 i.e., the projection of xt − ∇fL(xt ) onto the set X.

Definition 10.3 The projection of a point y, onto a set X is defined as


1
ΠX (y) = argmin kx − yk22 .
x∈X 2

Projected Gradient Descent (PGD): Given a starting point x0 ∈ X and step-size γ > 0, PGD works
as follows until a certain stopping criterion is satisfied,

xt+1 = xt − γΠX (xt − ∇f (xt )), ∀t ≥ 1.


1
In this lecture, for an L smooth convex function, we fix the step-size to be γ = L.

Proposition 10.4 The following inequalities hold:

1. If y ∈ X, then ΠX (y) = y.
2. The projection onto a convex set X is non-expansive.
kΠX (x) − ΠX (y)k2 ≤ kx − yk2 ,
kΠX (x) − ΠX (y)k22 ≤ hΠX (x) − ΠX (y), x − yi ≤ kx − yk22 .

Proof:

1. We first note that the 12 kx − yk22 is strictly convex since ∇2 f (x) = 1. Hence the solution to the
optimization problem is unique. If y ∈ X, then we have 21 ky − yk22 = 0. Since 12 kx − yk22 ≥ 0, zero is
the optimal value of the function and y is it’s unique minimizer, thus giving ΠX (y) = y.

2. For any feasible x∗ , the optimality conditions are given by

h∇f (x∗ ), z − x∗ i ≥ 0, ∀z ∈ X.

Hence for x, y, we have


hΠX (y) − y, ΠX (x) − ΠX (y)i ≥ 0,
hΠX (x) − x, ΠX (y) − ΠX (x)i ≥ 0,
since ΠX (y), ΠX (x) ∈ X.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-6

Combining the above equations, we have

hΠX (x) − ΠX (y) − (x − y), 2(ΠX (y) − ΠX (x))i ≥ 0,

hy − x, ΠX (y) − ΠX (x)i ≥ hΠX (y) − ΠX (x), ΠX (y) − ΠX (x)i,


From Cauchy-Schwartz we further have

hy − x, ΠX (y) − ΠX (x)i ≤ kΠX (x) − ΠX (y)k2 kx − yk2

giving us kΠX (x) − ΠX (y)k2 ≤ kx − yk2 .

10.3.1 Interpretation

We present a few useful observations.

1. Each iterate in the PGD can be viewed as the minimizer of a quadratic approximation of the objective
function, similar to the case of Gradient Descent (GD).
µ
xt+1 = argmin{f (xt ) + h∇f (xt ), x − xt i + kx − xt k2 }.
x∈X 2

We also observe that the objective function is non-increasing with each iteration, f (xt+1 ) ≤ f (xt ).
∇f (x) ∇f (x∗ )
2. The optimal solution x∗ is a fixed point of x = ΠX (x − L ) i.e., x∗ = ΠX (x∗ − L ).
Proof:

x∗ is optimal ⇐⇒ h∇f (x∗ ), z − x∗ i ≥ 0, ∀z ∈ X


1
⇐⇒ h− ∇f (x∗ ), z − x∗ i ≤ 0, ∀z ∈ X
L
1
⇐⇒ h(x∗ − ∇f (x∗ )) − x∗ , z − x∗ i ≤ 0, ∀z ∈ X
L
1
⇐⇒ hx∗ − (x∗ − ∇f (x∗ )), z − x∗ i ≥ 0, ∀z ∈ X
L
1
⇐⇒ x∗ is the projection of x∗ − ∇f (x∗ )
L
where the last line follows because the projection of y is the minimizer of the function 12 kx − yk22 over
the set X.
3. We can rewrite the iterates of PGD as follows
1
xt+1 = xt − gX (xt ).
L
by defining gX (x) = L(x − x† ), where x† = ΠX (x − L1 ∇f (x)). The function gX (x) is often called the
Gradient Mapping. Note that if the problem is unconstrained, x† = x − L1 ∇f (x), gX (x) = ∇f (x), thus
reducing to the usual gradient descent case.

It can be shown that the key inequalities used to show convergence in the gradient descent method follow in
a similar way for the projected gradient descent as well by replacing the gradient with the gradient mapping.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-7

10.3.2 Convergence Analysis

Recall that when showing the convergence rates for the gradient descent algorithm, we used the following
properties:
1
(a) For a L-smooth function f , the iterates given by the gradient descent with step size γ = L satisfy

k∇f (xt )k2


f (xt+1 ) − f (xt ) ≤ − .
2L
(b) If f is aslo µ-strongly convex, we have
µL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk2 + k∇f (x) − ∇f (y)k2 .
µ+L µ+L

Similar results hold for the projected gradient descent which are presented below. Details can be found in
Section 2.2 from [NES’04].

Proposition 10.5 For a convex and L-smooth function f , we have


kgX (x)k2
f (x† ) − f (x) ≤ − .
2L
If f is also µ-strongly convex, we have
µ 1
hgX (x), x − x∗ i ≥ kx − x∗ k2 + kgX (x)k2 .
2 2L

Remark. With these facts, we can immediately adapt previous convergence analysis for GD method to
analyze PGD and obtain similar results, namely, a sublinear rate O(1/t) for general smooth convex case and a
linear rate O((1−κ−1 )t ) for the smooth strongly convex case. Moreover, if we combine the projected gradient
descent with Nesterov’s acceleration, we will also obtain the optimal convergence results for constained convex
optimization, similar to what we have in the unconstrained case.

Theorem 10.6 For a convex function f that is L-smooth, the iterates given by the projected gradient descent
with step size γ = L1 satisfy
2L
f (xt ) − f (x∗ ) ≤ kx0 − x∗ k2 .
t
If f is further µ-strongly convex, we have
µ
kxt − x∗ k2 ≤ (1 − )t kx0 − x∗ k2 .
L

Proof: The proofs are similar to the gradient descent case. We present the proof for µ-strongly convex case.
1
kxt+1 − x∗ k2 = kxt − gX (xt ) − x∗ k2
L
1 2
= kxt − x∗ k2 + 2 kgX (xt )k2 − hgX (xt ), xt − x∗ i
L L
1 2 µ 1
≤ kxt − x∗ k2 + 2 kgX (xt )k2 − ( kxt − x∗ k2 + k∇gX (xt )k2 )
L L 2 2L
µ
= (1 − )kxt − x∗ k2
L
µ
≤ (1 − )t kx0 − x∗ k2 .
L
where we have used Proposition 10.5 to bound the inner product.
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-8

10.3.3 Examples

We present a few examples to illustrate how to find projection onto different sets.

1. l2 ball: X = {x : kxk2 ≤ 1}. The projection is given by


(
y for kxk2 ≤ 1,
ΠX (y) = y
kyk2 for kxk2 > 1.

2. Box: X = {x ∈ Rn : ` ≤ x ≤ u}. The ith coordinate of the projection, ∀1 ≤ i ≤ n, is given by



`i for xi < `i ,

[ΠX (y)]i = xi for `i ≤ xi ≤ ui ,

ui for xi > ui .

Pn
3. Simplex: X = {x ∈ Rn : xi ≥ 0, i=1 xi ≤ 1}. Computing the projection requires solving the
quadratic problem
1
ΠX (y) = argmin kx − yk22 .
xi ≥0;
P
xi ≤1 2
If y ∈ X, then ΠX (y) = y. Let’s consider the case when y ∈
/ X. We form the Langrangian dual
function for λ ≥ 0 and µ ≥ 0,
1 X
L(x, λ, µ) = kx − yk22 + λT (−x) + µ( xi − 1).
2
We use KKT conditions in order to solve the optimization problem. The optimal solution satisfies

xi − yi − λi + µ = 0, ∀i (optimality)
T
λ x=0 ( complementary slackness)
X
µ( xi − 1) = 0 ( complementary slackness)
λ ≥ 0, µ ≥ 0 ( feasibility)
X
x ≥ 0, xi ≤ 1 ( feasibility)
i

Let µ∗ be such that (yi − µ∗ )+ = 1, where (x)+ = max{x, 0}. Let x∗i = (yi − µ∗ )+ and λi =
P
x∗i − (yi − µ∗ ) for any i. We can easily verify that the solution (x∗ ; λ∗ , µ∗ ) satisfies the KKT conditions
listed above. Therefore, the projection of y onto the simplex set X is given by,
( P
y for y ≥ 0, yi ≥ 0,
ΠX (y) =
(y − µ∗ )+ otherwise.

where µ∗ is the solution to the constraint (yi − µ)+ = 1. Note that µ∗ can be found using tools such
P
as the bisection method.
4. Nuclear ball: X = {x ∈ Rm×n : kxknuc =
P
σi (x) = kσ(x)k1 ≤ 1}.
From the P
singular value decomposition of y, there exist appropriate matrices U, V, Σ such that y =
U ΣV T = σi ui viT .
Lecture 10: Lower bounds & Projected Gradient Descent– September 22 10-9

Let x = U x0 V T . Since we are looking for the projection of a diagonal matrix Σ, the closest x0 will be
a diagonal matrix and well thus x0 denotes the singular values of x. Let the singular values of y be
denoted by a vector σ ∈ Rrank(y) and that of x0 by σ 0 ∈ Rrank(y) .
1 1 0
ΠX (y) = argmin kx − yk22 = argmin kσ − σk22 .
kxknuc ≤1 2 σ 0 ≥0; σi0 ≤1 2
P

This is equivalent to projecting onto a simplex. Therefore, the projection of y onto the nuclear ball is
given by, (
y for kσ(y)k1 ≤ 1,
ΠX (y) = P ∗ T
(σi − µ )+ ui vi otherwise.
where µ∗ is the solution to the constraint (σi −µ)+ = 1. Note that in order to compute the projection,
P
we need to find the full singular value decomposition of an m × n matrix, the complexity of which
is known as O(mn2 ). Hence using projected gradient descent may be computationally prohibitive for
high-dimensional problems.

References
[NES ’04] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, 2004.

You might also like