You are on page 1of 21

Projected Gradient Algorithm

Andersen Ang
Content
ECS, Uni. Southampton, UK
andersen.ang@soton.ac.uk Unconstrained vs constrained problem
Problem setup
Homepage angms.science Understanding the geometry of projection
PGD is a special case of proximal gradient
Theorem 1. An inequality of PGD with constant stepsize
 
Theorem 2. PGD converges ergodically at rate O √1 on Lipschitz function
k
Version: July 13, 2023 K
! K
1 X ∗ ∥x0 − x∗ ∥22 α X 2
f xk − f ≤ + ∥∇f (xk )∥2
First draft: August 2, 2017 K + 1 k=0 2α(K + 1) 2(K + 1) k=0

∗ L∥x0 − x∗ ∥
f (x̄K ) − f ≤ √ .
K+1
Unconstrained minimization Constrained minimization
min f (x). min f (x).
x∈Rn x∈Q

▶ All x ∈ Rn is feasible. ▶ Not all x ∈ Rn is feasible.


▶ Any x ∈ Rn can be a solution. ▶ Not all x ∈ Rn can be a solution.
▶ The solution has to be inside the set Q.
▶ An example:

min ∥Ax − b∥22 s.t. ∥x∥2 ≤ 1


x∈Rn

can be expressed as
min ∥Ax − b∥22 .
∥x∥2 ≤1

Here Q := {v ∈ Rn : ∥v∥2 ≤ 1} is known as the unit ℓ2 ball.

2 / 21
Approaches for solving constrained minimization problems

▶ Duality / Lagrangian approach


▶ Not our focus here.
▶ Although the approach of Lagrangian multiplier is usually taught in standard calculus class, the standard
explanation that {gradient on primal variable has to be anti-parallel to gradient on the dual variable} is not
intuitive and it is not the deep reason why the method works.
▶ It requires a deep understanding of convex conjugate, constraint qualifications and duality to appreciate the
Lagrangian approach, which is out of the scope here.

▶ First-order method / gradient-based method


▶ Simple.
▶ Our focus.

▶ Second-order method, Zero-order method, Higher-order method


▶ Not our focus here.

3 / 21
Solving unconstrained problem minn f (x) by gradient descent
x∈R

simple

▶ Gradient descent GD is a easy way to solve unconstrained optimization problem minn f (x).
 x∈R
intuitive

▶ Starting from an initial point x0 ∈ Rn , GD iterates the following until a stopping condition is met:

k ∈ N: the current iteration counter


k + 1 ∈ N: the next iteration counter
xk : the current variable
xk+1 = xk − αk ∇f (xk ), xk+1 : the next variable
∇f is the gradient of f with respect to differentiation of x
∇f (xk ) is the ∇f at the current variable xk
αk ∈ (0, +∞): gradient stepsize

▶ Question: how about constrained problem? Is it possible to tune GD to fit constrained problem?
Answer: yes, and the key is Euclidean projection operator proj : Rn ⇒ Rn .

▶ Remark
▶ We assume f is differentiable (i.e., ∇f exists).
▶ If f is not differentiable, we can replace gradient by subgradient, and we get the so-called subgradient method.

4 / 21
Problem setup of constrained problem

min f (x).
x∈Q

▶ We focus on the Euclidean space Rn

▶ f : Rn → R is the objective / cost function


▶ f is assumed to be continuously differentiable, i.e., ∇f (x) exists for all x f ∈ C1
▶ we assume f is globally L-Lipschitz, but not here |f (x) − f (y)| ≤ L∥x − y∥
▶ we do not assume ∇f is globally L-Lipschitz ∥∇f (x) − ∇f (y)∥ ≤ L∥x − y∥

▶ ∅ ̸= Q ⊂ Rn is convex and compact


▶ The constraint is represented by a set Q
▶ Q ⊂ Rn means Q is a subset of Rn , the domain of f
▶ Q ̸= ∅ means Q is not an empty set it is not useful for discussion if Q is empty
n o
▶ Q is a convex set ∀x∀y∀λ ∈ (0, 1) x ∈ Q, y ∈ Q =⇒ λx + (1 − λ)y ∈ Q
▶ Q is compact compact = bounded + closed

▶ For the details of convexity, Lipschitz, see here.

5 / 21
Solving constrained problem by projected gradient descent
▶ Projected gradient descent PGD = GD + projection

▶ Starting from an initial point x0 ∈ Q, PGD iterates the following equation until a stopping condition is met:
k ∈ N: the current iteration counter
k + 1 ∈ N: the next iteration counter
xk : the current variable
xk+1 : the next variable
 
xk+1 = PQ xk − αk ∇f (xk ) ,
∇f is the gradient of f with respect to differentiation of x
∇f (xk ) is the ∇f at the current variable xk
αk ∈ (0, +∞): gradient stepsize
PQ is the shorthand of projQ

▶ projQ ( · ) is called Euclidean projection operator, and itself is also an optimization problem:
PQ (x0 ) = projQ (x0 ) = argmin ∥x − x0 ∥2 . (∗)
x∈Q
i.e., given a point x0 , PQ finds a point x ∈ Q which is “closest” to x0 .
▶ The measure of “closeness” here is the Euclidean distance ∥x − x0 ∥2 .

▶ (∗) is equivalent to
1
∥x − x0 ∥22 ,
argmin
2
x∈Q
where we squaring the cost so that the function becomes differentiable.
6 / 21
Comparing PGD to GD ▶ PGD = GD + projection.
GD ▶ if the point xk − αk ∇f (xk ) after the gradient update is leaving the
set Q, project it back.
1. Pick an initial point x0 ∈ Rn ▶ if the point xk − αk ∇f (xk ) after the gradient update is within the
2. Loop until stopping condition is met: set Q, keep the point and do nothing.
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk ▶ Projection PQ ( · ) : Rn ⇒ Rn
2.3 Update: xk+1 = xk − αk ∇f (xk ) ▶ It is a mapping from Rn to Rn , i.e., a point-to-point mapping
▶ In general, for a nonconvex set Q, such mapping is possibly
non-unique (this is the ⇒)
PGD ▶ PQ ( · ) is an optimization problem
1. Pick an initial point x0 ∈ Q
1 2
2. Loop until stopping condition is met: PQ (x0 ) := argmin ∥x − x0 ∥2 . (⋆)
x∈Q 2
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk If Q is a convex compact set, the optimization problem has a unique
2.3 Update: yk+1 = xk − αk ∇f (xk ) solution, and we have PQ ( · ) : Rn → Rn
2.4 Projection:
xk+1 = argmin 21 ∥x − yk+1 ∥22

x∈Q easy to solve

▶ PGD is economic if (⋆) is has a closed-form expression

cheap to compute


Q is nonconvex

▶ PGD is possibly not an economic if (∗) has no closed-form expression

(∗) is expensive to compute

7 / 21
Understanding the geometry of projection ... (1/4)
Consider a convex set Q ⊂ Rn and a point x0 ∈ Rn .

Case 1. x0 ∈ Q. Case 2. x0 ∈
/ Q.

x0 x0
Q Q

▶ As x0 ∈ Q, the closest point to x0 in Q will be x0 itself. ▶ Now x0 is outside Q

▶ The distance between a point to itself is zero. ▶ We need to find a point x


▶ x∈Q
▶ Mathematically: ∥x − x0 ∥2 = 0 gives x = x0 .
▶ ∥x − x0 ∥2 is smallest
▶ This is the trivial case and therefore not interesting.
▶ This is case that is interesting.

8 / 21
Understanding the geometry of projection ... (2/4)
▶ The circles are ℓ2 -norm ball centered at x0 with different radius.

▶ Points on these circles are equidistant to x0 (with different l2 distance on different circles).

▶ Note that some points on the blue circle are inside Q, those are feasible points.

x0
Q

9 / 21
Understanding the geometry of projection ... (3/4)
▶ The point inside Q which is closest to x0 is the point where the ℓ2 norm ball “touches” Q.

▶ In this example, the blue point y is the solution to


1
PQ (x0 ) = argmin ∥x − x0 ∥22 .
x∈Q 2

y
x0
Q

▶ In fact, such point is always located on the boundary of Q for x0 ∈


/ Q.
That is, mathematically, if x0 ∈/ Q, then
1
argmin ∥x − x0 ∥22 ∈ bdryQ.
x∈Q 2

10 / 21
Understanding the geometry of projection ... (4/4)
Note that the projection is orthogonal: the blue point y is always on a straight line that is tangent to the norm ball
and Q.

y
x0
Q

The normal to the tangent is exactly x0 − y = x0 − projQ (x0 ).

11 / 21
Property of projection: Bourbaki-Cheney-Goldstein inequality

See the details here

12 / 21
PGD is a special case of proximal gradient

▶ The indicator function ι(x), of a set Q is defined as follows:


(
0 x∈Q
ιQ (x) =
+∞ x∈
/Q

▶ With the indicator function, constrained problem has two equivalent expressions

min f (x) ≡ minf (x) + ιQ (x).


x∈Q x

▶ Proximal gradient is a method to solve the optimization problem of a sum of differentiable and a non-differentiable
function:
minf (x) + g(x),
x
where g is non-differentiable.

▶ PGD is in fact the special case of proximal gradient where g(x) is the indicator function. See here for more about
proximal gradient .

13 / 21
On PGD ergodic convergence rate
▶ Theorem 1. If f is convex, PGD with constant stepsize α satisfies
K
! K x∗ is the (global) minimizer
1 X ∥x0 − x∗ ∥22 α X f ∗ := f (x∗ ) is the optimal cost value
f xk − f ∗ ≤ + ∥∇f (xk )∥22 α is the constant stepsize
K + 1 k=0 2α(K + 1) 2(K + 1) k=0
K is the total of number of iteration performed

▶ Interpretation:
PK
▶ the term 1 xk is the “average” of the sequence xk after K iterations
K+1
PK k=0
▶ denote 1
x k as x̄
K+1 k=0
▶ denote f (x̄) as f¯
Then the theorem reads:
∥x0 − x∗ ∥22
f¯ − f ∗ ≤ + something positive.
2α(K + 1)
1
Hence the convergence rate is like O( K ).

K
α X
▶ The term ∥∇f (xk )∥22 converges to zero
2(K + 1) k=0
K
X
▶ as long as 2
∥∇f (xk )∥2 is not diverging to infinity, or
k=0
K
X
▶ the growth of 2
∥∇f (xk )∥2 is slower than K
k=0

14 / 21
What is ergodic convergence?

▶ Ergodic convergence = “The centroid of a point cloud moving towards the limit point”

▶ Sequence convergence: each of x1 , x2 , ..., xk are all getting closer and closer to x∗

▶ Ergodic convergence: the average of x1 , x2 , ..., xk converges to x∗


▶ which doesn’t imply each of x1 , x2 , ..., xk are getting closer and closer to x∗
▶ some of them can be moving away from x∗ , as long as the centroid is getting closer

15 / 21
Proof of theorem 1 ... (1/3)
Combine the two boxes.
f (z) ≤ f (x) + ⟨∇f (x), z − x⟩ f is convex
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
⇐⇒ f (x) − f (z) ≤ ⟨∇f (x), x − z⟩ f (xk )−f ≤

=⇒ f (xk ) − f ∗ ≤ ⟨∇f (xk ), xk − x∗ ⟩ x = xk , z = x∗ , f (x∗ ) = f ∗

k − yk+1
Dx
PGD
f (xk ) − f ∗ ≤ , xk − x∗
E
⇐⇒ yk+1 = xk − αk ∇f (xk ) PGD
yk+1 = xk − αk ∇f (xk ) we have xk − yk+1 = α∇f (xk )
αk

xk − yk+1 , xk − x∗
=⇒ f (xk ) − f ∗ ≤ constant stepsize Then
α
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥α∇f (xk )∥2 − ∥yk+1 − x ∥2
So we have f (xk )−f ≤

xk − yk+1 , xk − x∗
f (xk ) − f ∗ ≤
α
Now we have
A not-so-trivial trick

(a − b)(a − c) = a2 − ac − ab + bc ∥xk −x∗ ∥2 ∗ 2


2 −∥yk+1 −x ∥2 + α ∥∇f (x )∥2
f (xk ) − f ∗ ≤ k 2
2α 2
2a2 − 2ac − 2ab + 2bc
=
2
a2 − 2ac + a2 − 2ab + 2bc +c2 − c2 + b2 − b2
=
2
(a − c)2 + (a − b)2 − (b − c)2
=
2

Therefore

∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
xk − yk+1 , xk − x∗ =
2
16 / 21
Proof of theorem 1 ... (2/3)
Now we have Note ∥yk+1 − x∗ ∥22 ≥ ∥ xk+1 −x∗ ∥22 .
| {z }
projQ (yk+1 )
∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2 - This is known as “projection operator is non-expansive”
- “post-projection distance at most the same as the pre-projected”
Next we need to make use of the fact that projection is - This is from the Bourbaki-Cheney-Goldstein inequality
non-expansive. - Details here

Pictorially

Explanation: focus on the term ∥xk − x∗ ∥22 − ∥yk+1 − x∗ ∥22


xx : current variable ΠQ (y) = xk+1
y
yk+1 : gradient updated xk
Q
xk+1 : projected yk+1 x∗
We wish to replace ∥yk+1 − x∗ ∥22 by ∥xk+1 − x∗ ∥22
How: by the fact that projection operator is non-expansive
Hence −∥yk+1 − x∗ ∥22 ≤ −∥xk+1 − x∗ ∥22 and

∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2

∥xk −x∗ ∥2 ∗ 2
2 −∥xk+1 −x ∥2 α 2
≤ 2α + 2 ∥∇f (xk )∥2

It forms a telescoping series !

17 / 21
Proof of theorem 1 ... (3/3)
Consider the left hand side, as f is convex, by Jensen’s inequality
∥x0 − x∗ ∥2 ∗ 2
2 − ∥x1 − x ∥2 α
f (x0 ) − f ∗ ≤ ∥∇f (x0 )∥2
 
k = 0 + 1 K 1 K
2 X X
2α 2 f xk  ≤ f (xk ).
K + 1 k=0 K + 1 k=0
∥x1 − x∗ ∥2 ∗ 2
2 − ∥x2 − x ∥2 α
k = 1 f (x1 ) − f ∗ ≤ + ∥∇f (x1 )∥2
2
2α 2 Therefore
.
. ∥x0 − x∗ ∥2
2
.  
∥xk − x∗ ∥2 ∗ 2
2 − ∥xK+1 − x ∥2 + α ∥∇f (x )∥2
1 K

2α(K + 1)
f (xk ) − f ∗ ≤
X
k = K k 2 f xk  − f ≤
2α 2 K + 1 k=0 α K
X 2
+ ∥∇f (xk )∥2 .
Sums all 2(K + 1) k=0

K 
X ∗
 ∥x0 − x∗ ∥2 ∗ 2
2 − ∥xk+1 − x ∥2 α XK
2
f (xk )−f ≤ + ∥∇f (xk )∥2 .
k=0 2α 2 k=0

As 0 ≤ 1 ∥x ∗ 2
2α k+1 − x ∥2 ,

K 
X ∗
 ∥x0 − x∗ ∥2 K
2 + α X ∥∇f (x )∥2 .
f (xk ) − f ≤ k 2
k=0 2α 2 k=0

Expand the summation on the left and divide the whole equation by K + 1

1 K
X ∗ ∥x0 − x∗ ∥2
2 α K
X 2
f (xk ) − f ≤ + ∥∇f (xk )∥2 .
K + 1 k=0 2α(K + 1) 2(K + 1) k=0

18 / 21
 
PGD converges ergodically at order O √1 on Lipschitz function
k

K
( )
1 X ∥x0 − x∗ ∥
Theorem 2. If f is Lipschitz, for the point x̄K = xk and constant stepsize α = √ we have
K + 1 k=0 L K +1

L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
Proof
▶ f is Lipschitz means ∇f is bounded: ∥∇f ∥ ≤ L, where L is the Lipschitz constant.
▶ Put x̄K , α, ∥∇f ∥ ≤ L into theorem 1.

Remarks
▶ On the stepsize α, note that it is K (total number of step) not k (current iteration number).
▶ α requires to know x∗ , so this theorem is practically useless as knowing x∗ already solves the problem.
 
▶ Although we do not know x∗ in general, the theorem tells that the ergodic convergence speed of PGD is O √1
k

19 / 21
Discussion
In the convergence analysis of GD:
1. f is convex and β-smooth (gradient is β-Lipschitz)
1
2. Convergence rate O .
k
3. The convergence rate is not ergodic

In the convergence analysis of PGD:


1. f is convex and L-Lipschitz (gradient is bounded above)
 1 
2. Convergence rate O √ .
k
3. The convergence rate is ergodic, it works on x̄K

If f is convex and β-smooth, the convergence of PGD will be the same as that of GD.
1
▶ Theoretical convergence rate of PGD on convex and β-smooth f is also O .
k
▶ However practically it depends on the complexity of the projection.
Some Q are difficult to project onto.

As PGD is a special case of proximal gradient method, it is better to study proximal gradient method. For example
here, here and here
20 / 21
Last page - summary

▶ PGD = GD + projection

▶ PGD with constant stepsize α:


K K
!
1 X ∥x0 − x∗ ∥22 α X
f xk − f∗ ≤ + ∥∇f (xk )∥22
K + 1 k=0 2α(K + 1) 2(K + 1) k=0

∥x0 −x∗ ∥
n PK o
▶ If f is Lipschitz (bounded gradient), for the point x̄K = 1
K+1 k=0 xk and constant step size α = √
L K+1
then

L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
End of document

21 / 21

You might also like