Projected Gradient

Projected Gradient Algorithm
Andersen Ang
Content
ECS, Uni. Southampton, UK
andersen.ang@soton.ac.uk Unconstrained vs constrained problem
Problem setup
Homepage angms.science Understanding the geometry of projection
PGD is a special case of proximal gradient
Theorem 1. An inequality of PGD with constant stepsize

Theorem 2. PGD converges ergodically at rate O √1 on Lipschitz function
k
Version: July 13, 2023 K
! K
1 X ∗ ∥x0 − x∗ ∥22 α X 2
f xk − f ≤ + ∥∇f (xk )∥2
First draft: August 2, 2017 K + 1 k=0 2α(K + 1) 2(K + 1) k=0
∗ L∥x0 − x∗ ∥
f (x̄K ) − f ≤ √ .
K+1
Unconstrained minimization Constrained minimization
min f (x). min f (x).
x∈Rn x∈Q
▶ All x ∈ Rn is feasible. ▶ Not all x ∈ Rn is feasible.

▶ Any x ∈ Rn can be a solution. ▶ Not all x ∈ Rn can be a solution.
▶ The solution has to be inside the set Q.
▶ An example:
min ∥Ax − b∥22 s.t. ∥x∥2 ≤ 1

x∈Rn
can be expressed as
min ∥Ax − b∥22 .
∥x∥2 ≤1
Here Q := {v ∈ Rn : ∥v∥2 ≤ 1} is known as the unit ℓ2 ball.
2 / 21
Approaches for solving constrained minimization problems
▶ Duality / Lagrangian approach

▶ Not our focus here.
▶ Although the approach of Lagrangian multiplier is usually taught in standard calculus class, the standard
explanation that {gradient on primal variable has to be anti-parallel to gradient on the dual variable} is not
intuitive and it is not the deep reason why the method works.
▶ It requires a deep understanding of convex conjugate, constraint qualifications and duality to appreciate the
Lagrangian approach, which is out of the scope here.
▶ First-order method / gradient-based method

▶ Simple.
▶ Our focus.
▶ Second-order method, Zero-order method, Higher-order method

▶ Not our focus here.
3 / 21
Solving unconstrained problem minn f (x) by gradient descent
x∈R

simple

▶ Gradient descent GD is a easy way to solve unconstrained optimization problem minn f (x).
 x∈R
intuitive

▶ Starting from an initial point x0 ∈ Rn , GD iterates the following until a stopping condition is met:
k ∈ N: the current iteration counter

k + 1 ∈ N: the next iteration counter
xk : the current variable
xk+1 = xk − αk ∇f (xk ), xk+1 : the next variable
∇f is the gradient of f with respect to differentiation of x
∇f (xk ) is the ∇f at the current variable xk
αk ∈ (0, +∞): gradient stepsize
▶ Question: how about constrained problem? Is it possible to tune GD to fit constrained problem?
Answer: yes, and the key is Euclidean projection operator proj : Rn ⇒ Rn .
▶ Remark
▶ We assume f is differentiable (i.e., ∇f exists).
▶ If f is not differentiable, we can replace gradient by subgradient, and we get the so-called subgradient method.
4 / 21
Problem setup of constrained problem
min f (x).
x∈Q
▶ We focus on the Euclidean space Rn
▶ f : Rn → R is the objective / cost function

▶ f is assumed to be continuously differentiable, i.e., ∇f (x) exists for all x f ∈ C1
▶ we assume f is globally L-Lipschitz, but not here |f (x) − f (y)| ≤ L∥x − y∥
▶ we do not assume ∇f is globally L-Lipschitz ∥∇f (x) − ∇f (y)∥ ≤ L∥x − y∥
▶ ∅ ̸= Q ⊂ Rn is convex and compact

▶ The constraint is represented by a set Q
▶ Q ⊂ Rn means Q is a subset of Rn , the domain of f
▶ Q ̸= ∅ means Q is not an empty set it is not useful for discussion if Q is empty
n o
▶ Q is a convex set ∀x∀y∀λ ∈ (0, 1) x ∈ Q, y ∈ Q =⇒ λx + (1 − λ)y ∈ Q
▶ Q is compact compact = bounded + closed
▶ For the details of convexity, Lipschitz, see here.
5 / 21
Solving constrained problem by projected gradient descent
▶ Projected gradient descent PGD = GD + projection
▶ Starting from an initial point x0 ∈ Q, PGD iterates the following equation until a stopping condition is met:
k ∈ N: the current iteration counter
k + 1 ∈ N: the next iteration counter
xk : the current variable
xk+1 : the next variable

xk+1 = PQ xk − αk ∇f (xk ) ,
∇f is the gradient of f with respect to differentiation of x
∇f (xk ) is the ∇f at the current variable xk
αk ∈ (0, +∞): gradient stepsize
PQ is the shorthand of projQ
▶ projQ ( · ) is called Euclidean projection operator, and itself is also an optimization problem:
PQ (x0 ) = projQ (x0 ) = argmin ∥x − x0 ∥2 . (∗)
x∈Q
i.e., given a point x0 , PQ finds a point x ∈ Q which is “closest” to x0 .
▶ The measure of “closeness” here is the Euclidean distance ∥x − x0 ∥2 .
▶ (∗) is equivalent to
1
∥x − x0 ∥22 ,
argmin
2
x∈Q
where we squaring the cost so that the function becomes differentiable.
6 / 21
Comparing PGD to GD ▶ PGD = GD + projection.
GD ▶ if the point xk − αk ∇f (xk ) after the gradient update is leaving the
set Q, project it back.
1. Pick an initial point x0 ∈ Rn ▶ if the point xk − αk ∇f (xk ) after the gradient update is within the
2. Loop until stopping condition is met: set Q, keep the point and do nothing.
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk ▶ Projection PQ ( · ) : Rn ⇒ Rn
2.3 Update: xk+1 = xk − αk ∇f (xk ) ▶ It is a mapping from Rn to Rn , i.e., a point-to-point mapping
▶ In general, for a nonconvex set Q, such mapping is possibly
non-unique (this is the ⇒)
PGD ▶ PQ ( · ) is an optimization problem
1. Pick an initial point x0 ∈ Q
1 2
2. Loop until stopping condition is met: PQ (x0 ) := argmin ∥x − x0 ∥2 . (⋆)
x∈Q 2
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk If Q is a convex compact set, the optimization problem has a unique
2.3 Update: yk+1 = xk − αk ∇f (xk ) solution, and we have PQ ( · ) : Rn → Rn
2.4 Projection:
xk+1 = argmin 21 ∥x − yk+1 ∥22

x∈Q easy to solve

▶ PGD is economic if (⋆) is has a closed-form expression

cheap to compute

Q is nonconvex

▶ PGD is possibly not an economic if (∗) has no closed-form expression

(∗) is expensive to compute
7 / 21
Understanding the geometry of projection ... (1/4)
Consider a convex set Q ⊂ Rn and a point x0 ∈ Rn .
Case 1. x0 ∈ Q. Case 2. x0 ∈
/ Q.
x0 x0
Q Q
▶ As x0 ∈ Q, the closest point to x0 in Q will be x0 itself. ▶ Now x0 is outside Q
▶ The distance between a point to itself is zero. ▶ We need to find a point x

▶ x∈Q
▶ Mathematically: ∥x − x0 ∥2 = 0 gives x = x0 .
▶ ∥x − x0 ∥2 is smallest
▶ This is the trivial case and therefore not interesting.
▶ This is case that is interesting.
8 / 21
▶ The circles are ℓ2 -norm ball centered at x0 with different radius.
▶ Points on these circles are equidistant to x0 (with different l2 distance on different circles).
▶ Note that some points on the blue circle are inside Q, those are feasible points.
x0
Q
9 / 21
▶ The point inside Q which is closest to x0 is the point where the ℓ2 norm ball “touches” Q.
▶ In this example, the blue point y is the solution to

1
PQ (x0 ) = argmin ∥x − x0 ∥22 .
x∈Q 2
y
x0
Q
▶ In fact, such point is always located on the boundary of Q for x0 ∈

/ Q.
That is, mathematically, if x0 ∈/ Q, then
1
argmin ∥x − x0 ∥22 ∈ bdryQ.
x∈Q 2
10 / 21
Note that the projection is orthogonal: the blue point y is always on a straight line that is tangent to the norm ball
and Q.
y
x0
Q
The normal to the tangent is exactly x0 − y = x0 − projQ (x0 ).
11 / 21
Property of projection: Bourbaki-Cheney-Goldstein inequality
See the details here
12 / 21
PGD is a special case of proximal gradient
▶ The indicator function ι(x), of a set Q is defined as follows:

(
0 x∈Q
ιQ (x) =
+∞ x∈
/Q
▶ With the indicator function, constrained problem has two equivalent expressions
min f (x) ≡ minf (x) + ιQ (x).

x∈Q x
▶ Proximal gradient is a method to solve the optimization problem of a sum of differentiable and a non-differentiable
function:
minf (x) + g(x),
x
where g is non-differentiable.
▶ PGD is in fact the special case of proximal gradient where g(x) is the indicator function. See here for more about
proximal gradient .
13 / 21
On PGD ergodic convergence rate
▶ Theorem 1. If f is convex, PGD with constant stepsize α satisfies
K
! K x∗ is the (global) minimizer
1 X ∥x0 − x∗ ∥22 α X f ∗ := f (x∗ ) is the optimal cost value
f xk − f ∗ ≤ + ∥∇f (xk )∥22 α is the constant stepsize
K + 1 k=0 2α(K + 1) 2(K + 1) k=0
K is the total of number of iteration performed
▶ Interpretation:
PK
▶ the term 1 xk is the “average” of the sequence xk after K iterations
K+1
PK k=0
▶ denote 1
x k as x̄
K+1 k=0
▶ denote f (x̄) as f¯
Then the theorem reads:
∥x0 − x∗ ∥22
f¯ − f ∗ ≤ + something positive.
2α(K + 1)
1
Hence the convergence rate is like O( K ).
K
α X
▶ The term ∥∇f (xk )∥22 converges to zero
2(K + 1) k=0
K
X
▶ as long as 2
∥∇f (xk )∥2 is not diverging to infinity, or
k=0
K
X
▶ the growth of 2
∥∇f (xk )∥2 is slower than K
k=0
14 / 21
What is ergodic convergence?
▶ Ergodic convergence = “The centroid of a point cloud moving towards the limit point”
▶ Sequence convergence: each of x1 , x2 , ..., xk are all getting closer and closer to x∗
▶ Ergodic convergence: the average of x1 , x2 , ..., xk converges to x∗

▶ which doesn’t imply each of x1 , x2 , ..., xk are getting closer and closer to x∗
▶ some of them can be moving away from x∗ , as long as the centroid is getting closer
15 / 21
Proof of theorem 1 ... (1/3)
Combine the two boxes.
f (z) ≤ f (x) + ⟨∇f (x), z − x⟩ f is convex
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
⇐⇒ f (x) − f (z) ≤ ⟨∇f (x), x − z⟩ f (xk )−f ≤
2α
=⇒ f (xk ) − f ∗ ≤ ⟨∇f (xk ), xk − x∗ ⟩ x = xk , z = x∗ , f (x∗ ) = f ∗
k − yk+1
Dx
PGD
f (xk ) − f ∗ ≤ , xk − x∗
E
⇐⇒ yk+1 = xk − αk ∇f (xk ) PGD
yk+1 = xk − αk ∇f (xk ) we have xk − yk+1 = α∇f (xk )
αk
xk − yk+1 , xk − x∗
=⇒ f (xk ) − f ∗ ≤ constant stepsize Then
α
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥α∇f (xk )∥2 − ∥yk+1 − x ∥2
So we have f (xk )−f ≤
2α
xk − yk+1 , xk − x∗
f (xk ) − f ∗ ≤
α
Now we have
A not-so-trivial trick
(a − b)(a − c) = a2 − ac − ab + bc ∥xk −x∗ ∥2 ∗ 2

2 −∥yk+1 −x ∥2 + α ∥∇f (x )∥2
f (xk ) − f ∗ ≤ k 2
2α 2
2a2 − 2ac − 2ab + 2bc
=
2
a2 − 2ac + a2 − 2ab + 2bc +c2 − c2 + b2 − b2
=
2
(a − c)2 + (a − b)2 − (b − c)2
=
2
Therefore
∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
xk − yk+1 , xk − x∗ =
2
16 / 21
Now we have Note ∥yk+1 − x∗ ∥22 ≥ ∥ xk+1 −x∗ ∥22 .
| {z }
projQ (yk+1 )
∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2 - This is known as “projection operator is non-expansive”
- “post-projection distance at most the same as the pre-projected”
Next we need to make use of the fact that projection is - This is from the Bourbaki-Cheney-Goldstein inequality
non-expansive. - Details here
Pictorially
Explanation: focus on the term ∥xk − x∗ ∥22 − ∥yk+1 − x∗ ∥22

xx : current variable ΠQ (y) = xk+1
y
yk+1 : gradient updated xk
Q
xk+1 : projected yk+1 x∗
We wish to replace ∥yk+1 − x∗ ∥22 by ∥xk+1 − x∗ ∥22
How: by the fact that projection operator is non-expansive
Hence −∥yk+1 − x∗ ∥22 ≤ −∥xk+1 − x∗ ∥22 and
∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2
∥xk −x∗ ∥2 ∗ 2
2 −∥xk+1 −x ∥2 α 2
≤ 2α + 2 ∥∇f (xk )∥2
It forms a telescoping series !
17 / 21
Consider the left hand side, as f is convex, by Jensen’s inequality
∥x0 − x∗ ∥2 ∗ 2
2 − ∥x1 − x ∥2 α
f (x0 ) − f ∗ ≤ ∥∇f (x0 )∥2
 
k = 0 + 1 K 1 K
2 X X
2α 2 f xk  ≤ f (xk ).
K + 1 k=0 K + 1 k=0
∥x1 − x∗ ∥2 ∗ 2
2 − ∥x2 − x ∥2 α
k = 1 f (x1 ) − f ∗ ≤ + ∥∇f (x1 )∥2
2
2α 2 Therefore
.
. ∥x0 − x∗ ∥2
2
.  
∥xk − x∗ ∥2 ∗ 2
2 − ∥xK+1 − x ∥2 + α ∥∇f (x )∥2
1 K
∗
2α(K + 1)
f (xk ) − f ∗ ≤
X
k = K k 2 f xk  − f ≤
2α 2 K + 1 k=0 α K
X 2
+ ∥∇f (xk )∥2 .
Sums all 2(K + 1) k=0
K
X ∗
∥x0 − x∗ ∥2 ∗ 2
2 − ∥xk+1 − x ∥2 α XK
2
f (xk )−f ≤ + ∥∇f (xk )∥2 .
k=0 2α 2 k=0
As 0 ≤ 1 ∥x ∗ 2
2α k+1 − x ∥2 ,
K
X ∗
∥x0 − x∗ ∥2 K
2 + α X ∥∇f (x )∥2 .
f (xk ) − f ≤ k 2
k=0 2α 2 k=0
Expand the summation on the left and divide the whole equation by K + 1
1 K
X ∗ ∥x0 − x∗ ∥2
2 α K
X 2
f (xk ) − f ≤ + ∥∇f (xk )∥2 .
K + 1 k=0 2α(K + 1) 2(K + 1) k=0
18 / 21

PGD converges ergodically at order O √1 on Lipschitz function
k
K
( )
1 X ∥x0 − x∗ ∥
Theorem 2. If f is Lipschitz, for the point x̄K = xk and constant stepsize α = √ we have
K + 1 k=0 L K +1
L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
Proof
▶ f is Lipschitz means ∇f is bounded: ∥∇f ∥ ≤ L, where L is the Lipschitz constant.
▶ Put x̄K , α, ∥∇f ∥ ≤ L into theorem 1.
Remarks
▶ On the stepsize α, note that it is K (total number of step) not k (current iteration number).
▶ α requires to know x∗ , so this theorem is practically useless as knowing x∗ already solves the problem.

▶ Although we do not know x∗ in general, the theorem tells that the ergodic convergence speed of PGD is O √1
k
19 / 21
Discussion
In the convergence analysis of GD:
1. f is convex and β-smooth (gradient is β-Lipschitz)
1
2. Convergence rate O .
k
3. The convergence rate is not ergodic
In the convergence analysis of PGD:

1. f is convex and L-Lipschitz (gradient is bounded above)
1
2. Convergence rate O √ .
k
3. The convergence rate is ergodic, it works on x̄K
If f is convex and β-smooth, the convergence of PGD will be the same as that of GD.
1
▶ Theoretical convergence rate of PGD on convex and β-smooth f is also O .
k
▶ However practically it depends on the complexity of the projection.
Some Q are difficult to project onto.
As PGD is a special case of proximal gradient method, it is better to study proximal gradient method. For example
here, here and here
20 / 21
Last page - summary
▶ PGD = GD + projection
▶ PGD with constant stepsize α:

K K
!
1 X ∥x0 − x∗ ∥22 α X
f xk − f∗ ≤ + ∥∇f (xk )∥22
K + 1 k=0 2α(K + 1) 2(K + 1) k=0
∥x0 −x∗ ∥
n PK o
▶ If f is Lipschitz (bounded gradient), for the point x̄K = 1
K+1 k=0 xk and constant step size α = √
L K+1
then
L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
End of document
21 / 21

Projected Gradient

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Projected Gradient

Uploaded by

Copyright:

Available Formats

Projected Gradient Algorithm

▶ All x ∈ Rn is feasible. ▶ Not all x ∈ Rn is feasible.

min ∥Ax − b∥22 s.t. ∥x∥2 ≤ 1

Here Q := {v ∈ Rn : ∥v∥2 ≤ 1} is known as the unit ℓ2 ball.

▶ Duality / Lagrangian approach

▶ First-order method / gradient-based method

▶ Second-order method, Zero-order method, Higher-order method

k ∈ N: the current iteration counter

▶ We focus on the Euclidean space Rn

▶ f : Rn → R is the objective / cost function

▶ ∅ ̸= Q ⊂ Rn is convex and compact

▶ For the details of convexity, Lipschitz, see here.

▶ As x0 ∈ Q, the closest point to x0 in Q will be x0 itself. ▶ Now x0 is outside Q

▶ The distance between a point to itself is zero. ▶ We need to find a point x

▶ In this example, the blue point y is the solution to

▶ In fact, such point is always located on the boundary of Q for x0 ∈

The normal to the tangent is exactly x0 − y = x0 − projQ (x0 ).

See the details here

▶ The indicator function ι(x), of a set Q is defined as follows:

min f (x) ≡ minf (x) + ιQ (x).

▶ Ergodic convergence: the average of x1 , x2 , ..., xk converges to x∗

(a − b)(a − c) = a2 − ac − ab + bc ∥xk −x∗ ∥2 ∗ 2

Explanation: focus on the term ∥xk − x∗ ∥22 − ∥yk+1 − x∗ ∥22

It forms a telescoping series !

In the convergence analysis of PGD:

▶ PGD with constant stepsize α:

You might also like