Professional Documents
Culture Documents
Projected Gradient
Projected Gradient
Andersen Ang
Content
ECS, Uni. Southampton, UK
andersen.ang@soton.ac.uk Unconstrained vs constrained problem
Problem setup
Homepage angms.science Understanding the geometry of projection
PGD is a special case of proximal gradient
Theorem 1. An inequality of PGD with constant stepsize
Theorem 2. PGD converges ergodically at rate O √1 on Lipschitz function
k
Version: July 13, 2023 K
! K
1 X ∗ ∥x0 − x∗ ∥22 α X 2
f xk − f ≤ + ∥∇f (xk )∥2
First draft: August 2, 2017 K + 1 k=0 2α(K + 1) 2(K + 1) k=0
∗ L∥x0 − x∗ ∥
f (x̄K ) − f ≤ √ .
K+1
Unconstrained minimization Constrained minimization
min f (x). min f (x).
x∈Rn x∈Q
can be expressed as
min ∥Ax − b∥22 .
∥x∥2 ≤1
2 / 21
Approaches for solving constrained minimization problems
3 / 21
Solving unconstrained problem minn f (x) by gradient descent
x∈R
simple
▶ Gradient descent GD is a easy way to solve unconstrained optimization problem minn f (x).
x∈R
intuitive
▶ Starting from an initial point x0 ∈ Rn , GD iterates the following until a stopping condition is met:
▶ Question: how about constrained problem? Is it possible to tune GD to fit constrained problem?
Answer: yes, and the key is Euclidean projection operator proj : Rn ⇒ Rn .
▶ Remark
▶ We assume f is differentiable (i.e., ∇f exists).
▶ If f is not differentiable, we can replace gradient by subgradient, and we get the so-called subgradient method.
4 / 21
Problem setup of constrained problem
min f (x).
x∈Q
5 / 21
Solving constrained problem by projected gradient descent
▶ Projected gradient descent PGD = GD + projection
▶ Starting from an initial point x0 ∈ Q, PGD iterates the following equation until a stopping condition is met:
k ∈ N: the current iteration counter
k + 1 ∈ N: the next iteration counter
xk : the current variable
xk+1 : the next variable
xk+1 = PQ xk − αk ∇f (xk ) ,
∇f is the gradient of f with respect to differentiation of x
∇f (xk ) is the ∇f at the current variable xk
αk ∈ (0, +∞): gradient stepsize
PQ is the shorthand of projQ
▶ projQ ( · ) is called Euclidean projection operator, and itself is also an optimization problem:
PQ (x0 ) = projQ (x0 ) = argmin ∥x − x0 ∥2 . (∗)
x∈Q
i.e., given a point x0 , PQ finds a point x ∈ Q which is “closest” to x0 .
▶ The measure of “closeness” here is the Euclidean distance ∥x − x0 ∥2 .
▶ (∗) is equivalent to
1
∥x − x0 ∥22 ,
argmin
2
x∈Q
where we squaring the cost so that the function becomes differentiable.
6 / 21
Comparing PGD to GD ▶ PGD = GD + projection.
GD ▶ if the point xk − αk ∇f (xk ) after the gradient update is leaving the
set Q, project it back.
1. Pick an initial point x0 ∈ Rn ▶ if the point xk − αk ∇f (xk ) after the gradient update is within the
2. Loop until stopping condition is met: set Q, keep the point and do nothing.
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk ▶ Projection PQ ( · ) : Rn ⇒ Rn
2.3 Update: xk+1 = xk − αk ∇f (xk ) ▶ It is a mapping from Rn to Rn , i.e., a point-to-point mapping
▶ In general, for a nonconvex set Q, such mapping is possibly
non-unique (this is the ⇒)
PGD ▶ PQ ( · ) is an optimization problem
1. Pick an initial point x0 ∈ Q
1 2
2. Loop until stopping condition is met: PQ (x0 ) := argmin ∥x − x0 ∥2 . (⋆)
x∈Q 2
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk If Q is a convex compact set, the optimization problem has a unique
2.3 Update: yk+1 = xk − αk ∇f (xk ) solution, and we have PQ ( · ) : Rn → Rn
2.4 Projection:
xk+1 = argmin 21 ∥x − yk+1 ∥22
x∈Q easy to solve
▶ PGD is economic if (⋆) is has a closed-form expression
cheap to compute
Q is nonconvex
▶ PGD is possibly not an economic if (∗) has no closed-form expression
(∗) is expensive to compute
7 / 21
Understanding the geometry of projection ... (1/4)
Consider a convex set Q ⊂ Rn and a point x0 ∈ Rn .
Case 1. x0 ∈ Q. Case 2. x0 ∈
/ Q.
x0 x0
Q Q
8 / 21
Understanding the geometry of projection ... (2/4)
▶ The circles are ℓ2 -norm ball centered at x0 with different radius.
▶ Points on these circles are equidistant to x0 (with different l2 distance on different circles).
▶ Note that some points on the blue circle are inside Q, those are feasible points.
x0
Q
9 / 21
Understanding the geometry of projection ... (3/4)
▶ The point inside Q which is closest to x0 is the point where the ℓ2 norm ball “touches” Q.
y
x0
Q
10 / 21
Understanding the geometry of projection ... (4/4)
Note that the projection is orthogonal: the blue point y is always on a straight line that is tangent to the norm ball
and Q.
y
x0
Q
11 / 21
Property of projection: Bourbaki-Cheney-Goldstein inequality
12 / 21
PGD is a special case of proximal gradient
▶ With the indicator function, constrained problem has two equivalent expressions
▶ Proximal gradient is a method to solve the optimization problem of a sum of differentiable and a non-differentiable
function:
minf (x) + g(x),
x
where g is non-differentiable.
▶ PGD is in fact the special case of proximal gradient where g(x) is the indicator function. See here for more about
proximal gradient .
13 / 21
On PGD ergodic convergence rate
▶ Theorem 1. If f is convex, PGD with constant stepsize α satisfies
K
! K x∗ is the (global) minimizer
1 X ∥x0 − x∗ ∥22 α X f ∗ := f (x∗ ) is the optimal cost value
f xk − f ∗ ≤ + ∥∇f (xk )∥22 α is the constant stepsize
K + 1 k=0 2α(K + 1) 2(K + 1) k=0
K is the total of number of iteration performed
▶ Interpretation:
PK
▶ the term 1 xk is the “average” of the sequence xk after K iterations
K+1
PK k=0
▶ denote 1
x k as x̄
K+1 k=0
▶ denote f (x̄) as f¯
Then the theorem reads:
∥x0 − x∗ ∥22
f¯ − f ∗ ≤ + something positive.
2α(K + 1)
1
Hence the convergence rate is like O( K ).
K
α X
▶ The term ∥∇f (xk )∥22 converges to zero
2(K + 1) k=0
K
X
▶ as long as 2
∥∇f (xk )∥2 is not diverging to infinity, or
k=0
K
X
▶ the growth of 2
∥∇f (xk )∥2 is slower than K
k=0
14 / 21
What is ergodic convergence?
▶ Ergodic convergence = “The centroid of a point cloud moving towards the limit point”
▶ Sequence convergence: each of x1 , x2 , ..., xk are all getting closer and closer to x∗
15 / 21
Proof of theorem 1 ... (1/3)
Combine the two boxes.
f (z) ≤ f (x) + ⟨∇f (x), z − x⟩ f is convex
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
⇐⇒ f (x) − f (z) ≤ ⟨∇f (x), x − z⟩ f (xk )−f ≤
2α
=⇒ f (xk ) − f ∗ ≤ ⟨∇f (xk ), xk − x∗ ⟩ x = xk , z = x∗ , f (x∗ ) = f ∗
k − yk+1
Dx
PGD
f (xk ) − f ∗ ≤ , xk − x∗
E
⇐⇒ yk+1 = xk − αk ∇f (xk ) PGD
yk+1 = xk − αk ∇f (xk ) we have xk − yk+1 = α∇f (xk )
αk
xk − yk+1 , xk − x∗
=⇒ f (xk ) − f ∗ ≤ constant stepsize Then
α
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥α∇f (xk )∥2 − ∥yk+1 − x ∥2
So we have f (xk )−f ≤
2α
xk − yk+1 , xk − x∗
f (xk ) − f ∗ ≤
α
Now we have
A not-so-trivial trick
Therefore
∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
xk − yk+1 , xk − x∗ =
2
16 / 21
Proof of theorem 1 ... (2/3)
Now we have Note ∥yk+1 − x∗ ∥22 ≥ ∥ xk+1 −x∗ ∥22 .
| {z }
projQ (yk+1 )
∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2 - This is known as “projection operator is non-expansive”
- “post-projection distance at most the same as the pre-projected”
Next we need to make use of the fact that projection is - This is from the Bourbaki-Cheney-Goldstein inequality
non-expansive. - Details here
Pictorially
∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2
∥xk −x∗ ∥2 ∗ 2
2 −∥xk+1 −x ∥2 α 2
≤ 2α + 2 ∥∇f (xk )∥2
17 / 21
Proof of theorem 1 ... (3/3)
Consider the left hand side, as f is convex, by Jensen’s inequality
∥x0 − x∗ ∥2 ∗ 2
2 − ∥x1 − x ∥2 α
f (x0 ) − f ∗ ≤ ∥∇f (x0 )∥2
k = 0 + 1 K 1 K
2 X X
2α 2 f xk ≤ f (xk ).
K + 1 k=0 K + 1 k=0
∥x1 − x∗ ∥2 ∗ 2
2 − ∥x2 − x ∥2 α
k = 1 f (x1 ) − f ∗ ≤ + ∥∇f (x1 )∥2
2
2α 2 Therefore
.
. ∥x0 − x∗ ∥2
2
.
∥xk − x∗ ∥2 ∗ 2
2 − ∥xK+1 − x ∥2 + α ∥∇f (x )∥2
1 K
∗
2α(K + 1)
f (xk ) − f ∗ ≤
X
k = K k 2 f xk − f ≤
2α 2 K + 1 k=0 α K
X 2
+ ∥∇f (xk )∥2 .
Sums all 2(K + 1) k=0
K
X ∗
∥x0 − x∗ ∥2 ∗ 2
2 − ∥xk+1 − x ∥2 α XK
2
f (xk )−f ≤ + ∥∇f (xk )∥2 .
k=0 2α 2 k=0
As 0 ≤ 1 ∥x ∗ 2
2α k+1 − x ∥2 ,
K
X ∗
∥x0 − x∗ ∥2 K
2 + α X ∥∇f (x )∥2 .
f (xk ) − f ≤ k 2
k=0 2α 2 k=0
Expand the summation on the left and divide the whole equation by K + 1
1 K
X ∗ ∥x0 − x∗ ∥2
2 α K
X 2
f (xk ) − f ≤ + ∥∇f (xk )∥2 .
K + 1 k=0 2α(K + 1) 2(K + 1) k=0
18 / 21
PGD converges ergodically at order O √1 on Lipschitz function
k
K
( )
1 X ∥x0 − x∗ ∥
Theorem 2. If f is Lipschitz, for the point x̄K = xk and constant stepsize α = √ we have
K + 1 k=0 L K +1
L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
Proof
▶ f is Lipschitz means ∇f is bounded: ∥∇f ∥ ≤ L, where L is the Lipschitz constant.
▶ Put x̄K , α, ∥∇f ∥ ≤ L into theorem 1.
Remarks
▶ On the stepsize α, note that it is K (total number of step) not k (current iteration number).
▶ α requires to know x∗ , so this theorem is practically useless as knowing x∗ already solves the problem.
▶ Although we do not know x∗ in general, the theorem tells that the ergodic convergence speed of PGD is O √1
k
19 / 21
Discussion
In the convergence analysis of GD:
1. f is convex and β-smooth (gradient is β-Lipschitz)
1
2. Convergence rate O .
k
3. The convergence rate is not ergodic
If f is convex and β-smooth, the convergence of PGD will be the same as that of GD.
1
▶ Theoretical convergence rate of PGD on convex and β-smooth f is also O .
k
▶ However practically it depends on the complexity of the projection.
Some Q are difficult to project onto.
As PGD is a special case of proximal gradient method, it is better to study proximal gradient method. For example
here, here and here
20 / 21
Last page - summary
▶ PGD = GD + projection
∥x0 −x∗ ∥
n PK o
▶ If f is Lipschitz (bounded gradient), for the point x̄K = 1
K+1 k=0 xk and constant step size α = √
L K+1
then
L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
End of document
21 / 21