Gradient

L.
Vandenberghe
EE236C (Spring 2016)
1. Gradient method
gradient method, first-order methods

quadratic bounds on convex functions
analysis of gradient method
1-1
Approximate course outline

First-order methods
gradient, conjugate gradient, quasi-Newton methods

subgradient, proximal gradient methods
accelerated (proximal) gradient methods
Decomposition and splitting methods
first-order methods and dual reformulations

alternating minimization methods
monotone operators and operator-splitting methods
Interior-point methods
conic optimization
primal-dual interior-point methods
Gradient method
1-2
Gradient method
to minimize a convex differentiable function f : choose initial point x(0) and repeat
x(k) = x(k1) tk f (x(k1)),
k = 1, 2, . . .
Step size rules
fixed: tk constant
backtracking line search
exact line search: minimize f (x tf (x)) over t
Advantages of gradient method
every iteration is inexpensive

does not require second derivatives
Gradient method
1-3
Quadratic example
1 2
f (x) = (x1 + x22)
2
(with > 1)
with exact line search and starting point x(0) = (, 1)
1
+1
k
x2
kx(k) x?k2
=
kx(0) x?k2
10
0
x1
10
gradient method is often slow; convergence very dependent on scaling
Gradient method
1-4
Nondifferentiable example
f (x) =
x21
x22
x1 + |x2|
f (x) =
for |x2| > x1
1+
for |x2| x1,
with exact line search, starting point x(0) = (, 1), converges to non-optimal point
x2
2
2
x1
gradient method does not handle nondifferentiable problems

Gradient method
1-5
First-order methods
address one or both disadvantages of the gradient method
Methods with improved convergence
quasi-Newton methods
conjugate gradient method
accelerated gradient method
Methods for nondifferentiable or constrained problems
subgradient method
proximal gradient method
smoothing methods
cutting-plane methods
Gradient method
1-6
Outline

Convex function
a function f is convex if dom f is a convex set and Jensens inequality holds:
f (x + (1 )y) f (x) + (1 )f (y) for all x, y dom f , [0, 1]
First-order condition
for (continuously) differentiable f , Jensens inequality can be replaced with
f (y) f (x) + f (x)T (y x) for all x, y dom f
Second-order condition
for twice differentiable f , Jensens inequality can be replaced with
2f (x) 0 for all x dom f
Gradient method
1-7
Strictly convex function

f is strictly convex if dom f is a convex set and
f (x+(1)y) < f (x)+(1)f (y) for all x, y dom f , x 6= y , and (0, 1)
strict convexity implies that if a minimizer of f exists, it is unique
First-order condition
for differentiable f , strict Jensens inequality can be replaced with
f (y) > f (x) + f (x)T (y x) for all x, y dom f , x 6= y
Second-order condition
note that 2f (x) 0 is not necessary for strict convexity (cf., f (x) = x4)
Gradient method
1-8
Monotonicity of gradient
a differentiable function f is convex if and only if dom f is convex and
T
(f (x) f (y)) (x y) 0 for all x, y dom f

i.e., the gradient f : Rn Rn is a monotone mapping
a differentiable function f is strictly convex if and only if dom f is convex and

T
(f (x) f (y)) (x y) > 0 for all x, y dom f , x 6= y

i.e., the gradient f : Rn Rn is a strictly monotone mapping
Gradient method
1-9
Proof
if f is differentiable and convex, then

f (y) f (x) + f (x)T (y x),
f (x) f (y) + f (y)T (x y)
combining the inequalities gives (f (x) f (y))T (x y) 0
if f is monotone, then g 0(t) g 0(0) for t 0 and t dom g , where

g 0(t) = f (x + t(y x))T (y x)
g(t) = f (x + t(y x)),

hence
Z
f (y) = g(1) = g(0) +
g 0(t) dt g(0) + g 0(0)
= f (x) + f (x)T (y x)
this is the first-order condition for convexity
Gradient method
1-10
Lipschitz continuous gradient

the gradient of f is Lipschitz continuous with parameter L > 0 if
kf (x) f (y)k2 Lkx yk2 for all x, y dom f
note that the definition does not assume convexity of f

we will see that for convex f with dom f = Rn, this is equivalent to
L T
x x f (x) is convex
2
(i.e., if f is twice differentiable, 2f (x) LI for all x)
Gradient method
1-11
Quadratic upper bound

suppose f is Lipschitz continuous with parameter L and dom f is convex
then g(x) = (L/2)xT x f (x), with dom g = dom f , is convex

convexity of g is equivalent to a quadratic upper bound on f :
f (y) f (x) + f (x)T (y x) +
f (y)
Gradient method
L
ky xk22 for all x, y dom f
2
(x, f (x))
1-12
Proof
Lipschitz continuity of f and the Cauchy-Schwarz inequality imply

(f (x) f (y))T (x y) Lkx yk22 for all x, y dom f
this is monotonicity of the gradient
g(x) = Lx f (x)
hence, g is a convex function if its domain dom g = dom f is convex
the quadratic upper bound is the first-order condition for convexity of g
g(y) g(x) + g(x)T (y x) for all x, y dom g
Gradient method
1-13
Consequence of quadratic upper bound

if dom f = Rn and f has a minimizer x?, then
L
1
2
?
kf (x)k2 f (x) f (x ) kx x?k22 for all x
2L
2
right-hand inequality follows from quadratic upper bound at x = x?
left-hand inequality follows by minimizing quadratic upper bound
f (x?)

inf
ydom f
f (x) + f (x)T (y x) +
L
ky xk22
2
1
= f (x)
kf (x)k22
2L
minimizer of upper bound is y = x (1/L)f (x) because dom f = Rn
Gradient method
1-14
Co-coercivity of gradient
if f is convex with dom f = Rn and (L/2)xT x f (x) is convex then
1
(f (x) f (y)) (x y) kf (x) f (y)k22 for all x, y
L
T
this property is known as co-coercivity of f (with parameter 1/L)
co-coercivity implies Lipschitz continuity of f (by Cauchy-Schwarz)

hence, for differentiable convex f with dom f = Rn
Lipschitz continuity of f
convexity of (L/2)xT x f (x)

co-coercivity of f
Lipschitz continuity of f
therefore the three properties are equivalent
Gradient method
1-15
Proof of co-coercivity: define two convex functions fx, fy with domain Rn
fx(z) = f (z) f (x)T z,
fy (z) = f (z) f (y)T z
the functions (L/2)z T z fx(z) and (L/2)z T z fy (z) are convex
z = x minimizes fx(z); from the left-hand inequality on page 1-14,

f (y) f (x) f (x)T (y x) = fx(y) fx(x)
1
kfx(y)k22
2L
1
kf (y) f (x)k22
=
2L
similarly, z = y minimizes fy (z); therefore
f (x) f (y) f (y)T (x y)
1
kf (y) f (x)k22
2L
combining the two inequalities shows co-coercivity

Gradient method
1-16
Strongly convex function

f is strongly convex with parameter m > 0 if
g(x) = f (x)
m T
x x is convex
2
Jensens inequality: Jensens inequality for g is
f (x + (1 )y) f (x) + (1 )f (y)
m
(1 )kx yk22
2
Monotonicity: monotonicity of g gives
(f (x) f (y))T (x y) mkx yk22 for all x, y dom f

this is called strong monotonicity (coercivity) of f
Second-order condition: 2f (x) mI for all x dom f
Gradient method
1-17
Quadratic lower bound

from 1st order condition of convexity of g :
m
f (y) f (x) + f (x) (y x) + ky xk22 for all x, y dom f
2
T
f (y)
(x, f (x))
implies sublevel sets of f are bounded

if f is closed (has closed sublevel sets), it has a unique minimizer x? and
m
1
kx x?k22 f (x) f (x?)
kf (x)k22 for all x dom f
2
2m
Gradient method
1-18
Extension of co-coercivity
if f is strongly convex and f is Lipschitz continuous, then the function

g(x) = f (x)
m
kxk22
2
is convex and g is Lipschitz continuous with parameter L m
co-coercivity of g gives
T
(f (x) f (y)) (x y)
1
mL
kx yk22 +
kf (x) f (y)k22
m+L
m+L
for all x, y dom f
Gradient method
1-19
Outline

Analysis of gradient method
x(k) = x(k1) tk f (x(k1)),
k = 1, 2, . . .
with fixed step size or backtracking line search
Assumptions
1. f is convex and differentiable with dom f = Rn
2. f (x) is Lipschitz continuous with parameter L > 0
3. optimal value f ? = inf x f (x) is finite and attained at x?
Gradient method
1-20
Analysis for constant step size
from quadratic upper bound (page 1-12) with y = x tf (x):

Lt
f (x tf (x)) f (x) t(1 )kf (x)k22
2
therefore, if x+ = x tf (x) and 0 < t 1/L,
t
f (x+) f (x) kf (x)k22
2
(1)
t
f ? + f (x)T (x x?) kf (x)k22
2

1
2
?
? 2
?
= f +
kx x k2 kx x tf (x)k2
2t

1
?
? 2
+
? 2
= f +
kx x k2 kx x k2
2t
second line follows from convexity of f
Gradient method
1-21
define x = x(i1), x+ = x(i), ti = t, and add the bounds for i = 1, . . . , k :

k
X
(f (x(i)) f ?)
i=1

1 X (i1)
? 2
(i)
? 2
kx
x k2 kx x k2
2t i=1

1 (0)
kx x?k22 kx(k) x?k22
2t
1 (0)
kx x?k22
2t
since f (x(i)) is non-increasing (see (1))

k
1X
1
(k)
?
(i)
?
f (x ) f
(f (x ) f )
kx(0) x?k22
k i=1
2kt
Conclusion: number of iterations to reach f (x(k)) f ? is O(1/)
Gradient method
1-22
Backtracking line search

initialize tk at t > 0 (for example, t = 1); take tk := tk until
f (x tk f (x)) < f (x) tk kf (x)k22

f (x tf (x))
f (x) tkf (x)k22

f (x) tkf (x)k22
t
0 < < 1; we will take = 1/2 (mostly to simplify proofs)

Gradient method
1-23
Analysis for backtracking line search

line search with = 1/2, if f has a Lipschitz continuous gradient
f (x) t(1
tL
2
)kf (x)k2
2
t
2
f (x) kf (x)k2
2
f (x) tf (x)
t = 1/L
selected step size satisfies tk tmin = min{t, /L}

Gradient method
1-24
Convergence analysis
as on page 1-21:
(i)
f (x ) f (x
(i1)
ti
) kf (x(i1))k22
2
ti
f + f (x
) (x
x ) kf (x(i1))k22
2

1
f? +
kx(i1) x?k22 kx(i) x?k22
2ti

1 (i1)
?
? 2
(i)
? 2
f +
kx
x k2 kx x k2
2tmin
?
(i1) T
(i1)
the first line follows from the line search condition
add the upper bounds to get

k
1X
1
(k)
?
(i)
?
f (x ) f
(f (x ) f )
kx(0) x?k22
k i=1
2ktmin
Conclusion: same 1/k bound as with constant step size
Gradient method
1-25
Gradient method for strongly convex functions

better results exist if we add strong convexity to the assumptions on p. 1-20
Analysis for constant step size

if x+ = x tf (x) and 0 < t 2/(m + L):
kx+ x?k22 = kx tf (x) x?k22

= kx x?k22 2tf (x)T (x x?) + t2kf (x)k22
2
2mL
)kx x?k22 + t(t
)kf (x)k22
(1 t
m+L
m+L
2mL
(1 t
)kx x?k22
m+L
(step 3 follows from result on p. 1-19)
Gradient method
1-26
Distance to optimum
kx
(k)
x?k22
c kx
(0)
x?k22,
2mL
c=1t
m+L
implies (linear) convergence

for t = 2/(m + L), get c =
1
+1
2
with = L/m
Bound on function value (from page 1-14)
f (x
(k)
ck L (0)
L (k)
? 2
) f kx x k2
kx x?k22
2
2
?
Conclusion: number of iterations to reach f (x(k)) f ? is O(log(1/))
Gradient method
1-27
Limits on convergence rate of first-order methods

First-order method: any iterative algorithm that selects x(k) in the set
x(0) + span{f (x(0)), f (x(1)), . . . , f (x(k1))}
Problem class: any function that satisfies the assumptions on page 1-20
Theorem (Nesterov): for every integer k (n 1)/2 and every x(0), there exist
functions in the problem class such that for any first-order method
f (x
(k)
3 Lkx(0) x?k22
)f
32 (k + 1)2
?
suggests 1/k rate for gradient method is not optimal

recent fast gradient methods have 1/k 2 convergence (see later)
Gradient method
1-28
References
Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course

(2004), section 2.1 (the result on page 1-28 is Theorem 2.1.7 in the book)
B. T. Polyak, Introduction to Optimization (1987), section 1.4

the example on page 1-5 is from N. Z. Shor, Nondifferentiable Optimization and
Polynomial Problems (1998), page 37
Gradient method
1-29

Gradient

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gradient

Uploaded by

Copyright:

Available Formats

L.

EE236C (Spring 2016)

gradient method, first-order methods

Approximate course outline

gradient, conjugate gradient, quasi-Newton methods

first-order methods and dual reformulations

x(k) = x(k1) tk f (x(k1)),

Step size rules

Advantages of gradient method

every iteration is inexpensive

with exact line search and starting point x(0) = (, 1)

gradient method is often slow; convergence very dependent on scaling

for |x2| x1,

gradient method does not handle nondifferentiable problems

Methods for nondifferentiable or constrained problems

gradient method, first-order methods

f (x + (1 )y) f (x) + (1 )f (y) for all x, y dom f , [0, 1]

f (y) f (x) + f (x)T (y x) for all x, y dom f

2f (x)  0 for all x dom f

Strictly convex function

f (y) > f (x) + f (x)T (y x) for all x, y dom f , x 6= y

(f (x) f (y)) (x y) 0 for all x, y dom f

a differentiable function f is strictly convex if and only if dom f is convex and

(f (x) f (y)) (x y) > 0 for all x, y dom f , x 6= y

if f is differentiable and convex, then

f (x) f (y) + f (y)T (x y)

combining the inequalities gives (f (x) f (y))T (x y) 0

if f is monotone, then g 0(t) g 0(0) for t 0 and t dom g , where

g(t) = f (x + t(y x)),

g 0(t) dt g(0) + g 0(0)

Lipschitz continuous gradient

kf (x) f (y)k2 Lkx yk2 for all x, y dom f

note that the definition does not assume convexity of f

Quadratic upper bound

then g(x) = (L/2)xT x f (x), with dom g = dom f , is convex

Lipschitz continuity of f and the Cauchy-Schwarz inequality imply

Consequence of quadratic upper bound

this property is known as co-coercivity of f (with parameter 1/L)

co-coercivity implies Lipschitz continuity of f (by Cauchy-Schwarz)

convexity of (L/2)xT x f (x)

therefore the three properties are equivalent

Proof of co-coercivity: define two convex functions fx, fy with domain Rn

fx(z) = f (z) f (x)T z,

fy (z) = f (z) f (y)T z

the functions (L/2)z T z fx(z) and (L/2)z T z fy (z) are convex

z = x minimizes fx(z); from the left-hand inequality on page 1-14,

combining the two inequalities shows co-coercivity

Strongly convex function

Jensens inequality: Jensens inequality for g is

f (x + (1 )y) f (x) + (1 )f (y)

Monotonicity: monotonicity of g gives

(f (x) f (y))T (x y) mkx yk22 for all x, y dom f

Quadratic lower bound

implies sublevel sets of f are bounded

if f is strongly convex and f is Lipschitz continuous, then the function

is convex and g is Lipschitz continuous with parameter L m

for all x, y dom f

gradient method, first-order methods

Analysis of gradient method

x(k) = x(k1) tk f (x(k1)),

with fixed step size or backtracking line search

Analysis for constant step size

from quadratic upper bound (page 1-12) with y = x tf (x):

2f (x) 0 for all x dom f

Conclusion: number of iterations to reach f (x(k)) f ? is O(1/)

Conclusion: number of iterations to reach f (x(k)) f ? is O(log(1/))