You are on page 1of 4

R EVIEW FOR O PTIMIZATION I

A P REPRINT

October 17, 2019

1 CONVEXITY

A function f : Rd → R is convex, if for any x, y that lie in the domain (which is assumed to be a convex set) of f and
any λ ∈ [0, 1] we have

f (λx + (1 − λ)y) ≤ λf (x) + (1 − y)f (y).

Here are some species of convex functions:


1. Strictly convex function: for all x 6= y and λ ∈ (0, 1)

f (λx + (1 − λ)y) ≤ λf (x) + (1 − y)f (y).

2. Strongly convex functions: there exists an α > 0 such that

f (x) − αkxk2 is convex.

strong convexity ⇒ strict convexity ⇒ convexity.

If f (x) is convex and differentiable, we write first-order convexity condition:

f (y) ≥ f (x) + h∇f (x), y − xi.

Proof. For λ ∈ [0, 1], by convexity of f ,

f (x + λ(y − x)) ≤ (1 − λ)f (x) + λf (y).

If we divide both sides by t, we obtain

f (x + λ(y − x)) − f (x)


f (y) ≥ f (x) + ,
λ
and taking the limit as λ → 0 yields result.

Remark 1. What do we do for non-differentiable functions? A vector v is a sub-gradient at x iff

f (y) ≥ f (x) + hv, y − xi.

The sub-differential for a function f at x is dened to be

∂f (x) := {v ∈ Rd : v is a sub-gradient of f at x.

Sub-differential is a set.
A PREPRINT - O CTOBER 17, 2019

1.1 Assumptions in convex optimization.

1. Lipschitz continuity:
|f (x) − f (y)|2 ≤ Lkx − yk2 .

2. Lipschitz continuity of gradients (L). The function f (x) is differentiable and its gradient is also Lipschitz continuous
k∇f (x) − ∇f (y)k ≤ Lkx − yk2 .
3. Strong convexity (µ): The function f (x) is convex and differentiable, denition of strong convexity is
µ
f (y) ≥ f (x) + h∇f (x), y − xi + kx − yk2 .
2

4. Co-coercivity of gradient: the function f (x) is differentiable and


1
h∇f (x) − ∇f (y), x − yi ≥ k∇f (x) − ∇f (y)k2
L
this property is known as co-coercivity of ∇f (with parameter 1/L)

Lipschitz continuity of ∇f ⇐⇒ co-coercivity of ∇f .

2 GRADIENT DESCENT
Consider solving the following problem:
x∗ = arg min f (x) (1)
x∈Rd

for a given function f (x).


Lemma 2.1. For convex functions, local minima are global minima.

Proof. .
Lemma 2.2. If x∗ is a local minimum of a continuously differentiable function f it satises rst-order optimality condition,
i.e.,
∇f (x∗) = 0.
If f (x) is convex, then ∇f (x∗) = 0 is a sufcient condition for global optimality.

Proof. .
Lemma 2.3. (Descent Lemma). For an L− smooth function f (x),
k∇f (x) − ∇f (y)k ≤ Lkx − yk2 ,
we have
L
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2 .
2

Proof. .
Corollary 2.1. A consequence of the descent lemma is that
1 L
k∇f (x)k2 ≤ f (x) − f (x∗) ≤ kx − x ∗ k2
2L 2

Proof. .
Corollary 2.2. Lipschitz continuity of ∇f ⇐⇒ co-coercivity of ∇f .

2
A PREPRINT - O CTOBER 17, 2019

Proof. (=⇒): Define two convex functions


g(z) = f (z) − h∇f (x), zi, h(z) = f (z) − h∇f (y), zi.
Notice that e two functions have L-Lipschitz continuous gradients. Futhermore, by convexity
g(z) − g(x) = f (z) − f (x) − h∇f (x), z − xi ≥ 0,
Thus z = x minimizes g(z), by above corollary
1
f (y) − f (x) − h∇f (x), y − xi = g(y) − g(x) ≥ k∇f (x) − ∇f (y)k2 .
2L
Similarly, z = y minimizes h(z) and
1
f (x) − f (y) − h∇f (y), x − yi = h(x) − g(y) ≥ k∇f (x) − ∇f (y)k2 .
2L
Combining the two inequalities shows co-coercivity.
(⇐=): co-coercivity in turn implies Lipschitz continuity of ∇f (by Cauchy–Schwarz).
Corollary 2.3. Substitute y = x − α∇f (x) in the descent lemma to show that if
1
α≤
L
then we get
1
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2

Proof. By Decent Lemma:
L
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2
2
L
= f (x) + h∇f (x), y − xi + kα∇f (x)k2
2
α
≤ f (x) + h∇f (x), y − xi + k∇f (x)k2
2
1
= f (x) + h∇f (x), y − xi + kx − yk2 .

Lemma 2.4. (Convergence rate for gradient descent with constant step-size).
1
f (xt ) − f (x∗) ≤ kx0 − x∗ k2 .
2tα
Proof. .
Lemma 2.5. For a m-strongly convex and L-smooth function f , we have
mL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk2 + k∇f (x) − ∇f (y)k2
m+L m+L
for all x, y.
m 2
Proof. Let h(x) = f (x) − 2 kxk , by co-coercity:
1
h∇h(x) − ∇h(y), x − yi ≥ k∇h(x) − ∇h(y)k2 .
L−m

Theorem 2.1. The convergence rate of gradient descent for


2
α≤
m+L
for m-strongly convex function f with L-Lipschitz gradients is
kx(t) − x ∗ k ≤ ct kx0 − x ∗ k,
where
2αmL
c=1− .
m+L

3
A PREPRINT - O CTOBER 17, 2019

3 STOCHASTIC GRADIENT DESCENT


In machine learning, we solve the finite-sum problem. Given a finite dataset D = {(ξi , yi )}i=1,...,n , we minimize
n
1X
f (x) := l(x; ξi , yi ).
n i=1

In practice, the number of data n can be very large in modern machine learning. It is difficult to do gradient descent in
this case because the gradient.
Stochastic gradient descent for the finite-sum case performs the following iterations

xt+1 = xt − η∇l(xt ; ξωt , yωt )


The datum (ξωt , yωt ) over which we compute the gradient before updating the weights is picked randomly from the
dataset D.
Remark 2. (Mini-batch version of SGD).
ϑ
ηX
xt+1 = xt − ∇l(xt ; ξωt
k k
, yωt )
ϑ
k=1

Lemma 3.1. (Descent lemma for stochastic updates). The next update for SGD satisfies
Lα2
Eωt [f (xt+1 )] − f (xt ) ≤ −αh∇f (xt ), Eωt [∇f (xt )] + Eωt [k∇f (xt )k2 ].
2

Proof. Use the descent lemma, substitute the iterates of SGD and take an expectation on both sides over the index of
the datum ωt .

You might also like