Professional Documents
Culture Documents
A P REPRINT
1 CONVEXITY
A function f : Rd → R is convex, if for any x, y that lie in the domain (which is assumed to be a convex set) of f and
any λ ∈ [0, 1] we have
∂f (x) := {v ∈ Rd : v is a sub-gradient of f at x.
Sub-differential is a set.
A PREPRINT - O CTOBER 17, 2019
1. Lipschitz continuity:
|f (x) − f (y)|2 ≤ Lkx − yk2 .
2. Lipschitz continuity of gradients (L). The function f (x) is differentiable and its gradient is also Lipschitz continuous
k∇f (x) − ∇f (y)k ≤ Lkx − yk2 .
3. Strong convexity (µ): The function f (x) is convex and differentiable, denition of strong convexity is
µ
f (y) ≥ f (x) + h∇f (x), y − xi + kx − yk2 .
2
2 GRADIENT DESCENT
Consider solving the following problem:
x∗ = arg min f (x) (1)
x∈Rd
Proof. .
Lemma 2.2. If x∗ is a local minimum of a continuously differentiable function f it satises rst-order optimality condition,
i.e.,
∇f (x∗) = 0.
If f (x) is convex, then ∇f (x∗) = 0 is a sufcient condition for global optimality.
Proof. .
Lemma 2.3. (Descent Lemma). For an L− smooth function f (x),
k∇f (x) − ∇f (y)k ≤ Lkx − yk2 ,
we have
L
f (y) ≤ f (x) + h∇f (x), y − xi + kx − yk2 .
2
Proof. .
Corollary 2.1. A consequence of the descent lemma is that
1 L
k∇f (x)k2 ≤ f (x) − f (x∗) ≤ kx − x ∗ k2
2L 2
Proof. .
Corollary 2.2. Lipschitz continuity of ∇f ⇐⇒ co-coercivity of ∇f .
2
A PREPRINT - O CTOBER 17, 2019
Lemma 2.4. (Convergence rate for gradient descent with constant step-size).
1
f (xt ) − f (x∗) ≤ kx0 − x∗ k2 .
2tα
Proof. .
Lemma 2.5. For a m-strongly convex and L-smooth function f , we have
mL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk2 + k∇f (x) − ∇f (y)k2
m+L m+L
for all x, y.
m 2
Proof. Let h(x) = f (x) − 2 kxk , by co-coercity:
1
h∇h(x) − ∇h(y), x − yi ≥ k∇h(x) − ∇h(y)k2 .
L−m
3
A PREPRINT - O CTOBER 17, 2019
In practice, the number of data n can be very large in modern machine learning. It is difficult to do gradient descent in
this case because the gradient.
Stochastic gradient descent for the finite-sum case performs the following iterations
Lemma 3.1. (Descent lemma for stochastic updates). The next update for SGD satisfies
Lα2
Eωt [f (xt+1 )] − f (xt ) ≤ −αh∇f (xt ), Eωt [∇f (xt )] + Eωt [k∇f (xt )k2 ].
2
Proof. Use the descent lemma, substitute the iterates of SGD and take an expectation on both sides over the index of
the datum ωt .