You are on page 1of 12

Design and Analysis of Algorithms October 29, 2021

Massachusetts Institute of Technology 6.046/18.410J Fall 2021


Constantinos Daskalakis, Piotr Indyk, and Bruce Tidor Recitation 8

Recitation 8: Continuous Optimization

1 Gradient Descent
In continuous optimization, the problem we aim to solve is the following: given a ”nice”
(i.e., differentiable) function f : Rn → R, find x∗ such that f (x∗ ) is the minimum of f , if it
exists.
In lecture, we presented the gradient descent algorithm, which corresponds to a greedy
local search.

G RADIENT D ESCENT(f ) returns an approximation of the minimum of f .


G RADIENT D ESCENT(f, x0 )
1: x(0) ← x0
2: for i in 1 . . . T do
3: x(i) ← x(i−1) − η · ∇f (x(i−1) )
4: end for

Intuitively, the algorithm starts at a point x0 and moves a little bit “downhill” every it-
eration (how little is governed by the step size η), so that eventually it should reach the
bottom of the hill. By ”downhill”, we mean always in the direction of −∇f . As a re-
minder, the gradient ∇f (x) of the function f at a point x ∈ Rn is a vector in Rn whose ith
entry is given by
∂f (x)
∇i f (x) = (∇f (x))i = ,
∂xi
which is simply the multi-dimensional analogue of the derivative and points in the direc-
tion of steepest ascent. For all x∗ such that f (x∗ ) is the minimum of f , ∇f (x∗ ) is equal to
0, i.e. a vector of all zeros. Thus, one criterion we are looking for is that ∇f (x∗ ) = 0. Note
that this criterion alone is not sufficient condition for x∗ to be the global minimizer of f , as
the gradient also vanishes at any local extremum (such as a maximum) or saddle point. We
will, however, often work with so-called convex functions which have the property that
∇f (x∗ ) = 0 only if x∗ globally minimizes f , as we will see ahead.
For convex functions, we can restate our problem to be a search for an x∗ such that
∇f (x∗ ) = 0, if one exists. The gradient descent algorithm does precisely that. Once
we understand its correctness, we’d like to analyze when the algorithm works well (i.e.,
converges quickly) and when it doesn’t. In the next few sections, we’ll do this using three
example functions, each with an input in one dimension1 , starting point x0 = 0, and step
1
Most problems involve more than one dimension, but for the purposes of illustration, we stick to n = 1.
2 Recitation 8

size2 η = 1. Through the course of these examples we’ll put stronger and stronger condi-
tions on our function, until we can guarantee that the algorithm works well. This will tell
us what properties a function needs in order to perform well under gradient descent.

1.1 Example 1: Convexity


30 f (x) = x4 − 3x3 + 2x

20

10

−2 −1 0 1 2 3 4
x

As mentioned above, gradient descent can go wrong by converging to the wrong point.
For instance, on the function f above, the algorithm starting at x0 = 0 will converge to a
local minimum near x = −0.5, not the desired global minimum near x = 2. This happens
because the only thing that gradient descent knows how to do is go downhill, so it goes
down the hill to the left, instead of to the point we want on the right. As a result, we get
stuck at a local minimum instead of the desired global minimum. We alluded to this issue
at the beginning; the gradient of a function can vanish not only at any local extremum
(maximum or minimum), but also at saddle points that are neither maxima nor minima.
To illustrate the latter, consider the example of the two-variable function g(x1 , x2 ) = x21 −
x22 . We have ∇g(x1 , x2 ) = (2x1 , −2x2 ). Therefore, ∇g(0, 0) = (0, 0). However, 0 is neither a
(local) minimum nor maximum of g. Note that for any ε 6= 0, g(ε, 0) = ε2 > 0 = g(0, 0) and
g(0, ε) = −ε2 < 0 = g(0, 0). To avoid this sort of problem, we’ll only run gradient descent
on functions that don’t have such points where the gradient vanishes (i.e., critical points)
except for the global minimum. Unfortunately, there isn’t a good way to tell whether this
is the case for a given function without time-consuming computation. Therefore, we’ll
assume a stronger but simpler property, namely that the function is convex. Here, we’ll
introduce two definitions of convexity. The standard definition, which is often the easiest
2
We note that there are better ways for selecting the step size.
Recitation 8 3

to prove, is that a function g is convex if and only if

∀x, y, 0 ≤ λ ≤ 1, g(λx + (1 − λ)y) ≤ λg(x) + (1 − λ)g(y)

Intuitively, this says that g is convex if and only if the value of the function at a point
between any two other points does not exceed the weighted average of the function val-
ues at those two other points. That is, the function must always lie below a line drawn
between any two points on the function. For instance, the function g shown below is
convex.

g(x)
6

4 λx + (1 − λ)y, 
λg(x) + (1 − λ)g(y))

y, g(y)
2

x, g(y)
0 λx + (1 − λ)y, 
g(λx + (1 − λ)y)

−1 0 1 2 3 4 5

The function f is not convex (do you see why?), and hence gradient descent fails on it. To
get another (equivalent) definition of convexity, let’s look into the Taylor series expansion
of f . Recall that we have

f (x + δ) = f (x) + ∇f (x)T δ + 1/2δ T ∇2 f (x)δ + . . .

From this, we can focus on the linear approximation of f (x):

f (x + δ) ≈ f (x) + ∇f (x)T δ

A function is convex if and only if this linear approximation is always a lower bound on
the function, i.e.
∀x, δ, f (x + δ) ≥ f (x) + ∇f (x)T δ
This means that the function lies above the line tangent to it at any point, in any direction.
If a function is convex, gradient descent will only converge to the global minimum, if it
converges at all.
4 Recitation 8

1.2 Example 2: Smoothness (Doesn’t bend up too quickly)

50 f (x) = 2(x − 1)2

40

30

20

10

−4 −2 0 2 4 6
x

Let’s take a look at what happens when we run gradient descent on the new f (x) above,
again with the parameters x0 = 0, η = 1. We can compute the gradient to be ∇f (x) =
df (x)
dx
= 4(x − 1).
i xi ∇f (xi )
0 0 −4
1 4 12
2 −8 −36
3 28 108
4 −80 −324
As you can see, the algorithm is not converging at all, and is in fact getting farther and far-
ther away from the minimum, which is x∗ = 1. The reason the algorithm fails in this case
is that it keeps overshooting the minimum. This happens because the steps we’re taking
are too large. For a convex function, the linear approximation only gives us the tangent
line from below (as we just saw), so in order to relate this behavior back to properties of
the function, we’ll need to go beyond that. Let’s think about a quadratic approximation
instead. Based on our Taylor series from before, the quadratic approximation is

f (x + δ) ≈ f (x) + ∇f (x)T δ + 1/2δ T ∇2 f (x)δ

This time, we’re ignoring all terms larger than the quadratic one. If this quadratic ap-
proximation was exactly correct, and we were working in one dimension, then the per-
2 f (x)
fect choice of η would be 1/(∇2 f (x)). For instance, in the example above, d dx 2 = 4,
and the value of η that would bring us exactly to the minimum is 1/4. Of course, not all
Recitation 8 5

functions that we consider will be exactly quadratic, and we never want to run into this
divergent behavior. So we’ll find a quadratic form that is an upper bound on how quickly
the function can slope upwards. The upper bound will rely on the following estimation:

δ T ∇2 f (x)δ ≤ β||δ||2

This says that at any starting point, and along any direction, the function is always up-
per bounded by a quadratic function, with leading coefficient no more than β/2. This
condition is called β-smoothness, and with the convexity guarantee, will ensure that our
algorithm always converges to an absolute minimum, given that one exists.

1.3 Example 3: Strongly-Convex


With smoothness, we’ve guaranteed that the algorithm will converge, but we’d like it to
do so quickly. In particular, we’d like it take a polynomial amount of time in the size of
the input. Since our minimum can be as big as a number written in the input, and it takes
as many bits to write a number as the logarithm of the size of number, this means that
our algorithm needs to run in an amount of time logarithmic in the size of the minimum.
In the next example, this will not be the case. Consider the function f (x) defined by

(
ex−5 − (x − 5) x<5
f (x) = 1
2
(x − 5)2 + 1 x≥5

14 f (x)
12

10

0
−6 −4 −2 0 2 4 6 8 10
x
6 Recitation 8

Let’s take a look at what happens when we run gradient descent on this f (x), again with
the parameters x0 = 0, η = 1. We first compute the gradient to be

(
df (x) ex−5 − 1 x<5
∇f (x) = =
dx x−5 x≥5

i xi ∇f (xi )
0 0 −0.99326
1 0.99326 −0.98181
2 1.97507 −0.95144
3 2.92651 −0.87425
4 3.80076 −0.69858
5 4.49934 −0.39387

While the algorithm takes the estimate closer and closer to the minimum x = 5, its
progress also gets slower and slower due to the diminishing gradient, so it never reaches
the true minimum but only gets arbitrarily close. This is not acceptable, and is another
type of function we shouldn’t be running this algorithm on. While in this specific case,
we can increase the step size to converge faster, in general we might not be able to do that
because increasing the step size might run afoul of the previous β-smoothness bound. In
particular, a function might look like a steep quadratic in one dimension and like a line
in another dimension, and so it will just take forever. To get around this problem, we’ll
fall back again on the situation we know how to solve really well: the quadratic case. For
quadratics, steps of size exactly 1/∇2 f (x) allow us to make significant progress and get
to the minimum in only a logarithmic number of iterations. If we can lower bound the
“quadratic-ness” of a function, we’ll be able to determine a good step size such that we
only need a logarithmic number of them. The bound looks like the following:

δ T ∇2 f (x)δ ≥ α||δ||2

This is called being α-strongly convex when α > 0. If α = 0, this is simply the convexity
constraint, which is enough to prove convergence, but not logarithmic converge. With
α > 0, we can say that after a logarithmic number of steps of size 1/α, we’ll reach the
minimum. However, we’re not taking steps of size 1/α, we’re taking steps of size 1/β.
The ratio of these two step sizes is β/α, and β/α times more steps will be enough to
compensate for the fact that our steps are α/β times smaller. Since β/α comes up so often,
it’s been given a name, the condition number, and a symbol, κ. Now we can give the actual
effectiveness guarantee of the function: If we set T , the number of steps we’re going to
(x∗ )
take, to O(κ log f (x0 )−f

), then xT , the output, will have a value within  of minimal.

f (xT ) − t(x ) ≤ .
Recitation 8 7

1.4 Guarantees, visualized


x
12
(x − 2)2 + x
10 f (x) = 1.5(x − 2)2 + x
2(x − 2)2 + x
8

0 1 2 3 4
x

The second curve from the top is f (x) = 1.5(x − 2)2 + x. The bottom line represents the
convexity guarantee: f (x) is lower bounded by the tangent line. The top curve represents
the β-smoothness guarantee: f (x) is upper bounded by a quadratic. The lower curve
represents the α-strongly convex guarantee for any α ≤ 3. Together, these guarantees
show that the algorithm will work quickly on this function.

2 Exercises
2.1 Gradient Descent on Related Functions
(a) Consider the function g(x) = f (x)+tx2 where f (x) is β-smooth and α-strongly
convex. Find β 0 and α0 such that g(x) is β 0 -smooth and α0 -strongly convex.
Solution.
The second derivative of tx2 is 2t. Therefore, g 00 (x) = f 00 (x) + 2t. Taking β 0 =
β + 2t and α0 = α + 2t is correct because α||δ||2 ≤ δ T ∇2 f (x)δ ≤ β||δ 2 ||∀δ, x
And thus (α + 2t)||δ||2 ≤ δ T ∇2 (f (x) + tx2 )δ = δ T ∇2 f (x)δ T + δ T 2tδ ≤ (β +
2t)||δ||2 .
(b) Let g and h be convex functions. Prove or disprove that f = min(g, h) is also a
convex function.
Solution. Not convex.
8 Recitation 8

As a counter-example, consider g(x) = x2 and h(x) = (x − 2)2 . Take y1 = 0 and


y2 = 2 and λ = 12 so that z = λy1 + (1 − λ)y2 = 1. Then, f (z) = g(z) = h(z) = 1
but f (y1 ) = g(y1 ) = 0 and f (y2 ) = h(y2 ) = 0 so λf (y1 )+(1−λ)f (y2 ) = 0 < f (z).
(c) Let g and h be convex functions. Prove or disprove that f = max(g, h) is also
a convex function.
Solution. Convex.
Take any x, y and λ. Let z = λx + (1 − λ)y. Since f = max(g, h), we must have
f (z) = g(z) or f (z) = h(z). WLOG, assume that f (λx + (1 − λ)y) = f (z) =
g(z) = g(λx + (1 − λ)y). Then,

f (λx + (1 − λ)y) = g(λx + (1 − λ)y)


≤ λg(x) + (1 − λ)g(y)
≤ λ max(g(x), h(x)) + (1 − λ) max(g(y), h(y))
= λf (x) + (1 − λ)f (y)

3 Appendix: A longer exercise


Logistic Regression and Gradient Descent. In this problem, we will consider using gra-
dient descent to minimize the following function:

f (x) = −b log σ(aT x) − (1 − b) log(1 − σ(aT x)) a, x ∈ Rn , b ∈ {0, 1} (1)

where σ is the sigmoid function defined as


1
σ(t) =
1 + e−t
Note that σ(t) ∈ (0, 1) for all t ∈ R. Note: You don’t have to know this, but if you’re curious, f
is the loss function for logistic regression. Minimizing f corresponds to finding the coefficients x
of logistic regression on a single data point (a, b). For this problem, the following property of
the sigmoid function may prove to be useful.

σ(t) = σ(t)(1 − σ(t)) (2)
∂t
You can easily verify that this property holds using the definition of σ(t). Let R be a
known upper bound on the norm of an optimum of f , that is, there exists an x∗ such
that f (x∗ ) = minn f (x) and kx∗ k ≤ R. In order to devise and analyze an algorithm for
x∈R
minimizing our function, we first look at the most basic objects related to our function:
the gradient and the Hessian.
Recitation 8 9

1. Compute ∇f (x), the gradient of f at x.


Solution: We take partial derivatives with respect to each xi to find the gradient.
This is most easily done by using Equation 2 and never expanding the sigmoid
function at all.

ai bσ(aT x)(1 − σ(aT x)) −(1 − b)ai σ(aT x)(1 − σ(aT x))
(∇f (x))i = − −
σ(aT x) 1 − σ(aT x)
= (σ(aT x) − b)ai

So the gradient is

∇f (x) = (σ(aT x) − b)a

2. Compute the Hessian ∇2 f (x) of f (x) to show that

∇2 f (x) = aaT σ(aT x)(1 − σ(aT x))

Solution: The easiest way to evaluate the Hessian here is to use the definition of the
Hessian:
∂ 2 f (x)
[∇2 f (x)]ij =
∂xi ∂xj

We already know the partial derivatives ∂f (x)/∂xi from the gradient we calculated
in part (a), so all that remains is to take the partial derivatives of each entry of ∇f (x).
Again, we can do this without expanding σ.

∂[∇x f (x)]i
(∇2 f (x))ij = = ai aj σ(aT x)(1 − σ(aT x))
∂xj
so
∇2 f (x) = aaT σ(aT x)(1 − σ(aT x))

As we saw in class, if our function has certain “niceness” properties, we can pro-
vide good guarantees on the performance of the gradient descent algorithm when
minimizing it. In particular, the relevant properties are the smoothness and the strong
convexity of f .
10 Recitation 8

3. Prove that the function is β-smooth for β = 0.25kak2 .


Hint: Recall that a function f (x) is β-smooth iff for any z, x ∈ Rn ,

z T ∇2 f (x)z ≤ βkzk2

You may find the following inequality useful:

for any z, x ∈ Rn , z T x ≤ kzkkxk

This is called the Cauchy-Schwarz inequality.


Solution: For any z, x ∈ Rn ,

z T (∇2 f (x))z = (z T a)2 σ(aT x)(1 − σ(aT x)) ≤ 0.25kak2 kzk2

where the last inequality holds because σ(aT x)(1 − σ(aT x)) ≤ 0.25 and (z T a)2 ≤
kzk2 kak2 by the Cauchy-Schwarz inequality. Therefore f is 0.25kak2 -smooth.

4. Prove that the function is convex but not α-strongly convex for any α > 0. Hint:
Recall that a function f (x) is convex iff

z T ∇2 f (x)z ≥ 0 ∀x, z ∈ Rn

(in other words, ∇2 f (x) is positive semi-definite for all x ∈ Rn )


and α-strongly convex iff

z T ∇2 f (x)z ≥ αkzk2 ∀x, z ∈ Rn

Solution: For any z and x,

z T (∇2 f (x))z = (z T a)2 σ(aT x)(1 − σ(aT x)) ≥ 0

On the other hand, note that for any z such that z T a = 0

z T (∇2 f (x))z = (z T a)2 σ(aT x)(1 − σ(aT x)) = 0

therefore the function is convex but not strongly convex.

Because f (x) is not strongly convex, the convergence analysis shown in class can’t be
directly applied here. To alleviate this problem we adjust f to make it strongly convex by
adding a so-called regularization term. Specifically, we will add a quadratic function to f ,
which makes our new function be
λ
fˆ(x) = −b log σ(aT x) − (1 − b) log(1 − σ(aT x)) + kxk2 (3)
2
Recitation 8 11

5. Prove that fˆ(x) is λ-strongly convex. Furthermore, prove that fˆ(x) is still smooth,
in particular prove that it is β-smooth for β = 0.25kak2 + λ.
Solution: Note that fˆ(x) = f (x) + λ2 kxk2 . This means that
 
2ˆ 2 2 λ T
∇ f (x) = ∇ f (x) + ∇ x x = ∇2 f (x) + λI
2
so
z T ∇2 fˆ(x)z = z T ∇2 f (x)z + λkzk2 ≥ λkzk2
and
z T ∇2 fˆ(x)z = z T ∇2 f (x)z + λkzk2 ≤ (0.25kak2 + λ)kzk2
Therefore fˆ is (0.25kak2 + λ)-smooth and λ-strongly convex.

6. We are now able to bound the convergence rate of the gradient descent method on
fˆ. To this end, prove that if we set the
 starting
 point x0 to 0 and use the gradient
1
descent method with step size η = O then the objective function value will
kak2 +λ

converge to within  of the true minimum within T steps, i.e. fˆ(xT ) − fˆ(x̂∗ ) ≤ ,
where
kak2 kx̂∗ k2 (kak2 + λ)
  
T =O + 1 log
λ 
and x̂∗ is the minimizer of fˆ(x).
Hint: You will have to bound fˆ(x0 ) − fˆ(x̂∗ ) in terms of kx0 − x̂∗ k.
Solution:
First, we can use the smoothness guarantee to bound fˆ(x0 ) − fˆ(x̂∗ ). In particular,
the smoothness implies that
β
fˆ(x0 ) − fˆ(x̂∗ ) ≤ (x0 − x̂∗ )T ∇fˆ(x̂∗ ) + kx0 − x̂∗ k2
2
Note that ∇fˆ(x̂∗ ) = 0 because x̂∗ is the minimum of fˆ. We can also plug in for x0
and β to find that
0.25kak2 + λ ∗ 2
fˆ(x0 ) − fˆ(x̂∗ ) ≤ kx̂ k
2
Alternatively, we could have used convexity and Cauchy-Schwarz in the following
way:
fˆ(x0 ) − fˆ(x̂∗ ) ≤ ∇fˆ(x0 )T (x0 − x̂∗ )
≤ k∇fˆ(x0 )kkx0 − x̂∗ k
= ka(σ(0) − b) + λ0kkx̂∗ k
≤ kakkx̂∗ k
12 Recitation 8

Now we can use the guarantee on convergence from the lecture notes. Plugging
in the appropriate expressions for β and α for the condition number and using the
bound we proved above on fˆ(x0 ) − fˆ(x̂∗ ), we find that the number of iterations is

0.25kak2 + λ (0.25kak2 + λ)kx̂∗ k2


 
T =O log
λ 2
2 ∗ 2
kx̂ k (kak2 + λ)
  
kak
=O + 1 log
λ 

You might also like