You are on page 1of 5

Nonlinear Optimization

Spring 2008 Princeton University Prof. Eckstein

Class Notes: Gradient-Related Methods

Consider solving the unconstrained problem

min f (x),
x∈Rn

where f : Rn → R is continuously differentiable. We consider iterative line search methods


of the form
xk+1 = xk + αk dk , (1)
where dk ∈ Rn is a search direction and αk ≥ 0 is a scalar stepsize. We will follow the
general analysis presented in Nonlinear Programming by D.P. Bertsekas (Athena Scientific,
1995/1999).
A very common rule for determining αk is the Armijo rule, parameterized by three scalars
s > 0 and σ, β ∈ (0, 1). Assuming that dk is a direction of descent, that is, h∇f (xk ), dk i < 0,
the rule works as follows: set αk = β m s, where m is the smallest nonnegative integer for
which
f (xk ) − f (xk + β m sdk ) ≥ −σβ m sh∇f (xk ), dk i. (2)
In words, this condition requires that the objective improvement obtained by the step β m s
(the left-hand side of (2)) is within a factor σ of what is predicted by a linear extrapolation
from xk (the right-hand side of (2)). In practice, this step can be computed by a simple
“backtracking” procedure: start with α = s; if

f (xk ) − f (xk + αdk ) ≥ −σαh∇f (xk ), dk i (3)

is not satisfied, set α ← βα, and repeat until (3) holds. That must eventually be the case
whenever h∇f (xk ), dk i < 0, because dividing (3) by α yields

f (xk ) − f (xk + αdk )


≥ −σh∇f (xk ), dk i,
α
whose left-hand side converges to −h∇f (xk ), dk i > 0 as α → 0. For completeness, we will
simply say that the Armijo stepsize is 0 if h∇f (xk ), dk i ≥ 0.
Definition. A sequences of iterates {xk } ⊂ Rn and a sequence of search directions {dk } ⊂ Rn
are said to be gradient-related with respect to f if, for all subsequences {xk }K converging to
a point x̄ with ∇f (x̄) 6= 0, one has

lim sup h∇f (xk ), dk i < 0


k→∞
k∈K

and {dk }K is bounded. 

1
A slightly counterintuitive aspect of this definition is that it will be used in proofs by contra-
diction that will seek to assert that, if certain kinds of stepsize rules are used, all convergent
subsequences of {xk } must have limits x̄ with ∇f (x̄) = 0, and hence no subsequences meet-
ing the stated conditions exist. The payoff for these modest mental contortions is a much
more general theory than is presented in the Ruszczyński text.
Example. Suppose dk = −∇f (xk ) for all k. Then {xk } and {dk } must be gradient-related.
To see this, suppose xk →K x̄ with ∇f (x̄) 6= 0. Then

h∇f (xk ), dk i = h∇f (xk ), −∇f (xk )i


2
= − ∇f (xk )
→ − k∇f (x̄)k2
K
< 0,

which verifies the “lim sup” condition. Furthermore, we also have dk = ∇f (xk ) →K
k∇f (x̄)k, so {dk }K is also bounded. 
However, there are many other gradient-related choices of {dk } besides dk = −∇f (xk ):
Example. Let {Dk } be a bounded sequence of positive definite symmetric matrices whose
minimum eigenvalues are all at least c > 0. Then the choice dk = −Dk ∇f (xk ) must be
gradient-related. To see this, again suppose we have xk →K x̄ with ∇f (x̄) 6= 0. Then we
have

h∇f (xk ), dk i = −∇f (xk )> Dk ∇f (xk )


2
≤ −c ∇f (xk )
→ −c k∇f (x̄)k2
K
< 0,

so the “lim sup” condition holds. Since {xk }K is convergent and f is continuously differen-
tiable, {∇f (xk )}K is convergent. Since {Dk } is bounded, it then follows that dk = Dk ∇f (xk )
is bounded over k ∈ K. 
The following theorem is proved in the Bertsekas book, although the proof here is slightly
simpler.
Theorem. Let {xk }, {dk } ⊂ Rn be gradient-related related with respect to the continuously
differentiable function f : Rn → R. Assume f (xk+1 ) ≤ f (xk ) for all k ≥ 0 and that the
objective improvement f (xk ) − f (xk+1 ) obtained at each step k is as least as great as would
be obtained by the Armijo rule with directions dk and fixed choice of parameters s > 0 and
σ, β ∈ (0, 1). Then all accumulation points x̄ of {xk } have ∇f (x̄) = 0.
Proof. Suppose we have xk →K x̄ with ∇f (x̄) 6= 0. If we can obtain a contradiction, the
theorem is proved. By continuity, we have f (xk ) →K f (x̄), but since the entire sequence
{f (xk )} ⊂ R is nonincreasing, we must have that the whole sequence {f (xk )} converges to
f (x̄), and consequently f (xk ) − f (xk+1 ) → 0. Let αk be the stepsize that the Armijo rule

2
would pick at step k. Then

f (xk ) − f (xk+1 ) ≥ f (xk ) − f (xk + αk dk ) [Improvement at least that of Armijo] (4)


k k
≥ −σαk h∇f (x ), d i [By the Armijo criterion]. (5)

Since f (xk ) − f (xk+1 ) → 0, it follows that lim supk→∞ −σαk h∇f (xk ), dk i ≤ 0, that is,
lim inf k→∞ σαk h∇f (xk ), dk i ≥ 0. But by the gradient-relatedness of {xk } and {dk }, we
also have that lim supk→∞,k∈K h∇f (xk ), dk i < 0. For both these limits to hold, it must be
that αk →K 0.
We must have h∇f (xk ), dk i < 0 all sufficiently large k ∈ K, for otherwise we cannot have
lim supk→∞,k∈K h∇f (xk ), dk i < 0. Since αk →K 0, there must thus exist k̄ ≥ 0 such that
0 < αk < s for all k ∈ K with k ≥ k̄, that is, dk is a descent direction for which the Armijo
rule rejects at least one trial stepsize. That means that the stepsize αk /β must fail the
Armijo test, that is,
αk k αk
f (xk ) − f (xk + d ) < −σ h∇f (xk ), dk i ∀k ∈ K : k ≥ k̄. (6)
β β
By the mean value theorem applied to the function φk (t) = f (xk + tdk ), whence φ0k (t) =
h∇f (xk + tdk ), dk i, we have for some γk ∈ [0, αk /β] that
αk k αk
f (xk + d ) = f (xk ) + h∇f (xk + γk dk ), dk i
β β
αk αk
⇔ − h∇f (xk + γk dk ), dk i = f (xk ) − f (xk + dk ),
β β
which we may substitute into (6) to obtain
αk αk
− h∇f (xk + γk dk ), dk i < −σ h∇f (xk ), dk i ∀k ∈ K : k ≥ k̄.
β β
Now, since αk > 0 for all k ≥ k̄, we may multiply through by −β/αk < 0 to obtain

h∇f (xk + γk dk ), dk i > σh∇f (xk ), dk i ∀k ∈ K : k ≥ k̄. (7)

Now, the sequence {dk }K is bounded by the gradient-relatedness assumption, and so has
some limit point d.¯ Let K0 ⊆ K be such that dk →K0 d. ¯ Noting that αk →K 0 implies
γk →K 0, and using the continuity of ∇f , we may take the limit as k → ∞, k ∈ K0 ⊆ K
in (7) to obtain
¯ ≥ σh∇f (x̄), di,
h∇f (x̄), di ¯
¯ ≥ 0. Since σ < 1, we conclude that h∇f (x̄), di
and hence (1 − σ)h∇f (x̄), di ¯ ≥ 0. Since
0 k k
K ⊆ K, this result implies the that the sequence {h∇f (x ), d i} has a nonnegative limit
point, contradicting the gradient-related hypothesis that lim supk→∞,k∈K h∇f (xk ), dk i < 0.

Among the possible applications of this theorem are the following situations:
• Armijo line search along gradient-related directions, in which case we have equality
in (4).

3
• Exact line minimization along gradient-related directions. This approach obtains the
maximum possible objective improvement along each direction dk , which has to exceed
that of any Armijo step in that direction.

• Bounded line minimization along gradient-related directions, that is, the best possible
objective improvement among all steps αk ∈ [0, s]. This clearly exceeds the improve-
ment of any Armijo rule with initial steplength s.

• Any of the above, followed in each iteration by an arbitrary “spacer step” that does not
increase the objective. This auxiliary step could consist of a more sophisticated search
along the direction dk using interpolation techniques (as described in the Ruszczyński
text), but also could involve motion in a completely different direction — note that
theorem does not actually require that xk+1 be derived from xk by line search along
dk , only that the objective improvement f (xk ) − f (xk+1 ) be at least that obtained by
a gradient-related algorithm.
Instead of the Armijo rule, the Ruszczyński text focuses on the historically older Goldstein
rule, also known as the two-slope rule. A stepsize α is acceptable to this rule if

−σαh∇f (xk ), dk i ≤ f (xk ) − f (xk + αdk ) ≤ −σ 0 αh∇f (xk ), dk i, (8)

where σ < σ 0 < 1. The first inequality in this condition is identical to the Armijo rule,
and says that objective improvement should be within a factor σ of that predicted by linear
extrapolation from xk . The second inequality, however says that the average rate at which
the objective improves along the step cannot be too close to the linear extrapolation. It
serves the same purpose as the as the “smallest m” requirement in Armijo: to make sure the
steps do not get too small.
If f is bounded below — an assumption necessary for the problem minx∈Rn f (x) to be
well defined — there must be a range of steps α meeting (8): if f is bounded below, the
continuous function ψ(α) = f (xk ) − f (xk + αdk ) is bounded above, while the linear functions
g(α) = −σ 0 αh∇f (xk ), dk i and g(α) = −σαh∇f (xk ), dk i are both unbounded increasing, with
g(α) < g(α) for all α > 0. For all sufficiently large α, we must therefore have ψ(α) < g(α) <
g(α), whereas for all sufficiently small α, the gradient approximation will be increasingly
accurate, and we must have g(α) < g(α) < ψ(α). By the continuity of all three functions,
there must be some range of α for which we have g(α) < ψ(α) < g(α).
The Goldstein rule’s potential rejection of steps whose average objective improvement is “too
high” is somewhat counterintuitive. Unlike the Armijo rule, stating the Goldstein rule does
not immediately yield an algorithm for computing an acceptable step. One simple approach
is a bisection search as follows:
1. Start with some initial steplength guess α = s > 0.

2. So long as f (xk ) − f (xk + αdk ) > −σ 0 αh∇f (xk ), dk i, repeatedly increase α, for example
by α ← 2α.

3. If (8) is satisfied, accept the step α and exit.

4
4. Otherwise, set s = 0 and s = α. At this point, s represents a step that is too small
to satisfy (8), and s represents a step that is too large. Some acceptable step must lie
between them.

5. Set α = (s + s)/2.

6. If f (xk ) − f (xk + αdk ) > −σ 0 αh∇f (xk ), dk i, then α is too small. Set s ← α and return
to step 5.

7. If f (xk ) − f (xk + αdk ) < −σαh∇f (xk ), dk i, then α is too large. Set s ← α and return
to step 5.

8. Accept the step α and exit.


Theoretically, the Goldstein rule has similar convergence properties to the Armijo rule:
Theorem. Suppose {xk }, {dk } ⊂ Rn are gradient-related related with respect to the con-
tinuously differentiable function f : Rn → R. Assume f (xk+1 ) ≤ f (xk ) for all k ≥ 0 and
that the objective improvement f (xk ) − f (xk+1 ) obtained at each step k is as least as great
as would be obtained by some set of stepsizes meeting the Goldstein condition (8) for some
fixed parameters 0 < σ < σ 0 < 1. Then all accumulation points x̄ of {xk } have ∇f (x̄) = 0.
Sketch of proof. The proof is identical to the previous theorem’s until the establishment
of (6). Instead of (6), one simply uses

f (xk ) − f (xk + αk dk ) ≤ −σ 0 αk h∇f (xk ), dk i

which follows directly from the step αk being acceptable to (8). Since this inequality has
almost the same form as (6), one can derive, in nearly the same manner as the previous
¯ < 0 and hf (x̄), di
proof, that both hf (x̄), di ¯ ≥ 0. 

You might also like