You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221989049

Newton’s method and its use in optimization

Article  in  European Journal of Operational Research · September 2007


DOI: 10.1016/j.ejor.2005.06.076 · Source: DBLP

CITATIONS READS

75 7,782

1 author:

Boris T. Polyak
Institute of Control Sciences
272 PUBLICATIONS   11,162 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Boris T. Polyak on 19 March 2018.

The user has requested enhancement of the downloaded file.


European Journal of Operational Research 181 (2007) 1086–1096
www.elsevier.com/locate/ejor

Newton’s method and its use in optimization


B.T. Polyak *

Institute for Control Science, Profsojuznaya 65, Moscow 117997, Russia

Received 30 September 2004; accepted 20 June 2005


Available online 25 April 2006

Abstract

Newton’s method is a basic tool in numerical analysis and numerous applications, including operations research and
data mining. We survey the history of the method, its main ideas, convergence results, modifications, its global behavior.
We focus on applications of the method for various classes of optimization problems, such as unconstrained minimization,
equality constrained problems, convex programming and interior point methods. Some extensions (non-smooth problems,
continuous analog, Smale’s results, etc.) are discussed briefly, while some others (e.g., versions of the method to achieve
global convergence) are addressed in more details.
 2006 Elsevier B.V. All rights reserved.

Keywords: Nonlinear programming; Newton’s method; Convergence; Global behavior; Interior-point methods

1. Introduction ern investigations and applications. Our goal is to


address the most challenging problems in this field
Newton’s method is one of the fundamental tools of research and to provide suitable references for a
in numerical analysis, operations research, optimi- reader.
zation and control. It has numerous applications The paper is organized as follows. Section 2 pre-
in management science, industrial and financial sents the basic idea of Newton’s method and the his-
research, data mining. Its role in optimization can- tory of its development. Main mathematical results
not be overestimated: the method is the basis for on the convergence are addressed in Section 3. New-
the most effective procedures in linear and nonlinear ton’s method in its basic form possesses just local
programming. For instance, the polynomial time convergence, its global behavior and modifications
interior point algorithms in convex optimization to achieve global convergence are discussed in Sec-
are based on Newton’s method. tions 4 and 5. The case of underdetermined systems
The present paper surveys the key ideas of the is worth of special consideration (Section 6). In its
method in the historical perspective as well as mod- original form, Newton’s method is destined for solv-
ing equations. However, it can be easily tailored for
unconstrained (Section 7) and constrained (Section
8) optimization. Some extensions of the method
*
Tel.: +7 495 334 8829; fax: +7 495 420 2016. and directions for future research are described in
E-mail address: boris@ipu.rssi.ru Section 9.

0377-2217/$ - see front matter  2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.ejor.2005.06.076
B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096 1087

2. Idea and history of the method exploited), this is why the method is often called
the Newton–Raphson method.
The basic idea of Newton’s method is very simple – The progress in development of the method is
it is linearization. Suppose F : R1 ! R1 is a differen- linked with such famous mathematicians as Fourier,
tiable function, and we are solving the equation Cauchy and others. For instance, Fourier proved in
1818 that the method converges quadratically in the
F ðxÞ ¼ 0. ð1Þ neighborhood of a root, while Cauchy (1829, 1847)
Starting from an initial point x0 we can construct provided the multidimensional extension of (2) and
the linear approximation of F(x) in the neighbor- used the method to prove the existence of a root of
hood of x0 : F(x0 + h)  F(x0) + F 0 (x0)h and solve an equation. Important early contributions to the
the arising linear equation F(x0) + F 0 (x0)h = 0. investigation of the method are due to Fine [17]
Thus, we arrive at recurrent method and Bennet [3]; their papers are published in one
volume of Proceedings of National Academy of Sci-
1
xkþ1 ¼ xk  F 0 ðxk Þ F ðxk Þ; k ¼ 0; 1; . . . ð2Þ ences USA in 1916. Fine proved the convergence in
the n-dimensional case with no assumption on the
This is the method proposed by Newton in 1669. To existence of a solution. Bennet extended the result
be more precise, Newton dealt with polynomials for the infinite-dimensional case; this is a surprising
only; in the expression of F(x + h) he discarded attempt because the foundations of the functional
higher order terms in h. The method was illustrated analysis were not created at that moment. In 1948,
on the example F(x) = x3  2x  5 = 0. The start- L.V. Kantorovich published a seminal paper [25]
ing approximation for the root was x = 2. Then where an extension of Newton’s method for func-
F(2 + h) = h3 + 6h2 + 10h  1, neglecting higher tional spaces was provided (Newton–Kantorovich
order terms Newton got the linear equation method). The results were also included in the survey
10h  1 = 0. Thus the next approximation is paper [26]; for further developments of the method
x = 2 + 0.1 = 2.1 and the process can be repeated by Kantorovich, see [27–31].
for this point. Fig. 1 demonstrates that the conver- The basic results on Newton’s method and
gence to the root of F(x) is very fast. It was J. Raph- numerous references can be found in the books by
son, who proposed in 1690 the general form of Ostrowski [47], Ortega and Rheinboldt [48], Kant-
method (2) (i.e., F(x) was not assumed to be a poly- orovich and Akilov [32,33]. A more recent bibliog-
nomial only and the notion of a derivative was raphy is available in the books [50,60,13], the

0.8

0.6

F(X)
0.4

0.2

X0 X2 X1
0

–0.2

–0.4

–0.6

–0.8

–1
2 2.05 2.1 2.15

Fig. 1. Newton’s method.


1088 B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096

survey paper [67] and on the special website, [32,33]. Moreover, the Newton–Kantorovich
devoted to Newton’s method [41]. method as a tool for existence results was immedi-
ately used in the classical works of Kolmogorov,
3. Convergence results Arnold and Moser (see, e.g., [1]) on KAM-theory
in mechanics. Actually many classical results in
Kantorovich [25] analyzes the same equation as the functional analysis are proved by use of New-
(1): ton-like methods; the typical example is Ljusternik’s
F ðxÞ ¼ 0; ð3Þ theorem on tangent spaces, see the original paper
[38] or modern research [22].
but now F : X ! Y, where X, Y are Banach spaces. Later Kantorovich [28,29] obtained another
The proposed method reads as (2) proof of Theorem 1 and its versions, based on the
1
xkþ1 ¼ xk  F 0 ðxk Þ F ðxk Þ; k ¼ 0; 1; . . . ; ð4Þ method of majorants. The idea is to compare itera-
tions (4) with scalar iterations, which in a sense maj-
0
where F (xk) is (Frechet) derivative of the nonlinear orize them and which possess convergence
operator F(x) at the point xk and F 0 (xk)1 – its in- properties. This approach provides more flexibility;
verse. The main convergence result from [25] looks the original proof is a particular case with a qua-
as follows. dratic majorant.
Theorem 1. Suppose F is defined and twice continu- There exist numerous versions of Theorem 1,
ously differentiable on the ball B = {x : kx  x0k 6 r}, which differ in assumptions and results. We mention
that the linear operator F 0 (x0) is invertible, just one of them due to Mysovskikh [43].
kF 0 (x0)1F(x0)k 6 g, kF 0 (x0)1F00 (x)k 6 K, x 2 B Theorem 2. Suppose F is defined and twice continu-
and ously differentiable on the ball B = {x : kx  x0k 6 r},
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi the linear operator F 0 (x) is invertible on B and
1 1  1  2h
h ¼ Kg < ; r P g. ð5Þ kF 0 (x)1k 6 b, kF00 (x)k 6 K, x 2 B, kF(x0)k 6 g and
2 h
X1  2n 1
Then Eq. (3) has a solution x* 2 B, the process (4) is h
well defined and converges to x* with a quadratic rate: h ¼ Kb2 g < 2; r P bg . ð7Þ
n¼0
2
g 2k
kxk  x k 6 ð2hÞ . ð6Þ Then Eq. (3) has a solution x* 2 B and the process (4)
h2k
converges to x* with quadratic rate:
The proof of the Theorem is simple enough; the k

main novelty of Kantorovich’s contribution were bgðh=2Þ2 1


kxk  x k 6 k . ð8Þ
not technicalities, but the general problem formula- 1  ðh=2Þ2
tion and the use of the appropriate techniques of
functional analysis. Until Kantorovich’s works The difference to Theorem 1 is the assumption on
[25–27] it was not understood that numerical analy- the invertibility of F 0 (x) on B (while in Theorem 1 it
sis should be considered in the framework of func- was assumed to be invertible in the initial point x0
tional analysis (note the title of [26]: ‘‘Functional only) and the slighter assumption on h (h < 2
analysis and applied mathematics’’). Another pecu- instead of h < 1/2). Other versions can be found in
liarity of the above theorem is the lack of assump- the books [32,33,47,48,35,10].
tion on the existence of a solution – the theorem is Newton’s method (4) requires to calculate deriv-
not only a convergence result for a specific method, atives F 0 (x) and to solve the linear equation at each
but simultaneously an existence theorem for nonlin- iteration; it can be a pretty hard computational bur-
ear equations. den. The modified Newton method
These properties of the approach ensured a wide 1
xkþ1 ¼ xk  F 0 ðx0 Þ F ðxk Þ; k ¼ 0; 1; . . . ð9Þ
range of applications. Numerous nonlinear prob-
lems – nonlinear integral equations, ordinary and avoids this difficulty – the matrix F 0 (x0) is calculated
partial differential equations, variational problems and inverted at the initial point only. The method
– could be cast in the framework of (3) and method converges under the same conditions as (4), however
(4) could be applied. Various examples of such the price we pay for simplification is high – method
applications are presented in the monographs (9) converges linearly, not quadratically. There are
B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096 1089

some tradeoffs – for instance, one can update the It is worth to mention that we have formulated
derivatives after several iterations. Newton’s method in real spaces but it is as well
applicable in complex spaces; in the above example
4. Global behavior we take zk 2 C. The equation has three roots
pffiffi
The critical condition for convergence is (5) or z1 ¼ 1; z2;3 ¼ 12  i 23;
(7). They mean that at the initial approximation
x0 the function kF(x0)k should be small enough, that and it is natural to expect that the entire plane C is
is x0 should be close to the solution. Thus, Newton’s partitioned into three basins of attraction
method is locally convergent. Very simple one- S m ¼ fz0 : zk ! zm g; m ¼ 1; 2; 3
dimensional examples demonstrate the lack of the
global convergence even for smooth monotone located around the corresponding roots. However,
F(x). There are many ways to modify the method the true picture is much more involved. First, there
to achieve its global convergence (we will discuss is a single point z = 0 where the method is not de-
them later), but the problem of interest is the global fined. It has three preimages – points pffiffiffiz0 such that
behavior of iterations. Obviously there are many z1 = 0,pffiffithey
ffi are q, qð1=2  i 3=2Þ, where
simple situations – say, there is a neighborhood S q ¼ 1= 3 2. Again, each of them has three preimages
of a solution such that x0 2 S implies convergence and so on. Thus there are 3k points z0 which are
to the solution (such a set is called basin of attrac- mapped after k iterations into the point zk = 0,
tion) while trajectories starting outside Q do not and they generate a countable set of such initial
converge (e.g., tend to infinity). However, in the points, that the method is not applicable at some
case of non-uniqueness of a solution the structure iteration
of basins of attractions may be very complicated
and exhibit the fractal nature. It was Cayley who S 0 ¼ fz0 : zk ¼ 0 for some kg.
formulated this problem as early as 1879.
It can be proved that for all points z0 62 S0 the meth-
Let us consider Cayley’s example: solve the equa-
od converges to one of the solutions (note that if
tion z3 = 1 by Newton’s method. Thus we take
jzkj > 1 then jzk+1j < jzkj) and we have
F(z) = z3  1 and apply method (2):
z3  1 2zk 1 [
3

zkþ1 ¼ zk  k 2 ¼ þ 2. C¼ Sm.
3zk 3 3zk m¼0

1.5

0.5

–0.5

–1

1.5

–2
–2 –1.5 –1 –0.5 0 0.5 1 1.5 2

Fig. 2. Basin of attraction for x* = 1.


1090 B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096

1.5

0.5

–0.5

–1

–1.5

–2
–2 –1.5 –1 –0.5 0 0.5 1 1.5

Fig. 3. Points with no convergence.

The sets Sm have a fractal structure: S0 is the bound- There are several ways to modify the basic New-
ary of each Sm, m = 1, 2, 3 and in any neighborhood ton method to achieve global convergence. The first
of any point z 2 S0 there exist points from Sm, one is to introduce a regulated step-size to avoid too
m = 1, 2, 3. The set S1 is displayed in Fig. 2, the large steps; this is the so called damped Newton’s
set S0 – in Fig. 3. The sets of initial points which method:
possess no convergence (like the set S0) for itera-
1
tions of general rational maps were studied by Julia xkþ1 ¼ xk  ak F 0 ðxk Þ F ðxk Þ; k ¼ 0; 1; . . . ; ð10Þ
[24] and are called now Julia sets. A lot of examples
linked with Newton’s method can be found in where the step-size 0 < ak 6 1 is chosen to achieve
numerous books on fractals [2,39], papers [12,49] monotone decrease of kF(x)k, that is condition
and web materials [23,14] (we mention just few of kF(xk+1)k < kF(xk)k holds. We will discuss the par-
the sources available). Some of these examples exhi- ticular algorithms for choosing ak later for minimi-
bit much more complicated behavior than for Cay- zation problems. The main goal in constructing
ley’s one. For instance, one can meet periodic or such algorithms is to preserve the balance between
chaotic trajectories of the iterations. convergence and rate of convergence, i.e., one
should take ak < 1 when xk is beyond a basin of
5. Overcoming the local nature of the method attraction of ‘‘pure Newton’s method’’ and switch
to ak = 1 inside this basin.
In Cayley’s example Newton’s method converged The second approach is the Levenberg-Marquardt
for almost all initial points (exceptions were the method [36,40]:
countable number of points S0); the complex struc- xkþ1 ¼ xk  ðak I þ F 0 ðxk ÞÞ1 F ðxk Þ; k ¼ 0; 1; . . .
ture of basins of attraction was caused by the exis-
ð11Þ
tence of several roots. However if (1) has a single
root, the method usually has local convergence For ak = 0 the method converts into pure Newton’s
only. For instance, take F(x) = arctan x, which is method, while for ak  1 it is close to a strongly
smooth, monotone and has the single root x* = 0. damped gradient method, which usually converges
It is easy to check that (2) converges if and only if globally. There are various strategies for adjusting
jx0j < x* (x* > 0 being a root of 2x = (1 + x2) arc- parameters ak, they are described, for instance, in
tan x), for jx0j = x* we have periodic behavior of [48]. Method (11) works even when the pure New-
iterations x0 = x1 = x2 = x3 =   , while for ton’s method does not – in a situation with the
jx0j > x* the iterations diverge: jxkj ! 1. degenerate operator F 0 (xk). As we shall see, for
B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096 1091

minimization problems method (11) is very promis- 2H k ðhÞ


kxk  x k 6 . ð15Þ
ing, because it does not assume the matrix F 0 (xk) to Ll
be positive definite.
Thus at each iteration of the method one should
One more strategy is to modify Newton’s method
solve the linear equation F 0 (xk)y = F(xk), where the
to prevent large steps; it can be done not by choos-
linear operator F 0 (xk) in general does not have the
ing step-sizes as in the damped Newton’s method
inverse; however it maps X onto Y and this equation
(10), but introducing trust regions, where a linear
has a solution (may be not a single one). Among the
approximation of F(x) is valid. Such an approach
solutions there exists the solution yk with the prop-
(originated in [19] and widely developed in [11]) will
erty kykk 6 (1/l)kF(xk)k; this one is exploited in the
be discussed later in the connection with optimiza-
method. In the finite dimensional case (X = Rn,
tion problems.
Y = Rm, n > m) such solution is provided by the for-
mula yk = F 0 (xk)+F(xk), where A+ denotes the
6. Underdetermined systems
pseudoinverse of the matrix A.
In the above analysis of Newton’s method it was We consider an application of Theorem 3 to con-
assumed that the linear operator F 0 (xk) is invertible. vex analysis. The following result on convexity of
However there are situations when this is definitely the nonlinear image of a small ball in Hilbert spaces
wrong. For instance suppose that F : Rn ! Rm, holds [52].
where m < n. That is we solve an underdetermined
Theorem 4. Suppose X, Y are Hilbert spaces, F :
system of m equations with n > m variables; of
X ! Y is defined and differentiable on the ball B =
course the rectangular matrix F 0 (xk) has no inverse.
{x : kx  ak 6 r}, its derivative satisfies a Lipschitz
Nevertheless an extension of Newton’s method for
condition on B
this case can be provided, this is due to Graves
[21], who exploited the method to prove existence kF 0 ðxÞ  F 0 ðzÞk 6 Lkx  zk; x; z 2 B;
of solutions of nonlinear mappings. We present
the result from [51] where emphasis on the method F 0 (a) maps X onto Y and the following estimate holds:
is done with more accurate estimates. kF 0 ðaÞ yk P lkyk; 8y 2 Y ; ð16Þ
Theorem 3. Suppose F : X ! Y is defined and dif- with l > 0 and r < l/(2L). Then the image of the ball
ferentiable on the ball B = {x : kx  x0k 6 r}, its B under the map F is convex, i.e., the set
derivative satisfies a Lipschitz condition on B S = {F(x) : x 2 B} is a convex set in Y.
kF 0 ðxÞ  F 0 ðzÞk 6 Lkx  zk; x; z 2 B; This theorem has numerous applications in opti-
0
F (x) maps X onto Y and the following estimate holds: mization [52], linear algebra [53], optimal control
[54]. For instance, the pseudospectrum of a n · n
kF 0 ðxÞ yk P lkyk; 8x 2 B; y 2 Y  ; ð12Þ
matrix (a set of all eigenvalues of perturbed matrices
with l > 0 (star denotes conjugation). Introduce the for perturbations bounded in Frobenius norm) hap-
function pens to be the union of n convex sets in the complex
X
1 plane provided that the nominal matrix has all
k
H n ðtÞ ¼ t2 eigenvalues distinct and perturbations are small
k¼n enough. Another result is the convexity of the reach-
and suppose that able set of a nonlinear system for L2 bounded con-
trols [54].
Ll2 kF ðx0 Þk 2H 0 ðhÞ
h¼ < 1; q¼ 6 r. ð13Þ
2 Ll 7. Unconstrained optimization
Then the method
Consider the simplest unconstrained minimiza-
xkþ1 ¼ xk  y k ; F 0 ðxk Þy k ¼ F ðxk Þ; tion problem in a Hilbert space H:
1 min f ðxÞ; x 2 H. ð17Þ
ky k k 6 kF ðxk Þk; k ¼ 0; 1; . . . ð14Þ
l
Assuming that f is twice differentiable, we can get
is well defined and converges to a solution x* of the Newton’s method for minimization by two different
equation F(x) = 0, kx*  x0k 6 q with the rate approaches.
1092 B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096

First, the necessary (and sufficient for f convex) The above formula includes the third and the sec-
condition for minimization is Fermat’s condition ond derivatives of f and their action on a vector
h 2 Rn. In simpler form it can be presented as the
rf ðxÞ ¼ 0; relation between the third and the second deriva-
that is, we should solve Eq. (3) with F(x) = $f(x). tives of the scalar function u(t) = f(x + th):
Applying Newton’s method for this equation we ju000 ð0Þj 6 2ðu00 ð0ÞÞ
3=2
arrive at Newton’s method for minimization:
for all x 2 D, h 2 Rn. For instance, the function
2 1
xkþ1 ¼ xk  ðr f ðxk ÞÞ rf ðxk Þ; k ¼ 0; 1; . . . ; f(x) = ln x, x > 0, x 2 R1 is self-concordant, while
ð18Þ the function f(x) = 1/x, x > 0 is not. For xk 2 D de-
fine Newton’s decrement
where $2f denotes the second Frechet derivative (the qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Hessian matrix in a finite dimensional case). dk ¼ rf ðxk ÞT ðr2 f ðxk ÞÞ1 rf ðxk Þ.
Second, we can approximate f(x) in the neighbor-
hood of a point xk by three terms of its Taylor Now the Nesterov–Nemirovski version of the
series: damped Newton’s method reads
f ðxk þ hÞ  fk ðhÞ
xkþ1 ¼ xk  ak ðr2 f ðxk ÞÞ1 rf ðxk Þ; k ¼ 0; 1; . . . ;
¼ f ðxk Þ þ ðrf ðxk Þ; hÞ þ 1
2
ðr2 f ðxk Þh; hÞ.
ð19Þ
Then, minimizing the quadratic function fk(h) we 1
get the same method (18). Both these interpretations ak ¼ 1; if dk 6 ; ð20Þ
4
would be exploited for constructing Newton meth-
1 1
ods for more general optimization problems. ak ¼ ; if dk > . ð21Þ
The theorems on convergence of Newton’s ð1 þ dk Þ 4
method for equations can be immediately adopted
to the unconstrained minimization case (just replace Theorem 5. If f(x) is self-concordant and f(x) P f*,
F(x) by $f(x) and F 0 (x) by $2f(x)). The main feature "x 2 D then for the method (19)–(21) with x0 2 D,
of the method – fast local convergence – remains e > 0 one has
unchanged. However, there are some specific prop- f ðxk Þ  f  6 e
erties. The most important: Newton’s method in
its pure form does not distinguish minima, maxima for
and saddle points – when started from a neighbor-  
1
hood of a nonsingular critical point (i.e., a point k ¼ c1 þ c2 log log þ c3 ðf ðx0 Þ  f  Þ. ð22Þ
e
x* with $f(x*) = 0, $2f(x*) invertible) the method
converges to it, making no difference between mini- Here c1, c2, c3 are some absolute constants.
mum, maximum or saddle points. The idea of the proof is simple enough. The min-
There are numerous ways to convert the method imization procedure consists of two stages. At Stage
into globally convergent one, they are similar to 1, method (19), (21) is applied, and the function
modifications discussed in Section 5 (damped New- decreases monotonically: f(xk+1) 6 f(xk)  c, where
ton, Levenberg-Marquardt, trust region). We con- c is a positive constant. Obviously, this stage termi-
sider an important version due to Nesterov and nates after a finite number of iterations (because f(x)
Nemirovski [44], in this version its complexity (the is bounded from below). At stage 2, the pure New-
number of iterations to achieve the desired accu- ton’s method (19), (20) works and it converges with
racy) can be estimated. Nesterov and Nemirovski the quadratic rate: 2dk+1 6 (2dk)2. Note that the
introduce the class of self-concordant functions. Newton decrement provides a convenient tool to
These are three times differentiable convex functions express the rate of convergence: there are no con-
defined on a convex set D  Rn, satisfying the stants like L, K, g in Theorems 1–3.
property For equality (22) simple values of the constants
can be determined. Note that log log(1/e) is not large
jr3 f ðxÞðh; h; hÞj 6 2ðr2 f ðxÞh; hÞ3=2 ;
even if e is small enough; for all reasonable e the fol-
8x 2 D; h 2 Rn . lowing estimate holds
B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096 1093

k ¼ 5 þ 11ðf ðx0 Þ  f  Þ. tation of the method for unconstrained


However, numerous simulation results for various minimization:
types of optimization problems of different dimen- xkþ1 ¼ arg min fk ðxÞ;
x2Q
sions [8] validate the following empirical formula
for the number of steps for method (19)–(21): fk ðxÞ ¼ ðrf ðxk Þ; x  xk Þ ð25Þ

k ¼ 5 þ 0:6ðf ðx0 Þ  f  Þ. þ 12ðr2 f ðxk Þðx  xk Þ; ðx  xk ÞÞ:

The serious restriction of the above analysis was The method converges under the same assumptions
the convexity assumption. Recently [45] another as for the unconstrained case: if f is convex and
version of Newton’s method has been proposed, twice differentiable with Lipschitz second deriva-
where global convergence and complexity results tives on Q, Q is closed convex, f attains its minimum
were obtained for general smooth functions. Sup- on Q in a point x*, $2f(x*) > 0, then sequence (25)
pose that f 2 C2,1, that is f is twice differentiable converges locally to x* with a quadratic rate. This
on Rn and its second derivative satisfies a Lipschitz result has been obtained in [37], see more details
condition with constant L on the set {x : f(x) 6 and examples in [55,6].
f(x0)}. The following version of Newton’s method Another simple situation is equality constrained
is considered in [45]: optimization:
xkþ1 ¼ arg min fk ðxÞ; h ¼ x  xk ; min f ðxÞ; gðxÞ ¼ 0; ð26Þ
x
1
fk ðxÞ ¼ f ðxk Þ þ ðrf ðxk Þ; hÞ ð23Þ where f : X ! R , g : X ! Y; X, Y are Hilbert
1 L spaces. If a solution x* of the problem exists and
3
þ ðr2 f ðxk Þh; hÞ þ khk : is a regular point (g 0 (x*) maps X onto Y), then there
2 6
exists a Lagrange multiplier y* such that the pair x*,
Thus at each iteration we solve the unconstrained y* is a stationary point of the Lagrangian
minimization problem with the same quadratic term
as in the pure Newton’s method, but regularized via Lðx; yÞ ¼ f ðxÞ þ ðy; gðxÞÞ
the cubic term. This problem looks hard and that is the solution of the nonlinear equation
non-convex, however it can be reduced to one-
L0x ðx; yÞ ¼ 0; L0y ðx; yÞ ¼ 0. ð27Þ
dimensional convex optimization (see details in
[45]). Surprisingly, the proposed method has many Hence, we can apply Newton’s method for solving
advantages if compared with pure Newton. Firstly, this equation. Under natural assumptions it con-
it converges globally, for an arbitrary initial point. verges locally to x*, y*, see rigorous results in
Secondly, it does not converge to maximum points [56,6,5]. Various implementations of the method
or saddle points in contrast to Newton’s method. can be also found in these references.
Thirdly, its complexity for various classes of func- There are other versions of Newton’s method for
tions can be estimated. There are implementable solving (26) which do not involve the dual variables
versions of method (23), where the value L is not y. Historically, the first application of the Newton-
known apriori. like method for finding the largest eigenvalue of a
matrix A = AT by reducing it to the constrained
8. Constrained optimization optimization problem:
2
maxðAx; xÞ; kxk ¼ 1
We start with some particular cases of con-
strained optimization problems. is due to Rayleigh (1899). Later Kantorovich [27]
The first one is optimization subject to simple has proposed a pure Newton method for the above
constraints: problem. Another simple situation, where calcula-
tions can be simplified, is the case of linear con-
min f ðxÞ; x 2 Q; ð24Þ
straints g(x) = Cx  d. Then the method is
where the set Q in a Hilbert space H is ‘‘simple’’ in equivalent to Newton’s method for unconstrained
the sense that (24) with a quadratic function f can be minimization of the restriction of f to the affine sub-
solved explicitly. For instance, Q may be a ball, a space Cx = d.
linear subspace, etc. The extension of Newton’s Now we proceed to applications of New-
method for this case is based on the second interpre- ton’s method for convex constrained optimization
1094 B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096

problems – the area, where the method plays the key • Relaxed smoothness assumptions. Newton’s
role in constructing the most effective optimization method can be applied in many situations where
algorithms. The basic scheme of interior-point meth- equations (or functions to be minimized) are not
ods looks as follows [44,66]. For the convex optimi- smooth enough. The simplest example is solving
zation problem Eq. (1) with piece-wise linear non-smooth convex
min f ðxÞ; x 2 Q; ð28Þ function F. Then, if a solution exists, method (2)
(naturally extended) finds it after a finite number
with a convex self-concordant f and a convex of iterations.
Q 2 Rn we construct a self-concordant barrier For the general case there exist numerous non-
F(x), defined on the interior of Q and growing to smooth versions of Newton’s method, beginning
infinity when a point approaches the boundary of with the concept of semi-smoothness by Mifflin
Q: [42]. Many researchers have contributed to the
F : int Q ! R1 ; F ðxÞ ! 1 for x ! oQ. area, see [9,34,57,58,62] and references therein;
the book [34] contains many basic ideas and his-
Such barriers exist for numerous examples of the torical remarks. A related approach is based on
constraints, for instance, if Q = {x : x P 0}, then smoothing the equations, consult [18,9].
the logarithmic barrier has the desired properties: • Multiple roots. We analyzed Newton’s method in
Xn a neighborhood of a simple root x*. In the case of
F ðxÞ ¼  ln xi . a multiple root Newton’s method either remains
i¼1
convergent, but looses its fast rate of conver-
Using the barriers, we take the function gence, or diverges. There is a modification of
the method, which preserves quadratic conver-
fk ðxÞ ¼ tk f ðxÞ þ F ðxÞ
gence; it is due to Schroder (1870). We provide
depending on the parameter tk > 0. Under natural it for the one-dimensional case (1):
assumptions it can be proved that fk(x) has a mini- 1
xkþ1 ¼ xk  pF 0 ðxk Þ F ðxk Þ; k ¼ 0; 1; . . . ; ð29Þ
mum point xk on int Q and f ðxk Þ ! f  (the minimal
value in (28)) when tk ! 1 (this is so called central here p is the multiplicity of the root. Unfor-
path). However, there is no need to obtain precise tunately, we should know in advance this
values of xk , it suffices to make one step of the multiplicity, moreover the situation in the multi-
damped Newton method (19)–(21) and then to vary dimensional case can be much more complicated.
tk. There exists such way to adjust the parameters • Higher-order methods. The local rate of conver-
ak, tk that the method yields polynomial-time com- gence of Newton’s method is fast enough; how-
plexity, see details and rigorous results in ever, some researchers construct methods with
[44,46,4,8]. The theoretical result on polynomial- still higher rate of convergence. This problem
time complexity is very important, however the looks a bit artificial (actually very few iterations
practical simulation results on implementation of of Newton’s method are required to obtain high
interior-point methods are not less important. They precision when we achieve its convergence
demonstrate very high effectiveness of the method domain), so there is no need to accelerate the
for a broad spectrum of convex optimization prob- method.
lems – from linear programming to semi-definite pro- • Continuous version. Instead of discrete iterations
gramming. For instance, the method is the in Newton’s method (4) one can apply its contin-
successful competitor to the classical simplex meth- uous analog
od for linear programs. 1
x_ ðtÞ ¼ F 0 ðxðtÞÞ F ðxðtÞÞ. ð30Þ
Similar ideas are exploited for non-convex con-
strained optimization problems, where so called The simplest difference approximation of (30)
sequential quadratic programming (SQP) is the basis leads to the damped Newton method (10) with
for efficient algorithms, see, e.g., [11,20,7,66,61,34]. constant step-size ak = a. As we know, the
damped Newton method can exhibit global con-
9. Some extensions vergence, thus one can expect the same property
for the continuous version. Indeed, the global
Newton’s method has numerous extensions, we convergence for (30) has been validated by Smale
consider below just few of them. [63], and its rate of convergence is analyzed in
B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096 1095

[59]. Of course, the numerical value of continu- [8] S. Boyd, L. Vandenberghe, Convex Optimization, Cam-
ous methods is arguable, because their implemen- bridge University Press, Cambridge, 2004.
[9] X. Chen, Z. Nashed, L. Qi, Smoothing methods and
tation demands their discretization. semismooth methods for nondifferentiable operator equa-
• Data at one point. All results on the convergence tions, SIAM Journal on Numerical Analysis 38 (2000) 1200–
of Newton’s method include assumptions, which 1216.
are valid in some neighborhood of the solution or [10] L. Collatz, Functional Analysis and Numerical Mathematics,
in some prescribed ball. In contrast, Smale [64] Academic Press, New York, 1966.
[11] A.B. Conn, N.I.M. Gould, Ph.L. Toint, Trust Region
provides a convergence theorem based on data Methods, SIAM, Philadelphia, 2000.
available at the single initial point. However, [12] J.H. Curry, L. Garnett, D. Sullivan, On the iteration of a
these data include bounds for all derivatives. rational function: Computer experiments with Newton’s
• Solving complementarity and equilibrium prob- method, Communications on Mathematical Physics 91
lems. There are many works where Newton’s (1983) 267–277.
[13] P. Deulfhard, Newton Methods for Nonlinear Problems:
method is applied to problems, which cannot be Affine Invariant and Adaptive Algorithms, Springer, Berlin,
casted into equation solving, the typical examples 2004.
are complementarity, variational inequalities and [14] R.M. Dickau, Newton’s method. Available from: <http://
equilibrium-type problems. For references see, mathforum.org/advanced/robertd/newtons.html>.
e.g., [16,34]. [15] Electronic library ZIB. Available from: <http://www.zib.de/
SciSoft/NewtonLib>.
• Implementation issues. We are unable to discuss [16] F. Facchinei, J.S. Pang, Finite-Dimensional Variational
implementation issues of various versions of Inequalities and Complementarity Problems, Springer, New
Newton’s method, which are indeed important York, 2003.
for their practical application. Many details can [17] H. Fine, On Newton’s method of approximation, Proceed-
be found in the recent book [13], while codes of ings of National Academy of Sciences, USA 2 (9) (1916)
546–552.
corresponding algorithms can be downloaded [18] M. Fukushima, L. Qi, Reformulation: Nonsmooth, Piece-
from [15]. For convex optimization problems wise Smooth, Semismooth and Smoothing Methods, Klu-
such issues are discussed in [8]. wer, Dordrecht, 1999.
• Complexity. There exist very deep results on the [19] S. Goldfeld, R. Quandt, H. Trotter, Maximization by
complexity of basic problems of numerical anal- quadratic hill climbing, Econometrica 34 (1966) 541–551.
[20] P.E. Gill, W. Murray, M.A. Saunders, SNOPT: An SQP
ysis (e.g., finding all roots of a polynomial), clo- algorithm for large-scale constrained optimization, SIAM
sely related to Newton’s method (some Journal on Optimization 12 (2002) 979–1006.
modification of the method is usually proved to [21] L.M. Graves, Some mapping theorems, Duke Mathematical
achieve the best possible result). The interested Journal 17 (1950) 111–114.
reader can consult [65]. [22] A.D. Ioffe, On the local surjection property, Nonlinear
Analysis: Theory, Methods and Applications 11 (5) (1987)
565–592.
[23] D.E. Joyce, Newton basins. Available from: <http://aleph0.
References clarku.edu/djoyce/newton/newton.html>.
[24] G. Julia, Sur l’iteration des fonctions rationelles, Journal de
[1] V.I. Arnold, Small denominators and problem of stability in Mathematiques Pure et Applique 8 (1918) 47–245.
classical and celestial mechanics, Uspekhi Mathematicheskih [25] L.V. Kantorovich, On Newton’s method for functional
Nauk 18 (6) (1963) 91–192. equations, Doklady AN SSSR 59 (7) (1948) 1237–1240.
[2] M. Barnsley, Fractals Everywhere, Academic Press, London, [26] L.V. Kantorovich, Functional analysis and applied mathe-
1993. matics, Uspekhi Mathematicheskih Nauk 3 (1) (1948) 89–
[3] A.A. Bennet, Newton’s method in general analysis, Proceed- 185, Translated as N.B.S. Report 1509, Washington, DC,
ings of National Academy of Sciences, USA 2 (10) (1916) 1952.
592–598. [27] L.V. Kantorovich, On Newton’s method, Trudy Mathemat-
[4] A. Ben-Tal, A. Nemirovski, Lectures on Modern Convex icheskogo Instituta Imeni V.A. Steklova 28 (1949) 104–144.
Optimization: Analysis, Algorithms and Engineering Appli- [28] L.V. Kantorovich, Principle of majorants and Newton’s
cations, SIAM, Philadelphia, 2001. method, Doklady AN SSSR 76 (1) (1951) 17–20.
[5] D.P. Bertsekas, Constrained Optimization and Lagrange [29] L.V. Kantorovich, Some further applications of principle of
Multiplier Method, Academic Press, New York, 1982. majorants, Doklady AN SSSR 80 (6) (1951) 849–852.
[6] D.P. Bertsekas, Nonlinear Programming, Athena Scientific, [30] L.V. Kantorovich, On approximate solution of functional
Belmont, 1999. equations, Uspekhi Mathematicheskih Nauk 11 (6) (1956)
[7] L.T. Biegler, I.E. Grossmann, Part I: Retrospective on 99–116.
optimization, Part II: Future perspective on optimization, [31] L.V. Kantorovich, Some further applications of Newton’s
Preprints, Chemical Engineering Department, Carnegie- method, Vestnik LGU, Seriya Matematika Mekhanika 0 (7)
Mellon University, 2004. (1957) 68–103.
1096 B.T. Polyak / European Journal of Operational Research 181 (2007) 1086–1096

[32] L.V. Kantorovich, G.P. Akilov, Functional Analysis in [50] H.O. Peitgen (Ed.), Newton’s Method and Dynamical
Normed Spaces, Fizmatgiz, Moscow, 1959. Systems, Springer, Berlin, 1989.
[33] L.V. Kantorovich, G.P. Akilov, Functional Analysis, [51] B.T. Polyak, Gradient methods for solving equations and
Nauka, Moscow, 1977. inequalities, USSR Computational Mathematics and Math-
[34] D. Klatte, B. Kummer, Nonsmooth Equations in Optimiza- ematical Physics 4 (6) (1964) 17–32.
tion: Regularity, Calculus, Methods and Applications, Klu- [52] B.T. Polyak, Convexity of nonlinear image of a small ball
wer, Dordrecht, 2002. with applications to optimization, Set-Valued Analysis 9 (1–
[35] M.A. Krasnoselski, G.M. Vainikko, P.P. Zabreiko, Ya.B. 2) (2001) 159–168.
Rutitski, V.Ya. Stetsenko, Approximate Solution of Oper- [53] B.T. Polyak, The convexity principle and its applications,
ator Equations, Nauka, Moscow, 1969, English translation: Bulletin of Brazilian Mathematical Society 34 (1) (2003) 59–
Walter-Noordhoff, Groningen, 1972. 75.
[36] K. Levenberg, A method for the solution of certain [54] B.T. Polyak, Convexity of the reachable set of nonlinear
nonlinear problems in least squares, Quarterly of Applied systems under L2 bounded controls, Dynamics of Continu-
Mathematics 2 (1944) 164–168. ous, Discrete and Impulsive Systems 11 (2–3) (2004) 255–
[37] E.S. Levitin, B.T. Polyak, Constrained minimization meth- 268.
ods, USSR Computational Mathematics and Mathematical [55] B.T. Polyak, Introduction to Optimization, Optimization
Physics 6 (5) (1966) 1–50. Software, New York, 1987.
[38] L.A. Ljusternik, On conditional extrema of functionals, [56] B.T. Polyak, Iterative methods using Lagrange multipliers
Mathematicheskij Sbornik 41 (3) (1934) 390–401. for solving extremal problems with equality-type constraints,
[39] B. Mandelbrot, The Fractal Geometry of Nature, W.H. USSR Computational Mathematics and Mathematical Phys-
Freeman, New York, 1983. ics 10 (5) (1970) 42–52.
[40] D. Marquardt, An algorithm for least squares estimation of [57] B.T. Polyak, Minimization of nonsmooth functionals, USSR
nonlinear parameters, SIAM Journal of Applied Mathemat- Computational Mathematics and Mathematical Physics 9 (3)
ics 11 (1963) 431–441. (1969) 14–29.
[41] J.H. Mathews, Bibliography for Newton’s method. Avail- [58] L. Qi, J. Sun, A nonsmooth version of Newton’s method,
able from: <http://math.fullerton.edu/mathews/newtons- Mathematical Programming 58 (1993) 353–367.
method/Newton’sMethodBib/Links/Newton’sMethodBib_ [59] A.G. Ramm, Acceleration of convergence: A continuous
lnk_3.html>. analog of the Newton’s method, Applicable Analysis 81 (4)
[42] R. Mifflin, Semismooth and semiconvex functions in con- (2002) 1001–1004.
strained optimization, SIAM Journal on Control and [60] W.C. Rheinboldt, Methods for Solving Systems of Nonlin-
Optimization 15 (1977) 959–972. ear Equations, SIAM, Philadelphia, 1998.
[43] I.P. Mysovskikh, On convergence of L.V. Kantorovich’s [61] S.M. Robinson, Perturbed Kuhn–Tucker points and rates of
method for functional equations and its applications, Dok- convergence for a class of nonlinear programming algo-
lady AN SSSR 70 (4) (1950) 565–568. rithms, Mathematical Programming 7 (1974) 1–16.
[44] Yu. Nesterov, A. Nemirovski, Interior-Point Polynomial [62] S.M. Robinson, Newton’s method for a class of nonsmooth
Algorithms in Convex Programming, SIAM, Philadelphia, functions, Set-Valued Analysis 2 (1994) 291–305.
1994. [63] S. Smale, A convergent process of price adjustment and
[45] Yu. Nesterov, B. Polyak, Cubic regularization of a Newton global Newton’s methods, Journal of Mathematical Eco-
scheme and its global performance, CORE Discussion nomics 3 (1976) 107–120.
Papers 2003, No 41, Mathematical Programming, submitted [64] S. Smale, Newton’s method estimates from data at one
for publication. point, in: R. Ewing, K. Gross, C. Martin (Eds.), The
[46] Yu. Nesterov, Introductory Lectures on Convex Program- Merging of Disciplines: New Directions in Pure, Applied and
ming, Kluwer, Boston, 2004. Computational Mathematics, Springer, Berlin, 1986.
[47] A.M. Ostrowski, Solution of Equations and Systems of [65] S. Smale, Complexity theory and numerical analysis, Acta
Equations, Academic Press, Basel, 1960. Numerica 6 (1997) 523–551.
[48] J.M. Ortega, W.C. Rheinboldt, Iterative Solution of Non- [66] S.J. Wright, Primal–Dual Interior-Point Methods, SIAM,
linear Equations in Several Variables, Academic Press, New Philadelphia, 1997.
York-London, 1970. [67] T.J. Ypma, Historical development of the Newton–Raphson
[49] H.O. Peitgen, D. Saupe, F. Haeseler, Cayley’s problem and method, SIAM Review 37 (4) (1995) 531–551.
Julia sets, Mathematical Intelligencer 6 (2) (1984) 11–20.

View publication stats

You might also like