Numerical Methods For Large-Scale Non-Linear Systems, Hoppe

Numerical Methods for
Large-Scale Nonlinear Systems
Handouts by Ronald H.W. Hoppe
following the monograph

P. Deuflhard
Newton Methods for Nonlinear Problems
Springer, Berlin-Heidelberg-New York, 2004
Num. Meth. Large-Scale Nonlinear Systems
1. Classical Newton Convergence Theorems

1.1 Classical Newton-Kantorovich Theorem
Theorem 1.1 Classical Newton-Kantorovich Theorem
Let X and Y be Banach spaces, D X a convex subset, and suppose that that
F : D X Y is continuously Frechet differentiable on D with an invertible
Frechet derivative F 0 (x0 ) for some initial guess x0 D. Assume further that
the following conditions hold true:
kF 0 (x0 )1 F (x0 k ,
kF 0 (y) F 0 (x)k ky xk , x, y D ,
1
h0 := kF 0 (x0 )1 k <
,
2
1 1 2h0
B(x0 , 0 ) D , 0 :=
.
kF 0 (x0 )1 k
(1.1)
(1.2)
(1.3)
(1.4)
Then, for the sequence {xk }lN0 of Newton iterates

F 0 (xk ) xk = F (xk ) ,
xk+1 = xk + xk
there holds
(i)
F 0 (x) is invertible for all Newton iterates x = xk , k lN0 ,
(ii) The sequence {xk }lN of Newton iterates is well defined with xk B(x0 , 0 ),
k lN0 , and xk x B(x0 , 0 ), k lN0 (k ), where F (x ) = 0,
(iii) The convergence xk x (k ) is quadratic,
(iv) The solution x of F (x) = 0 is unique in
B(x0 , 0 ) (D B(x0 , 0 )) ,
1 + 1 2h0
:=
.
kF 0 (x0 )1 k
Proof. We have
kF 0 (xk ) F 0 (x0 )k kxk x0 k tk
for some upper bound tk , k lN.
If we can prove xk B(x0 , 0 ) and tk := kF 0 (x0 )1 ktk < 1, k lN, then by the
Banach perturbation lemma F 0 (xk ) is invertible with

kF 0 (x0 )1 k
(1.5)
1 kF 0 (x0 )1 kkF 0 (xk ) F 0 (x0 )k
kF 0 (x0 )1 k
kF 0 (x0 )1 k
=: k .
1 kF 0 (x0 )1 kkxk x0 k
1 tk
kF 0 (xk )1 k
We prove xk B(x0 , 0 ) and tk < 1, k lN, by induction on k:

For k = 1 we have
h0
kx1 x0 k = kF 0 (x0 )1 F (x0 )k =
< 0 ,
0
kF (x0 )1 k
since h0 < 1 1 2h0 , and

t1 := kF 0 (x0 )1 k t1 = kF 0 (x0 )1 k kx1 x0 k =
1
= kF 0 (x0 )1 k kF 0 (x0 )1 F (x0 )k kF 0 (x0 )1 k <
.
|
{z
}
2
= h0
Assuming the assertion to be true for some k lN, for k + 1, using (1.2) we
obtain
kxk+1 xk k = kF 0 (xk )1 F (xk )k =
= kF 0 (xk )1 (F (xk ) F (xk1 ) F 0 (xk1 )xk1 )k =
Z1
= kF 0 (xk )1
F 0 (xk1 + sxk1 ) F 0 (xk1 ) xk1 dsk
(1.6)
Z1
kF 0 (xk )1 k
kF 0 (xk1 + sxk1 ) F 0 (xk1 )k kxk1 k ds

|
{z
}
0
s kxk1 k
1
k kxk xk1 k2 .
2
Setting
hk := kxk+1 xk k ,
we thus get the recursion
hk =
1
k h2k1
2
k lN .
(1.7)
In view of the relationship

kxk+1 x0 k kxk+1 xk k + kxk x0 k ,
|
|
|
{z
}
{z
}
{z
}
tk+1
= hk
tk
we consider the recursion

tk+1 = tk + hk .
(1.8)
Observing (1.5) and (1.7), we find

1 kF 0 (x0 )1 k
(tk tk1 )2 .
2
1 tk
tk+1 tk =
Hence, multiplying both sides with kF 0 (x0 )1 k, we end up with the following
three-term recursion for tk :
1
1
tk+1 tk =
(tk tk1 )2
2 1 tk
t0 = 0 , t1 = h0 .
(1.9)
The famous Ortega-trick allows to reduce (1.9) to a two-term recursion which

can be interpreted as a Newton method in lR1 :
Multiplying both sides in (1.9) by 1 tk results in
1
(tk+1 tk ) (1 tk ) =
(tk tk1 )2 ,
2
from which we deduce
tk+1 tk+1 tk +
|
{z
1 2
t = tk tk tk1 +
2 }k
|
{z
= (tk+1 ,tk )
1 2
t
.
2 k1}
= (tk ,tk1 )
It follows that
(tk+1 , tk ) = (t1 , t0 ) = h0 ,
from which we deduce
tk+1 tk =
h0 tk + 21 t2k
(tk )
= 0
,
1 tk
(tk )
where : lR lR is given by
1
(t) := h0 t + t2 .
2
Obviously, has the zeroes
p
t1 := 1 1 2h0
t2 := 1 +
1 2h0 .
Since is convex, the Newton method converges for t1 to t1 .

It follows from the definition of tk that xk B(x0 , 0 ). Moreover, as a consequence of (1.6) we readily find that {xk }lN is a Cauchy sequence in B(x0 , 0 ).
Hence, there exists x B(x0 , 0 ) such that xk x (k ) and
xk = F 0 (xk )1 F (xk ) F 0 (x )1 F (x ) = 0 ,
whence F (x ) = 0.
The quadratic convergence can be deduced from (1.6) as well. Finally, the
uniqueness of x in B(x0 , 0 )(DB(x0 , 0 )) follows readily from the properties
of the function .
1.2 Classical Newton-Mysovskikh Theorem

Theorem 1.2 Classical Newton-Mysovskikh Theorem
Let X and Y be Banach spaces, D X a convex subset, and suppose that that
F : D X Y is continuously Frechet differentiable on D with invertible
Frechet derivatives F 0 (x), x D, and let x0 D be some initial guess. Assume
further that the following conditions hold true:
kF 0 (x0 )1 F (x0 k ,
kF 0 (x)1 k , x D ,
kF 0 (y) F 0 (x)k ky xk , x, y D ,
1
1
h0 :=
kF 0 (x0 )1 F (x0 k
< 1,
2
2
j
B(x0 , ) D , :=
h02 1
.
1
h
0
j=0
(1.10)
(1.11)
(1.12)
(1.13)
(1.14)

F 0 (xk ) xk = F (xk ) ,
xk+1 = xk + xk
there holds
(i)
xk B(x0 , ), k lN0 , and there exists x B(x0 , ) such that F (x ) = 0

and xk x (k ),
kxk+1 xk k
(ii)
1
2
kxk xk1 k ,
kxk x k k kxk xk1 k2
(iii)
k lN,
, where
X k j
1
1
(h20 )2 )
:=
(1 +
.
2
2 1 h20k
j=1
Proof. Observing
F 0 (xk1 ) xk1 + F (xk1 ) = 0 ,
we obtain
kxk k = kF 0 (xk )1 F (xk )k =
0 k 1
k
k1
0 k1
k1
= kF (x )
F (x ) F (x ) F (x )x
k
Z1
kF 0 (xk )1 k k
F 0 (xk1 + sxk1 ) F 0 (xk1 ) xk1 dsk
0
kxk1 k2 ,
2
which gives the assertion (ii).
We now prove that {xk }lN0 is a Cauchy sequence in B(x0 , ). By induction on
k we show
kxk+1 xk k
2 2k
h
0
k lN0 .
For k = 0, we have in view of (1.10)

kx0 k =
2
h0 .
Assuming (1.15) to be true for some k lN, we get

1
kxk+2 xk+1 k = kxk+1 k
kxk k2
2
2
2
2 2k+1
1
2k
=
h0
h
.
2
0
It follows readily from (1.15) that xk+1 B(x0 , ):
kxk+1 x0 k kxk+1 xk k + ... + kx1 x0 k
X
2
2 2k
j
h0 + ... + h0 =
h0
h20 1 = .
| {z } j=0
=
Similarly, it can be shown that

kxm+k xm k 0 (, ) .
(1.15)
Since {xk }lN0 is a Cauchy sequence in B(x0 , ), there exists x B(x0 , ) such
that xk x (k ). Hence,
xk = F 0 (xk )1 F (xk ) F 0 (x )F (x ) = 0 ,
and thus F (x ) = 0 which proves (i).
The assertion (iii) is shown as follows: Setting
hk :=
1
kxk k ,
2
we obtain
kxk x k = lim kxk xm k
k<m
h
i
m
m1
k+1
k
lim
kx x
k + ... + kx
x k
k<m
h
i
2
lim
hm1 + ... + hk =
k<m
h
hk+1
2hk
hm1 i
1 +
=
lim
+ ... +
.
k<m
hk
hk
On the other hand, taking (ii) into account
hk
2
2
kxk1 k2 = h2k1 ,
whence
`
hk+` h2k
k lN0 .
We conclude
h
i
1
k1 2
2
kx x k
kx k
1 + hk + ...
2
X
1
2j
1 +
hk kxk1 k2 ,
2
j=1
k
which proves (iii).
2. Affine Invariant/Conjugate Newton Convergence Theorems

2.1 Affine Covariant Newton Convergence Theorems
Theorem 2.1 Affine Covariant Newton-Kantorovich Theorem
Let F : D lRn lRn be continuously differentiable on D with an invertible
Jacobian F 0 (x0 ) for some initial guess x0 D. Assume further that the following
conditions hold true:
kF 0 (x0 )1 F (x0 k ,
kF 0 (x0 )1 (F 0 (y) F 0 (x))k ky xk ,
1
h0 := <
,
2
1 1 2h0
0
B(x , 0 ) D , 0 :=
.
x, y D ,
(1.16)
(1.17)
(1.18)
(1.19)

F 0 (xk ) xk = F (xk ) ,
xk+1 = xk + xk
there holds
(i)
F 0 (x) is invertible for all Newton iterates x = xk , k lN0 ,
(ii) The sequence {xk }lN of Newton iterates is well defined with xk B(x0 , 0 ),
k lN0 , and xk x B(x0 , 0 ), k lN0 (k ), where F (x ) = 0,
(iii) The convergence xk x (k ) is quadratic,
(iv) The solution x of F (x) = 0 is unique in
0
B(x , 0 ) (D B(x , 0 )) ,
Proof. First homework assignment.
0 :=
1+
1 2h0
.
Theorem 2.2 Affine Covariant Newton-Mysovskikh Theorem

Let F : D lRn lRn , d lRn convex, be continuously differentiable on D
with invertible Jacobians F 0 (x), x D, and let x0 D be some initial guess.
Assume further that the following conditions hold true:
kF 0 (x0 )1 F (x0 )k ,
kF 0 (z)1 F 0 (y) F 0 (x) (y x)k ky xk2
(1.20)
,
x, y, z D(1.21)
,
h0 := kx0 k < 2 ,
kx0 k
B(x0 , ) D , :=
.
1 h20
(1.22)
(1.23)

F 0 (xk ) xk = F (xk ) ,
xk+1 = xk + xk
there holds xk B(x0 , ), k lN0 , and there exists x B(x0 , ) such that
F (x ) = 0 and xk x (k ) with
kxk+1 xk k
kxk x k
1
kxk xk1 k2 ,
2
kxk xk1 k
.
1 12 kxk xk1 k
Proof. slight modification of the Classical Newton-Mysovskikh Theorem.
2.2 Affine Contravariant Newton Convergence Theorem

Theorem 2.3 Affine Contravariant Newton-Mysovskikh Theorem
Let F : D lRn lRn , d lRn convex, be continuously differentiable on D
with invertible Jacobians F 0 (x), x D, and let x0 D be some initial guess.
Assume further that the following conditions hold true:
k F 0 (y) F 0 (x) (y x)k kF 0 (x)(y x)k2 , x, y D ,(1.24)

L D
L := {x D | kF (x)k <
h0 := kF (x0 k < 2 .
2
,
(1.25)
(1.26)
10
Then, the sequence {xk }lN0 of Newton iterates stays in L , and there exists an
x L such that xk x for some subsequence N 0 lN and F (x ) = 0.
Moreover, for the residuals F (xk ) there holds
kF (xk )k
1
kF (xk )k2 .
2
Proof. We first prove xk L by induction on k:

(i)
k = 0: in view of (1.26)
kF (x0 )k <
(ii)
x0 L .
Assume that the assertion holds true for some k lN.
(iii) For any [0, 1] such that xk + txk L , t [0, ], we have

Z
k
F 0 (xk + txk )xk dtk .
kF (x + x )k = kF (x ) +
0
Since F (xk ) = F 0 (xk )xk ,

F (xk ) = F 0 (xk )xk + (1 ) F (xk ) ,
and hence,
kF (xk + xk )k =
Z h
= k
i
F 0 (xk + txk ) F 0 (xk ) xk + (1 ) F (xk ) dtk
k F 0 (xk + txk ) F 0 (xk ) xk k dt + (1 ) kF (xk )k
|
{z
}
0
tkF 0 (xk )xk )k2
Z
k F 0 (xk )xk k2 t dt + (1 ) kF (xk )k =
| {z }
= F (xk )
(1.27)
11
1
= (1 + 2 kF (xk )k) kF (xk )k .
2
We assume
xk+1 = xk + xk
/ L .
Then there exists
:= min{ (0, 1] | xk + xk L } .
It follows from (1.27)
kF (xk + xk )k
1 2
(1 + kF (xk )k) kF (xk )k <
| {z }
2
<
2
2
< (1 + ) kF (xk )k <
,
|
{z
} | {z }
< 1
<
and hence, xk + xk L which is a contradiction.

For = 1, (1.27) gives the asserted residual estimate.
For the proof of the rest of the assertion, we define the residual oriented Kantorovich quantities
hk := kF (xk )k .
then, (1.26) implies
kF (xk+1 )k
1 2
kF (xk )k ,
2
i.e.,
1
1 2
hk = hk hk .
2
2
hk+1
Since h0 < 2, for k = 0 we obtain
h1
1
h0 h0 < h0 ,
2 |{z}
< 2
12
and an induction argument shows

hk+1 < hk < 2 ,
k lN0 .
Moreover,
kF (xk+1 )k < kF (xk )k <
lim kF (xk )k = 0 ,
and
which implies
xk L D
k lN .
Since L is bounded, there exist x L and a subsequence lN0 lN such that

xk x (k lN0 ) and F (x ) = 0.
Affine conjugacy
Assume that D lRn is a convex set and that f : D lR is a strictly convex
functional. Consider the minimization problem
min f (x) .
xD
Then, a necessary and sufficient optimality condition is given by the

nonlinear equation
F (x) = grad f (x) = f 0 (x)T = 0 ,
xD.
We note that the Jacobian F 0 (x) = f 00 (x) is symmetric and uniformly positive definite on D. In particular, F 0 (x)1/2 is well defined and symmetric,
positive definite as well.
Consequently, the energy product
(u, v)E := uT F 0 (x)v
u, v, x D
defines locally an inner product with associated norm

kuk2E = uT F 0 (x)u = kF 0 (x)1/2 uk2
which is referred to as a local energy norm.
For regular B lRnn , we consider the transformed minimization problem
min g(y) ,
y
g(y) := f (By) ,
x = By .
13
We obtain the optimality condition

G(y) = grad g(y) = (f 0 (By)B)T = B T f 0 (x)T = B T F (By) = 0
with the transformed Jacobian
G0 (y) = B T F 0 (x)B .
Hence, the Jacobian transformation is conjugate which motivates the notion
of affine conjugacy.
14
An appropriate affine conjugate Lipschitz condition is as follows
kF 0 (z)1/2 F 0 (y) F 0 (x) (y x)k kF 0 (z)1/2 (y x)k2 .
2.3 Affine Conjugate Newton Convergence Theorem

Theorem 2.4 Affine Conjugate Newton-Mysovskikh Theorem
Assume that D lRn is a convex domain and f : D lR a strictly convex, twice
continuously differentiable functional. Let F (x) = f 0 (x)T and F 0 (x) = f 00 (x).
Consider the minimization problem
min f (x)
(1.28)
xD
and the associated optimality condition

F (x) = grad f (x) = 0 ,
xD.
(1.29)
Note that (1.28) has a unique solution x D.

Let x0 D be an initial guess and assume that the following conditions are
satisfied:
kF 0 (z)1/2 F 0 (y) F 0 (x) (y x)k kF 0 (z)1/2 (y x)k2

(1.30)
for collinear x, y, z D and
L0
h0 := kF 0 (x0 )1/2 x0 k < 2 ,

:= {x D | f (x) < f (x0 )} is compact .
(1.31)
(1.32)
Then, for the Newton iterates xk , k lN0 , there holds:

(i)
xk L0 , k lN0 , and xk x (k ) with

1
kF 0 (xk+1 )1/2 xk+1 k
kF 0 (xk )1/2 xk k2 .
2
(1.33)
1/2
(ii) For k := kF 0 (xk )1/2 xk k2 and the Kantorovich quantities hk := k we

have
1
1
1
k hk k ,
hk k f (xk ) f (xk+1 )
(1.34)
6
2
6
5
1
k f (xk ) f (xk+1 )
k .
(1.35)
6
6
(iii) We have the a priori estimate
0
f (x ) f (x )
5
6
0
.
1 h20
(1.36)
15
Proof: Assertion (i) and (1.33) can be verified as in the proof of the affine
contravariant version of the Newton-Mysovskikh theorem.
For the proof of (1.34) in (ii), observing
F 0 (xk )xk = F (xk ) ,
we obtain
f (xk+1 ) f (xk ) +
1
kF 0 (xk )1/2 xk k2 =
2
Z1
< F (xk + sxk ), xk > ds < F (xk ), xk >
=
s=0
< F 0 (xk )xk , xk > +
1
< F 0 (xk )xk , xk > =
2
Z1
< F (xk + sxk ) F (xk ), xk > ds
=
s=0
Z1
1
< F 0 (xk )xk , xk > =
2
Z1
< F 0 (xk + stxk )xk , xk > dt ds
=
s=0 t=0
Z1
Z1
< (F 0 (xk + stxk ) F 0 (xk ))xk , xk > dt ds =
|
{z
}
s
s=0
t=0
Z1
=
s=0
Z1
< F 0 (xk )1/2 wk , F 0 (xk )1/2 xk > dt ds

t=0
Z1
kF 0 (xk )1/2 wk k kF 0 (xk )1/2 xk k dt ds

{z
}
|
s
s=0
=: wk
Z1
s
1
< F 0 (xk )xk , xk >=
2
t=0 stkF 0 (xk )1/2 xk k2
Z1
kF 0 (xk )1/2 xk k2 kF 0 (xk )1/2 xk k
|
{z
} |
{z
}
= k
= hk
Z1
s2
s=0
t dt
t=0
which proves (1.34).

Using the right-hand side of (1.34) and hk < 2 yields
f (xk ) f (xk+1 ) (
1
1
5
+ hk ) k <
k .
2
6
6
1
hk k ,
6
16
Likewise, using the left-hand side of (1.34) and hk < 2

f (xk ) f (xk+1 ) (
1
1
1
hk ) k >
k .
2
6
6
Together, this proves (1.35).

In order to prove (iii), we use (1.34) and obtain
2
0 (f (x ) f (x ))
(f (xk ) f (xk+1 ) <
k=0
5 2
k =
6
5 2
5
1
=
hk =
4 ( hk )2 .
6
6
2
Using
1
1
1
hk+1 ( hk )2
hk < 1 ,
2
2
2
we further get
1
1
h0 )2 + ( h1 )2 + ...
2
2
1
1
1
( h0 )2 + ( h0 )4 + ( h1 )4 + ...
2
2
2
1 2
X
h
1
1 2
( h0 )k = 4 h00 ,
h0
4
2
1 2
k=0
(
17
3. Inexact Newton Methods

We recall that Newtons method computes iterates successively as the solution
of linear algebraic systems
F 0 (xk ) xk = F (xk ) , k lN0 ,
xk+1 = xk + xk .
(1.37)
The classical convergence theorems of Newton-Kantorovich and Newton-Mysovskikh and its affine covariant, affine contravariant, and affine conjugate versions assume the exact solution of (1.37).
In practice however, in particular if the dimension is large, (1.37) will be solved
by an iterative method. In this case, we end up with an outer/inner iteration, where the outer iterations are the Newton steps and the inner iterations
result from the application of an iterative scheme to (1.37). It is important to
tune the outer and inner iterations and to keep track of the iteration errors.
With regard to affine covariance, affine contravariance, and affine conjugacy the
iterative scheme for the inner iterations has to be chosen in such a way, that it
easily provides information about the
error norm in case of affine covariance,
residual norm in case of affine contravariance, and
energy norm in case of affine conjugacy.
Except for convex optimization, we cannot expect F 0 (x), x D, to be symmetric positive definite. Hence, for affine covariance and affine contravariance
we have to pick iterative solvers that are designed for nonsymmetric matrices.
Appropriate candidates are
CGNE (Conjugate Gradient for the Normal Equations) in case
of affine covariance,
GMRES (Generalized Minimum RESidual) in case of affine contravariance, and
PCG (Preconditioned Conjugate Gradient) in case of affine conjugacy.
18
3.1 Affine Covariant Inexact Newton Methods

3.1.1 CGNE (Conjugate Gradient for the Normal Equations)
We assume A lRnn to be a regular, nonsymmetric matrix and b lRn to
be given and look for y lRn as the unique solution of the linear algebraic
system
Ay = b .
(1.38)
As the name already suggests, CGNE is the conjugate gradient method applied
to the normal equations:
It solves the system
AAT z = b ,
(1.39)
for z and then computes y according to

y = AT z .
(1.40)
The implementation of CGNE is as follows:

CGNE Initialization:
Given an initial guess y0 lRn , compute the residual r0 = b Ay0 and set
p0 = r0 , p0 = 0 ,
0 = 0 , 0 = kr0 k2 .
CGNE Iteration Loop: For 1 i imax compute
pi = AT ri1 + i1 pi1
yi = yi1 i pi
i =
i =
i1
,
kpi k2
2
i1
= i i1 ,
ri = ri1 i Api
i = kri k2 ,
i
.
i1
CGNE has the error minimizing property

ky yi k =
min
vKi (AT r0 ,AT A)
ky vk ,
(1.41)
where Ki (AT r0 , AT A) stands for the Krylov subspace

Ki (AT r0 , AT A) := span{AT r0 , (AT A)AT r0 , ..., (AT A)i1 AT r0 } .
(1.42)
19
Lemma 3.1 Representation of the iteration error

Let i := ky yi k2 be the square of the CGNE iteration error with respect
to the i-th iterate. Then, there holds
i =
n1
X
j2 .
(1.43)
j=i
Proof. CGNE has the Galerkin orthogonality

(yi y0 , yi+m yi ) = 0 ,
m lN .
(1.44)
Setting m = 1, this implies the orthogonal decomposition

kyi+1 y0 k2 = kyi+1 yi k2 + kyi y0 k2 ,
(1.45)
which readily gives

kyi y0 k
i1
X
i1
X
kyj+1 yj k =
j=0
j2 .
(1.46)
j=0
On the other hand, observing yn = y , for m = n i the Galerkin orthogonality

yields
ky y0 k2 = ky yi k2 + kyi y0 k2 .
| {z }
| {z }
| {z }
=
n1
P
j=0
= 2i
j2
i1
P
j=0
(1.47)
j2
Computable lower bound for the iteration error

It follows readily from Lemma 3.1 that the computable quantity
[i ] :=
i+m
X
j2
m lN,
(1.48)
j=i
provides a lower bound for the iteration error.

In practice, we will test the relative error norm according to
p
[i ]
ky yi k
i :=
,
kyi k
kyi k
where is a user specified accuracy.
(1.49)
20
3.1.2 Convergence of affine covariant inexact Newton methods

We denote by xk lRn the result of an inner iteration, e.g., CGNE, for the
solution of (1.37). Then, it is easy to see that the iteration error xk xk
satisfies the error equation
F 0 (xk )(xk xk ) = F (xk ) + F 0 (xk )xk =: rk .
(1.50)
We will measure the impact of the inexact solution of (1.37) by the relative
error
k :=
kxk xk k
.
kxk k
(1.51)
Theorem 3.1 Affine covariant convergence theorem for the inexact

Newton method. Part I: Linear convergence
Suppose that that F : D lRn lRn is continuously differentiable on D with
invertible Frechet derivatives F 0 (x), x lRn . Assume further that the following
affine covariant Lipschitz condition is satisfied
kF 0 (z)1 F 0 (y) F 0 (x) vk ky xk kvk ,

(1.52)
where x, y, z D, v lRn .
Assume that x0 D is an initial guess for the outer Newton iteration and
that x0 = 0 is chosen as the startiterate for the inner iteration. Consider
the Kantorovich quantities
hk := kxk k ,
hk
1 + k2
hk := kxk k = p
(1.53)
associated with the outer and inner iteration.

Assume that
h0 < 2 ,
0 < 1,
(1.54)
and control the inner iterations according to

(hk , k ) :=
1
h
2 k
+ k (1 + hk )
p
< 1,
1 + k2
(1.55)
which implies linear convergence.

Note that a necessary condition for (hk , k ) is that it holds true for
k = 0, which is satisfied due to assumption (1.37).
21
Then, there holds:

The Newton CGNE iterates xk , k lN0 stay in
(i)
B(x0 , ) ,
:=
kx0 k
1
(1.56)
and converge linearly to some x B(x0 , ) with F (x ) = 0.

(ii)
The exact Newton increments decrease monotonically according to

kxk+1 k
,
kxk k
(1.57)
whereas for the inexact Newton increments we have

p
1 + k2
kxk+1 k
q
.
kxk k
2
1+
(1.58)
k+1
Proof. By elementary calculations we find

kxk+1 k = kF 0 (xk+1 )1 F (xk+1 )k =
0
= kF (x
(1.59)
h
i
k+1
k
) F (x ) F (x ) + F 0 (xk+1 )1
k+1 1
F (xk )
| {z }
= rk F 0 (xk )xk
= kF (x
k+1 1
h
F (x
k+1
i
) F (x ) F (x )x k +
+ kF 0 (xk+1 )1
rk
|{z}
= F 0 (xk )(xk xk )
Z1
h
i
kF 0 (xk+1 )1 F 0 (xk + txk ) F 0 (xk ) xk kdt +
|0
{z
=: I
+ kF 0 (xk+1 )1 F 0 (xk )(xk xk )k .

|
{z
}
=: II
22
Using the affine covariant Lipschitz condition (1.52), the first term on the
right-hand side in (1.59) can be estimated according to
Z1
I kxk k2
t dt =
1
kxk k2 .
2
(1.60)
For the second term we obtain by the same argument

h
i
0 k
k
k
0 k+1
k
k
0 k+1 1
II = kF (x ) F (x )(x x ) F (x )(x x ) k (1.61)
kF 0 (xk+1 )1 (F 0 (xk+1 ) F 0 (xk ))(xk xk )k +
+ kF 0 (xk+1 )1 F 0 (xk+1 )(xk xk )k
1
kxk k kxk xk k + kxk xk k2 .
2
Combining (1.60) and (1.61) yields
k
k
kxk+1 k
1
1
kxk xk k
k
k kx x k
+
kx
k
kx
k
+
kxk k
2 | {z }
2
kxk
kxk k
|
{z
}
|
{z
}
= h
k
= k hk
= k
hk + k (1 + hk ) .
Observing (1.53), we finally get

kxk+1 k
(hk , k ) =
kxk k
1
h
2 k
+ k (1 + hk )
p
< 1,
1 + k2
(1.62)
which implies linear convergence.

Note that a necessary condition for (hk , k ) is that it holds true for k = 0,
which is satisfied due to assumption (1.54).
For the contraction of the inexact Newton increments we get
s
s
kxk+1 k
1 + k2 kxk+1 k
1 + k2
=
.
(1.63)
2
2
kxk k
1 + k+1
kxk k
1 + k+1
It can be easily shown that {xk }lN0 is a Cauchy sequence in B(x0 , ). Consequently, there exists x B(x0 , ) such that xk x (k ). Since
F 0 (xk )xk = F (xk ) + rk ,
| {z }
|
{z
}
0
we conclude F (x ) = 0.
F (x )
23
Theorem 3.2 Affine covariant convergence theorem for the inexact

Newton method. Part II: Quadratic convergence
Under the same assumptions on F : D lRn lRn as in Theorem 3.1 suppose
that the initial guess x0 D satisfies
h0 <
2
1+
(1.64)
for some appropriate > 0 and control the inner iterations such that
hk
.
2 1 + hk
(1.65)
Then, there holds:

(i)
The Newton CGNE iterates xk , k lN0 stay in

0
B(x , ) ,
kx0 k
:=
1 1+
h0
2
(1.66)
and converge quadratically to some x B(x0 , ) with F (x ) = 0.

(ii) The exact Newton increments and the inexact Newton increments
decrease quadratically according to
1+
kxk k2 ,
2
1+
kxk+1 k
kxk k2 .
2
kxk+1 k
Proof. We proceed as in the proof of Theorem 3.1 to obtain

kxk+1 k
(hk , k ) =
kxk k
and
k+1
kx k
=
kxk k
1
h
2 k
+ k (1 + hk )
p
.
1 + k2
1 + k2 kxk+1 k
.
2
1 + k+1
kxk k
In view of (1.65) we get the further estimates

kxk+1 k
1 + hk
1+
hk .
2
k
kx k
2 1 + k
2
(1.67)
(1.68)
24
and
kxk+1 k
1+
1+
hk
q
hk ,
kxk k
2
2
2
1 + k+1
from which (1.67) and (1.68) follow by the definition of the Kantorovich quantities.
In order to deduce quadratic convergence we have to make sure that the initial
increments (k = 0) are small enough, i.e.,
1+
1+
h0
h0 < 1 .
2
2
(1.69)
Furthermore, (1.68) and (1.69) allow us to show that the iterates xk , k lN stay
in B(x0 , ). Indeed, (1.68) implies
kxj k
1+
1+
hj1 kxj1 k
h0 kxj1 k , j lN ,
2
2
and hence,
k
X
1+
kx0 k
kx x k
kx k kx k
(
h0 )j
.
1+
2
1
h
0
2
j=0
j=0
k
k
X
3.1.3 Algorithmic aspects of affine covariant inexact Newton methods

(i) Convergence monitor
Let us assume that the quantity < 1 in both the linear convergence mode
and the quadratic convergence mode has been specified and let us further
assume that we use CGNE with xk0 = 0 in the inner iteration.
Then, (1.58) suggests the monotonicity test
v
u
k+1
u 1 + 2
k
k+1 kx
k := t
,
(1.70)
2
k
kx k
1+
k
2
where k and k+1 are computationally available estimates of k2 and k+1
.
(ii) Termination criterion

We recall that the termination criterion for the exact Newton iteration with
respect to a user specified accuracy XT OL is given by
kxk k
XTOL .
1 2k1
25
According to (1.53) we have

q
kxk k =
1 + k2 kxk k.
k1 and
Consequently, replacing k1 and k by the computable quantities
k , we arrive at the termination criterion
q
2
1 + k
(1.71)
XTOL .
2
1
k1
(iii) Balancing outer and inner iterations

According to (1.55) of Theorem 3.1, in the linear convergence mode the
adaptive termination criterion for the inner iteration is
(hk , k ) :=
1
h
2 k
+ k (1 + hk )
p
< 1.
1 + k2
On the other hand, in view of (1.65) of Theorem 3.2, in the quadratic convergence mode the termination criterion is
hk
.
2 1 + hk
Since the theoretical Kantorovich quantities (cf. (1.53))

hk
1 + k2
hk = kxk k = p
are not directly accessible, we have to replace them by computationally available estimates [hk ].
We recall that for hk we have the a priori estimate
[hk ] = 2 2k1 hk .
k1 (cf. (1.70)), we
Consequently, replacing k by k , hk by [hk ], and k1 by
get the a priori estimates
[hk ]
[hk ] = q
1+
2
k
2
[hk ] = 2
k1
k lN .
(1.72)
For k = 0, we choose 0 = 0 = 41 .
In practice, for k 1 we begin with the quadratic convergence mode and switch
26
to the linear convergence mode as soon as the approximate contraction factor

k is below some prespecified threshold value 1 .
2
(iii)1 Quadratic convergence mode
The computationally realizable termination criterion for the inner iteration in the quadratic convergence mode is
k
[hk ]
.
2 1 + [hk ]
(1.73)
Inserting (1.72) into (1.73), we obtain a simple nonlinear equation in k .

Remark 3.1 Validity of the approximate termination criterion
Observing that the right-hand side in (1.73) is a monotonically increasing function of [hk ], and taking [hk ] hk into account, it follows that for k k the
approximate termination criterion (1.73) implies the exact termination criterion
(1.65).
Remark 3.2 Computational work in the quadratic convergence mode
Since k 0 (k ) is enforced, it follows that:
The more the iterates xk approach the solution x , the more computational work is required for the inner iterations to guarantee quadratic
convergence of the outer iteration.
(iii)2 Linear convergence mode
We switch to the linear convergence mode, once the criterion
k <
(1.74)
is met.
The computationally realizable termination criterion for the inner iteration in the linear convergence mode is
[(hk , k )] := ([hk ], k ) =
1
[h ]
2 k
+ k (1 + [hk ])
q
.
2
1 + k
(1.75)
Remark 3.3 Validity of the approximate termination criterion

Since the right-hand side in (1.75) is a monotonically increasing function in [hk ]
and [hk ] hk , the estimate provided by (1.75) may be too small and thus result
in an overestimation of k . However, since the exact quantities and their a
priori estimates both tend to zero as k approaches infinity, asymptotically we
may rely on (1.75).
27
In practice, we require the monotonicity test (1.70) in CGNE and run the
inner iterations until k satisfies (1.75) or divergence occurs, i.e.,
k > 2 .
Remark 3.4 Computational work in the linear convergence mode

As opposed to the quadratic convergence mode, we observe
The more the iterates xk approach the solution x , the less computational work is required for the inner iterations to guarantee linear
28
3.2 Affine Contravariant Inexact Newton Methods

3.2.1 GMRES (Generalized Minimum RESidual)
The Generalized Minimum RESidual Method (GMRES is an iterative
solver for nonsymmetric linear algebraic systems which generates an orthogonal basis of the Krylov subspace
Ki (r0 , A) := span{r0 , Ar0 , ..., Ai1 r0 } .
(1.76)
by a modified Gram-Schmidt orthogonalization called the Arnoldi method.

The inner product coefficients are stored in an upper Hessenberg matrix
so that an approximate solution can be obtained by the solution of a leastsquares problem in terms of that Hessenberg matrix:
GMRES Initialization:
Given an initial guess y0 lRn , compute the residual r0 = b Ay0 and set
:= kr0 k ,
v1 :=
r0
V1 := v1 .
(1.77)
GMRES Iteration Loop: For 1 i imax :

I. Orthogonalization:
vi+1 = Avi Vi hi ,
where hi = ViT Avi .
(1.78)
(1.79)
II. Normalization:
vi+1
.
k
vi+1 k
vi+1 =
(1.80)
III. Update:
Vi+1 =
Hi =
Hi =
Vi vi+1 .
hi
k
vi+1 k
(1.81)
Hi1
hi
0
k
vi+1 k
i=1,
(1.82)
(1.83)
i>1.
29
IV. Least squares problem: Compute zi as the solution of

k e1 Hi zi k = minn k e1 Hi zk .
zlR
(1.84)
V. Approximate solution:
yi = Vi zi + y0 .
(1.85)
GMRES has the residual norm minimizing property

kb Ayi k =
min
zKi (r0 ,A)
kb Azk .
(1.86)
Moreover, the inner residuals decrease monotonically

kri+1 k kri k ,
i lN0 .
(1.87)
Termination criterion for the GMRES iteration

The residuals satisfy the orthogonality relation
(ri , ri r0 ) = 0 ,
i lN ,
(1.88)
from which we readily deduce

kr0 k2 = kri r0 k2 + kri k2
i lN .
(1.89)
We define the relative residual norm error

i :=
kri k
.
kr0 k
(1.90)
Clearly, i < 1, i lN, and

i+1 < i
if i 6= 0 .
(1.91)
Consequently, given a user specified accuracy , an appropriate adaptive

termination criterion is
i .
(1.92)
We note that, in terms of i , (1.89) can be written as

kri r0 k2 = (1 i2 ) kr0 k2 .
(1.93)
30
3.2.2 Convergence of affine contravariant inexact Newton methods

We denote by xk lRn the result of the inner GMRES iteration. As initial
values for GMRES we choose
xk0 = 0 ,
r0k = F (xk ) .
(1.94)
Consequently, during the inner GMRES iteration the relative error i , i lN0 ,
in the residuals satisfies
i =
krik k
1 ,
kF (xk )k
i+1 < i , if i 6= 0 .
(1.95)
In the sequel, we drop the subindices i for the inner iterations and refer to k
as the final value of the inner iterations at each outer iteration step k.
Theorem 3.3 Affine contravariant convergence theorem for the inexact Newton GMRES method. Part I: Linear convergence
Suppose that F : D lRn lRn is continuously differentiable on D and let
x0 D be some initial guess. Let further the following affine contravariant
Lipschitz condition be satisfied
k(F 0 (y) F 0 (x))(y x)k kF 0 (x)(y x)k2 , x, y D , 0 . (1.96)
Assume further that the level set
L0 := {x lRn | kF (x)k kF (x0 )k}
(1.97)
is a compact subset of D.
In terms of the Kantorovich quantities
hk := kF (xk )k ,
k lN0 .
the outer residual norms can be bounded according to
1
kF (xk+1 )k k + (1 k2 ) hk kF (xk )k .
2
(1.98)
(1.99)
Assume that
h0 < 2
(1.100)
and control the inner iterations according to

k
1
hk ,
2
(1.101)
31
for some h20 < < 1.

Then, the Newton GMRES iterates xk , k lN0 stay in L0 and converge
linearly to some x L0 with F (x ) = 0 at an estimated rate
kF (xk+1 )k kF (xk )k .
(1.102)
Proof. We recall that the Newton GMRES iterates satisfy

F 0 (xk ) xk = F (xk ) + rk ,
xk+1 = xk + xk .
(1.103)
(1.104)
It follows from the generalized mean value theorem that

Z1
F (x
k+1
F 0 (xk + txk ) xk dt .
) = F (x ) +
(1.105)
Consequently, replacing F (xk ) in (1.105) by (1.103), we obtain

kF (x
k+1
Z1
)k = k
0
1
Z
F (x + tx ) F (x ) xk dt + rk k
0
0 k
k
0 k
k F (x + tx ) F (x ) xk k dt + krk k
kF 0 (xk ) xk k2 + krk k
2
1
kF (xk ) rk k2 + krk k .
2
We recall (1.93)
krk F (xk )k2 = (1 k2 ) kF (xk )k2 ,
from which (1.99) can be immediately deduced.
Now, in view of (1.101), (1.99) yields
1
kF (xk+1 )k
k + (1 k2 )hk kF (xk )k
|{z}
2
12 hk
1 2
hk ) kF (xk )k kF (xk )k .
2 k
Taking advantage of the previous inequality, by induction on k it follows that

xk L0 D
k lN0 .
32
Hence, there exists a subsequence lN0 lN and an x L0 such that xk

x (k lN0 ) and F (x ) = 0. Moreover, since
`
kF (xk+` ) F (xk )k kF (xk+` )k + kF (xk )k (1 + ) kF (xk )k

`
(1 + ) kF (x0 )k 0 (k lN ) ,
the whole sequence must converge to x .
Theorem 3.4 Affine contravariant convergence theorem for the inexact Newton GMRES method. Part II: Quadratic convergence
h0 <
2
1+
(1.106)
k
hk .
2
1 k
2
(1.107)
Then, the Newton GMRES iterates xk , k lN0 stay in L0 and converge

quadratically to some x B(x0 , ) with F (x ) = 0 at an estimated rate
kF (xk+1 )k
1
(1 + ) (1 k2 ) kF (xk )k2 .
2
(1.108)
Proof. Inserting (1.107) into (1.99) and observing hk = kF (xk )k gives the
assertion.
3.2.3 Algorithmic aspects of affine contravariant inexact Newton

methods
Throughout the inexact Newton GMRES iteration we use the residual monotonicity test
k+1
k := kF (x )k < 1 .
kF (xk )k
(1.109)
The iteration is considered as divergent, if

k > .
(1.110)
33

As in the exact Newton iteration, specifying a residual accuracy F T OL, the
termination criterion for the inexact Newton GMRES iteration is
kF (xk )k FTOL .
(1.111)

With regard to (1.101) of Theorem 3.3, in the linear convergence mode the
adaptive termination criterion for the inner GMRES iteration is
k
1
hk ,
2
whereas, in view of (1.107) of Theorem 3.4, in the quadratic convergence

mode the termination criterion is
k
hk .
2
1 k
2
Again, we replace the theoretical Kantorovich quantities hk by some computationally easily available a priori estimates. We distinguish between the quadratic
and the linear convergence mode:
We recall the termination criterion (1.107) for the quadratic convergence mode
k
hk .
2
1 k
2
It suggests the a posteriori estimate
[hk ]2 :=
2 k
hk .
(1 + ) (1 k2 )
In view of hk+1 = k hk , this implies the a priori estimate

[hk+1 ] := k [hk ]2 k hk = hk+1 .
(1.112)
Using (1.112) in (1.107) results in the computationally feasible termination

criterion
k
1
[hk ] ,
2
1 k
2
1.0 .
(1.113)
34

We switch from the quadratic to the linear convergence mode, if the local contraction factor satisfies
k < .
(1.114)
The proof of the previous theorems reveals

kF (xk+1 ) rk k
1
kF (xk ) rk k2 =
(1 k2 ) hk kF (xk )k . (1.115)
2
2
The above inequality (1.115) implies the a posteriori estimate

[hk ]1 :=
2 kF (xk+1 ) rk k
(1 k2 )kF (xk )k
hk
(1.116)
and the a priori estimate

[hk+1 ] := k [hk ]1 hk+1 .
(1.117)
Based on (1.117) we define

k+1 :=
1
[hk+1 ] .
2
(1.118)
If we find
k+1 < k
(1.119)
with k from (1.113), we continue the iteration in the quadratic convergence mode.
Otherwise, we realize the linear convergence mode with some
k+1 k+1 .
(1.120)
35
3.3 Affine Conjugate Inexact Newton Methods

3.3.1 PCG (Preconditioned Conjugate Gradient)
The Preconditioned Conjugate Gradient Method (PCG) is an iterative
solver for linear algebraic systems with a symmetric positive definite coefficient
matrix A lRnn . We recall that any symmetric positive definite matrix C
lRnn defines an energy inner product (, )C according to
(u, v)C := (u, Cv) ,
u, v lRn .
The associated energy norm is denoted by k kC .

The PCG Method with a symmetric positive definite preconditioner B
lRnn corresponds to the CG Method applied to the transformed linear algebraic
system
B 1/2 AB 1/2 (B 1/2 y) = B 1/2 b .
The PCG Method is implemented as follows:
PCG Initialization:
Given an initial guess y0 lRn , compute the residual r0 = b Ay0 and the
preconditioned residual r0 = Br0 and set
p0 := r0
0 := (r0 , r0 ) = kr0 k2B .
PCG Iteration Loop: For 0 i imax compute:

yi+1 = yi +
ri+1 = ri
1
Api
i
i2 =
pi+1 = ri+1 +
,
i
i
1
pi ,
i
ri+1 = Bri+1
i =
(= kyi+1 yi k2A ) ,
i+1
pi
i
i+1 = kri+1 k2B .
kpi k2A
i
36
PCG minimizes the energy error norm

ky yi kA =
min
zKi (r0 ,A)
ky zkA ,
(1.121)
where Ki (r0 , A) denotes the Krylov subspace

Ki (r0 , A) := span{r0 , ..., Ai1 r0 } .
(1.122)
PCG satisfies the Galerkin orthogonality

(yi y0 , yi+m yi )A = 0 ,
m lN .
(1.123)
Denoting by y lRn the unique solution of Ay = b and by i := ky yi k2A

the square of the iteration error in the energy norm, we have the following error
representation:
Lemma 3.2 Representation of the iteration error
The PCG iteration error satisfies
i =
n1
X
j2 .
(1.124)
j=i
Proof. For m = 1 the Galerkin orthogonality implies the orthogonal decompositions

kyi+1 y0 k2A = kyi+1 yi k2A + kyi y0 k2A ,
|
{z
}
(1.125)
= i2
kyi
y0 k2A
i1
X
kyj+1
yj k2A
j=0
i1
X
j2 .
(1.126)
j=0
On the other hand, observing yn = y , for m = n i the Galerkin orthogonality

yields
ky y k2 = ky yi k2A + kyi y0 k2A .
| {z }
| {z 0 A}
| {z }
=
n1
P
j=0
j2
= 2i
i1
P
j=0
(1.127)
j2
37
Computable lower bound for the iteration error

A lower bound for the iteration error in the energy norm is obviously given by
[i ] =
i+m
X
j2 .
(1.128)
j=0
In the inexact Newton PCG method we will control the inner PCG iterations by the relative energy error norms
p
[i ]
ky yi kA
i =
(1.129)
kyi kA
kyi kA
and use the termination criterion
i ,
(1.130)
where is a user specified accuracy.

3.3.2 Convergence of affine conjugate inexact Newton methods
We denote by xk lRn the result of the inner PCG iteration. As initial value
for PCG we choose
xk0 = 0 .
(1.131)
Again, we will drop the subindices i for the inner PCG iterations and refer to
k as the final value of the inner iterations at each outer iteration step k. We
recall the Galerkin orthogonality (cf. (1.123))
(xk , F 0 (xk )(xk xk )) = (xk , rk ) = 0 .
(1.132)
Theorem 3.5 Affine conjugate convergence theorem for the inexact

Newton PCG method. Part I: Linear convergence
Suppose that f : D lRn lR is a twice continuously differentiable strictly
convex functional on D with the first derivative F := f 0 and the Hessian F 0 = f 00
which is symmetric and uniformly positive definite. Assume that x0 D is some
initial guess such that the level set
L0 := {x D | f (x) f (x0 )}
38
is compact.
Let further the following affine conjugate Lipschitz condition be satisfied
kF 0 (z)1/2 F 0 (y) F 0 (x) vk

(1.133)
kF 0 (x)1/2 (y x)k kF 0 (x)1/2 vk , x, y, z D , 0 .
For the inner Newton PCG iterations consider the exact error terms
k := kF 0 (xk )1/2 xk k2
and the Kantorovich quantities
hk := kF 0 (xk )1/2 xk k
as well as their inexact analogues
k := kF 0 (xk )1/2 xk k2 =
k
1 + k2
and
hk := kF 0 (xk )1/2 xk k = p
hk
,
1 + k2
where k characterizes the inner PCG iteration error
0 k 1/2
k
k
kF (x )
x x k
.
k :=
kF 0 (xk )1/2 xk k
Assume that for some < 1
h0 < 2 < 2
(1.134)
and that
k+1 k
k lN0
holds true throughout the outer Newton iterations.

Control the inner iterations according to
2
hk + k hk + 4 + (hk )
p
(hk , k ) :=
.
2 1 + k2
(1.135)
(1.136)
39
Then, the inexact Newton PCG iterates xk , k lN0 stay in L0 and converge linearly to some x L0 with f (x ) = min f (x).
xD
The following estimates hold true
kF 0 (xk+1 )1/2 xk+1 k kF 0 (xk )1/2 xk k ,

kF 0 (xk+1 )1/2 xk+1 k kF 0 (xk )1/2 xk k ,
k lN0 ,
k lN0 .
(1.137)
(1.138)
Moreover, the objective functional is reduced according to
1
2
1
hk k f (xk ) f (xk+1 )
k
h .
10
3
10 k k
(1.139)
Proof. Observing
rk = F (xk ) + F 0 (xk )xk
k lN0 ,
for [0, 1] we obtain

Z
k
(xk , F (xk + sxk )) ds =
f (x + x ) f (x ) =
(1.140)
s=0
Z
k
(xk , F (xk )) ds =
(x , F (x + sx ) F (x )) ds +
s=0
s=0
Z
=
Zs
(xk , F 0 (xk + stxk )xk ) dt ds +
s
s=0
t=0
s=0
Zs
Z
=
s
s=0
Z
+
s=0
(xk , F 0 (xk + stxk ) F 0 (xk ) xk ) dt ds +
t=0
Zs
Z
k
(xk ,
(x , F (x )x ) dt ds +
t=0
(xk , F (xk )) ds =
s=0
F (xk ) ) ds =
| {z }
rk F 0 (xk )xk

Z
Zs
s=0
t=0
(F 0 (xk )1/2 xk , F 0 (xk )1/2 F 0 (xk + stxk ) F 0 (xk ) xk ) dt ds

{z
}
|
Z
+
kF 0 (xk )1/2 xk k s t kF 0 (xk )1/2 xk k2 = s t hk k
Zs
Z
(xk , F 0 (xk )xk ) dt ds
s
s=0
t=0
(xk , F 0 (xk )xk ) ds +

s=0
Z
+
40
(xk , rk )
| {z }
ds
1 4
1 6
hk k +
k 2 k .
10
3
s=0 = 0 due to (1.123)
It readily follows from (1.140) that

f (xk + xk ) f (xk ) + 2 (
1
1
hk k + ( 2 1) k ) .
10
3
(1.141)
Denoting by Lk the level set

Lk := { x D | f (x) f (xk ) } ,
by induction on k we prove
hk < 2 and hence, xk+1 Lk .
(1.142)
For k = 0, we have h0 < 2 by assumption (1.134). Since h0 h0 , (1.141) readily

shows f (x1 ) < f (x0 ), whence x1 L0 .
Now, assuming (1.142) to hold true for some k lN, again taking advantage of
hk hk < 2, (1.141) yields f (xk+1 ) < f (xk ) and thus xk+1 Lk .
Moreover, choosing = 1 in (1.141), we obtain the left-hand side of the functional descent property (1.139). We note that we get the right-hand side of
(1.139), if in (1.140) we estimate by the other direction of the Cauchy-Schwarz
inequality.
Finally, in order to prove the contraction properties (1.137),(1.138) and linear convergence, we estimate the local energy norms as follows:
kF 0 (xk+1 )1/2 xk+1 k = kF 0 (xk+1 )1/2 F 0 (xk+1 )xk+1 k =
|
{z
}
= F (xk+1 )
= kF 0 (xk+1 )1/2 F (xk+1 ) F (xk ) k =
41
= kF 0 (xk+1 )1/2 F (xk+1 ) F (xk ) + F 0 (xk+1 )1/2 F (xk )k .

Observing
F (xk ) = F 0 (xk )xk + rk ,
and using the affine conjugate Lipschitz condition we obtain
kF 0 (xk+1 )1/2 xk+1 k =
= kF 0 (xk+1 )1/2
Z1
(1.143)
F 0 (xk + txk ) F 0 (xk ) xk dt + rk k
1
kF 0 (xk )1/2 xk k2 + kF 0 (xk+1 )1/2 rk k .
2
Setting z = xk xk , for the second term on the right-hand side of the previous
inequality we get the implicit estimate
kF 0 (xk+1 )1/2 rk k2
kF 0 (xk )1/2 zk2 + hk kF 0 (xk )1/2 zk kF 0 (xk+1 )1/2 rk k ,
which gives the explicit bound
kF 0 (xk+1 )1/2 rk k
1
hk +
4 + (hk )2 kF 0 (xk )zk .
2
(1.144)
Using (1.144) in (1.143) results in

kF 0 (xk+1 )1/2 xk+1 k
q
1 2
1
0 k 1/2
k 2
2
kF (x ) x k +
hk + 4 + (hk ) kF 0 (xk )1/2 zk .
|
{z
}
|
{z
}
2
2
= (hk )2
= k hk
Taking (1.136) into account, we thus get the contraction factor estimate
k :=
kF 0 (xk+1 )1/2 xk+1 k

(hk , k ) ,
kF 0 (xk )1/2 xk k
|
{z
2 }
= hk =
1+k hk
(1.145)
42
which proves (1.137) and linear convergence.

For the proof of (1.138) we observe
kF 0 (x` )1/2 x` k2 = (1 + `2 ) kF 0 (x` )1/2 x` k2
` = k, k + 1 ,
as well as k+1 k and obtain

kF 0 (xk+1 )1/2 xk+1 k
kF 0 (xk )1/2 xk k
1 + k2
k k .
2
1 + k+1
(1.146)
By standard arguments we further show that the sequence {xk }lN0 of inexact
Newton PCG iterates is a Cauchy sequence in L0 and there exists an x L0
such that xk x (k ) with F (x ) = 0.
Theorem 3.6 Affine conjugate convergence theorem for the inexact

Newton PCG method. Part II: Quadratic convergence
h0 <
2
1+
(1.147)
k
pk
.
2 hk + 4 + (hk )2
(1.148)
Then, there holds:

(i)
The Newton CGNE iterates xk , k lN0 stay in L0 and converge
quadratically to some x L0 with F (x ) = 0.
(ii) The exact Newton increments and the inexact Newton increments
decrease quadratically according to
kF 0 (xk+1 )1/2 xk+1 k
1+
kF 0 (xk )1/2 xk k2 ,
2
(1.149)
kF 0 (xk+1 )1/2 xk+1 k
1+
kF 0 (xk )1/2 xk k2 .
2
(1.150)
43
Proof. Using (1.148) in (1.145) yields

p
hk + k (hk + 1 + (hk )2 )
kF 0 (xk+1 )1/2 xk+1 k
1
p
(1 + ) hk ,
0
k
1/2
k
2
kF (x ) x k
2
2 1 + k
which proves (1.149) in view of hk hk h0 < 2.
The proof of (1.150) follows along the same line by using (1.148) in (1.146).
3.3.3 Algorithmic aspects of the affine conjugate inexact Newton

PCG method
Let us assume that the quantity < 1 in both the linear convergence mode
and the quadratic convergence mode has been specified and let us further
assume that we use the startiterate xk0 = 0 in the inner PCG iteration.
Denoting by k an easily computable estimate of the relative energy norm
iteration error k , we accept a new iterate xk+1 , if the condition
1
1
2
k =
(1 + k )k
10
10
f (xk+1 ) f (xk )
(1.151)
or the monotonicity test

k :=
k+1
1/2
(1 + 2 ) 1/2
k+1
k+1
2
(1 + k ) k
< 1
(1.152)
is satisfied. We consider the outer iteration as divergent, if neither (1.151) nor

(1.152) hold true.
With respect to a user specified accuracy ETOL, the inexact Newton PCG
iteration will be terminated, if either
2
k = (1 + k ) k ETOL2 .
(1.153)
1
ETOL2 .
2
(1.154)
or
f (xk ) f (xk+1 )

For k = 0, we choose 0 = 0 = 41 .
As in case of the inexact Newton CGNE iteration, for k 1 we begin with the
44
quadratic convergence mode and switch to the linear convergence mode as soon
k is below some prespecified threshold
as the approximate contraction factor
value 12 .
A computationally realizable termination criterion for the inner PCG
iteration in the quadratic convergence mode is given by
k
[hk ]
p
,
[hk ] +
4 + [hk ]2
(1.155)
where [hk ] is an appropriate a priori estimate of the inexact Kantorovich

quantity hk . In view of (1.145), we have the a posteriori estimates
[hk ]2 :=
10
1
k+1
k
|f
(x
)
f
(x
)
+
|
3 k
k
(1.156)
and
q
[hk ]2 :=
1 + k |[hk ]2 .
(1.157)
We note that (1.157) yields the a priori estimate

[hk ] := k1 [hk1 ]2 .
(1.158)
Using (1.158) in (1.157), for the inexact Kantorovich quantity we obtain the
following a priori estimate
[hk ]
[hk ] := q
.
2
1 + k
(1.159)
Inserting (1.159) into (1.155), we obtain a simple nonlinear equation in k .

Remark 3.5 Computational work in the quadratic convergence mode
Since k 0 (k ) is enforced, it follows that:
The more the iterates xk approach the solution x , the more computational work is required for the inner iterations to guarantee quadratic
We switch to the linear convergence mode, if
k <
(1.160)
45
is satisfied.
The computationally realizable termination criterion for the inner iteration in the linear convergence mode is
[(hk , k )] := ([hk ], k ) .
(1.161)
Since asymptotically there holds
k q
(k ) ,
1
we observe:
Remark 3.6 Computational work in the linear convergence mode

The more the iterates xk approach the solution x , the less computational work is required for the inner iterations to guarantee linear
46
4. Quasi-Newton Methods
4.1 Introduction
Given F : D lRn lRn as well as xk , xk+1 D , xk 6= xk+1 , the idea is to
approximate F locally around xk+1 by an affine function
Sk+1 (x) := F (xk+1 ) + Jk+1 (x xk+1 ) , Jk+1 lRnn ,
(1.162)
such that
Sk+1 (xk ) = F (xk ) .
(1.163)
The requirement (1.163) gives rise to the so-called secant condition
k+1
k
J x
x
= F (xk+1 ) F (xk ) .
(1.164)
|
{z
}
|
{z
}
k
=: y
=: xk
The matrix J is not uniquely determined by (1.164), since

dim Sk+1 = (n 1)n ,
(1.165)
Sk+1 := {J lRnn | Jxk = y k } .
(1.166)
where
There are different criteria to select an appropriate J Sk+1 .

4.1.1 The Good Broyden rank 1 update
Let us consider the change in the affine model as given by
Sk+1 (x) Sk (x) = (Jk+1 Jk )(x xk ) .
(1.167)
An appropriate idea is to choose Jk+1 Sk+1 such that there is a least change
in the affine model in the sense
kJk+1 Jk kF =
min kJ Jk kF ,
JSk+1
(1.168)
where k kF stands for the Frobenius norm (observe J = (Jik )ni,k=1 )

kJkF :=
n
X
i,k=1
2
Jik
1/2
(1.169)
47
The solution of (1.169) can be heuristically motivated as follows: Choose tk

xk such that
x xk = xk + tk .
Then, (1.167) reads
Sk+1 (x) Sk (x) = (Jk+1 Jk )xk + (Jk+1 Jk )tk .
|
{z
}
(1.170)
= (y k Jk xk )
Now, choose Jk+1 Sk+1 such that

(Jk+1 Jk )tk = 0 .
It follows that
rank (Jk+1 Jk ) = 1 ,
Jk+1 Jk = v k (xk )T .
(1.171)
Inserting (1.171) into (1.170) yields

v k (xk )T xk = (y k Jk xk ) ,
which results in
vk =
y k Jk xk
.
(xk )T xk
Altogether, this gives us Broydens rank 1 update (Good Broyden)

h
i (xk )T
k+1
k
k
Jk+1 = Jk + F (x ) F (x ) Jk x
.
(1.172)
(xk )T xk
For the solution of nonlinear systems, we are more interested in updates of the
inverse of Jk . Such an update can be provided by the Sherman-MorrisonWoodbury formula
(A + uv T )1 = A1
A1 uv T A1
.
1 + v T A1 u
(1.173)
Setting
A := Jk
u := F (xk+1 ) F (xk ) Jk xk
we obtain
1
= Jk1
Jk+1
v :=
(xk )T
,
(xk )T xk
i
h
1
k+1
k
k
x Jk (F (x ) F (x )) (xk )T Jk1
h
i
+
.
(xk )T Jk1 F (xk+1 ) F (xk )
(1.174)
48
4.1.2 The Bad Broyden rank 1 update

Instead of (1.168), an alternative to choose Jk+1 Sk+1 such that there is a
least change in the solution of the affine model, i.e.,
1
kJk+1
Jk1 kF =
min kJ 1 Jk1 kF .
J Sk+1
(1.175)
Similar considerations as before lead us to the Broydens alternative rank

1 update (Bad Broyden)
1
Jk+1
i
T
xk Jk1 F (xk+1 ) F (xk )
F (xk+1 ) F (xk )
= Jk1 +
.(1.176)
k+1
k
k+1
k
F (x ) F (x )
F (x ) F (x )
4.2 Affine covariant Quasi-Newton method

4.2.1 Affine covariant Quasi-Newton convergence theory
Affine covariant Quasi-Newton methods require the secant condition (1.164) to
be stated by means of affine covariant terms in the domain of definition of the
nonlinear mapping F .
Observing that we compute the Quasi-Newton increment xk as the solution of
Jk xk = F (xk ) ,
(1.177)
we can rewrite (1.164) according to

(Jk J)xk = F (xk+1 ) .
Multiplication by Jk1 yields the affine covariant secant condition
xk+1 := (I Jk1 J) xk = Jk1 F (xk+1 ) .
|
{z
}
(1.178)
=: Ek (J)
we note that any rank 1 update of the form
xk+1 v T
Jk+1 = Jk I
,
v T xk
v lRn \ {0}
satisfies the affine covariant secant condition (1.178).

In particular, for v = xk we recover the Good Broyden.
(1.179)
49
Theorem 4.1 Properties of the affine covariant Quasi-Newton method

For Broydens affine covariant rank 1 update (Good Broyden)
Jk+1 = Jk
xk+1 (xk )T
I
kxk k2
(1.180)
assume that the local contraction condition

k =
kxk+1 k
1
<
k
kx k
2
(1.181)
is satisfied. Then, there holds:

(i)
The update matrix Jk+1 is a least change update in the sense that
kEk (Jk+1 )k kEk (J)k ,
J Sk+1 ,
kEk (Jk+1 )k k .
(ii)
(1.182)
(1.183)
If Jk is regular, then Jk+1 is regular as well with the inverse given by

1
Jk+1
=
I +
xk+1 (xk )T 1
Jk ,
(1 k+1 ) kxk k2
(1.184)
where
k+1 =
(xk )T xk+1
1
<
.
k
2
kx k
2
(iii) The Quasi-Newton increment xk+1 is given by

1
xk+1 = Jk+1
F (xk+1 ) =
xk+1
.
1 k+1
(1.185)
(iv) The Quasi-Newton increments decrease according to

k
kxk+1 k
< 1.
k
kx k
1 k+1
(1.186)
Proof. In view of (1.178) we have

kEk (Jk+1 )k = k
xk+1 (xk )T
xk (xk )T
k
=
kE
(J)
k kEk (J)k ,
k
kxk k2
kxk k2
50
which proves (1.182). Moreover, (1.183) follows readily from

kEk (Jk+1 )k = k
kxk+1 k
xk+1 (xk )T
k
= k .
kxk k2
kxk k
The same argument shows

|k+1 | k <
1
,
2
and hence, (1.186) follows from

k
k
kxk+1 k
< 1.
=
k
kx k
1 k+1
1 k
Finally, the proofs of (ii) and (iii) are direct consequences of the ShermanMorrison-Woodbury formula (1.173).
Theorem 4.2 Convergence of the affine covariant Quasi-Newton method

Suppose that that F : D lRn lRn , D lRn convex, is continuously
differentiable on D. Let x D be the unique solution of F (x) = 0 in D
with invertible Jacobian F 0 (x ). Assume that the following affine covariant
Lipschitz condition is satisfied
kF 0 (x )1 F 0 (x) F 0 (x ) vk kx x k kvk ,
(1.187)
where x, x + v D, v lRn .
For some 0 < < 1 assume further that:
(a)
The initial approximate Jacobian J0 satisfies

0 := kF 0 (x )1 (J0 F 0 (x0 ))k <
(b)
.
1+
(1.188)
The initial guess x0 D satisfies

t0 := kx0 x+ k
1
0 .
2 1+
(1.189)
Then, there holds:

(i)
The Quasi-Newton iterates xk , k lN0 converge to x according to

kxk+1 x k < kxk x k ,
(1.190)
51
kxk+1 k kxk k .
(1.191)
We have superlinear convergence in the sense that

kxk+1 k
lim
= 0.
k kxk k
(1.192)
(ii) For Ek := F 0 (x )1 Jk I = F 0 (x )1 (Jk F 0 (x )) the following affine

covariant bounded deterioration property holds true
kEk k
<
.
2
1+
(1.193)
Moreover, asymptotically we have

kEk xk k
= 0.
k kxk k
lim
(1.194)
Proof. Denoting by ek := xk x the iteration error, we set

tk := kek k .
We first derive an estimate for the contraction of the Quasi-Newton increments. For this purpose, using the affine covariant Lipschitz condition,
we get
kF 0 (x )1 F (xk+1 )k = kF 0 (x )1 F (xk+1 ) F (xk ) k =

(1.195)
= kF 0 (x )1 F (xk+1 ) F (xk ) +
F 0 (x )1 F (xk )
{z
}
|
= F 0 (x )1 Jk xk = Ek xk xk
Z1
kF 0 (x )1 F 0 (xk + txk ) F 0 (x ) xk k dt + kEk xk k
Z1
kxk + t
Z1

0
xk
|{z}
x k dt kxk k + kEk xk k
= xk+1 xk
t kxk x k + (1 t) kxk+1 x k dt kxk k + kEk xk k
1
2
(tk+1 + tk ) +
52
kEk xk k
kxk k .
kxk k
On the other hand, assuming
xk+1 k
kEk+1
kxk+1 k
< 1 we obtain
kF 0 (x )1 F (xk+1 )k = k(I + Ek+1

)xk+1 k (1
kEk+1
xk+1 k
) kxk+1 k.
(1.196)
kxk+1 k
Setting
` :=
kE` x` k
, ` = k, k + 1,
kx` k
1
(tk + tk+1 ) ,
2
tk :=
the combination of (1.195) and (1.196) yields

kxk+1 k
k + tk
.
k
kx k
1 k+1
(1.197)
Next, we establish an estimate for the iterative error terms tk . We have

ek+1 = ek + xk = ek Jk1 F (xk ) =
= ek Jk1 F (xk ) F (x ) =
= ek Jk1 F 0 (x ) F 0 (x )1 F (xk ) F (x ) =
| {z }
= (I+Ek )1
= (I +
Ek )1
h
(I +
Ek )ek
F (x )
= (I + Ek )1 Ek ek F 0 (x )1
Z1
i
F (x ) F (x )
=
k
i
F 0 (xk + tek ) F 0 (x ) ek dt .
Applying the affine covariant Lipschitz condition again, we arrive at

tk+1
kEk k + 21 tk
tk .
1 kEk k
(1.198)
Comparing (1.197) and (1.198), we find

tk+1 < k tk
k :=
kEk k + tk
.
1 kEk k
(1.199)
53
As far as the approximation properties of the rank 1 updates are concerned, we start from
Ek+1
= F 0 (x )1 Jk+1 I =
(1.200)
= Jk1 F (xk+1 )
= F 0 (x )1 Jk I
z }| {
xk+1
(xk )T
I =
kxk k2
= Ek + F 0 (x )1
F (xk+1 ) (xk )T
.
kxk k2
We further have
F 0 (x )1 F (xk+1 ) = F 0 (x )1 F (xk+1 ) F (xk ) + F 0 (x )1 F (xk )

| {z }
= Jk xk
{z
= (I+Ek )xk
Z1
= F 0 (x )1
F 0 (xk + txk )xk dt F 0 (x )1 F 0 (x )xk Ek xk =

0
= F 0 (x )1
|
Z1
F 0 (xk + txk ) F 0 (x ) dt xk Ek xk .
{z
=: Dk+1
Consequently, (1.200) yields
xk (xk )T
xk (xk )T
Ek+1
= Ek I
+
D
.
k+1
kxk k2
kxk k2
|
{z
}
| {z }
=: Qk
(1.201)
=: IQk =Q
k
Note that Qk and Q

k are orthogonal projections.
Transposing (1.201), we obtain
T
)T = Qk (Ek )T + Q
(Ek+1
k Dk+1 .
(1.202)
54
If we apply the affine covariant Lipschitz condition to Dk+1 xk , from (1.201)

we get the estimate
kEk+1
xk k
kDk+1 xk k
=
tk ,
kxk k
kxk k
(1.203)
whereas (1.202) results in
kEk+1
k = max
v6=0
max
v6=0
k(Ek+1
)T vk
kvk
(1.204)
k(Ek )T vk
|(Dk+1 xk , v)|
+ max
kEk k + tk .
v6=0
kvk
kxk k kvk
In view of (1.199), assuming uniform boundedness

kEk k
k < 1 ,
k lN0 ,
it is natural to assume
k
+ t0
:= .
1
(1.205)
This gives
kEk+1
k
kE0 k
k
X
t` < kE0 k +
`=0
t0
,
1
(1.206)
so that we may set

:= kE0 k +
t0
.
1
(1.207)
If we insert the expression for into (1.205) and solve for t0 , we obtain
2 t0 = (1 ) (1 + ) kE0 k .
Taking into account that
t0 =
1
1
(t0 + t1 )
(1 + ) t0 ,
2
2
leads to the requirement

t0 (1 )
kE0 k .
1+
(1.208)
55
Hence, kE0 k has to satisfy

kE0 k <
1
<
2
1+
for < 1 .
(1.209)
On the other hand,
kE0 k = kF 0 (x )1 J0 Ik = kF 0 (x )1 J0 F 0 (x ) F 0 (x0 ) k
kF 0 (x )1 J0 F 0 (x0 ) k + kF 0 (x )1 F 0 (x0 ) F 0 (x ) k .
{z
}
|
{z
}
|
= 0
kx0 x k=t0
Replacing kE0 k in (1.208) by t0 + 0 gives

(2 ) t0 (1 )
0
1+
t0
1
0 ,
2 1+
so that (1.208) can be replaced by the assumptions (1.188) and (1.189).

Now, for a < 1 satisfying (1.188) and (1.189), the linear convergence
(1.190) follows from (1.199), whereas (1.191) follows by inserting into (1.197).
Finally, the bounded deterioration property (1.193) results from the insertion of (1.189) into (1.206).
What remains to be proved is the superlinear convergence (1.192). In view
of (1.201), we have
k(Ek+1
)T vk2 = k(Ek )T vk2
(Dk+1 xk , v)2
(Ek xk , v)2
+
.
kxk k2
kxk k2
Summing over all 0 k `, it follows that

`
`
X
X
kE`+1
vk2
kE0 vk2
(Dk+1 xk , v)2
(Ek xk , v)2
=
+
.
kvk2 kxk k2
kvk2
kvk2
kvk2 kxk k2
k=0
k=0
Now, letting ` and using (1.203), i.e.,
kEk+1
xk k
kDk+1 xk k
=
tk ,
kxk k
kxk k
56
we obtain
X
X
X
(Ek xk , v)2
2k 2
2
2
2
kE0 k +
tk kE0 k +
t0 .
2
k
2
kvk kx k
k=0
k=0
k=0
2
Since t0
1
(1
2
+ )t20 , it follows that
X
(Ek xk , v)2
1 1+ 2
2
kE0 k +
2 t0 .
2 kxk k2
kvk
2
1
k=0
the right-hand side in the previous inequality is bounded, whence

(Ek xk , v)2
= 0 ,
k kvk2 kxk k2
lim
v lRn \ {0} .
Consequently, setting
k :=
xk
,
kxk k
we have
lim Ek k = 0 ,

Finally, using the previous result in (1.197) proves superlinear convergence.
4.2.2 Algorithmic aspects of the affine covariant Quasi-Newton method

(i)
Recursive Good Broyden algorithm
The recursion (1.184) cannot be used directly for the computation of the QuasiNewton increments. To come up with a computationally feasible recursion, we
rewrite (1.184) according to
1
=
Jk+1
I +
xk+1 (xk )T 1
Jk .
kxk k2
(1.210)
This leads to the following product representation

1
Jk+1
k1
Y
`=0
I +
x`+1 (x` )T 1
J0 ,
kx` k2
(1.211)
57
which can be efficiently used in actual computations.

(ii)
Condition number monitor
In order to monitor the condition number of the approximations of the Jacobians, we provide the following elementary result for rank 1 matrices.
Lemma 4.1 Condition number of rank 1 matrices
For a rank 1 matrix A of the form
A = I
uv T
vT v
:=
kuk
< 1,
kvk
the condition number can be bounded according to

cond(A)
1+
.
1
Proof. The assertion readily follows from the two estimates

kAk 1 + k
uv T
k 1+
vT v
and
kA k
uv T 1
1
1 k T k
.
v v
1
Applying Lemma 4.1 to the recursions (1.56) and (1.60) and observing
k =
kxk+1 k
1
<
,
kxk k
2
we get the condition number estimate

cond(Jk+1 )
1 + k
cond(Jk ) < 3 cond(Jk ) .
1 k
(1.212)
(iii) Convergence monitor

In accordance with the result of the condition number estimate in (ii) we monitor the convergence by
k <
1
,
2
i.e., if k 12 , no convergence is detected.
(1.213)

(iv)
58
Termination criterion
Denoting by XT OL a user specified accuracy, the Quasi-Newton iteration will

be terminated, if
kxk+1 k XT OL .
(1.214)
4.3 Affine contravariant Quasi-Newton method

Affine contravariant Quasi-Newton methods require to reformulate the secant
condition (1.164) in an affine contravariant manner. For that purpose, setting
F k+1 := F (xk+1 ) F (xk ), we rewrite (1.164) according to
k+1
k
1
k+1
x
.
| {z x} = J F
= xk
multiplication by Jk results in
Jk xk
| {z }
= Jk J 1 F k+1 ,
= F (xk )
whence
Jk J 1 F k+1 = F k+1 F (xk+1 ) .
We thus get
I Jk J 1 F k+1 = F (xk+1 ) .
|
{z
}
(1.215)
= Ek (J)
We note that any Jacobian rank 1 update of the form
F (xk+1 )v T
1
Jk+1
= Jk1 I
,
v T F k+1
v lRn \ {0}
satisfies the affine contravariant secant condition (1.215).

In particular, for v = F k+1 we recover the Bad Broyden update (1.176).
4.3.1 Affine contravariant Quasi-Newton convergence theory
We begin with some useful properties of the Bad Broyden update.
Theorem 4.3
method
Properties of the affine contravariant Quasi-Newton
59
For Broydens affine contravariant rank 1 update (Bad Broyden)
F (xk+1 )(F k+1 )T

1
1
Jk+1 = Jk
I
kF k+1 k2
assume local residual contraction
kF (xk+1 )k
k =
< 1.
kF (xk )k
Then, there holds:
(i)
(1.216)
(1.217)
The update matrix Jk+1 is a least change update in the sense that
kEk (Jk+1 )k kEk (J)k ,
J Sk+1 ,
(1.218)
k
.
(1.219)
1 k
is regular as well. Jk+1 can be represented
kEk (Jk+1 )k
(ii) If Jk is regular, then Jk+1

according to
F (xk+1 )(F k+1 )T

Jk+1 = I
Jk .
(F k+1 )T F (xk )
(1.220)
(iii) The Quasi-Newton increment xk+1 is given by
(F k+1 )T F (xk+1 )
1
xk+1 = Jk+1
F (xk+1 ) = 1
( Jk1 F (xk+1 ))
(1.221)
.
|
{z
}
kF k+1 k2
=: xk+1
Proof. For the proof of (1.218),(1.219) we have

1
Ek (Jk+1 ) = I Jk Jk+1
=
F (xk+1 )(F k+1 )T

,
kF k+1 k2
which gives
kEk (Jk+1 )F k+1 k
kF (xk+1 )k
kEk (Jk+1 )k =
=
=
kF k+1 k
kF k+1 k
=
kEk (J)F k+1 k

kEk (J)k ,
kF k+1 k
J Sk+1 .
Using
kF (xk+1 )k
kF (xk+1 ) F (xk )k kF (xk )k kF (xk+1 )k = kf (xk )k 1
,
kF (xk )k
we find
kF (xk+1 )k
k
kEk (Jk+1 )k =
.
k+1
kF k
1 k
60
In the the subsequent convergence proof for the affine contravariant QuasiNewton method we need the following technical result:
Lemma 4.2 An elementary technical estimate
Assume 0 < < 1 , 0 0 < and
t
0
.
1 + 0 + 43 (1 )1
Setting
= 0 +
t
,
(1 t)(1 )
there holds
+ (1 + ) t .
Proof. For t we have
t
1+
4
)
3(1
3(1 )
=: g() .
7 3
The function g() attains its maximum in =

0 + (1 + 0 +
= 0 +
0 +
7
t
6
4
3
+ (1 + 0 +
7 28
3
with g( ) < 17 . Hence,
)t =
1
6
)t
t
t
+ (1 + 0 +
) t = + (1 + )t .
(1 t)(1 )
(1 t)(1 )
61
Theorem 4.4 Convergence of the affine contravariant Quasi-Newton

method
Suppose that that F : D lRn lRn , D lRn convex, is continuously
differentiable on D and let x D be the unique solution of F (x) = 0 in D with
invertible Jacobian F 0 (x ). Assume that the following affine contravariant
Lipschitz condition is satisfied
0
0 +
k F (x) F (x ) (y x)k kF 0 (x )(x x )k kF 0 (x )(y x)k (, 1.222)
where x, y D, and denote by
Ek := I F 0 (x )Jk1
(1.223)
the affine contravariant deterioration matrix.

For some 0 < < 1 assume further that:
(a)
The initial approximate Jacobian J0 satisfies

0 := kE0 k < .
(b)
(1.224)
The initial guess x0 D satisfies

t0 := kF 0 (x )(x0 x )k
0
.
1 + 0 + 43 (1 )1
(1.225)
Then, there holds:

(i)
The Quasi-Newton iterates xk , k lN0 converge to x according to

kF 0 (x )(xk+1 x )k kF 0 (x )(xk x )k ,
(1.226)
(1.227)
We have superlinear convergence in the sense that

kF (xk+1 )k
= 0.
k kF (xk )k
lim
(1.228)
(ii) The following affine contravariant bounded deterioration property

holds true
kEk k 0 +
t0
.
(1 t0 )(1 )
(1.229)
62
Moreover, asymptotically we have

kEk F k+1 k
= 0.
k kF k+1 k
lim
(1.230)
Proof. We first derive an estimate for the iterative residuals F (xk+1 ).

We have
Z1
F (x
k+1
) =
= Jk xk
Z1
=
F 0 (xk + txk ) xk dt =
F (x ) +
| {z }
0
F 0 (xk + txk ) F 0 (x ) xk dt +
0
=
xk .
F 0 (x ) Jk
{z
}
|
k
1
(F 0 (x )Jk I) Jk x
| {z }
= F (xk )
Using the affine contravariant Lipschitz condition (1.222), it follows that

kF (xk+1 )k
Z1
k F 0 (xk + txk ) F 0 (x ) xk k dt + k F 0 (x )Jk1 I F (xk )k

|
{z
}
0
= Ek
Z1
kF 0 (x )
Z1

k kF 0 (x )xk k dt + kEk F (xk )k
(xk + txk x )
{z
}
|
= (1t)(xk x )+t(xk+1 x )
(1 t)kF 0 (x )(xk x )k + tkF 0 (x )(xk+1 x )k kF 0 (x )xk k dt +
+ kEk k kF (xk )k =
1
(tk + tk+1 ) k
2
=
F 0 (x )xk k + kEk k kF (xk )k ,

| {z }
k
(IEk ) Jk x
| {z }
= F (xk )
63
where
tk := kF 0 (x )(xk x )k ,
k lN0 .
Setting
tk :=
1
(tk + tk+1 ) ,
2
k lN0 ,
we obtain
kF (xk+1 )k tk k(Ek I)F (xk )k + kEk k kF (xk )k
h
(1.231)
i
tk (1 + kEk k) + kEk k kF (xk )k .
Likewise, for the iterative error F 0 (x )(xk+1 x ) we get

F 0 (x )(xk+1 x ) = F 0 (x )(xk x ) + F 0 (x )
xk
|{z}
= Jk1 F (xk )
{z
= (Ek I)F (xk )
= F 0 (x )(xk x ) +
F (x ) F (xk )
|
{z
}
=
R1
+ Ek F (xk ) ,
F 0 (x +t(x xk ))(x xk )dt
and hence,
kF 0 (x )(xk x )k
Z1
t kF 0 (x )(xk x )k2 dt + kEk k kF (xk ) F 0 (x )(xk x )k
kF 0 (x )(xk x )k2 + kEk k kF 0 (x )(xk x ) F (xk )k +

2
+ kF 0 (x )(xk x )k .
Treating F 0 (x )(xk x ) F (xk ) as before and multiplying by yields
tk+1
1 2
1
tk + kEk k ( t2k + tk ) =
2
2
(1.232)

=
64
1 + kEk k
kEk k +
tk tk .
2
We further study the approximation properties of the Jacobian updates.
For the deterioration matrix Ek+1

we obtain
F (xk+1 )(F k+1 )T

1
Ek+1
= I F 0 (x )Jk+1
= I F 0 (x )Jk1 I
=
kF k+1 k2
= (I F (x
)Jk1 )
(F (xk+1 ) F (xk ))(F k+1 )T F (xk+1 )(F k+1 )T

I
+
=
kF k+1 k2
kF k+1 k2
F k+1 (F k+1 )T
)
kF k+1 k2
{z
}
= (I F 0 (x )Jk1 ) (I
{z
}
|
|
= Ek
= (IQk ) = Q
k
(I F (x
)Jk1 )
(F (xk ) F (xk+1 ))(F k+1 )T

F (xk+1 )(F k+1 )T
+
=
kF k+1 k2
kF k+1 k2
1
0
= Ek Q
k + F (x )Jk
F (xk+1 )(F k+1 )T

kF k+1 k2
F k+1 (F k+1 )T
.
kF k+1 k2
|
{z
}
= Qk
Finally, taking advantage of

F (xk+1 )(F k+1 )T
F (xk+1 )(F k+1 )T
Qk =
,
kF k+1 k2
kF k+1 k2
we obtain
1
0
Ek+1
= Ek Q
k + F (x )Jk
Ek Q
k
I F (x
F (xk+1 )(F k+1 )T
) Jk1
|
kF k+1 k2
= (1.233)
F (xk+1 )(F k+1 )T

I
Qk =
kF k+1 k2
{z
}
1
= Jk+1
= Ek Q
k + Ek+1 Qk .
Qk Qk
65
The representation (1.233) of Ek+1

readily yields the estimate
kEk+1
k kEk Q
k k + kEk+1 Qk k
kEk k
(1.234)
kEk+1
F k+1 k
+
.
kF k+1 k
Recalling the definition (1.223) of Ek+1

,, we have
1
F k+1 = F k+1 F 0 (x ) Jk+1

F k+1 .
Ek+1
On the other hand, the update formula (??) gives

1
Jk+1
F k+1 = Jk1 F k+1 Jk1 F (xk+1 ) = Jk1 F (xk ) = xk ,
and hence
F k+1 = F k+1 F 0 (x ) xk =
Ek+1
Z1
=
(1.235)
F 0 (xk + txk ) F 0 (x ) xk dt .
By the affine contravariant Lipschitz condition and using (1.235)

Z1
kEk+1
F k+1 k
kF 0 (x )(xk + txk x )k kF 0 (x )xk k dt

0
Z1 h
i
tkF 0 (x )(xk+1 x )k + (1 t)kF 0 (x )(xk x )k kF 0 (x )xk k dt
tk kF 0 (x )xk k = tk kEk+1 F k+1 F k+1 k kEk+1 F k+1 k + kF k+1 k ,

whence
kEk+1
F k+1 k
tk
kF k+1 k .
1 tk
Using the previous estimate in (1.234) results in
kEk+1
k kEk k +
tk
.
1 tk
(1.236)
66
The bounded deterioration property and the contraction of the residuals is now proved by an induction argument. We assume that we have
k
P
kEk k kE0 k +
t0
`=0
1 t0
:= kE0 k +
t0
.
(1 t0 )(1 )
and
k
tk t0 .
Then, (1.232) in combination with Lemma 4.2 yields
k
tk+1 ( + (1 + )t0 )tk tk t0 .

Moreover, (1.236) gives us
kEk+1
k kEk k +
k1
P
kE0 k
t0
`=0
1 t0
k
P
kE0 k +
tk
1 t0
k+1
t0
+
1 t0
`
t0
`=0
1 t0
In summary, induction on k shows the bounded deterioration property

kEk k
as well as
tk+1 tk ,
which in view of (1.232) and Lemma 4.2 implies contraction of the residuals
We finally prove superlinear convergence. For that purpose, we use the
orthogonal decomposition (1.233) in the form
(Ek+1
)T = Q
+ Qk (Ek+1
)T .
k (Ek )
67
Observing
Qk (Ek+1
)T v
= F
k+1
k+1
, v)
(F k+1 , (Ek+1
)T v)
k+1 (Ek+1 F
= F
,
k+1
2
k+1
2
kF k
kF k
we get
T
2
T
2
k(Ek+1
)T vk2 = kQ
k (Ek ) vk + kQk (Ek+1 ) vk =
T
2
2
T
= k(Ek )T vk2 kQ
k (Ek ) vk + kQk (Ek+1 ) vk =
= k(Ek )T vk2
F k+1 , v)2
(Ek+1
(Ek F k+1 , v)2
+
.
kF k+1 k2
kF k+1 k2
Summing over all 0 k ` yields

`
X
(E F k+1 , v)2
k
k=0
kF k+1 k2 kvk2
X
kE`+1
k2
F k+1 , v)2
(Ek+1
kE0 k2
=
+
.
kvk2
kvk2
kF k+1 k2 kvk2
k=0
Estimating from above by neglecting the negative term and observing (1.236),
for ` , we obtain
`
X
(E F k+1 , v)2
k
k=0
kF k+1 k2 kvk2
kE0 k2 +
X
X
tk 2
t20
2k
kE0 k2 +
.
2
1
t
(1
t
k
0)
k=0
|k=0{z }
=
1
2
1
Due to the boundedness of the right-hand side, we must have

(Ek F k+1 , v)2
= 0,
k kF k+1 k2 kvk2
lim
and thus
lim kEk k = 0 .
The superlinear convergence of the residuals results from the following reasoning: Observing
Ek F (xk+1 ) = F (xk+1 ) F 0 (x )Jk1 F (xk+1 ) ,
we have
kF (xk+1 )k kF 0 (x )Jk1 F (xk+1 )k kEk F (xk+1 )k kEk kkF (xk+1 )k ,
68
and hence,
kF (xk+1 )k
kF 0 (x )Jk1 F (xk+1 )k
.
1 kEk k
(1.237)
On the other hand, in view of the update formula (1.216)
F 0 (x )Jk1 F (xk+1 ) = (Ek+1

Ek )F k+1 ,
which gives
kF 0 (x )Jk1 F (xk+1 )k
k kF k+1 k + kEk kkF k+1 k .

kEk+1
| {z }
t
k
kEk k+ 1t
Taking into account that

kF k+1 k kF (xk+1 )k + kF (xk )k (1 + )kF (xk )k ,
we arrive at
kF 0 (x )Jk1 F (xk+1 )k
2kEk k +
tk
(1 + ) kF (xk )k .
1 tk
Using the previous estimate in (1.237) yields
tk
2kEk k + 1t
k
kF (xk+1 )k
kF (xk )k .
1 kEk k
(1.238)
Since kEk k 0 and tk 0 as k , (1.238) proves superlinear convergence

of the residuals.
4.3.2 Algorithmic aspects of the affine contravariant Quasi-Newton

method
(i)
Condition number monitor
As in the affine covariant Quasi-Newton method, we track the condition number

of the Jacobian rank 1 updates. For k < 12 , an application of Lemma 4.1 yields
F (xk+1 )(F k+1 )T

cond(Jk+1 ) cond I
cond(Jk ) .
kF k+1 k2
|
{z
}
1
12k

(ii)
69
Convergence monitor
We choose
max <
1
2
e.g. max =
1
4
and check
k max .
(1.239)
If (1.239) is violated, we stop the algorithm (no convergence).

(iii) Termination criterion
Given a user specified tolerance FTOL, the Quasi-Newton iteration will be
stopped, if
kF (xk )k FTOL .
(1.240)
70
5. Global Newton methods

The convergence results of the previous chapters stated local convergence of
Newton or Newton-like methods under a restriction on the initial guess in terms
of the initial Kantorovich quantity. If this condition is not satisfied, there is no
guaranteed convergence and globalization strategies have to be imposed such
as steepest descent or trust region methods.
In the sequel, we derive a globalization strategy that is based on an appropriate damping of the Newton increments within the three affine invariance classes and design related residual and error-oriented monotonicity
tests as well as a convex functional test along with adaptive trust region
strategies for a suitable selection of the damping parameter.
5.1 The Newton path
A widely used globalization concept relies on a decrease of the residual level
function
T (x) :=
1
1
kF (x)k2 =
F (x)T F (x)
2
2
(1.241)
in the sense that we require the monotonicity criterion

T (xk+1 ) < T (xk ) ,
if T (xk ) 6= 0 .
(1.242)
We associate with the residual level function T the level set

G(z) := { x D lRn | T (x) T (z) } .
(1.243)
In terms of the level set G, the monotonicity criterion (1.242) can be stated as
xk+1 int G(xk ) ,
if int G(xk ) 6= .
(1.244)
In the steepest descent method, the gradient of the level function is used as
the direction of the iterative correction
xk = grad T (xk ) = F 0 (xk )F (xk ) ,
xk+1 = xk + sk xk ,
(1.245)
where sk > 0 is an appropriate steplength parameter.

The convergence of the steepest descent method is assured by the following
result.
71
Lemma 5.1 Downhill property

Assume that F : D lRn lRn is continuously differentiable on D and
x = F 0 (x)F (x) 6= 0. Then there exists > 0 such that
T (x + sxk ) < T (x) for all 0 < s < .
(1.246)
Proof. Define the function

(s) := T (x + sx) .
Obviously, C 1 (D1 ) for some D1 lR1 and
0 (0) := (grad T (x + sx))T |s=0 x =
= (F 0 (x)T F (x))T x = kxk2 ,
|
{z
}
= x
The steplength strategy consists of two parts:

a reduction strategy and a prediction strategy.
The reduction strategy applies when the monotonicity test fails, i.e.,
T (xk + s0k xk ) > T (xk )
for some s0k . In this case, the monotonicity test will be repeated with the reduced
steplengths
si+1
:= sik
k
i lN0
< 1 (e.g.,
1
.
2
(1.247)
the downhill property (1.246) assures that a finite number of reductions will
result in a feasible steplength sk > 0.
The prediction strategy selects s0k+1 on the basis of an ad-hoc rule with
respect to the steplength history
s
min (smax , k ) , if sk1 sk
0
sk+1 :=
.
(1.248)
sk , otherwise
Remark 5.1 The speed of convergence of the steepest descent method may be
slow and problems may occur due to an ill-conditioning of the Jacobian F 0 (x).
72
The steepest descent method as described above is not affine covariant. Indeed,
given a nonsingular matrix A lRnn , we may introduce another level function
TA (x) :=
1
kAF (x)k2 .
2
(1.249)
The following result underpins the necessity to develop an affine covariant

descent concept, since it can be shown that almost always there exists a
matrix A such that x is uphill with respect to TA .
Lemma 5.2 Deficiency of the residual level function
Let x = grad T (x) be the descent direction with respect to the level function
T . Then, unless
F 0 (x) x = F (x) for some < 0 ,
(1.250)
there exists a class of regular matrices A such that for some > 0
TA (x + s) > TA (x) ,
0<s< .
(1.251)
Proof. We have
xT grad TA (x) = (F 0 (x)T F (x))T F 0 (x)T AT AF (x) =
= F (x)T F 0 (x)F 0 (x)T AT AF (x) .
We choose A lRnn such that
AT A = F 0 (x)F 0 (x)T + yy T
where y lRn satisfies
T
0
0
T
F (x) F (x)F (x) + I y = 0 ,
>0,
F (x)T y 6= 0 .
The existence of such an y lRn is guaranteed regarding that (1.250) is excluded.

We then get
xT grad TA (x) =
= F (x)T F 0 (x)F 0 (x)T F 0 (x)F 0 (x)T + yy T F (x) =

= kF 0 (x)F 0 (x)T F (x)k2 + F (x)T yy T F (x) =
73
= kF 0 (x)F 0 (x)T F (x)k2 + (F (x)T y)2 .

Hence, if we choose
>
kF 0 (x)F 0 (x)T F (x)k2

,
(F (x)T y)2
we get
xT grad TA (x) > 0 ,
In order to come up with an affine covariant globalization concept, we

introduce the level set associated with the level function TA given by
GA (z) := {x D | TA (x) TA (z)} .
(1.252)
We recall that monotonicity with respect to TA reads as follows

xk+1 int GA (xk ) ,
if int GA (xk ) 6= .
Denoting by GL(n) the set of all regular nn matrices, we introduce the affine
covariant level set
\
GA (x) .
GA (x) :=
(1.253)
AGL(n)
Theorem 5.1 Newton path

Assume that F : D lRn lRn is continuously differentiable on D with
nonsingular Jacobi matrix F 0 (x), x D. Further suppose that for some A
GL(n) the path-connected component of GA (x0 ), x0 D, is a compact subset
of D. Then, the path-connected component of GA (x0 ) is a topological
path x : [0, 2] lRn , called the Newton path. It has the properties
F (x()) = (1 ) F (x0 ) ,
TA (x()) = (1 )2 TA (x0 ) ,
A GL(n) ,
(1.254)
(1.255)
and satisfies the two-point boundary value problem

dx
= F 0 (x)1 F (x0 ) ,
d
x(0) = x0 , x(1) = x .
(1.256)
74
Moreover, we recover the ordinary Newton increment x0 by means of

dx
|=0 = F 0 (x0 )1 F (x0 ) = x0 .
d
(1.257)
Proof. We introduce the level sets

HA (x0 ) := {y lRn | kAyk2 kAF (x0 )k2 }
and define their intersection
\
H(x0 ) :=
HA (x0 ) .
(1.258)
AGL(n)
The idea of proof is to show that H(x0 ) = G(x0 ).

For that purpose, we refer to i , 1 i n, as the singular values of A and
to qi , 1 i n, as the associated eigenvectors of AT A such that
n
X
A A =
i2 qi qiT .
i=1
We further denote by A the following subset of GL(n)

n
X
A := {A GL(n) | A A =
i2 qi qiT , q1 =
i=1
F (x0 )
}.
kF (x0 )k
Obviously, every y lRn admits the representation

y =
n
X
b j qj
bj lR , 1 j n ,
j=1
and hence,
2
kAyk = y A Ay =
n
X
i2 b2i ,
i=1
kAF (x0 )k2 = 12 kF (x0 )k2 .

In particular, for A A we find
0
HA (x ) = {y lR |
n
X
i=1
i2 b2i 12 kF (x0 )k2 } .
75
Figure 1: Intersection of ellipsoids HA (x0 ), A A.

In other words, HA (x0 ) defines the n-dimensional ellipsoid
2
1
2
n
2
2
b
+
b
+
...
+
b2n 1 .
2
kF (x0 )k2 1
1 kF (x0 )k
1 kF (x0 )k
For A A, all ellipsoids have a common b1 -axis of length kF (x0 )k, whereas the
lengths of the other axes differ (cf. Figure 1).
It follows readily that
0 ) = {y lRn | y = b1 q1 , |b1 | kF (x0 )k} =
H(x
= {y lRn | y = (1 )F (x0 ) , [0, 2]} =
= {y lRn | Ay = (1 )AF (x0 ) , [0, 2] , A GL(n)} .
Since A GL(n), we have
0) .
H(x0 ) H(x
0 ) and A A
On the other hand, for y H(x
kAyk2 = (1 )2 kAF (x0 )k2 kAF (x0 )k2 ,
which shows
0 ) H(x0 ) .
H(x
(1.259)
76
The final stage of the proof is done by an appropriate lifting of the path H(x0 )
to G(x0 ) using the homotopy
(x, ) := F (x) (1 )F (x0 ) .
In view of
x = F 0 (x) ,
= F (x0 )
and observing that x is nonsingular for x D and GA (x0 ) D, local continuation from x(0) = x0 by the implicit function theorem, applied to 0,
delivers the existence of the path
x GA (x0 ) D
with the properties (1.256),(1.257). The assertions (1.254) and (1.255) are now
a direct consequence of (1.259).
Remark 5.2 The implication of the previous theorem is that even far from the
solution, the Newton increment x0 /kx0 k, which is tangent to the Newton
path originating from x0 , plays a decisive role and should be used in an affine
invariant globalization strategy. Alone, its length may me too large and thus
has to be controlled appropriately.
Remark 5.3 The previous theorem assumes that the Jacobian is regular in D.
However, sometimes the situation is encountered where the Jacobian is singular
at a critical point x even close to the initial guess x0 . In this case, the implicit
function theorem tells us that the Newton path ends at that critical point.
5.2 Trust region concepts
As we have seen, far away from the solution the ordinary Newton method can
be still used, provided an appropriate damping of the Newton increment is
provided. Of course, we would like to know how to determine the damping
factor, or in other words, what is the region around the current iterate where
we can rely on the linearization with respect to the tangent to the Newton path.
The specification of such regions is known as trust region concepts.
5.2.1 Trust region based on the Levenberg-Marquardt method
Given a current iterate xk lRn and a prespecified parameter > 0, the idea of
the Levenberg-Marquardt method is to determine an increment xk lRn
as the solution of the constrained minimization problem
inf
xk K
kF (xk ) + F 0 (xk )xk k ,
77
where K stands for the constraint

K := {xk lRn | kxk k } .
Coupling the inequality constraints by a Lagrangian multiplier lR+ leads
to the saddle point problem
inf
xk lRn
sup L(xk , )
lR+
in terms of the associated Lagrangian functional
L(xk , ) := kF (xk ) + F 0 (xk )xk k2 + kxk k2 2 .

The KKT conditions read as follows:
F 0 (xk )T F 0 (xk ) + I xk = F 0 (xk )F (xk ) ,
(1.260)
kxk k2 2 0 ,
(1.261)
0 ,
(kxk k2 2 ) = 0 .
Denoting the solution of the saddle point problem by (xk (), ), we observe
0+
>> 1
xk () F 0 (xk )F (xk ) ,
xk ()
1 0 k
1
F (x )F (xk ) = grad T (xk ) .
This means:
Close to the solution, the method coincides with the ordinary Newton method,
whereas far from the solution, it corresponds to a steepest descent with the
steplength parameter 1 .
The Levenberg-Marquardt method looks robust, since the coefficient matrix
F 0 (xk )T F 0 (xk ) + I in (1.260) is regular, even if the Jacobian F 0 (xk ) is singular.
However, the method may terminate for singular F 0 (xk ), since then the righthand side in (1.260) also degenerates. Moreover, the Levenberg-Marquardt
methods lacks affine invariance.
5.2.2 The Armijo damping strategy
An empirical damping strategy is the Armijo strategy:
Let k {1, 12 , 41 , ..., min } be a sequence of steplengths with the property
1
T (xk + xk ) (1 ) T (xk ) ,
2
k .
(1.262)
78
Figure 2: Geometric interpretation of the affine covariant trust region method

Then, the damping parameter k k is chosen as the optimal one:
T (xk + k xk ) = min T (xk + xk ) .
k
Obviously, the choice of the level function T (x) in the Armijo rule does not
reflect affine covariance. We will develop an affine covariant damping strategy
below.
5.2.3 Affine covariant trust region method
The Levenberg-Marquardt method can be easily reformulated to yield an affine
covariant version. Since affine covariance means affine invariance with respect
to transformations in the domain of definition, we have to modify the objective
functional:
inf kF 0 (xk )1 F (xk ) + F 0 (xk )xk k ,

(1.263)
xk K
whereas the set of constraints K is given as before.

The affine covariant trust region method (1.263) admits an easy geometric interpretation as shown in Figure 5.2. The set K of constraints is represented
as a sphere with radius around xk . If exceeds the length of the Newton
correction xk , the constraint is not active, and we are in the regime of the
ordinary Newton method. However, if is smaller than the Newton correction
xk , we have to apply an appropriate damping.
79
5.2.4 Affine contravariant trust region method

We can also easily reformulate the Levenberg-Marquardt method to come up
with an affine contravariant version. Since affine contravariance means affine
invariance with respect to transformations in the range space, the objective
functional remains unchanged, but we have to modify the set of constraints:
inf
xk K
kF (xk ) + F 0 (xk )xk k ,
is given as follows:
whereas the set of constraints K
:= {xk lRn | kF 0 (xk )xk k } .
K
(1.264)
There is basically the same geometric interpretation as before with the only
difference that now the picture has to be drawn in the range space.
5.3 Globalization of affine contravariant Newton methods
5.3.1 Convergence of the damped Newton iteration
We consider the damped Newton iteration
F 0 (xk )xk = F (xk ) ,
xk+1 = xk + k xk , k [0, 1]
(1.265)
in an affine contravariant setting where the damping factor k is chosen to

achieve residual contraction.
Theorem 5.2 Optimal choice of the damping factor
Assume that F : D lRn lRn , D convex, is continuously differentiable on
D with regular Jacobian F 0 (x), x D. We further suppose that the following
affine contravariant Lipschitz condition holds true
k F 0 (y) F 0 (x) (y x)k kF 0 (x)(y x)k2 , x, y D . (1.266)

Setting hk := kF (xk )k, for [0, min(1, h2k )] we have
kF (xk + xk )k tk () kF (xk )k ,
where
tk () := 1 +
1
hk 2 .
2
(1.267)
80
The optimal choice of the damping factor is

k := min(1,
1
).
hk
(1.268)
Proof. By straightforward calculation we find

kF (xk + xk )k = kF (xk + xk F (xk ) F 0 (xk )xk k =
Z
= k
F 0 (xk + txk ) F 0 (xk ) xk dt (1 ) F 0 (xk )xk k
k
F 0 (xk + txk ) F 0 (xk ) xk dtk + (1 ) kF (xk )k .
0
The first term on the right-hand side measures the deviation from the Newton path. Using the affine contravariant Lipschitz condition, it can be estimated as follows
Z
k
F 0 (xk + txk ) F 0 (xk ) xk dtk
1 2
1
kF 0 (xk )xk k2 hk 2 kF (xk )k .
2
2
Inserting this estimate into the previous one and minimizing tk () proves the
theorem.
Theorem 5.3 Global convergence of affine contravariant Newton

methods
Under the same assumptions as in theorem 5.2 let D0 be the path-connected
component of the level set G(x0 ) and suppose that D0 is a compact subset of
D. Then, the for all damping factors
k [, 2k ]
(1.269)
with > 0 sufficiently small, the damped Newton iterates xk , k lN0 converge
to some x D0 with F (x ) = 0.
81
Proof. The parabola tk () from Theorem 5.2 can be bounded by a polygonal

as follows
1 12
, 0 h1k ,
tk ()
1 + 21 h1k
, h1k h2k .
For 0 <
1
hk
and k [, 2k ] we thus have

tk () 1
1
,
2
(1.270)
which shows strict reduction of the residual level function T (x).

The existence of a global > 0 follows from the compactness assumption on D0
which implies
max kF (x)k < .
xD0
Consequently, if G(xk ) D0 , then (1.270) yields

G(xk+1 ()) G(xk ) .
The rest of the proof is along the same lines as the proof of the affine contravariant Newton-Mysovskikh theorem.
5.3.2 Adaptive affine contravariant trust region strategy

In Theorem 5.2 we derived the theoretical damping factor (1.268). Since the
Kantorovich quantity hk = kF (xk )k cannot be accessed directly, we again have
to provide appropriate estimates
[hk ] := [] kF (xk )k ,
(1.271)
where [] is a lower bound for the domain dependent Lipschitz constant that
can be obtained by pointwise sampling.
Then, an estimate of the optimal damping factor is given by means of
[k ] := min (1,
1
).
[hk ]
It follows readily from (1.271) that

[k ] k ,
(1.272)
82
i.e., we may have a considerable overestimation. As a remedy, repeated reductions must be performed by appropriate prediction and correction strategies.
The following bit counting lemma gives information about the contraction
in the residuals in terms of the accuracy of the estimate for the Kantorovich
quantity.
Lemma 5.3 Bit counting lemma
Assume that for some 0 < 1 there holds
0 hk [hk ] < max (1, [hk ]) .
Then, the residual monotonicity test (1.267) yields
1
kF (xk+1 )k 1 (1 )k kF (xk )k .
2
(1.273)
(1.274)
Proof. The assumption (1.273) can be rewritten as

[hk ] hk < (1 + ) max (1, [hk ]) ,
which results in the following estimate of the residual contraction
kF (xk+1 )k
1 2
[1
+
hk ]|=[k ] <
kF (xk )k
2
< [1 +
1
1
(1 + ) 2 [hk ]]|=[k ] 1 (1 ) k .
2
2
Remark 5.4 Restricted residual monotonicity test

For 12 , (1.274) results in the restricted residual monotonicity test
kF (x
k+1
1
)k 1 k kF (xk )k .
4
(1.275)
This should be compared with the heuristically derived Armijo strategy

(1.262).
In order to come up with a computationally feasible, affine contravariant
adaptive trust region strategy, we have to provide appropriate estimates of
the Kantorovich quantities.
For the k-th iteration step, such a strategy consists of a correction step, providing a reliable damping factor, and a prediction step, predicting an initial
guess 0k+1 for the subsequent k + 1-st iteration step.
83
As far as the correction step is concerned, we recall that the damped Newton method with damping factor [0, 1] represents a deviation from the
Newton path which can be measured by means of
1
2 kF (xk )k2 .
2
This leads us to the following lower bound for the affine contravariant Kantorovich quantity
kF (xk+1 ) (1 )F (xk )k
[hk ] :=
2 kF (xk+1 ) (1 )F (xk )k
hk .
2 kF (xk )k
Using the prediction 0k from the previous step, for i 0 we compute the trial
iterate
xk+1 = xk + ik xk
and perform the residual monotonicity test
kF (xk+1 )k (1
1 i
) kF (xk )k .
4 k
If the test is successful, we accept the current value ik as the damping factor.
Otherwise, we set
1
1
i+1
= min ( ik , i ) .
k
2
[hk ]
As long as i+1
min , the gives us a new trial iterate. However, if i+1
< min ,
k
k
the process is stopped (convergence failure).
For the prediction of a damping factor 0k+1 , we recall
hk+1 =
kF (xk+1 )k
hk .
kF (xk )k
Denoting by i the index, for which ik passed the residual monotonicity test,
we use the lower bound
[h0k+1 ] =
kF (xk+1 )k i
[hk ] < [hik ]
k
kF (x )k
and set
0k+1 := min (1,
1
[h0k+1 ]
).
Obviously, we need an initial guess 00 which is chosen as 00 = 1 for mildly

nonlinear problems and 00 1 for highly nonlinear problems.
84
5.4 Error oriented descent

We consider the damped Newton iteration
F 0 (xk )xk = F (xk ) ,
xk+1 = xk + k xk , k [0, 1]
(1.276)
in an affine covariant framework. Consequently, the damping factor k has to

be chosen in such a way that the deviation from the Newton path is controlled
in lights of an affine covariant Lipschitz condition. This will lead to the natural
monotonicity test
kx
where x
k+1
k+1
k < kxk k ,
(1.277)
stands for the simplified Newton correction

F 0 (xk )x
k+1
= F (xk+1 ) .
(1.278)
5.4.1 General level functions

We start from the following, easily verifiable local descent result for the
Newton direction with respect to general level functions
TA (x) :=
1
kAF (x)k2
2
A GL(n) .
(1.279)
Lemma 5.4 General downhill property

Assume that F : D lRn lRn , D lRn convex, is continuously differentiable.
Then, for all A GL(n) there holds
(xk )T grad TA (x) = 2 TA (x) < 0 .
(1.280)
The previous result tells us that with regard to first order information all
level functions are equally well suited. In order to be more selective, we have
to use second order information.
Theorem 5.4 Affine covariant downhill property
Assume that F : D lRn lRn , D lRn convex, is continuously differentiable
on D with regular Jacobian F 0 (x), x D and suppose further that the following
affine covariant Lipschitz condition holds true
kF 0 (x)1 F 0 (y) F 0 (x) (y x)k ky xk2 , x, y D . (1.281)
85
Let xk D be an iterate such that

GA (xk ) D
A GL(n) ,
(1.282)
and denote by hk and hk the Kantorovich quantities

hk := kxk k ,
hk := hk cond(AF 0 (xk )) .
(1.283)
Then, for all [0, min(1, 2/hk )] there holds

k
kAF (xk + xk )k tA
k () kAF (x )k ,
(1.284)
where
tA
k () := 1 +
1 2
hk .
2
(1.285)
The optimal damping factor is given by

k (A) := min (1, 1/hk ) .
(1.286)
Proof. For [0, 1], we have

kAF (xk + xk )k = kAF (xk + xk ) F (xk ) +
F (xk )
| {z }
k =
= F 0 (xk )xk
Z
= kA
(F 0 (xk + txk ) F 0 (xk )) xk dt (1 ) F 0 (xk )xk k

| {z }
= F (xk )
Z
(F 0 (xk + txk ) F 0 (xk )) xk dtk + (1 ) kAF (xk )k .
kA
0
Invoking the affine covariant Lipschitz condition, for the first term on the righthand side we obtain
Z
F 0 (xk )1 (F 0 (xk + txk ) F 0 (xk )) xk dtk
kAF 0 (xk )
0
Z
0
t |{z}
t kxk k k
kAF (x )k
0
xk
|{z}
= (AF 0 (xk ))1 AF (xk )
k dt
86
1 2
kAF 0 (xk )k hk k(AF 0 (xk ))1 AF (xk )k
2
1 2
1 2
hk kAF 0 (xk )k k(AF 0 (xk ))1 k kAF (xk )k =
hk kAF (xk )k .
2
2
Combining the previous estimates gives the assertions.
In view of Theorem 5.4 we readily get the following global convergence result.
Theorem 5.5 Affine covariant global convergence theorem
In addition to the assumptions of Theorem 5.4, let x0 D be an initial guess
such that the path-connected component D0 of GA (x0 ) is a compact subset of D.
Then, for all damping parameters
k [ , 2 k (A) ] ,
(1.287)
with > 0 being sufficiently small, the damped Newton method converges
to some x D0 with F (x ) = 0.
Proof. As before, we remark that the parabola tA
k () can be bounded from
above by a polygonal bound according to
tA
k () 1
1
,
2
0<
1
.
hk
(1.288)
Moreover, there is a global , since with regard to the compactness assumption

on D0 we have
max kF 0 (x)1 F (x)k cond(AF 0 (x)) < .
xD0
The proof proceeds by induction on k: Assuming GA (xk ) D0 , (1.288) yields

GA (xk+1 ) GA (xk ) D0 .
Consequently, the sequence of Newton iterates lives in a compact set which
allows to conclude.
Remark 5.5 The flaws of residual monotonicity

Setting A = I in the previous theorem, we are obviously back in the residual
based regime where we have proved global convergence according to Theorem
5.3. However, if the Jacobian F 0 (xk ) is ill conditioned, we obtain
1
k = hk cond(F 0 (xk ))
1,
(1.289)
87
Figure 3: Reduction factors and optimal damping factors

which algorithmically will result in a termination of the iteration.
5.4.2 Natural level function
In view of (1.283) and (1.286), the most natural choice of the matrix A
GL(n) in the level function TA is
A := Ak = F 0 (xk )1 .
(1.290)
The associated level function TF 0 (xk )1 is called the natural level function
which gives rise to the natural monotonicity test
kx
k+1
k kxk k
(1.291)
in terms of the simplified Newton correction

x
k+1
= F 0 (xk )1 F (xk+1 ) .
(1.292)
Several remarks are due with respect to the properties of the natural level
function.
Remark 5.6 Extremal properties
As shown in Figure 3, for A GL(n) the reduction factors tA
k () and the optimal
damping factors k (A) satisfy

k
tA
k () = 1 +
k (Ak ) = min (1,
1 2
hk tA
k () ,
2
(1.293)
1
) k (A) .
hk
(1.294)
88
Figure 4: Asymptotic distance spheres associated with natural level sets
Remark 5.7 Steepest descent property

The damped Newton method in xk is a method of steepest descent for the
natural level function TAk :
xk = grad TAk (xk ) .
(1.295)
Remark 5.8 Asymptotic optimality

In view of
hk < 1
k (Ak ) = 1 ,
(1.296)
the damped Newton method asymptotically achieves quadratic convergence.

Remark 5.9 Asymptotic distance function
If F : D lRn lRn is twice continuously differentiable, we can show
TF 0 (x )1 (x) =
1
kx x k2 + O(kx x k3 ) .
2
Hence, for xk x the natural monotonicity criterion approaches a distance

criterion of the form
kxk+1 x k kxk x k .
As shown in Figure 4, close to the solution x the natural level surface is close
to a sphere, whereas it degenerates to an osculating sphere with increasing
89
distance to x . Note that for other level functions, the level surface is an ellipsoid
close to x , with the ratio of the largest to the smallest half-axis being related
to the condition number of the Jacobian, and an osculating ellipsoid off x .
Remark 5.10 Local descent
if we insert A = Ak into (1.285),(1.286) of Theorem 5.4, we get the local
descent property
1
k+1
kx k 1 + 2 hk kxk k .
(1.297)
2
Remark 5.11 Global convergence
We note that the results of Theorem 5.5 are not applicable to the situation at
hand, since A = Ak changes from one step to the other. Taking the asymptotic
distance function property into account, in the subsequent global convergence
result we make the fixed choice A = F 0 (x )1 .
Theorem 5.6 Global convergence of the affine covariant damped Newton method with natural level functions; Part I
Assume that F : D lRn lRn , D lRn convex, is continuously differentiable
on D with regular Jacobian F 0 (x), x D and suppose that the following affine
covariant Lipschitz condition is fulfilled
kF 0 (x )1 F 0 (y) F 0 (x) (y x)k ky xk2 , x, y D . (1.298)

Suppose further that x D is the unique solution in D and let x0 D be
an initial guess such that the path-connected component of GF 0 (x )1 (x0 ) is a
compact subset of D.
Let the damping factors be chosen according to
k [, 2k ] ,
0<<
1
,
hk
(1.299)
where
k := min(1,
1
) ,
hk
hk := kxk k kF 0 (xk )1 F 0 (x )k .
(1.300)
Then, the damped Newton iteration converges globally to x .

Proof. In the proof of Theorem 5.4 we have shown
kA F (xk + xk )k
Z
kA (F 0 (xk + txk ) F 0 (xk )) xk dtk + (1 ) kAF (xk )k .
0
90
Choosing A = F 0 (x )1 and observing

xk = F 0 (xk )1 F (xk ) = F 0 (xk )1 F 0 (x )F 0 (x )1 F (xk ) ,
the first term on the right-hand side of the previous inequality is now estimated
as follows
Z
kF 0 (x )1
(F 0 (xk + txk ) F 0 (xk )) xk dtk

0
1 2
kxk k kF 0 (xk )1 F 0 (x )k kF 0 (x )1 F (xk )k .
|
{z
}
2
= hk
The rest of the proof proceeds in exactly the same manner as in the proof of
Theorem 5.5.
In much the same way as we derived Theorem 5.5 from Theorem 5.4, the previous results imply the following convergence statement in a more realistic scenario:
Corollary 5.7 Global convergence of the affine invariant damped Newton method; Part II
Assume that all assumptions of Theorem 5.6 are met, except that the affine
covariant Lipschitz condition is replaced by one with a local Lipschitz
constant
kF 0 (z)1 F 0 (y) F 0 (x) (y x)k (z) ky xk2 , x, y, z D0 (.1.301)

Then, the damped Newton method converges for
[, 2k (z) ] ,
0<<
1
,
hk (z)
(1.302)
with the optimal damping factor given by

k (z) := min (1,
1
),
hk (z)
(1.303)
where
hk (z) := (z) kxk k kF 0 (xk )1 F 0 (z)k .
(1.304)
91
Figure 5: Newton path G(xk ), trust region around xk and Newton step with
locally optimal damping factor
We have a local level function reduction according to
2
1
TF 0 (z)1 (xk + xk ) 1 + 2 hk (z) TF 0 (z)1 (xk ) .
2
(1.305)
Remark 5.12 Recovery of the exact Newton method

If we choose z := xk , we recover the theoretically optimal damping strategy for
the exact Newton method.
Remark 5.13 Geometrical interpretation
As shown in Figure 5, the damped Newton method proceeds along the tangent
of the Newton path G(xk ) with the actual steplength
kxk+1 xk k = k kxk k k :=
1
,
k
where the radius k describes the local trust region around the current iterate
xk .
5.4.3 Adaptive trust region strategies
We provide lower estimates
[k ] k
[hk ] hk
(1.306)
for the Lipschitz constant and the Kantorovich quantity (e.g., by pointwise
sampling of the domain, and thus get an upper estimate
[k ] := min(1,
1
) k
[hk ]
(1.307)
92
of the damping factor. Since (1.307) usually leads to an overestimation, we

have to perform an appropriate prediction and correction strategies which
depend on the required accuracy.
Lemma 5.5 Bit counting lemma
Assume that for some 0 < 1 there holds
0 hk [hk ] max(1, [hk ]) .
(1.308)
then, the natural monotonicity test gives

kx
k+1
1
k (1 (1 ))kxk k .
2
(1.309)
Proof. The proof is left as an exercise.
Remark 5.14 Restricted natural monotonicity test

For 12 , the bit counting lemma suggests the following restricted natural
monotonicity test
kxk+1 k 1
kxk k .
(1.310)
4
Correction strategy
We have to monitor the deviation from the Newton path. In an affine covariant setting we have the upper bound
kxk+1 () (1 ) xk k
1 2
k kxk k2 ,
2
which gives us the following estimate for the Kantorovich quantity

2kxk+1 () (1 ) xk k
[hk ]() = [k ] kx k :=
hk .
2 kxk k
k
(1.311)
Assuming that we have a trial value 0k at disposal, for j 0 we compute

j+1
k
jk
:= min( , [hk ](jk ))
2
(1.312)
as long as the restricted natural monotonicity test fails.

Prediction strategy
The prediction strategy aims to provide a reasonable initial estimate 0k . Such
an estimate can only be accessed, if we use a Lipschitz constant k satisfying
kF 0 (xk )1 F 0 (x) F 0 (xk ) vk k kx xk k kvk ,
93
where v is supposed in some sense close to x xk .

We are thus led to the local estimate
kxk xk k = k F 0 (xk1 )1 F 0 (xk )1 F (xk )k =
= kF 0 (xk )1 F 0 (xk ) F 0 (xk1 ) xk k k k1 kxk1 k kxk k ,

which suggests the estimate
[ k ] :=
kxk xk k
k .
k1 kxk1 k kxk k
(1.313)
The estimate (1.313) exploits newest information and leads us to the prediction strategy
0k
:= min(1, k ) ,
kxk1 k
kxk k
:=
k1 .
kxk xk k kxk k
(1.314)
Finally, as far as an initial guess 00 is concerned, we choose 00 = 1 for mildly

nonlinear problems and 00 1 for highly nonlinear problems.
94
6. Continuation Methods for Parameter Dependent Systems

In applications, frequently nonlinear systems occur that depend on one or several parameters representing the influence of specific quantities on the behavior
of the systems. A classical example for such a parameter dependent nonlinear system is the Bratu problem
u = exp(u) in lR3 ,
u = 0
on = ,
(1.315)
describing certain exothermal chemical reactions, where > 0 stands for the
so-called Arrhenius parameter.
Figure 6: Bifurcation diagram for the Bratu problem

An important feature of such problems is that they typically exhibit multiple
solutions and bifurcation phenomena. Figure 6 shows the k k1, -norm of
the solution as a function of the parameter . We see that there is a principal solution branch which has a left-winding fold point for some critical
parameter value (actually 6.8 for the Bratu problem). The principal solution branch up to that critical value represents the physically stable
branch whereas the upper part of the principal solution branch corresponds to
95
the physically unstable branch. Moreover, on the unstable branch, primary

bifurcation from the principal branch as well as secondary bifurcation may
occur.
6.1 Newton continuation methods
6.1.1 Introduction
If we discretize the Bratu problem (1.315) by finite differences or finite elements,
it takes the form of a parameter dependent system of nonlinear equations in Euclidean space lRn :
Given a continuously differentiable function F : DI lRn , D lRn , I lR,
find (x, ) D I such that
F (x, ) = 0 .
(1.316)
Assume that (x , ) is the unique solution of (1.316) in D I and that there

exists a neighborhood U (x ) I( ) D I such that Fx (x, ) lRnn is nonsingular for all (x, ) U (x ) I( ). Then, the implicit function theorem
asserts the existence of a continuously differentiable function
x : I( ) U (x ) ,
called the homotopy path, such that x( ) = x and
I(x ) .
F (x, ) = 0 ,
(1.317)
Differentiation with respect to results in a linearly implicit ODE, called the

Davidenko differential equation
F x x + F = 0 ,
I(x ) .
(1.318)
By introducing the augmented variable

y := (x, ) lRn+1 ,
(1.319)
equation (1.316) can be rewritten as

F (y) = 0 ,
(1.320)
whereas the Davidenko equation takes the form

F 0 (y)t(y) = 0
(1.321)
with the path t(y) being uniquely defined in a neighborhood of y = (x , ) up

to some appropriate normalization (e.g., ktk = 1), provided
rank F 0 (y ) = n
dim ker F 0 (y ) = 1 .
(1.322)
96
However, if for some k 1 there holds

rank F 0 (y ) = n k
dim ker F 0 (y ) = k + 1 ,
(1.323)
the point y = (x , ) is a critical point.

In particular, in case k = 1, the point y is said to be a simple bifurcation
point.
A special role is played by turning points or fold points which occur for
k = 0, if
rank F 0 (y ) = n and rank Fx (x , ) = n 1 .
(1.324)
As far as the numerical solution of (1.316) is concerned, from a theoretical

point of view one might be tempted to solve the Davidenko equation (1.318) by
an appropriate numerical integrator for implicit ODEs. This, however, is not
suitable, since
it usually requires second order information in terms of Fxx , Fx whose
computation can be computationally costly,
and more importantly,
it does not enforce directly the solution of (1.316) so that due to error
propagation the computed solution path may drift away from the true
solution path.
Therefore, in the sequel we will focus on discrete continuation methods.
6.1.2 Classification of continuation methods
For the solution of the parameter dependent problem (1.316) we subdivide the
interval I := [0, L] into subintervals by means of the partition
0 =: 0 < 1 < ... < N 1 < N := L ,
and consider the local problems
F (x, ) = 0 ,
0N .
(1.325)
The solution of (1.325) requires a good initial guess which will be provided by
some appropriately chosen prediction method. A related important issue is
an adaptive selection of the steplengths
:= +1
0 N 1 .
Both issues will be addressed within an affine covariant theory.
97
denoting the solution of (1.325) by x(), +1 , it is, of course, natural

to start from x( ). However, there are different ways to construct a prediction
path x(), +1 , emanating from x( ). In general, a continuation
method defined via a prediction path x() is said to be of order p, if there
exists a positive constant p such that
kx() x()k p p ,
(1.326)
where := .
In the sequel, as the most important examples for prediction pathes we will
consider
the classical continuation method,
the tangent continuation method,
the standard and the partial standard embedding,
the polynomial continuation method.
Figure 7: Classical (green) and tangent (blue) continuation
98
(i)
Classical continuation
The most simple prediction path is the constant path
x() := x( ) ,
+1 .
(1.327)
Obviously, we have
kx() x()k kx() x( )k
max
s[ ,+1 ]
kx0 (s)k .
Hence, the method is of first order with the order coefficient

1 :=
max
s[ ,+1 ]
kx0 (s)k .
(ii)
Tangent continuation
An alternative way to obtain a prediction path is to apply the explicit Euler
method to the Davidenko equation (1.318):
x() := x( ) + ( ) x0 ( ) ,
+1 .
(1.328)
Therefore, the tangent continuation is also referred to as the Euler continuation or method of incremental load. Figure 7 shows both the classical
and the tangent continuation method.
As far as the order is concerned, we have
kx() x()k kx() x( ) x0 ( )k
1
2
2
max
s[ ,+1 ]
kx00 (s)k .
Hence, the method is of second order with the order coefficient

2 :=
1
2
max
s[ ,+1 ]
kx00 (s)k .
(iii) Standard embedding and partial standard embedding

Another way to obtain a prediction path is to consider the embedding
F (x, ) := F (x) (1 ) F (x ) ,
+1 .
(1.329)
However, this method does not exploit any structure of F . Therefore, a better
way is ro select only a component of the mapping which leads to the so-called
partial standard embedding
F (x, ) := P F (x) P F (x) (1 ) F (x ) , +1 ,(1.330)
99
where P is an appropriate orthogonal projection.

Lemma 6.1 Classical and tangent continuation based on partial standard embedding
Assume that F : D I lRn is continuously differentiable in the first argument
with regular Jacobian F 0 and that the following affine covariant Lipschitz
condition holds true
kF 0 (
x())1 F 0 (x) F 0 (
x() k
kx x()k , x, x() D , I.(1.331)
Then, the order coefficient for the classical continuation method is
1 = max kF 0 (x())1 P F (x )k ,
I
(1.332)
whereas that one for the tangent continuation method is close to

2 :=
12 .
2
(1.333)
Proof. The partial derivatives of the map F can be easily computed
F (x, ) =
F (x) = F 0 (x) ,
x
x
F (x, ) = P F (x ) ,
and hence,
x0 () = F 0 (x())1 P F (x ) .
This readily gives (1.332). For the tangent continuation, we must invoke the
Lipschitz condition
kx0 () x0 ( )k = k F 0 (x())1 F 0 (x( ))1 P F (x )k
kF 0 (
x( ))1 F 0 (x()) F 0 (x( )) k kx0 ()k

kx() x( )k 1
12 .
(iv)
Polynomial continuation
We distinguish between extrapolation by Lagrange and by Hermite interpolation.
(iv)1 Lagrange extrapolation
We assume that for some q > 0 the data
x(` ) ,
q `
100
is available. Then, in terms of the fundamental Lagrange polynomials L`q (),

the prediction path is given by the interpolating polynomial
xq () :=
x(` L`q () .
(1.334)
`=q
Standard error estimates give

kx() xq ()k Cq+1 () ,
(1.335)
where
() :=
( ` ) .
`=q
(iv)2 Hermite extrapolation

Here, we assume that we are given the data
x(` ) , x0 (` ) ,
q ` .
We define the prediction path xq () as the associated Hermite polynomial and

obtain
kx() xq ()k C q+1 () ,
(1.336)
where
() :=
( ` )2 .
`=q
6.1.3 Affine covariant correction method

Once we have computed a prediction path x(), +1 , we choose the
predicted value x0 := x(+1 ) as an initial guess for a correction method
to compute an approximation of x := x(+1 ). We will study the ordinary
Newton method with a new Jacobian at each iterate. Applying the affine covariant version of the Newton-Kantorovich theorem, we get the following result.
Theorem 6.1 Convergence of the corrector
Assume that F : D I lRn is continuously differentiable with nonsingular
Jacobian Fx (x, ), (x, ) D I. Further, suppose that there exists a unique
homotopy path x() and that the affine covariant Lipschitz condition
kFx (
x(), )1 Fx (y, ) Fx (x, ) k
0 ky xk , x, y D , I (1.337)
101
is satisfied, where x() is a prediction method of order p (cf. (1.326)). Then,

for all step sizes
2 1 1/p
,
max :=
(1.338)
0 p
the ordinary Newton method with initial guess x(+1 ) converges to the solution
point x(+1 ).
Proof. For the ease of exposition, we write instead of . The affine
covariant Newton-Kantorovich theorem requires
kx0 ()k
0
1
.
2
(1.339)
Applying the Lipschitz condition (1.337), by straightforward computation we

find
0
1
1
kx ()k = kFx (
x(), ) F (
x(), )k = kFx (
x, )
F (
x, ) F (x, ) k =
Z1
kFx (
x, )1
1
Fx (x + t(
x x), )(
x x) dtk k
x x)k 1 +
0 k
x xk .
2
Observing (1.326), we deduce
1
p
kx ()k p 1 +
0 p
=: () .
2
0
(1.340)
Consequently, this leads to the requirement
1
1
0 p p 1 +
0 p p
,
2
2
which is equivalent to
0 p p
21 .
6.1.4 Adaptive stepsize control

For the practical application of the theoretical convergence results we have to
replace the theoretical quantities
0 and p by computationally available lower
bounds [
0 ] and [p ] thus resulting in the stepsize estimate
2 1 1/p
max .
[max ] :=
(1.341)
[
0 ] [p ]
Since there might be a substantial overestimation, we need again a prediction
strategy and a correction strategy.
102
As far as the correction strategy is concerned, let us assume that for +1 we

already know the first contraction factor
kx1 ()k
.
kx0 ()k
0 () :=
The convergence analysis of the affine covariant Newton method yields

0 ()
0 kx0 ()k .
2
(1.342)
Hence, inserting (1.340) gives us

0 ()
0 p p ,
2
which leads to
0 p p g(0 ()) ,
where
g() :=
1 + 4 1 .
From this, we get the a posteriori estimate

[
0 p ] :=
g(0 ())

0 p ,
p
and the associated stepsize estimate

[max ] :=
g() 1/p
[
0 p ]
1
.
4
Denoting by the stepsize associated with the computed value of 0 and by

0 corresponding to = 14 , we arrive at the stepsize correction
0 :=
g() 1/p
g(0 )
(1.343)
Remark: If the termination criterion detects some k such that k > 21 , the
last continuation step has to be repeated with
0 :=
g() 1/p
,
g(k )
which gives rise to a reduction, since

2 1 1/p
[max ] <
0.57p .
31
(1.344)
103
Whereas a posteriori estimates lead to correction strategies, a priori estimates

allow us to derive prediction methods.
We note that (1.342) gives us the lower bound
[
0 ] :=
20 ( )

0 ,
kx0 ( )
whereas the definition (1.330) of the order of a prediction method implies

[p ] :=
k
x( ) x( )k
p .
|1 |p
Using the preceding quantities in (1.341), we arrive at the stepsize prediction

0
:=
kx0 ( )k g() 1/p

1 .
k
x( ) x( )k 20
(1.345)
Remark: The prediction strategy (1.345) is robust with respect to the accuracy of x: Even if only a single Newton step is performed, i.e.,
x = x( ) + x0 ( ) ,
the prediction takes the form
0 :=
g() 1/p
20
1 ,
which in lights of (1.343) still is a reasonable estimate.

Only in the nearly linear case
0 min 1 ,
the estimate (1.345) should be replaced by
0
g() 1/p
:=
1 .
2min
(1.346)
104
6.2 Augmented systems for critical points

We assume that y is a perfect or unperturbed singularity of order k 1 such
that
F (y ) = 0 ,
rank F 0 (y ) = n k .
(1.347)
For notational convenience, we set A := F 0 (y ) and refer to

N (A) := ker A ,
R (A)
(1.348)
with
dim N (A) = k + 1 ,
dim R (A) = k
as the nullspace and corange of A. We further introduce the projectors

P := A+ A ,
P := AA+
and recall that P projects onto N (A), whereas P

We consider the natural splitting
y = y + v + w
w := P (y y ) ,
(1.349)
projects onto R (A).
v := P (y y ) .
(1.350)
Then, in view of (1.347) the implicit function theorem asserts the existence of
a function w = w (v) such that
P F (y + v + w) = 0
w = w (v) .
(1.351)
Replacing w by w gives rise to the reduced system
f (v) : P F (y + v + w (v)) = 0 ,
(1.352)
which is known as the Lyapunov-Schmidt reduction.

For computational purposes, we choose orthogonal bases of the nullspace and
the corange according to
N (A) := span(t1 , ..., tk+1 ) ,
R (A) := span(z1 , ..., zk ) .
In terms of the matrices

t := [t1 , ..., tk+1 ] ,
z := [z1 , ..., zk ] ,
105
we obviously have
At = 0 , tT t = Ik+1 , P = ttT ,
AT z = 0 , z T z = Ik , P
(1.353)
= zz T .
If we now introduce local coordinates by means of

v = t =
k+1
X
i ti
f (v) = z =
i=1
k
X
j z j ,
j=1
in terms of the function : lRk+1 lRk , the reduced system (1.351) can be
written as
() := z T f (t) = z T F (y + t + w (t)) = 0 .
(1.354)
For the actual computation of singularities, we need higher order derivatives

which are provided by the following result.
Lemma 6.2 Higher order derivatives w.r.t. the Lyapunov-Schmidt
reduction
Assume for simplicity that y = 0. Then, with the notations introduced above,
for ai lRk+1 , 1 i 3, there holds
d
(0)a1 = 0 ,
d
d2
(0)[a1 , a2 ] = z T F 00 [ta1 , ta2 ] ,
2
d
3
d
(0)[a1 , a2 , a3 ] = z T F 000 [ta1 , ta2 , ta3 ]
3
d
z T F 00 [ta1 , A+ F 00 [ta2 , ta3 ]]
z T F 00 [ta2 , A+ F 00 [ta3 , ta1 ]]
z T F 00 [ta3 , A+ F 00 [ta1 , ta2 ]] .
(1.355)
(1.356)
(1.357)
In view of the preceding result, it is sufficient to consider the contact equivalence

class
(g) := { () = ()g(h()) } ,
(1.358)
where g P k+1+q () with q denoting the codimension of N (A) and , h are

C -diffeomorphisms with h(0) = 0. The functions g are called polynomial
germs.
For instance, in case of a simple bifurcation, i.e., k = 1, q = 1, we have
g() = 12 22 ,
(1.359)
106
whereas for an asymmetric cusp, i.e., k = 1, q = 2,we obtain

g() = 12 23 .
(1.360)
In order to allow for imperfect or unfolded singularities, we have to consider the

perturbed germs
G(, ) := g() + p(, ) ,
(1.361)
where p(, ) P q1 () is a polynomial perturbation with lRq denoting the

unfolding parameters.
For instance, for a simple bifurcation
G(, ) = 12 22 + ,
(1.362)
and for an asymmetric cusp

G(, ) = 12 23 + 1 + 2 2 .
(1.363)
In particular, if we have
G(h(0), ) = G(o, ) = p(0, ) 6= 0 ,
the reduced system (1.354) has to be replaced by
z T F (y ) = p(0, ) .
These k equations together with the n k equations P F = 0 then give rise to
the n equations
F (y ) = zp(0, )
(1.364)
in the (k + 1)n + q + 1 unknowns (y, z, ).

Normalizing the k basis functions zj , 1 j k, in case of a simple bifurcation
we arrive at the augmented system
F 0 (y)T z = 0 ,
F (y) + z = 0 ,
(1.365)
(1.366)
1 T
(z z 1) = 0 ,
2
(1.367)
which represents 2n + 2 equations in the 2n + 2 unknowns (y, z, ).

Including second order information, we can show the existence of two local
branch directions.
Theorem 6.2 Existence of two local branch directions
107
Let y be a simple bifurcation and assume that F C k , k 3, and that

z T F 00 (y )[t, t]
(1.368)
is nondegenerate. Then, in a neighborhood of y , the solution set of F + z = 0

consists of two one-dimensional C k2 -branches 1 , 2 such that
i (0) = y , 1 i 2 ,
N = { 1 (0), 2 (0)} ,
h
i
T 00
z F (y ) 1 (0), 2 (0) = 0 .
6.3 Newton method for simple bifurcations

For the augmented system (1.365), the
block structure
A
J(y, z, ) =
0
extended Jacobian has the following
AT 0
In z ,
zT 0
where
0
C := (F (y) z) =
n
X
fi00 (y)zi
A := F 0 (y) .
i=1
Theorem 6.3 Properties of the extended Jacobian

At a simple bifurcation point y with sufficiently small perturbation parameter
, the extended Jacobian J(y , z , ) is nonsingular.
As a consequence
C
A
0
of Theorem 6.3, the ordinary Newton method
(F 0 )T z
AT 0
y
In z z = F + z
1 T
zT 0
(z z 1)
2
(1.369)
is well-defined in a neighborhood of y .
Instead of (1.369), replacing J(y, z, ) by J(y, z, o) and A by A F 0 (y ), we
consider the Newton-like method
C AT 0
y
(F 0 )T z
A 0 z z = F + z ,
(1.370)
1 T
(z
z
1)
0 zT 0
2
108
which is easier to solve.

In particular, a structure preserving algorithm for the solution of (1.370) makes
use of the following QR decomposition of A = F 0 (y)
R S
A = Q
T ,
0 T
where Q is an orthogonal nn matrix, R is an upper triangular (n1)(n1)
matrix, S is an (n 1) 2 matrix, lR2 , and is an (n + 1) (n + 1)
permutation matrix.
For y close to y , the matrix R is nonsingular and the vector is small. Hence,
we may choose
R
S
A = Q
T .
(1.371)
0 0
Using (1.371) in (1.370), suggests the partitioning
C11 C12
T
C := C =
, C22 lR22 ,
T
C12
C22
w
T
z := Q z =
, w lRn1 , lR ,
u
T
y =
, u lRn1 , v lR2 ,
v
w
T
, w lRn+1 , lR ,
z = Q z =
f1
T T
A z =
, f1 lRn1 , f2 lR2 ,
f2
g1
T
Q (F + z) =
, g1 lRn1 , g2 lR ,
g2
1 T
h :=
(z z 1) ,
2
which leads to the linear system
C11 C12 RT 0 0
T
C12
C22 S T 0 0
R
S
0 0 w
0
0
0 0
0
0 wT 0
u
v
w
f1
f2
g1
g2
h

Numerical Methods For Large-Scale Non-Linear Systems, Hoppe

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numerical Methods For Large-Scale Non-Linear Systems, Hoppe

Uploaded by

Copyright:

Available Formats

Numerical Methods for

Large-Scale Nonlinear Systems

Handouts by Ronald H.W. Hoppe

following the monograph

Num. Meth. Large-Scale Nonlinear Systems

1. Classical Newton Convergence Theorems

Then, for the sequence {xk }lN0 of Newton iterates

F 0 (x) is invertible for all Newton iterates x = xk , k lN0 ,

Num. Meth. Large-Scale Nonlinear Systems

Banach perturbation lemma F 0 (xk ) is invertible with

We prove xk B(x0 , 0 ) and tk < 1, k lN, by induction on k:

since h0 < 1 1 2h0 , and

kF 0 (xk1 + sxk1 ) F 0 (xk1 )k kxk1 k ds

In view of the relationship

Num. Meth. Large-Scale Nonlinear Systems

we consider the recursion

Observing (1.5) and (1.7), we find

The famous Ortega-trick allows to reduce (1.9) to a two-term recursion which

Num. Meth. Large-Scale Nonlinear Systems

Since is convex, the Newton method converges for t1 to t1 .

1.2 Classical Newton-Mysovskikh Theorem

Then, for the sequence {xk }lN0 of Newton iterates

xk B(x0 , ), k lN0 , and there exists x B(x0 , ) such that F (x ) = 0

kxk x k k kxk xk1 k2

Num. Meth. Large-Scale Nonlinear Systems

For k = 0, we have in view of (1.10)

Assuming (1.15) to be true for some k lN, we get

Similarly, it can be shown that

Num. Meth. Large-Scale Nonlinear Systems

which proves (iii).

Num. Meth. Large-Scale Nonlinear Systems

2. Affine Invariant/Conjugate Newton Convergence Theorems

Then, for the sequence {xk }lN0 of Newton iterates

F 0 (x) is invertible for all Newton iterates x = xk , k lN0 ,

Num. Meth. Large-Scale Nonlinear Systems

Theorem 2.2 Affine Covariant Newton-Mysovskikh Theorem

kF 0 (z)1 F 0 (y) F 0 (x) (y x)k ky xk2

Then, for the sequence {xk }lN0 of Newton iterates

Proof. slight modification of the Classical Newton-Mysovskikh Theorem.

2.2 Affine Contravariant Newton Convergence Theorem

k F 0 (y) F 0 (x) (y x)k kF 0 (x)(y x)k2 , x, y D ,(1.24)

Num. Meth. Large-Scale Nonlinear Systems

Proof. We first prove xk L by induction on k:

Assume that the assertion holds true for some k lN.

(iii) For any [0, 1] such that xk + txk L , t [0, ], we have

F 0 (xk + txk )xk dtk .

Since F (xk ) = F 0 (xk )xk ,

k F 0 (xk + txk ) F 0 (xk ) xk k dt + (1 ) kF (xk )k

Num. Meth. Large-Scale Nonlinear Systems

and hence, xk + xk L which is a contradiction.

Num. Meth. Large-Scale Nonlinear Systems

and an induction argument shows

Since L is bounded, there exist x L and a subsequence lN0 lN such that

Then, a necessary and sufficient optimality condition is given by the

defines locally an inner product with associated norm

Num. Meth. Large-Scale Nonlinear Systems

We obtain the optimality condition

Num. Meth. Large-Scale Nonlinear Systems

An appropriate affine conjugate Lipschitz condition is as follows

kF 0 (z)1/2 F 0 (y) F 0 (x) (y x)k kF 0 (z)1/2 (y x)k2 .

2.3 Affine Conjugate Newton Convergence Theorem

and the associated optimality condition

Note that (1.28) has a unique solution x D.