2007 The Solution of Nonlinear Inverse Problems and The Levenberg-Marquardt Method PUJOL

GEOPHYSICS, VOL. 72, NO. 4 共JULY-AUGUST 2007兲; P. W1–W16, 6 FIGS., 1 TABLE.
10.1190/1.2732552
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
The solution of nonlinear inverse problems

and the Levenberg-Marquardt method
Jose Pujol1
ABSTRACT method and the steepest-descent method. In this tutorial, the two
papers are combined into a unified presentation, which will help
the reader gain a better understanding of what happens when
Although the Levenberg-Marquardt damped least-squares solving nonlinear problems. Because the damped least-squares
method is an extremely powerful tool for the iterative solution of and steepest-descent methods are intimately related, the latter is
nonlinear problems, its theoretical basis has not been described also discussed, in particular in its relation to the gradient. When
adequately in the literature. This is unfortunate, because Leven- the inversion parameters have the same dimensions 共and units兲,
berg and Marquardt approached the solution of nonlinear prob- the direction of steepest descent is equal to the direction of minus
lems in different ways and presented results that go far beyond the gradient. In other cases, it is necessary to introduce a metric
the simple equation that characterizes the method. The idea of 共i.e., a definition of distance兲 in the parameter space to establish a
damping the solution was introduced by Levenberg, who also relation between the two directions. Although neither Levenberg
showed that it is possible to do that while at the same time reduc- nor Marquardt discussed these matters, their results imply the in-
ing the value of a function that must be minimized iteratively. troduction of a metric. Some of the concepts presented here are il-
This result is not obvious, although it is taken for granted. More- lustrated with the inversion of synthetic gravity data correspond-
over, Levenberg derived a solution more general than the one ing to a buried sphere of unknown radius and depth. Finally, the
currently used. Marquardt started with the current equation and work done by early researchers that rediscovered the damped
showed that it interpolates between the ordinary least-squares- least-squares method is put into a historical context.
INTRODUCTION rameters about which the problem is linearized. Once equation 1 is

solved, ␦ is added to the initial vector of parameters and the resulting
Anyone working on inverse problems will immediately recognize vector is used as a new initial vector. In this way an iterative process
the equation is established, but as early workers noted, convergence is not assured
when ␦ is computed using ordinary least squares 共i.e., when equation
共ATA + ␭I兲␦ = ATc, 共1兲 1 with ␭ = 0 is used兲. A typical reason for the lack of convergence is
that the initial values of the parameters are far from the values that
where ␭ ⬎ 0 and I is the identity matrix. The symbols used may be
actually solve the nonlinear problem, which means that the assump-
different, but the meaning of equation 1 will be clear, namely it rep-
tions behind the linearization of the problem are not valid. As a con-
resents the Levenberg-Marquardt damped least-squares solution of
sequence, the absolute values of the components of ␦ may become
the equation
larger or oscillate as the iterations proceed, instead of going through
A␦ = c. 共2兲 the steady decrease that characterizes a convergent process. This
problem is solved by Levenberg 共1944兲 and independently by Mar-
Equation 1 arises in a number of situations. The one to be dis- quardt 共1963兲, who uses different approaches that led to similar solu-
cussed here is the solution of linearized nonlinear inverse problems tions. The purpose of this tutorial is to go through their arguments in
with ␦ a vector of adjustments 共or corrections兲 to the values of the pa- detail, which will help to gain a better understanding of what really
Manuscript received by the Editor June 27, 2006; revised manuscript received January 10, 2007; published online May 30, 2007.
1
University of Memphis, Department of Earth Sciences, Memphis, Tennessee. E-mail: jpujol@memphis.edu.
© 2007 Society of Exploration Geophysicists. All rights reserved.
W1
W2 Pujol
happens when solving a linearized nonlinear problem. To give the The concepts introduced here are illustrated with a simple exam-
readers a flavor of the matters to be discussed, the basic features of ple involving the inversion of gravity data corresponding to a buried
the two approaches are summarized below. sphere. In this case, the unknown parameters are the radius of the
Levenberg 共1944兲 solves the problem of the lack of convergence sphere and the depth to its center. By limiting the number of inver-
by introducing 共and naming兲 the damped least-squares method. The sion parameters to two, it is easy to visualize both the function s and
basic idea was to damp 共i.e., limit兲 the values of the parameters at the path followed by the parameters as a function of the iterations for
each iteration. Specifically, instead of using the function S 共see be- different initial values of the parameters and ␭, and for solutions ob-
low兲 whose minimization leads to the ordinary least-squares solu- tained using the damped and ordinary least-squares methods and the
tion, Levenberg minimizes the function S̄ = wS + Q, where w ⬎ 0 steepest-descent method.
and Q is a linear combination of the components of ␦ squared. The This tutorial concludes with a historical note. Although Leven-
result of this minimization is a generalization of equation 1, with ␭I berg’s paper was published almost twenty years before Marquardt’s,
replaced by a diagonal matrix D with nonnegative elements. Leven- it went almost unnoticed in spite of its practical importance. Interest-
berg’s contribution to the solution of the problem did not stop here, ingly, an internet search uncovered a paper by Feder 共1963兲 on com-
however. Equally important are his proofs that the minimization of S̄ puterized lens design that shows that ideas similar to that of Leven-
leads to a decrease in the values of S and the function whose linear- berg had been rediscovered more than once. Feder’s paper, in turn,
ization leads to S 共i.e., the function s below兲. The reduction in the led to a paper by Wynne 共1959兲, which anticipates some of the ideas
value of s does not occur for all values of w 共which is equal to 1/␭ in Marquardt’s approach. Yet the fact remains that it was Mar-
when D = I兲, and Levenberg suggests a way to find the value of w quardt’s paper that popularized the damped least-squares method, a
leading to a reduction in the value of s. Another important result is fact he attributed to his having distributed hundreds of copies of his
that the Q corresponding to the damped least-squares solution is al- FORTRAN code!
ways 共i.e., for all w兲 smaller than when damping is not applied.
These results are not obvious, but are rarely considered when the
damped least-squares method is introduced in the literature. THE GAUSS-NEWTON METHOD
Marquardt 共1963兲, on the other hand, starts with equation 1 and in-
Let f be a function of the independent variables vk,k = 1,2,. . . and
vestigates the angle between the computed ␦ and the direction of
the parameters x j, j = 1,2, . . . ,n. For convenience, the variables and
steepest descent of s 共equal to −ⵜs, where ⵜ stands for gradient兲.
parameters will be considered the components of vectors v and x, re-
When ␭ goes to infinity, the contribution of ATA in equation 1 be-
spectively. To identify a particular set of values of the variables we
comes negligible and the result is the equation used in the method of
will use symbols such as vi. Let us assume that f is a mathematical
steepest descent. This method generally produces a significant re-
model for observations of interest to us, and that oi is the observation
duction in the value of s in the early iterations, but becomes extreme-
corresponding to the set of variables vi, so that
ly slow after that, to the point that convergence to the solution may
not be achieved even after a large number 共hundreds or more兲 of iter-
oi ⬇ f共vi,x兲 ⬅ f i共x兲; i = 1, . . .,m. 共3兲
ations 共e.g., Gill et al., 1981, and the example below兲. On the other
hand, the ordinary least-squares method 共known as the Gauss-New- Let us define the residual ␳i共x兲 as
ton method兲 has the ability to converge to the solution quickly when
the starting point is close to the solution 共and even when far from it, ␳i共x兲 = oi − f i共x兲; i = 1, . . .,m. 共4兲
as the example below shows兲. Marquardt proves that equation 1 in-
terpolates between the two methods, and shows that the angle be- We are interested in finding the set of parameters x j that minimize the
tween ␦ and −ⵜs is a monotonically decreasing function of ␭, with sum of residuals squared, namely
the angle going to zero as ␭ goes to infinity. Based on this fact, Mar-
quardt proposes a simple algorithm in which at each iteration the val- m
ue of ␭ is modified to assure that the corresponding value of s be-
comes smaller than in the previous iteration. Marquardt also recog-
s共x兲 = 兺 ␳2i 共x兲.
i=1
共5兲
nizes that the term ␭I in equation 1 assures that the matrix in paren-
theses is better conditioned than ATA and that the angle between ␦ A function that measures the misfit between observations and model
and −ⵜs is always less than 90°. If this condition is not met, the itera- values, such as s共x兲, is known as a merit function. Other terms found
tive process may not be convergent. in the parameter estimation and optimization literature are objective,
Although neither Levenberg nor Marquardt discusses the steep- loss, and risk function. If f i共x兲 is a nonlinear function of the parame-
est-descent method itself, this tutorial would be incomplete without ters, the minimization of equation 5 generally requires the use of nu-
consideration of the relation between the direction of steepest de- merical methods. A typical approach is to express f i共x兲 in terms of its
scent and the gradient, which is not unique when inversion parame- linearized Taylor expansion about an initial solution xoj 共j = 1, . . . ,n兲
ters have different dimensions or units. In such cases, it is not obvi- at which s does not have a stationary point. This gives
冏冏
ous how to measure the distance between two points in parameter
n
space, and as a result, equating the direction of steepest descent to ⳵fi
the direction of minus the gradient becomes meaningless 共Feder, f i共x兲 ⬇ f i共xo兲 + 兺 ⳵x j
共x j − xoj 兲; i = 1, . . .,m,
1963兲. It is only when a definition of distance 共i.e., a metric兲 is intro- j=1 x = xo
duced that the two directions become uniquely related. These ques- 共6兲
tions will be discussed first to put Levenberg’s and Marquardt’s ap-
proaches into a broader perspective, as they involve, either directly where xo has components xoj . Using this expression with equation 3,
or indirectly, the introduction of a metric. we can introduce a new set of residuals
Levenberg-Marquardt nonlinear inversion W3
n A condition for a minimum of S to exist is that HS be positive definite

ri = oi − f i共xo兲 − 兺 aij␦ j ,
j=1
共7兲共e.g., Apostol, 1969兲, which is the case when 共ATA兲−1 exists 共see Ap-
pendix A兲.
These results allow us to establish, in principle, the following iter-
where
ative process. Solve equation 15 for ␦, use equation 13 to compute x,
which becomes a new xo, and use it to compute new values of A and c
␦ j = x j − xoj , 共8兲
共using equation 9 and 11兲. Then solve for ␦ again. To make the pro-
and cess clear, we will describe the two steps that lead to the estimate
冏冏
x共p+1兲 for the 共p + 1兲th iteration. First, solve
⳵fi
aij = . 共9兲共ATA兲共p兲␦共p兲 = 共ATc兲共p兲 , 共17兲
⳵x j x = xo
where the superscript 共p兲 indicates iteration number and A and c
Note that f i共xo兲 and the derivatives have specific numerical values, have components
while the ␦ j are unknown. Equation 7 will be written in matrix form
as
r = c − A␦, 共10兲
a共p兲
ij = 冏冏
⳵fi
⳵x j x = x共p兲
; c共p兲共p兲
i = oi − f i共x 兲. 共18兲
Then compute the updated estimate using

where r and ␦ have components ri and ␦i, A denotes the matrix of de-
rivatives, and c is the vector with components x共p+1兲 = x共p兲 + ␦共p兲 . 共19兲
ci = oi − f i共xo兲. 共11兲 We will let x共o兲 = xo and will apply the superscript 共p兲 to any function
of x computed at x共p兲.
Now we will look for the vector ␦ that minimizes
To stop the iterative process, one can introduce conditions such as
m
␦ j共p兲 ⱕ ␦ jmin ; 共20兲
S共x兲 = 兺
i=1
r2i = rTr = 共cT − ␦TAT兲共c − A␦兲
j = 1,2, . . . ,n,
s共p+1兲 ⱕ smin , 共21兲

= cTc − 2cTA␦ + ␦TATA␦. 共12兲
The superscript T indicates transposition. Before proceeding, how- p ⬎ pmax , 共22兲
ever, it is necessary to make a comment on notation. Strictly speak-
where the values on the right sides of these equations are preestab-
ing, the right-hand side of equation 12 is a function of ␦, but because
lished values.
the x that minimizes equation 5 will be determined iteratively, we
The iterative minimization method with the ␦ in equation 19 com-
will be interested in deriving results involving x, which from equa-
puted using equation 17 is known as the Gauss-Newton method, and
tion 8 is equal to
can be derived using a different approach 共e.g., Gill et al., 1981; see
x = xo + ␦. 共13兲 also the note at the end of next section兲. As noted in the Introduction,
however, a problem with this method is that 兩␦共o兲兩 may be too large if
Therefore, S can be considered a function of x. The minimization of xo is too far from its optimal value, which in turn may lead to a non-
S requires computing its derivative with respect to ␦ and setting it convergent iterative process. The following example will illustrate
equal to zero: some of the features of the method.
⳵S
⳵␦
= 冉
⳵S ⳵S
⳵␦1 ⳵␦2
... 冊 T
= −2AcT + 2ATA␦ = 0 共14兲 Example 1a
Consider a buried homogeneous sphere of radius a with center at
共Seber, 1977兲, which leads to the well-known ordinary least-squares 共y o,z兲, where y o is measured along the y-axis 共horizontal兲 and z is
equation depth. The vertical component of the gravitational attraction caused
by the sphere at a point y i at zero depth is given by:
ATA␦ = ATc. 共15兲
4␲ ⌫Da3z
In this section, we will assume that 共ATA兲−1 exists, which means that g共y i,z,a兲 = 共23兲
equation 15 can be solved for ␦. When this assumption is not valid, a
3 关共y i − y o兲2 + z2兴3/2
different method 共such as damped least-squares兲 should be used. 共e.g., Dobrin, 1976兲, where ⌫ is the gravitational constant and D is
Now it remains to show that the ␦ obtained from equation 15 min- the density contrast 共equal to the difference of the densities of the
imizes S. To see that, we must examine the Hessian of S, HS, which is sphere and the surrounding medium, assumed homogeneous兲. For
the matrix of second derivatives of S with respect to the components distance and density in km and g/cm3 and gravity in mGal 共used
of ␦. Because the quadratic terms in equation 12 are of the form here兲, the numerical value of ⌫ is 6.672.
共ATA兲mn␦m␦n and ATA is symmetric, The inverse problem that we will solve is the following. Given m
gravity values Gi corresponding to points y i along the y axis, find out
⳵ 2S the values of a and z of the sphere whose gravity best fits the Gi. It
共HS兲kl = = 2共ATA兲kl . 共16兲
⳵ ␦ k⳵ ␦ l will be assumed that D and y o are known. Clearly, this problem is
W4 Pujol
nonlinear in both a and z, which play the role of the parameters x1 and the 共z,a兲 plane that connects an initial point 共zo,ao兲 and the point
x2. In practice, the Gi should be observed values, but for the purposes 共zM ,aM 兲 that minimizes ␴ 共and thus, ␴2兲. In our example, 共zM ,aM 兲
of this tutorial they will be synthetic data generated using equation = 共7,5兲, and at this point ␴ = 0. It may happen, however, that be-
23 with the following values: z = 7, a = 5, y o = 0, all in km; D cause of the complexity of ␴, no path can be found, in which case the
= 0.25 g/cm3, m = 20, y 1 = −10 km, and y i+1 − y i = 1 km. To stop inverse problem has not been solved. To investigate this question the
the iterative process, the condition that the adjustments ␦1 and ␦2 be- initial points labeled A, B, C, and D in Figure 1 were used. Some of
come smaller than or equal to 1 ⫻ 10−5 km was assumed. these initial values are too far from the true values 共see Figure 2兲, but
In this example, the estimated variance of the residuals, given by they were chosen for demonstration purposes, not as reasonable ini-
␴ 共z,a兲 =
2 1
m − 2i=1
m
兺关Gi − g共yi,z,a兲兴2 ; 再 2 ⱕ z ⱕ 12
2 ⱕ a ⱕ 10
,
tial estimates for this problem. In addition, the corresponding results
can be useful for cases where the function to be minimized is not
equal to zero at its minimum, and there is no easy way to assess
whether the initial estimates are reasonably close to the optimal val-
共24兲
ues.
plays the role of the merit function s to be minimized. The 2 in the de- The results of the inversion are summarized in Table 1 and the
nominator is introduced to make ␴2 an unbiased estimate 共Jenkins paths followed by the intermediate pairs 共z p,a p兲共p = iteration num-
and Watts, 1968兲. Clearly, a cannot be larger than z when the y i are ber兲 are shown in Figure 1. For the initial point D there was no con-
assumed to be at the same elevation, but for the analysis that follows vergence, but for the other three points, the minimum was reached in
we will be concerned with the mathematical, not the physical, as- five iterations 共points B and C兲 or 10 iterations 共point A兲. These re-
pects of the problem. There are two reasons for the use ␴2. One is its sults are interesting for several reasons. First, convergence can be
statistical significance and the other is that ␴2 is a normalized form of achieved even when the assumptions behind the linearization of the
s, which allows a comparison of results obtained for different num- problem are completely violated. Second, convergence is not always
bers of observations or for different models. The following results, achieved. Third, whether an initial point leads to convergence or not
however, are shown in terms of the standard deviation ␴, which has is not directly related to its distance to the point that minimizes ␴. Fi-
the same units as g 共i.e., mGal兲. nally, inspection of the inversion paths does not give any clue as to
Representative contour lines of ␴共z,a兲 are shown in Figure 1. the path corresponding to any other initial point within the range of
Note that the shapes of the contours are highly variable. For values of Figure 1. These facts are typical of nonlinear problems, and the other
␴ less than about eight they are close to highly elongated ellipses
methods discussed below have been designed to address some of
共closed or open兲, although the other contours are mostly straight or
them.
slightly curved with changing slopes. This fact must be kept in mind
because solving the inverse problem is equivalent to finding a path in
T: (7, 5)
A 500 100 50 B A: (2, 10)
10
3 B: (10, 10)
10 C: (2, 2)
9
D: (10, 2)
8 10
5
Radius (km)
2
7 3 10
3
g(x,z,a) (mGal)
6 B
5
1
2 A
5 1
10
4 T
10
3
0
10
2 C D
2 3 4 5 6 7 8 9 10 11 12
Depth (km) D
−1 C
Figure 1. Contour lines 共cyan curves兲 of the function ␴ 共see equation 10
−10 −5 0 5 10
24兲 and paths followed by the points 共zi,ai兲共indicated by circles兲,
x (km)
where i is iteration number for the Gauss-Newton inversion method
共equation 17兲. The numbers next to the contours indicate the value of
␴ 共in mGal兲. The contours between the numbered ones are equis- Figure 2. Gravity values computed using equation 23 for several
paced. The points labeled A, B, C, and D are initial points 共xo,y o兲 for 共z,a兲 pairs, listed on the upper-right corner of the figure. Each pair is
the inversion. Figure 2 shows the corresponding gravity values. For identified by a different symbol and by a letter. The gravity values
D the method did not converge. The value of ␴ for this point is 11.5. identified by a T are the true values, while the others correspond to
See Table 1 for additional inversion results. The large + is centered the initial values used for the inversion of the true values. The gravity
at 共7,5兲, which is at the minimum of ␴ 共=0兲. scale is logarithmic.
THE GRADIENT AND THE METHOD OF −ⵜs is the basis of the steepest-descent method of minimization,
STEEPEST DESCENT which is one of the oldest methods used.
A heuristic introduction of the steepest-descent method is as fol-
To motivate the following discussion, let us consider the gradient lows. Using the notation introduced above, we are interested in an it-
of s, to be indicated with ⵜs, which is the column vector with compo- erative approach such that
nents
s共p+1兲 ⬍ s共p兲 .
共29兲
m
⳵s ⳵ f i共x兲
共ⵜs兲 j = = −2 兺关oi − f i共x兲兴 , 共25兲 To achieve this goal, we will use the fact that ⵜs points in the direc-
⳵x j i=1 ⳵x j tion of steepest ascent. This is a well-known result from calculus
共e.g., Apostol, 1969兲 and will be proved below in a more general
where equation 4 and 5 were used. Now let x = xo. Then
context. Therefore, the initial estimate for the 共p + 1兲th iteration will
m be computed using
⳵fi
共ⵜs兲 j = −2 兺 ci = − 2共ATc兲 j , 共26兲
⳵x j 1
x共p+1兲 = x共p兲 − ⵜs共x共p兲兲,
i=1
共30兲
where equation 9 and 11 were used. Writing in matrix form, we have
␮共p兲
where ␮共p兲 is a scalar that assures that equation 29 is satisfied. The
ⵜs = −2ATc. 共27兲 question is how to choose the value of ␮共p兲. A general discussion of
Now consider equation 1 with ␦ and c the vectors that appear in this question is presented below, but for the time being, we note that
equation 15 and let ␭ go to infinity. In this case, the first term on the the gradient of a function is a local property, which means that, in
left side of the equation becomes negligible and the solution be- general, the direction of the steepest descent will change as a func-
comes tion of position. Therefore, if ␮共p兲 is not selected carefully, it may
happen that the value s is not reduced, as desired. For this reason, a
1 T 1 number of strategies for the selection of ␮共p兲 have been designed 共i.e.,
␦g ⬇ A c=− ⵜs → 0; ␭→ ⬁. 共28兲 Beveridge and Schechter, 1970; Dorny, 1975兲, but for the example
␭ 2␭
considered next we will use a very simple approach, based on the use
Therefore, in this limiting case the damped least-squares solution ␦g of equation 30 with a large constant value of ␮共p兲, say ␮, which will
is in the direction of minus the gradient. This fact is emphasized by assure a small step in the steepest direction. In this way, we will be
use of the subscript g. In addition, 兩␦g兩 goes to zero. The direction able to see the steepest-descent path clearly, which will be used for
Table 1. Gravity inversion results obtained using different methods and values of the initial parameter.
Point zo ao goM z1 a1 g1M N zN aN ␭o or ␮ ␭N
Gauss-Newton method 共equation 17, Figure 1兲

A 2 10 1747 2.0 6.8 531 10 7.00 5.00
B 10 10 70 9.1 6.9 28 5 7.00 5.00
C 2 2 14 5.9 4.6 19 5 7.00 5.00
Steepest-descent method 共equation 30, ␮ = ␮, Figure 3兲
共p兲
A 2 10 1747 2.3 9.9 1317 28904 6.52 5.00 2 ⫻ 107

A 2 10 1747 4.7 8.9 229 7141 7.06 5.04 2 ⫻ 106
B 10 10 70 10.3 9.4 54 13326 7.02 5.01 2 ⫻ 104
C 2 2 14 2.0 2.0 15 8481 6.98 4.99 2 ⫻ 104
D 10 2 1 10.0 2.0 1 10265 7.02 5.01 2 ⫻ 104
Levenberg-Marquardt method 共equation 108, ␭共p+1兲 = ␭共p兲 /2, Figure 5兲
A 2 10 1747 2.8 9.5 762 23 7.00 5.00 1 ⫻ 106 0.24
A 2 10 1747 2.2 7.1 550 17 7.00 5.00 1 ⫻ 104 0.15
B 10 10 70 10.0 10.0 70 24 7.00 5.00 1 ⫻ 106 0.12
B 10 10 70 10.3 7.6 29 11 7.00 5.00 1 ⫻ 102 0.10
C 2 2 14 2.0 2.0 14 24 7.00 5.00 1 ⫻ 106 0.12
C 2 2 14 2.7 2.9 22 11 7.00 5.00 1 ⫻ 102 0.10
D 10 2 1 10.0 2.0 1 24 7.00 5.00 1 ⫻ 106 0.12
D 10 2 1 9.9 3.2 2 11 7.00 5.00 1 ⫻ 102 0.10
Point: 共see Figure 1兲; zo,ao: initial values of z,a for the inversion; goM ,g1M : maximum values of g 共mGal, equation 23兲 for zo, ao and z1, a1, respec-
tively; zi, ai: values of z,a after the ith iteration; N: total number of iterations; ␭o, ␭N: initial and final values of ␭. The gravity values to be inverted
were generated using equation 23 with z = zM = 7 and a = aM = 5. The corresponding maximum value is 18.
W6 Pujol
comparison with paths corresponding to the Gauss-Newton method of the gradient and the steepest-descent method is very fruitful be-
and the Levenberg-Marquardt solutions obtained for different choic- cause it sheds light on certain questions that arise when solving in-
es of ␭. This example will also show the problems that affect the verse problems that involve parameters with different dimensions.
method of steepest descent, which are removed when the Leven- This type of problem is not uncommon. In seismology, for example,
berg-Marquardt method is used. the parameters may be time, position, velocity, and density, among
others. If the dimensions of two or more of the parameters are differ-
Example 1b ent, a question that arises is how to define distance in the parameter
space. When all the parameters have the same dimensions and are
The merit function is the ␴2 introduced in equation 24. For the ini-
measured in the same units, the gradient of a function s共x兲 gives the
tial points B, C, and D the same value of ␮ was used, while for point
direction along which s has the largest rate of change. In other words,
A two other values were used 共see Figure 3兲. Let us consider the most
for a given ⌬x, 兩s共x + ⌬x兲 − s共x兲兩/兩⌬x兩 is largest when ⌬x is in the
salient aspects of this example. For three of the initial points 共B, C,
direction of ⵜs computed at x. In this case, it is meaningful to speak
D兲 the corresponding endpoints are very close to, although not ex-
of the direction of steepest ascent and to equate it to the gradient di-
actly at, 共zM ,aM 兲共see Table 1兲, but the number of iterations is ex-
rection. For any other case, however, a distance in parameter space
tremely large 共⬎7000兲. Recall that with the Gauss-Newton method,
must be defined. Once this is done, the direction of steepest ascent is
convergence for points B and C is achieved in five iterations. Note
well defined, as we now show. The following results originate with
that for the three points the paths have sharp bends, after which the
Crockett and Chernoff 共1955兲. Although in this paper we are inter-
paths follow the major axes of the roughly elliptical contours. These
ested in the steepest-descent direction, here we consider its opposite
bends occur when the paths become approximately tangent to the
direction 共corresponding to the steepest ascent兲 to avoid introducing
contours, which is a general property of the method 共see the discus-
an inconvenient minus sign.
sion following equation 48 and Figure 4兲. For point A, the results are
Ageneral definition of distance d between two points 共represented
different. First, the value of ␮ used for the other three points did not
by vectors ␣ and ␤兲 is
lead to convergence. Second, when using the larger value of ␮ in Ta-
ble 1, the path reaches a point close to where it should bend, but this
bending does not occur even after a very large number of iterations
共about 29,000兲. When a somewhat smaller value of ␮ is used, the
d= 冋兺i,j
共␤i − ␣i兲bij共␤ j − ␣ j兲册 1/2
path reaches a point close to the minimum with a much smaller num- ⬅ 关共␤ − ␣兲T B共␤ − ␣兲兴1/2 , 共31兲
ber of iterations 共about 7000兲, but the path between 共xo,y o兲 and
共x1,y 1兲 is clearly different from the steepest-descent path. where B is a positive definite symmetric matrix 共see Appendix A兲.
The previous example illustrates the well-known slow conver- With this condition on B, d is always a nonnegative real number,
gence of the steepest-descent method, which makes it computation- with d = 0 only if ␣ = ␤. The definition of distance is known as the
ally inefficient, particularly when compared to the Levenberg-Mar- metric of the space of points under consideration. If B = I, d is the
quardt method. On the other hand, a study of some of the properties
: µ = 2 ×10 4 : µ = 2 ×10 6 : µ = 2 ×10 7

A 500 100 50 B
10
8 10
B
Radius (km)
5
7 3 C
3 A
6
5
4
10
2 C D
2 3 4 5 6 7 8 9 10 11 12 Figure 4. Elliptical contour lines corresponding to a 2D quadratic
Depth (km) merit function 共given by equation 55 with x1 a point of minimum兲.
The contour spacing is not uniform. The points corresponding to the
Figure 3. Similar to Figure 1 showing the paths corresponding to the centers of the small circles are identified by the letters next to them.
steepest-descent method 共equation 30 with ␮共p兲 = ␮ = constant兲 for The solid and dashed lines are tangent to the contours at points B and
the values of ␮ given at the top of the figure. For point A the dot-dash C. The segments AB and BC are in the directions of the gradient at A
path was far from the minimum value 共see Table 1兲. The large ⫻ on and B. The positions of points B and C were determined using equa-
each path corresponds to 共z1,a1兲. The two contours corresponding to tions 47, 54, and 56. The two segments are perpendicular to each oth-
␴ ⬍ 3 are different from those in Figure 1, and were drawn to show er. The two pairs of closely spaced contours were drawn to show that
their relations to the bend in the paths for points B and D 共see Figure if the segments AB and BC extend beyond points B and C, the value
4 and corresponding text for further details兲. of the quadratic function becomes larger.
usual Euclidean distance. Given a point with coordinates xo, the ⵜs共xo + ␦兲 ⬇ ⵜs共xo兲 + H␦, 共43兲
points x at a distance d from it are on the ellipsoid
where H is the Hessian of s. Therefore,
共x − xo兲T B共x − xo兲 = d2 . 共32兲
ⵜs共xo + ␦兲 = ⵜs共xo兲; d→0 共44兲
Given a function s, the direction of steepest ascent in the d neighbor-
hood of xo is defined as the direction from xo to the point on the ellip- and, from equation 41,
soid of equation 32 for which the value of s is greatest. Let ␦ repre-

sent that direction. To find it, we will maximize s共xo + ␦兲 under the dB−1ⵜs共xo兲
␦= ; d → 0. 共45兲
constraint 兵关ⵜs共xo兲兴TB−1ⵜs共xo兲其1/2
␦TB␦ = d2 , 共33兲 Here, we are assuming that ⵜs共xo兲 ⫽ 0. If ⵜs共xo兲 = 0, xo corresponds
to a critical point 共i.e., a point where s has an extremal value or a sad-
where dle point兲. In conclusion, the direction of steepest ascent at any point
x is given by the vector
␦ = x − xo . 共34兲
The notation used here has been chosen to emphasize the connection ␦ˆ 共x兲 = B−1ⵜs共x兲. 共46兲
of this section to the preceding one. In particular, s and ␦ could be
those introduced in equations 5 and 13. To solve this problem, we When B = I, ␦ˆ is in the direction of the gradient, as expected.
will use the method of Lagrange multipliers, which requires intro- Finally, we will address the choice of B. In principle, the choice is
ducing the function somewhat arbitrary, because different metrics should lead to the
same minimum. In fact, equation 46 shows that ␦ˆ 共x兲 = 0 implies
u = s共xo + ␦兲 + ␯共d2 − ␦TB␦兲, 共35兲 ⵜs = 0 and vice versa 共Feder, 1963兲. However, Crockett and Cher-
noff 共1955兲 showed that the most computationally efficient steepest-
where ␯ is the Lagrange multiplier, computing its derivatives with ascent method requires that B = H. The following proof has two
respect to xi and ␯, and setting them equal to zero. This gives parts, originating from Davies and Whitting 共1972兲 and Greenstadt
共1967兲, respectively. For the first part, consider the iterative minimi-
⳵
s共xo + ␦兲 = 2␯ 兺 bij␦ j , 共36兲 zation 共or maximization兲 of a function s共x兲. At a given iteration, we
⳵xi i have a point xo and move to a new point x1 in a direction defined by a
vector u. Let
which is equivalent to
x1 = xo + ␣u 共47兲
ⵜs共xo + ␦兲 = 2␯B␦, 共37兲
and
and
s共x1兲 = s共xo + ␣u兲 ⬅ F共␣兲. 共48兲
d2 = ␦T B␦. 共38兲
From a computational point of view, we are interested in an iterative
Now we will go over the following steps. Solve equation 37 for ␦ process with the least number of steps. This requires finding the ␣
that reduces 共or increases in the case of maximization兲 the value of F
1 −1
␦= B ⵜs, 共39兲 as much as possible in the direction u at every step, which in turn
2␯ means that x1 becomes a point of tangency to one of the contours of s.
This situation is illustrated in Figure 4, which shows the contours
introduce this result in equation 38, use the symmetry of B, and ap-
corresponding to a 2D quadratic merit function with a point of mini-
ply the square root to both sides of the resulting equation. Thus we
mum. The points indicated by A and B correspond to xo and x1, re-
get
spectively, for a given iteration. Going from A to B, the value of s
1 keeps decreasing, while moving past B leads to an increase in the
d= 关共ⵜs兲T B−1ⵜs兴1/2 . 共40兲 value of s. For the next iteration B and C become the new xo and x1.
2␯
The points B and C correspond to tangency points and were deter-
Next, solve this equation for 1/2␯ and introduce the result in equa- mined using equations 54 and 56 below. The directions at A and B
tion 39. This gives are given by −ⵜs computed at those points. Because the gradient is
perpendicular to the contours, the segment BC is perpendicular to
dB−1ⵜs AB. If the contours were circular, the minimum would be reached in
␦= , 共41兲
关共ⵜs兲TB−1ⵜs兴1/2 one step.
To find the value of ␣ that will lead to a point of tangency, we will
where ⵜs is computed at xo + ␦. Now we will let d go to zero, which expand F to second order about xo
冏冏冏冏
means that ␦ goes to the zero vector, and linearize ⵜs共xo + ␦兲 in the
vicinity of xo. Writing in component form we have dF ␣ 2 d 2F
F ⬇ Fo + ␣ + , 共49兲
d␣ 2 d␣2
⳵s ⳵s ⳵ 2s o o
⳵xi
共xo + ␦兲 ⬇
⳵xi
共xo兲 + 兺j ⳵ x i⳵ x j
共xo兲␦ j . 共42兲 where the subscript o indicates evaluation at xo. Then, expanding
dF/d␣ to first order about xo and setting it equal to zero at the point of
In vector form, this equation becomes tangency we obtain
W8 Pujol
dF
d␣
⬇
dF
d␣
冏冏冏冏 o
+␣
d 2F
d␣2 o
= 0. 共50兲 ␥=
⌬FB
=
共ⵜsTB−1ⵜs兲2
⌬FN 共ⵜsTB−1HB−1ⵜs兲共ⵜsTH−1ⵜs兲
, 共60兲
If F were a quadratic function, these relations would be exact. Now, but before proceeding we will introduce the following vector
differentiating equation 48 with respect to ␣ gives
p = B−1/2ⵜs, 共61兲
dF ⳵s
d␣
= 兺i ⳵xi
ui = uTⵜs 共51兲 so that
ⵜs = B1/2p. 共62兲
and
Because B is positive definite and symmetric, so is B−1/2 共see Appen-
d 2F ⳵ 2s dix A兲. Using these two equations, ␥ becomes
d␣2
= 兺 ⳵ x i⳵ x j
uiu j = uTHu, 共52兲
i,j 共pT p兲2
␥= , 共63兲
where H is the Hessian matrix. Introducing these two expressions in 共p Mp兲共pT M−1p兲
T
equation 50 and solving for ␣ gives
冏冏冒冏冏冏冏
where
dF d 2F uTⵜs M = B−1/2 HB−1/2 . 共64兲
␣⬇− =− T . 共53兲
d␣ o d␣2 o u Hu x = xo
Matrix M is positive definite 共see Appendix A兲. An upper bound
If u = −ⵜs, this expression becomes to ␥ can be established by using the following generalization of
冏冏
Schwarz’s inequality
ⵜsTⵜs
␣⬇ . 共54兲共aT b兲2 ⱕ 共aT Ca兲共bT C−1b兲共65兲
ⵜsTHⵜs x = xo
共see Appendix A兲, where C is a positive definite matrix. Application
This expression is exact when s is a quadratic function. For example, of this expression to ␥ gives
s may be of the form
␥ ⱕ 1. 共66兲
s = 共x − x1兲T P共x − x1兲, 共55兲
Now we will apply the Kantorovich inequality 共Luenberger, 1973兲
where x1 is a constant vector and P is a symmetric matrix. In this to the right-side of equation 63, which immediately gives
case,
4 ␭ n␭ 1 4␭1 /␭n 4␬
ⵜs = 2P共x − x1兲; H = 2P. 共56兲 ␥ⱖ 2 = 2 = , 共67兲
共 ␭ n + ␭ 1兲共1 + ␭1 /␭n兲共1 + ␬兲2
If x1 minimizes s, ⵜs共x1兲 = 0 and P is positive definite 共e.g., Apos- where ␭1 and ␭n are the largest and smallest eigenvalues of M and
tol, 1969兲.
For the second part of the proof we will consider the difference ⌬F ␬ = ␭1 /␭n . 共68兲
between F and Fo, which is determined from equations 49 and
51–53: In summary,
⌬F = F − Fo = − 冏
1 共uT ⵜs兲2
2 uT Hu
冏 x = xo
. 共57兲
4␬
共1 + ␬兲2
ⱕ ␥ ⱕ 1. 共69兲
This result shows that the efficiency of the method depends on ␬,

This result applies to both the minimization and maximization of s, which is the condition number of M 共e.g., Gill et al., 1981兲. The bet-
with the sign of ⌬F depending on whether H is positive or negative ter conditioned this matrix is, the higher the efficiency. In particular,
definite 共see Appendix A兲, corresponding to whether s has a mini- ␥ = 1 when ␬ = 1, which in turn requires M = cI. Without losing
mum or a maximum 共e.g., Apostol, 1969兲. Now we will set u = ␦ˆ generality, we can take c = 1, in which case B = H 共see equation
共see equation 46兲 for two cases: 共1兲 B is any positive definite sym- 64兲. Therefore, the Newton step is the most efficient 共assuming that
metric matrix, and 共2兲 B = H. The latter case corresponds to the so- ⵜs is arbitrary兲. Crockett and Chernoff’s 共1955兲 proof of this result
called Newton 共or Newton-Raphson兲 method, and to distinguish be- is based on a different approach 共and resulting expressions兲. This
tween the two possibilities we will use subscripts B and N. Thus, choice of B, however, is not always the most advisable for two rea-
1 共ⵜsTB−1ⵜs兲2
sons. First, when minimizing s, H may not always be positive defi-
⌬FB = − 共58兲 nite; in fact, some of its eigenvalues may be negative and ⌬FN may
2 ⵜsTB−1HB−1ⵜs become positive. Second, even though H may be positive definite
and through the iterative process, the required computation of second de-
rivatives increases the computational costs. On the other hand, this
1 choice has special relevance in statistics because, as Crockett and
⌬FN = − ⵜsTH−1ⵜs. 共59兲 Chernoff 共1955兲 note, when solving maximum likelihood estimation
2
problems the function s is the logarithm of the likelihood function, in
To investigate the relative efficiency of the methods represented by which case H−1 represents an estimate of the covariance matrix of the
the two choices we will consider the ratio maximum likelihood estimate 共Seber and Wild, 1989兲.
The covariance matrix is also related to our discussion of the met-

S̄共xw兲 ⬍ S̄共x兲; x ⫽ xw . 共75兲
ric via the Mahalanobis distance, named after the Indian statistician
that introduced the concept 共in 1936兲. If x is a random vector from a Because Q is nonnegative and
population with mean ␮ and covariance matrix V, then the distance
between x and ␮ is given by Q共xo兲 = 0 共76兲
d M = 关共x − ␮兲T V−1共x − ␮兲兴1/2 . 共70兲共see equation 8兲, we can write
A good qualitative justification for this definition can be found in wS共xw兲 ⬍ wS共xw兲 + Q共xw兲 = S̄共xw兲 ⬍ S̄共xo兲
Krzanowski 共1988兲, who also notes the relation between this dis-
tance and the maximum likelihood function. = wS共xo兲 + Q共xo兲 = wS共xo兲, 共77兲
Finally, it is worth noting that the choice B = HS 共see equation 16兲
so that
leads to the Gauss-Newton method. In fact, using equations 46, 16,
27, and 15 gives S共xw兲 ⬍ S共xo兲, 共78兲
␦ˆ = −共ATA兲−1ATc = −␦ 共71兲 which means that the minimization of S̄ will lead to a decrease in S.
Now, letting x⬁ denote the ordinary least-squares solution 共the rea-
共provided that the inverse exists兲. Now using equation 34 with −␦ˆ in- son for this notation is explained below兲, we have
stead of ␦ 共the − sign being used to specialize to the steepest-descent
case兲 and then using equation 71 we have wS共xw兲 + Q共xw兲 = S̄共xw兲 ⬍ S̄共x⬁兲 = wS共x⬁兲 + Q共x⬁兲
x = xo − ␦ˆ = xo + ␦. 共72兲 ⬍ wS共xw兲 + Q共x⬁兲, 共79兲
Comparison of this expression with equation 13 shows that we have so that

recovered the Gauss-Newton method. This result is consistent with
the fact that one way to derive this method is to assume that H ⬇ HS
Q共xw兲 ⬍ Q共x⬁兲. 共80兲
in the Newton method 共e.g., Gill et al., 1981兲 The second inequality in equation 79 arises because x⬁ minimizes S.
Inequality 80 shows that the minimization of S̄ also leads to a de-
crease in the weighted sum of adjustments squared.
THE LEVENBERG-MARQUARDT DAMPED
Next, we will derive the equation for the solution that minimizes
LEAST-SQUARES METHOD S̄, but before proceeding we note that Levenberg derived his results
As noted in the discussion of the Gauss-Newton method, if the ini- using scalar notation, not the more familiar matrix notation used
tial solution is far from the optimal solution, the iterative process here. The starting point is equation 73, which will be rewritten using
may not converge. Although this problem is addressed by several au- equation 12
thors 共see Historical Note below兲, Levenberg’s 共1944兲 and Mar-
quard’s 共1963兲 papers are much more thorough than the others and S̄ = w共cTc − 2cTA␦ + ␦TATA␦兲 + ␦T D␦
for this reason they will be examined in detail here. However, their
approaches are so different that it is convenient to discuss them sepa-
rately.
冋
= w cTc − 2cTA␦ + ␦T ATA + 冉 1
w
冊册
D ␦ . 共81兲
Aside from a factor of w, equation 81 is formally similar to equation

The Levenberg approach 12, with ATA in the latter replaced by the symmetric matrix in paren-
theses in the former. Therefore, by analogy with equation 15, the
The notation used here is that introduced in the discussion of the
minimization of S̄ leads to the damped least-squares solution
Gauss-Newton method. To solve the problem of parameter adjust-
ments too large, Levenberg introduced the idea of damping the abso-
lute values of the ␦i by minimizing 冉 A TA +
1
w
冊
D ␦ = ATc, 共82兲
S̄共x兲 = wS共x兲 + Q共x兲, 共73兲 so that the only difference from the ordinary least-squares solution is
the addition of a diagonal matrix to ATA. Because the inverse of the
where
matrix in parentheses always exists for w ⬍ ⬁ 共see Appendix A兲,
Q共x兲 = d1␦12 + ¯ + dn␦n2 = ␦T D␦, 共74兲 equation 82 has a solution even when 共ATA兲−1 does not exist and the
Gauss-Newton method is not applicable. Also note that for w = ⬁
w and the di are positive weighting factors independent of x, and D is the second term on the left side of equation 82 vanishes and we get
a diagonal matrix with elements 共D兲ii = di. A comparison of equa- the ordinary Gauss-Newton solution 共provided it exists兲. This is why
tions 74 and 31 shows that Levenberg’s method implicitly introduc- we introduced the x⬁ used in equations 79 and 80. On the other hand,
es a non-Euclidean norm in the parameter space. Moreover, the re- if w goes to zero, 1/w goes to infinity and the first term on the left be-
sults of the analysis below are valid when D is a symmetric positive comes negligible, which means that we can write
definite matrix.
Let us establish two important results concerning S̄, S, and Q. Let 1
D␦g ⬇ ATc; w → 0. 共83兲
xw be the value of x that minimizes S̄ for a given value of w, i.e., w
W10 Pujol
In addition, because the diagonal elements of D are nonzero, its in- dxw /dw is a vector tangent to xw 共e.g., Spiegel, 1959兲. Furthermore,
verse always exists and we can write because the product on the right side is the matrix form of the scalar
product, we can write
冏冏
1
␦g ⬇ wD−1ATc = − wD−1ⵜs → 0; w → 0. 共84兲 ds共xw兲 dxw
2 = 兩ⵜs兩 cos ␾ , 共91兲
dw dw
共see equation 27兲. This result is also valid when D is symmetric and
positive definite 共so that its inverse exists, see Appendix A兲, in which where ␾ is the angle between the two vectors. The minimum value of
case it agrees with equation 46. The difference in the signs of ␦ and ␦ˆ the derivative is attained for ␾ = ␲. Introducing this value of ␾ as
is due to the fact that they are in the directions of steepest descent and well as equations 27, 86, and 88 into equation 91 we obtain
ascent, respectively.
So far, we have concentrated on S and S̄, but as we will see next, 2兩D−1/2ATc兩2 = 2兩ATc兩兩D−1ATc兩. 共92兲
we can derive several important results concerning s, which is the
This equation is satisfied when D = dI, with d equal to a constant,
quantity that is of most interest to us. In the following we will focus
which results in a factor of d−1 on both sides of the equation. Taking
on the case of w going to zero, which means that we can use equation
d = 1 and letting ␭ = 1/w, we find that equation 82 becomes the
84. Then, letting
well-known equation
␦g = xw − xo 共85兲
1
we find that
共ATA + ␭I兲␦ = ATc; ␭= . 共93兲
w
dxw d␦g The second approach proposed by Levenberg is to choose
= = D−1ATc; w → 0. 共86兲
dw dw
di = 共ATA兲ii , 共94兲
Furthermore,
in which case the matrix in parentheses in equation 82 becomes the
ds共xw兲
dw
= 兺
n
j=1
冏冏
⳵s dx j
⳵x j dw x = xw
= 共ⵜs兲T
dxw
dw
. 共87兲
matrix ATA with its diagonal elements multiplied by 1 + ␭. Leven-
berg did not give a motivation for this choice, but it is directly related
to the scaling introduced by Marquardt.
Because of equations 84 and 85, xw ⬇ xo. Then, introducing equa-
tions 86 and 27 in equation 87 and operating gives The Marquardt approach
冏冏
ds共xw兲
dw w=0
= −2共ATc兲T D−1ATc
Marquardt approaches the problem from a point of view different
from that of Levenberg. His starting point is the following series of
results. Unless otherwise noted, the notation used here is that intro-
= −2共D−1/2ATc兲T 共D−1/2ATc兲 duced earlier except for the fact that S will be assumed to be a func-
tion of ␦, as indicated by the right side of equation 12.
= −2兩D−1/2ATc兩2 ⬍ 0 共88兲 Let ␭ ⱖ 0 be arbitrary 共unrelated to the w above兲 and let ␦o satisfy
共see also equation 44兲. The inequality arises because of the assump- 共ATA + ␭I兲␦o = ATc. 共95兲
tion that xo is not a stationary point of s, which means that the partial
derivatives cannot all be equal to zero. Therefore, because s共xw兲 is Then ␦o minimizes S on the sphere whose radius 兩␦兩 satisfies
decreasing at w = 0, there are values of w 共positive兲 that will reduce
the value of s. In principle, the value of w that minimizes s could be 兩␦兩2 = 兩␦o兩2 . 共96兲
determined by setting ds/dw equal to zero, but because of the com-
This result was proved using the method of Lagrange multipliers,
plexity of this equation in practical cases, Levenberg proposed to
which requires minimizing the function
write s共xw兲 in terms of its linearized Taylor expansion
s共xw兲 ⬇ s共xo兲 + w
ds
dw
冏冏 w=0
共89兲
u共␦, ␭兲 = S + ␭共兩␦兩2 − 兩␦o兩2兲
with respect to ␦ and ␭, where ␭ is a Lagrange multiplier. This re-
共97兲
quires finding the derivatives of u with respect to ␦ and ␭ and setting

and to assume that s共xw兲 will be small. Under these conditions, them equal to zero. Because ␦o does not depend on ␦, the derivative
s共xo兲 s共xo兲 with respect to ␦ can be determined as done in connection with the
w⬇− = −1/2 T 2 , 共90兲 minimization of S̄ in equation 81. In fact, setting w = 1 and D = ␭I
共ds/dw兲兩w = 0 2兩D A c兩 in equation 82 immediately gives
where equation 88 was used. According to Levenberg, this type of
approximation was published by Cauchy in 1847.
共ATA + ␭I兲␦ = ATc. 共98兲
The results derived above do not depend on the values of the This proof is more general than that provided by Marquardt, which
weights d j. To determine them, Levenberg proposed two approach- assumed the existence of 共ATA兲−1. Next, setting the derivative of u
es. One was to choose the di such that the directional derivative of s with respect to ␭ equal to zero gives
along the curve defined by x = xw taken at w = 0 has a minimum val-
ue. The directional derivative is given by equation 87 because 兩␦兩2 = 兩␦o兩2 , 共99兲
which proves the result. For the sake of simplicity, the subscript in ␦o applies to a scaled version of the problem. However, because this
will be dropped. scaling is not essential 共and is not always used; e.g., Gill et al., 1981兲,
The second result requires writing ATA in terms of its eigenvalue the basis of the algorithm is described first. At the pth iteration the
decomposition 共e.g., Noble and Daniel, 1977兲, namely following equation is solved for ␦共p兲
ATA = U⌳UT , 共100兲关共ATA兲共p兲 + ␭共p兲I兴␦共p兲 = 共ATc兲共p兲 , 共108兲

where U is a matrix whose columns are the eigenvectors of ATA and

⌳ is the diagonal matrix of its eigenvalues ␭i 共all real numbers兲. This and the updated value of x is computed, i.e.,
decomposition applies to any symmetric matrix. The elements of the
diagonal matrix ⌳ will be indicated with ␭i. There should be no con- x共p+1兲 = x共p兲 + ␦共p兲 , 共109兲
fusion between the damping parameter ␭ and the ␭i because the
former is never subscripted. Using equation 100 and the property which in turn is used to compute s共p+1兲. Then, if
UUT = UT U = I 共101兲 s共p+1兲 ⬍ s共p兲 , 共110兲

we obtain
the value of ␭共p兲 is reduced. Otherwise, its value is increased. After
A A + ␭I = U共⌳ + ␭I兲U .
T T
共102兲 this step a new iteration is started. Marquardt introduced three tests
that determined the value of ␭共p+1兲, but a simpler approach, described
The matrix in parentheses is diagonal, and, as shown in Appendix A, below, works well. In any case, the important point to note here is
all of its diagonal elements are always positive when ␭ ⬎ 0. There- that the Marquardt algorithm is based on a trial-and-error approach
fore its inverse always exists, which allows writing for the selection of the appropriate value of ␭ at each iteration, which
is simpler than Levenberg’s approach 共equation 90兲.
␦ = 关U共⌳ + ␭I兲UT兴−1ATc = U共⌳ + ␭I兲−1u, 共103兲 Marquardt applies his algorithm after introducing the scaled ma-
where equation 95 was used and trix 关ATA兴* and vector 关ATc兴*, with components given by
u = UTATc. 共104兲共关ATA兴*兲ij = siis jj共ATA兲ij 共111兲

Then and
u2i
兩␦兩2 = ␦T ␦ = uT 关共⌳ + ␭I兲2兴−1u = 兺i 共␭i + ␭兲2 . 共105兲共关ATc兴*兲i = sii共ATc兲i , 共112兲
where
From this equation we see that 兩␦兩 is a decreasing function of ␭. This
result and the previous one are from Morrison 共1960, unpublished,
1
quoted by Marquardt兲. sii = 共113兲
Marquardt’s final result concerns the angle ␤ between ␦ and ATc, 冑共ATA兲ii .
which is proportional to −ⵜs 共see equation 27兲. Using equations 103,
104, and 101, we can write Note that the diagonal elements of 关ATA兴* are all equal to one. In
terms of this scaling, equation 15 becomes
␦TATc uT共⌳ + ␭I兲−1UTATc
cos ␤共␭兲 = =
兩␦兩兩A c兩
T
兩␦兩兩ATc兩关ATA兴*␦* = 关ATc兴* . 共114兲
uT共⌳ + ␭I兲−1u After solving this equation, ␦ is computed using

= . 共106兲
兵u 关共⌳ + ␭I兲兴
T 2 −1
u其兩A c兩
1/2 T
␦i = sii␦*i . 共115兲
The angle ␤ is a function of ␭. When ␭ goes to infinity, we already
saw that ␦ goes to 共−ⵜs兲/␭ 共see equation 28兲, so that ␤ goes to zero. To verify that equation 115 is correct, we will proceed as follows.
When ␭ = 0, two cases must be considered. If the inverse of ATA ex- First, introduce a diagonal matrix S with elements
ists, all the ␭i are positive, cos ␤ ⬎ 0, and ␤ ⬍ 90°. If the inverse does
not exist, then equation 15 cannot be solved. Now we will address 共S兲ii = sii . 共116兲
the question of what happens to ␤共␭兲 for other values of ␭. To answer
it, Marquardt investigates the sign of the derivative of cos ␤ with re- Using this matrix, equations 111, 112, and 114 become
spect to ␭ and finds that
关ATA兴* = SATAS, 共117兲
d
cos ␤共␭兲 ⬎ 0. 共107兲
d␭
关ATc兴* = SATc, 共118兲
共Appendix B兲. The main consequence of this result is that ␤ is a
monotonic decreasing function of ␭, which assures that it is always and
possible to find a value of ␭ that will reduce the value of s. This ob-
servation leads to the algorithm introduced by Marquardt, which he SATAS␦* = SATc. 共119兲
W12 Pujol
Multiplying both sides of equation 119 by S−1 on the left gives solved and on the initial values given to the parameters to be deter-
mined. For points A, B, and C, equation110 was always satisfied and
ATA共S␦*兲 = ATc, 共120兲 the choice of ␭o was not critical 共recall that the Gauss-Newton meth-
so that od converged for these points兲. For point D, the situation is different
共see below兲. For other inverse problems, the best approach to the se-
␦ = S␦* 共121兲 lection of ␭o is to invert synthetic data that resemble the actual data as
much as possible and to experiment with different values of ␭o and

共provided that 共ATA兲−1 exists兲. even the constant c.
Marquardt’s algorithm is based on the solution of equation 108 af- Two values of ␭o for each initial point were used. The first value
ter introducing the scaling described above. Solving the scaled equa- 共equal to 1 ⫻ 106 for all the points兲 was chosen very large to see the
tion gives ␦*共p兲, which is converted to ␦共p兲 using equation 115, which
relation between the convergence paths and the corresponding
in turn is used in equation 109.
steepest-descent paths. Interestingly, the two paths are extremely
The reasons given by Marquardt to scale the problem are twofold.
close to each other for the four initial values, but the number of itera-
First, it was known 共Curry, 1944兲 that the properties of the steepest-
tions is several orders of magnitude smaller 共just 23 or 24兲 and the
descent method are not scale invariant. Second, the particular scal-
endpoints coincide with 共zM ,aM 兲共Table 1兲. This similarity of paths
ing he chose was widely used in linear least-squares problems to im-
was not expected, and it is not clear how the changes in ␭, ATA, and
prove their numerical aspects. These questions are discussed in, e.g.,
ATc at each iteration combine to produce the observed paths. For a
Draper and Smith 共1981兲 and Gill et al. 共1981兲. What is not obvious,
however, is that equation 115 is applicable after the scaled version of comparison, the largest initial values of ATA for the points A, B, C,
equation 108 is solved, but this fact can be proved using the ideas de- and D are close to 5 ⫻ 106, 5 ⫻ 103, 1 ⫻ 103, and 8, respectively, and
veloped by Levenberg. To see that, let us multiply both sides of equa- the corresponding value at the minimum point is 920.
tion 82 by D−1/2 on the left, rewrite it slightly, and operate. Using ␭ The second value of ␭o is equal to 1 ⫻ 104 for point A and 1 ⫻ 102
= 1/w, this gives for the other ones. In all cases, the point 共zM ,aM 兲 is reached exactly,
and the number of iterations becomes smaller 共17 for point A and 11
D−1/2共ATA + ␭D1/2 D1/2兲共D−1/2 D1/2兲␦ for the other points兲 than for the previous value of ␭o. The values of
␭o used here were chosen so that the convergence paths are interme-
= 共D−1/2ATAD−1/2 + ␭I兲共D1/2␦兲 = D−1/2ATc. 共122兲 diate between the previous ones and those obtained using the Gauss-
If D is the diagonal matrix with elements given by equation 94, then Newton method. For smaller values of ␭o, convergence is even faster
D−1/2 = S and equation 122 becomes for all points except D. For this point, values of ␭o less than about 10
lead to a larger number of iterations. Recall that this was the only
共SATAS + ␭I兲␦* = SATc, 共123兲 point for which the Gauss-Newton method did not converge. Al-
though it is not possible to give a conclusive explanation for these
where ␦* = S−1␦ 共see equation 121兲. This result has two consequenc-
differences in convergence speeds, they may be related to the fact
es. First, its comparison with equations 117–121 justifies Mar-
that point D is in a region of the 共z,a兲 plane with a very slow rate of
quardt’s procedure. Second, it shows that this procedure is equiva-
change in the value of ␴, so that to assure a decrease in its value, the
lent to Levenberg’s second choice of matrix D, namely D = S−2. This
adjustments ␦z and ␦a must be smaller than for some of the other
equivalence, noted without proof by Davies and Whitting 共1972兲,
points, thus requiring larger values of ␭o.
shows that Marquardt’s method implicitly introduces a non-Euclide-
The application of the scaled version of the method is based on
an norm that changes at each iteration because it is based on the value
equation 123. Using ␭o = 1, convergence to the true values of z and a
of x for that iteration.
was achieved in 12 iterations for points A, B, and C 共Figure 6兲 and in
27 iterations for point D. The problem for point D is that ␴ initially
Example 1c increases for this value of ␭o, which requires an increase in the values
Here, we will apply the Levenberg-Marquardt method with and of ␭ used in subsequent iterations. Therefore, to reduce the total
without scaling to the gravity data introduced before. There are two number of iterations a larger ␭o is needed. For this particular point,
reasons for using the two options. One is that because both of them the smallest number of iterations is 17 for ␭ = 20 共Figure 6兲. Let us
are used in practice, a comparison of their performances will be use- examine the convergence paths. For points A and C there are not sig-
ful. The second reason is that scaling is equivalent to changing the nificant qualitative differences with the corresponding paths seen in
shape of s at each iteration, so that a direct comparison with the Figure 5 for the smaller values of ␭o, but for the other two points the
Gauss-Newton and steepest-descent methods is possible only for the differences are significant. For point B, the first three points of the
unscaled version. path do not interpolate between the Gauss-Newton and steepest-de-
First, we will consider the results obtained using the unscaled ver- scent paths. Recall that the interpolation property discussed here ap-
sion 共corresponding to equation 108兲. The initial values of ␭ 共indi- plies to the unscaled version of the method, so that it cannot be ex-
cated by ␭o兲 are given in Table 1. The following procedure to handle pected that it will always apply when scaling is introduced. For point
␭ at each iteration is simpler than that proposed by Marquardt but D, the path is completely unexpected, with 共z1,a1兲 equal to
was found to be effective when applied to a variety of inverse prob- 共2.79,2.79兲, much further away from 共xo,y o兲 than for any of the other
lems. As before, ␴2 plays the role of s. Then, if equation 110 is satis- three initial points. In addition, after this first point has been reached,
fied, the value of ␭共p +1兲 is set equal to ␭共p兲 /c, where c is a constant the path is similar to that for point C.
共here, c = 2兲. If not, the values of ␭ and the parameters are set equal The inversion was repeated using equation 82 with the diagonal
to those they had in the iteration p − 1. Then a new iteration is start- elements of D given by equation 94, 1/w = ␭, and the same values of
ed. The selection of ␭o depends on the type of inverse problem being ␭o. The results obtained in this way agree with those shown in Figure
6, thus providing a numerical confirmation of the equivalence of this solution, which can be derived as follows. Equations 124 and 125
choice of D and the Marquardt scaling proved analytically. will be written as a single equation involving partitioned matrices,
In summary, for this particular example, the unscaled and scaled namely
versions of the Levenberg-Marquardt method perform similarly. It
B␦ = u, 共127兲
may be argued that the scaled version makes it easier to choose the
value of ␭o, which can be taken close to one, but as point D showed, where
冉冊冉冊
this value may not lead to a smaller number of iterations. Obviously,
this is not a major concern in our case, but it may be so when solving A c
B= ; u= . 共128兲
inverse problems involving large numbers of parameters. Also note pI 0
that the results for points C and D in Figure 6 show that it is difficult
to make general statements regarding the convergence paths for lin-
earized nonlinear problems, even for a relatively simple 2D case.
o: λo = 1×10 6; o: λo = 1×10 4; o: λo = 1× 10 2
Again, convergence to a solution may become more of an issue as
the number of inversion parameters increases. In particular, the A 500 100 50 B
10
function s may have local minima in addition to an absolute mini-
mum, in which case the inversion results may depend on the initial 9
solution and on the selection of ␭o. These facts must be borne in mind
by those beginning their work in nonlinear inverse problems. Each 8 10
problem will have features that make it different from other prob-
Radius (km)
5
lems and, as noted above, the best way to investigate it is through the 7 3
inversion of realistic synthetic data 共i.e., the model is realistic兲. In 3
addition, because actual data are always affected by measurement or 6
5
observational errors, representative errors should be added to the
5
synthetic data.
4
HISTORICAL NOTE 10
3
In spite of its importance, Levenberg’s 共1944兲 paper went largely
unnoticed until it was referred to in Marquardt’s 共1963兲 paper. When 2 C D
Levenberg published his paper he was working at the Engineering 2 3 4 5 6 7 8 9 10 11 12
Research Section of the U. S. Army Frankford Arsenal 共Philadel- Depth (km)
phia兲, and according to Curry 共1944兲 the engineers there preferred
Levenberg’s method over the steepest-descent method. Interesting- Figure 5. Similar to Figure 1 showing the paths corresponding to the
ly, the Frankford Arsenal supported the work of Rosen and Eldert unscaled Levenberg-Marquardt method 共equation 108兲 using the
initial values of ␭ 共i.e., ␭o兲 given at the top of the figure 共circles兲. For
共1954兲 on lens design using least squares, but they did not use Leven- a comparison, some of the steepest-descent paths in Figure 3 are also
berg’s method. The computerized design of lenses was an area of re- shown here 共black lines兲.
search with military and civilian applications, with early results
summarized by Feder 共1963兲. Regarding Levenberg’s paper, Feder
notes that it had come to his attention in 1956 and that other people
o, o, o: λ o = 1; o: λ o = 20
had rediscovered the damped least-squares method, although some
of the work was supported by the military and could not be made A 500 100 50 B
10
public until several years later because of its classified nature.
One of the rediscoverers of the damped least-squares method was 9
Wynne 共1959兲, who notes that the problems affecting the ordinary
8 10
least-squares method when the initial solutions did not lead to an ap-
5
Radius (km)
proximate linearization could be addressed by limiting the size of the

7 3
adjustment vector. Using the notation introduced here, Wynne added
3
the constraint 6
5
p␦ = 0, 共124兲 5
where p is a weighting factor, to an equation similar to 4

10
A␦ = c. 共125兲 3
Wynne noted that the least-squares solution of the combined equa-
2 C D
tions 124 and 125 minimizes a function similar to
2 3 4 5 6 7 8 9 10 11 12
Depth (km)
S̄ = r r + p ␦ ␦.
T 2 T
共126兲
Figure 6. Similar to Figure 5 showing the paths corresponding to the
Note that this is a special case of equation 73 with w = 1 and D scaled Levenberg-Marquardt method 共based on equation 123兲 using
= p2I. Wynne, however, did not give an explicit expression for the the initial values of ␭ 共␭o兲 given at the top of the figure 共circles兲.
W14 Pujol
Solving equation 127 by least squares gives APPENDIX A
BT B␦ = BTu, 共129兲 SOME RESULTS CONCERNING POSITIVE

DEFINITE AND SEMIDEFINITE MATRICES
which after performing the multiplications indicated becomes equa-
tion 93 with ␭ replaced by p2. This result is quoted without proof in 共1兲 A square symmetric matrix C is said to be positive semidefi-
Wynne and Wormell 共1963兲. Feder 共1963兲, however, provides a nite if

proof that started with equation 126. Wynne’s 共1959兲 paper is also
interesting because it notes that for p going to infinity the solution yTCy ⱖ 0; y ⫽ 0. 共A-1兲
approaches that which is obtained using the method of steepest de-
The matrix C is positive definite if the ⱖ sign above is replaced by
scent, thus anticipating Marquardt’s results. In Wynne’s method, the
⬎. If vi and ␭i are an eigenvector of C and its corresponding eigen-
selection of p was empirical, and was based on the condition that the
value, then
computed value of ␦ was small enough to assure that the lineariza-
tion of the nonlinear problem was approximately satisfied. A simpler vTi Cvi = ␭ivTi vi = ␭i兩vi兩2 . 共A-2兲
approach, suggested by Feder 共1963兲, is to start with a large value of
p and to reduce it gradually so as to assure convergence to a solution. Because 兩vi兩 ⬎ 0, if C is positive semidefinite, ␭i ⱖ 0. If C is positive
An application of Wynne’s method was provided in Nunn and definite, ␭i ⬎ 0 and its inverse exists because C−1 = U⌳−1UT 共see
Wynne 共1959兲. equations 100 and 101兲.
Another rediscoverer was Girard 共1958兲, although his work was If the ⱖ sign in equation A-1 is replaced by ⬍ the matrix C is said
not as extensive as that of Wynne. Girard’s approach was to add the to be negative definite and its eigenvalues are negative.
constraint 共1a兲 Given any matrix A, ATA is either positive definite or
semidefinite, as can be seen from
K␦ = 0, 共130兲
yT共ATA兲y = 共Ay兲TAy = 兩Ay兩2 ⱖ 0; y ⫽ 0. 共A-3兲
where K is a diagonal matrix, to an equation similar to equation 125.
If the inverse of ATA exists, all of its eigenvalues will be larger than
This led to a merit function similar to S̄ and to a matrix equation sim-
zero and ATA will be positive definite. If the inverse does not exist,
ilar to equation 82 with 共1/w兲D replaced by a diagonal matrix. Nei-
some of the eigenvalues will be equal to zero and the matrix will be
ther the derivation of the equation, nor an expression for the ele- positive semidefinite.
ments of the matrix were given, although the latter can be derived by 共1b兲 A diagonal matrix D with elements 共D兲ii = di ⬎ 0 is positive
replacing pI with K in the matrix B in equation 128 and proceeding definite because
as before.
This brief summary shows that Levenberg’s method was known
among people working on optical design, but this knowledge did not
yTDy = 兺i diy2i ⬎ 0; y ⫽ 0, di ⬎ 0. 共A-4兲
spread further. The widespread lack of recognition of Levenberg’s
work may have to do with the unavailability of adequate computa- 共1c兲 If matrices C and P are positive semidefinite and definite, re-
tional capabilities at the time his paper appeared, and the possibility spectively, and ␭ ⬎ 0, then C + ␭P is positive definite because
that his way of finding the optimal value of w at each iteration was
deemed too complicated for its computer implementation. For ex- yT共C + ␭P兲y = yTCy + ␭yTPy ⬎ 0; y ⫽ 0, ␭ ⬎ 0.
ample, Hartley 共1961兲 notes the need to know higher derivatives of 共A-5兲
the merit function and concludes that the method was not well suited
These three results are important in the context of the damped
for computer programming. Interestingly, it was Hartley who
least-squares method because if C = ATA, then the matrices in pa-
brought Levenberg’s paper to the attention of Marquardt as a review-
rentheses in equations 82 and 93 will be positive definite as long as
er of the latter’s paper. Marquardt’s work, on the other hand, became
w ⬍ ⬁ and ␭ ⬎ 0, and their inverses will exist. For the particular case
popular rather quickly, which brings the question of why this hap- of P = I the eigenvalues of C + ␭I are ␭i + ␭, which are always pos-
pened. According to Davis 共1993兲, Marquardt explained the success itive as long as ␭ ⬎ 0 共Feder, 1963兲.
of his method by the fact that he implemented it in a FORTRAN code 共2兲 If B is a symmetric positive 共semi兲definite matrix, there exists
and that he gave away hundreds of copies of it. Other interesting a unique matrix C such that
comments by Marquardt on his paper can be found at http://garfield.
library.upenn.edu/classics1979/A1979HZ24400001.pdf兲. B = C2 , 共A-6兲
where C is symmetric positive 共semi兲definite. To see that, start with

ACKNOWLEDGMENT the eigenvalue decomposition of B, given by
I gratefully acknowledge the constructive comments of one of the B = U⌳UT 共A-7兲

reviewers, Bill Rodi, which led to an improved presentation of the
paper, his careful checking of the equations, and the comments that 共see equation 100兲 and introduce the matrix
motivated the note on the relation between the Newton and Gauss-
Newton methods. C = U⌳1/2 UT , 共A-8兲
where ⌳1/2 is a diagonal matrix with its iith element given by ␭i1/2.
Then, 冋兺册冋兺册冋兺册
i
u2i ⌸1i
i
u2i ⌸3i −
i
u2i ⌸2i
2
共B-4兲
冋兿册
,
C2 = U⌳1/2 UT U⌳1/2 UT = U⌳UT = B, 共A-9兲共␭i + ␭兲2
2
i
where equation 101 has been used. The matrix C is known as the
square root of B, and is indicated by B1/2. This matrix is unique 共for a where
proof see, e.g., Harville, 1997; Zhang, 1999兲.
If B is positive definite, ⌸ni = 兿共 ␭ i⬘ + ␭ 兲 n ; n = 1,2,3; i⬘ ⫽ i. 共B-5兲
i⬘
B−1/2 = C−1 = U⌳−1/2UT , 共A-10兲 To show how this result is derived it will be assumed that the number
of terms in the sums is three; the extension to any other number is
where equation 101 has been used. This matrix is also symmetric and
straightforward. Let
positive definite.
共3兲 In its standard form, Schwarz inequality states that ai = ␭i + ␭ . 共B-6兲
共aTb兲2 ⱕ 共aTa兲共bTb兲共A-11兲 Then, the fractions within braces in equation B-2 can be rewritten as
follows
for any vectors a and b 共e.g., Arfken, 1985兲.
Let C be a symmetric positive definite matrix. In equation A-11, u21 u22 u23 u21an2an3 + u22an1an3 + u23an1an2
+ + =
replace a and b by C1/2 a and C−1/2 b, respectively. Because C1/2 is an1 an2 an3 an1an2an3
symmetric, this immediately gives equation 65 共e.g., Zhang, 1999兲.
共4兲 The matrix M in equation 64 is symmetric 共because so are the
u21⌸n1 + u22⌸n2 + u23⌸n3
兺i u2i ⌸ni
matrices on the right兲 and positive definite. To show that, let y be an
= = ,
arbitrary nonzero vector. Then an1an2an3 兿i ani
yTMy = yTB−1/2HB−1/2y = 共B−1/2y兲TH共B−1/2y兲 ⬎ 0, 共A-12兲
共B-7兲
because H is assumed to be positive definite. The vector in parenthe-
ses on the right-hand side of this equation is arbitrary. where
⌸n1 = an2an3 ; ⌸n2 = an1a3;n ⌸n3 = an1an2 ; n = 1,2,3.

APPENDIX B 共B-8兲
Note that the first subindex in ⌸ni refers to the power to which the
PROOF OF EQUATION 107 factors in the product are raised and the second one indicates which
factor is excluded from the product. The denominator in equation
In component form, equation 106 becomes
B-4 is common to the two terms within the braces in equation B-2.
The denominator in equation B-4 is also positive. To find out the
u2i
兺i ␭i + ␭
sign of the numerator, a new transformation is needed, based on the
fact that
冋兺册
cos ␤共␭兲 = . 共B-1兲
u2i 1/2
⌸1i ⌸3i = 共⌸2i兲2 . 共B-9兲
兩ATc兩
i 共␭i + ␭兲2 This allows writing the numerator of expression B-4 as
The derivative of cos ␤ with respect to ␭ is given by
冋兺共 ui⌸1i 兲
1/2 2
册冋兺共 ui⌸3i 兲 −
1/2 2
册冋兺共 1i 兲共 ui⌸3i 兲 ,
ui⌸1/2 1/2
册 2
d
d␭
cos ␤共␭兲 =
1
C 再冋兺册冋兺 u2i
␭i + ␭
u2i
共␭i + ␭兲3 册
i i i
共B-10兲
冋兺册冎
i i
which is always positive because of the Schwarz inequality 共see
u2i 2
equation A-11兲. To apply it to equation B-10 let a and b be vectors
− , 共B-2兲 with components ui⌸1/2
共␭i + ␭兲2 1i and u i⌸3i , respectively. This result shows
1/2
i
that
where
d
冋兺册
cos␤共␭兲 ⬎ 0. 共B-11兲
u2i 3/2 d␭
C= 兩A c兩.
T
共B-3兲
i 共␭i + ␭兲2
REFERENCES
The factor 1/C is positive. To find the sign of the factor in braces in
equation B-2 we must perform all the operations indicated. The re- Arfken, G., 1985, Mathematical methods for physicists: Academic Press Inc.
sulting expression is Apostol, T., 1969, Calculus, vol. II: Blaisdell Publishing Company.
W16 Pujol
Beveridge, G., and R. Schechter, 1970, Optimization: Theory and practice: linear regression functions by least squares: Technometrics, 3, 269–280.
McGraw-Hill Book Company. Harville, D., 1997, Matrix algebra from a statistician’s perspective: Springer,
Crockett, J., and H. Chernoff, 1955, Gradient methods of maximization: Pa- Pub. Co., Inc.
cific Journal of Mathematics, 5, 33–50. Jenkins, G., and D. Watts, 1968, Spectral analysis: Holden-Day.
Curry, H., 1944, The method of steepest descent for non-linear minimization Krzanowski, W., 1988, Principles of multivariate analysis: Oxford Universi-
problems: Quarterly of Applied Mathematics, 2, 258–261. ty Press.
Davies, M., and I. Whitting, 1972, A modified form of Levenberg’s correc- Levenberg, K., 1944, A method for the solution of certain non-linear prob-
tion, in F. Lootsma, ed., Numerical methods for non-linear optimization: lems in least squares: Quarterly of Applied Mathematics, 2, 164–168.
Academic Press Inc., 191–201. Luenberger, D., 1973, Introduction to linear and nonlinear programming:
Davis, P., 1993, Levenberg-Marquart 关sic兴 methods and nonlinear estima- Addison-Wesley Publishing Company.
tion: SIAM News, 26, no. 2. Marquardt, D., 1963, An algorithm for least-squares estimation of nonlinear
Dobrin, M., 1976, Introduction to geophysical prospecting: McGraw-Hill parameters: SIAM Journal, 11, 431–441.
Book Co. Noble, B., and J. Daniel, 1977, Applied linear algebra: Prentice-Hall.
Dorny, C., 1975, A vector space approach to models and optimization: John Nunn, M., and C. Wynne, 1959, Lens designing by electronic digital comput-
Wiley & Sons. er: II: Proceedings of the Physical Society, 74, 316–329.
Draper, N., and H. Smith, 1981, Applied regression analysis: John Wiley & Rosen, S., and C. Eldert, 1954, Least-squares method for optical correction:
Sons. Journal of the Optical Society of America, 44, 250–252.
Feder, D., 1963, Automatic optical design: Applied Optics, 2, 1209–1226. Seber, G., 1977, Linear regression analysis: John Wiley & Sons, Inc.
Gill, P., W. Murray, and M. Wright, 1981, Practical optimization: Academic Seber, G., and C. Wild, 1989, Nonlinear regression: John Wiley & Sons, Inc.
Press Inc. Spiegel, M., 1959, Vector analysis: McGraw-Hill Book Co.
Girard, A., 1958, Calcul automatique en optique géométrique: Revue Wynne, C., 1959, Lens designing by electronic digital computer: I: Proceed-
D’Optique, 37, 225–241. ings of the Physical Society 共London兲, 73, 777–787.
Greenstadt, J., 1967, On the relative efficiencies of gradient methods: Mathe- Wynne, C., and P. Wormell, 1963, Lens design by computer: Applied Optics,
matics of Computation, 21, 360–367. 2, 1233–1238.
Hartley, H., 1961, The modified Gauss-Newton method for the fitting of non- Zhang, F., 1999, Matrix theory: Springer, Pub. Co., Inc.

2007 The Solution of Nonlinear Inverse Problems and The Levenberg-Marquardt Method PUJOL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2007 The Solution of Nonlinear Inverse Problems and The Levenberg-Marquardt Method PUJOL

Uploaded by

Copyright:

Available Formats

GEOPHYSICS, VOL. 72, NO. 4 共JULY-AUGUST 2007兲; P. W1–W16, 6 FIGS., 1 TABLE.

The solution of nonlinear inverse problems

INTRODUCTION rameters about which the problem is linearized. Once equation 1 is

n A condition for a minimum of S to exist is that HS be positive definite

Then compute the updated estimate using

s共p+1兲 ⱕ smin , 共21兲

Point zo ao goM z1 a1 g1M N zN aN ␭o or ␮ ␭N

Gauss-Newton method 共equation 17, Figure 1兲

A 2 10 1747 2.3 9.9 1317 28904 6.52 5.00 2 ⫻ 107

: µ = 2 ×10 4 : µ = 2 ×10 6 : µ = 2 ×10 7

soid of equation 32 for which the value of s is greatest. Let ␦ repre-

equation 50 and solving for ␣ gives

This result shows that the efficiency of the method depends on ␬,

The covariance matrix is also related to our discussion of the met-

x = xo − ␦ˆ = xo + ␦. 共72兲 ⬍ wS共xw兲 + Q共x⬁兲, 共79兲

Comparison of this expression with equation 13 shows that we have so that

Aside from a factor of w, equation 81 is formally similar to equation

quires finding the derivatives of u with respect to ␦ and ␭ and setting

ATA = U⌳UT , 共100兲 关共ATA兲共p兲 + ␭共p兲I兴␦共p兲 = 共ATc兲共p兲 , 共108兲

where U is a matrix whose columns are the eigenvectors of ATA and

UUT = UT U = I 共101兲 s共p+1兲 ⬍ s共p兲 , 共110兲

u = UTATc. 共104兲 共关ATA兴*兲ij = siis jj共ATA兲ij 共111兲

uT共⌳ + ␭I兲−1u After solving this equation, ␦ is computed using

much as possible and to experiment with different values of ␭o and

proximate linearization could be addressed by limiting the size of the

where p is a weighting factor, to an equation similar to 4

Solving equation 127 by least squares gives APPENDIX A

BT B␦ = BTu, 共129兲 SOME RESULTS CONCERNING POSITIVE

Wynne and Wormell 共1963兲. Feder 共1963兲, however, provides a nite if

where C is symmetric positive 共semi兲definite. To see that, start with

I gratefully acknowledge the constructive comments of one of the B = U⌳UT 共A-7兲

⌸n1 = an2an3 ; ⌸n2 = an1a3;n ⌸n3 = an1an2 ; n = 1,2,3.

You might also like

ATA = U⌳UT , 共100兲关共ATA兲共p兲 + ␭共p兲I兴␦共p兲 = 共ATc兲共p兲 , 共108兲

u = UTATc. 共104兲共关ATA兴*兲ij = siis jj共ATA兲ij 共111兲