Professional Documents
Culture Documents
10.1190/1.2732552
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
Jose Pujol1
ABSTRACT method and the steepest-descent method. In this tutorial, the two
papers are combined into a unified presentation, which will help
the reader gain a better understanding of what happens when
Although the Levenberg-Marquardt damped least-squares solving nonlinear problems. Because the damped least-squares
method is an extremely powerful tool for the iterative solution of and steepest-descent methods are intimately related, the latter is
nonlinear problems, its theoretical basis has not been described also discussed, in particular in its relation to the gradient. When
adequately in the literature. This is unfortunate, because Leven- the inversion parameters have the same dimensions 共and units兲,
berg and Marquardt approached the solution of nonlinear prob- the direction of steepest descent is equal to the direction of minus
lems in different ways and presented results that go far beyond the gradient. In other cases, it is necessary to introduce a metric
the simple equation that characterizes the method. The idea of 共i.e., a definition of distance兲 in the parameter space to establish a
damping the solution was introduced by Levenberg, who also relation between the two directions. Although neither Levenberg
showed that it is possible to do that while at the same time reduc- nor Marquardt discussed these matters, their results imply the in-
ing the value of a function that must be minimized iteratively. troduction of a metric. Some of the concepts presented here are il-
This result is not obvious, although it is taken for granted. More- lustrated with the inversion of synthetic gravity data correspond-
over, Levenberg derived a solution more general than the one ing to a buried sphere of unknown radius and depth. Finally, the
currently used. Marquardt started with the current equation and work done by early researchers that rediscovered the damped
showed that it interpolates between the ordinary least-squares- least-squares method is put into a historical context.
Manuscript received by the Editor June 27, 2006; revised manuscript received January 10, 2007; published online May 30, 2007.
1
University of Memphis, Department of Earth Sciences, Memphis, Tennessee. E-mail: jpujol@memphis.edu.
© 2007 Society of Exploration Geophysicists. All rights reserved.
W1
W2 Pujol
happens when solving a linearized nonlinear problem. To give the The concepts introduced here are illustrated with a simple exam-
readers a flavor of the matters to be discussed, the basic features of ple involving the inversion of gravity data corresponding to a buried
the two approaches are summarized below. sphere. In this case, the unknown parameters are the radius of the
Levenberg 共1944兲 solves the problem of the lack of convergence sphere and the depth to its center. By limiting the number of inver-
by introducing 共and naming兲 the damped least-squares method. The sion parameters to two, it is easy to visualize both the function s and
basic idea was to damp 共i.e., limit兲 the values of the parameters at the path followed by the parameters as a function of the iterations for
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
each iteration. Specifically, instead of using the function S 共see be- different initial values of the parameters and , and for solutions ob-
low兲 whose minimization leads to the ordinary least-squares solu- tained using the damped and ordinary least-squares methods and the
tion, Levenberg minimizes the function S̄ = wS + Q, where w ⬎ 0 steepest-descent method.
and Q is a linear combination of the components of ␦ squared. The This tutorial concludes with a historical note. Although Leven-
result of this minimization is a generalization of equation 1, with I berg’s paper was published almost twenty years before Marquardt’s,
replaced by a diagonal matrix D with nonnegative elements. Leven- it went almost unnoticed in spite of its practical importance. Interest-
berg’s contribution to the solution of the problem did not stop here, ingly, an internet search uncovered a paper by Feder 共1963兲 on com-
however. Equally important are his proofs that the minimization of S̄ puterized lens design that shows that ideas similar to that of Leven-
leads to a decrease in the values of S and the function whose linear- berg had been rediscovered more than once. Feder’s paper, in turn,
ization leads to S 共i.e., the function s below兲. The reduction in the led to a paper by Wynne 共1959兲, which anticipates some of the ideas
value of s does not occur for all values of w 共which is equal to 1/ in Marquardt’s approach. Yet the fact remains that it was Mar-
when D = I兲, and Levenberg suggests a way to find the value of w quardt’s paper that popularized the damped least-squares method, a
leading to a reduction in the value of s. Another important result is fact he attributed to his having distributed hundreds of copies of his
that the Q corresponding to the damped least-squares solution is al- FORTRAN code!
ways 共i.e., for all w兲 smaller than when damping is not applied.
These results are not obvious, but are rarely considered when the
damped least-squares method is introduced in the literature. THE GAUSS-NEWTON METHOD
Marquardt 共1963兲, on the other hand, starts with equation 1 and in-
Let f be a function of the independent variables vk,k = 1,2,. . . and
vestigates the angle between the computed ␦ and the direction of
the parameters x j, j = 1,2, . . . ,n. For convenience, the variables and
steepest descent of s 共equal to −ⵜs, where ⵜ stands for gradient兲.
parameters will be considered the components of vectors v and x, re-
When goes to infinity, the contribution of ATA in equation 1 be-
spectively. To identify a particular set of values of the variables we
comes negligible and the result is the equation used in the method of
will use symbols such as vi. Let us assume that f is a mathematical
steepest descent. This method generally produces a significant re-
model for observations of interest to us, and that oi is the observation
duction in the value of s in the early iterations, but becomes extreme-
corresponding to the set of variables vi, so that
ly slow after that, to the point that convergence to the solution may
not be achieved even after a large number 共hundreds or more兲 of iter-
oi ⬇ f共vi,x兲 ⬅ f i共x兲; i = 1, . . .,m. 共3兲
ations 共e.g., Gill et al., 1981, and the example below兲. On the other
hand, the ordinary least-squares method 共known as the Gauss-New- Let us define the residual i共x兲 as
ton method兲 has the ability to converge to the solution quickly when
the starting point is close to the solution 共and even when far from it, i共x兲 = oi − f i共x兲; i = 1, . . .,m. 共4兲
as the example below shows兲. Marquardt proves that equation 1 in-
terpolates between the two methods, and shows that the angle be- We are interested in finding the set of parameters x j that minimize the
tween ␦ and −ⵜs is a monotonically decreasing function of , with sum of residuals squared, namely
the angle going to zero as goes to infinity. Based on this fact, Mar-
quardt proposes a simple algorithm in which at each iteration the val- m
ue of is modified to assure that the corresponding value of s be-
comes smaller than in the previous iteration. Marquardt also recog-
s共x兲 = 兺 2i 共x兲.
i=1
共5兲
nizes that the term I in equation 1 assures that the matrix in paren-
theses is better conditioned than ATA and that the angle between ␦ A function that measures the misfit between observations and model
and −ⵜs is always less than 90°. If this condition is not met, the itera- values, such as s共x兲, is known as a merit function. Other terms found
tive process may not be convergent. in the parameter estimation and optimization literature are objective,
Although neither Levenberg nor Marquardt discusses the steep- loss, and risk function. If f i共x兲 is a nonlinear function of the parame-
est-descent method itself, this tutorial would be incomplete without ters, the minimization of equation 5 generally requires the use of nu-
consideration of the relation between the direction of steepest de- merical methods. A typical approach is to express f i共x兲 in terms of its
scent and the gradient, which is not unique when inversion parame- linearized Taylor expansion about an initial solution xoj 共j = 1, . . . ,n兲
ters have different dimensions or units. In such cases, it is not obvi- at which s does not have a stationary point. This gives
冏 冏
ous how to measure the distance between two points in parameter
n
space, and as a result, equating the direction of steepest descent to fi
the direction of minus the gradient becomes meaningless 共Feder, f i共x兲 ⬇ f i共xo兲 + 兺 x j
共x j − xoj 兲; i = 1, . . .,m,
1963兲. It is only when a definition of distance 共i.e., a metric兲 is intro- j=1 x = xo
duced that the two directions become uniquely related. These ques- 共6兲
tions will be discussed first to put Levenberg’s and Marquardt’s ap-
proaches into a broader perspective, as they involve, either directly where xo has components xoj . Using this expression with equation 3,
or indirectly, the introduction of a metric. we can introduce a new set of residuals
Levenberg-Marquardt nonlinear inversion W3
共using equation 9 and 11兲. Then solve for ␦ again. To make the pro-
and cess clear, we will describe the two steps that lead to the estimate
冏冏
x共p+1兲 for the 共p + 1兲th iteration. First, solve
fi
aij = . 共9兲 共ATA兲共p兲␦共p兲 = 共ATc兲共p兲 , 共17兲
x j x = xo
where the superscript 共p兲 indicates iteration number and A and c
Note that f i共xo兲 and the derivatives have specific numerical values, have components
while the ␦ j are unknown. Equation 7 will be written in matrix form
as
r = c − A␦, 共10兲
a共p兲
ij = 冏冏
fi
x j x = x共p兲
; c共p兲 共p兲
i = oi − f i共x 兲. 共18兲
S
␦
= 冉
S S
␦1 ␦2
... 冊 T
= −2AcT + 2ATA␦ = 0 共14兲 Example 1a
Consider a buried homogeneous sphere of radius a with center at
共Seber, 1977兲, which leads to the well-known ordinary least-squares 共y o,z兲, where y o is measured along the y-axis 共horizontal兲 and z is
equation depth. The vertical component of the gravitational attraction caused
by the sphere at a point y i at zero depth is given by:
ATA␦ = ATc. 共15兲
4 ⌫Da3z
In this section, we will assume that 共ATA兲−1 exists, which means that g共y i,z,a兲 = 共23兲
equation 15 can be solved for ␦. When this assumption is not valid, a
3 关共y i − y o兲2 + z2兴3/2
different method 共such as damped least-squares兲 should be used. 共e.g., Dobrin, 1976兲, where ⌫ is the gravitational constant and D is
Now it remains to show that the ␦ obtained from equation 15 min- the density contrast 共equal to the difference of the densities of the
imizes S. To see that, we must examine the Hessian of S, HS, which is sphere and the surrounding medium, assumed homogeneous兲. For
the matrix of second derivatives of S with respect to the components distance and density in km and g/cm3 and gravity in mGal 共used
of ␦. Because the quadratic terms in equation 12 are of the form here兲, the numerical value of ⌫ is 6.672.
共ATA兲mn␦m␦n and ATA is symmetric, The inverse problem that we will solve is the following. Given m
gravity values Gi corresponding to points y i along the y axis, find out
2S the values of a and z of the sphere whose gravity best fits the Gi. It
共HS兲kl = = 2共ATA兲kl . 共16兲
␦ k ␦ l will be assumed that D and y o are known. Clearly, this problem is
W4 Pujol
nonlinear in both a and z, which play the role of the parameters x1 and the 共z,a兲 plane that connects an initial point 共zo,ao兲 and the point
x2. In practice, the Gi should be observed values, but for the purposes 共zM ,aM 兲 that minimizes 共and thus, 2兲. In our example, 共zM ,aM 兲
of this tutorial they will be synthetic data generated using equation = 共7,5兲, and at this point = 0. It may happen, however, that be-
23 with the following values: z = 7, a = 5, y o = 0, all in km; D cause of the complexity of , no path can be found, in which case the
= 0.25 g/cm3, m = 20, y 1 = −10 km, and y i+1 − y i = 1 km. To stop inverse problem has not been solved. To investigate this question the
the iterative process, the condition that the adjustments ␦1 and ␦2 be- initial points labeled A, B, C, and D in Figure 1 were used. Some of
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
come smaller than or equal to 1 ⫻ 10−5 km was assumed. these initial values are too far from the true values 共see Figure 2兲, but
In this example, the estimated variance of the residuals, given by they were chosen for demonstration purposes, not as reasonable ini-
共z,a兲 =
2 1
m − 2i=1
m
兺 关Gi − g共yi,z,a兲兴2 ; 再 2 ⱕ z ⱕ 12
2 ⱕ a ⱕ 10
,
tial estimates for this problem. In addition, the corresponding results
can be useful for cases where the function to be minimized is not
equal to zero at its minimum, and there is no easy way to assess
whether the initial estimates are reasonably close to the optimal val-
共24兲
ues.
plays the role of the merit function s to be minimized. The 2 in the de- The results of the inversion are summarized in Table 1 and the
nominator is introduced to make 2 an unbiased estimate 共Jenkins paths followed by the intermediate pairs 共z p,a p兲 共p = iteration num-
and Watts, 1968兲. Clearly, a cannot be larger than z when the y i are ber兲 are shown in Figure 1. For the initial point D there was no con-
assumed to be at the same elevation, but for the analysis that follows vergence, but for the other three points, the minimum was reached in
we will be concerned with the mathematical, not the physical, as- five iterations 共points B and C兲 or 10 iterations 共point A兲. These re-
pects of the problem. There are two reasons for the use 2. One is its sults are interesting for several reasons. First, convergence can be
statistical significance and the other is that 2 is a normalized form of achieved even when the assumptions behind the linearization of the
s, which allows a comparison of results obtained for different num- problem are completely violated. Second, convergence is not always
bers of observations or for different models. The following results, achieved. Third, whether an initial point leads to convergence or not
however, are shown in terms of the standard deviation , which has is not directly related to its distance to the point that minimizes . Fi-
the same units as g 共i.e., mGal兲. nally, inspection of the inversion paths does not give any clue as to
Representative contour lines of 共z,a兲 are shown in Figure 1. the path corresponding to any other initial point within the range of
Note that the shapes of the contours are highly variable. For values of Figure 1. These facts are typical of nonlinear problems, and the other
less than about eight they are close to highly elongated ellipses
methods discussed below have been designed to address some of
共closed or open兲, although the other contours are mostly straight or
them.
slightly curved with changing slopes. This fact must be kept in mind
because solving the inverse problem is equivalent to finding a path in
T: (7, 5)
A 500 100 50 B A: (2, 10)
10
3 B: (10, 10)
10 C: (2, 2)
9
D: (10, 2)
8 10
5
Radius (km)
2
7 3 10
3
g(x,z,a) (mGal)
6 B
5
1
2 A
5 1
10
4 T
10
3
0
10
2 C D
2 3 4 5 6 7 8 9 10 11 12
Depth (km) D
−1 C
Figure 1. Contour lines 共cyan curves兲 of the function 共see equation 10
−10 −5 0 5 10
24兲 and paths followed by the points 共zi,ai兲 共indicated by circles兲,
x (km)
where i is iteration number for the Gauss-Newton inversion method
共equation 17兲. The numbers next to the contours indicate the value of
共in mGal兲. The contours between the numbered ones are equis- Figure 2. Gravity values computed using equation 23 for several
paced. The points labeled A, B, C, and D are initial points 共xo,y o兲 for 共z,a兲 pairs, listed on the upper-right corner of the figure. Each pair is
the inversion. Figure 2 shows the corresponding gravity values. For identified by a different symbol and by a letter. The gravity values
D the method did not converge. The value of for this point is 11.5. identified by a T are the true values, while the others correspond to
See Table 1 for additional inversion results. The large + is centered the initial values used for the inversion of the true values. The gravity
at 共7,5兲, which is at the minimum of 共=0兲. scale is logarithmic.
Levenberg-Marquardt nonlinear inversion W5
THE GRADIENT AND THE METHOD OF −ⵜs is the basis of the steepest-descent method of minimization,
STEEPEST DESCENT which is one of the oldest methods used.
A heuristic introduction of the steepest-descent method is as fol-
To motivate the following discussion, let us consider the gradient lows. Using the notation introduced above, we are interested in an it-
of s, to be indicated with ⵜs, which is the column vector with compo- erative approach such that
nents
s共p+1兲 ⬍ s共p兲 .
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
共29兲
m
s f i共x兲
共ⵜs兲 j = = −2 兺 关oi − f i共x兲兴 , 共25兲 To achieve this goal, we will use the fact that ⵜs points in the direc-
x j i=1 x j tion of steepest ascent. This is a well-known result from calculus
共e.g., Apostol, 1969兲 and will be proved below in a more general
where equation 4 and 5 were used. Now let x = xo. Then
context. Therefore, the initial estimate for the 共p + 1兲th iteration will
m be computed using
fi
共ⵜs兲 j = −2 兺 ci = − 2共ATc兲 j , 共26兲
x j 1
x共p+1兲 = x共p兲 − ⵜs共x共p兲兲,
i=1
共30兲
where equation 9 and 11 were used. Writing in matrix form, we have
共p兲
where 共p兲 is a scalar that assures that equation 29 is satisfied. The
ⵜs = −2ATc. 共27兲 question is how to choose the value of 共p兲. A general discussion of
Now consider equation 1 with ␦ and c the vectors that appear in this question is presented below, but for the time being, we note that
equation 15 and let go to infinity. In this case, the first term on the the gradient of a function is a local property, which means that, in
left side of the equation becomes negligible and the solution be- general, the direction of the steepest descent will change as a func-
comes tion of position. Therefore, if 共p兲 is not selected carefully, it may
happen that the value s is not reduced, as desired. For this reason, a
1 T 1 number of strategies for the selection of 共p兲 have been designed 共i.e.,
␦g ⬇ A c=− ⵜs → 0; → ⬁. 共28兲 Beveridge and Schechter, 1970; Dorny, 1975兲, but for the example
2
considered next we will use a very simple approach, based on the use
Therefore, in this limiting case the damped least-squares solution ␦g of equation 30 with a large constant value of 共p兲, say , which will
is in the direction of minus the gradient. This fact is emphasized by assure a small step in the steepest direction. In this way, we will be
use of the subscript g. In addition, 兩␦g兩 goes to zero. The direction able to see the steepest-descent path clearly, which will be used for
Table 1. Gravity inversion results obtained using different methods and values of the initial parameter.
comparison with paths corresponding to the Gauss-Newton method of the gradient and the steepest-descent method is very fruitful be-
and the Levenberg-Marquardt solutions obtained for different choic- cause it sheds light on certain questions that arise when solving in-
es of . This example will also show the problems that affect the verse problems that involve parameters with different dimensions.
method of steepest descent, which are removed when the Leven- This type of problem is not uncommon. In seismology, for example,
berg-Marquardt method is used. the parameters may be time, position, velocity, and density, among
others. If the dimensions of two or more of the parameters are differ-
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
Example 1b ent, a question that arises is how to define distance in the parameter
space. When all the parameters have the same dimensions and are
The merit function is the 2 introduced in equation 24. For the ini-
measured in the same units, the gradient of a function s共x兲 gives the
tial points B, C, and D the same value of was used, while for point
direction along which s has the largest rate of change. In other words,
A two other values were used 共see Figure 3兲. Let us consider the most
for a given ⌬x, 兩s共x + ⌬x兲 − s共x兲兩/兩⌬x兩 is largest when ⌬x is in the
salient aspects of this example. For three of the initial points 共B, C,
direction of ⵜs computed at x. In this case, it is meaningful to speak
D兲 the corresponding endpoints are very close to, although not ex-
of the direction of steepest ascent and to equate it to the gradient di-
actly at, 共zM ,aM 兲 共see Table 1兲, but the number of iterations is ex-
rection. For any other case, however, a distance in parameter space
tremely large 共⬎7000兲. Recall that with the Gauss-Newton method,
must be defined. Once this is done, the direction of steepest ascent is
convergence for points B and C is achieved in five iterations. Note
well defined, as we now show. The following results originate with
that for the three points the paths have sharp bends, after which the
Crockett and Chernoff 共1955兲. Although in this paper we are inter-
paths follow the major axes of the roughly elliptical contours. These
ested in the steepest-descent direction, here we consider its opposite
bends occur when the paths become approximately tangent to the
direction 共corresponding to the steepest ascent兲 to avoid introducing
contours, which is a general property of the method 共see the discus-
an inconvenient minus sign.
sion following equation 48 and Figure 4兲. For point A, the results are
Ageneral definition of distance d between two points 共represented
different. First, the value of used for the other three points did not
by vectors ␣ and 兲 is
lead to convergence. Second, when using the larger value of in Ta-
ble 1, the path reaches a point close to where it should bend, but this
bending does not occur even after a very large number of iterations
共about 29,000兲. When a somewhat smaller value of is used, the
d= 冋兺i,j
共i − ␣i兲bij共 j − ␣ j兲 册 1/2
path reaches a point close to the minimum with a much smaller num- ⬅ 关共 − ␣兲T B共 − ␣兲兴1/2 , 共31兲
ber of iterations 共about 7000兲, but the path between 共xo,y o兲 and
共x1,y 1兲 is clearly different from the steepest-descent path. where B is a positive definite symmetric matrix 共see Appendix A兲.
The previous example illustrates the well-known slow conver- With this condition on B, d is always a nonnegative real number,
gence of the steepest-descent method, which makes it computation- with d = 0 only if ␣ = . The definition of distance is known as the
ally inefficient, particularly when compared to the Levenberg-Mar- metric of the space of points under consideration. If B = I, d is the
quardt method. On the other hand, a study of some of the properties
8 10
B
Radius (km)
5
7 3 C
3 A
6
5
4
10
2 C D
2 3 4 5 6 7 8 9 10 11 12 Figure 4. Elliptical contour lines corresponding to a 2D quadratic
Depth (km) merit function 共given by equation 55 with x1 a point of minimum兲.
The contour spacing is not uniform. The points corresponding to the
Figure 3. Similar to Figure 1 showing the paths corresponding to the centers of the small circles are identified by the letters next to them.
steepest-descent method 共equation 30 with 共p兲 = = constant兲 for The solid and dashed lines are tangent to the contours at points B and
the values of given at the top of the figure. For point A the dot-dash C. The segments AB and BC are in the directions of the gradient at A
path was far from the minimum value 共see Table 1兲. The large ⫻ on and B. The positions of points B and C were determined using equa-
each path corresponds to 共z1,a1兲. The two contours corresponding to tions 47, 54, and 56. The two segments are perpendicular to each oth-
⬍ 3 are different from those in Figure 1, and were drawn to show er. The two pairs of closely spaced contours were drawn to show that
their relations to the bend in the paths for points B and D 共see Figure if the segments AB and BC extend beyond points B and C, the value
4 and corresponding text for further details兲. of the quadratic function becomes larger.
Levenberg-Marquardt nonlinear inversion W7
usual Euclidean distance. Given a point with coordinates xo, the ⵜs共xo + ␦兲 ⬇ ⵜs共xo兲 + H␦, 共43兲
points x at a distance d from it are on the ellipsoid
where H is the Hessian of s. Therefore,
共x − xo兲T B共x − xo兲 = d2 . 共32兲
ⵜs共xo + ␦兲 = ⵜs共xo兲; d→0 共44兲
Given a function s, the direction of steepest ascent in the d neighbor-
hood of xo is defined as the direction from xo to the point on the ellip- and, from equation 41,
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
冏 冏 冏 冏
means that ␦ goes to the zero vector, and linearize ⵜs共xo + ␦兲 in the
vicinity of xo. Writing in component form we have dF ␣ 2 d 2F
F ⬇ Fo + ␣ + , 共49兲
d␣ 2 d␣2
s s 2s o o
xi
共xo + ␦兲 ⬇
xi
共xo兲 + 兺j x i x j
共xo兲␦ j . 共42兲 where the subscript o indicates evaluation at xo. Then, expanding
dF/d␣ to first order about xo and setting it equal to zero at the point of
In vector form, this equation becomes tangency we obtain
W8 Pujol
dF
d␣
⬇
dF
d␣
冏冏 冏 冏 o
+␣
d 2F
d␣2 o
= 0. 共50兲 ␥=
⌬FB
=
共ⵜsTB−1ⵜs兲2
⌬FN 共ⵜsTB−1HB−1ⵜs兲共ⵜsTH−1ⵜs兲
, 共60兲
If F were a quadratic function, these relations would be exact. Now, but before proceeding we will introduce the following vector
differentiating equation 48 with respect to ␣ gives
p = B−1/2ⵜs, 共61兲
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
dF s
d␣
= 兺i xi
ui = uTⵜs 共51兲 so that
ⵜs = B1/2p. 共62兲
and
Because B is positive definite and symmetric, so is B−1/2 共see Appen-
d 2F 2s dix A兲. Using these two equations, ␥ becomes
d␣2
= 兺 x i x j
uiu j = uTHu, 共52兲
i,j 共pT p兲2
␥= , 共63兲
where H is the Hessian matrix. Introducing these two expressions in 共p Mp兲共pT M−1p兲
T
冏 冏 冒冏 冏 冏 冏
where
dF d 2F uTⵜs M = B−1/2 HB−1/2 . 共64兲
␣⬇− =− T . 共53兲
d␣ o d␣2 o u Hu x = xo
Matrix M is positive definite 共see Appendix A兲. An upper bound
If u = −ⵜs, this expression becomes to ␥ can be established by using the following generalization of
冏 冏
Schwarz’s inequality
ⵜsTⵜs
␣⬇ . 共54兲 共aT b兲2 ⱕ 共aT Ca兲共bT C−1b兲 共65兲
ⵜsTHⵜs x = xo
共see Appendix A兲, where C is a positive definite matrix. Application
This expression is exact when s is a quadratic function. For example, of this expression to ␥ gives
s may be of the form
␥ ⱕ 1. 共66兲
s = 共x − x1兲T P共x − x1兲, 共55兲
Now we will apply the Kantorovich inequality 共Luenberger, 1973兲
where x1 is a constant vector and P is a symmetric matrix. In this to the right-side of equation 63, which immediately gives
case,
4 n 1 41 /n 4
ⵜs = 2P共x − x1兲; H = 2P. 共56兲 ␥ⱖ 2 = 2 = , 共67兲
共 n + 1兲 共1 + 1 /n兲 共1 + 兲2
If x1 minimizes s, ⵜs共x1兲 = 0 and P is positive definite 共e.g., Apos- where 1 and n are the largest and smallest eigenvalues of M and
tol, 1969兲.
For the second part of the proof we will consider the difference ⌬F = 1 /n . 共68兲
between F and Fo, which is determined from equations 49 and
51–53: In summary,
⌬F = F − Fo = − 冏
1 共uT ⵜs兲2
2 uT Hu
冏 x = xo
. 共57兲
4
共1 + 兲2
ⱕ ␥ ⱕ 1. 共69兲
d M = 关共x − 兲T V−1共x − 兲兴1/2 . 共70兲 共see equation 8兲, we can write
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
A good qualitative justification for this definition can be found in wS共xw兲 ⬍ wS共xw兲 + Q共xw兲 = S̄共xw兲 ⬍ S̄共xo兲
Krzanowski 共1988兲, who also notes the relation between this dis-
tance and the maximum likelihood function. = wS共xo兲 + Q共xo兲 = wS共xo兲, 共77兲
Finally, it is worth noting that the choice B = HS 共see equation 16兲
so that
leads to the Gauss-Newton method. In fact, using equations 46, 16,
27, and 15 gives S共xw兲 ⬍ S共xo兲, 共78兲
␦ˆ = −共ATA兲−1ATc = −␦ 共71兲 which means that the minimization of S̄ will lead to a decrease in S.
Now, letting x⬁ denote the ordinary least-squares solution 共the rea-
共provided that the inverse exists兲. Now using equation 34 with −␦ˆ in- son for this notation is explained below兲, we have
stead of ␦ 共the − sign being used to specialize to the steepest-descent
case兲 and then using equation 71 we have wS共xw兲 + Q共xw兲 = S̄共xw兲 ⬍ S̄共x⬁兲 = wS共x⬁兲 + Q共x⬁兲
S̄共x兲 = wS共x兲 + Q共x兲, 共73兲 so that the only difference from the ordinary least-squares solution is
the addition of a diagonal matrix to ATA. Because the inverse of the
where
matrix in parentheses always exists for w ⬍ ⬁ 共see Appendix A兲,
Q共x兲 = d1␦12 + ¯ + dn␦n2 = ␦T D␦, 共74兲 equation 82 has a solution even when 共ATA兲−1 does not exist and the
Gauss-Newton method is not applicable. Also note that for w = ⬁
w and the di are positive weighting factors independent of x, and D is the second term on the left side of equation 82 vanishes and we get
a diagonal matrix with elements 共D兲ii = di. A comparison of equa- the ordinary Gauss-Newton solution 共provided it exists兲. This is why
tions 74 and 31 shows that Levenberg’s method implicitly introduc- we introduced the x⬁ used in equations 79 and 80. On the other hand,
es a non-Euclidean norm in the parameter space. Moreover, the re- if w goes to zero, 1/w goes to infinity and the first term on the left be-
sults of the analysis below are valid when D is a symmetric positive comes negligible, which means that we can write
definite matrix.
Let us establish two important results concerning S̄, S, and Q. Let 1
D␦g ⬇ ATc; w → 0. 共83兲
xw be the value of x that minimizes S̄ for a given value of w, i.e., w
W10 Pujol
In addition, because the diagonal elements of D are nonzero, its in- dxw /dw is a vector tangent to xw 共e.g., Spiegel, 1959兲. Furthermore,
verse always exists and we can write because the product on the right side is the matrix form of the scalar
product, we can write
冏 冏
1
␦g ⬇ wD−1ATc = − wD−1ⵜs → 0; w → 0. 共84兲 ds共xw兲 dxw
2 = 兩ⵜs兩 cos , 共91兲
dw dw
共see equation 27兲. This result is also valid when D is symmetric and
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
positive definite 共so that its inverse exists, see Appendix A兲, in which where is the angle between the two vectors. The minimum value of
case it agrees with equation 46. The difference in the signs of ␦ and ␦ˆ the derivative is attained for = . Introducing this value of as
is due to the fact that they are in the directions of steepest descent and well as equations 27, 86, and 88 into equation 91 we obtain
ascent, respectively.
So far, we have concentrated on S and S̄, but as we will see next, 2兩D−1/2ATc兩2 = 2兩ATc兩兩D−1ATc兩. 共92兲
we can derive several important results concerning s, which is the
This equation is satisfied when D = dI, with d equal to a constant,
quantity that is of most interest to us. In the following we will focus
which results in a factor of d−1 on both sides of the equation. Taking
on the case of w going to zero, which means that we can use equation
d = 1 and letting = 1/w, we find that equation 82 becomes the
84. Then, letting
well-known equation
␦g = xw − xo 共85兲
1
we find that
共ATA + I兲␦ = ATc; = . 共93兲
w
dxw d␦g The second approach proposed by Levenberg is to choose
= = D−1ATc; w → 0. 共86兲
dw dw
di = 共ATA兲ii , 共94兲
Furthermore,
in which case the matrix in parentheses in equation 82 becomes the
ds共xw兲
dw
= 兺
n
j=1
冏 冏
s dx j
x j dw x = xw
= 共ⵜs兲T
dxw
dw
. 共87兲
matrix ATA with its diagonal elements multiplied by 1 + . Leven-
berg did not give a motivation for this choice, but it is directly related
to the scaling introduced by Marquardt.
Because of equations 84 and 85, xw ⬇ xo. Then, introducing equa-
tions 86 and 27 in equation 87 and operating gives The Marquardt approach
冏 冏
ds共xw兲
dw w=0
= −2共ATc兲T D−1ATc
Marquardt approaches the problem from a point of view different
from that of Levenberg. His starting point is the following series of
results. Unless otherwise noted, the notation used here is that intro-
= −2共D−1/2ATc兲T 共D−1/2ATc兲 duced earlier except for the fact that S will be assumed to be a func-
tion of ␦, as indicated by the right side of equation 12.
= −2兩D−1/2ATc兩2 ⬍ 0 共88兲 Let ⱖ 0 be arbitrary 共unrelated to the w above兲 and let ␦o satisfy
共see also equation 44兲. The inequality arises because of the assump- 共ATA + I兲␦o = ATc. 共95兲
tion that xo is not a stationary point of s, which means that the partial
derivatives cannot all be equal to zero. Therefore, because s共xw兲 is Then ␦o minimizes S on the sphere whose radius 兩␦兩 satisfies
decreasing at w = 0, there are values of w 共positive兲 that will reduce
the value of s. In principle, the value of w that minimizes s could be 兩␦兩2 = 兩␦o兩2 . 共96兲
determined by setting ds/dw equal to zero, but because of the com-
This result was proved using the method of Lagrange multipliers,
plexity of this equation in practical cases, Levenberg proposed to
which requires minimizing the function
write s共xw兲 in terms of its linearized Taylor expansion
s共xw兲 ⬇ s共xo兲 + w
ds
dw
冏冏 w=0
共89兲
u共␦, 兲 = S + 共兩␦兩2 − 兩␦o兩2兲
with respect to ␦ and , where is a Lagrange multiplier. This re-
共97兲
which proves the result. For the sake of simplicity, the subscript in ␦o applies to a scaled version of the problem. However, because this
will be dropped. scaling is not essential 共and is not always used; e.g., Gill et al., 1981兲,
The second result requires writing ATA in terms of its eigenvalue the basis of the algorithm is described first. At the pth iteration the
decomposition 共e.g., Noble and Daniel, 1977兲, namely following equation is solved for ␦共p兲
where
From this equation we see that 兩␦兩 is a decreasing function of . This
result and the previous one are from Morrison 共1960, unpublished,
1
quoted by Marquardt兲. sii = 共113兲
Marquardt’s final result concerns the angle  between ␦ and ATc, 冑共ATA兲ii .
which is proportional to −ⵜs 共see equation 27兲. Using equations 103,
104, and 101, we can write Note that the diagonal elements of 关ATA兴* are all equal to one. In
terms of this scaling, equation 15 becomes
␦TATc uT共⌳ + I兲−1UTATc
cos 共兲 = =
兩␦兩兩A c兩
T
兩␦兩兩ATc兩 关ATA兴*␦* = 关ATc兴* . 共114兲
Multiplying both sides of equation 119 by S−1 on the left gives solved and on the initial values given to the parameters to be deter-
mined. For points A, B, and C, equation110 was always satisfied and
ATA共S␦*兲 = ATc, 共120兲 the choice of o was not critical 共recall that the Gauss-Newton meth-
so that od converged for these points兲. For point D, the situation is different
共see below兲. For other inverse problems, the best approach to the se-
␦ = S␦* 共121兲 lection of o is to invert synthetic data that resemble the actual data as
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
6, thus providing a numerical confirmation of the equivalence of this solution, which can be derived as follows. Equations 124 and 125
choice of D and the Marquardt scaling proved analytically. will be written as a single equation involving partitioned matrices,
In summary, for this particular example, the unscaled and scaled namely
versions of the Levenberg-Marquardt method perform similarly. It
B␦ = u, 共127兲
may be argued that the scaled version makes it easier to choose the
value of o, which can be taken close to one, but as point D showed, where
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
冉冊 冉冊
this value may not lead to a smaller number of iterations. Obviously,
this is not a major concern in our case, but it may be so when solving A c
B= ; u= . 共128兲
inverse problems involving large numbers of parameters. Also note pI 0
that the results for points C and D in Figure 6 show that it is difficult
to make general statements regarding the convergence paths for lin-
earized nonlinear problems, even for a relatively simple 2D case.
o: λo = 1×10 6; o: λo = 1×10 4; o: λo = 1× 10 2
Again, convergence to a solution may become more of an issue as
the number of inversion parameters increases. In particular, the A 500 100 50 B
10
function s may have local minima in addition to an absolute mini-
mum, in which case the inversion results may depend on the initial 9
solution and on the selection of o. These facts must be borne in mind
by those beginning their work in nonlinear inverse problems. Each 8 10
problem will have features that make it different from other prob-
Radius (km)
5
lems and, as noted above, the best way to investigate it is through the 7 3
inversion of realistic synthetic data 共i.e., the model is realistic兲. In 3
addition, because actual data are always affected by measurement or 6
5
observational errors, representative errors should be added to the
5
synthetic data.
4
HISTORICAL NOTE 10
3
In spite of its importance, Levenberg’s 共1944兲 paper went largely
unnoticed until it was referred to in Marquardt’s 共1963兲 paper. When 2 C D
Levenberg published his paper he was working at the Engineering 2 3 4 5 6 7 8 9 10 11 12
Research Section of the U. S. Army Frankford Arsenal 共Philadel- Depth (km)
phia兲, and according to Curry 共1944兲 the engineers there preferred
Levenberg’s method over the steepest-descent method. Interesting- Figure 5. Similar to Figure 1 showing the paths corresponding to the
ly, the Frankford Arsenal supported the work of Rosen and Eldert unscaled Levenberg-Marquardt method 共equation 108兲 using the
initial values of 共i.e., o兲 given at the top of the figure 共circles兲. For
共1954兲 on lens design using least squares, but they did not use Leven- a comparison, some of the steepest-descent paths in Figure 3 are also
berg’s method. The computerized design of lenses was an area of re- shown here 共black lines兲.
search with military and civilian applications, with early results
summarized by Feder 共1963兲. Regarding Levenberg’s paper, Feder
notes that it had come to his attention in 1956 and that other people
o, o, o: λ o = 1; o: λ o = 20
had rediscovered the damped least-squares method, although some
of the work was supported by the military and could not be made A 500 100 50 B
10
public until several years later because of its classified nature.
One of the rediscoverers of the damped least-squares method was 9
Wynne 共1959兲, who notes that the problems affecting the ordinary
8 10
least-squares method when the initial solutions did not lead to an ap-
5
Radius (km)
where ⌳1/2 is a diagonal matrix with its iith element given by i1/2.
Then, 冋兺 册冋兺 册 冋兺 册
i
u2i ⌸1i
i
u2i ⌸3i −
i
u2i ⌸2i
2
共B-4兲
冋兿 册
,
C2 = U⌳1/2 UT U⌳1/2 UT = U⌳UT = B, 共A-9兲 共i + 兲2
2
i
where equation 101 has been used. The matrix C is known as the
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
square root of B, and is indicated by B1/2. This matrix is unique 共for a where
proof see, e.g., Harville, 1997; Zhang, 1999兲.
If B is positive definite, ⌸ni = 兿 共 i⬘ + 兲 n ; n = 1,2,3; i⬘ ⫽ i. 共B-5兲
i⬘
B−1/2 = C−1 = U⌳−1/2UT , 共A-10兲 To show how this result is derived it will be assumed that the number
of terms in the sums is three; the extension to any other number is
where equation 101 has been used. This matrix is also symmetric and
straightforward. Let
positive definite.
共3兲 In its standard form, Schwarz inequality states that ai = i + . 共B-6兲
共aTb兲2 ⱕ 共aTa兲共bTb兲 共A-11兲 Then, the fractions within braces in equation B-2 can be rewritten as
follows
for any vectors a and b 共e.g., Arfken, 1985兲.
Let C be a symmetric positive definite matrix. In equation A-11, u21 u22 u23 u21an2an3 + u22an1an3 + u23an1an2
+ + =
replace a and b by C1/2 a and C−1/2 b, respectively. Because C1/2 is an1 an2 an3 an1an2an3
symmetric, this immediately gives equation 65 共e.g., Zhang, 1999兲.
共4兲 The matrix M in equation 64 is symmetric 共because so are the
u21⌸n1 + u22⌸n2 + u23⌸n3
兺i u2i ⌸ni
matrices on the right兲 and positive definite. To show that, let y be an
= = ,
arbitrary nonzero vector. Then an1an2an3 兿i ani
yTMy = yTB−1/2HB−1/2y = 共B−1/2y兲TH共B−1/2y兲 ⬎ 0, 共A-12兲
共B-7兲
because H is assumed to be positive definite. The vector in parenthe-
ses on the right-hand side of this equation is arbitrary. where
冋兺 册
cos 共兲 = . 共B-1兲
u2i 1/2
⌸1i ⌸3i = 共⌸2i兲2 . 共B-9兲
兩ATc兩
i 共i + 兲2 This allows writing the numerator of expression B-4 as
The derivative of cos  with respect to is given by
冋兺 共 ui⌸1i 兲
1/2 2
册冋兺 共 ui⌸3i 兲 −
1/2 2
册 冋兺 共 1i 兲共 ui⌸3i 兲 ,
ui⌸1/2 1/2
册 2
d
d
cos 共兲 =
1
C 再冋兺 册冋兺 u2i
i +
u2i
共i + 兲3 册
i i i
共B-10兲
冋兺 册 冎
i i
which is always positive because of the Schwarz inequality 共see
u2i 2
equation A-11兲. To apply it to equation B-10 let a and b be vectors
− , 共B-2兲 with components ui⌸1/2
共i + 兲2 1i and u i⌸3i , respectively. This result shows
1/2
i
that
where
d
冋兺 册
cos共兲 ⬎ 0. 共B-11兲
u2i 3/2 d
C= 兩A c兩.
T
共B-3兲
i 共i + 兲2
REFERENCES
The factor 1/C is positive. To find the sign of the factor in braces in
equation B-2 we must perform all the operations indicated. The re- Arfken, G., 1985, Mathematical methods for physicists: Academic Press Inc.
sulting expression is Apostol, T., 1969, Calculus, vol. II: Blaisdell Publishing Company.
W16 Pujol
Beveridge, G., and R. Schechter, 1970, Optimization: Theory and practice: linear regression functions by least squares: Technometrics, 3, 269–280.
McGraw-Hill Book Company. Harville, D., 1997, Matrix algebra from a statistician’s perspective: Springer,
Crockett, J., and H. Chernoff, 1955, Gradient methods of maximization: Pa- Pub. Co., Inc.
cific Journal of Mathematics, 5, 33–50. Jenkins, G., and D. Watts, 1968, Spectral analysis: Holden-Day.
Curry, H., 1944, The method of steepest descent for non-linear minimization Krzanowski, W., 1988, Principles of multivariate analysis: Oxford Universi-
problems: Quarterly of Applied Mathematics, 2, 258–261. ty Press.
Davies, M., and I. Whitting, 1972, A modified form of Levenberg’s correc- Levenberg, K., 1944, A method for the solution of certain non-linear prob-
tion, in F. Lootsma, ed., Numerical methods for non-linear optimization: lems in least squares: Quarterly of Applied Mathematics, 2, 164–168.
Downloaded 10/07/13 to 134.53.24.2. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/
Academic Press Inc., 191–201. Luenberger, D., 1973, Introduction to linear and nonlinear programming:
Davis, P., 1993, Levenberg-Marquart 关sic兴 methods and nonlinear estima- Addison-Wesley Publishing Company.
tion: SIAM News, 26, no. 2. Marquardt, D., 1963, An algorithm for least-squares estimation of nonlinear
Dobrin, M., 1976, Introduction to geophysical prospecting: McGraw-Hill parameters: SIAM Journal, 11, 431–441.
Book Co. Noble, B., and J. Daniel, 1977, Applied linear algebra: Prentice-Hall.
Dorny, C., 1975, A vector space approach to models and optimization: John Nunn, M., and C. Wynne, 1959, Lens designing by electronic digital comput-
Wiley & Sons. er: II: Proceedings of the Physical Society, 74, 316–329.
Draper, N., and H. Smith, 1981, Applied regression analysis: John Wiley & Rosen, S., and C. Eldert, 1954, Least-squares method for optical correction:
Sons. Journal of the Optical Society of America, 44, 250–252.
Feder, D., 1963, Automatic optical design: Applied Optics, 2, 1209–1226. Seber, G., 1977, Linear regression analysis: John Wiley & Sons, Inc.
Gill, P., W. Murray, and M. Wright, 1981, Practical optimization: Academic Seber, G., and C. Wild, 1989, Nonlinear regression: John Wiley & Sons, Inc.
Press Inc. Spiegel, M., 1959, Vector analysis: McGraw-Hill Book Co.
Girard, A., 1958, Calcul automatique en optique géométrique: Revue Wynne, C., 1959, Lens designing by electronic digital computer: I: Proceed-
D’Optique, 37, 225–241. ings of the Physical Society 共London兲, 73, 777–787.
Greenstadt, J., 1967, On the relative efficiencies of gradient methods: Mathe- Wynne, C., and P. Wormell, 1963, Lens design by computer: Applied Optics,
matics of Computation, 21, 360–367. 2, 1233–1238.
Hartley, H., 1961, The modified Gauss-Newton method for the fitting of non- Zhang, F., 1999, Matrix theory: Springer, Pub. Co., Inc.