You are on page 1of 12

An Algorithm for Least-Squares Estimation of Nonlinear Parameters

Author(s): Donald W. Marquardt

Source: Journal of the Society for Industrial and Applied Mathematics, Vol. 11, No. 2 (Jun.,
1963), pp. 431-441
Published by: Society for Industrial and Applied Mathematics
Stable URL: .
Accessed: 15/07/2014 11:44

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact

Society for Industrial and Applied Mathematics is collaborating with JSTOR to digitize, preserve and extend
access to Journal of the Society for Industrial and Applied Mathematics.

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions
Vol. 11. No. 2, June, 1963
Printed in U.S.A.


Introduction.M\ostalgorithmsforthe least-squaresestimationof non-
linear parametershave centeredabout eitherof two approaches. On the
one hand, the model may be expandedas a Taylor seriesand corrections
to the several parameterscalculated at each iterationon the assumption
of local linearity.On the otherhand, variousmodifications of the method
of steepest-descent have been used. Both methodsnot infrequently run
aground, the Taylor seriesmethod because of of
divergence the successive
iterates,the steepest-descent (or gradient)methodsbecause of agonizingly
slow convergenceafterthe firstfewiterations.
In this paper a maximtumneighborhood methodis developed which,in
effect,performsan optimum interpolationbetween the Taylor series
methodand the gradientmethod,the interpolation beingbased upon the
maximumneighborhoodin which the truncatedTaylor series gives an
adequate representation of the nonlinearmodel.
The resultsare extendedto the problemof solvinga set of nonlinear
Statementofproblem.Let the modelto be fittedto the data be
- = f(Xl , X2 ,
(y) * * Xm; ; 21, 2, '''* *, )
(1) = f(x, ),
where xi , x2, X, m are independent variables, 81, t82, *' , /k are the
populationvalues of k parameters,and E(y) is the expectedvalue of the
dependentvariabley. Let thedata pointsbe denotedby
(2) (Yi , X , Xi, ***, Xm), i = 1, 2, ***, n.
The problemis to computethose estimatesof the parameterswhichwill
P = E [Yii yi]2
(3) i=1
= IY- YI12
whereYi is thevalue ofy predictedby (1) at theith data point.
It is wellknownthat whenf is linearin the 13's,the contoursof constant
* Received
by the editors June 13, 1962, and in revised formDecember 3, 1962.
t Engineering Department, E. I. du Pont de Nemours & Company, Inc., Wil-
mington 98, Delaware.

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

4 are ellipsoids, while iff is nonlinear, the contours are distorted, according
to the severity of the nonlinearity. Even with nonlinear models, however,
the contours are nearly elliptical in the immediate vicinity of the minimum
of 1. Typically the contour surface of 1 is greatly attenuated in some direc-
tions and elongated in others so that the minimum lies at the bottom of a
long curving trough.
In this paper attention is confined to the algorithm for obtaining least-
squares estimates. For discussion of the broader statistical aspects of non-
linear estimation and application to specific examples the reader is referred
to items [1], [2], [4], [5], [6] in the References. Other referencesare given in
the papers cited.
Methods in currentuse. The method based upon expanding f in a Taylor
series (sometimes referredto as the Gauss method [1], [7], or the Gauss-
Newton method [4]) is as follows.
Writing the Taylor series through the linear terms

(4) (Y(X,, b + 5)) =f(Xi, b) + E (f,)


(4a) (Y) = fo + Pat

In (4), e is replaced notationally by b, the converged value of b being the

least-squares estimate of 5. The vector at is a small correction to b, with
the subscript t used to designate 5 as calculated by this Taylor series
method. The brackets () are used to distinguish predictions based upon
the linearized model from those based upon the actual nonlinear model.
Thus, the value of I predictedby (4) is

(5) (I) = i=l

[Yi - (Y)]2.

Now, at appears linearly in (4), and can thereforebe found by the standard
least-squares method of setting 0()/jaj = 0, forall j. Thus at is found by

(6) Abe = g,

(7) A[xk = pTP,

1The superscriptT denotesmatrixtransposition.

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

(8) plnXkl i = 1,2,... ,n; j= 1,2,... ,k,

gXl]l = ( Yi--f) b) = 1,2, ... k,

(9) g- - ,' ' abj '
= PT(Y - f0).
In practiceit is foundhelpfulto correctb by only a fractionof 5t ; other-
wise the extrapolationmay be beyondthe regionwheref can be adequately
representedby (4), and would cause divergenceof the iterates.Various
methodshave, therefore, been used [1], [4], [6] to determinean appropriate
step size Kt , 0 < K < 1, once the directionhas been specifiedby t .
Even so, failureto convergeis not uncommon.
The gradientmethodsby contrastsimplystep offfromthe currenttrial
value in the directionof the negativegradientof I. Thus

(9) mg= a
--(d\bi Qbk)T
Various modifiedsteepest-descentmethods have been employed [5] to
compensatepartiallyforthe typicallypoor conditioningof the 1) surface
whichleads to veryslow convergenceof the gradientmethods.With these
gradientmethods,as with the Taylor series methods,it is necessaryto
controlthe step size carefullyonce the directionof the correctionvector
has been established.Even so, slow convergenceis the rule ratherthan
the exception.
A numberof convergenceproofs,valid undervariousassumptionsabout
the propertiesof f, have been derived for the Taylor seriesmethod [4],
[7]. Convergenceproofshave also been derivedforvarious gradienttype
methods;e.g., [3]. However,mathematicalproofof the monotonicityof 1
fromiterationto iteration,assuminga well behaved function f, and assum-
ing infinitely
precise arithmeticprocesses,is at best a necessarycondition
forconvergencein practice.Such proofs,in themselves,give no adequate
indicationof the rate of convergence.Among the class of theoretically
convergentmethodsone mustseek to findmethodswhichactuallydo con-
vergepromptlyusingfinitearithmetic,forthe largemajorityof problems.
Qualitativeanalysis of the problem.In view of the inadequacies of the
Taylor seriesand gradientmethods,it is well to reviewthe undergirding
principlesinvolved.First,any propermethodmust resultin a correction
vectorwhose directionis within90? of the negativegradientof 1. Other-
wise the values of 1 can be expectedto be largerratherthan smallerat
pointsalong the correctionvector.Second,because of the severeelongation

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

of the 4) surface in most problems, 6t is usually almost 90? away from g .

(Indeed, we have monitoredthis angle, -, for a varietyof problemsand
have foundthat y usually falls in the range 80? < y < 90?.) From these
considerationsit wouldseem reasonablethat any improvedmethodwill in
some sense interpolate between Btand g .
Implicitin both of the proceduresoutlinedis the choice of directionof
the correctionvectorpriorto the determination of an acceptable step size.
In the algorithmto be described,the directionand step size are deter-
Theoreticalbasis of algorithm.The theoreticalbasis of the algorithmis
containedin severaltheoremswhichfollow.Theorems 1 and 2 are due to
Morrison[7]. The proofgiven here for Theorem 1 is different fromthat
given by Morrison.The presentproofaffordsperhaps a greaterinsight
into the geometricrelationshipsinvolved.
THEOREM 1. Let X ? 0 be arbitraryand let 50 satisfythe equation

(10) (A + XI)5o = g.
Then 5o minimizes(~) on the spherewhose radius 115 II satisfies
a 112
= 110o112
Proof.In orderto find5 whichwillminimize
(11) (((5)) = Y f-o - P 2,
underthe constraint
(12) I2, I1a |12= 1' 0
a necessaryconditionfora stationarypointis, by the methodof Lagrange,
au = allu alu au
(13) u . ... = --= 0, -=
081 062 06k aX
(14) u(5, X) = 11Y - fo - PF 112+ X(115 112
- 115o 112)

and X is a Lagrangemultiplier.
Thus, takingthe indicatedderivatives
(15) 0 = -[P(Y - fo) - PTP]I + X?,

(16) 2- 11 112.
0= 11 112
For a givenX, (15) is satisfiedby the solution5 of the equation
(17) (pTp + XI) = PT(Y - fo)

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

as can easily be verifiedby premultiplying

(17) by (PTP)- , and writing
in the form

(18) 5 = (PTP)-'P(Y -fo) - (P P) -X,

and then substitutinginto (15). (10) and (17) are identical.That this
stationarypointis actuallya minimum,is clearfromthefactthat (A + XI)
is positivedefinite.
THEOREM 2. Let 5(X) be thesolutionof (10) for a givenvalue of X. Then
115(X) 112is a continuousdecreasingfunctionof X, such thatas X -> o,
- o
II (X) 11-2
Proof.Since the symmetric matrixA is positivedefinite,it may be trans-
formedby an orthonormalrotationof axes into a diagonal matrix,D,
withoutalteringthe distancesbetweenpoints. Let the transformation be
denoted STAS = D, whereSTS = I, and all diagonal elementsof D are
positive.Thus, (10) takes the form(D + XI) S-~' = STg, so that
(19) 5o = S(D + XI)-STg.

Then, definingv = STg

IIso(x) i2 = gTS(D + XI)-1T(D + XI)-lSTg

= vT[(D + XI)2']-1
k 2
_ Vj
i= (D, + X)2
whichis clearlya decreasingfunctionof X, (X > 0), such that as X -oo,
IIl0(X) 112> o.
The orthonormaltransformation to a diagonal matrix has here been
explicitly in orderto facilitatethe proofofthe followingtheorem.
THEOREM 3. Let 7ybe theangle between 5oand g. Then y is a continuous
monotone decreasingfunctionof X such thatas X -> oo,7y-> 0. Since ?g is
independent it
of X, follows that5o rotatestoward 6g as X - oo.
Proof. We first
observe that

(21) 5 = ( I(Y i- ) ab, = 1,2, ...k.

Thus (except fora scale factorwhichis irrelevantsince only the orienta-

tionofthevectorsis pertinent)
(22) 5g = g.
By definition

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

cos Y =
(11 11)(11
g II)
vT(D + XI)-lv
(23) (vT[(D + XI)2]-lv)12(gTg)12

V jk Vj2

(DIV42 (gg)

and simplifying
2 2 2
v2 1
(24) d Z L 1 Dj + X2 1 (Dj+ X)- [='(D +X)2
(24) jx-
4dX j
L[Z-cs= (D, + X)2

[Ej Vj2Iljl[=1 vj2J3j] - V

(25) [z3j=li (D, X)2 [Ilj=1(Dj + X) ] (gTg)
k k
where Ij = I (Dj, + X), H112 = H
(Dj, + X)2, H3
= f (Dj, + X)3. The denominator of (25) is positive, since each
is thesign ofthe numerator.
factoris positive.Hence thesign ofd cos y/dX
Noting that IIj I3j = (2j) 2, the numeratorcan be written

(26) f[ (vj IIT1/2][ (vj 1/2)2 - (vVj 1/2

(Vj /2)]

By Schwarz's Inequality, (26) is positive. Thus d cos y/dXis always

positive (X > 0). Consequently,7yis a monotonedecreasingfunctionof X.
For very large values of X, the matrix(A + XI) is dominatedby the
diagonal XI. Thus it is seen from (10) that as X -- oo, o -> g/X, whence
6o and g become proportional in the limit, so that the angle between them
approacheszero. On the otherhand, ifX = 0 in (10), then (except forthe
trivialcase whereA is diagonal) the vectors5oand g meet at some finite
angle 0 < y < 7r/2.It followsthat yis a continuousmonotonedecreasing
function of X, such that as X --> oo, y-- 0.
Scale of measurement.The relevantpropertiesof the solution, t , of (6)
are invariantunder linear transformations of the b-space. However,it is
well known [3] that the propertiesof the gradientmethodsare not scale

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

invariant.It becomes necessary,then,to scale the b-space in some con-

venientmanner.We shall chooseto scale the b-space in unitsof the stand-
ard deviationsof the derivativesdfi/Obj,taken over the sample points
i = 1, 2, .. , n. Since these derivativesdepend, in general,on the bj
themselves,the currenttrialvalues of the bj are used as necessaryin the
evaluation of the derivatives.This choice of scale causes the A matrixto
be transformed intothe matrixof simplecorrelationcoefficients amongthe
dfi/Obj.This choice of scale in
has, fact,been widely used in linear least-
squares problems as a device for improvingthe numerical of
aspects com-
Thus, we definea scaled matrixA*, and a scaled vectorg*:

(27) A* (a*) ( aj- -j


g* = (gj*) -=
(28) (

and solvefortheTaylorseriescorrection
(29) A*t* =g*
(30) 6 = j*/7/a .
Constructionof the algorithm.The broad outline of the appropriate
algorithmis now clear. Specifically,at the rth iterationthe equation
(31) (A*() + X(r)I)*(r) = g*(r)

is constructed.This equation is thensolved for * (r). Then (30) is used to

obtain (r). The new trial vector
= b(r) + (r)
(32) b(r+l)

will lead to a new sum of squares (r+1). It is essentialto select X() such
(r+) (r)

It is clear fromthe foregoingtheorythat a sufficiently large X(r) always

existssuch that (33) willbe satisfied,unlessb(r) is alreadyat a minimumof
1. Some formof trial and erroris requiredto finda value X(r whichwill
lead to satisfactionof (33) and will producerapid convergenceof the algo-
rithm to the least-squaresvalues.
At each iterationwe desireto minimize1 in the (approximately)maxi-
mum neighborhoodover whichthe linearizedfunctionwill give adequate

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

representationof the nonlinearfunction.Accordingly,the strategyfor

choosing X(r) must seek to use a small value of Xr) whenever conditions are
such that the unmodifiedTaylor series method would convergenicely.
This is especiallypertinentin thelaterstagesofthe convergenceprocedure,
whenthe guessesare in the immediatevicinityof the minimum,wherethe
contoursofq are asymptoticallyelliptical,and the linearexpansionof the
model needs to be a good approximationover only a very small region.
Large values of Xr) shouldtherefore be used onlywhennecessaryto satisfy
(33). While it is true that )(r+l) as a functionof X has a minimum,and
choice of this value of X at the rth iteration would maximize (4(r) _- )(r+l)),
such a locally optimumchoice would be poor global strategy,since it
typicallyrequiresa substantiallylargervalue of X than is necessaryto
satisfy (33). Such a strategywould inheritmany of the propertiesof
steepest-descent;e.g., rapid initial progress followed by progressively
We shall therefore
defineour strategyas follows:
Let v > 1.
Let X(r-l) denote the value of X fromthe previousiteration.Initially
let X(?) = 10-2, say.
Compute b(X(r-l)) and2 b(XA(r-l)/).
i. If )(X(r-l)/) < (r) let X(r) = (r-)/.
ii. If (I)(r-X) /) > 4j(r) and X (I(r-l)) < +(r) let (r) = (r-l)
iii. If 1(X(r-l)/v) > )(r),and )(X(r-)) > (r), increaseX by successive
multiplication3 by v until for some smallest w, 4(X(r-l1)v) _ 4(r).
Let X(r) = X(r-1)Vw
2 If X(r-l) is
already negligible by comparison with 1.0 to the number of significant
figurescarried, then go to test ii. or iii. immediately without computing I(X(r-1)/v),
and ignore comparisons involving q(X('-l)/v).
3 On occasion in problems where the correlations among the parameter-estimates

are extremely high (>0.99) it can happen that X will be increased to unreasonably
large values. It has been found helpful forthese instances to alter test iii. The revised
test is:
Let b(r+1) = b(r) + K(r);(r) K(') < 1.

Noting that the angle 7(r) is a decreasing function of X(r), select a criterion angle
yo < 7r/2and take

Kr) = 1 if ( > 70o.

However, if test iii. is not passed even though X(r) has been increased until y(r) < 0o,
then do not increase X(r) further,but take K(r) sufficientlysmall so that cp(r+l) < cr).
This can always be done since y(r) < 70 < r/2. A suitable choice forthe criterionangle
is Yo = r/4.
It may be noted that positivity of cos y when X = 0 is guaranteed only when A is

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

By this algorithmwe always obtain a feasibleneighborhood.Further,we

almost always obtain, within a factor determinedby v, the maximum
neighborhoodin whichthe Taylor seriesgives an adequate representation
forour purposes.The iterationis convergedwhen I < e, forallj,
br )
T+ (r Ib
for some suitably small e > 0, say 10-5 and some suitable r, say 10-3.
The choice of v is arbitrary;v = 10 has been found in practice to be a
good choice.
Typically, conditioniii. is met only rarely.Thus, it is most oftenre-
quired that (31) be solved for two values of X(r) at each iteration.One
such solution is required for the standard Taylor series method. The
extra linearequation solutionis generallymuch less computationaleffort
than the evaluation of the A* matrix,so that the small proportionalin-
crease in computationper iterationis morethan offsetby the gain in the
powerof an iteration.
It may be remarkedthat whenusingfloatingpointarithmeticit is often
desirableto accumulatethe sum of squares of (3) (for use in tests i. and
ii.) in double precision,aftercomputingYfto singleprecision.Failure to
do this can lead to erraticbehaviorof 4 near the minimum,due only to
roundingerror.In extremecases it may also be necessary to compute
Yi to double precision.
A corollarynumericalbenefitassociated with adding X to the diagonal
of the A* matrixis that the compositematrixis always betterconditioned
than A* itself.

Applicationto other problems. Clearly the method described can be

applied to othertypes of problems.For example,it will enable interpola-
tion betweenthe gradientmethodand the Newton-Raphsonmethodfor
solvingsystemsof nonlinearalgebraicequations,such as the simultaneous
nonlineardifference equations associated with the solution of nonlinear
boundary value problemsby implicitmethods.In a sense, such problems
are a particularizationof the least-squaresproblem to the case where
n = k. Let the systemof equationsto be solved be written
(34) 0 = f(x), i = 1, 2, ..., k,
vectorof unknowns,Xj.
wherex is a k-dimensional

positive definite. In the presence of very high correlations, the positive definiteness
of A can break down due to rounding error alone. In such instances the Taylor series
method can diverge, no matter what value of K(r) is used. The present algorithm, by
its use of (A + NX) will guarantee positive cos 7 even though A may not quite be
positive definite.While it is not always possible to forsee high correlations among the
parameter-estimates when taking data for a nonlinear model, better experimental
design can often substantially reduce such extreme correlations when they do occur.

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

For the gradientmethodsit is customaryto definean errorcriterion

(35) = EfZ(x)]

and to make correctionsto trial vectorsx(r) by movingin the directionof

the negative gradient of A, defined by

(36) 2m (2?f a-f)

So that
(37) Xj(r+l) = xj() + ^ j = 1, 2, *.. , k,

(38) ) = -a , j= 1,2, ... k,

and a is a suitablydefinedconstant.
In matrixnotation

(39) 1I (r) (r)

On the otherhand,the Newton-Raphsonmethod,whichinvolvesexpan-

sion of thefi in Taylor seriesthroughthe linearterms,leads to the equa-
(40) ( 8(r), = ifr) = 1,2, ... , c.
In matrixnotation
(41) BSt = f
where the matrix B = - (Ofd/xj) is not, in general,symmetric.Pre-
multiplying(41) by BT we then formBTB = A and BTf = g. The ma-
trixA is symmetric.Matrix A and vectorg are then normalizedto A*,
Application of the new algorithmgives the formulation
(42) (A* + X(r)I) *(r) = g*
which is identicalto the formulationof the nonlinearleast-squaresprob-
lem. The computationalmethodforobtainingthe solutionis also identical.
In this case, however,the minimumvalue ofqI is knownto be zero,within

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

Conclusion.The algorithmdescribedshareswith the gradientmethods

theirabilityto convergefroman initialguess whichmay be outside the
regionof convergenceof other methods.The algorithmshares with the
Taylorseriesmethodthe abilityto close in on the convergedvalues rapidly
afterthe vicinityof the convergedvalues has been reached. Thus, the
methodcombinesthe best featuresof its predecessorswhileavoidingtheir
Acknowledgments. The authoris indebtedto a refereefor detectingan
errorin the proofof Theorem3 in the originalmanuscript.The authoris
also gratefulto ProfessorH. O. Hartleywho has drawnhis attention,since
submissionof thispaper,to the relatedworkof Levenberg[8]. From some-
what different reasoningLevenbergwas also led to add a quantityto the
diagonal of A. Levenberg'srecommendationto minimize D locally as a
functionofthe quantityadded to the diagonalwouldgive steepest-descent-
like convergenceforthe reasonsdetailedearlierin thispaper.
ComputerProgram.A FORTRAN program,"Least-Squares Estimation of
NonlinearParameters",embodyingthe algorithmdescribedin this paper
is available as IBM Share ProgramNo. 1428.
[1] G. W. BOOTH, G. E. P. Box, M. E. MULLER, AND T. I. PETERSON, Forecasting by
Generalized RegressionMethods,Nonlinear Estimation (Princeton - IBM),
InternationalBusiness Machines Corp., Mimeo. (IBM Share Program
No. 687WL NL1.), 1959.
[2] G. E. P. Box, ANDG. A. COUTIE, Proc. Inst. Elec. Engrs. 103, Part B, Suppl. No. 1
(1956),Paper No. 2138M.
[3] H. B. CURRY,Quart. Appl. Math., 2 (1944),pp. 258-261.
[4] H. O. HARTLEY, Technometrics, 3 (1961), No. 2, pp. 269-280.
[5] D. W. MARQUARDT, Chem. Eng. Progr., 55 (1959), No. 6, pp. 65-70.
[6] D. W. MARQUARDT,R. G. BENNETT, AND E. J. BURRELL, Jour. Molec. Spectros-
copy,7 (1961),No. 4, pp. 269-279.
[7] D. D. MORRISON, Methods for nonlinear least squares problems and convergence
proofs, Tracking Programs and Orbit Determination, Proc. Jet Propulsion
Laboratory Seminar, (1960), pp. 1-9.
[8] K. LEVENBERG,A methodfor the solution of certain non-linear problems in least
squares,Quart. Appl. Math., 2 (1944),pp. 164-168.

This content downloaded from on Tue, 15 Jul 2014 11:44:37 AM

All use subject to JSTOR Terms and Conditions

You might also like