You are on page 1of 9

A Stochastic Approximation Method

Herbert Robbins; Sutton Monro

The Annals of Mathematical Statistics, Vol. 22, No. 3. (Sep., 1951), pp. 400-407.

Stable URL:
http://links.jstor.org/sici?sici=0003-4851%28195109%2922%3A3%3C400%3AASAM%3E2.0.CO%3B2-W

The Annals of Mathematical Statistics is currently published by Institute of Mathematical Statistics.

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained
prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in
the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/journals/ims.html.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic
journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers,
and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take
advantage of advances in technology. For more information regarding JSTOR, please contact support@jstor.org.

http://www.jstor.org
Mon Aug 27 15:50:48 2007
A STOCHASTIC APPROXIMATION METHOD'

University of North Carolina


I. Summary. Let iM(x)denotc the expected value a t level x of the response
to a certain experiment. ill(z)is assumed to he a monotone function of x but is
unkno~vnto the experimenter, and it is desired to find the solution x = 0 of thc
equation ilf(z) = a , where a is a given constant. We give a method for making
successive experiments a t levels .rl, x2 , . . . in such a way that x, will tend to 0 in
probability.

2. Introduction. Let M ( x ) be a given function and a a given constant such


t,hal; the equ at'10x1

has n unique loot n: = 0. Therc are many methods for determining the value of 8
by successive approximation. With any such method n-e begin by choosing one or
more values 1'1, . . . , z,more or less arbitrarily, and then successively obtain new
values 2, as certain fucctions of the previously obtained xl , . . , x,-I ,the values
.I() . . ill(.r,-I) , possibly those of the derivatives M ' ( q ) , - - . ,M'(x,-~),
and
etc. If

lim x, = 0,
n+w

irrespective of the arbitrary initial values xl . . , z, , then the method 1s


effective for the particular function M(x) and value a. The speed of the con-
vergence in (2) and the ease \kith which the on can he computed determine t , h ~
practical utility of the method.
We consider a stochastic generalization of the above problem in which the

that t o each value z corresponds a random variable Y


function Pr[Y(x)5 y] H(y I x),such that
-
nature of the function M(x)is unknown to the experimenter. Instead, we suppose

- Y(x)with distribution

1s the expected value of Y for the given x. Neither the exact nature of H ( y / x)
nor that of M(x)is known to the experimenter, but it is assumed that equation ( 1 )
has a unique root 0, and it is desired to estimate 0 by making successive observa-
tions on Y a t levels x1 , x2 , . . . determined sequentially in accordance with some
definite experimental procedure. If (2) holds in probability irrespective of any
arbitrary initial values xl , . . . , x, , we shall, in conformity with usual statistical
terminology, call the procedure consisfcnf for the given H ( y / x) and vali~rn
This work was upp ported In part by t h e Office of Naval R ~ s r t i r c h
400
STOCHASTIC APPROXIMATION 40 1

In what follows we shall give a particular procedure for estimating 8 which is


consistent under certain restrictions on the nature of H(y I x). These restrictions
are severe, and could no doubt be lightened considerably, but they are often
satisfied in practice, as will be seen in Section 4. No claim is made that the
procedure to be described has any optimum properties (i.e. that it is "efficient")
but the results indicate a t least that the subject of ~t~ochastic
approximation is
likely to be useful and is worthy of further study.

3. Convergence theorems. We suppose henceforth that H(y 1 x) is, for every A ,


a distribution function in y, and that there exists a positive constant C such that

(4) Prl 1 Y ( d 15 CI =
-C
dH(ylx1 = 1 for all .r.

I t follows in particular that for every x the expected value M(x) defined by (3)
exists and is finite. We suppose, moreover, t,hat there exist finite constants a,
0 such that.
(5) M(z) 5 a for x < 8, M(x) 2 cu for z > 0.
Whether M (8) a is, for the moment, immaterial.
=;

Let (a,] be a fixed sequence of positive constants such that

We define a (nonstationary) Markov chain (xn}by taking XI to be an arbitrary


constant and defining

where y, is a random variable such that


(8) pr[yn I y 1 xnl = H(Y1 xn).

(9) b, = E(x, - 8)2


We shall find conditions under which
lim bn = 0
n-rw

no matter what the initial value xl . As is wall known, (10) implies the convergence
in probability of x, to 8.
From (7) we have
402 HERBERT KOHBINS .4XD SUTTON MONK0

Setting
d. = E [ ( z n- O)(IM(x.) - a)],

we can write
bn+l - bn = anen - 2u, d,, .
2
(14)
Note t'hai from ( 5 )
dn 2 0,
while from (4)
0 5 en I [C + ( aj12 < 03.

Together with (6) this implies that the positive-term series ~Ia:e, converges.
Summing (14)we obtain

Since 1,+, 2 0 it follows that

Hence t,he positive-term series

converges. I t follows from (15) that

linl b,,
11-01
= b, + he. - 2 2 a, dn = b

exists; b >. 0.
Now suppose that tllere exists a sequence {kn} of nonnegative constants
such tjhat

(19) tl,, 2 k, b,, , tank. = m.

e pa,rl ol' (19) and the convergence of (17) it f o l l o ~ ~


Fl,onl t # l ~first s
t,hat

E'rorn (20) a~ltlthe secolid part of (19) it follows that for any e > 0 there must
exist infinitely many values 12 such that b. < E. Since we already kno~vthat
b =; lim,,,,. b,, exists, it follo\vs that b = 0. Thus Itre have proved
LEMMAI . I j n sequence (k,,)of nonnegative constants esist.s satisfying ( 1 9 )
then b = 0.
Let

then from (4) and (7) it, follows thst


(22) Pr[l 2,,- 6 I < A,] = I .
-
k = i,,f ["::TI
---.---- for 0 < 1.7. - 01 5 A , .

From (5) i t follow that L,, 2 b. Moreover, denoting by P,(z) the probahilit,y
distribiition of x, , we have

.I, = i ir Rl<A,,
(., - 6) (A/( a ) - a) CIP,,
(.I.)

(24)
-
k II 7. - 0 l2 dJ',?(.t.) En bw

I t follons that the particu1:tr sequcncc3 (z,,) defined by (23) sat isfics the first
part of (19).
111 orclel- t o c,st:?hlish the second part of (39) nr\'r shall m:ilir the following
assnmptinns:

for some const,ant I( > 0 and suficient,ly large )1, ~.nd


m

It follo~vsfrom (26) that

ancl 11rnc.rfor srifric.ic~ntI?-1:trgc n


(28) 2lc -t1 + ... +
Oijj(~11 O,? I ) 2 '1, .
This iinplic+ t)y (2.7) tlinf for \1ifii(.ic~i?1
I;\. 1:ir.g~tr

:ind i hc. ~ c c o u dpart, (11 (19) folloi~s froin (291 :111(1 (26). '1.h;~: 111~ovr:-
J,mrv z U,f ( 2 5 ) ( ~ n f(26)
? hold ihrn h = 0.
404 HERBERT BOBBINS AND SUTTON MONRO

The hypotheses ( 6 ) and (26) concerning {a,} are satisfied by the sequence
a, = l/n, since

More generally, any sequence {a,] such that there exist two positive constants
C" for which
c',

will satisfy (6) and (26). We shall call any sequence {a,} which satisfies ( 6 )
and (26),whether or not it is of the form (30), a sequence of type l/n.
If {a,} is a sequence of type l/n it is easy to find functions M ( x ) which satisfy
( 5 ) and (25). Suppose, for example, that M ( x ) satisfies the following strength-
ened form of ( 5 ): for some 6 > 0 ,
(5') M(x)<a-6 for x < O , M(x)La+8 for a : > O .
Then for 0 <Ix - 0 I 5 A, we have

so that
6

5, 2 -
A,'
which is (25) with K = 6. From Lemma 2 we conclude
THEOREM 1. If (a,] is of type l/n, if ( 4 ) holds, and if iM(x) satisfies (5') then
b = 0.
A more interesting case occurs when iM(x) satisfies the following conditions:
(33) M ( x ) is nondecreasing,

We shall prove that (25) holds in this case,also. From (34) it follows that
(36 ) M ( x ) - a: = (X +
- e ) [ M f ( e ) e(x - e)],
where r(t) is a function such that

Hence there exists a constant 8 > 0 such that


(38) r ( t ) > --&M'(O) for /tI58,
STOCHASTIC APPROXIMATION

so that

f-Ience, for 6 -+ 5 < < 0 + A,


3: , since M(x) is nondecrcasing,

Thus, sinre we may assunle without loss of generality that 6/A, 5 1,

so that (25) holds with K = 6M1(0)/2 > 0. This proves


THEOREM 2. If {a,} is of type l/n, if (4) holds, and if M(x) satisJies (33), (34),
and (35), then b = 0.
I t is fairly obvious that condition (4) could be considerably weakened without
affecting the validity of Theorems 1 and 2. A reasonable substitute for (4)
would be the condition

(4') 1 M(T) i < C, Sm m

(y - ~ ( 2 )d ) ~~ / x)( < ~u2 < for all rc.

We do not know whether Theorems 1 and 2 hold with (4) replaced by (4').
Likewise, the 'hypotheses (33), (34), and (35) of Theorem 2 could be weakened
somewhat, perhaps being replaced by
(5") M ~ X<
) for L < e, M(X) > for x > e.
4. Estirnatiori of a quantile using response, nonrespoxlse data. Let F(x) be
an unknown distribution function such that
(431 F(0) = a: (0 < a < I), E"(0) > 0,
and let (z,) be a sequence of independent random variables each with the
distribution function Pr[z, 5 x] = F(x). On the basis of {z,} we wish to estimate
8. However, as sometimes happens in ptnctice (bioassay, sensitivity data), we
are not allowed to know the values of zn themselves. Instead, me are free to
prescribe for each n a value x, and arc then given only thc values (y,,} where
11 ifzn<x, ("response"),
yn = 10 otherwise (((nonres~:onse").
Hon shall we choose the values {z,} and how shall we nse the sequcncil {y,]
to estimate 01
406 HERBERT ROBRINS .4ND SUTTON BIONRO

Let us proceed as follows. Choose xl as our best guess of the value 0 and
let ( a , ] be any sequence of constants of type I /n. Then choose vall~esx2 , x3 , . . .
seqilentially according t o thc rule

(46) i'r[y. = 1 / x,] = F(x,), I'r[yn = 0 / x,,] = 1 - F(x,),


it follo~\rsthat (4) hold3 and that

All the hypotheses of Tileorem 4 are satisfied, so that


lim 2, = 8
n-tw

11 quadratic nlcan ancl hence in probability. I n other words, (z,)is a consistent


estimator of 0.
The eficiency of {x,] will depend on xl and on the choice of the sequence
{a,), as well as on the nature of F(x). For any given F(x) there doubtless exist
more efficient estimators of 0 than any of the type {x,,) defined by (45), but
{x,] has the advantage of being distribution-free.
I n some applications it is more convenient to make a group of r observations
a t t,he same level before proceeding to the next level. The ? ~ t group
h of ohserva-
t,ions will then be

using the iotat ti on (44). J,ct fj, = arithmetic mean of the values (49). Then
setting

we havc df(x) = P(.c) as before, and hence ($8) continues to holcl.


The possibility of using a convergent sequential process in this proble~llwas
first mentioned by T. W. Anderson, P. J. McCarthy, and J . JIT.Tukey in the
Naval Ordnance Report No. G5-46(1946), p. 99.

5. A more general regression problem. It is clear that the problenl oi dectio~l4


is a special case of a more general regression problem. I n fact, using the notation
of Section 2, consider any random variable Y which is associated with an observ-
able rnlue .2: in such a way that the conditional distribution function of Y for
fixed x is H ( y ( a);the function M(x) is then the regression of Y on x.
The usual regrcqiion analysis assumes that M(x) is of known forrn with
unknown psr:~rnc.tc>rb,S , I T ~
STOCHASTIC APPROXIMATION 407

and deals with the estimation of one or both of the parameters Pi on the basis of
observations y, , yz , . . . , y, corresponding to observed values XI , 22 , . . . , x, .
The method of least squares, for example, yields the estimators bi which minimize
t'he expression

Instead of trying to estimate the parameters Pi of M(x) under the assumption


that M(x) is a linear function of x, we may try to estimate the value 0 such that
M(0) = a, where cx is given, without any assumption about the form of M(x).
If we assume only that H(y I x) satisfies the hypotheses of Theorem 2 then the
sequence of estimators {x,] of 0 defined by (7) will at least be consistent. This
indicates that a distribution-free sequential system of making observations,
~ u c has that given by (7), is worth investigating from the practical point of
view in regression problems.
One of us is investigating the properties of this and other sequential designs
as a graduate student; the senior author is responsible for the convergence
proof in Section 3.

You might also like