You are on page 1of 9

Stochastic approximation

Stochastic approximation methods are a family of iterative methods typically used for root-finding
problems or for optimization problems. The recursive update rules of stochastic approximation methods can
be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for
approximating extreme values of functions which cannot be computed directly, but only estimated via noisy
observations.

In a nutshell, stochastic approximation algorithms deal with a function of the form


which is the expected value of a function depending on a random variable . The goal is to recover
properties of such a function without evaluating it directly. Instead, stochastic approximation algorithms
use random samples of to efficiently approximate properties of such as zeros or extrema.

Recently, stochastic approximations have found extensive applications in the fields of statistics and machine
learning, especially in settings with big data. These applications range from stochastic optimization methods
and algorithms, to online forms of the EM algorithm, reinforcement learning via temporal differences, and
deep learning, and others.[1] Stochastic approximation algorithms have also been used in the social sciences
to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied
using their theory.[2]

The earliest, and prototypical, algorithms of this kind are the Robbins–Monro and Kiefer–Wolfowitz
algorithms introduced respectively in 1951 and 1952.

Contents
Robbins–Monro algorithm
Complexity results
Subsequent developments and Polyak–Ruppert averaging
Application in stochastic optimization
Convergence of the algorithm
Example (where the stochastic gradient method is appropriate)[8]
Kiefer–Wolfowitz algorithm
Subsequent developments and important issues
Further developments
See also
References

Robbins–Monro algorithm
The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro,[3] presented a
methodology for solving a root finding problem, where the function is represented as an expected value.
Assume that we have a function , and a constant , such that the equation has a unique
root at . It is assumed that while we cannot directly observe the function , we can instead obtain
measurements of the random variable where . The structure of the algorithm is to
then generate iterates of the form:

Here, is a sequence of positive step sizes. Robbins and Monro proved[3], Theorem 2 that
converges in (and hence also in probability) to , and Blum[4] later proved the convergence is actually
with probability one, provided that:

is uniformly bounded,
is nondecreasing,
exists and is positive, and
The sequence satisfies the following requirements:

A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have
the form: , for . Other series are possible but in order to average out the noise in , the
above condition must be met.

Complexity results
1. If is twice continuously differentiable, and strongly convex, and the minimizer of
belongs to the interior of , then the Robbins–Monro algorithm will achieve the
asymptotically optimal convergence rate, with respect to the objective function, being
, where is the minimal value of over .[5][6]
2. Conversely, in the general convex case, where we lack both the assumption of smoothness
and strong convexity, Nemirovski and Yudin[7] have shown that the asymptotically optimal
convergence rate, with respect to the objective function values, is . They have also
proven that this rate cannot be improved.

Subsequent developments and Polyak–Ruppert averaging

While the Robbins–Monro algorithm is theoretically able to achieve under the assumption of twice
continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is
primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the
supposed asymptotically optimal step size policy can be quite harmful in the beginning.[6][8]

Chung (1954)[9] and Fabian (1968)[10] showed that we would achieve optimal convergence rate
with (or ). Lai and Robbins[11][12] designed adaptive
procedures to estimate such that has minimal asymptotic variance. However the application of
such optimal methods requires much a priori information which is hard to obtain in most situations. To
overcome this shortfall, Polyak (1991)[13] and Ruppert (1988)[14] independently developed a new optimal
algorithm based on the idea of averaging the trajectories. Polyak and Juditsky[15] also presented a method
of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer
steps, and averaging of the iterates. The algorithm would have the following structure:

The convergence of to the unique root relies on the condition that the step sequence decreases
sufficiently slowly. That is

A1)

Therefore, the sequence with satisfies this restriction, but does not, hence
the longer steps. Under the assumptions outlined in the Robbins–Monro algorithm, the resulting
modification will result in the same asymptotically optimal convergence rate yet with a more
robust step size policy. [15] Prior to this, the idea of using longer steps and averaging the iterates had already
been proposed by Nemirovski and Yudin[16] for the cases of solving the stochastic optimization problem
with continuous convex objectives and for convex-concave saddle point problems. These algorithms were
observed to attain the nonasymptotic rate .

A more general result is given in Chapter 11 of Kushner and Yin[17] by defining interpolated time
, interpolated process and interpolated normalized process as

Let the iterate average be and the associate normalized error to be

With assumption A1) and the following A2)

A2) There is a Hurwitz matrix and a symmetric and positive-definite matrix such that
converges weakly to , where is the statisolution to

where is a standard Wiener process.

satisfied, and define . Then for each ,


The success of the averaging idea is because of the time scale separation of the original sequence and
the averaged sequence , with the time scale of the former one being faster.

Application in stochastic optimization

Suppose we want to solve the following stochastic optimization problem

where is differentiable and convex, then this problem is equivalent to find the root
of . Here can be interpreted as some "observed" cost as a function of the chosen and
random effects . In practice, it might be hard to get an analytical form of , Robbins–Monro method
manages to generate a sequence to approximate if one can generate , in which the
conditional expectation of given is exactly , i.e. is simulated from a conditional
distribution defined by

Here is an unbiased estimator of . If depends on , there is in general no natural way of


generating a random outcome that is an unbiased estimator of the gradient. In some special cases
when either IPA or likelihood ratio methods are applicable, then one is able to obtain an unbiased gradient
estimator . If is viewed as some "fundamental" underlying random process that is generated
independently of , and under some regularization conditions for derivative-integral interchange operations
so that , then gives the fundamental gradient
unbiased estimate. However, for some applications we have to use finite-difference methods in which
has a conditional expectation close to but not exactly equal to it.

We then define a recursion analogously to Newton's Method in the deterministic algorithm:

Convergence of the algorithm

The following result gives sufficient conditions on for the algorithm to converge:[18]

C1)
C2)

C3)

C4)

C5)

Then converges to almost surely.

Here are some intuitive explanations about these conditions. Suppose is a uniformly

bounded random variables. If C2) is not satisfied, i.e. , then

is a bounded sequence, so the iteration cannot converge to if the initial guess is too far away from .
As for C3) note that if converges to then

so we must have , and the condition C3) ensures it. A natural choice would be .
Condition C5) is a fairly stringent condition on the shape of ; it gives the search direction of the
algorithm.

Example (where the stochastic gradient method is appropriate)[8]

Suppose , where is differentiable and is a random variable


independent of . Then depends on the mean of , and the
stochastic gradient method would be appropriate in this problem. We can choose

Kiefer–Wolfowitz algorithm
The Kiefer–Wolfowitz algorithm was introduced in 1952 by Jacob Wolfowitz and Jack Kiefer,[19] and was
motivated by the publication of the Robbins–Monro algorithm. However, the algorithm was presented as a
method which would stochastically estimate the maximum of a function. Let be a function which
has a maximum at the point . It is assumed that is unknown; however, certain observations ,
where , can be made at any point . The structure of the algorithm follows a gradient-
like method, with the iterates being generated as follows:

where and are independent, and the gradient of is approximated using


finite differences. The sequence specifies the sequence of finite difference widths used for the
gradient approximation, while the sequence specifies a sequence of positive step sizes taken along
that direction. Kiefer and Wolfowitz proved that, if satisfied certain regularity conditions, then
will converge to in probability as , and later Blum[4] in 1954 showed converges to almost
surely, provided that:

for all .
The function has a unique point of maximum (minimum) and is strong concave
(convex)
The algorithm was first presented with the requirement that the function maintains
strong global convexity (concavity) over the entire feasible space. Given this condition is
too restrictive to impose over the entire domain, Kiefer and Wolfowitz proposed that it is
sufficient to impose the condition over a compact set which is known to include
the optimal solution.
The function satisfies the regularity conditions as follows:

There exists and such that

There exists and such that

For every , there exists some such that

The selected sequences and must be infinite sequences of positive numbers


such that
A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be and
.

Subsequent developments and important issues


1. The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least
different parameter values must be simulated for every iteration of the algorithm, where is
the dimension of the search space. This means that when is large, the Kiefer–Wolfowitz
algorithm will require substantial computational effort per iteration, leading to slow
convergence.
1. To address this problem, Spall proposed the use of simultaneous perturbations to
estimate the gradient. This method would require only two simulations per iteration,
regardless of the dimension .[20]
2. In the conditions required for convergence, the ability to specify a predetermined compact
set that fulfills strong convexity (or concavity) and contains the unique solution can be
difficult to find. With respect to real world applications, if the domain is quite large, these
assumptions can be fairly restrictive and highly unrealistic.

Further developments
An extensive theoretical literature has grown up around these algorithms, concerning conditions for
convergence, rates of convergence, multivariate and other generalizations, proper choice of step size,
possible noise models, and so on.[21][22] These methods are also applied in control theory, in which case
the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step
size should not converge to zero but should be chosen so as to track the function.[21], 2nd ed., chapter 3

C. Johan Masreliez and R. Douglas Martin were the first to apply stochastic approximation to robust
estimation.[23]

The main tool for analyzing stochastic approximations algorithms (including the Robbins–Monro and the
Kiefer–Wolfowitz algorithms) is a theorem by Aryeh Dvoretzky published in the proceedings of the third
Berkeley symposium on mathematical statistics and probability, 1956.[24]

See also
Stochastic gradient descent
Stochastic variance reduction

References
1. Toulis, Panos; Airoldi, Edoardo (2015). "Scalable estimation strategies based on stochastic
approximations: classical results and new insights" (https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC4484776). Statistics and Computing. 25 (4): 781–795. doi:10.1007/s11222-015-9560-
y (https://doi.org/10.1007%2Fs11222-015-9560-y). PMC 4484776 (https://www.ncbi.nlm.nih.
gov/pmc/articles/PMC4484776). PMID 26139959 (https://pubmed.ncbi.nlm.nih.gov/2613995
9).
2. Le Ny, Jerome. "Introduction to Stochastic Approximation Algorithms" (http://www.professeur
s.polymtl.ca/jerome.le-ny/teaching/DP_fall09/notes/lec11_SA.pdf) (PDF). Polytechnique
Montreal. Teaching Notes. Retrieved 16 November 2016.
3. Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method" (https://doi.org/10.121
4%2Faoms%2F1177729586). The Annals of Mathematical Statistics. 22 (3): 400.
doi:10.1214/aoms/1177729586 (https://doi.org/10.1214%2Faoms%2F1177729586).
4. Blum, Julius R. (1954-06-01). "Approximation Methods which Converge with Probability
one" (https://doi.org/10.1214%2Faoms%2F1177728794). The Annals of Mathematical
Statistics. 25 (2): 382–386. doi:10.1214/aoms/1177728794 (https://doi.org/10.1214%2Faom
s%2F1177728794). ISSN 0003-4851 (https://www.worldcat.org/issn/0003-4851).
5. Sacks, J. (1958). "Asymptotic Distribution of Stochastic Approximation Procedures" (https://d
oi.org/10.1214%2Faoms%2F1177706619). The Annals of Mathematical Statistics. 29 (2):
373–405. doi:10.1214/aoms/1177706619 (https://doi.org/10.1214%2Faoms%2F117770661
9). JSTOR 2237335 (https://www.jstor.org/stable/2237335).
6. Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. (2009). "Robust Stochastic Approximation
Approach to Stochastic Programming". SIAM Journal on Optimization. 19 (4): 1574.
doi:10.1137/070704277 (https://doi.org/10.1137%2F070704277).
7. Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin,
Wiley -Intersci. Ser. Discrete Math 15 John Wiley New York (1983) .
8. Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control (http
s://books.google.com/books?id=f66OIvvkKnAC&printsec=frontcover#v=onepage&q=%22Ro
bbins-Monro%22&f=false), J.C. Spall, John Wiley Hoboken, NJ, (2003).
9. Chung, K. L. (1954-09-01). "On a Stochastic Approximation Method" (https://doi.org/10.121
4%2Faoms%2F1177728716). The Annals of Mathematical Statistics. 25 (3): 463–483.
doi:10.1214/aoms/1177728716 (https://doi.org/10.1214%2Faoms%2F1177728716).
ISSN 0003-4851 (https://www.worldcat.org/issn/0003-4851).
10. Fabian, Vaclav (1968-08-01). "On Asymptotic Normality in Stochastic Approximation" (http
s://doi.org/10.1214%2Faoms%2F1177698258). The Annals of Mathematical Statistics. 39
(4): 1327–1332. doi:10.1214/aoms/1177698258 (https://doi.org/10.1214%2Faoms%2F1177
698258). ISSN 0003-4851 (https://www.worldcat.org/issn/0003-4851).
11. Lai, T. L.; Robbins, Herbert (1979-11-01). "Adaptive Design and Stochastic Approximation"
(https://doi.org/10.1214%2Faos%2F1176344840). The Annals of Statistics. 7 (6): 1196–
1221. doi:10.1214/aos/1176344840 (https://doi.org/10.1214%2Faos%2F1176344840).
ISSN 0090-5364 (https://www.worldcat.org/issn/0090-5364).
12. Lai, Tze Leung; Robbins, Herbert (1981-09-01). "Consistency and asymptotic efficiency of
slope estimates in stochastic approximation schemes". Zeitschrift für
Wahrscheinlichkeitstheorie und Verwandte Gebiete. 56 (3): 329–360.
doi:10.1007/BF00536178 (https://doi.org/10.1007%2FBF00536178). ISSN 0044-3719 (http
s://www.worldcat.org/issn/0044-3719). S2CID 122109044 (https://api.semanticscholar.org/C
orpusID:122109044).
13. Polyak, B T (1990-01-01). "New stochastic approximation type procedures. (In Russian.)" (ht
tps://www.researchgate.net/publication/236736759). 7 (7).
14. Ruppert, D. "Efficient estimators from a slowly converging robbins-monro process" (https://w
ww.researchgate.net/publication/242608650).
15. Polyak, B. T.; Juditsky, A. B. (1992). "Acceleration of Stochastic Approximation by
Averaging". SIAM Journal on Control and Optimization. 30 (4): 838. doi:10.1137/0330046 (ht
tps://doi.org/10.1137%2F0330046).
16. On Cezari's convergence of the steepest descent method for approximating saddle points of
convex-concave functions, A. Nemirovski and D. Yudin, Dokl. Akad. Nauk SSR 2939, (1978
(Russian)), Soviet Math. Dokl. 19 (1978 (English)).
17. Kushner, Harold; George Yin, G. (2003-07-17). Stochastic Approximation and Recursive
Algorithms and | Harold Kushner | Springer (https://www.springer.com/us/book/9780387008
943). www.springer.com. ISBN 9780387008943. Retrieved 2016-05-16.
18. Bouleau, N.; Lepingle, D. (1994). Numerical Methods for stochastic Processes (https://book
s.google.com/books?id=9MLL2RN40asC). New York: John Wiley. ISBN 9780471546412.
19. Kiefer, J.; Wolfowitz, J. (1952). "Stochastic Estimation of the Maximum of a Regression
Function" (https://doi.org/10.1214%2Faoms%2F1177729392). The Annals of Mathematical
Statistics. 23 (3): 462. doi:10.1214/aoms/1177729392 (https://doi.org/10.1214%2Faoms%2F
1177729392).
20. Spall, J. C. (2000). "Adaptive stochastic approximation by the simultaneous perturbation
method". IEEE Transactions on Automatic Control. 45 (10): 1839–1853.
doi:10.1109/TAC.2000.880982 (https://doi.org/10.1109%2FTAC.2000.880982).
21. Kushner, H. J.; Yin, G. G. (1997). Stochastic Approximation Algorithms and Applications.
doi:10.1007/978-1-4899-2696-8 (https://doi.org/10.1007%2F978-1-4899-2696-8). ISBN 978-
1-4899-2698-2.
22. Stochastic Approximation and Recursive Estimation, Mikhail Borisovich Nevel'son and
Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and
B. Silver, Providence, RI: American Mathematical Society, 1973, 1976. ISBN 0-8218-1597-0.
23. Martin, R.; Masreliez, C. (1975). "Robust estimation via stochastic approximation". IEEE
Transactions on Information Theory. 21 (3): 263. doi:10.1109/TIT.1975.1055386 (https://doi.o
rg/10.1109%2FTIT.1975.1055386).
24. Dvoretzky, Aryeh (1956-01-01). "On Stochastic Approximation" (http://projecteuclid.org/eucli
d.bsmsp/1200501645). The Regents of the University of California.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Stochastic_approximation&oldid=1113581937"

This page was last edited on 2 October 2022, at 06:08 (UTC).

Text is available under the Creative Commons Attribution-ShareAlike License 3.0; additional terms may apply. By
using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the
Wikimedia Foundation, Inc., a non-profit organization.

You might also like