Gossippaper

SIAM J. CONTROL OPTIM.
c 2016 Society for Industrial and Applied Mathematics

Vol. 54, No. 3, pp. 1535–1557
NONLINEAR GOSSIP∗
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
ADWAITVEDANT S. MATHKAR† AND VIVEK S. BORKAR‡
Abstract. We consider a gossip-based distributed stochastic approximation scheme wherein

processors situated at the nodes of a connected graph perform stochastic approximation algorithms,
modified further by an additive interaction term equal to a weighted average of iterates at neighboring
nodes along the lines of “gossip” algorithms. We allow these averaging weights to be modulated by
the iterates themselves. The main result is a Benaim-type meta-theorem characterizing the possible
asymptotic behavior in terms of a limiting o.d.e. In particular, this ensures “consensus,” which we
further strengthen to a form of “dynamic consensus” which implies that they asymptotically track
a single common trajectory belonging to an internally chain transitive invariant set of a common
o.d.e. that we characterize. We also consider a situation where this averaging is replaced by a fully
nonlinear operation and extend the results to this case, which in particular allows us to handle certain
projection schemes.
Key words. gossip algorithms, distributed algorithms, stochastic approximation, two time
scales, Benaim theorem
AMS subject classifications. Primary, 62L20; Secondary, 68W15, 68T05
DOI. 10.1137/140992588
1. Introduction. In [30], Tsitsiklis, Athans, and Bertsekas introduced a novel

framework for distributed algorithms, ideally suited for distributed stochastic opti-
mization. This included in particular a distributed stochastic approximation scheme
with a decreasing step-size combined with an averaging across the processors to en-
sure “consensus.” There has been a lot of development since on such schemes and
related variants; see, e.g., [14], [19], [21], [29], [27], to mention a few representative
works. The pure averaging component has seen even more explosive development
under the rubric of “gossip” or “consensus” algorithms; see [28] for a survey. Related
“second order” dynamics also appears in the literature on flocking of mobile agents
[32] and synchonization [33] (see [24] for a survey). In this work, we go a step further
in a different direction, viz., to replace the simple (linear) averaging mechanism by
a nonlinear operation that subsumes the classical case. Specifically, we consider two
distinct cases. The first is what may be considered a state-dependent averaging in
which the averaging stochastic matrix is modulated by the current values of the iter-
ates. This adds another layer of complication to the analysis and is well motivated
by some applications. We establish a dynamic version of consensus characterizing
the possible common trajectories which the component iterations jointly track in a
potentially nonconvergent scenario. In the second case, we replace averaging by a
fully nonlinear operation satisfying some key hypotheses. Just as consensus in aver-
aging leads to a constant vector in the limit, i.e., convergence to the one-dimensional
invariant subspace of stochastic matrices, the nonlinear case leads to convergence to
∗ Received by the editors October 22, 2014; accepted for publication (in revised form) April 14,
2016; published electronically June 7, 2016.

http://www.siam.org/journals/sicon/54-3/99258.html
† Department of Electrical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai,
400 076, India (mathkar.adwaitvedant@gmail.com). Current address: Goldman Sachs, Bangalore,

540 071, India.
‡ Department of Electrical Engineering, Indian Institute of Technology Bombay, Powai, Mumbai,
400 076, India (borkar.vs@gmail.com). This author’s was supported in part by a J. C. Bose Fellowship
and a grant for “Distributed computation for optimization over large networks and high dimensional
data analysis” from the Department of Science and Technology, Government of India.
1535
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1536 ADWAITVEDANT S. MATHKAR AND VIVEK S. BORKAR
the invariant set of the nonlinear operation. A motivating example is projection to

a convex set, a situation commonplace in convex optimization. Our objective is to
provide in each case a meta-theorem in the tradition of Benaim [3] that character-
izes possible asymptotic behavior of the scheme, covering potentially nonconvergent
cases as well. For specific instances thereof, this result may be leveraged to exploit
additional structure of the problem in order to obtain sharper results.
For simplicity, we initially confine ourselves to the synchronous case where the
distributed computation runs on a common clock. Asynchrony with its concomitant
issues (differing clocks, delays, etc.) leads to additional complications. We address
these separately in a later section.
The paper is in three parts. Part I (sections 2–4) formulates the state-dependent
averaging model (section 2) and analyzes its asymptotic behavior (section 3), inclusive
of a stability test (section 4). We dub this the “quasi-linear” case. Part II, the
“fully nonlinear” model (sections 5–6), introduces the fully nonlinear case motivated
by projected optimization schemes (section 5) and analyzes its asymptotic behavior
(section 6). In Part III, we consider briefly the asynchronous case in section 7 and
conclude in section 8 with a discussion of some straightforward variants and some
research issues for the future.
I. Quasi-linear gossip.
2. Preliminaries. Consider a finite connected directed graph G = (V, E), where
V, E denote, respectively, its node and edge sets with |V| = N . Without loss of
generality, we may label V as {1, 2, . . . , N }. We assume that G is irreducible, i.e.,
there is a directed path from each node to any other node. Let
N (i) := {j ∈ V : (i, j) ∈ E}
denote the set of neighbors of node i. We are given a family of irreducible aperiodic
stochastic matrices Px = [[px (j|i)]]i,j∈V compatible with G, indexed by x ∈ Rd×N , d ≥
1, such that the map x 7→ Px is Lipschitz. We further assume that
(1) (min)+
j∈N (i) px (j|i) > ∆ ∀x ∈ R
d×N
, i ∈ V,
where ∆ > 0 and (min)+ denotes the minimum over all nonzero elements. Node i
performs the d-dimensional iteration
X h i
(2) xi (n + 1) = px(n) (j|i)xj (n) + a(n) hi (x(n)) + M i (n + 1)
j∈N (i)
for n ≥ 0. Here x(n) := [(x1 (n))T , . . . , (xN (n))T ]T ∈ (Rd )N and

• (A1) hi : (Rd )N 7→ Rd are Lipschitz;
• (A2) {M (n) := [M 1 (n)T , . . . , M N (n)T ]T } is an (Rd )N -valued martingale dif-
ference sequence with respect to the filtration Fn := σ(x(k), M (k),
k ≤ n), n ≥ 0 (in other words, ∀ n ≥ 0, E[M (n + 1)|Fn ] = 0 a.s.), satis-
fying in addition the conditional variance bound
E kM (n + 1)k2 |Fn ≤ K(1 + kx(n)k2 )

(3)
for some K > 0 (we strengthen this later in (4)); and
• (A3) the step-size sequence {a(n)} is nonincreasing and satisfies
X X
a(n) = ∞, a(n)2 < ∞.
n n
(In particular, a(n) → 0.)

NONLINEAR GOSSIP 1537
Later we need a stronger version of (A2) given by

• (A20 )
(4) kMn k ≤ K 0 (1 + kxn k) a.s.
We shall also assume

• (A4)
(5) sup kxi (n)k < ∞ a.s. ∀i.

n
As in classical analysis of stochastic approximation algorithms by the o.d.e. method,

we prove convergence assuming this “stability” condition. Usually this needs to be
separately verified even in classical stochastic approximation. No universal method-
ology is available for verifying (5), only a variety of sufficient conditions. For the
present case, one sufficient condition for (5) is given in section 4.
Remarks. 1. Without the averaging in the first term on the right-hand side (r.h.s.)
in (2), it would reduce to N uncoupled classical Robbins–Monro schemes [8]. Without
the second term on the r.h.s., it would be the classical gossip algorithm [28]. Thus it is
a distributed stochastic approximation scheme with interaction restricted to a gossip-
like averaging over neighbors. Note that for a truly distributed scheme, px (·|i), hi (x)
in (2) should depend on x only through xi , or through xi and its neighboring xj s.
We allow for more general dependence purely because it simplifies notation and does
not cost any additional technical overhead. It also allows for an alternative model
where the “noisy measurement” à la stochastic approximation implicit in the square
brackets on the r.h.s. of (2) is done in a decentralized manner and communicated
to a central coordinator, who in turn does the averaging and sends back the average
value. We won’t delve into these issues as they do not affect our analysis to follow.
The fact that a common step-size a(n) is used for all components implies a common
synchronized clock. We shall discuss a more general scenario later where this is not
the case.
2. The term hi (x(n))+Mi (n+1) is interpreted as noisy measurement of hi (x(n)),
i.e., only the sum is known, and the two summands are not known separately. Thus
Mi (n + 1) represents the measurement noise incurred by the ith agent/processor at
time n. A typical situation is when the stochastic approximation component is a
stochastic gradient scheme where hi (x) = −∇fi (x) (see Example 3 below). Typically
such a scheme will replace this gradient by the instantaneous empirical gradient, or
sometimes even a further approximation involving finite differences. Thus what one
has is P“hi (x) + noise” even when x may be precisely known.1 On the other hand, the
j
term j∈N (i) px(n) (j|i)x (n) represents local averaging by the ith agent/processor
over its neighbors and is assumed to be noise free. In practice, e.g., in sensor network
applications, this can be achieved to a very good approximation by using error cor-
recting codes. This is important because we are seeking a conclusion beyond mere
consensus or agreement across agents, viz., averaging of the drift term in the limiting
dynamics with respect to specified weights. To highlight this issue, consider pure de-
terministic gossip for averaging, involving only the first term on the r.h.s. of (2), with
a constant P (i.e., without the x-dependence thereof). This will lead to averaging of
1 For example, in online regression to fit a model “Y = f (Z ) + noise” to independent and
n x n
identically distributed (i.i.d.) data (Zn , Yn ), using mean square error criterion, from a smoothly
parametrized function family fx , x ∈ R , h(x) = −∇x E kYn − fx (Xn )k2 , whereas the stochastic
m

gradient scheme uses −∇x kYn − fx (Zn )k2 |x=x(n) which is of the form h(x(n)) + i.i.d. noise.

its initial data with respect to the stationary distribution π of P . In presence of noise,
one can show consensus, in fact convergence to a common value, but not necessarily
to the above average. This is because the iteration does not have a single equilibrium
point, but a one-dimensional subspace of equilibria corresponding to the span of its
eigenvector [1, 1, . . . , 1]T . In the deterministic case, we get a limit dependent on the
initial condition as described above, but in the noisy case we may get a random limit
point on this line which loses track of the initial condition. See, e.g., [9], where this
problem is handled by proposing an alternative algorithm. Whether this matters will
depend on the application. For example, in Example 1 below, the specifics of the sta-
tionary distribution πx matter and therefore the above concern is real. In the motion
coordination problem sketched in Example 3 below, it may not matter to the same
extent, since the objective of the averaging term is essentially consensus. We return
to this issue in the discussion on future directions at the end.
3. Condition (1) is convenient to work with from the point of view of reducing
notationQMbut can be easily relaxed to, e.g., a similar condition for a finite product
P̃x̃ := m=1 Pxn for x̃ := [x1 , . . . , xM ] for some M ≥ 1.
We call (2) a quasi-linear gossip. The first scheme of the form (2) with a constant
P appeared in [30]. There have been other subsequent works in this direction; see,
e.g., [14], [21] for two recent instances. Our aim here is to give a broad analysis of
this scheme in a very general framework and to derive a Benaim-type meta-theorem
regarding its asymptotic behavior [3]. For simplicity of exposition, we take d = 1
henceforth, though the results are completely general.
Let x = [x1 , . . . , xN ]T denote a generic element of RN . Likewise, x(n) :=
[x (n), . . . , xN (n)]T . Define h : RN 7→ RN by
1
h(x) = [h1 (x), . . . , hN (x)]T .
Thus the combined iteration (2) becomes

h i
(6) x(n + 1) = Px(n) x(n) + a(n) h(x(n)) + M (n + 1) , n ≥ 0.
Let πx denote the unique stationary distribution for Px , written as a row vector, and
1 := the column vector of all 1’s in RN . Let Px∗ denote the rank one matrix 1πx .
Then
n↑∞
(7) Pxn → Px∗ uniformly
under (1). In particular, x 7→ Px∗ is continuous. Consider the o.d.e.
(8) ẋi (t) = πx(t) h(x(t)), 1 ≤ i ≤ N.
This is well posed under our hypotheses. The right-hand side is independent of i and
hence the set A := {x ∈ RN : xi = xj ∀ i, j} is invariant under (8). On A, (8)
decouples into N uncoupled copies of the scalar o.d.e.,
N
X
(9) ẏ(t) = πy(t)1 (i)hi (y(t)1).
i=0
Recall that an invariant set B of a well posed o.d.e. is said to be internally chain
transitive if for any x, y ∈ B and , T > 0, we can find n ≥ 1 and x0 = x, x1 , . . . , xn = y
in B such that for 0 ≤ i < N , the trajectory of this o.d.e. initiated at xi meets with the

-neighborhood of xi+1 after a time ≥ T [3]. Our main result below is a counterpart of
the corresponding result for stochastic approximation from [3], stated here for general
d ≥ 1. For this purpose, let A := {x = [(x1 )T : · · · : (xN )T ]T ∈ (Rd )N : xi =
[xi1 , . . . , xid ]T , 1 ≤ i ≤ N ; xik = xjk ∀ i, j, k}. This reduces to the earlier definition for
d = 1. The o.d.e. (9) gets replaced by
N
X
(10) ẏ(t) = πψ(y(t)) (i)hi (ψ(y(t))),
i=0
where ψ(y) := [y T : y T : · · · : y T ]T for y ∈ Rd .

Theorem 1. Under (A1), (A20 ), (A3), and (A4), almost surely, there exists a
compact connected nonempty internally chain transitive invariant set A0 ⊂ A of the
n↑∞
N -fold product of the dynamics given by (10)2 such that miny∈A0 kx(n) − yk → 0.
Remarks. 1. It should be noted that internal chain transitivity is a nontrivial
restriction. For example, the whole space is always an invariant set, but not internally
chain transitive.
2. The theorem can be further specialized to say more in specific cases. For
example, if (9) has a unique asymptotically stable equilibrium x∗ , then x(n) → x∗ 1
a.s. If it has finitely many equilibria, then all components of x(n) will a.s. converge
to a common one. (In fact, since bounded trajectories of well-posed scalar differential
equations always converge to an equilibrium, there will be convergence to a common
equilibrium for d = 1.) Under additional conditions on the step-sizes and {M (n)}, one
can improve this to “convergence to a common stable equilibrium” as in [8, section
4.3]. When Px ≡ some fixed P , πx ≡ the unique stationary distribution π of P , in
particular it is independent of x. This is the case commonly studied in literature. We
prove Theorem 1 in the next section.
3. The result can be given a “two-time-scale” interpretation. The gossip-like
averaging takes place on a natural time-scale tagged by the clock n = 0, 1, 2, . . . ,
whereas the stochastic
Pn approximation component is on a slower “algorithmic” time-
scale t(n) := m=0 a(m), n ≥ 0. Hence the overall dynamics viewed on the latter
time-scale has the effect of the former averaged out. This is analogous to the averaging
effect in two time-scale dynamics [20] except that here both time-scales are in the same
components and are not separated out into fast and slow variables.
Example 1. Consider hi (x) = −∇f (xi ) ∀i for a suitable f , implying a common
stochastic gradient descent by each node in order to minimize f . Suppose |N (i)| =
M ∀i and for a prescribed T > 0,
1 − (f (xj )−f (xi ))+

px (j|i) = e T , j ∈ N (i),
M
= 0, j 6= i, j ∈ / N (i),
X
=1− px (j|i), j = i.
j∈N (i)
These are the transition probabilities for the Metropolis–Hastings scheme for Markov
chain Monte Carlo, except here they are normalized weights for neighboring nodes
2 That is, dynamics in (Rd )N wherein each component satisfies (10).

rather than transition probabilities of a Markov chain. Then

f (xi )
e−
T
πx (i) = P f (xj )
.
j∈V e− T
Thus πx puts greater weight on nodes j that have lower values of f (xj ). This is a
distributed optimization scheme for which numerical experiments show promising re-
sults [15]. Here we want to highlight the fact that x-dependent averaging weights give
us an additional handle to control the asymptotic behavior of stochastic approxima-
tion with gossip. One can think of this scheme as “leaderless swarm optimization”:
In classical particle swarm optimization, each agent makes an incremental update
based on her own gradient, that of her neighbors, and that of the “leader,” meaning
the current best performer. The last mentioned aspect requires keeping track of the
current best, introducing nonlocal computation. The above scheme automatically
concentrates weights on the so-called leader(s) by adapting the probability weights.
Example 2. More generally, consider the situation when hi (x) is of the form g(xi ),
i.e., hi = g ◦ Γi ∀i, where Γi is the projection from the N -fold product (Rd )N to its
ith factor space Rd . This is an important special case which can be interpreted as the
stochastic approximation or learning component of the iteration being strictly local.
The additional structure offers some further simplifications. Since asymptotically,
xi (t) ≈ xj (t) for i 6= j, (10) simplifies to N identical copies of a trajectory of the
d-dimensional o.d.e.
(11) ż(t) = g(z(t)).
In particular, values of πx(t) do not matter. In the preceding example, we had a

gradient scheme, for which the only possible attractors are equilibria. But this may
not be so in general if d ≥ 1. For example, consider the situation where the limiting
dynamics (11) is the two-dimensional system
ż 1 (t) = z 2 (t) − p(z 1 (t)),

(12) ż 2 (t) = q(z 1 (t)),
where p, q are continuously differentiable and satisfy

1. p is odd, p(0) = 0, p0 (0) < 0, p has a unique positive zero at some a > 0, and
p(z) monotonically increases to ∞ as a ≤ z ↑ ∞;
2. q is odd with zq(z) > 0 ∀ z 6= 0.
In our earlier notation of (2), this corresponds to d = 2 and hi (·) ≡ h(·) :=
[h0 (·), h00 (·)]T : R2 7→ R2 , i = 1, 2, where h0 ([w, y]T )) := y−p(w), h00 ([w, y]T ) := q(w).
The system (12) is known as a Lienard system and has a unique stable limit cycle
[26, p. 254]. Writing the r.h.s. of (12) as hi (z(t)) for z(t) := [z 1 (t), z 2 (t)]T , use this
choice of h in our framework to get an example of a situation where the xi (t) move
in unison along a limit cycle, which is obviously a compact connected nonempty in-
ternally chain transitive set. (An unstable equilibrium also exists but will be avoided
with probability one if the noise is sufficiently rich; see, e.g., section 4.2 of [8].) An
important observation here is that because xi (·) become asymptotically identical, so
does hi (x(·)) = h(xi (·)); therefore the exact nature of x 7→ πx does not matter as far
as the limiting dynamics is concerned. Note that this refers to limiting dynamics, not
to finite time behavior, and is therefore not in contradiction to the observations of
the preceding example.

This is only a suggestive simple example. Attractors that are not equilibria are
endemic in multiagent learning schemes [16].
Pm
Example 3. Consider minimization of a separable function f := i=1 αi fi (x) :
Rd 7→ R, where m is large. We assume that fi : Rd 7→ R (and hence f ) are
differentiable. Without loss of generality,
PmassumePthat αi > 0 (since we can absorb the
m
negative sign in fi ). Then write f = ( i=1 αi ) i=1 πi fi (x), where πi = Pmαi αi can
i=1
be thought of as elements of the stationary distribution of some irreducible stochastic
matrix P . Note that (1) is trivially satisfied here. The ith computational node runs
the following d-dimensional iteration:
X
xi (n) = p(j|i)xj (n) + a(n)(−∇fi (xi (n)) + M i (n + 1)),
j∈N (i)
where N (i) is according to the graph structure induced by P and M i (n + 1) corre-

sponds to measurement noise. In the notation of (2), hi (x) = −∇fi (x) ∀i. That is,
node/agent i performs the (possibly noisy) gradient computation for the ith component
function fi and together they perform a combined steepest descent using gossip-like
averaging. By Theorem 1, the nodes achieve consensus and converge Pm to the compact
connected internally chainP transitive invariant subset of ẏ(t) = − i=1 πi ∇fi (y(t)),
m
equivalently, of ẏ 0 (t) = − i=1 αi ∇fi (y 0 (t)) = −∇f (y 0 (t)), because the two o.d.e.s
differ only by a time scaling and therefore have exactly the same trajectories. By
imposing additional conditions on f (such as convexity) the iterates will converge to
the unique minimum of f . This corresponds to a distributed gradient descent for
a separable function [30]. If π is uniform, p(·|·) will have to be doubly stochastic.
A state dependence in p(·|·) would arise, e.g., if the underlying neighborhood struc-
ture is state dependent, say, when the computing agents are mobile. This would fit
our framework. Mobility, however, raises other issues, such as whether the graph
would remain irreducible. We do not address this here. See, however, [13] for some
interesting developments in this direction.
3. Proof of Theorem 1. Note that the claim of Theorem 1 is more specific and
stronger than mere “consensus,” viz., it goes a step further in establishing a common
asymptotic dynamics and characterizing it. Nevertheless, we prove it in two steps,
first proving consensus and then the main result. This is because consensus can be
proved under weaker conditions, which is worth noting. We describe this next. This
part is in the spirit of already existing results such as [4]; we include it here for sake of
completeness and in order to put our subsequent results in perspective. Once again,
we restrict to d = 1 for notational simplicity, although the arguments are completely
general.
Replace (6) by the iteration
h i
(13) x(n + 1) = Pn x(n) + a(n) h(x(n)) + M (n + 1) , n ≥ 0,
where {Pn } is a sequence of irreducible aperiodic stochastic matrices on G satisfying

Qn+k
(1). Define U (n, k) := m=n+1 PQ m for n ≥ 0, k ≥ 1. We define U (n, 0) := In×n ,
the n × n identity matrix. Here denotes the backward product of matrices. By
Theorem 5 of [12], Un := limk↑∞ U (n, k) exists and is of the form 1π̂n for some
probability vector π̂n . Furthermore,
(14) kU (n, k) − U (n)k ≤ Cαk ∀n

for some C > 0, α ∈ (0, 1). We proceed through a sequence of lemmas. The first
lemma below is of independent interest.
Lemma 1. Under (A1), (A2), (A3), and (A4), iteration (13) achieves a.s. con-
sensus, i.e.,
lim max kxi (n) − xj (n)k = 0 a.s.
n↑∞ i,j
Proof. Iterating (13),

m−1
X
xn+m = U (n − 1, m)xn + a(n + i)U (n + i, m − i − 1)h(x(n + i))
i=0
m−1
X
+ a(n + i)U (n + i, m − i − 1)M (n + i + 1)
i=0
m−1
X
= Un−1 x(n) + a(n + i)Un+i h(x(n + i))
i=0
m−1
X
+ a(n + i)Un+i M (n + i + 1) + (n, m),
i=0
where
(n, m) := (U (n − 1, m) − Un−1 )x(n)

m−1
X
+ a(n + i)(U (n + i, m − i − 1) − Un+i )h(x(n + i))
i=0
m−1
X
+ a(n + i)(U (n + i, m − i − 1) − Un+i )M (n + i + 1).
i=0
Thus
m−1
X
k(n, m)k ≤ Cαm kx(n)k + a(n + i)Cαm−1−i kh(x(n + i))k
i=0
m−1
X
+ a(n + i)Cαm−1−i kM (n + i + 1)k.
i=0
By (5), the first term on the r.h.s. → 0 as mP ↑ ∞. By (5) and the fact that a(n) → 0,
n
so does the second term. By (5) and (3), i=0 a(i)M (i + 1) is a square-integrable
{Fn }-martingale with a.s. convergent quadratic variation process, hence it converges
a.s. by Proposition VII-2-3(c) of [22, p. 149], recalled as Theorem A in the appendix.
Thus a(n)kM (n + 1)k → 0 a.s. This implies that the third term → 0 a.s. Thus
(n, m) → 0 a.s. as n, m ↑ ∞. Recall once again that Uk has equal rows ∀k and
hence it gives a constant vector whenever it left multiplies a column vector. Now
write x(n) = x(d n2 e + b n2 c) to conclude.
The intuition for the following two lemmas can be stated as follows. Lemma
2 says that as n increases, x(n) moves slowly and so does Pn := Px(n) due to the
Lipschitz condition. Hence for a sufficiently large m, as n increases U (n, m) tracks
∗
Px(n) as stated by Lemma 3 below. In order to formally prove this we need the
stronger hypothesis (A20 ).

Lemma 2. Under (A1), (A20 ), (A3), and (A4), almost surely, for any η > 0, m ≥
0, there exists a (possibly random) n0 ≥ 1 such that
kx(n + m) − x(n)k < η ∀n ≥ n0 .

Proof. We have
m−1
X
kx(n + m) − x(n)k ≤ a(n + i)kUn+i kkh(x(n + i))k
i=0
m−1
X
+ a(n + i)kUn+i kkM (n + i + 1)k
i=0
+ k(n, m)k + kx(n) − Un−1 x(n)k.
Consider n ↑ ∞. By Lemma 1, the last term on the r.h.s. → 0 a.s. Also, as in the
Pn+m
proof of Lemma 1, (n, m) → 0 a.s. The first term on the r.h.s. is O( i=n a(i)),
hence → 0. The second term on the r.h.s. can likewise be shown to → 0 a.s. by (5)
and (4), using Theorem A of the appendix as in the proof of Lemma 1. The claim
follows.
Henceforth we return to our original set-up of Pn = Px(n) .
Lemma 3. Under (A1), (A20 ), (A3), and (A4), limn↑∞ kPx(n)
∗
− Un k = 0 a.s.
Proof. Let η > 0. We have
∗ ∗
kPx(n) − Un k ≤ kPx(n) − (Px(n) )m k + k(Px(n) )m − U (n, m)k + kU (n, m) − Un k.
Pick m large enough so that the first and third terms are < η3 (which is possible by
(7) and (14)), and then pick n large enough so that the second term is < η3 , which is
possible by Lemma 2 and the continuity of x 7→ Px∗ . The claim follows.
Pnx̄(t), t ≥ 0, as follows: Let In := [t(n), t(n + 1)), where t(0) = 0 and

Define
t(n) := i=0 a(i), n ≥ 0. Then x̄(t(n)) := x(n), n ≥ 0, with linear interpolation on
each In . For s ≥ 0, let xs (t), t ≥ s, denote the solution to (8) with xs (s) = x̄(s).
We give the proof of the next lemma only in outline as the Gronwall inequality based
arguments involved therein are quite standard in the o.d.e. approach to stochastic
approximation (see Chapter 2 in [8]).
Lemma 4. Under (A1), (A20 ), (A3), and (A4), for any η ≥ 0, T ≥ η,
lim sup kxs (t) − x̄(t)k = 0 and
s↑∞ t∈[s+η,s+T ]
lim sup kxs (t) − x̄(t)k = 0 a.s.

s↑∞ t∈[s−η,s−T ]
Proof (sketch). Pick n, m such that |t(n + m) − t(n)| ≤ T . We have

kx̄(t(n + m)) − xt(n) (t(n + m))k
m−1 Z t(n+m)
X
∗
≤ a(n + i)Px̄(t(n+i)) h(x̄(t(n + i))) − Px∗t(n) (s) h(xt(n) (s))dτ

i=0 t(n)
m−1
X
∗
+ a(n + i)Px̄(t(n+i)) M (n + i + 1) + k(n, m)k

i=0
+ kx(n) − Un−1 x(n)k + k˘
(n, m)k ,

where
m−1
X
∗
˘(n, m) = a(n + i)kUn+i − Px̄(t(n+i)) kkh(x̄(t(n + i))k
i=0
m−1
X
∗
+ a(n + i)kUn+i − Px̄(t(n+i)) kkM (n + i + 1)k.
i=0
By (A4), kxn+m k have a common (random) finite bound a.s., hence so do

kh(x̄(t(n
Pm+ i)))k and by (4) also kM (n + i + 1)k for 0 ≤ i ≤ m. By our choice
of m, i=0 a(n + i) ≤ T . Combining these observations with Lemma 3, it follows
that ˘(n, m) → 0 a.s. Further, (n, m), kx(n) − Un−1 x(n)k → 0 a.s. as already noted.
Pm−1 ∗
Note further that ζm := i=0 a(i)Px(i) M (i + 1) is also a convergent martingale by
Theorem A in the appendix. Hence the second term on the r.h.s., being of the form
kζn+m − ζn−1 k, also → 0 a.s. The claim now follows by a standard argument based
on the Gronwall inequality as in [8, pp. 12–14] and the third remark of section 2.2 in
[8, p. 17].
Proof of Theorem 1. The proof follows that of Theorem 2 in [8, pp. 15–16], which
is based on [3]. We shall denote by Φt : (Rd )N 7→ (Rd )N the flow associated with
(10), i.e., Φt (x) := x(t) when x(·) satisfies (10) with x(0) = x. This is a flow of
homeomorphisms
T [2]. Fix a sample point where (5) and Lemma 4 hold. Let A denote
the set t≥0 {x̄(s) : s ≥ t}. Since x̄(·) is continuous and bounded, {x̄(s) : s ≥ t}, t ≥ 0,
is a nested family of nonempty compact and connected sets. A, being the intersection
thereof, will also be nonempty compact and connected. Also, by Lemma 1, A ⊂ A.
Then miny∈A kx̄(t) − yk → 0. Since x̄(·) is obtained from the iterates {x(n)} of
(2) by interpolation, we have miny∈A kx(n) − yk → 0. In fact, for any > 0, let
∆
A = {x : miny∈A kx − yk < }. Then (A )c ∩ (∩t≥0 {x̄(s) : s ≥ t}) = φ. Hence by
the finite intersection property of families of compact sets, (A )c ∩ {x̄(s) : s ≥ t0 } = φ
for some t0 > 0. That is, x̄(t0 + ·) ∈ A . Conversely, if x ∈ A, there exist sn ↑ ∞ in
[0, ∞) such that x̄(sn ) → x. This is immediate from the definition of A. In fact, we
have
max kx̄(s) − x̄(t(n))k = O(a(n)) → 0
s∈[t(n),t(n+1)]
as n → ∞. Thus we may take sn = t(m(n)) for suitable {m(n)} without any loss
of generality. Let x̃(·) denote the trajectory of (8) with x̃(0) = x. Then by the first
part of Lemma 4, it follows that kx̄(sn + t) − Φt (x̄(sn ))k → 0. On the other hand,
the continuity of the map Φt leads to Φt (x̄(sn )) → Φt (x) = x̃(t) ∀t > 0. Hence
x̄(sn + t) → x̃(t), implying that x̃(t) ∈ A as well. A similar argument works for t < 0,
using the second part of Lemma 4. Thus A is invariant under (8).
Let x̃1 , x̃2 ∈ A and fix > 0, T > 0. Pick /4 > δ > 0 such that if kz − yk < δ
and x̂z (·), x̂y (·) are solutions to (8) with initial conditions z, y, respectively, then
maxt∈[0,2T ] kx̂z (t) − x̂y (t)k < /4. Also pick n0 > 1 such that n ≥ n0 implies that
x̄(t(n) + ·) ∈ Aδ and
sup ||x̄(t) − xt(n) (t)|| < δ.

t∈[t(n),t(n)+2T ]
Pick n2 > n1 ≥ n0 such that ||x̄(t(ni )) − x̃i || < δ, i = 1, 2 and t(n2 ) − t(n1 ) ≥
T . Let kT ≤ t(n2 ) − t(n1 ) < (k + 1)T for some integer k ≥ 1 and let s(0) =
t(n1 ), s(i) = s(0) + iT for 1 ≤ i < k, and s(k) = t(n2 ). Then for 0 ≤ i < k,

supt∈[s(i),s(i+1)] ||x̄(t) − xs(i) (t)|| < δ. Pick x̂i , 0 ≤ i ≤ k, in A such that x̂1 = x̃1 ,
x̂k = x̃2 , and for 0 < i < k, x̂i are in the δ-neighborhood of x̄(s(i)). The sequence
(s(i), x̂i ), 0 ≤ i ≤ k, verifies the definition of internal chain transitivity: If x∗i (·) denote
the trajectories of (8) initiated at x̂i for each i, we have
kx∗i (s(i + 1) − s(i)) − x̂i+1 k

≤ kx∗i (s(i + 1) − s(i)) − xs(i) (s(i + 1))k + kxs(i) (s(i + 1)) − x̄(s(i + 1))k
+ kx̄(s(i + 1)) − x̂i+1 k

≤ + + < .
4 4 4
This completes the proof.
To compare with “classical gossip,” observe that in classical gossip, Px is indepen-
dent of x, whence U (m, n) are deterministic, in particular trivially adapted to {Fm }.
This allows us to drop the additional restriction (4) and work with (3) alone.
4. A test for stability. In this section, we give a sufficient condition for (5)
along the lines of [10] (see also Chapter 3 in [8]). Define h̃(x) = Px∗ h(x) and hc (x) =
h̃(cx)
c .
• (A5) We assume that hc (x) → h∞ (x) as c → ∞ uniformly on compacts for
some h∞ (x). Also, for some < 1, B := {x : kxk < } contains a globally
asymptotically stable attractor of the o.d.e. ẋ(t) = h∞ (x).
In fact, pointwise convergence itself will imply uniform convergence on compacts,
because hc , 0 < c ≤ ∞, inherit the Lipschitz condition of h̃ with the same Lipschitz
constant as h. It will become apparent that < 1 is not really required. We assume
it for notational simplicity.
Let φc (x, t) and φ∞ (x, t) denote the solutions of the o.d.e.
(15) ẋ(t) = hc (x(t)), x(0) = x,

ẋ(t) = h∞ (x(t)), x(0) = x,
respectively. We then have the following lemma.

Lemma 5. There exists a c0 > 0 and T > 0 such that for all initial conditions
x on the unit sphere and all c ≥ c0 , kφc (x, t)k ≤ 1 − 0 for t ∈ [T, T + 1], for some
0 > 0.
Proof. This follows by rather straightforward arguments similar to Lemma 1,
Lemma 2, and Corollary 3 of Chapter 3 in [8]. We give a mere sketch here. Pick
0 ∈ 0, 1− our choice of , 0 , it follows that for T sufficiently large, kφ∞ (x, t)k <

2 . By
0
1 − 2 ∀ t ≥ T . On the other hand, for c0 sufficiently large, a standard argument
based on the Gronwall inequality yields
sup kφc (x, t) − φ∞ (x, t)k < 0 ∀c ≥ c0 .

t∈[0,T ]
Combining the two, we get the desired result.

Let T0 := 0 and Tn+1 = min{t(m) : t(m) ≥ Tn + T } for n ≥ 0. Then Tn+1 ∈
[Tn , Tn + T + ā], where ā = supn a(n). Let m(n) be such that Tn = t(m(n)), n ≥
x̄(t)
0. Define the piecewise continuous trajectory x̂(t) = r(n) for t ∈ [Tn , Tn+1 ], where
r(n) := max{1, kx̄(Tn )k}. That is, we obtain x̂(t) by observing x̄(t) at time Tn
and resetting it back to the unit sphere in Rd by normalization whenever x̄(Tn )

− x̄(Tn+1 ) M (k+1)
falls outside the unit ball. Define x̂(Tn+1 ) = r(n) . Let M̂ (k + 1) = r(n) for
k ∈ [m(n), m(n + 1)). Note that (4) implies
(16) kM̂ (k + 1)k ≤ K(1 + kx̂(t(k))k).

For k ∈ [m(n), m(n + 1)) we have
h i
x̂(t(k + 1)) = Pr(n)x̂(t(k)) x̂(t(k)) + a(k) hr(n) (x̂(t(k))) + M̂ (k + 1) .
Lemma 6. Under (A1), (A20 ), and (A3), supt kx̂(t)k2 < ∞.

Proof. Let k · k∞ denote the sup-norm. Then we have kP xk∞ ≤ kxk∞ for any
stochastic matrix P . Also, the Lipschitz condition on h implies that it has at most
linear growth. Combining this with (16), we have
kx̂(t(k + 1))k∞ ≤ kx̂(t(k))k∞ + a(k)khr(n) (x̂(t(k))) + M̂ (k + 1)k∞
≤ Ǩ(1 + a(k))kx̂(t(k))k∞
for a suitable Ǩ > 0. By the discrete Gronwall inequality, we then have, for Tn ≤
t(k) < Tn+1 ,
kx̂(t(k))k∞ ≤ kx̂(Tn )k∞ eǨ(Tn+1 −Tn ) ≤ eǨ(T +ā) .
Thus supt kx̂(t)k2∞ < ∞. By equivalence of compatible norms in finite dimensions we
have kxk ≤ κkxk∞ ∀x for some κ > 0. The claim follows.
Lemma 7. Under (A1), (A20 ), and (A3), the sequence
m
X
ζ̂m := a(k)Pr(n)x̂(t(k)) M̂ (k + 1)
k=0
is a.s. convergent.
Proof. Since kM̂ (k + 1)k ≤ K(1 + kx̂(t(k))k), we have
ka(k)Pr(n)x̂(t(k)) M̂ (k + 1)k2 ≤ a(k)2 (sup kP k2 )K 2 (1 + kx̂(t(k))k2 ),
where the supremum is over all stochastic matrices P . Hence
X X
E ka(k)Pr(n)x̂(t(k)) M̂ (k + 1)k2 |Fk ≤ a(k)2 (sup kP k2 )K 2 (1 + kx̂(t(k))k2 ),
k k
which is finite by Lemma 6. The claim follows by Theorem A of the appendix.

For n ≥ 0, let xn (t), t ∈ [Tn , Tn+1 ] denote the trajectory of (15) with c = r(n)
and xn (Tn ) = x̂(Tn ).
Lemma 8. Under (A1), (A20 ), and (A3), limn→∞ supt∈[Tn ,Tn+1 ) kx̂(t)−xn (t)k = 0
a.s.
Proof. For 0 < k ≤ m(n + 1) − m(n) we have
k−1
X
x̂(t(m(n) + k)) = U (m(n) − 1, k)x̂(t(m(n))) + a(m(n) + i)U (m(n) + i, k − i − 1)
i=0
× hr(n) (x̂(t(m(n) + i)))
k−1
X
+ a(m(n) + i)U (m(n) + i, k − i − 1)M̂ (m(n) + i + 1).
i=0

Recall that kP xk∞ ≤ kxk∞ for any stochastic matrix P . Hence

kx̂(t(m(n) + k))k∞
k−1
X
≤ kx̂(t(m(n)))k∞ + a(m(n) + i)khr(n) (x̂(t(m(n) + i)))k∞
i=0
k−1
X
+ a(m(n) + i)kM̂ (m(n) + i + 1)k∞
i=0
k−1
X
a(m(n) + i) kh(0)k∞ + K 0 + (L + K 0 )kx̂(t(m(n) + i))k∞

≤ kx̂(t(m(n)))k∞ +
i=0
k−1
X
≤ (L + K 0 ) a(m(n) + i)kx̂(t(m(n) + i))k∞ + kh(0)k∞ + K 0 (T + 1) + β.

i=0
0
Here K corresponds to K in (4) when the norm used is sup-norm, L is the Lipschitz
constant of h and therefore of hr(n) under the sup-norm, and β is a positive constant
such that kxk∞ ≤ βkxk ∀x ∈ Rn . The second inequality follows from (4) and the
Lipschitz property of hr(n) . The third inequality follows from kx̂(t(m(n)))k ≤ 1. By
the discrete Gronwall inequality,
0
kx̂(t(m(n) + k))k∞ ≤ [(kh(0)k∞ + K 0 )(T + 1) + β]e(L+K )(T +1)
.
∗ ∗
Hence by equivalence of norms, kx̂(t(m(n) + k))k ≤ K for some K > 0 independent
of n. Hence x̂ remains bounded a.s. on [Tn , Tn+1 ]. We can now mimic the arguments
in the previous section, using Lemma 7, to prove the claim.
This leads to the main theorem.
Theorem 2. Under (A1), (A20 ), (A3), and (A5), supn kxn k < ∞ a.s., i.e., (A4)
holds.
Proof. We prove supn kx̄(Tn )k < ∞ a.s. If not, then there exists a subsequence
{nk } such that kx̄(Tnk )k ↑ ∞, i.e., rnk ↑ ∞. By Lemma 5 there exists a c0 > 0 and
T > 0 such that for all initial conditions on the unit sphere, kφc (x, t)k ≤ 1 − 0 , for
t ∈ [T, T + 1], c > c0 (≥ 0, by assumption). If rn > c0 , kx̂(Tn )k = kxn (Tn )k = 1, and
−
kxn (Tn+1 )k ≤ 1 − 0 . Then by Lemma 8, kx̂(Tn+1 )k < 1 − 00 for some 0 < 00 < 0 .
Thus for rn > c0 and n sufficiently large,
−
kx̄(Tn+1 )k kx̂(Tn+1 )k
= < 1 − 00 .
kx̄(Tn )k kx̂(Tn )k
The rest of the arguments are similar to Theorem 7 in Chapter 3 of [8] with 12 being
replaced by 1 − 00 < 1: We conclude that if kx̄(Tn )k > c0 , x̄(Tk ), k ≥ n falls back
to the ball of radius c0 at an exponential rate. Thus if kx̄(Tn )k > c0 , kx̄(Tn−1 )k is
either even greater than kx̄(Tn )k or is inside the ball of radius c0 . Then there must
be an instance prior to n when x̄(·) jumps from inside this ball to outside the ball of
radius 0.9rn . Thus, corresponding to the sequence rnk % ∞, we will have a sequence
of jumps of x̄(Tn ) from inside the ball of radius c0 to points increasingly far away
from the origin. But, by the discrete Gronwall inequality (see, e.g., [8, pp. 146]), it
follows that there is a bound on the amount by which kx̄(·)k can increase over an
interval of length T + 1 if it is inside the ball of radius c0 at the beginning of the
interval. This leads to a contradiction. Thus C̃ := supn kx̄(Tn )k < ∞. This implies
that supn kxn k ≤ C̃K ∗ < ∞ for K ∗ = K ∗ (T ) as in the Gronwall inequality.

This extends the result of [10] which was motivated by reinforcement learning
applications where the h in question has linear growth and h∞ picks only its linearly
growing terms, killing everything else that had sublinear growth. As global Lipschitz
condition on h implies at most linear growth, the above criterion is often useful for
such schemes. One can, however, conceive of other normalizations to handle specific
situations that do not fit the above model. For example, for h(x) = −x3 +g(x), x ∈ R,
with g at most quadratic, hc (x) should be defined as h(cx)/c3 .
II. The fully nonlinear case.
5. Preliminaries. Consider the n-dimensional projected stochastic iteration
given by
(17) xn+1 = PΩ (xn + a(n)[h(xn ) + Mn+1 ]),
where xn ∈ Rn , h : Rn → Rn is Lipschitz, and PΩ (x) is the projection of x on the set

Ω. (We need to assume convexity or other properties for uniqueness of the projection,
but we defer such considerations for now.) {Mn } is an Fn := σ(xm , Mm , m ≤ n)
adapted Pmartingale difference
P sequence. The step-sizes {a(n)} satisfy the usual con-
ditions: a(n) = ∞ and a(n)2 < ∞. Such iterations have numerous applications.
For example, if we replace h by −∇g for a continuously differentiable g : Rn 7→ R,
then under some technical hypotheses, the iteration solves
min g(x)
s.t. x ∈ Ω.
On the other hand, if we set h(x) as G(x) − x, where G is a contraction, and Ω =

a subspace, then the iteration converges to the unique projected fixed point x =
PΩ G(x). Such projected fixed points find applications in reinforcement learning and
game theory, among others.
Projections themselves are often calculated by an iterative procedure, e.g., the
alternating projection scheme to compute projection on the intersection of hyper-
planes, the Boyle–Dykstra–Han algorithm for projection on the intersection of convex
sets [17], and so on. Thus the stochastic iteration (17) calls for another subroutine
for projection during each iteration. That subroutine is itself an iterative algorithm
of the form
(18) x(n + 1) = F (x(n)), n ≥ 0.
This situation is reminiscent of the two-time-scale stochastic iteration, wherein the

subroutine is another stochastic approximation algorithm executed on a faster time-
scale (see [5] or Chapter 6 in [8] for a detailed treatment). We use this intuition to
give an alternative iteration to (17) which, while it does not use two coupled iterations
as in the classical two-time-scale approach, integrates the subroutine into the primary
iteration. The details follow in the next section; suffice it to say for now that we
propose a rather general formulation motivated by the above considerations. This
formulation can be viewed as a fully nonlinear version of gossip, as we point out later.
6. Problem formulation. Let f : Rn → Rn be such that P (x) := limm↑∞ f m (x)
exists ∀x ∈ Rn . Let C be the set {x | P (x) = x}. We assume the following:
• (B1) f is continuous.
• (B2) f n (:= f ◦ f ◦ f ◦ · · · ◦ f n times) → P uniformly on compacts. (In
particular, P is continuous.)

Thus we have the following properties about P :

• (P1) P (P (x)) = P (x),
• (P2) P (x) ∈ C, ∀x ∈ Rn ,
• (P3) P (f (x)) = f (P (x)) = P (x),

where the first and the third properties follow from continuity of f , while the second
property follows from the first property along with the definition of C.
Due to (P1) and (P2), we can think of P as a projection operator to C. Since
f n → P , we can think of f as one iteration of the projection subroutine. That is, the
subroutine will run
(19) yn+1 = f (yn ), y0 = x.
As n ↑ ∞, yn → P (x). The traditional method will then be to use the output
of this subroutine after running it for a long time to ensure near-convergence, and
then substitute it in (17) (with Ω = C). The two-time-scale algorithm methodology
alluded to above suggests that we can get the same effect by running (17) and (19)
concurrently albeit on different time-scales. We instead present a single stochastic
iteration integrating both in order to achieve the same effect.
Consider the stochastic iteration
(20) xn+1 = f (xn ) + a(n)(h(xn ) + Mn+1 ).
This is in fact a disguised two-time-scale algorithm. The projection algorithm yn+1 =
f (yn ) is running on the natural time-scale with an additive stochastic approximation
perturbation “a(n)(h(xn ) + Mn+1 )” moving on a slower time-scale dictated by the
decreasing “time steps” {a(n)}. We assume the following conditions:
• (B3) P is Frechet differentiable.
• (B4) The Frechet derivative P̄x (·) is continuous in x.
• (B5) E[kMn+1 k2 |Fn ] ≤ K(1 + kxn k2 ) for some K > 0.
• (B6) P̄f(·) (h(·)) is Lipschitz
in its argument.
• (B7) E kMn+1 k4 |Fn ≤ F (xn ) for some continuous F (·).
We additionally assume (5). (B5) and (5) ensure that supn E kMn+1 k2 |Fn < ∞ a.s.
Hence X
a(n)2 E kMn+1 k2 |Fn < ∞,

n
PN
implying by Theorem A of the appendix that n=0 a(n)Mn+1 converges a.s. as N ↑
∞, in particular,
(21) a(n)Mn+1 → 0 a.s.
Consider the o.d.e,
(22) ẋ(t) = P̄x(t) (h(x(t))).
Our main result is as follows.
Theorem 3. Under (B1)–(B7), almost surely, there exists a compact connected
nonempty internally chain transitive invariant set A0 of (22) contained in C such
that
n↑∞
min kxn − yk → 0.
y∈A0
If f (x) = Qx x, where Qx is a parametrized family of irreducible stochastic ma-

trices Lipschitz in the parameter x, then C = A and this reduces to Theorem 1.

7. Proof of Theorem 3. We proceed through a sequence of lemmas.

Lemma 9. Under (B1)–(B7), inf y∈C kxn − yk → 0 as n → ∞
Proof. Because of (5), xn remains in a (random) compact subset of Rn on which,

by (B1), f is uniformly continuous. Let xn(k) → x∗ for some subsequence {n(k)}∞ k=1 .
Since kf (xn(k)−1 ) − xn(k) k → 0, we have f (xn(k)−1 ) → x∗ . By (21) and (5),
f (f (xn(k)−2 ) + a(n(k) − 2)(h(xn(k)−2 ) + Mn(k)−1 )) → x∗

⇒ f 2 (xn(k)−2 ) → x∗ ,
where the last statement follows from uniform continuity of f on compact sets. We
have from (B1) that f k is continuous ∀k. Then by repeating the above steps, we have
for each m ≥ 1,
(23) lim f m (xn(k)−m ) = x∗ .

k↑∞
We now prove that limm↑∞ limm≤n(k)↑∞ P (xn(k)−m ) = x∗ . Let > 0 be given. Hence
by (B2), ∃ M such that ∀m ≥ M , kP (xn ) − f m (xn )k ≤ 2 ∀n. By (23), for a given
m, ∃ km ≥ m such that ∀k ≥ km , kf m (xn(k)−m ) − x∗ k ≤ 2 . Pick m = 2M and thus
∀k ≥ km ,
kP (xn(k)−M ) − x∗ k ≤ kP (xn(k)−M ) − f M (xn(k)−M )k + kf M (xn(k)−M ) − x∗ k

≤ .
Hence limm↑∞ limm≤n(k)↑∞ P (xn(k)−m ) = x∗ . Then by (P2), x∗ ∈ C.

Let x̃n := P (xn ). Then, we have the next lemma.
Lemma 10. Under (B1)–(B7), (xn − x̃n ) → 0.
Proof. Recall that P is continuous. Let yn = argminy∈C kxn − yk (pick one such y
if there are multiple). By Lemma 9, xn − yn → 0. Continuity of P implies its uniform
continuity on bounded sets, hence P (xn ) − yn = P (xn ) − P (yn ) → 0. Hence xn − x̃n
= xn − P (xn ) → 0.
We now derive a recursion for x̃n . Using Taylor expansion, we have
xn+1 = f (xn ) + a(n)(h(xn ) + Mn+1 ) =⇒

x̃n+1 = P (f (xn ) + a(n)(h(xn ) + Mn+1 ))
= P (f (xn )) + a(n)(P̄f (xn ) (h(x(n))) + P̄f (xn ) (Mn+1 ) + n )
= P (xn ) + a(n)(P̄f (xn ) (h(x(n))) + P̄f (xn ) (Mn+1 ) + n )
= x̃n + a(n)(P̄f (xn ) (h(x(n))) + P̄f (xn ) (Mn+1 ) + n ),
where the additional term a(n)n is due to the second order term in Taylor series
expansion. Furthermore, (B7) and (5) ensure that n a(n)2 E[kMn+1 k4 |Fn ] < ∞ a.s.
P
By Theorem A of the appendix, it then follows that
X
a(n)2 (kMn+1 k2 − E[kMn+1 k2 |Fn ]) < ∞
n
a.s. By (B5) and (5), n a(n)2 E[kMn+1 k2 |Fn ] < ∞. It follows that a(n)2 kMn+1 k2
P P
n
converges a.s., in particular, a(n)2 kMn+1 k2 → 0 a.s. Thus a.s.,
n = O(a(n)2 kMn+1 k2 ) + O(a(n)2 ) → 0.

Rewrite the above iteration as
x̃n+1 = x̃n + a(n)(P̄f (x̃n ) (h(x̃(n)) + P̄f (xn ) (Mn+1 ) + n + 0 n + 00 n ),

where
0 n = (P̄f (xn ) (h(xn )) − P̄f (x̃n ) (h(xn ))),

00 n = (P̄f (x̃n ) (h(xn )) − P̄f (x̃n ) (h(x̃n ))).
Lemma 11. Under (B1)–(B7), 0 n , 00 n → 0 as n → ∞.

Proof. By Lemma 10, (xn − x̃n ) → 0. By uniform continuity of f on com-
pacts, (f (xn ) − f (x̃n )) → 0. The first statement then follows by noting that P̄x (y)
is continuous in x (cf. (B4)). The second statement can be proved similarly. Since
kxn − x̃n k → 0 and h is Lipschitz, kh(xn ) − h(x̃n )k → 0. The second statement now
follows by continuity (which in turn follows from linearity) of P̄x (y) in y, uniform
w.r.t. x.
Since P̄x (·) is linear in its argument, P̄f (xn ) Mn+1 is an Fn adapted martingale
difference sequence. Furthermore, for a suitable (random) K 00 > 0,
E[kP̄f (xn ) Mn+1 k2 |Fn ] ≤ E[kP̄f (xn ) k2 kMn+1 k2 |Fn ]

≤ kP̄f (xn ) k2 E[kMn+1 k2 |Fn ]
≤ K 0 K(1 + kxn k2 )
≤ K 00 .
Here K 0 = supn kP̄f (x̃n ) k, which is finite because P and f are continuous functions,
and {xn } is bounded. The last Pn inequality follows from (5).
Let t(0) = 0, t(n) := m=0 a(m), In = [t(n), t(n + 1)), and x̄(t(n)) = x̃n ,
with linear interpolation on In . Let xs (t) denote the unique trajectory of ẋs (t) =
P̄f (xs (t)) (h(xs (t))), xs (s) = x̄(s). Then by standard Gronwall inequality based argu-
ments as in Lemma 1 on pp. 12–15 of [8], we have the following.
Lemma 12. Under(B1)–(B7), for any T ≥ 0, limt↑∞ sups∈[t,t+T ] kx̃(s)−xt (s)k = 0.
Finally, in view of Lemma 10 and Lemma 12 above, the proof of Theorem 3 closely
follows that of Theorem 1 above (see also Theorem 2 on pp. 15–16 of [8]) on noting
that on C, f (x) = x =⇒ P̄f (x) = P̄x . As before, one may say more for special cases
with additional structure, e.g., for d = 1, where one can claim convergence, allowing
for boundary equilibria on ∂C.
Example 4. Consider h(x) = −∇g(x) and fi (x) = xi − (xi − ci )+ , 1 ≤ i ≤ N ,
for prescribed ci ∈ R. This amounts to stochastic gradient descent for minimizing g
subject to constraints xi ≤ ci ∀i.
Example 5. For h, {ci } as above, let fi (x) = xi − γi ai (xi − ci )I{ j aj (xj − cj )2 >
P
M } for some γ > 0 small, aj > 0, 1P≤ j ≤ N, M > 0. This is stochastic steepest
descent for minimizing g subject to i ai (xi − ci )2 ≤ M . Evaluation of the quantity
2
P
i i i − ci ) requires global information. This can be done by a distributed gossip
a (x
algorithm as a subroutine.
These are rather simple situations. One technical issue here is the following.
The maps f above are smooth except at the boundaries of certain open sets. The

corresponding P (·) can be shown to be Frechet differentiable everywhere except at

these boundaries; hence strictly speaking the examples do not fit the above framework.
There are two ways of working around this problem. One is to replace the respective
f ’s by convenient smooth approximations, e.g., (xi − ci )+ by g((xi − ci )+ ), where
g(·) is a smooth approximation to x 7→ x ∨ 0 such that g(x) = 0 for x ≤ 0 and > 0
for x > 0. The other option is to not change anything but invoke the fact that if
the noise {Mn } is rich enough, the probability of the iterates falling exactly on the
troublesome boundary will be zero. Such arguments are often used in application of
stochastic approximation algorithms. These considerations are also present in the two
more examples that follow.
Example 6. Consider h(x) = −∇g(x) and let f stand for a single full iteration of
the Boyle–Dykstra–Han algorithm for computing the projection onto the intersection
of a finite family of closed convex sets. (The algorithms go in a round robin fashion
componentwise; what we imply here is one full round.) Then f (n) → P := the
desired projection. The Boyle–Dykstra–Han algorithm, however, is not distributed.
A distributed version has been derived in [25].
Example 7. Consider a second order dynamics in R3 that is commonplace in
models of flocking:
x(n + 1) = x(n) + a(n)y(n),

y(n + 1) = y(n) − a(n)(∇H(x(n)) + M (n + 1)).
This is a discretization of a Newtonian system where x(n), y(n) ∈ R3 interpreted as,

respectively, position and velocity, and H(·) is a potential whose negative gradient
yields the force being exerted. Suppose we have N interacting particles with such
dynamics, i.e.,
xi (n + 1) = xi (n) + a(n)y i (n),

y i (n + 1) = y i (n) − a(n)(∇H i (x(n)) + M i (n + 1)).
We introduce interaction by modifying this to
xi (n + 1) = g(xi (n)) + a(n)y i (n),

y i (n + 1) = r(y i (n)) − a(n)(∇H i (x(n)) + M i (n + 1)).
Here f (·) := [g(·)T , r(·)T ]T : R6 7→ R6 can be chosen to impose constraints such

as minimum and maximum distance between pairs of agents, minimum or maximum
bound on velocities, etc.
We conclude this section with a simple sufficient condition for (5) to hold in this
framework. Assume (4) and the following Liapunov condition:
(†) There exists a continuously differentiable function V : Rd 7→ Rd satisfying
the following:
1.
(24) V (x) ≥ C1 + C2 kxk2 for some C1 , C2 > 0,
2. there exist , R > 0 such that
(25) V (f (x)) < (1 − )V (x) ∀ kxk ≥ R,

3.

k∇V (x)kkxk
2
(26) sup k∇ V (x)k ∨ < ∞.
x V (x)
Note that (24)–(26) imply in particular that
(27) V (x) = Θ(kxk2 ).
Theorem 4. Under (B1)–(B7), (4), and (†), condition (5) holds.

Proof. From (20), we have, for kx(n)k > R and suitable constants K ∗ , K 00 > 0,
V (x(n + 1)) = V (f (x(n)) + a(n)(h(x(n)) + M (n + 1)))

≤ V (f (x(n))) + a(n)h∇V (x(n)), h(x(n)) + M (n + 1)i + a(n)2 K 00
≤ V (x(n)) 1 − + a(n)K 00 + a(n)2 K ∗ ,

where the last two inequalities use (4), (25), (26), and (27). Since a(n) → 0, the r.h.s.
is < (1 − /2)V (xn ) for n sufficiently large. This establishes the claim.
III. Extensions and future issues.
8. Asynchronous case. A more realistic scenario than what we have been con-
sidering so far is that of asynchronous implementation. Here each node operates on
its own clock. Thus assume a universal clock with ticks n ≥ 0 in the background.
This can be “event driven” and not necessarily in multiples of a fixed unit as in a
conventional clock. For each node i, we have a possibly random subsequence of {n}
along which it performs its polling and updates. Furthermore, there can be commu-
nication delays in receiving one node’s information by another. These considerations
lead to highly nontrivial complications as seen in [6] or [7, Chapter 7]. We shall adapt
the development of those works to analyze what happens in the present case under
asynchronous implementation. Instead of replicating the messy analysis of [6] or [7,
Chapter 7] here, we only sketch the underlying arguments, which are quite easy to
comprehend.
Let Bn := the random subset of V denoting nodes Pn which were “active,” i.e.,
performed their updates, at time n. Let κ(i, n) := m=0 I{i ∈ Bm }. This then is
the “local clock” of i, indicating its own count of updates performed till n. As in [6]
or [7, Chapter 7], we consider in lieu of (20) the iteration
xi (n + 1) = fi (x(n)) + a(κ(i, n))I{i ∈ Bn }(hi (x1 (n − τn (1, i)), . . . , xN (n − τn (N, i)))

(28) + M i (n + 1)), 1 ≤ i ≤ N, n ≥ 0.
Here fi is the ith component of f . Note that we have assumed a delay-free computa-
tion of f (x(n)). We shall relax this later. The quantity τk (j, i) is the possibly random
delay with which i receives j’s data at time k. It is argued in [6] or [7, Chapter 7] that
under mild conditional moment conditions, the delays contribute an asymptotically
negligible error which does not affect our convergence analysis. We omit the details,
referring the interested reader to [6] or [7, Chapter 7].
Replacement of the common step-size a(n) by the node-dependent a(κ(i, n)) is
a much more significant modification, which serves a dual purpose. First, it elimi-
nates the need for the nodes to know the global clock. Second, a common step-size
would have resulted in different components getting weighted differently according to

their relative frequency of occurrence, thus modifying the resultant limiting o.d.e. and
rendering our analysis invalid. Suppose
κ(i, n)
(29) lim inf ≥ ζ ∀i
n↑∞ n
for some constant ζ > 0. That is, all components are updated comparably often. As
argued in the above references, under some additional conditions on {a(n)} stipulated
in [6], the limiting o.d.e. under {a(κ(i, n))} remains the same modulo a time scaling
that does not affect its asymptotic behavior. Thus our analysis continues to apply;
see, e.g., [6], [7, Chapter 7].
The important fact to note here is that the effect of asynchrony and delays was
killed by our choice of the step-size schedule. This suggests a remedy for managing
the same in the computation of f , which was taken to be free of these problems in
(28). Replace (28) by
(30)
(n)
xi (n+1) = fi (x̌(n))+a(κ(i, n))I{i ∈ Bn }(hi (x̌(n))+ M i (n+1)), 1 ≤ i ≤ N, n ≥ 0,
where x̌(n) is shorthand for the argument of hi (·) in (28) and f (n) (x) := (1 − b(n))x +
b(n)f (x) for a step-size schedule b(n) > 0 satisfying
X a(n)
b(n) ↓ 0, b(n) = ∞, → 0.
n
b(n)
Note that C := {x : f (x) = x} = {x : fn (x) = x} ∀n ≥ 1. We also need to replace

(A2) by
• (A8) f n := f (n) ◦ f (n−1) ◦ · · · ◦ f (2) ◦ f (1) → P uniformly on compacts,
where P is continuous and satisfies (P1), (P2), (P3), in addition to satisfying
P (f (n) (x)) = P (x) ∀n.
An example is the case when P (·) is the projection to a suitable convex set and
x(n + 1) = f (n) (x(n)) = (1 − b(n))x(n) + b(n)f (x(n)) is an iterative scheme for
computing the same.
Now we have the decreasing step-size {b(n)} to help us. Provided that {b(n)}
also satisfy the conditions stipulated in [6], the delays do not affect the conclusions
for the same reasons as in [6]. This underscores the role of decreasing step-sizes
in controlling errors due to delays, first highlighted in [11]. Under (A8), one can
mimic the arguments of the preceding section with minor differences to conclude that
Theorem 3 continues to hold.
Finally, we can introduce asynchrony in the first term as well by replacing
(n) (n)
f (n) (x) := (1 − b(n))x + b(n)f (x) by f (n) (x) = [f1 (x), f2 (x), . . .]T where
(n)
fi (x) := (1 − b(κ(i, n)))I{i ∈ B(n)}x + b(κ(i, n)))I{i ∈ B(n)}fi (x)
with (29) in place, and proceed as before.

Contrast the above with [18], which analyzes the extremely complex behavior of
linear dynamics in the presence of asynchrony. This in essence suggests that moving
over to an incremental scheme as above may be unavoidable in general.
9. Variants and extensions. We have established a general result for asymp-
totic behavior of nonlinear gossip, analogous to the Benaim theorem for stochastic
approximation. Important consequences of this result are (i) proof of a “dynamic
consensus” whereby the agents follow a common trajectory of a dynamics that can

be specified in terms of problem parameters, and (ii) narrowing down potential limit
sets for the same. Just as in the case of stochastic approximation, one can say more
for specific instances by exploiting the additional structure, e.g., componentwise a.s.
convergence to a common point when this point is the only possible compact con-
nected internally chain transitive invariant set for (9). Some further possibilities are
as follows:
1. There are situations when one may want to relax (1). For example, if each
node in V polls exactly one neighbor at each time, the ensuing averaging
matrix has a single nonzero entry per row. Also, if there are transmission
constraints (e.g., in a wireless medium), only some of the nodes can poll at
any given time, implying that only some rows will be nonzero. In either case
the resulting transition matrices may not be irreducible aperiodic at each time
but may be so on average. These issues will be addressed in a future work.
2. Some of the standard variations and extensions of stochastic approximation
suggest natural counterparts here. These include avoidance of traps and sam-
ple complexity results [8, Chapter 4], as well as constant step-size schemes
[8, Chapter 9].
3. Better stability tests for the fully nonlinear case will be very useful. Even in
the quasi-linear case, we have extended one of the many sufficient conditions
from the stochastic approximation literature to stochastic approximation with
gossip. A similar exercise with other sufficient conditions remains to be car-
ried out, as also an extension to the present scenario of projected stochastic
approximation.
4. We have taken evaluation of the “coordination” component f (·) in (20) to
be noise-free. If this is not the case, we need to replace it by a stochastic
approximation iterate as well. Then (20) becomes
x(n + 1) = x(n) + b(n)(f (x(n)) + M 0 (n + 1)) + a(n)(h(x(n)) + M (n + 1)).
Here {M 0 (n)} is the measurement P noise for fPand the step-size sequence
{b(n)} has to be chosen so that n b(n) = ∞, n b(n)2 < ∞ (to ensure the
standard stochastic approximation behavior) and a(n) = o(b(n)) (to ensure
that the f -iterates are faster than the h-iterates). Even with this the desired
result is not guaranteed; see, e.g., [9], where the pure linear gossip case is dis-
cussed. Provable convergence to the desired equilibrium, or more generally,
tracking of the desired limiting behavior may be possible in specific cases.
This needs a separate analysis, which will be pursued in a sequel.
5. As mentioned, relating f to a subroutine for projection to a convex set im-
plements such a projection on a faster time-scale. An important possible
future direction is to use f to effect a projection to a smooth manifold, which
will enable us to execute stochastic approximation versions of algorithms on,
e.g., matrix manifolds [1]. This is reminiscent of ideas from “sliding mode
control,” where a trajectory is controlled along a prescribed manifold [31].
Appendix A. We recall here a key martingale convergence theorem from [22].
Theorem A. Let Z(n), n ≥ 0, denote a zero mean square-integrable martingale
w.r.t. an increasing family of σ-fields {Fn }, satisfying
X
E |Z(n + 1)|2 |Fn < ∞ a.s.

n
Then Z(n) converges a.s. as n ↑ ∞.

This is Proposition VII-2-3(C) on p. 149 in [22] (see also Theorem 11, p. 150
in [8]).
REFERENCES
[1] P.-A. Absil, R. Mahoney, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds,
Princeton University Press, Princeton, NJ, 2008.
[2] V. I. Arnold, Ordinary Differential Equations, Springer-Verlag, Berlin, 2001.
[3] M. Benaim, A dynamical system approach to stochastic approximations, SIAM J. Control
Optim., 34 (1996), pp. 437–472.
[4] P. Bianchi, G. Fort, and W. Hachem, Performance of a distributed stochastic approximation
algorithm, IEEE Trans. Automat. Control, AC-59 (2013), pp. 7405–7418.
[5] V. S. Borkar, Stochastic approximations with two time scales, Systems Control Lett., 29
(1997), pp. 291–294.
[6] V. S. Borkar, Asynchronous stochastic approximation, SIAM J. Control Optim., 36 (1998),
pp. 840–851.
[7] V. S. Borkar, Erratum: Asynchronous stochastic approximations, SIAM J. Control Optim.,
38 (2000), pp. 662–663.
[8] V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, Hindustan Book
Agency, New Delhi, and Cambridge University Press, Cambridge, UK, 2008.
[9] V. S. Borkar, R. Makhijani, and R. Sundaresan, Asynchronous gossip for averaging and
spectral ranking, IEEE J. Selected Topics Signal Process., 8 (2014), pp. 703–716.
[10] V. S. Borkar and S. P. Meyn, The O.D.E. method for convergence of stochastic approxima-
tion and reinforcement learning, SIAM J. Control Optim., 38 (2000), pp. 447–469.
[11] V. S. Borkar and V. V. Phansalkar, Managing interprocessor delays in distributed recursive
algorithms, Sadhana, 19 (1994), pp. 995–1003.
[12] S. Chatterjee and E. Seneta, Towards consensus: Some convergence theorems for repeated
averaging, J. Appl. Probab., 14 (1977), pp. 89–97.
[13] G. Chen, Z. Liu, and L. Guo, The smallest possible interaction radius for synchronization of
self-propelled particles, SIAM J. Control Optim., 50 (2012), pp. 1950–1970.
[14] J. Chen and A. H. Sayed, On the limiting behavior of distributed optimization strategies, in
Proceedings of the 50th Allerton Conference on Control, Communications and Computing,
Monticello, IL, 2012, pp. 1535–1542.
[15] R. Dwivedi, Unpublished Course Project, IIT Bombay, 2014.
[16] D. Fudenberg and D. K. Levine, The Theory of Learning in Games, MIT Press, Cambridge,
MA, 1998.
[17] N. Gaffke and R. Mathar, A cyclic projection algorithm via duality, Metrika, 36 (1989), pp.
29–54.
[18] R. Gharavi and V. Anantharam, Structure theorems for partially asynchronous iterations
of a nonnegative matrix with random delays, Sadhana, 24 (1999), pp. 369–423.
[19] M. Huang and J. H. Manton, Stochastic approximation for consensus seeking: Mean square
and almost sure convergence, in Proceedings of the 46th IEEE Conference Decision and
Control, New Orleans, LA, 2007, pp. 206–211.
[20] Y. Kabanov and S. Pergamenshchikov, Two-scale Stochastic Systems, Springer-Verlag,
Berlin, 2003.
[21] S. Lee and A. Nedić, Distributed random projection algorithm for convex optimization over
networks, IEEE J. Selected Topics Signal Process., 7 (2013), pp. 2221–229.
[22] J. Neveu, Discrete-Parameter Martingales, North-Holland, Amsterdam, 1975.
[23] R. Olfati-Saber, Flocking for multi-agent dynamic systems: Theory and algorithms, IEEE
Trans. Automat. Control, 51 (2006), pp. 401–420.
[24] R. Olfati-Saber, J. A. Fax, and R. M. Murray, Consensus and cooperation in networked
multi-agent systems, Proc. IEEE, 95 (2007), pp. 215–233.
[25] S. Phade and V. S. Borkar, A distributed Boyle-Dykstra-Han scheme, submitted.
[26] L. Perko, Differential Equations and Dynamical Systems, 3rd ed., Springer-Verlag, New York,
2001.
[27] A. H. Sayed, Adaptation, learning, and optimization over networks, Found. Trends Machine
Learning, 7 (2014).
[28] D. Shah, Gossip algorithms, Found. Trends Networking, 3 (2009), pp. 1–125.
[29] S. S. Stankovic̀, M. S. Stankovic̀, and D. M. Stipanovic̀, Decentralized parameter estima-
tion by consensus based stochastic approximation, IEEE Trans. Automat. Control, AC-56
(2011), pp. 531–543.

[30] J. N. Tsitsiklis, M. Athans, and D. P. Bertsekas, Distributed asynchronous deterministic

and stochastic gradient optimization algorithms, IEEE Trans. Automat. Control, AC-31
(1986), pp. 803–812.
[31] V. I. Utkin, Sliding Modes in Control and Optimization, Springer-Verlag, Berlin, 1992.
[32] Z. Wan, Flocking for Multi-agent Dynamical Systems, Lambert Academic Publishing, 2012.
[33] C. W. Wu, Synchronization in Complex Networks of Nonlinear Dynamical Systems, World
Scientific, Singapore, 2007.

Gossippaper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gossippaper

Uploaded by

Copyright:

Available Formats

SIAM J. CONTROL OPTIM.

c 2016 Society for Industrial and Applied Mathematics

ADWAITVEDANT S. MATHKAR† AND VIVEK S. BORKAR‡

Abstract. We consider a gossip-based distributed stochastic approximation scheme wherein

AMS subject classifications. Primary, 62L20; Secondary, 68W15, 68T05

1. Introduction. In [30], Tsitsiklis, Athans, and Bertsekas introduced a novel

2016; published electronically June 7, 2016.

400 076, India (mathkar.adwaitvedant@gmail.com). Current address: Goldman Sachs, Bangalore,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

the invariant set of the nonlinear operation. A motivating example is projection to

for n ≥ 0. Here x(n) := [(x1 (n))T , . . . , (xN (n))T ]T ∈ (Rd )N and

(In particular, a(n) → 0.)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Later we need a stronger version of (A2) given by

(4) kMn k ≤ K 0 (1 + kxn k) a.s.

We shall also assume

(5) sup kxi (n)k < ∞ a.s. ∀i.

As in classical analysis of stochastic approximation algorithms by the o.d.e. method,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

h(x) = [h1 (x), . . . , hN (x)]T .

Thus the combined iteration (2) becomes

under (1). In particular, x 7→ Px∗ is continuous. Consider the o.d.e.

(8) ẋi (t) = πx(t) h(x(t)), 1 ≤ i ≤ N.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

where ψ(y) := [y T : y T : · · · : y T ]T for y ∈ Rd .

1 − (f (xj )−f (xi ))+

2 That is, dynamics in (Rd )N wherein each component satisfies (10).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

rather than transition probabilities of a Markov chain. Then

(11) ż(t) = g(z(t)).

In particular, values of πx(t) do not matter. In the preceding example, we had a

ż 1 (t) = z 2 (t) − p(z 1 (t)),

where p, q are continuously differentiable and satisfy

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

where N (i) is according to the graph structure induced by P and M i (n + 1) corre-

where {Pn } is a sequence of irreducible aperiodic stochastic matrices on G satisfying

(14) kU (n, k) − U (n)k ≤ Cαk ∀n

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Proof. Iterating (13),

(n, m) := (U (n − 1, m) − Un−1 )x(n)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

kx(n + m) − x(n)k < η ∀n ≥ n0 .

Pnx̄(t), t ≥ 0, as follows: Let In := [t(n), t(n + 1)), where t(0) = 0 and

lim sup kxs (t) − x̄(t)k = 0 a.s.

Proof (sketch). Pick n, m such that |t(n + m) − t(n)| ≤ T . We have

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

By (A4), kxn+m k have a common (random) finite bound a.s., hence so do

sup ||x̄(t) − xt(n) (t)|| < δ.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

the trajectories of (8) initiated at x̂i for each i, we have

kx∗i (s(i + 1) − s(i)) − x̂i+1 k

(15) ẋ(t) = hc (x(t)), x(0) = x,

respectively. We then have the following lemma.

sup kφc (x, t) − φ∞ (x, t)k < 0 ∀c ≥ c0 .

Combining the two, we get the desired result.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(16) kM̂ (k + 1)k ≤ K(1 + kx̂(t(k))k).

Lemma 6. Under (A1), (A20 ), and (A3), supt kx̂(t)k2 < ∞.

which is finite by Lemma 6. The claim follows by Theorem A of the appendix.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Recall that kP xk∞ ≤ kxk∞ for any stochastic matrix P . Hence

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

(17) xn+1 = PΩ (xn + a(n)[h(xn ) + Mn+1 ]),

(n, m) := (U (n − 1, m) − Un−1 )x(n)

sup kφc (x, t) − φ∞ (x, t)k < 0 ∀c ≥ c0 .

n = O(a(n)2 kMn+1 k2 ) + O(a(n)2 ) → 0.

x̃n+1 = x̃n + a(n)(P̄f (x̃n ) (h(x̃(n)) + P̄f (xn ) (Mn+1 ) + n + 0 n + 00 n ),

0 n = (P̄f (xn ) (h(xn )) − P̄f (x̃n ) (h(xn ))),

Lemma 11. Under (B1)–(B7), 0 n , 00 n → 0 as n → ∞.

2. there exist , R > 0 such that

(25) V (f (x)) < (1 − )V (x) ∀ kxk ≥ R,