Professional Documents
Culture Documents
NONLINEAR GOSSIP∗
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Key words. gossip algorithms, distributed algorithms, stochastic approximation, two time
scales, Benaim theorem
DOI. 10.1137/140992588
400 076, India (borkar.vs@gmail.com). This author’s was supported in part by a J. C. Bose Fellowship
and a grant for “Distributed computation for optimization over large networks and high dimensional
data analysis” from the Department of Science and Technology, Government of India.
1535
provide in each case a meta-theorem in the tradition of Benaim [3] that character-
izes possible asymptotic behavior of the scheme, covering potentially nonconvergent
cases as well. For specific instances thereof, this result may be leveraged to exploit
additional structure of the problem in order to obtain sharper results.
For simplicity, we initially confine ourselves to the synchronous case where the
distributed computation runs on a common clock. Asynchrony with its concomitant
issues (differing clocks, delays, etc.) leads to additional complications. We address
these separately in a later section.
The paper is in three parts. Part I (sections 2–4) formulates the state-dependent
averaging model (section 2) and analyzes its asymptotic behavior (section 3), inclusive
of a stability test (section 4). We dub this the “quasi-linear” case. Part II, the
“fully nonlinear” model (sections 5–6), introduces the fully nonlinear case motivated
by projected optimization schemes (section 5) and analyzes its asymptotic behavior
(section 6). In Part III, we consider briefly the asynchronous case in section 7 and
conclude in section 8 with a discussion of some straightforward variants and some
research issues for the future.
I. Quasi-linear gossip.
2. Preliminaries. Consider a finite connected directed graph G = (V, E), where
V, E denote, respectively, its node and edge sets with |V| = N . Without loss of
generality, we may label V as {1, 2, . . . , N }. We assume that G is irreducible, i.e.,
there is a directed path from each node to any other node. Let
N (i) := {j ∈ V : (i, j) ∈ E}
denote the set of neighbors of node i. We are given a family of irreducible aperiodic
stochastic matrices Px = [[px (j|i)]]i,j∈V compatible with G, indexed by x ∈ Rd×N , d ≥
1, such that the map x 7→ Px is Lipschitz. We further assume that
(1) (min)+
j∈N (i) px (j|i) > ∆ ∀x ∈ R
d×N
, i ∈ V,
where ∆ > 0 and (min)+ denotes the minimum over all nonzero elements. Node i
performs the d-dimensional iteration
X h i
(2) xi (n + 1) = px(n) (j|i)xj (n) + a(n) hi (x(n)) + M i (n + 1)
j∈N (i)
its initial data with respect to the stationary distribution π of P . In presence of noise,
one can show consensus, in fact convergence to a common value, but not necessarily
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
to the above average. This is because the iteration does not have a single equilibrium
point, but a one-dimensional subspace of equilibria corresponding to the span of its
eigenvector [1, 1, . . . , 1]T . In the deterministic case, we get a limit dependent on the
initial condition as described above, but in the noisy case we may get a random limit
point on this line which loses track of the initial condition. See, e.g., [9], where this
problem is handled by proposing an alternative algorithm. Whether this matters will
depend on the application. For example, in Example 1 below, the specifics of the sta-
tionary distribution πx matter and therefore the above concern is real. In the motion
coordination problem sketched in Example 3 below, it may not matter to the same
extent, since the objective of the averaging term is essentially consensus. We return
to this issue in the discussion on future directions at the end.
3. Condition (1) is convenient to work with from the point of view of reducing
notationQMbut can be easily relaxed to, e.g., a similar condition for a finite product
P̃x̃ := m=1 Pxn for x̃ := [x1 , . . . , xM ] for some M ≥ 1.
We call (2) a quasi-linear gossip. The first scheme of the form (2) with a constant
P appeared in [30]. There have been other subsequent works in this direction; see,
e.g., [14], [21] for two recent instances. Our aim here is to give a broad analysis of
this scheme in a very general framework and to derive a Benaim-type meta-theorem
regarding its asymptotic behavior [3]. For simplicity of exposition, we take d = 1
henceforth, though the results are completely general.
Let x = [x1 , . . . , xN ]T denote a generic element of RN . Likewise, x(n) :=
[x (n), . . . , xN (n)]T . Define h : RN 7→ RN by
1
Let πx denote the unique stationary distribution for Px , written as a row vector, and
1 := the column vector of all 1’s in RN . Let Px∗ denote the rank one matrix 1πx .
Then
n↑∞
(7) Pxn → Px∗ uniformly
This is well posed under our hypotheses. The right-hand side is independent of i and
hence the set A := {x ∈ RN : xi = xj ∀ i, j} is invariant under (8). On A, (8)
decouples into N uncoupled copies of the scalar o.d.e.,
N
X
(9) ẏ(t) = πy(t)1 (i)hi (y(t)1).
i=0
Recall that an invariant set B of a well posed o.d.e. is said to be internally chain
transitive if for any x, y ∈ B and , T > 0, we can find n ≥ 1 and x0 = x, x1 , . . . , xn = y
in B such that for 0 ≤ i < N , the trajectory of this o.d.e. initiated at xi meets with the
-neighborhood of xi+1 after a time ≥ T [3]. Our main result below is a counterpart of
the corresponding result for stochastic approximation from [3], stated here for general
d ≥ 1. For this purpose, let A := {x = [(x1 )T : · · · : (xN )T ]T ∈ (Rd )N : xi =
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
[xi1 , . . . , xid ]T , 1 ≤ i ≤ N ; xik = xjk ∀ i, j, k}. This reduces to the earlier definition for
d = 1. The o.d.e. (9) gets replaced by
N
X
(10) ẏ(t) = πψ(y(t)) (i)hi (ψ(y(t))),
i=0
These are the transition probabilities for the Metropolis–Hastings scheme for Markov
chain Monte Carlo, except here they are normalized weights for neighboring nodes
T
πx (i) = P f (xj )
.
j∈V e− T
Thus πx puts greater weight on nodes j that have lower values of f (xj ). This is a
distributed optimization scheme for which numerical experiments show promising re-
sults [15]. Here we want to highlight the fact that x-dependent averaging weights give
us an additional handle to control the asymptotic behavior of stochastic approxima-
tion with gossip. One can think of this scheme as “leaderless swarm optimization”:
In classical particle swarm optimization, each agent makes an incremental update
based on her own gradient, that of her neighbors, and that of the “leader,” meaning
the current best performer. The last mentioned aspect requires keeping track of the
current best, introducing nonlocal computation. The above scheme automatically
concentrates weights on the so-called leader(s) by adapting the probability weights.
Example 2. More generally, consider the situation when hi (x) is of the form g(xi ),
i.e., hi = g ◦ Γi ∀i, where Γi is the projection from the N -fold product (Rd )N to its
ith factor space Rd . This is an important special case which can be interpreted as the
stochastic approximation or learning component of the iteration being strictly local.
The additional structure offers some further simplifications. Since asymptotically,
xi (t) ≈ xj (t) for i 6= j, (10) simplifies to N identical copies of a trajectory of the
d-dimensional o.d.e.
This is only a suggestive simple example. Attractors that are not equilibria are
endemic in multiagent learning schemes [16].
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Pm
Example 3. Consider minimization of a separable function f := i=1 αi fi (x) :
Rd 7→ R, where m is large. We assume that fi : Rd 7→ R (and hence f ) are
differentiable. Without loss of generality,
PmassumePthat αi > 0 (since we can absorb the
m
negative sign in fi ). Then write f = ( i=1 αi ) i=1 πi fi (x), where πi = Pmαi αi can
i=1
be thought of as elements of the stationary distribution of some irreducible stochastic
matrix P . Note that (1) is trivially satisfied here. The ith computational node runs
the following d-dimensional iteration:
X
xi (n) = p(j|i)xj (n) + a(n)(−∇fi (xi (n)) + M i (n + 1)),
j∈N (i)
for some C > 0, α ∈ (0, 1). We proceed through a sequence of lemmas. The first
lemma below is of independent interest.
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Lemma 1. Under (A1), (A2), (A3), and (A4), iteration (13) achieves a.s. con-
sensus, i.e.,
lim max kxi (n) − xj (n)k = 0 a.s.
n↑∞ i,j
where
Thus
m−1
X
k(n, m)k ≤ Cαm kx(n)k + a(n + i)Cαm−1−i kh(x(n + i))k
i=0
m−1
X
+ a(n + i)Cαm−1−i kM (n + i + 1)k.
i=0
By (5), the first term on the r.h.s. → 0 as mP ↑ ∞. By (5) and the fact that a(n) → 0,
n
so does the second term. By (5) and (3), i=0 a(i)M (i + 1) is a square-integrable
{Fn }-martingale with a.s. convergent quadratic variation process, hence it converges
a.s. by Proposition VII-2-3(c) of [22, p. 149], recalled as Theorem A in the appendix.
Thus a(n)kM (n + 1)k → 0 a.s. This implies that the third term → 0 a.s. Thus
(n, m) → 0 a.s. as n, m ↑ ∞. Recall once again that Uk has equal rows ∀k and
hence it gives a constant vector whenever it left multiplies a column vector. Now
write x(n) = x(d n2 e + b n2 c) to conclude.
The intuition for the following two lemmas can be stated as follows. Lemma
2 says that as n increases, x(n) moves slowly and so does Pn := Px(n) due to the
Lipschitz condition. Hence for a sufficiently large m, as n increases U (n, m) tracks
∗
Px(n) as stated by Lemma 3 below. In order to formally prove this we need the
stronger hypothesis (A20 ).
Lemma 2. Under (A1), (A20 ), (A3), and (A4), almost surely, for any η > 0, m ≥
0, there exists a (possibly random) n0 ≥ 1 such that
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
where
m−1
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
X
∗
˘(n, m) = a(n + i)kUn+i − Px̄(t(n+i)) kkh(x̄(t(n + i))k
i=0
m−1
X
∗
+ a(n + i)kUn+i − Px̄(t(n+i)) kkM (n + i + 1)k.
i=0
Proof of Theorem 1. The proof follows that of Theorem 2 in [8, pp. 15–16], which
is based on [3]. We shall denote by Φt : (Rd )N 7→ (Rd )N the flow associated with
(10), i.e., Φt (x) := x(t) when x(·) satisfies (10) with x(0) = x. This is a flow of
homeomorphisms
T [2]. Fix a sample point where (5) and Lemma 4 hold. Let A denote
the set t≥0 {x̄(s) : s ≥ t}. Since x̄(·) is continuous and bounded, {x̄(s) : s ≥ t}, t ≥ 0,
is a nested family of nonempty compact and connected sets. A, being the intersection
thereof, will also be nonempty compact and connected. Also, by Lemma 1, A ⊂ A.
Then miny∈A kx̄(t) − yk → 0. Since x̄(·) is obtained from the iterates {x(n)} of
(2) by interpolation, we have miny∈A kx(n) − yk → 0. In fact, for any > 0, let
∆
A = {x : miny∈A kx − yk < }. Then (A )c ∩ (∩t≥0 {x̄(s) : s ≥ t}) = φ. Hence by
the finite intersection property of families of compact sets, (A )c ∩ {x̄(s) : s ≥ t0 } = φ
for some t0 > 0. That is, x̄(t0 + ·) ∈ A . Conversely, if x ∈ A, there exist sn ↑ ∞ in
[0, ∞) such that x̄(sn ) → x. This is immediate from the definition of A. In fact, we
have
max kx̄(s) − x̄(t(n))k = O(a(n)) → 0
s∈[t(n),t(n+1)]
as n → ∞. Thus we may take sn = t(m(n)) for suitable {m(n)} without any loss
of generality. Let x̃(·) denote the trajectory of (8) with x̃(0) = x. Then by the first
part of Lemma 4, it follows that kx̄(sn + t) − Φt (x̄(sn ))k → 0. On the other hand,
the continuity of the map Φt leads to Φt (x̄(sn )) → Φt (x) = x̃(t) ∀t > 0. Hence
x̄(sn + t) → x̃(t), implying that x̃(t) ∈ A as well. A similar argument works for t < 0,
using the second part of Lemma 4. Thus A is invariant under (8).
Let x̃1 , x̃2 ∈ A and fix > 0, T > 0. Pick /4 > δ > 0 such that if kz − yk < δ
and x̂z (·), x̂y (·) are solutions to (8) with initial conditions z, y, respectively, then
maxt∈[0,2T ] kx̂z (t) − x̂y (t)k < /4. Also pick n0 > 1 such that n ≥ n0 implies that
x̄(t(n) + ·) ∈ Aδ and
Pick n2 > n1 ≥ n0 such that ||x̄(t(ni )) − x̃i || < δ, i = 1, 2 and t(n2 ) − t(n1 ) ≥
T . Let kT ≤ t(n2 ) − t(n1 ) < (k + 1)T for some integer k ≥ 1 and let s(0) =
t(n1 ), s(i) = s(0) + iT for 1 ≤ i < k, and s(k) = t(n2 ). Then for 0 ≤ i < k,
supt∈[s(i),s(i+1)] ||x̄(t) − xs(i) (t)|| < δ. Pick x̂i , 0 ≤ i ≤ k, in A such that x̂1 = x̃1 ,
x̂k = x̃2 , and for 0 < i < k, x̂i are in the δ-neighborhood of x̄(s(i)). The sequence
(s(i), x̂i ), 0 ≤ i ≤ k, verifies the definition of internal chain transitivity: If x∗i (·) denote
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
− x̄(Tn+1 ) M (k+1)
falls outside the unit ball. Define x̂(Tn+1 ) = r(n) . Let M̂ (k + 1) = r(n) for
k ∈ [m(n), m(n + 1)). Note that (4) implies
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
is a.s. convergent.
Proof. Since kM̂ (k + 1)k ≤ K(1 + kx̂(t(k))k), we have
ka(k)Pr(n)x̂(t(k)) M̂ (k + 1)k2 ≤ a(k)2 (sup kP k2 )K 2 (1 + kx̂(t(k))k2 ),
where the supremum is over all stochastic matrices P . Hence
X X
E ka(k)Pr(n)x̂(t(k)) M̂ (k + 1)k2 |Fk ≤ a(k)2 (sup kP k2 )K 2 (1 + kx̂(t(k))k2 ),
k k
k−1
X
≤ kx̂(t(m(n)))k∞ + a(m(n) + i)khr(n) (x̂(t(m(n) + i)))k∞
i=0
k−1
X
+ a(m(n) + i)kM̂ (m(n) + i + 1)k∞
i=0
k−1
X
a(m(n) + i) kh(0)k∞ + K 0 + (L + K 0 )kx̂(t(m(n) + i))k∞
≤ kx̂(t(m(n)))k∞ +
i=0
k−1
X
≤ (L + K 0 ) a(m(n) + i)kx̂(t(m(n) + i))k∞ + kh(0)k∞ + K 0 (T + 1) + β.
i=0
0
Here K corresponds to K in (4) when the norm used is sup-norm, L is the Lipschitz
constant of h and therefore of hr(n) under the sup-norm, and β is a positive constant
such that kxk∞ ≤ βkxk ∀x ∈ Rn . The second inequality follows from (4) and the
Lipschitz property of hr(n) . The third inequality follows from kx̂(t(m(n)))k ≤ 1. By
the discrete Gronwall inequality,
0
kx̂(t(m(n) + k))k∞ ≤ [(kh(0)k∞ + K 0 )(T + 1) + β]e(L+K )(T +1)
.
∗ ∗
Hence by equivalence of norms, kx̂(t(m(n) + k))k ≤ K for some K > 0 independent
of n. Hence x̂ remains bounded a.s. on [Tn , Tn+1 ]. We can now mimic the arguments
in the previous section, using Lemma 7, to prove the claim.
This leads to the main theorem.
Theorem 2. Under (A1), (A20 ), (A3), and (A5), supn kxn k < ∞ a.s., i.e., (A4)
holds.
Proof. We prove supn kx̄(Tn )k < ∞ a.s. If not, then there exists a subsequence
{nk } such that kx̄(Tnk )k ↑ ∞, i.e., rnk ↑ ∞. By Lemma 5 there exists a c0 > 0 and
T > 0 such that for all initial conditions on the unit sphere, kφc (x, t)k ≤ 1 − 0 , for
t ∈ [T, T + 1], c > c0 (≥ 0, by assumption). If rn > c0 , kx̂(Tn )k = kxn (Tn )k = 1, and
−
kxn (Tn+1 )k ≤ 1 − 0 . Then by Lemma 8, kx̂(Tn+1 )k < 1 − 00 for some 0 < 00 < 0 .
Thus for rn > c0 and n sufficiently large,
−
kx̄(Tn+1 )k kx̂(Tn+1 )k
= < 1 − 00 .
kx̄(Tn )k kx̂(Tn )k
The rest of the arguments are similar to Theorem 7 in Chapter 3 of [8] with 12 being
replaced by 1 − 00 < 1: We conclude that if kx̄(Tn )k > c0 , x̄(Tk ), k ≥ n falls back
to the ball of radius c0 at an exponential rate. Thus if kx̄(Tn )k > c0 , kx̄(Tn−1 )k is
either even greater than kx̄(Tn )k or is inside the ball of radius c0 . Then there must
be an instance prior to n when x̄(·) jumps from inside this ball to outside the ball of
radius 0.9rn . Thus, corresponding to the sequence rnk % ∞, we will have a sequence
of jumps of x̄(Tn ) from inside the ball of radius c0 to points increasingly far away
from the origin. But, by the discrete Gronwall inequality (see, e.g., [8, pp. 146]), it
follows that there is a bound on the amount by which kx̄(·)k can increase over an
interval of length T + 1 if it is inside the ball of radius c0 at the beginning of the
interval. This leads to a contradiction. Thus C̃ := supn kx̄(Tn )k < ∞. This implies
that supn kxn k ≤ C̃K ∗ < ∞ for K ∗ = K ∗ (T ) as in the Gronwall inequality.
This extends the result of [10] which was motivated by reinforcement learning
applications where the h in question has linear growth and h∞ picks only its linearly
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
growing terms, killing everything else that had sublinear growth. As global Lipschitz
condition on h implies at most linear growth, the above criterion is often useful for
such schemes. One can, however, conceive of other normalizations to handle specific
situations that do not fit the above model. For example, for h(x) = −x3 +g(x), x ∈ R,
with g at most quadratic, hc (x) should be defined as h(cx)/c3 .
II. The fully nonlinear case.
5. Preliminaries. Consider the n-dimensional projected stochastic iteration
given by
min g(x)
s.t. x ∈ Ω.
where the last statement follows from uniform continuity of f on compact sets. We
have from (B1) that f k is continuous ∀k. Then by repeating the above steps, we have
for each m ≥ 1,
We now prove that limm↑∞ limm≤n(k)↑∞ P (xn(k)−m ) = x∗ . Let > 0 be given. Hence
by (B2), ∃ M such that ∀m ≥ M , kP (xn ) − f m (xn )k ≤ 2 ∀n. By (23), for a given
m, ∃ km ≥ m such that ∀k ≥ km , kf m (xn(k)−m ) − x∗ k ≤ 2 . Pick m = 2M and thus
∀k ≥ km ,
where the additional term a(n)n is due to the second order term in Taylor series
expansion. Furthermore, (B7) and (5) ensure that n a(n)2 E[kMn+1 k4 |Fn ] < ∞ a.s.
P
By Theorem A of the appendix, it then follows that
X
a(n)2 (kMn+1 k2 − E[kMn+1 k2 |Fn ]) < ∞
n
a.s. By (B5) and (5), n a(n)2 E[kMn+1 k2 |Fn ] < ∞. It follows that a(n)2 kMn+1 k2
P P
n
converges a.s., in particular, a(n)2 kMn+1 k2 → 0 a.s. Thus a.s.,
where
Here K 0 = supn kP̄f (x̃n ) k, which is finite because P and f are continuous functions,
and {xn } is bounded. The last Pn inequality follows from (5).
Let t(0) = 0, t(n) := m=0 a(m), In = [t(n), t(n + 1)), and x̄(t(n)) = x̃n ,
with linear interpolation on In . Let xs (t) denote the unique trajectory of ẋs (t) =
P̄f (xs (t)) (h(xs (t))), xs (s) = x̄(s). Then by standard Gronwall inequality based argu-
ments as in Lemma 1 on pp. 12–15 of [8], we have the following.
Lemma 12. Under(B1)–(B7), for any T ≥ 0, limt↑∞ sups∈[t,t+T ] kx̃(s)−xt (s)k = 0.
Finally, in view of Lemma 10 and Lemma 12 above, the proof of Theorem 3 closely
follows that of Theorem 1 above (see also Theorem 2 on pp. 15–16 of [8]) on noting
that on C, f (x) = x =⇒ P̄f (x) = P̄x . As before, one may say more for special cases
with additional structure, e.g., for d = 1, where one can claim convergence, allowing
for boundary equilibria on ∂C.
Example 4. Consider h(x) = −∇g(x) and fi (x) = xi − (xi − ci )+ , 1 ≤ i ≤ N ,
for prescribed ci ∈ R. This amounts to stochastic gradient descent for minimizing g
subject to constraints xi ≤ ci ∀i.
Example 5. For h, {ci } as above, let fi (x) = xi − γi ai (xi − ci )I{ j aj (xj − cj )2 >
P
M } for some γ > 0 small, aj > 0, 1P≤ j ≤ N, M > 0. This is stochastic steepest
descent for minimizing g subject to i ai (xi − ci )2 ≤ M . Evaluation of the quantity
2
P
i i i − ci ) requires global information. This can be done by a distributed gossip
a (x
algorithm as a subroutine.
These are rather simple situations. One technical issue here is the following.
The maps f above are smooth except at the boundaries of certain open sets. The
There are two ways of working around this problem. One is to replace the respective
f ’s by convenient smooth approximations, e.g., (xi − ci )+ by g((xi − ci )+ ), where
g(·) is a smooth approximation to x 7→ x ∨ 0 such that g(x) = 0 for x ≤ 0 and > 0
for x > 0. The other option is to not change anything but invoke the fact that if
the noise {Mn } is rich enough, the probability of the iterates falling exactly on the
troublesome boundary will be zero. Such arguments are often used in application of
stochastic approximation algorithms. These considerations are also present in the two
more examples that follow.
Example 6. Consider h(x) = −∇g(x) and let f stand for a single full iteration of
the Boyle–Dykstra–Han algorithm for computing the projection onto the intersection
of a finite family of closed convex sets. (The algorithms go in a round robin fashion
componentwise; what we imply here is one full round.) Then f (n) → P := the
desired projection. The Boyle–Dykstra–Han algorithm, however, is not distributed.
A distributed version has been derived in [25].
Example 7. Consider a second order dynamics in R3 that is commonplace in
models of flocking:
3.
k∇V (x)kkxk
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
2
(26) sup k∇ V (x)k ∨ < ∞.
x V (x)
where the last two inequalities use (4), (25), (26), and (27). Since a(n) → 0, the r.h.s.
is < (1 − /2)V (xn ) for n sufficiently large. This establishes the claim.
III. Extensions and future issues.
8. Asynchronous case. A more realistic scenario than what we have been con-
sidering so far is that of asynchronous implementation. Here each node operates on
its own clock. Thus assume a universal clock with ticks n ≥ 0 in the background.
This can be “event driven” and not necessarily in multiples of a fixed unit as in a
conventional clock. For each node i, we have a possibly random subsequence of {n}
along which it performs its polling and updates. Furthermore, there can be commu-
nication delays in receiving one node’s information by another. These considerations
lead to highly nontrivial complications as seen in [6] or [7, Chapter 7]. We shall adapt
the development of those works to analyze what happens in the present case under
asynchronous implementation. Instead of replicating the messy analysis of [6] or [7,
Chapter 7] here, we only sketch the underlying arguments, which are quite easy to
comprehend.
Let Bn := the random subset of V denoting nodes Pn which were “active,” i.e.,
performed their updates, at time n. Let κ(i, n) := m=0 I{i ∈ Bm }. This then is
the “local clock” of i, indicating its own count of updates performed till n. As in [6]
or [7, Chapter 7], we consider in lieu of (20) the iteration
Here fi is the ith component of f . Note that we have assumed a delay-free computa-
tion of f (x(n)). We shall relax this later. The quantity τk (j, i) is the possibly random
delay with which i receives j’s data at time k. It is argued in [6] or [7, Chapter 7] that
under mild conditional moment conditions, the delays contribute an asymptotically
negligible error which does not affect our convergence analysis. We omit the details,
referring the interested reader to [6] or [7, Chapter 7].
Replacement of the common step-size a(n) by the node-dependent a(κ(i, n)) is
a much more significant modification, which serves a dual purpose. First, it elimi-
nates the need for the nodes to know the global clock. Second, a common step-size
would have resulted in different components getting weighted differently according to
their relative frequency of occurrence, thus modifying the resultant limiting o.d.e. and
rendering our analysis invalid. Suppose
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
κ(i, n)
(29) lim inf ≥ ζ ∀i
n↑∞ n
for some constant ζ > 0. That is, all components are updated comparably often. As
argued in the above references, under some additional conditions on {a(n)} stipulated
in [6], the limiting o.d.e. under {a(κ(i, n))} remains the same modulo a time scaling
that does not affect its asymptotic behavior. Thus our analysis continues to apply;
see, e.g., [6], [7, Chapter 7].
The important fact to note here is that the effect of asynchrony and delays was
killed by our choice of the step-size schedule. This suggests a remedy for managing
the same in the computation of f , which was taken to be free of these problems in
(28). Replace (28) by
(30)
(n)
xi (n+1) = fi (x̌(n))+a(κ(i, n))I{i ∈ Bn }(hi (x̌(n))+ M i (n+1)), 1 ≤ i ≤ N, n ≥ 0,
where x̌(n) is shorthand for the argument of hi (·) in (28) and f (n) (x) := (1 − b(n))x +
b(n)f (x) for a step-size schedule b(n) > 0 satisfying
X a(n)
b(n) ↓ 0, b(n) = ∞, → 0.
n
b(n)
be specified in terms of problem parameters, and (ii) narrowing down potential limit
sets for the same. Just as in the case of stochastic approximation, one can say more
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
for specific instances by exploiting the additional structure, e.g., componentwise a.s.
convergence to a common point when this point is the only possible compact con-
nected internally chain transitive invariant set for (9). Some further possibilities are
as follows:
1. There are situations when one may want to relax (1). For example, if each
node in V polls exactly one neighbor at each time, the ensuing averaging
matrix has a single nonzero entry per row. Also, if there are transmission
constraints (e.g., in a wireless medium), only some of the nodes can poll at
any given time, implying that only some rows will be nonzero. In either case
the resulting transition matrices may not be irreducible aperiodic at each time
but may be so on average. These issues will be addressed in a future work.
2. Some of the standard variations and extensions of stochastic approximation
suggest natural counterparts here. These include avoidance of traps and sam-
ple complexity results [8, Chapter 4], as well as constant step-size schemes
[8, Chapter 9].
3. Better stability tests for the fully nonlinear case will be very useful. Even in
the quasi-linear case, we have extended one of the many sufficient conditions
from the stochastic approximation literature to stochastic approximation with
gossip. A similar exercise with other sufficient conditions remains to be car-
ried out, as also an extension to the present scenario of projected stochastic
approximation.
4. We have taken evaluation of the “coordination” component f (·) in (20) to
be noise-free. If this is not the case, we need to replace it by a stochastic
approximation iterate as well. Then (20) becomes
x(n + 1) = x(n) + b(n)(f (x(n)) + M 0 (n + 1)) + a(n)(h(x(n)) + M (n + 1)).
Here {M 0 (n)} is the measurement P noise for fPand the step-size sequence
{b(n)} has to be chosen so that n b(n) = ∞, n b(n)2 < ∞ (to ensure the
standard stochastic approximation behavior) and a(n) = o(b(n)) (to ensure
that the f -iterates are faster than the h-iterates). Even with this the desired
result is not guaranteed; see, e.g., [9], where the pure linear gossip case is dis-
cussed. Provable convergence to the desired equilibrium, or more generally,
tracking of the desired limiting behavior may be possible in specific cases.
This needs a separate analysis, which will be pursued in a sequel.
5. As mentioned, relating f to a subroutine for projection to a convex set im-
plements such a projection on a faster time-scale. An important possible
future direction is to use f to effect a projection to a smooth manifold, which
will enable us to execute stochastic approximation versions of algorithms on,
e.g., matrix manifolds [1]. This is reminiscent of ideas from “sliding mode
control,” where a trajectory is controlled along a prescribed manifold [31].
Appendix A. We recall here a key martingale convergence theorem from [22].
Theorem A. Let Z(n), n ≥ 0, denote a zero mean square-integrable martingale
w.r.t. an increasing family of σ-fields {Fn }, satisfying
X
E |Z(n + 1)|2 |Fn < ∞ a.s.
n
This is Proposition VII-2-3(C) on p. 149 in [22] (see also Theorem 11, p. 150
in [8]).
Downloaded 02/10/20 to 103.21.127.60. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
REFERENCES
[1] P.-A. Absil, R. Mahoney, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds,
Princeton University Press, Princeton, NJ, 2008.
[2] V. I. Arnold, Ordinary Differential Equations, Springer-Verlag, Berlin, 2001.
[3] M. Benaim, A dynamical system approach to stochastic approximations, SIAM J. Control
Optim., 34 (1996), pp. 437–472.
[4] P. Bianchi, G. Fort, and W. Hachem, Performance of a distributed stochastic approximation
algorithm, IEEE Trans. Automat. Control, AC-59 (2013), pp. 7405–7418.
[5] V. S. Borkar, Stochastic approximations with two time scales, Systems Control Lett., 29
(1997), pp. 291–294.
[6] V. S. Borkar, Asynchronous stochastic approximation, SIAM J. Control Optim., 36 (1998),
pp. 840–851.
[7] V. S. Borkar, Erratum: Asynchronous stochastic approximations, SIAM J. Control Optim.,
38 (2000), pp. 662–663.
[8] V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint, Hindustan Book
Agency, New Delhi, and Cambridge University Press, Cambridge, UK, 2008.
[9] V. S. Borkar, R. Makhijani, and R. Sundaresan, Asynchronous gossip for averaging and
spectral ranking, IEEE J. Selected Topics Signal Process., 8 (2014), pp. 703–716.
[10] V. S. Borkar and S. P. Meyn, The O.D.E. method for convergence of stochastic approxima-
tion and reinforcement learning, SIAM J. Control Optim., 38 (2000), pp. 447–469.
[11] V. S. Borkar and V. V. Phansalkar, Managing interprocessor delays in distributed recursive
algorithms, Sadhana, 19 (1994), pp. 995–1003.
[12] S. Chatterjee and E. Seneta, Towards consensus: Some convergence theorems for repeated
averaging, J. Appl. Probab., 14 (1977), pp. 89–97.
[13] G. Chen, Z. Liu, and L. Guo, The smallest possible interaction radius for synchronization of
self-propelled particles, SIAM J. Control Optim., 50 (2012), pp. 1950–1970.
[14] J. Chen and A. H. Sayed, On the limiting behavior of distributed optimization strategies, in
Proceedings of the 50th Allerton Conference on Control, Communications and Computing,
Monticello, IL, 2012, pp. 1535–1542.
[15] R. Dwivedi, Unpublished Course Project, IIT Bombay, 2014.
[16] D. Fudenberg and D. K. Levine, The Theory of Learning in Games, MIT Press, Cambridge,
MA, 1998.
[17] N. Gaffke and R. Mathar, A cyclic projection algorithm via duality, Metrika, 36 (1989), pp.
29–54.
[18] R. Gharavi and V. Anantharam, Structure theorems for partially asynchronous iterations
of a nonnegative matrix with random delays, Sadhana, 24 (1999), pp. 369–423.
[19] M. Huang and J. H. Manton, Stochastic approximation for consensus seeking: Mean square
and almost sure convergence, in Proceedings of the 46th IEEE Conference Decision and
Control, New Orleans, LA, 2007, pp. 206–211.
[20] Y. Kabanov and S. Pergamenshchikov, Two-scale Stochastic Systems, Springer-Verlag,
Berlin, 2003.
[21] S. Lee and A. Nedić, Distributed random projection algorithm for convex optimization over
networks, IEEE J. Selected Topics Signal Process., 7 (2013), pp. 2221–229.
[22] J. Neveu, Discrete-Parameter Martingales, North-Holland, Amsterdam, 1975.
[23] R. Olfati-Saber, Flocking for multi-agent dynamic systems: Theory and algorithms, IEEE
Trans. Automat. Control, 51 (2006), pp. 401–420.
[24] R. Olfati-Saber, J. A. Fax, and R. M. Murray, Consensus and cooperation in networked
multi-agent systems, Proc. IEEE, 95 (2007), pp. 215–233.
[25] S. Phade and V. S. Borkar, A distributed Boyle-Dykstra-Han scheme, submitted.
[26] L. Perko, Differential Equations and Dynamical Systems, 3rd ed., Springer-Verlag, New York,
2001.
[27] A. H. Sayed, Adaptation, learning, and optimization over networks, Found. Trends Machine
Learning, 7 (2014).
[28] D. Shah, Gossip algorithms, Found. Trends Networking, 3 (2009), pp. 1–125.
[29] S. S. Stankovic̀, M. S. Stankovic̀, and D. M. Stipanovic̀, Decentralized parameter estima-
tion by consensus based stochastic approximation, IEEE Trans. Automat. Control, AC-56
(2011), pp. 531–543.
[31] V. I. Utkin, Sliding Modes in Control and Optimization, Springer-Verlag, Berlin, 1992.
[32] Z. Wan, Flocking for Multi-agent Dynamical Systems, Lambert Academic Publishing, 2012.
[33] C. W. Wu, Synchronization in Complex Networks of Nonlinear Dynamical Systems, World
Scientific, Singapore, 2007.