Zhang 2023 Distributed 2

DISTRIBUTED LEARNING WITH CONVEX SUM-OF-NON-CONVEX OBJECTIVE
Mengfei Zhang∗† , Jie Chen∗ , Cédric Richard†

∗
Centre of Intelligent Acoustics and Immersive Communications
School of Marine Science and Technology, Northwestern Polytechnical University, China
†
Université Côte d’Azur, CNRS, OCA, France
zhangmengfei@mail.nwpu.edu.cn, dr.jie.chen@ieee.org, cedric.richard@unice.fr
ABSTRACT authors introduce a general stochastic unified decentralized algo-

rithm (SUDA). It ensures convergence under both non-convex and
Recent research works have shown that some classical optimization the Polyak-Łojasiewicz condition settings. It is important to notice
methods originally designed for dealing with convex problems can that all three aforementioned works emphasize the assumption that
demonstrate similar properties when applied to non-convex scenar- both individual and global costs are non-convex.
ios, where the problems are both locally and globally non-convex.
However, it is recognized that situations where the costs are
This has led to the widespread development of distributed strategies
locally non-convex but their sum is (strongly) convex, referred to
for non-convex problems. When the sum of the local non-convex
as sum-of-non-convex objective, arise in solving problems such as
costs remains (strongly) convex and the individual local costs are
regularized loss minimization [11], approximate principle compo-
smooth, it indicates a specific scenario that has notable applications
nent analysis (PCA) problems [12], distributed adaptive beamform-
and can benefit from existing solving methods. In this paper, draw-
ing [13], and privacy-enhancing distributed optimization [14]. This
ing inspiration from the efficiency and stability of diffusion adapta-
observation has motivated researchers to develop distributed adapta-
tion, we explore the minimization of a strongly convex sum of non-
tion strategies based on the sum-of-non-convex objective. The work
convex local costs. Specifically, we provide an analysis to demon-
presented in [14] demonstrates that coupled consensus and projected
strate the convergence behavior of the network. Simulations are con-
gradient descent algorithms can converge to the optimum under the
ducted to validate our theory.
assumption of non-increasing step-sizes and gradient Lipschitzness,
Index Terms— Stochastic optimization, non-convex cost, dif- which, unfortunately, limits the adaptive filter’s ability to effectively
fusion adaptation. estimate and track over time.
In this paper, we investigate how the diffusion adaptation strat-
1. INTRODUCTION egy can minimize a sum-of-non-convex objective using stochastic
gradients in complex optimization problems. We analyze its conver-
Distributed adaptation over networks addresses global stochastic op- gence behavior in a mean-square sense, where the individual esti-
timization problems by estimating and tracking the parameters of in- mates will be close to certain neighbors of the global optimal point,
terconnected nodes from streaming data, without prior knowledge of forming a small cluster, for a small fixed step-size. Finally, we per-
its statistical properties. This is a collaborative strategy where each form simulations to validate our theory and compare it with its coun-
node shares its local estimates with all its neighbors to optimize a terpart of (strongly) convex local costs.
global cost. In comparison to alternative approaches [1–4], diffusion Notation. Lowercase letters x, boldface lowercase letters x and
adaptation has gained popularity due to superior performance and a boldface capital letters X denote scalars, column vectors and ma-
wider stability range [4–7]. trices, respectively. The expected value of a random variable is de-
Nonetheless, a fundamental assumption of these works is the noted by E{·}. Re{·} and Im{·} denote the real part and imaginary
convexity of the local cost at each node, which limits their applica- part of the argument, respectively. Operator ∥·∥ denotes the ℓ2 -norm
tion, particularly in handling complex optimization problems such as of a vector or matrix. Operator ∇ refers to the gradient of a function
dictionary learning. Recent research has addressed non-convex opti- with respect to its variables. Hermitian transpose of a vector or ma-
mization in a distributed manner through the use of stochastic local trix is denoted by (·)H . Operator col{·} provides a column vector by
cost functions [8–10]. The studies conducted in [8, 10] demonstrate stacking its arguments on top of each other. Operator ⊗ refers to the
that the diffusion learning strategy continues to produce meaningful Kronecker product. Vector 1N denotes the all-one vector of length
estimates in non-convex environments, as the iterates by individual N , and IM is the identity matrix of size M × M .
agents tend to cluster in a narrow region close to the network cen-
troid (an auxiliary variable during network evolution). In [9], the
2. DIFFUSION ADAPTATION WITH
The work of M. Zhang was supported partly by the China Scholar-
SUM-OF-NON-CONVEX COST
ship Council and partly by the Innovation Foundation for Doctor Disserta-
tion of Northwestern Polytechnical University. The work of J. Chen is par-
tially supported by NSFC grant 61671382, Shaanxi Key Industrial Innovation 2.1. Model assumptions
Chain Project 2022ZDLGY01-02, and Xi’an Technology Industrialization
Plan XA2020-RGZNTJ-0076. The work of C. Richard was funded in part by We consider a network of N interconnected nodes. Each node k has
the PIA program under its IDEX UCAJEDI project (ANR-15-IDEX-0001) access to local measurements. The optimization problem is to find a
and by ANR under grant ANR-19-CE48-0002. Pareto solution wo ∈ CM that minimizes the following real-valued
global cost function defined over CM , i.e., J glob (w) : CM → R: A.1. Each Jk (w) for k ∈ {1, . . . , N } may not be convex, while the
N
! weighted aggregate of the local costs J glob (w) is strongly convex.
Specifically, there exists a constant c > 0 such that the following
o glob
X
w = arg min J (w) ≜ αk Jk (w) , (1)
w
k=1
inequality holds for all w1 , w2 ∈ CM :
in a collaborative and distributed manner. Factors αk are positive J glob (w1 ) ≥
and normalized to add up to one. Local cost Jk (w) at each node k c
J glob (w2 ) + Re{∇J glob (w2 )H (w1 − w2 )} + ∥w1 − w2 ∥2 .
is defined as the expectation of an instantaneous risk of the form 2
Jk (w) = Ex {Jk (w; xk )} with Ex {·} the expectation w.r.t the
stochastic variable xk . We will keep each αk fixed as a constant A.2. Each Jk (w) is δ-upper smooth and ∆-lower smooth, that is:
throughout this work, specifically αk = N1 . This specific case can −∆ 2
∥w1 − w2 ∥2 ≤ Jk (w1 ) − [Jk (w2 ) + Re{∇Jk (w2 )H (w1 −
be easily generalized. Our motivation for this choice stems from the w2 )}] ≤ 2δ ∥w1 − w2 ∥2 , for all w1 , w2 ∈ CM . Constants δ and ∆
need to tackle the challenges of non-convexity in certain real-world satisfy the conditions c ≤ δ and ∆ ≥ 0, respectively.
distributed signal processing problems. In [13, 15], the local costs of
individual nodes are non-convex while their sum remains a strongly A.3. For all pairs of agents k and ℓ, the gradient disagreement is
convex objective. bounded, namely, for any w ∈ CM : ∥∇Jk (w) − ∇Jℓ (w)∥ ≤ ζ.
A.4. Matrix A is a doubly stochastic matrix. This property implies

2.2. Diffusion adaptation for distributed estimation that ∥A − N1 1N 1⊤
N ∥ = λ ∈ (0, 1), where λ is the second largest
eigenvalue of A.
Before presenting the diffusion strategy, let us introduce the defi-
nition of the gradient in the complex domain. Consider the case A.5. The gradient noise vector sk,i (wk,i−1 ) satisfies:
where J is a real-valued differentiable function of a complex vari- E{sk,i (wk,i−1 )|F i−1 } = 0, (9)
able w = u + iv = Re{w} + iIm{w}, where u and v are real 2 2
vectors. The gradient of J is defined as [16, Equation (1.53)]: E{∥sk,i (wk,i−1 )∥ |F i−1 } ≤ σ . (10)
∇w J(w) ≜ ∇u J(w) + i∇v J(w). (2)
3. CONVERGENCE ANALYSIS
Inspired by the distributed optimization algorithm proposed
in [14], this paper aims to demonstrate how the existing diffusion In this section, we shall examine the convergence of the ATC algo-
strategy can be utilized to optimize the sum-of-non-convex costs (1) rithm for minimizing the sum-of-non-convex costs. Due to space
using stochastic gradient descent. The adapt-then-combine (ATC) limitations, we provide only a brief overview of the analysis. The
diffusion strategy is formulated as follows: detailed proofs are omitted but will be available as a supplementary
ψk,i = wk,i−1 − µ∇Jˆ k (wk,i−1 ) (3) material.
X The convergence theorem can be proved in two steps. First,
wk,i = aℓk ψℓ,i . (4) we establish a consensus lemma that shows the consensus distance
ℓ∈Nk E{∥wk,i − w̄i ∥2 } ≤ O(µ2 ) is bounded. This implies that wk,i con-
ˆ k (wk,i−1 ) represents a stochastic approximation of the verges to a small cluster in the vicinity of the network average w̄i
Here, ∇J for small step-size µ. Second, we establish a theorem that demon-
true local gradient ∇Jk (wk,i−1 ). The coefficients aℓk are non- strates the geometric convergence of J glob (w̄i ) towards a neighbor of
negative and correspond to the (ℓ, k)-th entries of a left-stochastic the optimal value. By combining these two steps, we conclude that,
matrix A, which satisfies: after a sufficient number of iterations or with the use of a suitable
A⊤ 1N = 1N , aℓk = 0 if ℓ ∈
/ Nk . (5) warm-up strategy, the estimates wk,i will converge to neighbors of
the global optimal point within a small cluster for a small step-size µ.
Here, Nk represents the set of neighboring nodes of node k, includ- ˆ k (wk,i ) over the entire network, we
ing k, and µ is a uniform positive step-size used by all nodes. Let By collecting wk,i and ∇J
p denote the right-eigenvector, known as the Perron eigenvector, of obtain block vectors of dimension M N × 1 as:
matrix A. After normalizing its entries pk , which are all strictly wi ≜ col{w1,i , · · · , wN,i }, (11)
positive, we have Ap = p and 1⊤ N p = 1. A special case is when ˆ 1 (w1,i ), · · · , ∇J
ˆ N (wN,i )}.
ĝi ≜ col{∇J (12)
pk = N1 for all k. Matrix A is then said to be a doubly stochastic
matrix because it satisfies: Based on the definitions above, we can write the diffusion recursion
(3)–(4) compactly as:
A⊤ 1N = A1N = 1N . (6)
wi = A⊤ (wi−1 − µĝi−1 ), (13)
We introduce the following definitions and assumptions to facil-
itate the derivation. It is important to note that these definitions and with A ≜ A ⊗ IM . Left-multiplying both sides of equation (13) by
assumptions are similar to those used in previous works [17–21]. 1 ⊗ IM , we obtain:
1 ⊤
N N
1
1⊤

D.1. The collection of all possible random events generated by the N ⊗ IM wi
past iterations {wj } up to time j ≤ i over the network is defined as: N
1
1⊤

⊤
= N ⊗ IM A (wi−1 − µĝi−1 )
F i ≜ filtration{w0 , w1 , · · · , wi }. (7) N
1 µ ⊤
1⊤ (1N ⊗ IM )ĝi−1 ,

D.2. The gradient noise vector sk,i (wk,i−1 ) at node k and time i is = N ⊗ IM wi−1 − (14)
N N
defined as the difference:
where we have used the mixed-product property: (M ⊗ N )(X ⊗
ˆ k (wk,i−1 ) − ∇Jk (wk,i−1 ).
sk,i (wk,i−1 ) ≜ ∇J (8) Y ) = (M X) ⊗ (N Y ).
We define w̄i Pas the average of the estimates
over the network, step-size µ:
k=1 wk,i = N 1N ⊗ IM wi . This intermedi-
N 1 ⊤
given by w̄i ≜ N1
8(δ 2 + 3δ∆ + 4∆2 )λ2 (ζ 2 + N σ 2 ) 2
ate variable w̄i plays a role during the convergence of wk,i , serving E{∥fi−1 ∥2 } ≤ µ , (22)
as the network-wide average of the estimates at iteration i. Combin- (1 − λ2 )2
ing (14) with the definition of w̄i , we can derive the recursion for E{∥si ∥2 |F i−1 } ≤ σ2 . (23)
w̄i as follows:
N
The results of Lemma 2 and the relation (21) establish the bound
µ Xˆ for the network average as follows:
w̄i = w̄i−1 − ∇Jk (wk,i )
N
k=1
Theorem 1 (Network Average Bound). Under assumptions A.1 to
N
µ X A.5, and based on Lemma 2, suppose that the diffusion adaptation
= w̄i−1 − ∇Jk (w̄i−1 ) − µfi−1 − µsi , (15) algorithm minimizing the global cost (1) is run with the warm-up
N
k=1
strategy and a fixed step-size µ satisfying:
where we define the following variables as perturbation terms:
1
N 0≤µ< . (24)
1 X 4δ
fi−1 ≜ ∇Jk (wk,i−1 ) − ∇Jk (w̄i−1 ) , (16) Then, the expected optimality gap satisfies the following inequality:
N
k=1
c2 µ + c1 µ2 + 2c1 δµ3
N
1 X ˆ E{J glob (w̄i ) − Joglob } −
si ≜ ∇Jk (wk,i−1 ) − ∇Jk (wk,i−1 ) . (17) 2c(1 − 2µδ)
N
E{J glob (w̄0 ) − Joglob }
k=1 i h
≤ 1 − cµ(1 − 2µδ)
Note that the subscript i for si is due to definition D.2, indicating
that si depends on the data up to time instant i. Making N copies of c2 µ + c1 µ2 + 2c1 δµ3 i
− , (25)
w̄i and stacking them entry-wise, we obtain: 2c(1 − 2µδ)
1
w̄i ≜ (1N 1⊤ N ⊗ IM )wi . (18) with c1 ≜ 8(δ 2 +3δ∆+4∆2 )λ2 (ζ 2 +N σ 2 )
(1−λ2 )2
, c2 ≜ δσ 2 .
N
Lemma 1 (Consensus Distance). Under assumptions A.2 to A.5, Theorem 1 illustrates the interplay between the step-size µ, the
the consensus residue is bounded by: upper-smooth parameter δ and the bound σ 2 on the gradient noise.
When the gradient computation is noisy, one can still use a fixed
E{∥wi − w̄i ∥2 } ≤ step-size and ensure that the expected global cost values w.r.t. the
1 + λ2 i 8λ2 (N ζ 2 + N 2 σ 2 ) 2 network average w̄i will converge at a geometric rate to a value close
E{∥w0 − w̄0 ∥2 } + µ . (19)
to the optimal value. The upper bound on the expected optimality
2 (1 − λ2 )2
gap is given by:
The first term on the right-hand side of equation (19) becomes
c2 µ + c1 µ2 + 2c1 δµ3
negligible as i becomes sufficiently large, given that λ ∈ (0, 1). lim sup E{J glob (w̄i ) − Joglob } ≤ , (26)
Additionally, we can employ a warm-up strategy where all nodes’ it- i→∞ c
erates at the initial time instant i = 0 are set to be identical, ensuring for sufficiently small step-size µ under condition (24). As shown in
that E{∥w0 − w̄0 ∥2 } = 0. the definition of c1 , the lower-smooth parameter ∆, related to the
non-convexity of local costs, clearly affects this upper bound. This
Corollary 1. Under assumptions A.2 to A.5 and using the warm- will be illustrated in the experiments section.
up strategy, the iterates wk,i will converge to a small cluster in the
vicinity of the network average w̄i for a small step-size µ:
4. SIMULATIONS
8λ2 (N ζ 2 + N 2 σ 2 ) 2
E{∥wi − w̄i ∥2 } ≤ µ . (20)
(1 − λ2 )2 We present simulation results to demonstrate the effectiveness of
glob glob
the diffusion strategy in minimizing a strongly convex sum of non-
Assumption A.2 implies that ∥∇J (w1 ) − ∇J (w2 )∥ ≤ convex local costs. In our simulations, we considered a network with
δ∥w1 − w2 ∥. This allows us to establish the following bound: 16 nodes as depicted in Figure 2(a). All nodes were initialized with
J glob (w̄i ) all-zero parameter vectors, wk,0 = 0. We conducted 100 Monte
n o Carlo runs and averaged the results to obtain the reported curves.
≤ J glob (w̄i−1 ) + Re ∇J glob (w̄i−1 )H (w̄i − w̄i−1 ) In the considered scenario, each agent k in the network observes
δ streaming complex realizations {dk,i , xk,i } from the linear model
+ ∥w̄i − w̄i−1 ∥2 dk,i = xH o
k,i w +zk,i , where zk,i denotes complex zero-mean Gaus-
2
(15)
sian noise with variance σz2 . The noise zk,i is independent of any
≤ J glob (w̄i−1 ) − µ∥∇J glob (w̄i−1 )∥2 other signals. In a distributed context, a common method for esti-
n o mating wo is the least-mean-square error (LMS) estimation, which
+ Re − µ∇J glob (w̄i−1 )H (fi−1 + si ) leads to the following local cost functions:
δ
+ µ2 ∥∇J glob (w̄i−1 ) + fi−1 + si ∥2 . (21) Jk (w) = Ex (dk,i − xH 2
k,i wk,i ) . (27)
2
The regression inputs xk,i are complex zero-mean random vectors
Lemma 2 (Perturbation Bounds). Under assumptions A.2 to A.5, of length M = 50 governed by a complex Gaussian distribution
2
considering Corollary 1 and using the warm-up strategy, the per- with covariance matrices Rx,k = σx,k IM . Both σz2 and σx,k
2
for
turbation terms in (15) satisfy the following inequalities for small all nodes are provided in Figure 2(b). We set the unknown vector
(a) MSD learning curves with varying θ. (b) MSD learning curves with varying ξ. (c) MSD learning curves.
Fig. 1: MSD learning curve properties and comparisons.
16 0.22
Setting 2 results in the same upper-smooth property as Setting 1, but
1 8 4
0.215 7 2 it yields a distinct lower-smooth property with ∆ ≫ δ.
2 3 3 10
10 11 0.21 11
15
9
To analyze the effect of parameters on the performance of the
5 0.205 5 13 diffusion algorithm, we varied the upper and lower smooth param-
6 0.2
7 4 12 1 eters δ and ∆. We conducted two scenarios to investigate the con-
14 0.195
0.19
12
vergence behavior of the mean-square deviation (MSD) for different
8 9 14
parameters θ and ξ. In scenario 1, we varied the parameter θ within
13 0.185
6
15 0.18
0.8 0.85
16
0.9 0.95 1 1.05 1.1 1.15 1.2
the range of 0.8 to 1.2, with an increment of 0.1. In scenario 2,
we explored the effect of different values of ξ by varying it with an
(a) Network topology. (b) Input and noise variances for increment of 1, from 12 to 18, which corresponds to approximately
each node. 0.8×(N −1) to 1.2×(N −1), with N = 16. For both scenarios, the
Fig. 2: Network topology, node input and noise variances. diffusion adaptation step-size was set to µ = 5·10−4 . Figure 1(a) il-
lustrates the convergence behavior of the MSD with varying θ, while
Figure 1(b) displays the convergence behavior with varying ξ. We
wo to be a fixed set of variables sampled from CN (0, 1). We set the
observe that, as the parameters θ or ξ decrease, diffusion algorithm
combination matrix A as a doubly-stochastic matrix following the
achieves superior steady-state performance.
maximum-degree rule [22] with its entry aℓk in (4) as:
 Then we run the diffusion algorithm over sums of non-convex
1
N ,
 if k ̸= ℓ are neighbors local costs with Settings 1 and 2, but also over the sum of strongly
aℓk = 1 − |NkN|−1 , k = ℓ (28) convex local costs by setting Dk = 0 for all k. To validate the

0, results obtained in (26), which indicate that parameter ∆ signifi-
otherwise.
cantly affects the upper bound on the expected optimality gap, we
Similar to the work [21], we designed two synthetic experi- conducted the following experiment. We set θ = 1 and ξ = 15 in
ments based on the construction of the sum-of-non-convex costs. Settings 1 and 2, ensuring that both have the same δ-upper smooth-
Specifically, given N diagonal matrices D1 , . . . , DN that satisfy ness characteristics. However, note that Setting 2 was set with a
D1 + . . . + DN = 0, we set each cost function Jk (w) as follows: large ∆ ≈ 14-lower smoothness, while Setting 1 was with a smaller
value of ∆ ≈ 0.2. The step-size µ in the diffusion LMS algorithm
Jk (w) = (dk,i − xH 2 H
k,i wk,i ) + wk,i Dk wk,i . (29)
was set to µ = 5 · 10−4 . The MSD learning curves for the three
Under this construction, each Jk = Ex {Jk } is non-convex if Dk settings are depicted in Figure 1(c). It can be observed that Setting 1
has negative entries on the diagonal smaller than −σk2 . We now achieves superior steady-state performance compared to Setting 2
present two different settings for designing Dk . due to its relatively smaller upper bound on the expected optimality
2 gap. Additionally, the performance of the diffusion LMS algorithm
Setting 1. Given θ ∈ [min{σx,k }N 2 N
k=1 , max{σx,k }k=1 )], for
when run over the sum of strongly convex local costs is enhanced.
each j ∈ {1, . . . , M }, we randomly select half of the indices k and
we set [Dk ]jj = θ. We set [Dℓ ]jj = −θ for the other half of the
2 5. CONCLUSION
indices ℓ. In this way, for nodes k where θ ≥ σx,k , Jk is δ-upper
2 2
smooth and ∆-lower smooth with δ = σx,k + θ and ∆ = θ − σx,k .
2 In this paper, we investigated the minimization of a strongly convex
Otherwise, Jk is strongly convex with c = σx,k − θ.
sum of non-convex local costs with the diffusion LMS. Through a
Setting 2. Let ξ = (N −1)θ, where θ is defined in Setting 1. For comprehensive analysis, we have analyzed the convergence behav-
each j ∈ {1, . . . , M }, we randomly select one matrix Dk and set its ior of the network. Simulations were conducted to validate our theo-
j-th diagonal entry, denoted as [Dk ]jj , to −ξ. Simultaneously, we retical findings and provide a comparison with a scenario involving
set the j-th diagonal entry of all other N − 1 matrices to ξ/(N − 1). (strongly) convex local costs.
To ensure generality and prevent the diagonal elements of each ma-
trix Dk from being uniformly positive or negative, we guarantee that
each index k is selected at least once during this setting procedure.
This approach ensures that each Jk is δ-upper smooth and ∆-lower
2 2
smooth, where δ = σx,k +θ and ∆ = (N −1)θ−σx,k . Observe that
6. REFERENCES [12] D. Garber and E. Hazan, “Fast and simple pca via convex
optimization,” arXiv preprint arXiv:1509.05647, 2015.
[1] A. Nedic and A. Ozdaglar, “Distributed subgradient methods
[13] M. O’Connor, W. B. Kleijn, and T. Abhayapala, “Distributed
for multi-agent optimization,” IEEE Trans. Automatic Control,
sparse mvdr beamforming using the bi-alternating direction
vol. 54, no. 1, pp. 48–61, 2009.
method of multipliers,” in Proc. IEEE Int. Conf. Acoust.,
[2] C. G. Lopes and A. H. Sayed, “Incremental adaptive strategies Speech, Signal Process., 2016, pp. 106–110.
over distributed networks,” IEEE Trans. Signal Process., vol.
[14] S. Gade and N. H. Vaidya, “Distributed optimization
55, no. 8, pp. 4064–4077, Aug. 2007.
of convex sum of non-convex functions,” arXiv preprint
[3] S. Kar and J. M. F. Moura, “Distributed consensus algorithms arXiv:1608.05401, 2016.
in sensor networks with imperfect communication: Link fail-
ures and channel noise,” IEEE Trans. Signal Process., vol. 57, [15] K. Hu, D. Jin, W. Zhang, and J. Chen, “Distributed optimiza-
no. 1, pp. 355–369, 2009. tion of quadratic costs with a group-sparsity regularization via
pdmm,” in 2018 Asia-Pacific Signal Inf. Process. Ass. Ann.
[4] A. H. Sayed, “Diffusion adaptation over networks,” in Aca- Sum. and Conf. (APSIPA ASC), 2018, pp. 1825–1830.
demic Press Library in Signal Processing, vol. 3, pp. 323–453.
Elsevier, 2014. [16] C. Y. Chi, W. C. Li, and C. H. Lin, Convex optimization for
signal processing and communications: from fundamentals to
[5] C. G. Lopes and A. H. Sayed, “Diffusion least-mean squares applications, CRC press, 2017.
over adaptive networks: Formulation and performance analy-
sis,” IEEE Trans. Signal Process., vol. 56, no. 7, pp. 3122– [17] A. H. Sayed, “Adaptation, learning, and optimization over net-
3136, July 2008. works,” Foundations and Trends in Machine Learning, vol. 7,
no. 4-5, pp. 311–801, 2014.
[6] J. Chen, C. Richard, and A. H. Sayed, “Multitask diffusion
adaptation over networks,” IEEE Trans. Signal Process., vol. [18] B. Swenson, S. Kar, H. V. Poor, and J. M. F. Moura, “Anneal-
62, no. 16, pp. 4129–4144, Aug. 2014. ing for distributed global optimization,” in 2019 IEEE 58th
Conference on Decision and Control (CDC). IEEE, 2019, pp.
[7] J. Chen, C. Richard, and A. H. Sayed, “Diffusion LMS over 3018–3025.
multitask networks,” IEEE Trans. Signal Process., vol. 63, no.
11, pp. 2733–2748, Jun. 2015. [19] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex
environments—part I: Agreement at a linear rate,” IEEE Trans.
[8] S. Vlaski and A. H. Sayed, “Distributed learning in non-convex
Signal Process., vol. 69, pp. 1242–1256, 2021.
environments—part i: Agreement at a linear rate,” IEEE Trans.
Signal Process., vol. 69, pp. 1242–1256, 2021. [20] B. Ying, K. Yuan, Y. Chen, H. Hu, P. Pan, and W. Yin, “Expo-
nential graph is provably efficient for decentralized deep train-
[9] S. A. Alghunaim and K. Yuan, “A unified and refined conver-
ing,” Proc. Adv. Neural Inf. Process. Syst., vol. 34, pp. 13975–
gence analysis for non-convex decentralized learning,” IEEE
13987, 2021.
Trans. Signal Process., vol. 70, pp. 3264–3279, 2022.
[10] S. Vlaski and A. H. Sayed, “Distributed learning in non- [21] Z. Allen-Zhu and Y. Yuan, “Improved svrg for non-strongly-
convex environments—part ii: Polynomial escape from saddle- convex or sum-of-non-convex objectives,” in Proc. Int. Conf.
points,” IEEE Trans. Signal Process., vol. 69, pp. 1257–1270, Mach. Learn. PMLR, 2016, pp. 1080–1089.
2021. [22] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed
[11] Sh. Shalev-Shwartz, “Sdca without duality, regularization, and sensor fusion based on average consensus,” in Proc. IPSN,
individual convexity,” in Proc. Int. Conf. Mach. Learn. PMLR, 2005. Los Angeles, CA, April 2005, pp. 63–70.
2016, pp. 747–754.

Zhang 2023 Distributed 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zhang 2023 Distributed 2

Uploaded by

Copyright:

Available Formats

DISTRIBUTED LEARNING WITH CONVEX SUM-OF-NON-CONVEX OBJECTIVE

Mengfei Zhang∗† , Jie Chen∗ , Cédric Richard†

ABSTRACT authors introduce a general stochastic unified decentralized algo-

A.4. Matrix A is a doubly stochastic matrix. This property implies

You might also like