You are on page 1of 16

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
1

Differential Privacy Meets Federated Learning


under Communication Constraints
Nima Mohammadi, Jianan Bai, Qiang Fan, Yifei Song, Yang Yi, and Lingjia Liu

Abstract—The performance of federated learning systems is Accordingly,Pthe global objective function can be represented
bottlenecked by communication costs and training variance. The N
as f (x) = i=1 wi fi (x), where wi = |Di |/|D| is the device
communication overhead problem is usually addressed by three weight proportional to the sample size [7].
communication-reduction techniques, namely, model compres-
sion, partial device participation, and periodic aggregation, at Despite the privacy advantages of federated learning, which,
the cost of increased training variance. Different from traditional as we discuss later, are prone to some challenges, its real-world
distributed learning systems, federated learning suffers from data implementation raises some issues regarding communication
heterogeneity (since the devices sample their data from possibly overhead and training variance:
different distributions), which induces additional variance among
devices during training. Various variance-reduced training al-
gorithms have been introduced to combat the effects of data A. Communication Overhead
heterogeneity, while they usually cost additional communication
resources to deliver necessary control information. Additionally, Since the parameter server has no access to the local
data privacy remains a critical issue in FL, and thus there have datasets, it needs to collect the local model updates from
been attempts at bringing Differential Privacy to this framework the local devices periodically, then send the aggregated model
as a mediator between utility and privacy requirements. This pa-
back to all devices. The communication rounds between the
per investigates the trade-offs between communication costs and
training variance under a resource-constrained federated system parameter server and the local devices result in a substantial
theoretically and experimentally, and studies how communication communication overhead, especially when the number of
reduction techniques interplay in a differentially private setting. devices is large. Moreover, the underlying communications
The results provide important insights into designing practical between the parameter server and the local devices are usually
privacy-aware federated learning systems.
imperfect and with limited capacity, and hence, it is imperative
Index Terms—Federated Learning, Differential Privacy, Ar- to design reliable and efficient federated learning algorithms
tificial Intelligence, Communication Constraints, and Training under communication budget constraints.
Variance
From a communication point of view, federated learn-
ing systems realize model aggregation (uplink) through the
I. I NTRODUCTION multiple access channel (MAC) [8], and model distribution
(downlink) through the broadcast channel (BC) [9]. Since
Rtificial intelligence is expected to have a significant
A impact on Beyond 5G and 6G networks [1]. This is
especially true for Internet of Things (IoT) [2] where massive
a substantial number of local devices will send potentially
different local model updates to the parameter server during
model aggregation, the limited capacity of the uplink commu-
connectivity is expected from heterogeneous devices. With the
nications is usually the bottleneck of the system. Therefore,
ever increasing importance of data privacy of these devices,
the existing literature suggests the following communication-
federated learning emerges as a promising machine learning
reduction strategies to be utilized for enhancing the communi-
framework that enables the training of a shared model among
cation efficiency of federated learning systems, especially for
multiple end devices and a parameter server without exchang-
the uplink (MAC) part:
ing local data [3]–[6]. Assuming a total of N local devices
1) Model Compression: The transmission of each full-
and each device i ∈ {1, 2, · · · , N } possesses a local dataset
accuracy single-precision floating-point value requires 32 bits.
Di with |Di | samples, the local objective of device i can be
Since the model size is generally large in machine learning
formulated as the following risk minimization problem
systems, transmitting model parameters with full accuracy can
min fi (x) = Eξ∼Di `(x; ξ), be prohibitively expensive. In this regard, a common strategy is
x
to quantize the local model updates with some low-accuracy
where x ∈ R denotes the model parameter, ξ ∈ Ru is a
d compressors or send only some important local parameters.
training sample, and ` : Rd × Ru → R is the sample-wise Overall, the communication overhead can be significantly
loss function. On the other hand, the goal of the parameter reduced via compression.
server is to find a single global model that would work 2) Partial Participation: Due to the straggler’s effect (some
well on the whole dataset D = D1 ∪ D2 ∪ · · · ∪ DN . devices becoming non-responding and inactive), the response
time for some devices can be prohibitively long. Therefore,
The authors are with the Bradley Department of Electrical and Computer awaiting model updates from all devices is not desirable in
Engineering, Virginia Tech, Blacksburg, VA 24060 USA. The work is sup-
ported in part by US National Science Foundation (NSF) under grant CCF- practice. Furthermore, the capacity of the underlying MAC
1937487. The corresponding author is L. Liu (ljliu@ieee.org). channel is usually limited and does not linearly increase with

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
2

the number of transmitting devices. A natural solution to used, the gradient direction will be quantized as Q(∇f (x)),
resolve these issues is partial participation through scheduling: which can deviate from ∇f (x). Although some stochastic
only M < N devices will be scheduled to transmit during each compressors can provide an unbiased estimate of ∇f (x),
communication slot (round). the variance cannot be eliminated. Second, when the set of
3) Periodic Aggregation: The aggregation of local models participating devices, S, does
P not contain all the local devices,
requires synchronization among all active devices. Aggregat- the aggregated gradient i∈S ∇fi (x) may not align with
ing in each iteration of training, as in traditional distributed ∇f (x). Third, under periodic aggregation, the devices can
learning, results in a large communication overhead. There- generate different local models that result in client drift [11].
fore, frequent model synchronization may become unrealistic 3) Variance Reduction: Attempts at combating the training
and consume a lot of system resources for communication. variance among local devices have led to the development
A widely adopted strategy is to conduct several local itera- of many variance-reduction techniques for federated learning.
tions before synchronizing with the parameter server to save Some of them originate from DANE [12], which is a
communication overhead. However, since all devices perform classical optimization method that introduces a sequence of
local updates in an unsynchronized way, the update direction local subproblems to reduce client drift. Fed-DANE [13] is
can deviate from the global gradient direction, especially for adapted from DANE by allowing partial device participation.
non-i.i.d. settings. Network-DANE [14] is developed for decentralized federated
It is important to note that the three communication- learning. SCAFFOLD [11] can also be viewed as an improved
reduction strategies are inherently coupled under communi- version of DANE in federated settings by introducing some
cation constraints. For example, the payload of the model pa- control variates.
rameters will inevitably impact how many devices (M ) can be
active under a fixed capacity of the MAC channel. Meanwhile,
C. Local Differential Privacy
there is a clear trade-off between the payload of the model pa-
rameters and the period of the model aggregation under a fixed Federated Learning achieves some levels of privacy by
MAC capacity constraint: a larger payload will lead to a less keeping the local datasets on user devices and only sharing
frequent model aggregation. Therefore, a joint analysis seems the local updates with the server [15]. This, however, has
necessary to provide a comprehensive analysis of federated been proven to be insufficient for maintaining data privacy
learning systems under communication constraints. as the parameters can reveal insights into the data that has
been used for training. Consequently, FL by itself can only be
incorporated with honest participating parties, and to extend
B. Training Variance it for secure and privacy-preserving settings, extra measures
The training variance in federated learning can come from should be considered.
different sources. A universal one is the variance of the By design, federated learning is ignorant of how the local
stochastic gradient, which exists in all stochastic gradient updates are being generated, making it vulnerable to different
decent (SGD)-based training algorithms. This variance is forms of privacy and robustness attacks from one or more
induced because in SGD-based training algorithms, instead malicious users [16]. Also, sharing the raw gradients can
of evaluating the true gradient ∇f (x) computed over the impose privacy risks for clients that can be exploited by a
whole training dataset, which costs expensive computing re- curious aggregation server, an adversary eavesdropping on the
sources, an estimation ∇f ˜ (x) is computed only over a mini- transmitted local updates, or a malicious participating client
batch of the training dataset. Although ∇f ˜ (x) is usually who might or might not be aware of the architecture of the
an unbiased estimate of ∇f (x) [10], it has variance, given model. A consequence of this is the membership inference
by Ek∇f ˜ (x) − ∇f (x)k2 , that depends on the size of the attack which refers to an adversary with the intention of
mini-batches. Compared to centralized SGD that only has the learning whether a certain record has been used for training
stochastic gradient variance, federated learning systems suffer the model (i.e., local private data of one of the clients) [17].
from the variance induced by imperfect communication and To this end, the adversary generates an attack model via
data heterogeneity. its domain knowledge. Typically this is achieved by shadow
1) Data Heterogeneity: In practice, the local datasets Di , models trained on noisy real or synthetic data or data acquired
i ∈ {1, · · · , N }, are drawn from possibly different and via model inversion attacks. As an example, one may assume
unknown distributions Pi , i ∈ {1, · · · , N }. Thus, the local a scenario of a model prescribing drugs for various patients.
objective functions, fi (x)’s, are non-uniform (different) among This form of attack allows the attacker to deduce if a pa-
the local devices and can deviate from the global objective tient participating in the study suffers from a disease (e.g.,
f (x). Thus, the presence of data heterogeneity renders the Alzheimer’s), that is, sensitive information that could later be
training and analysis of federated learning systems more chal- used against the patient. This phenomenon can be regarded
lenging compared with traditional distributed learning systems. as an implication of the model overfitting the training data
2) Imperfect Communication: To see how imperfect com- which suggests using different regularization techniques (e.g.,
munication results in additional training variance, we assume dropout and l2-norm, etc.) to overcome the problem. However,
˜ (x) = ∇f (x),
there is no stochastic gradient variance, i.e., ∇f the high learning capacity of machine learning algorithms
and examine the effects of the three communication-reduction renders such attempts in preventing memorization of training
techniques individually. First, when model compression is data insufficient [18].

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
3

With the lack of a rigorous privacy guarantee for FL, there mechanism M is (ε, δ)-LDP, for any pair of inputs x and x0 in
have been attempts to bring the de facto framework of privacy- X , and any measurable subset O ⊆ Range(M), then we have:
preserving analysis, Differential Privacy (DP) [19], and its
Pr[M(x) ∈ O] ≤ e · Pr [M (x0 ) ∈ O] + δ (2)
local counterpart, Local DP [20], to Federated Learning.
Originally motivated by the inadequacy of anonymization Again, the privacy guarantee of M is determined by ε, but
techniques for enhancing privacy, DP has emerged as the gold with a low probability of δ this might not hold.
standard of privacy protection. Privatizing machine learning
by adding noise has introduced different methods based D. Contributions
on the stage noise addition takes place. Specifically, for an
Although the effects of model compression, partial device
Empirical Risk Minimization problem, three main approaches
participation, periodic aggregation, and data heterogeneity
have been introduced for differentially private optimization:
have been studied, a few works in the literature provide a
1) Objective Perturbation where a randomized regularization
comprehensive analysis by jointly considering all of them.
term is added to the loss function, 2) Output Perturbation
This paper, to the best of our knowledge, is the first one pre-
where noise is added to the parameters of a non-private
senting the convergence analysis of stochastic gradient descent
model after training, and 3) the currently prevailing Gradient
in privacy-preserving federated settings by considering all the
Perturbation, a more practical method that adds noise to the
three communication-reduction techniques under the presence
released gradient at each step, drawing more attention as it
of data heterogeneity for strongly convex loss functions. Based
has been shown to be effective for nonconvex problems (as
on the results, we can have a clear understanding of how
opposed to the last two methods) [21], [22].
these components affect the convergence of federated learning
To achieve data privacy in FL, instead of submitting raw
systems and interplay with each other. Furthermore, we inves-
local updates, the local parameters can be first perturbed
tigate differential privacy in the context of federated learning
using a randomization algorithm and then be released to
by introducing privacy-augmented FedPaq and discussing the
the parameter server (local model), or alternatively, in the
impact of parameters describing the gradient perturbation on
centralized fashion, the trusted parameter server can add noise
the performance of the model. Moreover, a privacy ampli-
to the aggregated updates (curator model). This addition of
fication method, based on subsampling of local datasets, is
noise ensures that the local updates remain private and do not
employed to enhance the convergence rate of the privatized
leak unnecessary information. Notice that there is a clear trade-
model. In addition, our analysis provides practical insights
off between the utility (accuracy of the model that is being
into the design of a privatized FL model and communication
trained) and the preserved privacy achieved by the noise. The
strategies to accelerate the training process of such systems
perturbation is to impede attempts to infer the true values of a
under a limited communication budget. To be specific, since all
client with strong confidence but still allow accurate inferences
the parameters involved in the privacy measure and the three
for the population.
communication-reduction techniques are jointly considered,
In the DP setting, it is assumed that the party responsible
we are able to quantitatively analyze the trade-offs between
for aggregating the results, the parameter server, is trusted.
different options. As a result, important design intuitions for
Therefore the privacy could be maintained by adding noise
real-world differentially private federated learning systems
to the aggregated results. However, for the surging edge
that are limited by the communication capacity constraints in
computing and IoT applications, the parameter server ideally
wireless networks are provided.
should not be trusted. This necessity naturally drives the
research toward the local mode of DP, where each client would
perturb its data to ascertain the data is kept private. This is to E. Related Works
ensure that the clients’ privacy is maintained even from the There have been several works on the convergence analysis
aggregator or an attacker that gets access to the data of the of federated learning systems with different communication-
client on the server. However, the downside of the local mode reduction approaches.
is that the accumulated noise would incur more accuracy loss The convergence rate of the FedAvg algorithm, which was
compared to the centralized context, mandating the need for first introduced in [3], has been studied in a non-i.i.d. data
more train data and longer training. setting with partial device participation and periodic aggrega-
The privacy parameters (ε, δ) quantify DP where for tion in [23]. However, the authors did not consider quantiza-
smaller values thereof we get more privacy. Formally, a tion. Similar set of results were presented in [24]. In [25],
randomized algorithm A is (ε, δ) -differentially private if for the FedPaq framework was introduced and analyzed under
all S ⊆ Range(A), and for all adjacent datasets D and D0 , all three communication-reduction approaches. However, the
then we have: authors assumed i.i.d. (homogeneous) data. To combat the
effect of data heterogeneity, [26], [27] introduced an adap-
Pr[A(D) ∈ S] ≤ eε Pr [A (D0 ) ∈ S] + δ (1)
tation of FedAvg, named FedProx. In FedProx, a proximal
0
where D and D are two datasets that only differ in a single term is introduced to each local objective function. However,
entry. In this context, the randomized algorithm A provides these two works did not consider quantization. Finally, and
privacy by making the two datasets difficult to distinguish. perhaps the most recent work, [11] introduced the SCAFFOLD
However, in the absence of trust with the data collector, framework that incorporates a variance-reduction mechanism
local DP is deemed more suitable. Formally, a randomized to combat the effect of non-i.i.d. data. More specifically, in

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
4

this framework, some control variates are introduced to reduce Algorithm 1 General Federated Learning
the drifts among different devices. While in [11], the authors Input: global learning rate ηk,g for k ∈ [K]
show SCAFFOLD achieves better performance compared with Initialize: model parameters x0 ∈ Rd
FedAvg, the performance improvement is not significant under 1: for each round k = 0, · · · , K − 1 do
moderate heterogeneity (e.g., 10% similarity). On the other 2: on each device i ∈ Sk :
hand, since in SCAFFOLD all active devices need to send (i)
3: calculate ∆xk = Vi (xk , ck ; Di )
an additional control variate update, which has the same (i)
4: (optional) calculate ∆ck
dimension as the model parameter, to the parameter server, 5:
(i) (i)
send ∆xk and ∆ck to the parameter server
the communication cost is doubled in each round. In practice, 6: on the parameter server:
FedPaq is still a good candidate federated learning algorithm. 7: collect the local updates from devices in Sk
However, in the original paper [25], the convergence analysis ηk,g P (i)
8: calculate xk+1 = xk + M i∈Sk ∆xk
relies on the i.i.d. data assumption, which is unrealistic. To
9: (optional) calculate ck+1
investigate the trade-offs between communication costs and
10: broadcast xk+1 and ck+1 to all devices.
training variance, it is important to obtain the convergence
11: end for
results for FedPaq under data heterogeneity, which is one of
the main contributions of this paper.
In the literature, various randomized mechanisms and vari- 1) Distributed SGD: In distributed SGD [33], all devices
ations of DP and LDP, referred to as protocols, have been participate in each communication round, i.e., Sk = [N ] for
investigated to bring privacy guarantees to FL. In [28], the all k ∈ [K]. Additionally, the local model update is calculated
notion of Condensed LDP (α−CLDP) is introduced which is by one local SGD step, such that
later used in [29] to propose LDP-FL. Developing quantization (i) ˜ i (xk ).
∆xk = −ηk,l ∇f (3)
schemes that are more compatible with continuous Gaussian
and Laplacian noise is a recent theme of the research. Adding where ηk,l is the local learning rate used at communication
continuous noise after quantization would turn the value into round k. Distributed SGD does not include a control variate.
a continuous number and the benefits of compression are However, since it requires full participation and communi-
lost. For LDP, biased compressors can not be used as they cation at each training round, it can result in overwhelming
break the independence between rounds. This has led to works communication costs.
on discrete noise addition and compressors that take privacy 2) FedAvg and FedPaq: FedAvg [23] is proposed to save
perturbation into account. cpSGD overcomes the prohibitive communication costs by using partial participation and peri-
problem of discrete values by adding noise drawn from a odic aggregation. Let E be the number of local iterations,
Binomial distribution and showing that for small ε it mimics n a device i ∈ Skogenerates a sequence of local models
then
the Gaussian mechanism [30]. In [31], a discrete Gaussian (i) (i) (i)
yk,0 , yk,1 , · · · , yk,E with
noise is introduced which has been used in [32].  
(i) (i) (i) ˜ i y(i) , (4)
yk,0 = xk and yk,t+1 = yk,t − ηkl ∇f k,t
II. R ESOURCE -C ONSTRAINED P RIVACY AUGMENTED
for t ∈ [E]. Then, the local model update is given by
F EDERATED L EARNING
(i) (i)
We start by introducing a general framework of federated ∆xk = yk,E − xk . (5)
learning systems, which can be seen in Fig. 1. The system Similarly, FedPaq [25] also generates the sequence of local
consists of N local devices located in a wireless network, all models in (4), but clients would send a quantized version of
connected to a parameter server that coordinates all devices local updates to the parameter server, i.e.,
to train a shared model. In a given communication round (i)

(i)

k ∈ [K], the parameter server first distributes the aggregated ∆xk = Q yk,E − xk (6)
information (xk , ck ), which is obtained through last commu- to further reduce communication overhead. Notice that FedPaq
nication round, to all active devices, where xk is the global is a generalization of both distributed SGD (with E = 1, Sk =
model and ck is the global control variate1 . On the other [N ], and Q(x) = x) and FedAvg (with Q(x) = x).
hand, for a device i that belongs to the set of participating 3) SCAFFOLD: Instead of updating the local models only
(active) devices, Sk , it will calculate the local model update by the local stochastic gradients ∇fi (y), SCAFFOLD [11]
(i)
∆xk := Vi (xk , ck ; Di ) and the update of the control variate uses some control variates to reduce the drifts that occur
(i)
∆ck , for some local functions Vi , i ∈ Sk . The complete among different devices. To be specific, one local step in
training process is summarized in Algorithm 1. SCAFFOLD is given by
   
(i) (i) ˜ i y(i) − c(i) + ck ,
yk,t+1 = yk,t − ηkl ∇f k,t k (7)
A. Federated Learning Algorithms
(i)
Different designs of Vi and Sk lead to different algorithms. while the local control variate ck and the global control
Here, we introduce some exemplary algorithms: variate ck are updated by
(i) (i) 1 (i)
1 We use [K] to represent the set {1, · · · , K}.
ck+1 = ck − ck − ∆xk , (8)
Eηkl

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
5

Active
(#)
!" !
Aggregator
(#)
Compressor !" !
Device i !! Device N
!(⋅)
!!
(%)
!!,#
!!

Local Dataset

Device j

Fig. 1: Federated learning system.

1 X  (i) (i)

Algorithm 2 Privacy Augmented FedPaq.
ck+1 = ck + ck+1 − ck . (9)
N
i∈Sk Input: ηk for k ∈ [K]
The idea behind SCAFFOLD is to mimic the ideal update Initialize: model parameters x0 ∈ Rd
under centralized SGD, i.e., 1: for each round k = [K] do
2: on each device i ∈ Sk :
˜ i y(i) − c(i) + ck ≈ 1 (i)
   
˜ j y(j) .
X
∇f k,t k ∇f k,t (10) 3: Initialize the local model xk,0 = xk
N 4: Select a random subset Ds of size Eb from Di
j∈[N ]
5: for each iteration t = 0, · · · , E − 1do 
(i) (i) ˜ i x(i)
B. Communication Constraints 6: calculate xk,t+1 = xk,t − ηk ∇f k,t
7: end for 
To analyze the trade-offs between communication costs and (i) 2

8: zk ∼ N 0, σi,k 1d
training variance, we focus our analysis on FedPaq, which  
(i) (i) (i)
jointly considers model compression, partial device partic- 9: send ∆xk = Q xk,E − xk + zk to the server
ipation, and periodic aggregation under data heterogeneity. 10: on the parameter server: P
To simplify the analysis while preserving the essence of the 1 (i)
11: calculate xk+1 = xk + M i∈Sk ∆xk
communication constraint, we assume the capacity of the 12: broadcast the global model xk+1 to all devices
underlying M -user MAC channel for the federated learning 13: end for
system is bounded by C bits/second [34] and the total duration
of the training process is T seconds. During the training
process, each local device can conduct a total of T training
iterations while the total of B = CT bits can be shared among
M (X, f, σ) , f(X) + N 0, σ 2 I ,

the active devices participating in the model aggregation pro- (12)
cess. Accordingly, we can link the communication constraints
with
of the federated learning system of interests in the following:
 
T TMβ
r
B = CT = KM β = Mβ ≈ , (11) ∆f 1.25
E E = 2 ln , (13)
σ δ
T 
where K = E is the total number of training rounds for any δ ∈ (0, 1] where ∆f bounds the L2 sensitivity of
and β is the number of bits required to transmit the model f(X), that is
update. In this way, we can provide a unified framework
to conduct performance analysis of federated learning under
communication constraints. kf (x) − f (x0 )k2 ≤ ∆f , ∀x, x0 ∈ X (14)

To reduce the amount of noise required to be added to


C. Privacy Measure the local updates to achieve a certain privacy guarantee, one
Algorithm 2 outlines the introduced privacy-aware version may use privacy amplification techniques. Here we employ
of FedPaq. After E local iterations, the clients generate the subsampling into mini-batches to achieve this reduction. An
local updates which are to be sent to the parameter server. Prior algorithm f would reach a better privacy guarantee when
to uploading the local model updates, they get perturbed by a applied on a random subsample of the dataset instead of the
random noise drawn from a Gaussian distribution to preserve full sample. This is intuitively due to the fact that data not
the privacy level of desire, i.e., a well-known procedure included in the subsample would enjoy full privacy, hence the
referred to as the Gaussian mechanism for achieving (ε, δ)-DP privacy being amplified. Applying the (, δ)-DP randomized
privacy guarantee. To locally differentially privatize a function mechanism M on the subsampled data achieves (0 , h(δ))-
f(X) subject to (ε, δ) we use DP mechanism MS for 0 ≤ 0 ≤  and a function h to

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
6

be determined based on the sampling procedure2 [35]. For paper. Examples for unbiased quantizers include quantized-
 E
subsampling  mini-batches
 of size bwithout replacement
  we SGD and TernGrad [37], [38], which preserve the true values
E
achieve log 1 + 1 − (1 − b/nk ) (e − 1) , γδ -DP for in expectation. Furthermore,
√ for quantized-SGD, [37] shows
(i) that q = min{n/s2 , n/s}, where n is the block-size and s
∆xk where γ := Eb/nk .
is the quantization level. For simplicity, we assume n = d.
Notice that a (, δ)-DP mechanism M is also (0 , δ 0 ) for
any 0 ≥  and any δ 0 ≥ δ. Then, Remark 2: While the assumption of strong convexity is
restrictive, it facilitates our analysis since the convergence rate
 
E
  can be directly measured by kxk − x∗ k2 . Additionally, µ > 0
log 1 + 1 − (1 − b/nk ) (e − 1) 4µ−1
(15) allows us to use the learning rate ηk = kE+4E in Theorem 1.
≤ log (1 + γ (e − 1)) ≤ γ (e − 1) ≤ 2γ, Relaxing this assumption is part of our future work.
hence the introduced method satisfying at least (, δ)-DP Remark 3: Notice that Assumption 3 is general for all
(i) choices of batch size when calculating the stochastic gradients.
for xk,E via adding reduced Gaussian noise of
A larger batch size will lead to a smaller σ.
2
2 2 (∆f ) ln(1.25/(δ/γ))
σi,k = . (16) Remark 4: Notice that Assumptions 1, 2, and 3 are also
(/2γ)2 used in [23], which proves the convergence rate for FedAvg
Following this procedure, having fixed the privacy parame- under non-iid data. However, our paper differs from [23]
ters ε and δ, the algorithm achieves higher utility (in terms in that we consider privacy concerns and compression of
of model accuracy) and converges faster. Needless to say, model updates, making our system more realistic. However,
clipping the local gradients to a small value to maintain lower this also makes the underlying analysis more challenging
global sensitivity, and subsampling, although result in noise of where the original results cannot be directly applied. More-
less magnitude, also may inversely impact the convergence in over, [23] has an additional assumption that the expected
the non-secure setting. Therefore, careful convergence analysis squared norm
 of stochastic
 2 gradients is uniformly bounded,
is deemed necessary. (i) (i)
i.e., E ∇fi xk,t ; ξk,t ≤ G2 , where G > 0, for all devices

and all time steps.
III. C ONVERGENCE OF P RIVATIZED F ED PAQ U NDER
H ETEROGENEITY Assumption 5: We assume that for all x ∈ Rd , there exist a
λ > 0, such that
In this section, we present the convergence analysis for
the Privatized FedPaq algorithm under non-i.i.d. data setting 1 X 2
(Algorithm 2) with the following assumptions: k∇fi (x) − ∇f (x)k ≤ λ2 . (17)
N
i∈[N ]
Assumption 1: The loss function ` is L-smooth, such that
k∇`(x) − ∇`(y)k ≤ Lkx − yk for any x, y ∈ Rd . Conse- The value of λ quantifies the degree of data heterogene-
quently, fi and f are also L-smooth. ity among different local devices. When λ = 0, we have
Assumption 2: The loss function ` is µ-strongly convex, ∇fi (x) = ∇f (x) for all i = 1, 2, · · · , N , and there is no
such that k∇`(x) − ∇`(y)k ≥ µkx − yk for any x, y ∈ Rd . heterogeneity. It can be seen from (17) that larger λ implicates
Consequently, fi and f are also µ-strongly convex. higher degree of heterogeneity.
Assumption 3: The stochastic gradient is unbiased and Remark 5: Different papers usually have different
variance-bounded, such that for any x ∈ Rd and the mini- assumptions
batch samples ξ, we have E [∇fi (x; ξ)] = ∇fi (x) and PNon data heterogeneity. For example, PN[23] defines
2
Γ = f ∗ − i=1 wi fi∗ and [24] defines ς 2 = N1 i=1 ∇fi (x∗ )
E k∇fi (x; ξ) − ∇fi (x)k ≤ σ 2 /b for all i = 1, 2, · · · , N . for quantifying the degree of non-i.i.d. Although our measure
Here, b is the mini-batch size. of heterogeneity is different from these two papers, we note
Assumption 4: Q(·) is an unbiased and q-lossy random com- that both Γ and ς can still be expressed using λ under
pressor, such that for any x ∈ Rd E [Q(x)] = x and EkQ(x)− strongly convex setting.
xk2 ≤ qkxk2 . Note that q is a real value in [0, 1], which Assumption 6: The stochastic gradient is uniformly
is determined by the resolution of the compressor, β. When bounded, such that for any x ∈ Rd and the mini-batch samples
q = 0, we have Q(x) = x and there is no quantization. For 2
ξ, we have E k∇fi (x; ξ)k ≤ G2 for all i = 1, 2, · · · , N .
q = 1, no information is being sent during communications.
For the simplicity of our analysis, we additionally assume
Remark 1: While Assumptions 1, 2, and 3 are widely
the device weights, wi ’s, to be uniform among all devices.
used in literature, the unbiasedness of the compressed model
However, as suggested in [23], this does not result in the
stated Assumption 4 does not always hold in practice. For
loss of generalization of our results. Indeed, by replacing the
example, the SignSGD [36] uses biased quantizer which makes
local objective as f˜i (x) = wi N fi (x) and transforming the
the analysis more complicated. For the tractability of our
value of L, µ, σ, and λ accordingly, the global objective
analysis, we use an unbiased compressor throughout this
becomes the simple
PN average of transformed local objectives,
2 Poisson subsampling, sampling without replacement and sampling with i.e., f (x) = N1 i=1 f˜i (x). Moreover, we assume the same
replacement are possible options that have been studied in the literature. gradient perturbation is being performed by all clients.

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
7

Using the assumptions and lemmas above, the convergence where a, b, c, and d are positive constants determined by q, L
rate of Algorithm 2 for the secure federated learning setting and µ, along the the global sensitivity and privacy parameters.
can be characterized in the following theorem: As it is evident from (19), in presence of data heterogeneity
Theorem 1: Training the secure federated learning system (λ > 0), increasing E can improve convergence for larger λ.
using Algorithm 2, under Assumptions 1, 2, 3, 4, 5 and 6 for 2) Model Compression v.s. Others: Consider the model
T iterations and setting the learning rate in the k-th round as compressor used in quantized-SGD [37], such that for any
4µ−1 x ∈ Rd , the i-th element is quantized as
ηk = kE+4E , the aggregated global model xK , where K =
T
E , satisfies Qi (x) = kxk · sign(xi ) · ϑi (x, s), (20)
2
16E where ϑi (x, s) is a random variable taking on value l+1
EkxK − x∗ k2 ≤ 2
Ekxk0 − x∗ k2 s
T with probability kxk|xi |
s − l and sl otherwise. Here, s ∈ Z+
16 2qG2 qG2 E
 
+ 2 + is the quantization level and l ∈ [0, s) is an integer such that
µ M N T |xi |  l l+1 
kxk ∈ s , s . To transmit Q(x), three parts of information
2 2
σ2 1
 
16 4eσ 3Lλ d
+ 2 + + (18) need to be encoded, including kxk2 , {sign(xi )}i=1 , and
µ bM µ bN T d
{ϑi (x, s)}i=1 . Assuming we are using the simplest one-hot
128eλ2 E − 1 128G2 (E − 1)2 d
encoder, jointly encoding {sign(xi )}i=1 and {ϑi (x, s)}i=1
d
+ 2 · +
µ M T µ2 T will cost d log2 (2s + 1) bits, while kxk costs 32 bits. Assume
 
1.25Eb
2 (2s + 1)  32, we have β ≈ d log2 (2s + 1).By √taking
2 2 d log√
4096dG b (1 + q) ln |Di |δ E 3 
+ q = sd , the quantization loss q is approximately O 2β/d 2 d
.
M |Di |2 2 T −1
Clearly, q will decrease rapidly with β. Furthermore, as we can
Remark 6: For vanilla version of the algorithm, without
see in Theorem 1, there is no term with q and λ coexisting,
addition of noise required for privacy, similar to [23], [25],
2
 indicating that the impact of q is invariant to non-i.i.d. data. On
and [26], our result suggests a convergence rate as O ET , the other hand, M E
∼ O (β) and the impact of E is magnified
√ 
which requires E = o T . However, this term appears only under data heterogeneity. Thus, we can choose a relatively
when data heterogeneity exists, i.e., λ > 0. When there is no small β for a good trade-off between model compression and
data heterogeneity among local devices, the training can get the other two communication-reduction approaches.
significantly accelerated. 3) Impact of Privacy Measure: As mentioned before, the
local updates are clipped such that the global sensitivity is
Remark 7: Convergence gets slowed down when q, E and bounded. Theorem 1 suggests that the convergence can be
λ become larger, whereas larger M (more active devices) can severely delayed for large G. This is due to the fact that in 16,
1
facilitate the convergence. However, since O M does not ap- the noise variance quadratically grows with the L2 sensitivity.
pear in all terms, the training does not enjoy a linear speedup. Increasing the subsampling ratio γ, determined by Eb/nk ,
Remark 8: The convergence rate is also impacted by other amplifies the noise variance required for desired privacy guar-
factors such as , δ and G. If  and δ are small, the model con- antee. For large E, the last terms in (18),
vergence will slow down, thus necessitating more global itera-  
tions. Meanwhile, if G increases, the right part of (18) will also 4096dG2 b2 (1 + q) ln 1.25Eb
|Di |δ E3
increase accordingly, and thus delays the model convergence.
M |Di |2 2 T
dominates convergence. In this case, increasing b, which
A. Impacts of Communication Constraints consequently increases γ, would deteriorate the performance,
As shown in (11), various communication overhead reduc- hence our decision to employ small subsampling ratios to
tion strategies will be coupled together under the constraint achieve privacy amplification. On the other hand, for small
of MAC capacity. Therefore, it is important for us to char- E increasing γ may not be as detrimental, or can even be
acterize the impacts on the various trade-offs among model beneficial, as is the case with the non-privatized algorithm.
compression, partial participation, and periodic aggregation Clearly, increasing  and δ improve the convergence rate, but
based on the conducted convergence analysis to obtain new it is not desirable as we would be opting for a risky and
system design intuitions for federated learning systems under insufficient privacy guarantee.
communication constraints and differential privacy.
1) Partial Participation v.s. Periodic Aggregation: Assum- IV. P ERFORMANCE E VALUATION
ing the model compression strategy is fixed, we can write 1) Model and Dataset: The theoretical results are evalu-
M = αE, where α = B/T β. Furthermore, we assume a ated via a logistic regression model based on the widely-
large-scale federated learning system where N  M , such known MNIST dataset. The MNIST dataset is equally dis-
that the approximation of (N −MMN2
)(N −1)
→M1
holds. When the tributed among N = 100 local devices. We control the
number of total iterations T is large enough, the dominating degree of data heterogeneity by allowing a local device to
term in (18) becomes have access only to training samples for a fraction of all
the 10 digits. Consequently, we can generate ten datasets
 2
aE + bλ2 + cλ2 /E + d

O , (19) HET_MNIST(n_digits), for n_digits = 1, 2, · · · , 10.
T

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
8

Non-private, E = 100 Non-private, E = 100


2.00 Non-private, E = 10 2.00 Non-private, E = 10
Non-private, E = 1 Non-private, E = 1
1.75 = 0.5, E = 100 1.75 = 0.5, E = 100
= 0.5, E = 10 = 0.5, E = 10
1.50 = 0.5, E = 1 1.50 = 0.5, E = 1
Loss

Loss
1.25 1.25

1.00 1.00

0.75 0.75

0.50 0.50
0 20 40 60 80 100 0 200 400 600 800 1000
Round Iteration

(a) (b)

Loss at Round 20 Loss at Iteration 2000


0.775 Non-private
0.750 = 0.2, = 0.2 0.8
= 0.2, = 0.6
0.725 = 0.5, = 0.2 0.7
= 0.5, = 0.6
0.700
Loss

Loss
0.675 0.6
0.650 Non-private
0.5 = 0.2, = 0.2
0.625 = 0.2, = 0.6
0.600 = 0.5, = 0.2
0.4 = 0.5, = 0.6
0.575
20 40 60 80 100 20 40 60 80 100
E E

(c) (d)

Fig. 2: The impact of E on HET_MNIST(2) dataset: (Top row): The training curves for different E with M = 10, s = 10,
γ = 0.2 and C = 1. (a) Over rounds (b) Over iterations. (Bottom row) The training loss (a) after 20 communication rounds
(b) after 2000 iterations.

Clearly, choosing a smaller number of digits results in a higher tigate how each of the parameters involved in privatizing the
degree of data heterogeneity, while n_digits = 10 indicates Secure FedPaq, ceteris paribus, influences the performance of
no heterogeneity among the local datasets. the introduced federated learning scheme.
3) Impact of Periodic Aggregation: Under the presence of
2) Experiment Settings: The local models are aggregated
T  data heterogeneity, the impact of E is illustrated in Fig. 2. We
for every E iterations, such that there are K = E com- observe that the accuracy of the model increases for larger
munication rounds. Unless otherwise stated, E = 10 local E per round, as illustrated in Fig. 2 (a). This refers to the
iterations are performed by each client, and the number of realistic case of constraining the number of communication
communication rounds K is fixed to 100, with the total number rounds to a fixed number. Alternatively, Fig. 2 (b) shows that
of iterations T accordingly calculated. The learning rate at the fixing the number of global iterations T , larger E in fact
η0
k-th round is set to be ηk = 1+kE/100 , where η0 = 0.1. The delays the convergence of the model at each iteration. The
non-privatized setting refers to the vanilla FedPaq algorithm, empirical performance verifies our analytical result in Theorem
with neither subsampling, clipping, nor gradient perturbation. 1, that for fixed T , increasing E would deteriorate the result.
In the secure mode, a subset of the local datasets with Intuitively, this means fewer opportunities for aggregating
cardinality γni is randomly selected without replacement by the local updates. However, proportionally increasing both
each participating device i ∈ Sk at the start of each round E and T clearly helps both versions of FedPaq, although a
to be used for local training. During training, the per-sample diminishing gain is achieved. Fig. 2 (c) and (d) show the
gradients are clipped with respect to C, and proper noise loss curves when the number of rounds or the number of
is added prior to the transmission of local gradients. Upon global iterations is fixed, respectively. Fig. 2 confirms our
aggregation, the scheduled devices will send the (noisy) model interpretation of Theorem 1 that larger subsampling ratio γ
updates compressed by the quantizer in (20) with quantization would be more detrimental for larger values of E.
level s. We evaluate the impact of communication-reduction 4) Impact of Partial Participation: As shown in Fig. 3(b),
techniques on both the non-privatized and secure aggregated when M is set to a small value, for example M = 1, the
models over training rounds. Furthermore, we will also inves- training becomes extremely unstable during the first several

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
9

4.5
2.5 s=1 Non-private, M = 100
s = 10 4.0 Non-private, M = 10
Non-private, M = 1
full accuracy = 0.2, M = 100
3.5
2.0 = 0.2, M = 10
= 0.2, M = 1
3.0
Loss

Loss
1.5 2.5

2.0
1.0 1.5

1.0
0.5
0.5
0 20 40 60 80 100 0 20 40 60 80 100
Round Round

(a) (b)

Fig. 3: Training curves for different s and M at each round on HET_MNIST(2) dataset: (a) Different s on the non-privatized
model with M = 10 and E = 10. (b) Different M with E = 10, s = 10, γ = 0.2 and C = 1.

training rounds. Note that the impact is even more pronounced worst privatized setting results based on which value of γ is
for the privatized algorithm, consistent with the last term of chosen. Although small C and γ lead to relatively negligible
(18) where the impact of perturbation is inhibited for larger M . added noise, however, clipping the gradients acquired on such
5) Impact of Model Compression: In Fig. 3 (a), the loss small subsample would significantly deteriorate convergence,
curve for s = 10 is almost overlapped with the one without as would be the case with non-privatized SGD. A more
quantization. Even for s = 1, the performance degradation is moderate clipping, namely C = 1, would be more resilient
small and the training is pretty stable. This motivates us to use in face potentially undesirable consequences of subsampling.
lossy model compressors to save communication resources for The theoretical and experimental results suggest some im-
smaller E and larger M , which have significant impacts on portant design intuitions for federated learning systems: (1)
the convergence of the global model. The same impact can be The choice of model compression accuracy has little impact
observed for the privatized setting, which is omitted to avoid on the convergence regardless of the heterogeneity of the
polluting the figure. problem. This encourages the widespread use of low-accuracy
6) Impact of Privacy Budget ε: Increasing the privacy quantizers to save the underlying communication overhead
parameters ε and δ result in additive noise of lower magnitude, even in non-iid settings. (2) Although larger E naturally incurs
which brings more accurate estimation while putting the more computational cost for the clients, we realized that in the
privacy of clients at risk. We have set δ = 10−4 across non-iid setting, it could actually enhance the performance for
all the experiments. Fig. 4 (a) depicts the impact of ε on a fixed number of communication rounds K. (3) A smaller M
performance of the model. For ε = 0.1, although a high will reduce the convergence rate of the model. This behavior is
level of privacy guarantee is ensured, the utility significantly even more noticeable in the privatized setting. (4) Subsampling
declines and convergence may be hindered. becomes more important for larger E. (5) Generally, a lower
7) Impact of Gradient Clipping: The global sensitivity sampling ratio can achieve better convergence, but it may make
G appears in the numerator of multiple terms in Eq. 18 the optimization unstable if the global sensitivity is very low.
suggesting its major impact on the convergence rate of the Finally, although our work on extending the results to the
algorithm. Fig. 4 (b) shows that larger C, inducing more noise, cases with more relaxed convexity assumptions is still in
delays convergence of the privatized algorithm. In fact, one progress, our empirical simulations on applying the introduced
might claim that a smaller C for clipping the gradients may be scheme on privacy-aware federated learning of ANNs, CNNs,
preferred to choosing larger ε for achieving faster convergence. and Spiking Neural Networks indicate that having set the
However, based on Fig. 4 (d), this is not always true. learning rate to a proper value, the same effects can be seen
8) Impact of Subsampling: As a privacy amplification mea- for hyperparameters involved in the FL design as presented in
sure, we are using subsampling to reduce the amount of noise this paper. Furthermore, another interesting direction for future
that needs to be added to maintain a certain level of privacy work is to explore similar designs that are consuming less en-
guarantee, thus increasing the utility. Fig. 2 (c) and (d) show ergy, ranging from energy-efficient neuromorphic computing
how a larger subsampling ratio γ has a more pronounced [39] to energy optimization when the FL design is incorporated
deteriorating impact for greater values of E, verifying our for mobile-edge computing [40].
analytical results. Moreover, Fig. 4 (c) depicts the training
loss of a few privatized models for various values of γ, but V. C ONCLUSION
otherwise identical. Fig. 4 (d) show how γ and C jointly In this paper, we provided a comprehensive convergence
affect the performance of the model. One can observe that analysis for a privacy augmented federated learning system by
a smaller clipping of C = 0.5 achieves the best or one of the considering data heterogeneity as well as the communication

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
10

0.9
1.2

0.8

1.0
0.7
Non-private
= 0.2, = 0.6, C = 0.5
Loss

Loss
0.8 = 0.2, = 0.6, C = 1.0
0.6 = 0.2, = 0.6, C = 2.0

0.5
0.6
Non-private
0.4 = 0.1, = 0.4, C = 1
= 0.2, = 0.4, C = 1
= 0.5, = 0.4, C = 1
= 1.0, = 0.4, C = 1 0.4
0.3
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Round Round

(a) (b)

0.85
Non-private
0.9 = 0.5, = 0.2, C = 0.5
0.80 = 0.5, = 0.8, C = 0.5
= 0.5, = 0.2, C = 1.0
0.8 = 0.5, = 0.8, C = 1.0
0.75 = 0.5, = 0.2, C = 2.0
= 0.5, = 0.8, C = 2.0

0.7 0.70

0.65
Loss

Loss

0.6
0.60
0.5
0.55

0.4 Non-private
0.50
= 0.2, = 0.2, C = 1
= 0.2, = 0.4, C = 1 0.45
0.3 = 0.2, = 0.6, C = 1
= 0.2, = 0.8, C = 1
0.40
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Round Round

(c) (d)

Fig. 4: The impact of different privacy-related parameters on convergence of the privatized FedPaq algorithm with E = 10,
M = 10 and s = 10 (a) for different ε (b) Training curves of different C, (c) Training curves of different γ, (c) Training
curves for both γ and C.

constraints. To be specific, three communication-reduction [3] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al.,
strategies, namely model compression, partial device participa- “Communication-efficient learning of deep networks from decentralized
data,” arXiv preprint arXiv:1602.05629, 2016.
tion, and periodic aggregation are jointly considered under the [4] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated
capacity limit of the underlying MAC for model aggregation. multi-task learning,” in Advances in Neural Information Processing
Due to the insufficiency of FL in maintaining the privacy of Systems, 2017, pp. 4424–4434.
[5] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning:
clients, a privacy measure based on differential privacy is con- Concept and applications,” ACM Transactions on Intelligent Systems and
sidered, introducing the privacy-augmented FedPaq algorithm. Technology (TIST), vol. 10, no. 2, pp. 1–19, 2019.
The impacts of differential privacy, data heterogeneity and [6] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How to
the communication-reduction strategies were evaluated both backdoor federated learning,” arXiv preprint arXiv:1807.00459, 2018.
[7] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
theoretically and numerically. Our analysis provides important D. Bacon, “Federated learning: Strategies for improving communication
design intuitions for real-world federated learning systems efficiency,” arXiv preprint arXiv:1610.05492, 2016.
that are limited by the communication capacity constraints in [8] A. D. Wyner, “Shannon-theoretic approach to a gaussian cellular
multiple-access channel,” IEEE Transactions on Information Theory,
wireless networks. vol. 40, no. 6, pp. 1713–1727, 1994.
[9] G. Caire and S. Shamai, “On the achievable throughput of a multiantenna
R EFERENCES gaussian broadcast channel,” IEEE Transactions on Information Theory,
[1] R. Shafin, L. Liu, V. Chandrasekhar, H. Chen, J. Reed, and J. C. vol. 49, no. 7, pp. 1691–1706, 2003.
Zhang, “Artificial intelligence-enabled cellular networks: A critical path [10] L. Bottou, “Large-scale machine learning with stochastic gradient de-
to Beyond-5G and 6G,” IEEE Wireless Commun., vol. 27, no. 2, pp. scent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–
212–217, 2020. 186.
[2] H. Song, J. Bai, Y. Yi, J. Wu, and L. Liu, “Artificial intelligence enabled [11] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and
internet of things: Network architecture and spectrum access,” IEEE A. T. Suresh, “Scaffold: Stochastic controlled averaging for on-device
Comput. Intell. Maga., vol. 15, no. 1, pp. 44–51, 2020. federated learning,” arXiv preprint arXiv:1910.06378, 2019.

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
11

[12] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient dis- in Advances in Neural Information Processing Systems, 2017, pp. 1709–
tributed optimization using an approximate newton-type method,” in 1720.
International conference on machine learning, 2014, pp. 1000–1008. [38] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad:
[13] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smithy, Ternary gradients to reduce communication in distributed deep learning,”
“Feddane: A federated newton-type method,” in 2019 53rd Asilomar in Advances in neural information processing systems, 2017, pp. 1509–
Conference on Signals, Systems, and Computers. IEEE, 2019, pp. 1519.
1227–1231. [39] R. Shafin, L. Liu, J. Ashdown, J. Matyjas, M. Medley, B. Wysocki, and
[14] B. Li, S. Cen, Y. Chen, and Y. Chi, “Communication-efficient distributed Y. Yi, “Realizing green symbol detection via reservoir computing: An
optimization in networks with gradient tracking and variance reduction,” energy-efficiency perspective,” in 2018 IEEE Intl Conf Communications
in International Conference on Artificial Intelligence and Statistics, (ICC), 2018, pp. 1–6.
2020, pp. 1662–1672. [40] B. Shang and L. Liu, “Mobile-edge computing in the sky: Energy
[15] B. Shang, S. Liu, S. Lu, Y. Yi, W. Shi, and L. Liu, “A cross-layer optimization for air–ground integrated networks,” IEEE Internet of
optimization framework for distributed computing in iot networks,” in Things Journal, vol. 7, no. 8, pp. 7443–7456, 2020.
2020 IEEE/ACM Symposium on Edge Computing (SEC). IEEE, 2020,
pp. 440–444.
[16] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov, “How to VI. P ROOFS
backdoor federated learning,” in International Conference on Artificial
Intelligence and Statistics. PMLR, 2020, pp. 2938–2948. To facilitate the analysis, we introduce two quantities
[17] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership N
inference attacks against machine learning models,” in 2017 IEEE 1 X  (i) (i)

Symposium on Security and Privacy (SP). IEEE, 2017, pp. 3–18. x̂k+1 = xk + Q xk,E − xk + wk ,
N i=1
[18] M. A. Rahman, T. Rahman, R. Laganière, N. Mohammed, and Y. Wang,
“Membership inference attack against differentially private deep learning N
(21)
model.” Trans. Data Priv., vol. 11, no. 1, pp. 61–79, 2018. 1 X (i)
xk,E = x ,
[19] C. Dwork, “Differential privacy: A survey of results,” in International N i=1 k,E
conference on theory and applications of models of computation.
Springer, 2008, pp. 1–19. (i)
 
2
[20] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and where wk ∼ N 0, ςi,k I .
statistical minimax rates,” in 2013 IEEE 54th Annual Symposium on Additionally, we note as x∗ and x∗i the optimal global model
Foundations of Computer Science. IEEE, 2013, pp. 429–438.
[21] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private and optimal local model for the i-th device, respectively. As
empirical risk minimization.” Journal of Machine Learning Research, shown in [25], for a fixed global iteration k, the squared
vol. 12, no. 3, 2011. difference between xk+1 and x∗ can be decomposed into three
[22] D. Yu, H. Zhang, and W. Chen, “Improve the gradient perturbation
approach for differentially private optimization,” 2018. terms
[23] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence Ekxk+1 − x∗ k2 = Ekxk+1 − x̂k+1 k2
| {z }
of fedavg on non-iid data,” arXiv preprint arXiv:1907.02189, 2019. A
[24] A. Khaled, K. Mishchenko, and P. Richtárik, “First analysis of local gd
on heterogeneous data,” arXiv preprint arXiv:1909.04715, 2019. + Ekx̂k+1 − xk,E k2
| {z } (22)
[25] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, B
“Fedpaq: A communication-efficient federated learning method with
periodic averaging and quantization,” arXiv preprint arXiv:1909.13014, + Ekxk,E − x∗ k2 ,
2019. | {z }
C
[26] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
“Federated optimization in heterogeneous networks,” arXiv preprint which are bounded as shown in the following lemmas:
arXiv:1812.06127, 2018.
[27] A. K. Sahu, T. Li, M. Sanjabi, M. Zaheer, A. Talwalkar, and V. Smith,
Lemma 1: Under Assumptions 1, 2, 3, 4, and 5, term A in
“On the convergence of federated optimization in heterogeneous net- (22) can be bounded as
works,” arXiv preprint arXiv:1812.06127, 2018.
[28] M. E. Gursoy, A. Tamersoy, S. Truex, W. Wei, and L. Liu, “Secure and 2qG2 (N − M ) 2 2
2
utility-aware data collection with condensed local differential privacy,”
E kxk+1 − x̂k+1 k ≤ E ηk
MN
IEEE Transactions on Dependable and Secure Computing, 2019. 2
4eσ (N − M ) 2
[29] L. Sun, J. Qian, X. Chen, and P. S. Yu, “LDP-FL: Practical private + Eηk
aggregation in federated learning with local differential privacy,” arXiv bM N
preprint arXiv:2007.15789, 2020. 8eλ2 (N − M ) (23)
[30] N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. McMahan, + E(E − 1)ηk2
“cpsgd: Communication-efficient and differentially-private distributed MN  
sgd,” in Advances in Neural Information Processing Systems, 2018, pp. 256dG2 b2 (1 + q)(N − M ) ln 1.25Eb
|Di |δ
7564–7575. + E 4 ηk2 .
[31] C. Canonne, G. Kamath, and T. Steinke, “The discrete gaussian for M (N − 1)|Di |2 2
differential privacy,” arXiv preprint arXiv:2004.00010, 2020.
[32] L. Wang, R. Jia, and D. Song, “D2p-fed: Differentially private federated Lemma 2: Under Assumptions 1, 2, 3, 4, and 5, term B in
learning with efficient communication,” 2020. (22) can be bounded as
[33] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, “Revisiting
distributed synchronous sgd,” arXiv preprint arXiv:1604.00981, 2016. qG2 2 2
[34] T. M. Cover and J. A. Thomas, Elements of information theory (2nd Ekx̂k+1 − xk,E k2 ≤
E ηk . (24)
Edition). John Wiley & Sons, 2012. N
[35] Y.-X. Wang, B. Balle, and S. P. Kasiviswanathan, “Subsampled rényi Lemma 3: Under Assumptions 1, 2, 3, 4, and 5, term C in
differential privacy and analytical moments accountant,” in The 22nd In-
ternational Conference on Artificial Intelligence and Statistics. PMLR,
(22) can be bounded as
2019, pp. 1226–1235. 2 2
[36] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, E kxk,E − x∗ k ≤(1 − µηk )E E kxk − x∗ k
3Lλ2 σ2
 
“signsgd: Compressed optimisation for non-convex problems,” arXiv
preprint arXiv:1802.04434, 2018. + + Eηk2 (25)
[37] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: µ bN
Communication-efficient sgd via gradient quantization and encoding,” + 8G2 E(E − 1)2 ηk2 .

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
12

n √ o
2 4 2L(E−1)
A. Proof of Theorem 1 Now set γ0 = max 4E, 4L
µ2 , µ . Since ηk =
By substituting (23), (24), and (25) into the (22), we obtain 4µ−1 1 µ
it follows that the constraints ηk ≤
kE+γ0 , µ, ηk ≤ L2 , and
∗ 2
Ekxk+1 − x k ≤(1 − µηk ) Ekxk − x k E ∗ 2 2L (E − 1)2 ηk2 ≤ 1 are satisfied.
2

+ D1 E 2 ηk2 + D2 Eηk2 We now show by induction that for all k ≥ k0 which


satisfies (30),
+ D3 E(E − 1)ηk2 (26)
+ D4 E(E − 1)2 ηk2
+ D5 E 4 ηk2 , (k0 + γ)2 α
δk ≤ δk + , (34)
(k + γ)2 0 k + γ
where
2qG2 (N − M ) qG2
D1 = + , which is also used in [25]. Obviously, (34) holds for k = k0
MN N
4eσ 2 (N − M ) 3Lλ2 σ2 since α > 0. Assume (34) holds for some k > k0 , then
D2 = + + ,
bM N µ bN
8e(N − M )λ2 (27)
D3 = ,  
MN 2 α
δk+1 ≤ 1 − δk +
D4 =8G2 , k+γ (k + γ)2
 
(k0 + γ)2
  
256dG2 b2 (1 + q)(N − M ) ln 1.25Eb 2 α α
|Di |δ ≤ 1− δ k + +
D5 = . k+γ (k + γ)2 0 k + γ (k + γ)2
M (N − 1)|Di |2 2
(k + γ − 1)(k0 + γ)2 k+γ−1
Let ηk ≤ 1
Thus, the term (1 − µηk )E can be bounded as ≤ δk0 + α
µ. (k + γ) 3 (k + γ)2

µEηk
E (k0 + γ)2 α
E
(1 − µηk ) = 1 − ≤ 2
δk0 + . (35)
E (k + 1 + γ) k+1+γ
(a)
(28)
≤ e−µEηk
Combining (31) and (34), it follows that
(b)
1
≤1 − µEηk + µ2 E 2 ηk2 ,
2
(k0 E + γ0 )2
where (a) holds because (1 + nx )n ≤ ex for |x| < n, and (b) Ekxk − x∗ k2 ≤ Ekxk0 − x∗ k2
2 (kE + γ0 )2
holds because ex ≤ 1+x+ x2 for x ≤ 0. Then, for (1−µηk )E ,
we have E
+ D̃1
1 kE + γ0
(1 − µηk )E ≤1 − µEηk + µ2 E 2 ηk2 . (29) 1
2 + D̃2
kE + γ0
By restricting (36)
1 E−1
ηk ≤ , (30) + D̃3
µE kE + γ0
it implies (E − 1)2
+ D̃4
  kE + γ0
1
Ekxk+1 − x k ≤ 1 − µEηk Ekxk − x∗ k2
∗ E3
2 + D̃5 ,
kE + γ0
+ D1 E 2 ηk2 + D2 Eηk2
(31)
+ D3 E(E − 1)ηk2
where D̃1 = µ162 D1 , D̃2 = 16
µ2 D2 , D̃3 = 16
µ2 D3 , D̃4 = 16
µ2 D4 ,
+ D4 E(E − 1)2 ηk2
and D̃5 = µ162 D5 .
+ D5 E 4 ηk2 .
We also observe that the algorithm does not require a
4µ−1 warming-up period of more than k0 iterations as
By setting the learning rate to ηk = (a diminishing kE+γ0
learning rate) and defining γ := γE0 and δk := Ekxk − x∗ k2 ,
we can write the above expression as
4µ−1 1

2

α ηk = ≤
δk+1 ≤ 1 − δk + , (32) kE + γ0 µE
k+γ (k + γ)2 1 (37)
(kE + γ0 ) ≥ 4
where E
γ0
(E − 1)2 ⇒k ≥4− ⇒ k0 = 0
 
16 1 E−1 E
α = 2 D1 + D2 + D3 + D4 + E 2 D5
µ E E E
(33) This concludes the proof of Theorem 1.

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
13

B. Proof of Lemma 1 Then,

1 X 2
(i) (j)
PN (i) (i)
By denoting ∆k+1 = N1 i=1 ∆k+1 and ∆k+1 =, the A1 ≤ 2
E ∆k+1 − ∆k+1
N
i6=j
term Ekxk+1 − x̂k+1 k2 can be calculated as  
1 (i) 2 (j) 2
D E
(i) (j)
X
= 2 E ∆k+1 − 2E ∆k+1 , ∆k+1 + E ∆k+1
N
i6=j
Ekxk+1 − x̂k+1 k2
 2 2 
q X (i)

(j)
≤ 2 E xk,E − xk + E xk,E − xk

2 
N
!
1 X (i)  i6=j
=E ESk xk + ∆k+1 − xk + ∆k+1 

 
M (i) 2 (j) 2

i∈Sk + (1 + q) E wk + E wk
 
 2

1 X 2
(i) 1 X (i)
D
(i) (j)
E
=E ESk ∆k+1 − ∆k+1  + 2 (E xk,E − xk − 2E xk,E − xk , xk,E − xk

M N
i∈Sk i6=j
N
2
1 X 2 (j)
+ E xk,E − xk )

(i)
= Pr{i ∈ S }E − ∆

k k+1 k+1
M 2 i=1
∆  
N N q X 2 2 

1 X X = 2 E (i) (j)
xk,E − xk +E xk,E − xk 
+ Pr{i, j ∈ Sk } N
M 2 i=1
 
j=1,j6=i i6=j | {z }
D E A2
(i) (j) 2
E ∆k+1 − ∆k+1 , ∆k+1 − ∆k+1 1 X (i) (j) 2 2

+ 2 E xk,E − xk,E +d(1 + q) ςi,k + ςj,k ,
N
(38) i6=j | {z }
A3
(41)

where we use Assumption 4 to get the second inequality. A2


measures the deviation of the local model from xk after E
(a)1 X
N local SGD steps, and A3 measures the difference between the
(i)
= Ek∆k+1 − ∆k+1 k2 local models on two different devices.
M N i=1
(i)
2
We now bound A2 . Based on Algorithm 2, E xk,t − xk

M −1
+ can be calculated as
M N (N − 1)
N N  2
t−1
X X D
(i) (j)
E 2 X 
E ∆k+1 − ∆k+1 , ∆k+1 − ∆k+1 (i) 2 (i) (i)
E xk,t − xk = ηk E ∇fi xk,s ; ξk,s

i=1 j=1,j6=i

s=0
N (a) t−1  2
(b) N − M (42)
1 X 2 
(i) (i)
(i) X
= · E ∆k+1 − ∆k+1 , (39) ≤ tηk2 E ∇fi xk,s ; ξk,s

M (N − 1) N i=1 s=0
| {z } (b)
A1
≤ t2 ηk2 G2 ,

where we use the following facts: (a) Pr{i ∈ Sk } = M


where P(a) follows 2 by Assumption 3 and the inequal-
K PK
N
ity k=1 xk ≤ K k=1 kxk k2 , (b) follows since

(M −1)
and Pr{i, j ∈ Sk } = M N (N −1) for uniform sampling   2
2 (i) (i)
PN E ∇fi xk,s ; ξk,s ≤ G2 due to the Assumption of

(i)
without replacement, and (b) − ∆k+1 +

i=1 ∆k+1
PN PN D
(i) (j)
E bounded stochastic gradients.
i=1 j=1,j6=i ∆k+1 − ∆k+1 , ∆k+1 − ∆k+1 = 0. To bound A3 , we introduce an auxiliary sequence of models
Term A1 in Eq. (39) can be further bounded as {ak,0 , ak,1 , · · · , ak,E }, which is generated by gradient descent
on the global loss function, i.e.,

N 2 ak,t+1 = ak,t −ηk ∇f (ak,t ), for t = 0, 1, · · · , E −1, (43)


1 X (i)
A1 = E ∆k+1 − ∆k+1

N i=1 with ak,0 = xk . When i 6= j, we have
2
N X N 
1 X 1 (i) (j)

(i)
2
(j)

(i)
2
(j)
= E ∆k+1 − ∆k+1
E xk,t − xk,t = E xk,t − ak,t + ak,t − xk,t (44)
N i=1 N j=1

2 2
(i) (j)
≤ 2E xk,t − ak,t + 2E xk,t − ak,t ,

(40)

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
14

where and hence, (46) holds for t = 1. Assume that (46) holds up
to iteration t − 1, then for iteration t
2
(i)
E xk,t − ak,t =


(i)
2 σ2 2
− a k,t ≤ Eηk + 2λ2 (E − 1)Eηk2

2 E k,t
t−1 t−1 b
x
 
(i)
X X
E xk − ηk ∇fi xk,s ; ξk,s − xk + ηk ∇f (ak,s ) t−1
2
X (i)
+ 2L2 (E − 1)ηk2 E xk,s − ak,s

s=0 s=0
2 s=1
X t−1    
(i)
 2 
= ηk2 E ∇fi xk,s ; ξk,s − ∇f (ak,s ) σ

2 2 2
≤ Eηk + 2λ (E − 1)Eηk ×
b

s=0
t−1 2 !
(a) X     t−1
(i)
X
2 2 2 2 2 2 2 s−1
≤ tηk σ /b + ηk E ∇fi xk,s − ∇f (ak,s ) 1 + 2L (E − 1)ηk (1 + 2L (E − 1)ηk )


s=1 s=1
t−1
 2 
  2 σ 2 2 2
(i) ≤ Eηk + 2λ (E − 1)Eηk ×
X
≤ tηk2 σ 2 /b + (t − 1)ηk2 E ∇fi xk,s − ∇f (ak,s )

b
s=1  2 2 t−1

t−1 2 2 (1 + 2L (E − 1)ηk ) −1
  1 + 2L (E − 1)ηk
(i) 2L2 (E − 1)ηk2
X
= tηk2 σ 2 /b + (t − 1)ηk2 Ek∇fi xk,s  2 
s=1 σ t−1
≤ 2
Eηk + 2λ E(E − 1)ηk 1 + 2L2 (E − 1)ηk2
2 2
.
− ∇f (ak,s ) + ∇fi (ak,s ) − ∇fi (ak,s )k2 b
Xt−1 (48)
2
≤ tηk2 σ 2 /b + 2(t − 1)ηk2 E k∇fi (ak,s ) − ∇f (ak,s )k
s=1 Consequently,
t−1   2
(i)
X
+ 2(t − 1)ηk2 E ∇fi xk,s − ∇fi (ak,s )
2
(i)
≤ Eηk2 σ 2 /b + 2λ2 E(E − 1)ηk2 ×

− a

s=1 E x k,t k,t
(b) E−1
≤ tηk2 σ 2 /b + 2(t − 1)tηk2 λ2 + 1 + 2L2 (E − 1)ηk2
t−1 2 (a)
2 2 2 2

≤ − ×

Eη σ /b + 2λ E(E 1)η
 
(i)
X
2
+ 2(t − 1)ηk E ∇fi xk,s − ∇fi (ak,s ) k k

2 2 2
s=1 e2L (E−1) ηk
(c) (b) eσ 2
≤ tηk2 σ 2 /b + 2(t − 1)tηk2 λ2 + ≤ Eηk2 + 2eλ2 E(E − 1)ηk2 , (49)
t−1 2 b
2 2
X (i)
+ 2L (t − 1)ηk E xk,s − ak,s

s=1
where (a) follows since 1 + x ≤ ex and (b) follows by
assuming that 2L2 (E − 1)2 ηk2 ≤ 1.
(d) σ2 2
≤ Eηk + 2λ2 (E − 1)Eηk2 By substituting (49) into (44), A3 in Eq. (40) is bounded
b
t−1 2 by
X (i)
+ 2L2 (E − 1)ηk2 E xk,s − ak,s , (45)

s=1

(i)
2
(j) 4eσ 2 2
E xk,t − xk,t ≤ Eηk + 8eλ2 E(E − 1)ηk2 . (50)
b
where (a) follows by the bounded variance of stochastic
gradient in Assumption 3, (b) follows due to Assumption 5, Finally, substituting(40), (42), and (50) into (39), we obtain
(c) follows by the L-smoothness, and (d) follows since t ≤ E
holds. 2 2qG2 (N − M ) 2 2
E kxk+1 − x̂k+1 k ≤ E ηk
Now we show by induction that for all t ≤ E, MN
2
4eσ (N − M ) 2
+ Eηk

(i)
2 bM N
E xk,t − ak,t ≤ Eηk2 σ 2 /b + 2λ2 E(E − 1)ηk2 ×

8eλ2 (N − M )

+ E(E − 1)ηk2
t−1 MN
1 + 2L2 (E − 1)ηk2 . (46) N
2d(1 + q)(N − M ) 1 X 2
+ · ς .
M (N − 1) N i=1 i,k
When t = 1, we have

2 By substituting
(i) 2
E xk,1 − ak,1 = E kxk − ηk ∇fi (xk ) − xk + ηk ∇f (xk )k

 
1.25Eb
= ηk2 E k∇fi (xk ) − ∇f (xk )k ≤ ηk2 λ2 ,
2 128G2 b2 ln |Di |δ
2
(47) ςi,k = E 4 ηk2 , (51)
|Di |2 2

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
15

PN (i)
we have and gk,t = N1 i=1 ∇fi (xk,t ). Since x∗i is the optimal model
of the i-th local loss function, we have
2 2qG2 (N − M ) 2 2
E kxk+1 − x̂k+1 k ≤ E ηk (a)
MN 1 2
fi (x∗ ) ≤ fi (x∗i ) + k∇fi (x∗ ) − ∇fi (x∗i )k
4eσ 2 (N − M ) 2 2µ
+ Eηk
bM N (b) 1 2
=fi (x∗i ) + k∇fi (x∗ ) − ∇f (x∗ )k
8eλ2 (N − M ) 2µ
+ E(E − 1)ηk2
MN (c) λ2
  ≤ fi (x∗i ) + , (55)
256dG2 b2 (1 + q)(N − M ) ln 1.25Eb
|Di |δ 2µ
+ E 4 ηk2 .
M (N − 1)|Di |2 2 where (a) holds since fi (·) is µ-strongly convex, (b) holds
since ∇fi (x∗i ) = ∇f (x∗ ) = 0, and (c) holds due to
which concludes the proof of Lemma 1.
Assumption 5 on heterogeneity. Then, the term C1 in Eq. (54)
can be bounded as
N
!
C. Proof of Lemma 2 1 X
2 2 ∗ ∗
6Lηk Γ = 6Lηk f (x ) − fi (xi )
For the term B, we have N i=1
N
Ekx̂k+1 − xk,E k2 6Lηk2 X
= (fi (x∗ ) − fi (x∗i ))

N N
2 N i=1
1 X  (i)  1 X (i)
= E xk + Q xk,E − xk − x 3Lηk2 λ2

N i=1 N i=1 k,E ≤ .
µ
 2

1 X N N

(i)
 1 X  (i) Term C2 in Eq. (54) can be calculated as
= E Q xk,E − xk − xk,E − xk

N N 2
i=1 i=1
ηk2 E gk,t − gk,t

N  2
(a) 1 X 
(i)
 
(i)
− − −  2

= x x x x N 

E k k
2 1
k,E k,E
X
N 2 i=1
Q
i (i) i
= ηk E ∇fi (xk,t , ξk,t ) − ∇fi (xk,t )

N
N
(b) 2 i=1
q 1 X (i)
≤ · E xk,E − xk , (52) N

N N i=1 (a) ηk2 X i (i) i
2
= (x , ξ ) − ∇f (x )

2
E i k,t k,t i k,t
N i=1
∇f
where (a) and (b) hold by Assumption 4. (b) ηk2 σ 2
Once again, we use the results in (42) and obtain the bound ≤ , (56)
bN
for term B in Eq. (22) as
where (a) and (b) holds due to Assumption 3.
qG2 2 2 Finally, we bound term C3 using the same technique in [23]:
Ekx̂k+1 − xk,E k2 ≤ E ηk , (53)
N N
2 X 2
(i)
which completes the proof for Lemma 2. E xk,t − xk,t ≤ 8G2 (E − 1)2 ηk2 . (57)

N i=1

D. Proof of Lemma 3 where (a) follows by the convexity of k · k2 and (b) follows
by applying Eq. (50).
Consider the k-th round of training with the sequence By substituting (56), (56), and (57) into Eq. (54), we have
{xk,0 , xk,1 , · · · , xk,E } representing the averaged model of all
2 2
local devices in distributed SGD with xk,0 = xk . A similar E kxk,t+1 − x∗ k ≤(1 − µηk )E kxk,t − x∗ k
problem has been investigated in [23], in which the authors 3Lλ2 σ2
 
proved the result of one-step distributed SGD as + + ηk2 + 8G2 (E − 1)2 ηk2 .
µ bN
2 2
E kxk,t+1 − x∗ k ≤(1 − µηk )E kxk,t − x∗ k By unfolding the recursions in (58) and since xk,0 = xk ,
2
+ 6Lηk2 Γ + ηk2 E gk,t − gk,t

2 2
| {z } | {z } E kxk,E − x∗ k ≤(1 − µηk )E E kxk − x∗ k
C1 3Lλ2 σ2 1 − (1 − µηk )E
 
C2
N
(54) + + ηk
2 X
(i)
2 µ bN µ
+ E xk,t − xk,t

N 1 − (1 − µηk )E
i=1 + 8G2 (E − 1)2 ηk
| {z } µ
C3
E ∗ 2
≤(1 − µηk ) E kxk − x k
with a different measure of heterogeneity, Γ = f (x∗ ) − 
3Lλ2 σ2

1
PN ∗ 1
PN (i) (i) + + Eηk2 + 8G2 E(E − 1)2 ηk2 ,
N i=1 fi (xi ). Additionally, gk,t = N i=1 ∇fi (xk,t ; ξk,t ) µ bN

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2021.3101991, IEEE Internet of
Things Journal
16

where the last inequality holds by using Bernoulli’s inequality Yang Yi (Senior Member, IEEE) is currently an
and the assumption that ηk ≤ µ1 . To be specific, we have Associate Professor with The Bradley Department
of Electrical Engineering and Computer Engineer-
(1 − µηk )s ≥ 1 − sµηk , which implies ing, Virginia Tech. Her research interests include
very large scale integrated (VLSI) circuits and sys-
1 − (1 − µηk )s tems, computer-aided design (CAD), neuromorphic
≤ sηk . (58)
µ architecture for brain-inspired computing systems,
and low-power circuits design with advanced nano-
technologies for high-speed wireless systems.

Nima Mohammadi (Student Member, IEEE) re-


ceived his M.Sc degree in Computer Science from
the University of Tehran in 2017. He is currently
a Ph.D. student at the Bradley Department of Elec-
trical and Computer Engineering at Virginia Tech.
His research interests include computational neu- Lingjia Liu (Senior Member, IEEE) received his
roscience and emerging applications of machine B.S. degree in Electronic Engineering from Shanghai
learning and neural networks in wireless systems. Jiao Tong University and Ph.D. degree in Electri-
cal and Computer Engineering from Texas A&M
University. Prior to joining the ECE Department at
Virginia Tech (VT), he was an Associate Professor
in the EECS Department at the University of Kansas
(KU). He spent 4+ years working in Mitsubishi Elec-
tric Research Laboratory (MERL) and the Standards
& Mobility Innovation Lab of Samsung Research
America (SRA) where he received Global Samsung
Jianan Bai received the B.S. degree in Communi-
Best Paper Award in 2008 and 2010. He was leading Samsung’s efforts on
cations Engineering from Yingcai Honors College
multiuser MIMO, CoMP, and HetNets in 3GPP LTE/LTE-Advanced standards.
at University of Electronic Science and Technology
Dr. Liu’s general research interests mainly lie in emerging technologies
of China (UESTC) in 2018 and M.S. degree in
for Beyond 5G cellular networks including machine learning for wireless
Electrical Engineering from the Bradley Department
networks, massive MIMO, massive MTC communications, and mmWave
of Electrical and Computer Engineering (ECE) at
communications.
Virginia Tech in 2020, respectively. His research
interest is applying tools from optimization, stochas-
tic geometry, and machine learning to solve emerg-
ing problems in device-to-device (D2D), internet-
of-things (IoT), and machine-type-communications
(MTC) networks.

Qiang Fan (Member, IEEE) received his Ph.D.


degree in Electrical and Computer Engineering from
New Jersey Institute of Technology (NJIT) in 2019,
and his M.S. degree in Electrical Engineering from
Yunnan University of Nationalities, China, in 2013.
He was a postdoctor researcher in the Department
of Electrical and Computer Engineering, Virginia
Tech. He has served as a reviewer for over 120
journal submissions such as IEEE Transactions on
Cloud Computing, IEEE Journal on Selected Areas
in Communications, IEEE Transactions on Commu-
nications. His current research interests include wireless communications and
networking, mobile edge computing, machine learning and drone assisted
networking.

Yifei Song received his B.S. degree in Electrical


Engineering and M.S. in Electrical and Computer
Engineering from University of Connecticut and
University of Washington respectively in 2017 and
2020. He joined the department of Electrical and
Computer Engineering at Virginia Tech as a Ph.D.
student in 2021 Spring. His current research interests
are in the broad area of wireless communications,
machine learning and optimization.

2327-4662 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Bibliothèque ÉTS. Downloaded on October 11,2022 at 17:56:12 UTC from IEEE Xplore. Restrictions apply.

You might also like