Professional Documents
Culture Documents
Daniel Rothchild∗ 1 Ashwinee Panda∗ 1 Enayat Ullah 2 Nikita Ivkin 3 Ion Stoica 1 Vladimir Braverman 2
Joseph Gonzalez 1 Raman Arora 2
sending it to the central aggregator. The aggregator main- often referred to as local/parallel SGD, has been studied
tains momentum and error accumulation Count Sketches, since the early days of distributed model training in the data
and the weight update applied at each round is extracted center (Dean et al., 2012), and is referred to as FedAvg
from the error accumulation sketch. See Figure 1 for an when applied to federated learning (McMahan et al., 2016).
overview of FetchSGD. FedAvg has been successfully deployed in a number of
domains (Hard et al., 2018; Li et al., 2019), and is the most
FetchSGD requires no local state on the clients, and we
commonly used optimization algorithm in the federated set-
prove that it is communication efficient, and that it con-
ting (Yang et al., 2018). In FedAvg, every participating
verges in the non-i.i.d.
setting for L-smooth
non-convex client first downloads and trains the global model on their
functions at rates O T − 1/2 and O T − 1/3 respectively local dataset for a number of epochs using SGD. The clients
under two alternative assumptions – the first opaque and the upload the difference between their initial and final model
second more intuitive. Furthermore, even without maintain- to the parameter server, which averages the local updates
ing any local state, FetchSGD can carry out momentum – weighted according to the magnitude of the corresponding
a technique that is essential for attaining high accuracy in local dataset.
the non-federated setting – as if on local gradients before
compression (Sutskever et al., 2013). Lastly, due to prop- One major advantage of FedAvg is that it requires no lo-
erties of the Count Sketch, FetchSGD scales seamlessly cal state, which is necessary for the common case where
to small local datasets, an important regime for federated clients participate only once in training. FedAvg is also
learning, since user interaction with online services tends communication-efficient in that it can reduce the total num-
to follow a power law distribution, meaning that most users ber of bytes transferred during training while achieving the
will have relatively little data to contribute (Muchnik et al., same overall performance. However, from an individual
2013). client’s perspective, there is no communication savings if
the client participates in training only once. Achieving high
We empirically validate our method with two image recog- accuracy on a task often requires using a large model, but
nition tasks and one language modeling task. Using models clients’ network connections may be too slow or unreliable
with between 6 and 125 million parameters, we train on to transmit such a large amount of data at once (Yang et al.,
non-i.i.d. datasets that range in size from 50,000 – 800,000 2010).
examples.
Another disadvantage of FedAvg is that taking many local
steps can lead to degraded convergence on non-i.i.d. data.
2. Related Work Intuitively, taking many local steps of gradient descent on
local data that is not representative of the overall data dis-
Broadly speaking, there are two optimization strategies that
tribution will lead to local over-fitting, which will hinder
have been proposed to address the constraints of federated
convergence (Karimireddy et al., 2019a). When training a
learning: Federated Averaging (FedAvg) and extensions
model on non-i.i.d. local datasets, the goal is to minimize
thereof, and gradient compression methods. We explore
the average test error across clients. If clients are chosen
these two strategies in detail in Sections 2.1 and 2.2, but as a
randomly, SGD naturally has convergence guarantees on
brief summary, FedAvg does not require local state, but it
non-i.i.d. data, since the average test error is an expectation
also does not reduce communication from the standpoint of
over which clients participate. However, although FedAvg
a client that participates once, and it struggles with non-i.i.d.
has convergence guarantees for the i.i.d. setting (Wang
data and small local datasets because it takes many local
and Joshi, 2018), these guarantees do not apply directly
gradient steps. Gradient compression methods, on the other
to the non-i.i.d. setting as they do with SGD. Zhao et al.
hand, can achieve high communication efficiency. However,
(2018) show that FedAvg, using K local steps, converges
it has been shown both theoretically and empirically that
as O (K/T ) on non-i.i.d. data for strongly convex smooth
these methods must maintain error accumulation vectors on
functions, with additional assumptions. In other words, con-
the clients in order to achieve high accuracy. This is ineffec-
vergence on non-i.i.d. data could slow down as much as
tive in federated learning, since clients typically participate
proportionally to the number of local steps taken.
in optimization only once, so the accumulated error has no
chance to be re-introduced (Karimireddy et al., 2019b). Variants of FedAvg have been proposed to improve its per-
formance on non-i.i.d. data. Sahu et al. (2018) propose
2.1. FedAvg constraining the local gradient update steps in FedAvg by
penalizing the L2 distance between local models and the cur-
FedAvg reduces the total number of bytes transferred dur- rent global model. Under the assumption that every client’s
ing training by carrying out multiple steps of stochastic loss is minimized wherever the overall loss function is mini-
gradient descent (SGD) locally before sending the aggre- mized, they recover the convergence rate of SGD. Karim-
gate model update back to the aggregator. This technique,
FetchSGD: Communication-Efficient Federated Learning with Sketching
Gradient 7 Broadcast
2 Sketches Sparse Updates
Edge
rL
<latexit sha1_base64="s6bWdfygh8s4atZE+zILdGqj8S0=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiRV0GXRjQsXFewDmlBuptN26GQSZiZiCfkVNy4UceuPuPNvnLRZaOuBgcM593LPnCDmTGnH+bZKa+sbm1vl7crO7t7+gX1Y7agokYS2ScQj2QtAUc4EbWumOe3FkkIYcNoNpje5332kUrFIPOhZTP0QxoKNGAFtpIFd9QQEHLAXgp4Q4OldNrBrTt2ZA68StyA1VKA1sL+8YUSSkApNOCjVd51Y+ylIzQinWcVLFI2BTGFM+4YKCKny03n2DJ8aZYhHkTRPaDxXf2+kECo1CwMzmUdUy14u/uf1Ez268lMm4kRTQRaHRgnHOsJ5EXjIJCWazwwBIpnJiskEJBBt6qqYEtzlL6+STqPuntcb9xe15nVRRxkdoxN0hlx0iZroFrVQGxH0hJ7RK3qzMuvFerc+FqMlq9g5Qn9gff4A4fmUVw==</latexit>
rL
<latexit sha1_base64="s6bWdfygh8s4atZE+zILdGqj8S0=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiRV0GXRjQsXFewDmlBuptN26GQSZiZiCfkVNy4UceuPuPNvnLRZaOuBgcM593LPnCDmTGnH+bZKa+sbm1vl7crO7t7+gX1Y7agokYS2ScQj2QtAUc4EbWumOe3FkkIYcNoNpje5332kUrFIPOhZTP0QxoKNGAFtpIFd9QQEHLAXgp4Q4OldNrBrTt2ZA68StyA1VKA1sL+8YUSSkApNOCjVd51Y+ylIzQinWcVLFI2BTGFM+4YKCKny03n2DJ8aZYhHkTRPaDxXf2+kECo1CwMzmUdUy14u/uf1Ez268lMm4kRTQRaHRgnHOsJ5EXjIJCWazwwBIpnJiskEJBBt6qqYEtzlL6+STqPuntcb9xe15nVRRxkdoxN0hlx0iZroFrVQGxH0hJ7RK3qzMuvFerc+FqMlq9g5Qn9gff4A4fmUVw==</latexit>
rL
<latexit sha1_base64="s6bWdfygh8s4atZE+zILdGqj8S0=">AAAB+3icbVDLSsNAFJ3UV62vWJduBovgqiRV0GXRjQsXFewDmlBuptN26GQSZiZiCfkVNy4UceuPuPNvnLRZaOuBgcM593LPnCDmTGnH+bZKa+sbm1vl7crO7t7+gX1Y7agokYS2ScQj2QtAUc4EbWumOe3FkkIYcNoNpje5332kUrFIPOhZTP0QxoKNGAFtpIFd9QQEHLAXgp4Q4OldNrBrTt2ZA68StyA1VKA1sL+8YUSSkApNOCjVd51Y+ylIzQinWcVLFI2BTGFM+4YKCKny03n2DJ8aZYhHkTRPaDxXf2+kECo1CwMzmUdUy14u/uf1Ez268lMm4kRTQRaHRgnHOsJ5EXjIJCWazwwBIpnJiskEJBBt6qqYEtzlL6+STqPuntcb9xe15nVRRxkdoxN0hlx0iZroFrVQGxH0hJ7RK3qzMuvFerc+FqMlq9g5Qn9gff4A4fmUVw==</latexit>
Local
1 Gradients
Figure 1. Algorithm Overview. The FetchSGD algorithm (1) computes gradients locally, and then send sketches (2) of the gradients to
the cloud. In the cloud, gradient sketches are aggregated (3), and then (4) momentum and (5) error accumulation are applied to the sketch.
The approximate top-k values are then (6) extracted and (7) broadcast as sparse updates to devices participating in next round.
ireddy et al. (2019a) modify the local updates in FedAvg to 2.3. Optimization with Sketching
make them point closer to the consensus gradient direction
This work advances the growing body of research applying
from all clients. They achieve good convergence at the cost
sketching techniques to optimization. Jiang et al. (2018) pro-
of making the clients stateful.
pose using sketches for gradient compression in data center
training. Their method achieves empirical success when gra-
2.2. Gradient Compression
dients are sparse, but it has no convergence guarantees, and
A limitation of FedAvg is that, in each communication it achieves little compression on dense gradients (Jiang et al.,
round, clients must download an entire model and upload an 2018, §B.3). The method also does not make use of error
entire model update. Because federated clients are typically accumulation, which more recent work has demonstrated
on slow and unreliable network connections, this require- is necessary for biased gradient compression schemes to be
ment makes training large models with FedAvg difficult. successful (Karimireddy et al., 2019b). Ivkin et al. (2019b)
Uploading model updates is particularly challenging, since also propose using sketches for gradient compression in data
residential Internet connections tend to be asymmetric, with center training. However, their method requires a second
far higher download speeds than upload speeds (Goga and round of communication between the clients and the param-
Teixeira, 2012). eter server, after the first round of transmitting compressed
gradients completes. Using a second round is not practical
An alternative to FedAvg that helps address this problem in federated learning, since stragglers would delay comple-
is regular distributed SGD with gradient compression. It tion of the first round, at which point a number of clients
is possible to compress stochastic gradients such that the that had participated in the first round would no longer be
result is still an unbiased estimate of the true gradient, for available (Bonawitz et al., 2016). Furthermore, the method
example by stochastic quantization (Alistarh et al., 2017) in (Ivkin et al., 2019b) requires local client state for both
or stochastic sparsification (Wangni et al., 2018). However, momentum and error accumulation, which is not possible
there is a fundamental tradeoff between increasing compres- in federated learning. Spring et al. (2019) also propose
sion and increasing the variance of the stochastic gradient, using sketches for distributed optimization. Their method
which slows convergence. The requirement that gradients re- compresses auxiliary variables such as momentum and per-
main unbiased after compression is too stringent, and these parameter learning rates, without compressing the gradients
methods have had limited empirical success. themselves. In contrast, our method compresses the gradi-
Biased gradient compression methods, such as top-k spar- ents, and it does not require any additional communication
sification (Lin et al., 2017) or signSGD (Bernstein et al., at all to carry out momentum.
2018), have been more successful in practice. These meth- Konecny et al. (2016) propose using sketched updates to
ods rely, both in theory and in practice, on the ability to achieve communication efficiency in federated learning.
locally accumulate the error introduced by the compression However, the family of sketches they use differs from the
scheme, such that the error can be re-introduced the next techniques we propose in this paper: they apply a combina-
time the client participates (Karimireddy et al., 2019b). Un- tion of subsampling, quantization and random rotations.
fortunately, carrying out error accumulation requires local
client state, which is often infeasible in federated learning.
FetchSGD: Communication-Efficient Federated Learning with Sketching
to carrying out error accumulation on each client, and up- 4.1. Scenario 1: Contraction Holds
loading sketches of the result to the server. (Computing the
To show that compressed SGD converges when using some
model update from the accumulated error is not linear, but
biased gradient compression operator C(·), existing meth-
only the server does this, whether the error is accumulated
ods (Karimireddy et al., 2019b; Zheng et al., 2019; Ivkin
on the clients or on the server.) Taking this a step further, we
et al., 2019b) appeal to Stich et al. (2018), who show that
note that momentum also consists of only linear operations,
compressed SGD converges when C is a τ-contraction:
and so momentum can be equivalently carried out on the
clients or on the server. Extending the above equations with kC(x) − xk ≤ (1 − τ ) kxk
momentum yields
Ivkin et al. (2019b) show that it is possible to satisfy this con-
1 W
W i∑
St = S(git ) traction property using Count Sketches to compress gradi-
=1 ents. However, their compression method includes a second
Stu+1 = ρStu + St round of communication: if there are no high-magnitude
∆ = Top-k(U (ηStu+1 + Ste ))) elements in et , as computed from S(et ), the server can
query clients for random entries of et . On the other hand,
Ste+1 = ηStu+1 + Ste − S(∆)
FetchSGD never computes the eit , or et , so this second
wt+1 = wt − ∆. round of communication is not possible, and the analysis of
Ivkin et al. (2019b) does not apply. In this section, we as-
FetchSGD is presented in full in Algorithm 1. sume that the updates have heavy hitters, which ensures that
the contraction property holds along the optimization path.
Algorithm 1 FetchSGD Assumption 1 (Scenario 1). Let {wt }tT=1 be the sequence
Input: number of model weights to update each round k of models generated by FetchSGD. Fixing this model se-
Input: learning rate η quence, let {ut }tT=1 and {et }tT=1 be the momentum and
Input: number of timesteps T error accumulation vectors generated using this model se-
Input: momentum parameter ρ, local batch size `
Input: Number of clients selected per round W quence, had we not used sketching for gradient compression
Input: Sketching and unsketching functions S , U (i.e. if S and U are identity maps). There exists a con-
1: Initialize S0u and S0e to zero sketches stant 0 < τ < 1 such that for any t ∈ [ T ], the quantity
2: Initialize w0 using the same random seed on the clients and qt := η (ρut−1 + gt−1 ) + et−1 has at least one coordinate
aggregator
2
3: for t = 1, 2, · · · T do
i s.t. (qit )2 ≥ τ
qit
.
4: Randomly select W clients c1 , . . . cW Theorem 1 (Scenario 1). Let f be an L-smooth 2 non-
5: loop {In parallel on clients {ci }Wi =1 } convex function and let the norm of stochastic gradients of f
6: Download (possibly sparse) new model weights wt − be upper bounded by G. Under Assumption 1, FetchSGD,
w0 1− ρ
7: Compute stochastic gradient git on batch Bi of size `:
with step size η = √ , in T iterations, returns {wt }tT=1 ,
2L T
git = 1` ∑lj=1 ∇w L(wt , z j ) such that, with probability at least 1 − δ over the sketching
8: Sketch git : Sit = S(git ) and send it to the Aggregator
randomness:
9: end loop 4L( f (w0 )− f ∗ ) + G2 )
2 2(1+ τ )2 G 2
1. min E
∇ f (wt )
≤
√ +
10: Aggregate sketches St = W 1
∑W t
i = 1 Si t=1··· T T (1− τ ) τ 2 T
.
11: t
Momentum: Su = ρSu + S t − 1 t
12: Error feedback: Ste = ηStu + Ste 2. The sketch uploaded from each participating client to
13: Unsketch: ∆t = Top-k(U (Ste )) the parameter server is O (log (dT/δ) /τ ) bytes per
14: Error accumulation: Ste+1 = Ste − S(∆t ) round.
15: Update wt+1 = wt − ∆t The expectation in part 1 of the theorem is over the random-
16: end for
T ness of sampling minibatches. For large T, the first term
Output: wt t=1
dominates, so the convergence rate in Theorem 1 matches
that of uncompressed SGD.
4. Theory Intuitively, Assumption 1 states that, at each time step, the
descent direction – i.e., the scaled negative gradient, in-
This section presents convergence guarantees for
cluding momentum – and the error accumulation vector
FetchSGD. First, Section 4.1 gives the convergence of
must point in sufficiently the same direction. This assump-
FetchSGD when making a strong and opaque assumption
tion is rather opaque, since it involves all of the gradient,
about the sequence of gradients. Section 4.2 instead makes
a more interpretable assumption about the gradients, and 2A differentiable function f is L-smooth if
arrives at a weaker convergence guarantee. k∇ f (x) − ∇ f (y)k ≤ L kx − yk ∀ x, y ∈ dom( f ).
FetchSGD: Communication-Efficient Federated Learning with Sketching
of size I, we need a sliding window error accumulation 2. The sketch uploaded from each participating
client to
log(dT/δ)
3 Technically, this definition is also parametrized by δ and β.
the parameter server is Θ τ2
bytes per round.
However, in the interest of brevity, we use the simpler term “( I, τ )-
sliding heavy” throughout the manuscript. Note that δ in Theorem As in Theorem 1, the expectation in part 1 of the theorem is
2 refers to the same δ as in Definition 1. over the randomness of sampling minibatches.
FetchSGD: Communication-Efficient Federated Learning with Sketching
0.90
0.6
0.85
0.80 0.5
Test Accuracy
Test Accuracy
0.75
0.4
0.70
FetchSGD
Local Top-k (ρg = 0.9)
0.65 0.3
Local Top-k (ρg = 0)
0.60 FedAvg (ρg = 0.9)
FedAvg (ρg = 0) 0.2
0.55
Uncompressed
0.50
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6
Overall Compression Overall Compression
Figure 3. Test accuracy achieved on CIFAR10 (left) and CIFAR100 (right). “Uncompressed” refers to runs that attain compression by
simply running for fewer epochs. FetchSGD outperforms all methods, especially at higher compression. Many FedAvg and local top-k
runs are excluded from the plot because they failed to converge or achieved very low accuracy.
0.75
the Pareto frontier over hyperparameters for each method.
Figures 7 and 9 in the Appendix show results for all runs 0.70
FetchSGD
that converged. Local Top-k (ρg = 0.9)
Local Top-k (ρg = 0)
5.1. CIFAR (ResNet9) 0.65 FedAvg (ρg = 0.9)
FedAvg (ρg = 0)
CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) are im- Uncompressed
age classification datasets with 60,000 32 × 32px color im- 0.60
2 4 6 8 10
ages distributed evenly over 10 and 100 classes respectively Overall Compression
(50,000/10,000 train/test split). They are benchmark com-
Figure 4. Test accuracy on FEMNIST. The dataset is not very non-
puter vision datasets, and although they lack a natural non- i.i.d., and has relatively large local datasets, but FetchSGD is still
i.i.d. partitioning, we artificially create one by giving each competitive with FedAvg and local top-k for lower compression.
client images from only a single class. For CIFAR10 (CI-
FAR100) we use 10,000 (50,000) clients, yielding 5 (1) from many workers, each with very different data, leads to
images per client. Our 7M-parameter model architecture, a nearly dense model update each round.
data preprocessing, and most hyperparameters follow Page
5.2. FEMNIST (ResNet101)
(2019), with details in Appendix A.1. We report accuracy
on the test datasets. The experiments above show that FetchSGD significantly
outperforms competing methods in the regime of very small
Figure 3 shows test accuracy vs. compression for CIFAR10
local datasets and non-i.i.d. data. In this section we intro-
and CIFAR100. FedAvg and local top-k both struggle to
duce a task designed to be more favorable for FedAvg, and
achieve significantly better results than uncompressed SGD.
show that FetchSGD still performs competitively.
Although we ran a large hyperparameter sweep, many runs
simply diverge, especially for higher compression (local top- Federated EMNIST is an image classification dataset with
k) or more local iterations (FedAvg). We expect this setting 62 classes (upper- and lower-case letters, plus digits) (Cal-
to be challenging for FedAvg, since running multiple gra- das et al., 2018), which is formed by partitioning the EM-
dient steps on only one or a few data points, especially NIST dataset (Cohen et al., 2017) such that each client in
points that are not representative of the overall distribution, FEMNIST contains characters written by a single person.
is unlikely to be productive. And although local top-k can Experimental details, including our 40M-parameter model
achieve high upload compression, download compression architecture, can be found Appendix A.2. We report final
is reduced to almost 1×, since summing sparse gradients accuracies on the validation dataset. The baseline run trains
for a single epoch (i.e., each client participates once).
FetchSGD: Communication-Efficient Federated Learning with Sketching
Train NLL
Local Top-k (ρg = 0) 7.5
Uncompressed (1.0x)
20
FedAvg (ρg = 0.9)
7.0
FedAvg (ρg = 0)
18 Uncompressed
6.5
16
6.0
14 5.5
100 101 0 1000 2000 3000 4000
Overall Compression Iterations
Figure 5. Left: Validation perplexity achieved by finetuning GPT2-small on PersonaChat. FetchSGD achieves 3.9× compression
without loss in accuracy over uncompressed SGD, and it consistently achieves lower perplexity than FedAvg and top-k runs with similar
compression. Right: Training loss curves for representative runs. Global momentum hinders local top-k in this case, so local top-k runs
with ρ g = 0.9 are omitted here to increase legibility.
FEMNIST was introduced as a benchmark dataset for Figure 5 also plots loss curves (negative log likelihood)
FedAvg, and it has relatively large local datasets (∼ 200 achieved during training for some representative runs. Some-
images per client). The clients are split according to the what surprisingly, all the compression techniques outper-
person who wrote the character, yielding a data distribution form the uncompressed baseline early in training, but most
closer to i.i.d. than our per-class splits of CIFAR10. To main- saturate too early, when the error introduced by the com-
tain a reasonable overall batch size, only three clients partic- pression starts to hinder training.
ipate each round, reducing the need for a linear compression
Sketching outperforms local top-k for all but the highest
operator. Despite this, FetchSGD performs competitively
levels of compression, because local top-k relies on local
with both FedAvg and local top-k for some compression
state for error feedback, which is impossible in this setting.
values, as shown in Figure 4.
We expect this setting to be challenging for FedAvg, since
For low compression, FetchSGD actually outperforms the running multiple gradient steps on a single conversation
uncompressed baseline, likely because updating only k pa- which is not representative of the overall distribution is
rameters per round regularizes the model. Interestingly, unlikely to be productive.
local top-k using global momentum significantly outper-
forms other methods on this task, though we are not aware 6. Discussion
of prior work suggesting this method for federated learning. Federated learning has seen a great deal of research interest
Despite this surprising observation, local top-k with global recently, particularly in the domain of communication effi-
momentum suffers from divergence and low accuracy on ciency. A considerable amount of prior work focuses on de-
our other tasks, and it lacks any theoretical guarantees. creasing the total number of communication rounds required
5.3. PersonaChat (GPT2) to converge, without reducing the communication required
in each round. In this work, we complement this body of
In this section we consider GPT2-small (Radford et al.,
work by introducing FetchSGD, an algorithm that reduces
2019), a transformer model with 124M parameters that is
the amount of communication required each round, while
used for language modeling. We finetune a pretrained GPT2
still conforming to the other constraints of the federated
on the PersonaChat dataset, a chit-chat dataset consisting
setting. We particularly want to emphasize that FetchSGD
of conversations between Amazon Mechanical Turk work-
easily addresses the setting of non-i.i.d. data, which often
ers who were assigned faux personalities to act out (Zhang
complicates other methods. The optimal algorithm for many
et al., 2018). The dataset has a natural non-i.i.d. partition-
federated learning settings will no doubt combine efficiency
ing into 17,568 clients based on the personality that was
in number of rounds and efficiency within each round, and
assigned. Our experimental procedure follows Wolf (2019).
we leave an investigation into optimal ways of combining
The baseline model trains for a single epoch, meaning that
these approaches to future work.
no local state is possible, and we report the final perplexity
(a standard metric for language models; lower is better) on
the validation dataset in Figure 5.
FetchSGD: Communication-Efficient Federated Learning with Sketching
EU. 2018 reform of eu data protection rules, 2018. URL Jiawei Jiang, Fangcheng Fu, Tong Yang, and Bin Cui.
https://tinyurl.com/ydaltt5g. Sketchml: Accelerating distributed machine learning with
data sketches. In Proceedings of the 2018 International
Robin C. Geyer, Tassilo Klein, and Moin Nabi. Differen- Conference on Management of Data, pages 1269–1284,
tially private federated learning: A client level perspec- 2018.
tive, 2017.
Peter Kairouz, H. Brendan McMahan, Brendan Avent,
Oana Goga and Renata Teixeira. Speed measurements of Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji,
residential internet access. In International Conference Keith Bonawitz, Zachary Charles, Graham Cormode,
on Passive and Active Network Measurement, pages 168– Rachel Cummings, Rafael G. L. D’Oliveira, Salim El
178. Springer, 2012. Rouayheb, David Evans, Josh Gardner, Zachary Gar-
rett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons,
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Mar-
Yangqing Jia, and Kaiming He. Accurate, large mini- tin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak,
batch sgd: Training imagenet in 1 hour. arXiv preprint Jakub Konecny, Aleksandra Korolova, Farinaz Koushan-
arXiv:1706.02677, 2017. far, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek
Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Ras-
Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ra- mus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage,
maswamy, Francoise Beaufays, Sean Augenstein, Hubert Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U.
Eichner, Chloé Kiddon, and Daniel Ramage. Federated Stich, Ziteng Sun, Ananda Theertha Suresh, Florian
learning for mobile keyboard prediction. arXiv preprint Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong,
arXiv:1811.03604, 2018. Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen
Zhao. Advances and open problems in federated learning,
Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard 2019.
Nock, Giorgio Patrini, Guillaume Smith, and Brian
Thorne. Private federated learning on vertically parti- Sai Praneeth Karimireddy, Satyen Kale, Mehryar
tioned data via entity resolution and additively homo- Mohri, Sashank J. Reddi, Sebastian U. Stich, and
morphic encryption. arXiv preprint arXiv:1711.10677, Ananda Theertha Suresh. Scaffold: Stochastic controlled
2017. averaging for on-device federated learning, 2019a.
Nikita Ivkin, Zaoxing Liu, Lin F Yang, Srinivas Suresh Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U
Kumar, Gerard Lemson, Mark Neyrinck, Alexander S Stich, and Martin Jaggi. Error feedback fixes signsgd
Szalay, Vladimir Braverman, and Tamas Budavari. Scal- and other gradient compression schemes. arXiv preprint
able streaming tools for analyzing n-body simulations: arXiv:1901.09847, 2019b.
Finding halos and investigating excursion sets in one pass.
Jakub Konecny, H. Brendan McMahan, Felix X. Yu, Pe-
Astronomy and computing, 23:166–179, 2018.
ter Richtárik, Ananda Theertha Suresh, and Dave Bacon.
Federated learning: Strategies for improving communica-
Nikita Ivkin, Ran Ben Basat, Zaoxing Liu, Gil Einziger,
tion efficiency, 2016.
Roy Friedman, and Vladimir Braverman. I know what
you did last summer: Network monitoring using interval Alex Krizhevsky, Geoffrey Hinton, et al. Learning multi-
queries. Proceedings of the ACM on Measurement and ple layers of features from tiny images. Master’s thesis,
Analysis of Computing Systems, 3(3):1–28, 2019a. Department of Computer Science, University of Toronto,
2009.
Nikita Ivkin, Daniel Rothchild, Enayat Ullah,
Vladimir Braverman, Ion Stoica, and Raman Arora. Kyunghan Lee, Joohyun Lee, Yung Yi, Injong Rhee, and
Communication-efficient distributed sgd with sketching. Song Chong. Mobile data offloading: How much can
In Advances in Neural Information Processing Systems, wifi deliver? In Proceedings of the 6th International
pages 13144–13154, 2019b. Conference, pages 1–12, 2010.
Nikita Ivkin, Zhuolong Yu, Vladimir Braverman, and Xin David Leroy, Alice Coucke, Thibaut Lavril, Thibault Gis-
Jin. Qpipe: Quantiles sketch fully in the data plane. selbrecht, and Joseph Dureau. Federated learning for
In Proceedings of the 15th International Conference on keyword spotting. In ICASSP 2019-2019 IEEE Inter-
Emerging Networking Experiments And Technologies, national Conference on Acoustics, Speech and Signal
pages 285–291, 2019c. Processing (ICASSP), pages 6341–6345. IEEE, 2019.
FetchSGD: Communication-Efficient Federated Learning with Sketching
He Li, Kaoru Ota, and Mianxiong Dong. Learning iot in Anit Kumar Sahu, Tian Li, Maziar Sanjabi, Manzil Zaheer,
edge: Deep learning for the internet of things with edge Ameet Talwalkar, and Virginia Smith. On the conver-
computing. IEEE network, 32(1):96–101, 2018. gence of federated optimization in heterogeneous net-
works. arXiv preprint arXiv:1812.06127, 2018.
Tian Li, Zaoxing Liu, Vyas Sekar, and Virginia Smith. Pri-
vacy for free: Communication-efficient learning with Shaohuai Shi, Xiaowen Chu, Ka Chun Cheung, and Simon
differential privacy using sketches. arXiv preprint See. Understanding top-k sparsification in distributed
arXiv:1911.00972, 2019. deep learning. arXiv preprint arXiv:1911.08772, 2019.
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu
Dally. Deep gradient compression: Reducing the commu- Xu. Edge computing: Vision and challenges. IEEE
nication bandwidth for distributed training. arXiv preprint internet of things journal, 3(5):637–646, 2016.
arXiv:1712.01887, 2017.
Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, and An-
Zaoxing Liu, Nikita Ivkin, Lin Yang, Mark Neyrinck, Ger- shumali Shrivastava. Compressing gradient optimizers
ard Lemson, Alexander Szalay, Vladimir Braverman, via count-sketches. arXiv preprint arXiv:1902.00179,
Tamas Budavari, Randal Burns, and Xin Wang. Streaming 2019.
algorithms for halo finders. In 2015 IEEE 11th Interna-
tional Conference on e-Science, pages 342–351. IEEE, Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin
2015. Jaggi. Sparsified sgd with memory. In Advances in Neural
Information Processing Systems, pages 4447–4458, 2018.
H Brendan McMahan, Eider Moore, Daniel Ramage, Seth
Hampson, et al. Communication-efficient learning of Ilya Sutskever, James Martens, George Dahl, and Geoffrey
deep networks from decentralized data. arXiv preprint Hinton. On the importance of initialization and momen-
arXiv:1602.05629, 2016. tum in deep learning. In International conference on
Jayadev Misra and David Gries. Finding repeated elements. machine learning, pages 1139–1147, 2013.
Science of computer programming, 2(2):143–152, 1982. Mark Tomlinson, Wesley Solomon, Yages Singh, Tanya
Lev Muchnik, Sen Pei, Lucas C Parra, Saulo DS Reis, José S Doherty, Mickey Chopra, Petrida Ijumba, Alexander C
Andrade Jr, Shlomo Havlin, and Hernán A Makse. Ori- Tsai, and Debra Jackson. The use of mobile phones as
gins of power-law degree distribution in the heterogeneity a data collection tool: a report from a household survey
of human activity in social networks. Scientific reports, 3 in south africa. BMC medical informatics and decision
(1):1–8, 2013. making, 9(1):51, 2009.
Shanmugavelayutham Muthukrishnan et al. Data streams: Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified
Algorithms and applications. Foundations and Trends
R framework for the design and analysis of communication-
in Theoretical Computer Science, 1(2):117–236, 2005. efficient sgd algorithms, 2018.
David Page. How to train your resnet, Nov Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K
2019. URL https://myrtle.ai/ Leung, Christian Makaya, Ting He, and Kevin Chan.
how-to-train-your-resnet/. Adaptive federated learning in resource constrained edge
computing systems. IEEE Journal on Selected Areas in
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, Communications, 37(6):1205–1221, 2019.
James Bradbury, Gregory Chanan, Trevor Killeen, Zem-
ing Lin, Natalia Gimelshein, Luca Antiga, Alban Des- Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang.
maison, Andreas Kopf, Edward Yang, Zachary DeVito, Gradient sparsification for communication-efficient dis-
Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, tributed optimization. In Advances in Neural Information
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chin- Processing Systems, pages 1299–1309, 2018.
tala. Pytorch: An imperative style, high-performance
deep learning library. In H. Wallach, H. Larochelle, Thomas Wolf. How to build a state-of-the-art conversational
A. Beygelzimer, F. dAlché Buc, E. Fox, and R. Garnett, ai with transfer learning, May 2019. URL https://
editors, Advances in Neural Information Processing Sys- tinyurl.com/ryehjbt.
tems 32, pages 8024–8035. Curran Associates, Inc., 2019.
Thomas Wolf, L Debut, V Sanh, J Chaumond, C Delangue,
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario A Moi, P Cistac, T Rault, R Louf, M Funtowicz, et al.
Amodei, and Ilya Sutskever. Language models are unsu- Huggingface’s transformers: State-of-the-art natural lan-
pervised multitask learners. OpenAI Blog, 1(8):9, 2019. guage processing. ArXiv, abs/1910.03771, 2019.
FetchSGD: Communication-Efficient Federated Learning with Sketching