Professional Documents
Culture Documents
Sai Praneeth Karimireddy 1 2 Satyen Kale 3 Mehryar Mohri 3 4 Sashank J. Reddi 3 Sebastian U. Stich 1
Ananda Theertha Suresh 3
the algorithm of choice for federated learning ing with unreliable and relatively slow network connections
due to its simplicity and low communication between the server and the clients, 2) only a small subset
cost. However, in spite of recent research ef- of clients being available for training at a given time, and
forts, its performance is not fully understood. We 3) large heterogeneity (non-iid-ness) in the data present on
obtain tight convergence rates for F EDAVG and the different clients (Konečnỳ et al., 2016a). The most pop-
prove that it suffers from ‘client-drift’ when the ular algorithm for this setting is F EDAVG (McMahan et al.,
data is heterogeneous (non-iid), resulting in un- 2017). F EDAVG tackles the communication bottleneck by
stable and slow convergence. performing multiple local updates on the available clients
before communicating to the server. While it has shown
As a solution, we propose a new algorithm
success in certain applications, its performance on hetero-
(SCAFFOLD) which uses control variates (vari-
geneous data is still an active area of research (Li et al.,
ance reduction) to correct for the ‘client-drift’ in
2018; Yu et al., 2019; Li et al., 2019b; Haddadpour & Mah-
its local updates. We prove that SCAFFOLD re-
davi, 2019; Khaled et al., 2020). We prove that indeed such
quires significantly fewer communication rounds
heterogeneity has a large effect on F EDAVG—it introduces
and is not affected by data heterogeneity or client
a drift in the updates of each client resulting in slow and un-
sampling. Further, we show that (for quadratics)
stable convergence. Further, we show that this client-drift
SCAFFOLD can take advantage of similarity in
persists even if full batch gradients are used and all clients
the client’s data yielding even faster convergence.
participate throughout the training.
The latter is the first result to quantify the useful-
ness of local-steps in distributed optimization. As a solution, we propose a new Stochastic Controlled Av-
eraging algorithm (SCAFFOLD) which tries to correct for
this client-drift. Intuitively, SCAFFOLD estimates the up-
1. Introduction date direction for the server model (c) and the update direc-
tion for each client ci .1 The difference (c − ci ) is then an
Federated learning has emerged as an important paradigm estimate of the client-drift which is used to correct the local
in modern large-scale machine learning. Unlike in tradi- update. This strategy successfully overcomes heterogene-
tional centralized learning where models are trained using ity and converges in significantly fewer rounds of commu-
large datasets stored in a central server (Dean et al., 2012; nication. Alternatively, one can see heterogeneity as intro-
Iandola et al., 2016; Goyal et al., 2017), in federated learn- ducing ‘client-variance’ in the updates across the different
ing, the training data remains distributed over a large num- clients and SCAFFOLD then performs ‘client-variance re-
ber of clients, which may be phones, network sensors, hos- duction’ (Schmidt et al., 2017; Johnson & Zhang, 2013;
pitals, or alternative local information sources (Konečnỳ Defazio et al., 2014). We use this viewpoint to show that
et al., 2016b;a; McMahan et al., 2017; Mohri et al., 2019; SCAFFOLD is relatively unaffected by client sampling.
Kairouz et al., 2019). A centralized model (referred to
as server model) is then trained without ever transmitting Finally, while accommodating heterogeneity is important,
client data over the network, thereby ensuring a basic level it is equally important that a method can take advantage of
of privacy. In this work, we investigate stochastic optimiza- similarities in the client data. We prove that SCAFFOLD
indeed has such a property, requiring fewer rounds of com-
1
EPFL, Lausanne 2 Based on work performed at Google Re- munication when the clients are more similar.
search, New York. 3 Google Research, New York 4 Courant Insti-
1
tute, New York. Correspondence to: Sai Praneeth Karimireddy We refer to these estimates as control variates and the result-
<sai.karimireddy@epfl.ch>. ing correction technique as stochastic controlled averaging.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Table 2. Number of communication rounds required to reach accuracy for µ strongly convex and non-convex functions (log factors
are ignored). Set µ = for general convex rates. (G, B) bounds gradient dissimilarity (A1), and δ bounds Hessian dissimilarity (A2).
Our rates for F EDAVG are more general and tighter than others, even matching the lower bound. However, SGD is still faster (B ≥ 1).
SCAFFOLD does not require any assumptions, is faster than SGD, and is robust to client sampling. Further, when clients become more
similar (small δ), SCAFFOLD converges even faster.
Method Strongly convex Non-convex Sampling Assumptions
σ2 1 σ2 1
SGD (large batch) µN K
+ µ N K2
+
× –
FedAvg
σ2 G2 K
(Li et al., 2019b) µ2 N K
+ µ2
– × (G, 0)-BGD
σ2 G2 N K
(Yu et al., 2019) – N K2
+
× (G, 0)-BGD
σ 2 +G2 2
(Khaled et al., 2020) µN K
+ σ+G √ + NB
µ µ
– × (G, B)-BGD
M2 2 M2 B2
Ours (Thm. I) 1
µSK
+ µG √ + B
µ SK2
+ G
3/2
+
X (G, B)-BGD
2
Lower-bound (Thm. II) Ω( µNσ K + √Gµ ) ? × (G, 1)-BGD
B2 B2
FedProx (Li et al., 2018)2 µ
(weakly convex) X σ = 0, (0, B)-BGD
2,3 δ2
DANE (Shamir et al., 2014) µ2
– × σ = 0, δ-BHD
N σ2 N
VRL-SGD (Liang et al., 2019) – µK
+
× –
SCAFFOLD
2
σ2 σ2
Theorem III µSK
+ µ1 + NS SK2
+ 1 ( NS ) 3 X –
σ2 1 σ2
Theorem IV3 µN K
+ µK + µδ N K2
+ K1
+ δ × δ-BHD
2 2
S M σ
1
M 2 := σ 2 + KS(1 − N )G2 . Note that S = N when no sampling (S = N ).
2
proximal point method i.e. K 1.
3
proved only for quadratic functions.
3.1. Rate of convergence It is illuminating to compare our rates with those of the
simpler iid. case i.e. with G = 0 and B = 1. Our
We now state our novel convergence results for functions σ2
strongly-convex rates become µSK + µ1 . In comparison,
with bounded dissimilarity (proofs in Appendix D.2).
the best previously known rate for this case was by Stich
Theorem I. For β-smooth functions {fi } which satisfy & Karimireddy (2019) who show a rate of µSK σ2
+ Sµ . The
(A1), the output of F EDAVG has expected error smaller main source of improvement in the rates came from the
than for some values of ηl , ηg , R satisfying use of two separate step-sizes (ηl and ηg ). By having a
√
• Strongly convex: ηg ≥ S, ηl ≤ (1+B 2 1)6βKηg , and larger global step-size ηg , we can use a smaller local step-
size ηl thereby reducing the client-drift while still ensur-
√ ing progress. However, even our improved rates do not
σ2 2 βG B 2 β
S G
R = Õ + 1− N + √ + , σ2
match the lower-bound for the identical case of µSK + Kµ1
µKS µ µ µ
√ (Woodworth et al., 2018). We bridge this gap for quadratic
• General convex: ηg ≥ S, ηl ≤ (1+B 2 1)6βKηg , and functions in Section 6.
We now compare F EDAVG to two other algorithms Fed-
σ2 D2 2 2 βG2 B 2 βD2
S G D
R=O + 1− N + 3 + , Prox by (Li et al., 2018) (aka EASGD by (Zhang et al.,
KS2 2 2 2015)) and to SGD. Suppose that G = 0 and σ = 0 i.e.
√
• Non-convex: ηg ≥ S, ηl ≤ (1+B 2 1)6βKηg , and we use full batch gradients and all clients have very similar
2
optima. In such a case, F EDAVG has a complexity of Bµ
σ2 F G2 F βG2 B 2 βF which is identical to that of FedProx (Li et al., 2018). Thus,
S
R=O + 1− N + 3 + , FedProx does not have any theoretical advantage.
KS2 2
2
Next, suppose that all clients participate (no sampling) with
2
where D := kx0 − x? k and F := f (x0 ) − f ? .
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
update ∇fi (y). The difference between the ideal update et al., 2018). In contrast, SGD with K times larger batch-
(6) and the F EDAVG update (1) is k∇fi (y) − ∇f (y)k. σ2 1
size would require µKN + µ (note the absence of K in the
We need a bound on the gradient-dissimilarity as in (A1) second term). Thus, for identical functions, SCAFFOLD
to bound this error. SCAFFOLD instead uses the update (and in fact even F EDAVG) improves linearly with increas-
∇fi (y) − ci + c, and the difference from ideal update be- ing number of local steps. In the other extreme, if the func-
comes tions are arbitrarily different, we may have δ = 2β. In this
X 2
X 2 case, the complexity of SCAFFOLD and large-batch SGD
k(∇fi (y) − ci + c) − ∇f (y)k ≤ kci − ∇fi (y)k .
match the lower bound of Arjevani & Shamir (2015) for the
i i
heterogeneous case.
Thus, the error is independent of how similar or dissimilar
the functions fi are, and instead only depends on the qual- The above insights can be generalized to when the func-
ity of our approximation ci ≈ ∇fi (y). Since fi is smooth, tions are only somewhat similar. If the Hessians are δ-close
we can expect that the gradient ∇fi (y) does not change too and σ = 0, then the complexity is β+δKµK . This bound im-
fast and hence is easy to approximate. Appendix E trans- plies that the optimum number of local steps one should
lates this intuition into a formal proof. use is K = βδ . Picking a smaller K increases the com-
munication required whereas increasing it further would
only waste computational resources. While this result is
6. Usefulness of local steps intuitive—if the functions are more ‘similar’, local steps
In this section we investigate when and why taking local are more useful—Theorem IV shows that it is the similar-
steps might be useful over simply computing a large-batch ity of the Hessians which matters. This is surprising since
gradient in distributed optimization. We will show that the Hessians of {fi } may be identical even if their individ-
when the functions across the clients share some similarity, ual optima x?i are arbitrarily far away from each other and
local steps can take advantage of this and converge faster. the gradient-dissimilarity (A1) is unbounded.
For this we consider quadratic functions and express their
similarity with the δ parameter introduced in (A2). Proof sketch. Consider a simplified SCAFFOLD up-
Theorem IV. For any β-smooth quadratic functions {fi } date with σ = 0 and no sampling (S = N ):
with δ bounded Hessian dissimilarity (A2), the output of
SCAFFOLD with S = N (no sampling) has error smaller yi = yi + η(∇fi (yi ) + ∇f (x) − ∇fi (x)) .
than for ηg = 1 and some values of ηl , R satisfying
1 We would ideally want to perform the update yi = yi −
• Strongly convex: ηl ≤ 15Kδ+8β and
η∇f (yi ) using the full gradient ∇f (yi ). We reinterpret
βσ 2 β + δK
the correction term of SCAFFOLD (c − ci ) as perform-
R = Õ + , ing the following first order correction to the local gradient
µKN µK
∇fi (yi ) to make it closer to the full gradient ∇f (yi ):
1
• Weakly convex: ηl ≤ 15Kδ+8β and
2 ∇fi (yi ) − ∇fi (x) + ∇f (x)
βσ F (β + δK)F
R=O + ,
| {z } | {z }
≈∇2 fi (x)(yi −x) ≈∇f (yi )+∇2 f (x)(x−yi )
KN 2 K
≈ ∇f (yi ) + (∇2 fi (x) − ∇2 f (x))(yi − x)
0 ?
where we define F := (f (x ) − f ). ≈ ∇f (yi ) + δ(yi − x)
When σ = 0 and K is large, the complexity of SCAF-
FOLD becomes µδ . In contrast DANE, which being a prox- Thus the SCAFFOLD update approximates the ideal up-
date up to an error δ. This intuition is proved formally for
imal point method also uses large K, requires ( µδ )2 rounds
quadratic functions in Appendix F. Generalizing these re-
(Shamir et al., 2014) which is significantly slower, or needs
sults to other functions is a challenging open problem.
an additional backtracking-line search to match the rates
of SCAFFOLD (Yuan & Li, 2019). Further, Theorem IV
is the first result to demonstrate improvement due to simi- 7. Experiments
lairty for non-convex functions as far as we are aware.
We run experiments on both simulated and real datasets
Suppose that {fi } are identical. Recall that δ in (A2) mea- to confirm our theory. Our main findings are i) SCAF-
sures the Hessian dissimilarity between functions and so FOLD consistently outperforms SGD and F EDAVG across
δ = 0 for this case. Then Theorem IV shows that the com- all parameter regimes, and ii) the benefit (or harm) of local
σ2 1
plexity of SCAFFOLD is µKN + µK which (up to ac- steps depends on both the algorithm and the similarity of
celeration) matches the i.i.d. lower bound of (Woodworth the clients data.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
10 10 10 10 10 10
SGD, K = 1 7.3. EMNIST results
Scaffold, K = 2
Scaffold, K = 10
10 20 10 20 10 20 We run extensive experiments on the EMNIST dataset to
0 20 40 60 0 20 40 60 0 20 40 60
rounds rounds rounds measure the interplay between the algorithm, number of
epochs (local updates), number of participating clients, and
Figure 3. SGD (dashed black), FedAvg (above), and SCAFFOLD the client similarity. Table 3 measures the benefit (or harm)
(below) on simulated data. FedAvg gets worse as local steps in-
of using more local steps, Table 4 studies the resilience to
creases with K = 10 (red) worse than K = 2 (orange). It also
gets slower as gradient-dissimilarity (G) increases (to the right).
client sampling, and Table 5 reports preliminary results on
SCAFFOLD significantly improves with more local steps, with neural networks. We are mainly concerned with minimiz-
K = 10 (blue) faster than K = 2 (light blue) and SGD. Its per- ing the number of communication rounds. We observe that
formance is identical as we vary heterogeneity (G). SCAFFOLD is consistently the best. Across all range
of values tried, we observe that SCAFFOLD outperforms
SGD, F EDAVG, and EASGD. The latter EASGD is al-
ways slower than the other local update methods, though in
7.1. Setup
some cases it outperforms SGD. F EDAVG is always slower
Our simulated experiments uses N = 2 quadratic func- than SCAFFOLD and faster than EASGD.
tions based on our lower-bounds in Theorem II. We use
SCAFFOLD > SGD > FedAvg for heterogeneous
full-batch gradients (σ = 0) and no client sampling. Our
clients. When similarity is 0%, F EDAVG gets slower with
real world experiments run logistic regression (convex) and
increasing local steps. If we take more than 5 epochs, its
2 layer fully connected network (non-convex) on the EM-
performance is worse than SGD’s. SCAFFOLD also ini-
NIST (Cohen et al., 2017). We divide this dataset among
tially worsens as we increase the number of epochs but then
N = 100 clients as follows: for s% similar data we allocate
flattens. However, its performance is always better than
to each client s% i.i.d. data and the remaining (100 − s)%
that of SGD, confirming that SCAFFOLD can handle het-
by sorting according to label (cf. Hsu et al. (2019)).
erogeneous data.
We consider four algorithms: SGD, F EDAVG SCAF-
SCAFFOLD and FedAvg get faster with more similar-
FOLD and EASGD (aka FedProx). On each sampled
ity, but not SGD. As similarity of the clients increases, the
client SGD uses the full local data to compute a single local
performance of SGD remains relatively constant. On the
update, whereas the other algorithms take 5 steps per epoch
other hand, SCAFFOLD and F EDAVG get significantly
(batch size is 0.2 of local data). We always use global step-
faster as similarity increases. Further, local steps become
size ηg = 1 and tune the local step-size ηl individually for
much more useful, showing monotonic improvement with
each algorithm. SCAFFOLD uses option II (no extra gra-
the increase in number of epochs. This is because with
dient computations) to keep comparison fair, and EASGD
increasing the i.i.d.ness of the data, both the gradient and
uses fixed regularization = 1 as in (Li et al., 2018).
Hessian dissimilarity decrease.
7.2. Simulated results SCAFFOLD is resilient to client sampling. As we de-
crease the fraction of clients sampled, SCAFFOLD ,and
The results are summarized in Fig. 3. Our simulated data F EDAVG only show a sub-linear slow-down. Unsurpris-
has Hessian difference δ = 1 (A2) and β = 1. We vary the ingly, they are more resilient to sampling in the case of
gradient heterogeneity (A1) as G ∈ [1, 10, 100]. For all val- higher similarity.
ued of G, F EDAVG gets slower as we increase the number
of local steps. This is explained by the fact that client-drift SCAFFOLD outperforms FedAvg on non-convex exper-
increases as we increase the number of local steps, hinder- iments. We see that SCAFFOLD is better than F EDAVG
ing progress. Further, as we increase G, F EDAVG continues in terms of final test accuracy reached, though interestingly
to slow down exactly as dictated by Thms. I and II. Note F EDAVG seems better than SGD even when similarity is 0.
that when heterogeneity is small (G = β = 1), F EDAVG However, much more extensive experiments (beyond cur-
can be competitive with SGD. rent scope) are needed before drawing conclusions.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Table 3. Communication rounds to reach 0.5 test accuracy for logistic regression on EMNIST as we vary number of epochs. 1k+
indicates 0.5 accuracy was not reached even after 1k rounds, and similarly an arrowhead indicates that the barplot extends beyond the
table. 1 epoch for local update methods corresponds to 5 local steps (0.2 batch size), and 20% of clients are sampled each round. Across
all parameters (epochs and similarity), SCAFFOLD is the fastest method. When similarity is 0 (sorted data), F EDAVG consistently gets
worse as we increase the number of epochs, quickly becoming slower than SGD. SCAFFOLD initially gets worse and later stabilizes,
but is always at least as fast as SGD. As similarity increases (i.e. data is more shuffled), both F EDAVG and SCAFFOLD significantly
outperform SGD though SCAFFOLD is still better than F EDAVG. Further, with higher similarity, both methods benefit from increasing
number of epochs.
Epochs 0% similarity (sorted) 10% similarity 100% similarity (i.i.d.)
Num. of rounds Speedup Num. of rounds Speedup Num. of rounds Speedup
SGD 1 317 (1×) 365 (1×) 416 (1×)
SCAFFOLD1 77 (4.1×) 62 (5.9×) 60 (6.9×)
5 152 (2.1×) 20 (18.2×) 10 (41.6×)
10 286 (1.1×) 16 (22.8×) 7 (59.4×)
20 266 (1.2×) 11 (33.2×) 4 (104×)
FedAvg 1 258 (1.2×) 74 (4.9×) 83 (5×)
5 428 (0.7×) 34 (10.7×) 10 (41.6×)
10 711 (0.4×) 25 (14.6×) 6 (69.3×)
20 1k+ (< 0.3×) 18 (20.3×) 4 (104×)
EASGD 1 1k+ (< 0.3×) 979 (0.4×) 459 (0.9×)
5 1k+ (< 0.3×) 794 (0.5×) 351 (1.2×)
10 1k+ (< 0.3×) 894 (0.4×) 308 (1.4×)
20 1k+ (< 0.3×) 916 (0.4×) 351 (1.2×)
Table 4. Communication rounds to reach 0.45 test accuracy for Table 5. Best test accuracy after 1k rounds with 2-layer fully con-
logistic regression on EMNIST as we vary the number of sam- nected neural network (non-convex) on EMNIST trained with
pled clients. Number of epochs is kept fixed to 5. SCAFFOLD is 5 epochs per round (25 steps) for the local methods, and 20%
consistently faster than F EDAVG. As we decrease the number of of clients sampled each round. SCAFFOLD has the best ac-
clients sampled in each round, the increase in number of rounds curacy and SGD has the least. SCAFFOLD again outperforms
is sub-linear. This slow-down is better for more similar clients. other methods. SGD is unaffected by similarity, whereas the local
methods improve with client similarity.
Clients 0% similarity 10% similarity
SCAFFOLD 20% 143 (1.0×) 9 (1.0×) 0% similarity 10% similarity
5% 290 (2.0×) 13 (1.4×) SGD 0.766 0.764
1% 790 (5.5×) 28 (3.1×) F EDAVG 0.787 0.828
F EDAVG 20% 179 (1.0×) 12 (1.0×) SCAFFOLD 0.801 0.842
5% 334 (1.9×) 17 (1.4×)
1% 1k+ (5.6+×) 35 (2.9×)
8. Conclusion
antees and empirical evaluations. Further, we showed that
Our work studied the impact of heterogeneity on the per- while SCAFFOLD is always at least as fast as SGD, it can
formance of optimization methods for federated learning. be much faster depending on the Hessian dissimilarity in
Our careful theoretical analysis showed that F EDAVG can our data. Thus, different algorithms can take advantage of
be severely hampered by gradient dissimilarity, and can be (and are limited by) different notions of dissimilarity. We
even slower than SGD. We then proposed a new stochastic believe that characterizing and isolating various dissimilar-
algorithm (SCAFFOLD) which overcomes gradient dis- ities present in real world data can lead to further new algo-
similarity using control variates. We demonstrated the ef- rithms and significant impact on distributed, federated, and
fectiveness of SCAFFOLD via strong convergence guar- decentralized learning.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
References Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M.,
Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K.,
Agarwal, N., Suresh, A. T., Yu, F. X., Kumar, S., and
Le, Q. V., and Ng, A. Y. Large scale distributed deep
McMahan, B. cpSGD: Communication-efficient and
networks. In Advances in neural information processing
differentially-private distributed SGD. In Proceedings
systems, pp. 1223–1231, 2012.
of NeurIPS, pp. 7575–7586, 2018.
Defazio, A. and Bottou, L. On the ineffectiveness of vari-
Arjevani, Y. and Shamir, O. Communication complex-
ance reduced optimization for deep learning. In Ad-
ity of distributed convex learning and optimization. In
vances in Neural Information Processing Systems, pp.
Advances in neural information processing systems, pp.
1753–1763, 2019.
1756–1764, 2015.
Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A
Bassily, R., Smith, A., and Thakurta, A. Private empiri-
fast incremental gradient method with support for non-
cal risk minimization: Efficient algorithms and tight er-
strongly convex composite objectives. In Advances in
ror bounds. In 2014 IEEE 55th Annual Symposium on
neural information processing systems, pp. 1646–1654,
Foundations of Computer Science, pp. 464–473. IEEE,
2014.
2014.
Fang, C., Li, C. J., Lin, Z., and Zhang, T. SPIDER: Near-
Basu, D., Data, D., Karakus, C., and Diggavi, S. Qsparse-
optimal non-convex optimization via stochastic path-
local-SGD: Distributed SGD with quantization, spar-
integrated differential estimator. In Advances in Neural
sification, and local computations. arXiv preprint
Information Processing Systems, pp. 689–699, 2018.
arXiv:1906.02367, 2019.
Glasserman, P. Monte Carlo methods in financial engineer-
Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A.,
ing, volume 53. Springer Science & Business Media,
McMahan, H. B., Patel, S., Ramage, D., Segal, A.,
2013.
and Seth, K. Practical secure aggregation for privacy-
preserving machine learning. In Proceedings of the 2017 Goyal, P., Dollár, P., Girshick, R., Noordhuis, P.,
ACM SIGSAC Conference on Computer and Communi- Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He,
cations Security, pp. 1175–1191. ACM, 2017. K. Accurate, large minibatch SGD: Training imagenet
in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Brisimi, T. S., Chen, R., Mela, T., Olshevsky, A., Pascha-
lidis, I. C., and Shi, W. Federated learning of predic- Haddadpour, F. and Mahdavi, M. On the convergence of lo-
tive models from federated electronic health records. In- cal descent methods in federated learning. arXiv preprint
ternational journal of medical informatics, 112:59–67, arXiv:1910.14425, 2019.
2018.
Hanzely, F. and Richtárik, P. One method to rule them all:
Cen, S., Zhang, H., Chi, Y., Chen, W., and Liu, T.-Y. Variance reduction for data, parameters and many new
Convergence of distributed stochastic variance reduced methods. arXiv preprint arXiv:1905.11266, 2019.
methods without sampling extra data. arXiv preprint
arXiv:1905.12648, 2019. Hard, A., Rao, K., Mathews, R., Beaufays, F., Augenstein,
S., Eichner, H., Kiddon, C., and Ramage, D. Federated
Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differ- learning for mobile keyboard prediction. arXiv preprint
entially private empirical risk minimization. Journal of arXiv:1811.03604, 2018.
Machine Learning Research, 12(Mar):1069–1109, 2011.
Hsu, T.-M. H., Qi, H., and Brown, M. Measuring the
Chen, M., Mathews, R., Ouyang, T., and Beaufays, F. effects of non-identical data distribution for federated
Federated learning of out-of-vocabulary words. arXiv visual classification. arXiv preprint arXiv:1909.06335,
preprint arXiv:1903.10635, 2019a. 2019.
Chen, M., Suresh, A. T., Mathews, R., Wong, A., Beau- Iandola, F. N., Moskewicz, M. W., Ashraf, K., and Keutzer,
fays, F., Allauzen, C., and Riley, M. Federated learning K. Firecaffe: near-linear acceleration of deep neural net-
of N-gram language models. In Proceedings of the 23rd work training on compute clusters. In Proceedings of
Conference on Computational Natural Language Learn- the IEEE Conference on Computer Vision and Pattern
ing (CoNLL), 2019b. Recognition, pp. 2592–2600, 2016.
Cohen, G., Afshar, S., Tapson, J., and Van Schaik, A. Johnson, R. and Zhang, T. Accelerating stochastic gradient
Emnist: Extending mnist to handwritten letters. In descent using predictive variance reduction. In Advances
2017 International Joint Conference on Neural Net- in neural information processing systems, pp. 315–323,
works (IJCNN), pp. 2921–2926. IEEE, 2017. 2013.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, Liang, X., Shen, S., Liu, J., Pan, Z., Chen, E., and Cheng,
M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, Y. Variance reduced local sgd with lower communication
G., Cummings, R., et al. Advances and open problems complexity. arXiv preprint arXiv:1912.12844, 2019.
in federated learning. arXiv preprint arXiv:1912.04977,
2019. McMahan, B., Moore, E., Ramage, D., Hampson, S., and
y Arcas, B. A. Communication-efficient learning of deep
Karimireddy, S. P., Rebjock, Q., Stich, S. U., and Jaggi, M. networks from decentralized data. In Proceedings of
Error feedback fixes SignSGD and other gradient com- AISTATS, pp. 1273–1282, 2017.
pression schemes. arXiv preprint arXiv:1901.09847,
2019. Mishchenko, K., Gorbunov, E., Takáč, M., and Richtárik,
P. Distributed learning with compressed gradient differ-
Khaled, A., Mishchenko, K., and Richtárik, P. Tighter the- ences. arXiv preprint arXiv:1901.09269, 2019.
ory for local SGD on indentical and heterogeneous data.
In Proceedings of AISTATS, 2020. Mohri, M., Sivek, G., and Suresh, A. T. Agnostic federated
learning. arXiv preprint arXiv:1902.00146, 2019.
Kifer, D., Smith, A., and Thakurta, A. Private convex em-
pirical risk minimization and high-dimensional regres- Nedich, A., Olshevsky, A., and Shi, W. A geomet-
sion. In Conference on Learning Theory, pp. 25–1, 2012. rically convergent method for distributed optimization
Konečnỳ, J., McMahan, H. B., Ramage, D., and Richtárik, over time-varying graphs. In 2016 IEEE 55th Confer-
P. Federated optimization: Distributed machine ence on Decision and Control (CDC), pp. 1023–1029.
learning for on-device intelligence. arXiv preprint IEEE, 2016.
arXiv:1610.02527, 2016a.
Nesterov, Y. Lectures on convex optimization, volume 137.
Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Springer, 2018.
Suresh, A. T., and Bacon, D. Federated learning: Strate-
gies for improving communication efficiency. arXiv Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M.
preprint arXiv:1610.05492, 2016b. Sarah: A novel method for machine learning problems
using stochastic recursive gradient. In Proceedings of
Kulunchakov, A. and Mairal, J. Estimate sequences for the 34th International Conference on Machine Learning-
stochastic composite optimization: Variance reduction, Volume 70, pp. 2613–2621. JMLR. org, 2017.
acceleration, and robustness to noise. arXiv preprint
arXiv:1901.08788, 2019. Nguyen, L. M., Scheinberg, K., and Takáč, M. Inexact
SARAH algorithm for stochastic optimization. arXiv
Lee, J. D., Lin, Q., Ma, T., and Yang, T. Distributed preprint arXiv:1811.10105, 2018.
stochastic variance reduced gradient methods and a
lower bound for communication complexity. arXiv Patel, K. K. and Dieuleveut, A. Communication trade-offs
preprint arXiv:1507.07595, 2015. for synchronized distributed SGD with large step size.
arXiv preprint arXiv:1904.11325, 2019.
Lei, L. and Jordan, M. Less than a single pass: Stochas-
tically controlled stochastic gradient. In AISTATS, pp. Ramaswamy, S., Mathews, R., Rao, K., and Beaufays, F.
148–156, 2017. Federated learning for emoji prediction in a mobile key-
board. arXiv preprint arXiv:1906.04329, 2019.
Li, T., Sahu, A. K., Sanjabi, M., Zaheer, M., Talwalkar,
A., and Smith, V. On the convergence of federated op- Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A.
timization in heterogeneous networks. arXiv preprint Stochastic variance reduction for nonconvex optimiza-
arXiv:1812.06127, 2018. tion. In International conference on machine learning,
Li, T., Sanjabi, M., and Smith, V. Fair resource allocation pp. 314–323, 2016a.
in federated learning. arXiv preprint arXiv:1905.10497,
Reddi, S. J., Konečnỳ, J., Richtárik, P., Póczós, B., and
2019a.
Smola, A. Aide: Fast and communication efficient dis-
Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, tributed optimization. arXiv preprint arXiv:1608.06879,
A., and Smith, V. Feddane: A federated newton-type 2016b.
method. arXiv preprint arXiv:2001.01920, 2020.
Reddi, S. J., Sra, S., Póczos, B., and Smola, A. Fast in-
Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. cremental method for smooth nonconvex optimization.
On the convergence of FedAvg on non-iid data. arXiv In 2016 IEEE 55th Conference on Decision and Control
preprint arXiv:1907.02189, 2019b. (CDC), pp. 1971–1977. IEEE, 2016c.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Safran, I. and Shamir, O. How good is sgd with random Woodworth, B. E., Wang, J., Smith, A., McMahan, H. B.,
shuffling? arXiv preprint arXiv:1908.00045, 2019. and Srebro, N. Graph oracle models, lower bounds, and
gaps for parallel stochastic optimization. In Advances in
Samarakoon, S., Bennis, M., Saad, W., and Debbah, M. neural information processing systems, pp. 8496–8506,
Federated learning for ultra-reliable low-latency v2v 2018.
communications. In 2018 IEEE Global Communications
Conference (GLOBECOM), pp. 1–7. IEEE, 2018. Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong,
N., Ramage, D., and Beaufays, F. Applied federated
Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite learning: Improving google keyboard query suggestions.
sums with the stochastic average gradient. Mathematical arXiv preprint arXiv:1812.02903, 2018.
Programming, 162(1-2):83–112, 2017.
Yu, H., Yang, S., and Zhu, S. Parallel restarted SGD with
Shamir, O., Srebro, N., and Zhang, T. Communication- faster convergence and less communication: Demystify-
efficient distributed optimization using an approximate ing why model averaging works for deep learning. In
newton-type method. In International conference on ma- Proceedings of the AAAI Conference on Artificial Intel-
chine learning, pp. 1000–1008, 2014. ligence, volume 33, pp. 5693–5700, 2019.
Shi, W., Ling, Q., Wu, G., and Yin, W. EXTRA: An ex- Yuan, X.-T. and Li, P. On convergence of distributed
act first-order algorithm for decentralized consensus op- approximate newton methods: Globalization, sharper
timization. SIAM Journal on Optimization, 25(2):944– bounds and beyond. arXiv preprint arXiv:1908.02246,
966, 2015. 2019.
Smith, V., Forte, S., Ma, C., Takáč, M., Jordan, M., Zhang, L., Mahdavi, M., and Jin, R. Linear convergence
and Jaggi, M. CoCoA: A general framework for with condition number independent access of full gra-
communication-efficient distributed optimization. Jour- dients. In Advances in Neural Information Processing
nal of Machine Learning Research, 18(230):1–49, 2018. Systems, pp. 980–988, 2013a.
Stich, S. U. Local SGD converges fast and communicates Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learn-
little. arXiv preprint arXiv:1805.09767, 2018. ing with elastic averaging sgd. In Advances in neural
information processing systems, pp. 685–693, 2015.
Stich, S. U. Unified optimal analysis of the (stochastic) gra-
dient method. arXiv preprint arXiv:1907.04232, 2019. Zhang, Y., Duchi, J. C., and Wainwright, M. J.
Communication-efficient algorithms for statistical opti-
Stich, S. U. and Karimireddy, S. P. The error-feedback
mization. The Journal of Machine Learning Research,
framework: Better rates for SGD with delayed gradi-
14(1):3321–3363, 2013b.
ents and compressed communication. arXiv preprint
arXiv:1909.05350, 2019. Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra,
V. Federated learning with non-iid data. arXiv preprint
Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified arXiv:1806.00582, 2018.
SGD with memory. In Advances in Neural Information
Processing Systems, pp. 4447–4458, 2018. Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Par-
allelized stochastic gradient descent. In Advances in
Suresh, A. T., Yu, F. X., Kumar, S., and McMahan, H. B. neural information processing systems, pp. 2595–2603,
Distributed mean estimation with limited communica- 2010.
tion. In Proceedings of the 34th International Confer-
ence on Machine Learning-Volume 70, pp. 3329–3337.
JMLR. org, 2017.
Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya, C.,
He, T., and Chan, K. Adaptive federated learning in re-
source constrained edge computing systems. IEEE Jour-
nal on Selected Areas in Communications, 37(6):1205–
1221, 2019.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Appendix
A. Related work and significance
Federated learning. As stated earlier, federated learning involves learning a centralized model from distributed client
data. This centralized model benefits from all client data and can often result in a beneficial performance e.g. in including
next word prediction (Hard et al., 2018; Yang et al., 2018), emoji prediction (Ramaswamy et al., 2019), decoder models
(Chen et al., 2019b), vocabulary estimation (Chen et al., 2019a), low latency vehicle-to-vehicle communication (Sama-
rakoon et al., 2018), and predictive models in health (Brisimi et al., 2018). Nevertheless, federated learning raises several
types of issues and has been the topic of multiple research efforts studying the issues of generalization and fairness (Mohri
et al., 2019; Li et al., 2019a), the design of more efficient communication strategies (Konečnỳ et al., 2016b;a; Suresh et al.,
2017; Stich et al., 2018; Karimireddy et al., 2019; Basu et al., 2019), the study of lower bounds (Woodworth et al., 2018),
differential privacy guarantees (Agarwal et al., 2018), security (Bonawitz et al., 2017), etc. We refer to Kairouz et al.
(2019) for an in-depth survey of this area.
Convergence of F EDAVG For identical clients, F EDAVG coincides with parallel SGD analyzed by (Zinkevich et al.,
2010) who proved asymptotic convergence. Stich (2018) and, more recently Stich & Karimireddy (2019); Patel &
Dieuleveut (2019); Khaled et al. (2020), gave a sharper analysis of the same method, under the name of local SGD, also for
identical functions. However, there still remains a gap between their upper bounds and the lower bound of Woodworth et al.
(2018). The analysis of F EDAVG for heterogeneous clients is more delicate due to the afore-mentioned client-drift, first
empirically observed by Zhao et al. (2018). Several analyses bound this drift by assuming bounded gradients (Wang et al.,
2019; Yu et al., 2019), or view it as additional noise (Khaled et al., 2020), or assume that the client optima are -close (Li
et al., 2018; Haddadpour & Mahdavi, 2019). In a concurrent work, (Liang et al., 2019) propose to use variance reduction
to deal with client heterogeneity but still show rates slower than SGD. We summarize the communication complexities of
different methods for heterogeneous clients in Table 2.
Variance reduction. The use of control variates is a classical technique to reduce variance in Monte Carlo sampling
methods (cf. (Glasserman, 2013)). In optimization, they were used for finite-sum minimization by SVRG (Johnson &
Zhang, 2013; Zhang et al., 2013a) and then in SAGA (Defazio et al., 2014) to simplify the linearly convergent method
SAG (Schmidt et al., 2017). Numerous variations and extensions of the technique are studied in (Hanzely & Richtárik,
2019). Starting from (Reddi et al., 2016a), control variates have also frequently been used to reduce variance in finite-sum
non-convex settings (Reddi et al., 2016c; Nguyen et al., 2018; Fang et al., 2018; Tran-Dinh et al., 2019). Further, they
are used to obtain linearly converging decentralized algorithms under the guise of ‘gradient-tracking’ in (Shi et al., 2015;
Nedich et al., 2016) and for gradient compression as ‘compressed-differences’ in (Mishchenko et al., 2019). Our method
can be viewed as seeking to remove the ‘client-variance’ in the gradients across the clients, though there still remains
additional stochasticity as in (Kulunchakov & Mairal, 2019), which is important in deep learning (Defazio & Bottou,
2019).
Distributed optimization. The problem of client-drift we described is a common phenomenon in distributed optimiza-
tion. In fact, classic techniques such as ADMM mitigate this drift, though they are not applicable in federated learning.
For well structured convex problems, CoCoA uses the dual variable as the control variates, enabling flexible distributed
methods (Smith et al., 2018). DANE by (Shamir et al., 2014) obtain a closely related primal only algorithm, which was
later accelerated by Reddi et al. (2016b) and recently extended to federated learning (Li et al., 2020). SCAFFOLD can
be viewed as an improved version of DANE where a fixed number of (stochastic) gradient steps are used in place of a
proximal point update. In a similar spirit, distributed variance reduction techniques have been proposed for the finite-sum
case (Lee et al., 2015; Konečnỳ et al., 2016a; Cen et al., 2019). However, these methods are restricted to finite-sums and
are not applicable to the stochastic setting studied here.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
B. Technicalities
We examine some additional definitions and introduce some technical lemmas.
(A4) gi (x) := ∇fi (x; ζi ) is unbiased stochastic gradient of fi with bounded variance
2
Eζi [kgi (x) − ∇fi (x)k ] ≤ σ 2 , for any i, x .
Note that (A4) only bounds the variance within the same client, but not the variance across the clients.
The assumption (A5) also implies the following quadratic upper bound on fi
β 2
fi (y) ≤ fi (x) + h∇f (x), y − xi + ky − xk . (8)
2
If additionally the function {fi } are convex and x? is an optimum of f , (A5) implies (via Nesterov (2018), Theorem 2.1.5)
N
1 X 2
k∇fi (x) − ∇fi (x? )k ≤ f (x) − f ? . (9)
2βN i=1
R+1
1 X wr wr c
ΨR := (1 − µη) dr−1 − dr + cηwr = Õ µd0 exp(−µηmax R) + .
WR r=1 η η µR
Proof. By substituting the value of wr , we observe that we end up with a telescoping sum and estimate
R+1 R+1
1 X cη X d0
ΨR = (wr−1 dr−1 − wr dr ) + wr ≤ + cη .
ηWR r=1 WR r=1 ηWR
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
1
When R ≥ 2µη , (1 − µη)R ≤ exp(−µηR) ≤ 23 . For such an R, we can lower bound ηWR using
R
X 1 − (1 − µη)R 1
ηWR = η(1 − µη)−R (1 − µη)r = η(1 − µη)−R ≥ (1 − µη)−R .
r=0
µη 3µ
1
This proves that for all R ≥ 2µη ,
The lemma now follows by carefully tuning η. Consider the following two cases depending on the magnitude of R and
ηmax :
The next lemma is an extension of (Stich & Karimireddy, 2019, Lemma 13), (Kulunchakov & Mairal, 2019, Lemma 13)
and is useful to derive convergence rates for general convex functions (µ = 0) and non-convex functions.
Lemma 2 (sub-linear convergence rate). For every non-negative sequence {dr−1 }r≥1 and any parameters ηmax ≥ 0,
c ≥ 0, R ≥ 0, there exists a constant step-size η ≤ ηmax and weights wr = 1 such that,
R+1 √ 23
1 X dr−1 dr 2 d0 2 c1 d0 d0 1
ΨR := − + c1 η + c2 η ≤ +√ +2 c23 .
R + 1 r=1 η η ηmax (R + 1) R+1 R+1
Similar to the strongly convex case (Lemma 1), we distinguish the following cases:
d0 d0
• When R + 1 ≤ 2
c1 ηmax , and R + 1 ≤ 3
c2 ηmax we pick η = ηmax to claim
√ 23
d0 2 d0 c1 d0 d0 1
ΨR ≤ + c1 ηmax + c2 ηmax ≤ +√ + c23 .
ηmax (R + 1) ηmax (R + 1) R+1 R+1
q q
2 d0 3 d0 d0 d0
• In the other case, we have ηmax ≥ c1 (R+1) or ηmax ≥ c2 (R+1) . We choose η = min c1 (R+1) ,
3
c2 (R+1) to
prove
√ s
d0 2 c1 d0 d20 c2
ΨR ≤ + cη = √ +23 .
η(R + 1) R+1 (R + 1)2
Next, we state a relaxed triangle inequality true for the squared `2 norm.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Lemma 3 (relaxed triangle inequality). Let {v1 , . . . , vτ } be τ vectors in Rd . Then the following are true:
2 2 2
1. kvi + vj k ≤ (1 + a)kvi k + (1 + a1 )kvj k for any a > 0, and
Pτ 2 Pτ 2
2. k i=1 vi k ≤ τ i=1 kvi k .
Proof. The proof of the first statement for any a > 0 follows from the identity:
2 2 2 √ 2
kvi + vj k = (1 + a)kvi k + (1 + a1 )kvj k − k avi + √1 vj k
a
.
2
For the second inequality, we use the convexity of x → kxk and Jensen’s inequality
1 τ τ
X
2
1 X
vi
2 .
≤τ
vi
τ
i=1 i=1
Xτ Xτ
2 2
E[k Ξi k ] ≤ k ξi k + τ 2 σ 2 .
i=1 i=1
If further, the variables {Ξi − ξi } form a martingale difference sequence we can improve the bound to
τ
X Xτ
2 2
E[k Ξi k ] ≤ k ξi k + τ σ 2 .
i=1 i=1
Proof. For any random variable X, E[X 2 ] = (E[X − E[X]])2 + (E[X])2 implying
τ
X Xτ τ
X
2 2 2
E[k Ξi k ] = k ξi k + E[k Ξi − ξi k ] .
i=1 i=1 i=1
Expanding the above expression using relaxed triangle inequality (Lemma 3) proves the first claim:
τ
X τ
X
2 2
E[k Ξi − ξi k ] ≤ τ E[kΞi − ξi k ] ≤ τ 2 σ 2 .
i=1 i=1
The cross terms in the above expression have zero mean since {Ξi − ξi } form a martingale difference sequence.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Proof. Given any x, y, and z, we get the following two inequalities using smoothness and strong convexity of h:
β 2
h∇h(x), z − xi ≥ h(z) − h(x) − kz − xk
2
µ 2
h∇h(x), x − yi ≥ h(x) − h(y) + ky − xk .
2
Further, applying the relaxed triangle inequality gives
µ 2 µ 2 µ 2
ky − xk ≥ ky − zk − kx − zk .
2 4 2
Combining all the inequalities together we have
µ 2 β+µ 2
h∇h(x), z − yi ≥ h(z) − h(y) + ky − zk − kz − xk .
4 2
The lemma follows since β ≥ µ.
2 2
kx − η∇h(x) − y + η∇h(y)k ≤ (1 − µη)kx − yk .
Proof.
2 2 2
kx − η∇h(x) − y + η∇h(y)k = kx − yk + η 2 k∇h(x) − ∇h(y)k − 2ηh∇h(x) − ∇h(y), x − yi
(A5)
2
≤ kx − yk + (η 2 β − 2η)h∇h(x) − ∇h(y), x − yi .
Recall our bound on the step-size η ≤ β1 which implies that (η 2 β − 2η) ≤ −η. Finally, apply the µ-strong convexity of h
to get
2
−ηh∇h(x) − ∇h(y), x − yi ≤ −ηµkx − yk .
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
D. Convergence of F EDAVG
We outline the F EDAVG method in Algorithm 2. In round r we sample S r ⊆ [N ] clients with |S r | = S and then perform
the following updates:
r
• Starting from the shared global parameters yi,0 = xr−1 , we update the local parameters for k ∈ [K]
r r r
yi,k = yi,k−1 − ηl gi (yi,k−1 ). (10)
• Compute the new global parameters using only updates from the clients i ∈ S r and a global step-size ηg :
ηg X r
xr = xr−1 + (yi,K − xr−1 ) . (11)
S r
i∈S
N
1 X 2 2
k∇fi (x)k ≤ G2 + B 2 k∇f (x)k . (13)
N i=1
N
1 X 2
k∇fi (x)k ≤ G2 + 2βB 2 (f (x) − f ? ) . (14)
N i=1
We defined two variants of the bounds on the heterogeneity depending of whether the functions are convex or not. Suppose
that the functions f is indeed convex as in (A3) and β-smooth as in (A5), then it is straightforward to see that (13) implies
(14). Thus for convex functions, (A1) is mildly weaker. Suppose that the functions {f1 , . . . , fN } are convex and β-smooth.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
N N N
1 X 2 2 X 2 2 X 2
k∇fi (x)k ≤ k∇fi (x? )k + k∇fi (x) − ∇fi (x? )k
N i=1 N i=1 N i=1
N
(9) 2 X 2
≤ k∇fi (x? )k +4β(f (x) − f ? ) .
N i=1
| {z }
=: σf2
Thus, (G, B)-BGD (14) is equivalent to the heterogeneity assumption of (Mishchenko et al., 2019) with G2 = σf2 . Instead,
if we have the stronger assumption (A1) but the functions are possibly non-convex, then G = corresponds to the local
dissimilarity defined in (Li et al., 2018).
1 8(1+B 2 )β
• Strongly convex: fi satisfies (A3) for µ > 0, ηl ≤ 8(1+B 2 )βKηg , R≥ µ then
M2 βG2
µ
E[f (x̄R )] − f (x? ) ≤ Õ + 2 2 + µD2 exp(− 16(1+B 2 )β R) ,
µRKS µ R
1
• General convex: fi satisfies (A3) for µ = 0, ηl ≤ (1+B 2 )8βKηg , R ≥ 1 then
1
• Non-convex: fi satisfies (A1) and ηl ≤ (1+B 2 )8βKηg , then
√ 2/3 2 1/3 2
!
2 M F F (βG ) B β(F
E[k∇f (x̄R )k ] ≤ O √ + + ,
RKS (R + 1)2/3 R
S S
where M 2 := σ 2 (1 + ηg2 ) + KS(1 − 2
N )G , D := kx0 − x? k, and F := f (x0 ) − f ? .
2 µη̃ 2
Ekxr − x? k ≤ (1 − 2 ) Ekx
r−1
− x? k + ( KS
1
)η̃ 2 σ 2 + (1 − S 2 2
N )4η̃ G − η̃(E[f (xr−1 )] − f (x? )) + 3β η̃Er ,
where Er is the drift caused by the local updates on the clients defined to be
K N
1 XX
r
2
Er := Er [
yi,k − xr−1
] .
KN i=1
k=1
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Proof. We start with the observation that the updates (10) and (11) imply that the server update in round r can be written
as below (dropping the superscripts everywhere)
η̃ X η̃ X
∆x = − gi (yi,k−1 ) and E[∆x] = − E[∇fi (yi,k−1 )] .
KS KN
k,i∈S k,i
We adopt the convention that summations are always over k ∈ [K] or i ∈ [N ] unless otherwise stated. Expanding using
above observing, we proceed as2
2
? 2 2η̃ X ? 2 ? 2
1 X
Er−1 kx + ∆x − x k = kx − x k − h∇fi (yi,k−1 ), x − x i + η̃ Er−1
gi (yi,k−1 )
KN KS
k,i k,i∈S
Lem. 4 2η̃ X ? 2
≤ kx − x k − h∇fi (yi,k−1 ), x − x? i
KN
k,i
| {z }
A1
2
η̃ 2 σ 2
2
1 X
+ η̃ Er−1
∇fi (yi,k−1 )
+ .
KS
KS
k,i∈S
| {z }
A2
We can directly apply Lemma 5 with h = fi , x = yi,k−1 , y = x? , and z = x to the first term A1
2η̃ X
A1 = h∇fi (yi,k−1 ), x? − xi
KN
k,i
2η̃ X 2 µ 2
≤ fi (x? ) − fi (x) + βkyi,k−1 − xk − kx − x? k
KN 4
k,i
µ 2
= −2η̃ f (x) − f (x? ) + kx − x? k + 2β η̃E .
4
For the second term A2 , we repeatedly apply the relaxed triangle inequality (Lemma 4)
2
1 X
A2 = η̃ 2 Er−1
KS ∇f (y
i i,k−1 ) − ∇f i (x) + ∇f i (x)
k,i∈S
2
X
2
2
1 X
2
1
≤ 2η̃ Er−1
∇fi (yi,k−1 ) − ∇fi (x)
+ 2η̃ Er−1
∇fi (x)
KS S
k,i∈S i∈S
2
2η̃ 2 X
X
2 2
1
≤ Er−1 k∇fi (yi,k−1 ) − ∇fi (x)k + 2η̃ Er−1
∇fi (x) − ∇f (x) + ∇f (x)
KN S
i,k i∈S
2η̃ 2 β 2 X 2 2 2 1
X 2
≤ Er−1 kyi,k−1 − xk + 4η̃ 2 k∇f (x)k + (1 − S
N )4η̃ N k∇fi (x)k
KN i
i,k
2 2 2 2 ? S 2 2
≤ 2η̃ β E + 8η̃ β(B + 1)(f (x) − f (x )) + (1 − N )4η̃ G
1
PN 2
The last step used Assumption (G, B)-BGD assumption (14) that N i=1 k∇fi (x)k ≤ G2 +2βB 2 (f (x)−f ? ). Plugging
back the bounds on A1 and A2 ,
2 µη̃ 2
Er−1 kx + ∆x − x? k ≤ (1 − 2 )kx − x? k − (2η̃ − 8β η̃ 2 (B 2 + 1))(f (x) − f (x? ))
1 2 2 S 2 2
+ (1 + η̃β)2β η̃E + KS η̃ σ + (1 − N )4η̃ G .
Lemma 8. (bounded drift) Suppose our functions satisfies assumptions (A1) and (A3)–(A5). Then the updates of F E -
DAVG for any step-size satisfying ηl ≤ (1+B 2 1)8βKη have bounded drift:
g
2η̃ r−1 η̃ 2 σ 2
3β η̃Er ≤ 3 (E[f (x )]) − f (x? ) + + 18β η̃ 3 G2 .
2Kηg2
Proof. If K = 1, the lemma trivially holds since yi,0 = x for all i ∈ [N ] and Er = 0. Assume K ≥ 2 here on. Recall that
the local update made on client i is yi,k = yi,k−1 − ηl gi (yi,k−1 ). Then,
2 2
Ekyi,k − xk = Ekyi,k−1 − x − ηl gi (yi,k−1 )k
2
≤ Ekyi,k−1 − x − ηl ∇fi (yi,k−1 )k + ηl2 σ 2
2 2
≤ (1 − 1
K−1 ) Ekyi,k−1 − xk + Kηl2 k∇fi (yi,k−1 )k + ηl2 σ 2
1 2 η̃ 2 2 η̃ 2 σ 2
= (1 − K−1 ) Ekyi,k−1 − xk + k∇fi (yi,k−1 )k + 2 2
ηg K K ηg
1 2 2η̃ 2 2 2η̃ 2 2 η̃ 2 σ 2
≤ (1 − K−1 ) Ekyi,k−1 − xk + k∇fi (yi,k−1 ) − ∇fi (x)k + k∇fi (x)k + 2 2
ηg K ηg K K ηg
1 2η̃ 2 β 2 2 2η̃ 2 2 η̃ 2 σ 2
≤ (1 − K−1 + ηg K ) Ekyi,k−1 − xk + k∇fi (x)k + 2 2
ηg K K ηg
1 2 2η̃ 2 2 η̃ 2 σ 2
≤ (1 − 2(K−1) ) Ekyi,k−1 − xk + k∇fi (x)k + 2 2 .
ηg K K ηg
In the above proof we separated the mean and the variance in the first inequality, then used the relaxed triangle inequality
1
with a = K−1 in the next inequality. Next equality uses the definition of η̃, and the rest follow from the Lipschitzness of
the gradient. Unrolling the recursion above,
k−1
2
X 2η̃ 2 2 η̃ 2 σ 2 1 τ
Ekyi,k − xk ≤ ( k∇fi (x)k + 2 2 )(1 − 2(K−1) )
τ =1
ηg K K ηg
2η̃ 2 2 η̃ 2 σ 2
≤( k∇fi (x)k + 2 2 )3K .
ηg K K ηg
The lemma now follows from our assumption that 8(B 2 + 1)β η̃ ≤ 1.
Moving the (f (x) − f (x? )) term and dividing throughout by η̃3 , we get the following bound for any η̃ ≤ 1
8(1+B 2 )β
µη̃ 2 2 σ2
E[f (xr−1 )] − f (x? ) ≤ η̃3 (1 − 2 )kx
r−1
− x? k − η̃3 kxr − x? k + 3η̃ KS (1 + S
ηg2 ) + 4G2 (1 − S
N) + 18β η̃G2 .
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
If µ = 0 (general convex), we can directly apply Lemma 2. Otherwise, by averaging using weights wr = (1 − µη̃2 )
1−r
and
R
using the same weights to pick output x̄ , we can simplify the above recursive bound (see proof of Lem. 1) to prove that
1 1
for any η̃ satisfying µR ≤ η̃ ≤ 8(1+B 2 )β
2 2
E[f (x̄R )] − f (x? ) ≤ 3kx0 − x? k µ exp(− η̃2 µR) + η̃ 2σ S
+ 8G2 (1 − S
+ η̃ 2 (36βG2 )
| {z } KS (1 + ηg2 ) N) | {z }
=:c2
=:d
| {z }
=:c1
n 2
o
Now, the choice of η̃ = min log(max(1,µ
µR
Rd/c1 ))
, (1+B12 )8β yields the desired rate.The proof of the non-convex case is
very similar and also relies on Lemma 2.
(A6) We assume that F EDAVG is run with ηg = 1, K > 1, and arbitrary possibly adaptive positive step-sizes {η1 , . . . , ηR }
are used with ηr ≤ µ1 and fixed within a round for all clients. Further, the server update is a convex combination of the
client updates with non-adaptive weights.
G2
f (xr ) − f (x? ) ≥ Ω min f (x0 ) − f (x? ), .
µR2
Proof. Consider the following simple one-dimensional functions for any given µ and G:
with f (x) = 21 (f1 (x) + f2 (x)) = µ2 x2 and optimum at x = 0. Clearly f is µ-strongly convex and further f1 and f2
satisfy A1 with B = 3. Note that we chose f2 to be a linear function (not strongly convex) to simplify computations. The
calculations made here can be extended with slightly more work for (f˜2 = µ2 x2 − Gx) (e.g. see Theorem 1 of (Safran &
Shamir, 2019)).
Let us start F EDAVG from x0 > 0. A single local update for f1 and f2 in round r ≥ 1 is respectively
y1 = y1 − ηr (2µx + G) and y2 = y2 + ηr G .
Then, straightforward computations show that the update at the end of round r is of the following form for some averaging
weight α ∈ [0, 1]
K−1
X
xr = xr−1 ((1 − α)(1 − 2µηr )K + α) + ηr G (α − (1 − α)(1 − 2µηr )τ ) .
τ =0
Since α was picked obliviously, we can assume that α ≤ 0.5. If indeed α > 0.5, we can swap the definitions of f1 and f2
and the sign of x0 . With this, we can simplify as
K−1
(1 − 2µηr )K + 1 ηr G X
xr ≥ xr−1 + (1 − (1 − 2µηr )τ )
2 2 τ =0
K−1
ηr G X
≥ xr−1 (1 − 2µηr )K + (1 − (1 − 2µηr )τ ) .
2 τ =0
Observe that in the above expression, the right hand side is increasing with ηr —this represents the effect of the client
drift and increases the error as the step-size increases. The left hand side decreases with ηr —this is the usual convergence
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
observed due to taking gradient steps. The rest of the proof is to show that even with a careful balancing of the two terms,
the effect of G cannot be removed. Lemma 9 performs exactly such a computation to prove that for any r ≥ 1,
G
xr ≥ c min(x0 , ).
µR
µ
We finish the proof by noting that f (xr ) = r 2
2 (x ) .
1
Lemma 9. Suppose that for all r ≥ 1, ηr ≤ µ and the following is true:
K−1
ηr G X
xr ≥ xr−1 (1 − 2µηr )K + (1 − (1 − 2µηr )τ ) .
2 τ =0
Then, there exists a constant c > 0 such that for any sequence of step-sizes {η r }:
G
xr ≥ c min(x0 , )
µR
Proof. Define γr = µηr R(K − 1). Such a γr exists and is positive since K ≥ 2. Then, γr satisfies
K−1 2γr K−1
(1 − 2µηr ) 2 = (1 − ) 2 ≤ exp(−γr /R) .
R(K − 1)
We then have
K−1
ηr G X
xr ≥ xr−1 (1 − 2µηr )K + (1 − (1 − 2µηr )τ )
2 τ =0
K−1
ηr G X
≥ xr−1 (1 − 2µηr )K + (1 − (1 − 2µηr )τ )
2
τ =(K−1)/2
γr G
≥ xr−1 (1 − 2µηr )K + (1 − exp(−γr /R)) .
4µ
The second inequality follows because ηr ≤ µ1 implies that (1 − (1 − 2µηr )τ ) is always positive. If γr ≥ R/8, then we
have a constant c1 ∈ (0, 1/32) which satisfies
c1 G
xr ≥ . (15)
µ
On the other hand, if γr < R/8, we have a tighter inequality
K−1 2γr K−1 γr
(1 − 2µηr ) 2 = (1 − ) 2 ≤1− ,
R(K − 1) R
implying that
K
γ2G
2γr
r
x ≥x r−1
1− + r
R(K − 1) 4Rµ
2
4γr γ G
≥ xr−1 (1 − )+ r . (16)
R 4µR
The last step used Bernoulli’s inequality and the fact that K − 1 ≤ K/2 for K ≥ 2. Observe that in the above expression,
the right hand side is increasing with γr —this represents the effect of the client drift and increases the error as the step-size
increases. The left hand side decreases with γr —this is the usual convergence observed due to taking gradient steps. The
rest of the proof is to show that even with a careful balancing of the two terms, the effect of G cannot be removed.
Suppose that all rounds after r0 ≥ 0 have a small step-size i.e. γr ≤ R/8 for all r > r0 and hence satisfies (16). Then we
will prove via induction that
G
xr ≥ min(cr xr0 , ), for constants cr := (1 − 1 r−r0
2R ) . (17)
256µR
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
4γr γ2G
xr ≥ xr−1 (1 − )+ r
R 4µR
r−1 1 G
≥ min x (1 − 2R ) ,
256µR
G
= min cr xr0 , .
256µR
The first step is because of (16) and the last step uses the induction hypothesis. The second step considers two cases for
γr : either γr ≤ 18 and (1 − 2R
1
) ≥ (1 − 2R1 1
), or γr2 ≥ 64 . Finally note that cr ≥ 12 using Bernoulli’s inequality. We have
hence proved
R 1 r0 G
x ≥ min x ,
2 256µR
cG
Now suppose γr0 > R/8. Then (15) implies that xR ≥ µR for some constant c > 0. If instead no such r0 ≥ 1 exists,
then we can set r0 = 0. Now finally observe that the previous proof did not make any assumption on R, and in fact the
inequality stated above holds for all r ≥ 1.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
E. Convergence of SCAFFOLD
We first restate the convergence theorem more formally, then prove the result for the convex case, and then for non-convex
case. Throughout the proof, we will focus on the harder option II. The proofs for SCAFFOLD with option I are nearly
identical and so we skip them.
Theorem VII. Suppose that the functions {fi } satisfies assumptions A4 and A5. Then, in each of the following cases,
there exist weights {wr } and local step-sizes ηl such that for any ηg ≥ 1 the output (22) of SCAFFOLD satisfies:
1 S 162β 30N
• Strongly convex: fi satisfies (A3) for µ > 0, ηl ≤ min 81βKη g
, 15µN Kηg , R ≥ max( µ , S ) then
σ2
S µ
E[f (x̄R )] − f (x? ) ≤ Õ (1 + S
ηg2 ) + µD̃2 exp − min , R .
µRKS 30N 162β
1
• General convex: fi satisfies (A3) for µ = 0, ηl ≤ 81βKηg , R ≥ 1 then
!
R σ D̃ q
? S
β D̃2
E[f (x̄ )] − f (x ) ≤ O √ 1+ ηg2 + ,
RKS R
1 S
23
• Non-convex: ηl ≤ 24Kηg β N , and R ≥ 1, then
√ 23 !
Rσ F q
2 N
βF N
E[k∇f (x̄ )k ] ≤ O √ 1+ ηg2 + .
RKS R S
2 1
PN 2
Here D̃2 := (kx0 − x? k + 2Sβ 2
0
i=1 kci − ∇fi (x? )k ) and F := (f (x0 ) − f (x? )).
We will rewrote SCAFFOLD using notation which is convenient for the proofs: {yi } represent the client models, x is
the aggregate server model, and ci and c are the client and server control variates. For an equivalent description which
is easier to implement, we refer to Algorithm 1. The server maintains a global control variate c as before and each client
maintains its own control variate ci . In round r, a subset of clients S r of size S are sampled uniformly from {1, . . . , N }.
Suppose that every client performs the following updates
0
• Starting from the shared global parameters yi,r = xr−1 , we update the local parameters for k ∈ [K]
r
yi,k r
= yi,k−1 r
− ηl vi,k , r
where vi,k r
:= gi (yi,k−1 ) − cir−1 + cr−1 (18)
• Compute the new global parameters and global control variate using only updates from the clients i ∈ S r :
N
ηg X r 1 X r 1 X r X r−1
xr = xr−1 + (yi,K − xr−1 ) and cr = ci = ci + cj . (21)
S r
N i=1 N r r
i∈S i∈S j ∈S
/
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
wr
x̄R = xr−1 with probability P for r ∈ {1, . . . , R + 1} . (22)
τ wτ
Note that the clients are agnostic to the sampling and their updates are identical to when all clients are participating. Also
/ Sr
note that the control variate choice (19) corresponds to (option II) of Algorithm 1. Further, the updates of the clients i ∈
is forgotten and is defined only to make the proofs easier. While actually implementing the method, only clients i ∈ S r
participate and the rest remain inactive (see Algorithm 1).
Additional definitions. Before proceeding with the proof of our lemmas, we need some additional definitions of the
various errors we track. As before, we define the effective step-size to be
η̃ := Kηl ηg .
We define client-drift to be how much the clients move from their starting point:
K N
1 XX r 2
Er := E[kyi,k − xr−1 k ] . (23)
KN i=1
k=1
Because we are sampling the clients, not all the client control-variates get updated every round. This leads to some ‘lag’
which we call control-lag:
N
1 X 2
Cr := EkE[cri ] − ∇fi (x? )k . (24)
N j=1
Variance of server update. We study how the variance of the server update can be bounded.
Lemma 10. For updates (18)—(21), we can bound the variance of the server update in any round r and any η̃ := ηl ηg K ≥
0 as follows
2 12η̃ 2 σ 2
E[kxr − xr−1 k ] ≤ 8β η̃ 2 (E[f (xr−1 )] − f (x? )) + 8η̃ 2 Cr−1 + 4η̃ 2 β 2 Er + .
KS
Proof. The server update in round r can be written as follows (dropping the superscript r everywhere)
2
η̃ X
2
η̃ X
2
Ek∆xk = E
− vi,k
= E
(gi (yi,k−1 ) + c − ci )
,
KS KS
k,i∈S k,i∈S
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
2
η̃ X
2
Ek∆xk ≤ E
(gi (yi,k−1 ) + c − ci )
KS
k,i∈S
η̃ X
2 2
η̃ X
2
≤ 4 E
gi (yi,k−1 ) − ∇fi (x)
+ 4η̃ 2 Ekck + 4 E
∇fi (x? ) − ci
KS KS
k,i∈S k,i∈S
η̃ X
2
+ 4 E
∇fi (x) − ∇fi (x? )
KS
k,i∈S
(9)
η̃ X
2 2
η̃ X
2
≤ 4 E
gi (yi,k−1 ) − ∇fi (x)
+ 4η̃ 2 Ekck + 4 E
∇fi (x? ) − ci
KS S
k,i∈S i∈S
12η̃ 2 σ 2
+ 8β η̃ 2 (E[f (x)] − f (x? )) + .
KS
1
P{fi }. The last inequality which separates
The inequality before the last used the smoothness of
2
the mean and the variance
is an application of Lemma 4: the variance of ( KS g (y
k,i∈S i i,k−1 )) is bounded by σ /KS. Similarly, cj as defined in
1
(19) for any j ∈ [N ] has variance smaller than σ /K and hence the variance of ( S i∈S ci ) is smaller than σ 2 /KS.
2
P
2 4η̃ 2 X
2 2 4η̃ 2 X
∇fi (x? ) − E[ci ]
2
E
∇fi (yi,k−1 ) − ∇fi (x)
+ 4η̃ 2 kE[c]k +
Ek∆xk ≤
KN N i
k,i
12η̃ 2 σ 2
+ 8β η̃ 2 (E[f (x)] − f (x? )) +
KS
4η̃ 2 X
2 8η̃ 2 X
∇fi (x? ) − E[ci ]
2
≤ E
∇fi (yi,k−1 ) − ∇fi (x)
+
KN N i
k,i
| {z }
T1
12η̃ 2 σ 2
+ 8β η̃ 2 (E[f (x)] − f (x? )) + .
KS
2
4η̃ 2 P
2
ci . Since the gradient of fi is β-Lipschitz, T1 ≤ βKN
1
P
The second step follows because c = k,i E yi,k−1 −x
=
N i
N 2
4η̃ 2 β 2 E. The definition of the error in the control variate Cr−1 := N1 j=1 EkE[ci ] − ∇fi (x? )k completes the proof.
P
Change in control lag. We have previously related the variance of the server update to the control lag. We now examine
how the control-lag grows each round.
Lemma 11. For updates (18)—(21) with the control update (19) and assumptions A3–A5, the following holds true for any
η̃ := ηl ηg K ∈ [0, 1/β]:
S S
4β(E[f (xr−1 )] − f (x? )) + 2β 2 Er .
Cr ≤ (1 − N )Cr−1 + N
Proof. Recall that after round r, the control update rule (19) implies that cri is set as per
(
r cr−1 / S r i.e. with probability (1 −
if i ∈ S
N ). ,
ci = 1i PK r S
K k=1 gi (yi,k−1 ) with probability N .
N
1 X 2
Cr = kE[cri ] − ∇fi (x? )k
N i=1
N
1 X r−1 PK 2
= k(1 − S
N )(E[ci ] − ∇fi (x? )) + S 1
N (K k=1
r
E[∇fi (yi,k−1 )] − ∇fi (x? ))k
N i=1
S S
PK r 2
≤ (1 − N )Cr−1 + N 2K k=1 Ek∇fi (yi,k−1 ) − ∇fi (x? )k .
The final step applied Jensen’s inequality twice. We can then further simplify using the relaxed triangle inequality as
S S X r 2
Er−1 [Cr ] ≤ 1 − Cr−1 + 2 Ek∇fi (yi,k−1 ) − ∇fi (x? )k
N N K
i,k
S 2S X 2 2S X 2
≤ 1− Cr−1 + 2 Ek∇fi (xr−1 ) − ∇fi (x? )k + 2 r
Ek∇fi (yi,k−1 ) − ∇fi (xr−1 )k
N N i N K
i,k
(7) S 2S X 2 2S X 2
≤ 1− Cr−1 + 2 Ek∇fi (xr−1 ) − ∇fi (x? )k + 2 β 2 r
Ekyi,k−1 − xr−1 k
N N i N K
i,k
(9) S S
≤ 1− Cr−1 + (4β(E[f (xr−1 )] − f (x? )) + β 2 Er ) .
N N
1 2 r 2
− xr−1 k .
P
The last two inequalities follow from smoothness of {fi } and the definition Er = NK β i,k Ekyi,k−1
Bounding client-drift. We will now bound the final source of error which is the client-drift.
1
Lemma 12. Suppose our step-sizes satisfy ηl ≤ 81βKηg and fi satisfies assumptions A3–A5. Then, for any global ηg ≥ 1
we can bound the drift as
2η̃ 2 η̃ r−1 η̃ 2
3β η̃Er ≤ 3 Cr−1 + 25ηg2 (E[f (x )] − f (x? )) + Kηg2 σ
2
.
Proof. First, observe that if K = 1, Er = 0 since yi,0 = x for all i ∈ [N ] and that Cr−1 and the right hand side are both
positive. Thus the lemma is trivially true if K = 1. For K > 1, we build a recursive bound of the drift.Starting from the
definition of the update (18) and then applying the relaxed triangle inequality, we can expand
X X
1
(yi − ηl vi ) − x
2 = 1 Er−1
yi − ηl gi (yi ) + ηl c − ηl ci − x
2
Er−1
S S
i∈S i∈S
X
1
2
+ ηl2 σ 2
≤ Er−1
yi − ηl ∇fi (yi ) + ηl c − ηl ci − x
S
i∈S
X
(1 + a)
2
≤ Er−1
yi − ηl ∇fi (yi ) + ηl ∇fi (x) − x
S
i∈S
| {z }
T2
1X 2
+ (1 + a1 )ηl2 Er−1 kc − ci + ∇fi (x)k +ηl2 σ 2 .
S
i∈S
| {z }
T3
The final step follows from the relaxed triangle inequality (Lemma 3). Applying the contractive mapping Lemma 6 for
ηl ≤ 1/β shows
1 X
yi − ηl ∇fi (yi ) + ηl ∇fi (x) − x
2 ≤
yi − x
2 .
T2 =
S
i∈S
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Once again using our relaxed triangle inequality to expand the other term T3 , we get
X
1 2
T3 = Er−1 kc − ci + ∇fi (x)k
S
i∈S
N
1 X 2
= kc − ci + ∇fi (x)k
N j=1
N
1 X 2
= kc − ci + ∇fi (x? ) + ∇fi (x) − ∇fi (x? )k
N j=1
N N
3 X
2 ? 2 3 X 2
≤ 3kck + kci − ∇fi (x )k + k∇fi (x) − ∇fi (x? )k
N j=1 N j=1
N N
6 X ? 2 3 X 2
≤ kci − ∇fi (x )k + k∇fi (x) − ∇fi (x? )k
N j=1 N j=1
N
6 X 2
≤ kci − ∇fi (x? )k + 6β(f (x) − f (x? )) .
N j=1
1
The last step used the smoothness of fi . Combining the bounds on T2 and T3 in the original inequality and using a = K−1
gives
1
1 X
2 (1 + K−1 )X
2
E
yi,k − x
≤ E
yi,k−1 − x
+ ηl2 σ 2
N i N i
6Kηl2 X 2
+ 6ηl2 Kβ(f (x) − f (x? )) + Ekci − ∇fi (x? )k .
N i
2
Recall that with the choice of ci in (19), the variance of ci is less than σK . Separating its mean and variance gives
X
1 X
2 1 1
2
E
yi,k−1 − x
+ 7ηl2 σ 2 +
E yi,k − x ≤ 1 +
N i K −1 N i
6Kηl2 X 2
6ηl2 Kβ(f (x) − f (x? )) + kE[ci ] − ∇fi (x? )k (25)
N i
Unrolling the recursion (25), we get the following for any k ∈ {1, . . . , K}
!
1 X
2 2 2 ? 2 2 2
k−1
X
1 τ
E yi,k − x ≤ 6Kβ ηl (f (x) − f (x )) + 6Kηl Cr−1 + 7βηl σ
(1 + K−1 )
N i τ =0
1
The equality follows from the definition η̃ = Kηl ηg , and the final inequality uses the bound that η̃ ≤ 81β .
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Progress in one round. Now that we have a bound on all errors, we can describe our progress.
Lemma
13. Suppose assumptions
A3–A5 are true. Then the following holds for any step-sizes satisfying ηg ≥ 1, ηl ≤
1 S
min 81βKηg , 15µN Kηg , and effective step-size η̃ := Kηg ηl
h i
2 9N η̃ 2 2 9N η̃ 2 2
E kxr − x? k + S Cr ≤ (1− µη̃
2 ) Ekxr−1 − x? k + S Cr−1 − η̃(E[f (xr−1 )]−f (x? ))+ 12η̃ S 2
KS (1+ η 2 )σ . g
We can then apply Lemma 10 to bound the second moment of the server update as
2 2 2η̃ X
2
Er−1 kx + ∆x − x? k = Er−1 kx − x? k − Er−1 h∇fi (yi,k−1 ), x − x? i + Er−1
∆x
KS
k,i∈S
2η̃ X 2
≤ Er−1 h∇fi (yi,k−1 ), x? − xi + Er−1 kx − x? k
KS
k,i∈S
| {z }
T4
12η̃ 2 σ 2
+ 8β η̃ 2 (E[f (xr−1 )] − f (x? )) + 8η̃ 2 Cr−1 + 4η̃ 2 β 2 E + .
KS
The term T4 can be bounded by using perturbed strong-convexity (Lemma 5) with h = fi , x = yi,k−1 , y = x? , and z = x
to get
2η̃ X
E[T4 ] = E h∇fi (yi,k−1 ), x? − xi
KS
k,i∈S
2η̃ X 2 µ 2
≤ E fi (x? ) − fi (x) + βkyi,k−1 − xk − kx − x? k
KS 4
k,i∈S
µ 2
= −2η̃ E f (x) − f (x? ) + kx − x? k + 2β η̃E .
4
Plugging T4 back, we can further simplify the expression to get
2 2
µ 2
Ekx + ∆x − x? k ≤ Ekx − x? k − 2η̃ f (x) − f (x? ) + kx − x? k + 2β η̃E
4
12η̃ 2 σ 2
+ + 8β η̃ 2 (E[f (xr−1 )] − f (x? )) + 8η̃ 2 Cr−1 + 4η̃ 2 β 2 E
KS
? 2
= (1 − µη̃ 2
2 )kx − x k + (8β η̃ − 2η̃)(f (x) − f (x ))
?
12η̃ 2 σ 2
+ + (2β η̃ + 4β 2 η̃ 2 )E + 8η̃ 2 Cr−1 .
KS
We can use Lemma 11 (scaled by 9η̃ 2 N
S ) to bound the control-lag
µη̃
9η̃ 2 N 2N
+ 9( µη̃N 2 2 r−1
)] − f (x? )) + 2β 2 E
S Cr ≤ (1 − 2 )9η̃ S Cr−1 2S − 1)η̃ Cr−1 + 9η̃ 4β(E[f (x
1 24 S 9µη̃N
Finally, the lemma follows from noting that η̃ ≤ 81β implies 44β 2 η̃ 2 ≤ 25 β̃ and η̃ ≤ 15µN implies 2S ≤ 31 .
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
The final rate follows simply by unrolling the recursive bound in Lemma 13 using Lemma 1 in the strongly-convex case
and Lemma 2 if µ = 0. Also note that if c0i = gi (x0 ), then η̃N
S C0 can be bounded in terms of function sub-optimality F .
Additional notation. Recall that in round r, we update the control variate as (19)
( P
1 K r
r gi (yi,k−1 ) if i ∈ S r ,
ci = Kr−1 k=1
ci otherwise .
We introduce the following notation to keep track of the ‘lag’ in the update of the control variate: define a sequence of
parameters {αr−1 0 0
i,k−1 } such that for any i ∈ [N ] and k ∈ [K] we have αi,k−1 := x and for r ≥ 1,
(
r
yi,k−1 if i ∈ S r ,
αri,k−1 := (26)
αr−1
i,k−1 otherwise .
r−1
By the update rule for control variates (19) and the definition of {αi,k−1 } above, the following property always holds:
K
1 X
cri = gi (αri,k−1 ) .
K
k=1
We can then define the following Ξr to be the error in control variate for round r:
K N
1 XX 2
Ξr := Ekαri,k−1 − xr k . (27)
KN i=1k=1
Also recall the closely related definition of client drift caused by local updates:
K N
1 XX r 2
Er := E[kyi,k − xr−1 k ] .
KN i=1k=1
Variance of server update. Let us analyze how the control variates effect the variance of the aggregate server update.
Lemma 14. For updates (18)—(21)and assumptions A4 and A5, the following holds true for any η̃ := ηl ηg K ∈ [0, 1/β]:
2 2
EkEr−1 [xr ] − xr−1 k ≤ 2η̃ 2 β 2 Er + 2η̃ 2 Ek∇f (xr−1 )k , and
2 2 9η̃ 2 σ 2
Ekxr − xr−1 k ≤ 4η̃ 2 β 2 Er + 8η̃ 2 β 2 Ξr−1 + 4η̃ 2 Ek∇f (xr−1 )k + .
KS
η̃ X 1 X
∆x = − (gi (yi,k−1 ) + c − ci ) where ci = gi (αi,k−1 ) .
KS K
k,i∈S k
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Taking norm on both sides and separating mean and variance, we proceed as
2 η̃ X 2
Ek∆xk = Ek− (gi (yi,k−1 ) − gi (αi,k−1 ) + c − ci )k
KS
k,i∈S
2
9η̃ 2 σ 2
η̃ X
≤ E
−
(∇fi (yi,k−1 ) + E[c] − E[ci ])
+
KS
KS
k,i∈S
2 X
2 2 2
η̃
∇fi (yi,k−1 ) + E[c] − E[ci ]
+ 9η̃ σ
≤E
KS
KS
k,i∈S
2
η̃ 2 X
9η̃ 2 σ 2
= E
(∇fi (yi,k−1 ) − ∇fi (x)) + (E[c] − ∇f (x)) + ∇f (x) − (E[ci ] − ∇fi (x))
+
KN
KS
k,i
2
4η̃ X 2 8η̃ 2 X 2
≤ Ek∇fi (yi,k−1 ) − ∇fi (x)k + Ek∇fi (αi,k−1 ) − ∇fi (x)k
KN KN
k,i k,i
2 2
2 9η̃ σ
+ 4η̃ 2 Ek∇f (x)k +
KS
2 9η̃ 2 σ 2
≤ 4η̃ 2 β 2 Er + 8β 2 η̃ 2 Ξr−1 + 4η̃ 2 Ek∇f (x)k + .
KS
1 1
P P
In the first inequality, note that the three random variables— KS k,i∈S gi (yi,k ), S i∈S ci , and c—may not be inde-
σ2
pendent but each have variance smaller than and so we can apply Lemma 4. The rest of the inequalities follow from
KS
repeated applications of the relaxed triangle inequality, β-Lipschitzness of fi , and the definition of Ξr−1 (27). This proves
the second statement. The first statement follows from our expression of Er−1 [∆x] and similar computations.
Lag in the control variates. We now analyze the ‘lag’ in the control variates due to us sampling only a small subset
of clients each round. Because we cannot rely on convexity anymore but only on the Lipschitzness of the gradients, the
control-lag increases faster in the non-convex case.
1 S α
Lemma 15. For updates (18)—(21) and assumptions A4, A5, the following holds true for any η̃ ≤ 24β ( N ) for α ∈ [ 12 , 1]
where η̃ := ηl ηg K:
S 2α−1 2 σ2
Ξr ≤ (1 − 17S
36N )Ξr−1 + 1
48β 2 ( N ) k∇f (xr−1 )k + 97 S 2α−1
48 ( N ) Er + ( NSβ 2 ) .
32KS
Proof. The proof proceeds similar to that of Lemma 11 except that we cannot rely on convexity. Recall that after round r,
the definition of αri,k−1 (26) implies that
r−1
ES r [αri,k−1 ] = (1 − S
N )αi,k−1 + S r
N yi,k−1 .
1 X 2
Ξr = Ekαri,k−1 − xr k
KN
i,k
S 1 X r 2 S 1 X 2
= 1− · Ekαr−1i,k−1 − x k + · r
Ekyi,k−1 − xr k .
N KN i N KN
k,i
| {z } | {z }
T5 T6
We can expand the second term T6 with the relaxed triangle inequality to claim
2
T6 ≤ 2(Er + Ek∆xr k ) .
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
We will expand the first term T5 to claim for a constant b ≥ 0 to be chosen later
1 X r−1 2 2
D E
T5 = E(kαr−1
i,k−1 − x k + k∆xr k + Er−1 ∆xr , αr−1 i,k−1 − x
r−1
)
KN i
1 X r−1 2 2 2 2
≤ E(kαr−1
i,k−1 − x k + k∆xr k + 1b kEr−1 [∆xr ]k + bkαi,k−1 r−1
− xr−1 k )
KN i
where we used Young’s inequality which holds for any b ≥ 0. Combining the bounds for T5 and T6 ,
2 2
S S
Er + 2 Ek∆xr k + 1b EkEr−1 [∆xr ]k
Ξr ≤ 1 − N (1 + b)Ξr−1 + 2 N
2 18η̃ 2 σ 2
S
(1 + b) + 16η̃ 2 β 2 )Ξr−1 + ( 2S 2 2 1 2 2 1 2
≤( 1− N N + 8η̃ β + 2 b η̃ β )Er + (8 + 2 b )η̃ Ek∇f (x)k ) + KS
S S S
The last inequality applied Lemma 14. Verify that with choice of b = 2(N −S) , we have 1 − N (1 + b) ≤ (1 − 2N )
1 2N 2 2 1 S 2α S
and b ≤ S . Plugging these values along with the bound on the step-size 16β η̃ ≤ 36 ( N ) ≤ 36N completes the
lemma.
Bounding the drift. We will next bound the client drift Er . For this, convexity is not crucial and we will recover a very
similar result to Lemma 12 only use the Lipschitzness of the gradient.
1
Lemma 16. Suppose our step-sizes satisfy ηl ≤ 24βKη g
and fi satisfies assumptions A4–A5. Then, for any global ηg ≥ 1
we can bound the drift as
5 2 5 3 2 η̃ r−1 2 η̃ 2 β 2
3 β η̃Er ≤ 3 β η̃ Ξr−1 + 24η 2 Ek∇f (xg
)k + 4Kη 2σ .
g
Proof. First, observe that if K = 1, Er = 0 since yi,0 = x for all i ∈ [N ] and that Ξr−1 and the right hand side are both
positive. Thus the Lemma is trivially true if K = 1 and we will henceforth assume K ≥ 2. Starting from the update rule
(18) for i ∈ [N ] and k ∈ [K]
2 2
Ekyi,k − xk = Ekyi,k−1 − ηl (gi (yi,k−1 ) + c − ci ) − xk
2
≤ Ekyi,k−1 − ηl (∇fi (yi,k−1 ) + c − ci ) − xk + ηl2 σ 2
1 2 2 2 2 2
≤ (1 + K−1 ) Ekyi,k−1 − xk + Kηl Ek∇fi (yi,k−1 ) + c − ci k + ηl σ
2
= (1 1
+ K−1 ) Ekyi,k−1 − xk + ηl2 σ 2
2
+ Kηl2 Ek∇fi (yi,k−1 ) − ∇fi (x) + (c − ∇f (x)) + ∇f (x) − (ci − ∇fi (x)k
1 2 2 2 2 2
≤ (1 + K−1 ) Ekyi,k−1 − xk + 4Kηl Ek∇fi (yi,k−1 ) − ∇fi (x)k + ηl σ
2 2 2
+ 4Kηl2 Ekc − ∇f (x)k + 4Kηl2 Ek∇f (x)k + 4Kηl2 Ekci − ∇fi (x)k
2 2
≤ (1 1
+ K−1 + 4Kβ 2 ηl2 ) Ekyi,k−1 − xk + ηl2 σ 2 + 4Kηl2 Ek∇f (x)k
2 2
+ 4Kηl2 Ekc − ∇f (x)k + 4Kηl2 Ekci − ∇fi (x)k
1
The last inequality used the bound on the step-size β η̃ ≤ 24 . Averaging over k and multiplying both sides by 53 β 2 η̃ yields
the lemma statement.
Progress made in each round. Given that we can bound all sources of error, we can finally prove the progress made in
each round.
Lemma 17. Suppose the updates (18)—(21) satisfy assumptions A4–A5. For any effective step-size η̃ := Kηg ηl satisfying
1
2
S 3
η̃ ≤ 24β N ,
5β η̃ 2 σ 2
r 3 2N r−1 3 2N η̃ 2
E[f (x )] + 12β η̃ S Ξr ≤ E[f (x )] + 12β η̃ S Ξr−1 + (1 + ηS2 ) − Ek∇f (xr−1 )k .
KS g 14
Proof. Starting from the smoothness of f and taking conditional expectation gives
β 2
Er−1 [f (x + ∆x)] ≤ f (x) + h∇f (x), Er−1 [∆x]i + Er−1 k∆xk .
2
We as usual dropped the superscript everywhere. Recall that the server update can be written as
η̃ X η̃ X
∆x = − (gi (yi,k−1 ) + c − ci ), and ES [∆x] = − gi (yi,k−1 ) .
KS KN
k,i∈S k,i
2
Substituting this in the previous inequality and applying Lemma 14 to bound E[k∆xk ] gives
η̃ X β 2
E[f (x + ∆x)] − f (x) ≤ − h∇f (x), E[∇fi (yi,k−1 )]i + Ek∆xk
KN 2
k,i
η̃ X
≤− h∇f (x), E[∇fi (yi,k−1 )]i+
KN
k,i
2 9β η̃ 2 σ 2
2η̃ 2 β 3 Er + 4η̃ 2 β 3 Ξr−1 + 2β η̃ 2 Ek∇f (x)k +
2KS
2
η̃ 2 η̃ X
≤ − k∇f (x)k + E
∇fi (yi,k−1 ) − ∇f (x)
+
2 2KN
i,k
9β η̃ 2 σ 2 2
2η̃ 2 β 3 Er + 4η̃ 2 β 3 Ξr−1 + 2β η̃ 2 Ek∇f (x)k +
2KS
η̃ 2 2 η̃ 2 2 3 2 9β η̃ 2 σ 2
≤ −( 2 − 2β η̃ )k∇f (x)k + ( 2 + 2β η̃ )β Er + 4β η̃ Ξr−1 + .
2KS
The third inequality follows from the observation that −ab = 21 ((b − a)2 − a2 ) − 12 b2 ≤ 21 ((b − a)2 − a2 ) for any a, b ∈ R,
and the last from the β-Lipschitzness of fi . Now we use Lemma 15 to bound Ξr as
σ2
3 2N 3 2N 17S 1 S 2α−1 r−1 2 97 S 2α−1 S
12β η̃ S Ξr ≤ 12β η̃ S (1 − 36N )Ξr−1 + 48β 2 ( N ) k∇f (x )k + 48 ( N ) Er + ( N β 2 )
32KS
2 3β η̃ 2 σ 2
= 12β 3 η̃ 2 N
S Ξr−1 − 17 3 2
3 β η̃ Ξr−1 + 1
4 β η̃ 2 N 2−2α
( S ) k∇f (x)k + 97 3 2 N 2−2α
4 β η̃ ( S ) E r + .
8KS
Also recall that Lemma 16 states that
5 2 η̃ 2 η̃ 2 β
3 β η̃Er ≤ 35 β 3 η̃ 2 Ξr−1 + 24ηg2 Ek∇f (xr−1 )k + 4Kηg2 σ
2
.
2
By our choice of α = 3 and plugging in the bound on step-size β η̃( N
S)
2−2α
≤ 1
24 proves the lemma.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
The non-convex rate of convergence now follows by unrolling the recursion in Lemma 17 and selecting an appropriate
step-size η̃ as in Lemma 2. Finally note that if we initialize c0i = gi (x0 ) then we have Ξ0 = 0.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
1
• Strongly convex: fi satisfies (A3) for µ > 0, ηl ≤ min( 10β 1
, 22δK 1
, 10µK ), R ≥ max( 20β
µ ,
44δK+20µK
µ , 20K) then
βσ 2
R 2 2 µ
E[k∇f (x̄ )k ] ≤ Õ + µD exp − RK .
µRKN 20β + 44δK + 20µK
1 1
• Weakly convex: f satisfies ∇2 f −δI, ηl ≤ min( 10β , 22δK ), and R ≥ 1, then
p !
2 σ β(f (x0 ) − f ? ) (β + δK)(f (x0 ) − f ? )
E[k∇f (x̄R )k ] ≤ O √ + .
RKN RK
Note that if δ = 0, we match (up to acceleration) the lower bound in (Woodworth et al., 2018). While certainly δ = 0
when the functions are identical as studied in (Woodworth et al., 2018), our upper-bound is significantly stronger since it is
possible that δ = 0 even for highly heterogeneous functions. For example, objective perturbation (Chaudhuri et al., 2011;
Kifer et al., 2012) is an optimal mechanism to achieve differential privacy for smooth convex objectives (Bassily et al.,
2014). Intuitively, objective perturbation relies on masking each client’s gradients by adding a large random linear term to
the objective function. In such a case, we would have high gradient dissimilarity but no Hessian dissimilarity.
Our non-convex convergence rates are the first of their kind as far as we are aware—no previous work shows how one can
take advantage of similarity for non-convex functions. However, we should note that non-convex quadratics do not have
a global lower-bound on the function value f ? . We will instead assume that f ? almost surely lower-bounds the value of
f (xR ), implicitly assuming that the iterates remain bounded.
Outline. In the rest of this section, we will focus on proving Theorem VIII. We will show how to bound variance in
Lemma 21, bound the amount of drift in Lemma 20, and show progress made in one step in Lemma 22. In all of these we
do not use convexity, but strongly rely on the functions being quadratics. Then we combine these to derive the progress
made by the server in one round—for this we need weak-convexity to argue that averaging the parameters does not hurt
convergence too much. As before, it is straight-forward to derive rates of convergence from the one-round progress using
Lemmas 1 and 2.
1 2 1 2
fi (x) − fi (x?i ) = kx − x?i kAi for i ∈ [N ] , and f (x) = kx − x?i kA , for all x ,
2 2
PN
for some {x?i } and x? , A := N1 i=1 Ai . We also assume that A is a symmetric matrix though this requirement is easily
relaxed. Note that this implies f (x? ) = 0 and that ∇fi (x) = A(x − x?i ). If {fi } are additionally convex, we have that x?i
is the optimum of fi and x? the optimum of f . However, this is not necessarily true in general.
We will also focus on a simplified version of SCAFFOLD where in each round r, client i performs the following update
r
starting from yi,0 ← xr−1 :
r r r
yi,k = yi,k−1 − η(gi (yi,k−1 ) + ∇f (xr−1 ) − ∇fi (xr−1 )) , or
r r r
(28)
Er−1,k−1 [yi,k ] = yi,k−1 − ηA(yi,k−1 − x? ) − η(Ai − A)(yi,k−1
r
− xr−1 )) ,
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Proof. Observe that the variables {yi,k − x} are independent of each other (the only source of randomness is the local
gradient computations). The rest of the proof is exactly that of Lemma 4. Dropping superscripts everywhere,
2
X 2
Er−1 kA(xrk − x? )k = Er−1 k N1 A(yi,k − x? )k
i
X 2
X 2
= Er−1 k N1 A(Er−1 [yi,k ] − x? )k + Er−1 k N1 A(Er−1 [yi,k ] − yi,k )k
i i
X 2
X 2
?
= Er−1 k N1 A(Er−1 [yi,k ] − x )k + 1
N2 Er−1 kA(Er−1 [yi,k ] − yi,k )k
i i
X 2
X 2
?
≤ Er−1 k N1 A(Er−1 [yi,k ] − x )k + 1
N2 Er−1 kA(Er−1 [yi,k ] − x? )k .
i i
The third equality was because {yi,k } are independent of each other conditioned on everything before round r.
1 K−k 2
Proof. Since f is δ-weakly convex, it follows that the function f (z) + δ(1 + K ) kz − xk2 is convex in z for any
k ∈ [K]. The lemma now follows directly from using convexity and the definition of xrk = N1 yi,k
r
.
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Bounding drift of one client. We see how the client drift of SCAFFOLD depends on δ.
1
Lemma 20. For the update (28), assuming (A2) and that {fi } are quadratics, the following holds for any η ≤ 21δK
r 2 2 2
Er−1,k−1 kyi,k − xr−1 k ≤ (1 + 1 r
2K )kyi,k−1 − xr−1 k + 7Kη 2 k∇f (yi,k−1
r
)k + η 2 σ 2 .
Note that if K = 1, then the first inequality directly proves the lemma. For the second inequality, we assumed K ≥ 2 and
then applied our relaxed triangle inequality. By assumption A2, we have the following for ηδ ≤ 1
2
k(I − η(Ai − A))2 k = kI − η(Ai − A)k ≤ (1 + ηδ)2 ≤ 1 + 3ηδ .
1
Using the bound on the step-size η ≤ 21δK gives
2 2 2
Er−1,k−1 kyi+ − xk ≤ (1 + 1
7K )(1 + 1
7(K−1) )kyi − xk + 7Kη 2 kA(yi − x? )k + η 2 σ 2
Tracking the variance. We will see how to bound the variance of the output.
1
Lemma 21. Consider the update (28) for quadratic {fi } with η ≤ max( 2δK , β1 ). Then, if further (A2),(A5) and (A4) are
satisfied, we have
2
Er−1 f (xr ) ≤ f (Er−1 [xr ]) + 3Kβ σN .
Further if {fi } are strongly convex satisfying (A3), we have
K
2
X
Er−1 f (xr ) ≤ f (Er−1 [xr ]) + β σN (1 − µη)k−1 .
k=1
where by the bounded variance assumption A4, ζi,k is a random variable satisfying Ek−1,r−1 [ζi,k ] = 0 and
2
Ek−1,r−1 kζi,k k ≤ σ 2 . Subtracting x? from both sides and unrolling the recursion gives
Similarly, the expected iterate satisfies the same equation without the ζi,k
K
X
Er−1 [yi,K ] − x? = (I − ηAi )K (x − x? ) − η(I − ηAi )k−1 (A − Ai )(x − x? ) .
k=1
2 2 2
= kEr−1 [xrk ] − x? kA + Nη 2 Er−1 i,k k(I − ηAi )k−1 ζi,k kA
P
2 2 2
≤ kEr−1 [xrk ] − x? kA + βη k−1
P
N 2 Er−1 i,k k(I − ηAi ) ζi,k k2 .
The last inequality used smoothness of f and the one before that relied on the independence of ζi,k . Now, if fi is weakly
1 1
convex we have for η ≤ 2δK that I − ηAi (1 + 2K )I and hence
2
k(I − ηAi )k−1 ζi,k k2 ≤ σ 2 (1 + 1 2(k−1)
2K ) ≤ 3σ 2 .
1
This proves our second statement of the lemma. For strongly convex functions, we have for η ≤ β,
2
k(I − ηAi )k−1 ζi,k k2 ≤ σ 2 (1 − ηµ)2(k−1) ≤ σ 2 (1 − ηµ)k−1 .
r µη r η r 2
ξi,k ≤ (1 − 6 )ξi,k−1 − 6 Er−1 k∇f (yi,k−1 )k + 7βη 2 σ 2 , and
ξ˜i,k µη ˜r
r η r 2
≤ (1 − 6 )ξi,k−1 − 6 k∇f (Er−1 [yi,k−1 ])k .
r
Proof. Recall that ξi,k ≥ 0 is defined to be
r r 2
ξi,k := Er−1 [f (yi,k )] − f (x? ) + δ(1 + 1 K−k
K)
r
Er−1 kyi,k − xr−1 k .
Let us start from the local update step (28) (dropping unnecessary subscripts and superscripts)
2 2
Er−1,k−1 kyi+ − x? kA ≤ kyi − x? kA − 2ηhA(yi − x? ), A(yi − x? )i + 2ηh(A − Ai )(yi − x), A(yi − x? )i
2
+ η 2 kA(yi − x? ) + (Ai − A)(yi − x))kA + βη 2 σ 2
2 3η ? 2 2
≤ kyi − x? kA − 2 kA(yi − x )k2 + 2ηk(A − Ai )(yi − x)k2
2 2
+ 2η 2 kA(yi − x? )kA + 2η 2 k(Ai − A)(yi − x))kA + βη 2 σ 2
2 2 2
≤ kyi − x? kA − ( 3η 2 ? 2 2 2 2
2 − 2η β)kA(yi − x )k2 + βη σ + δ (2η β + 2η)kyi − xk2
2 2 2
≤ kyi − x? kA − ( 3η 2 ? 2 2
2 − 2η β)kA(yi − x )k2 + βη σ +
δ
10K kyi − xk2 .
2 2 2 2
The second to last inequality used that k · kA ≤ βk · k2 by (A5) and that k(A − Ai )( · )k2 ≤ δ 2 k · k2 by (A2). The final
1 1 1 K−k
inequality used that η ≤ max( 10β , 22δK ). Now, multiplying Lemma 20 by δ(1 + K ) ≤ 20δ
7 we have
2 2 ? 2
δ(1 + 1 K−k
K) Er−1,k−1 kyi+ − xk ≤ δ(1 + 1 K−k
K) (1 1 2
2K )kyi − xk + 20δKη kA(yi − x )k
+ + 3δη 2 σ 2
1 K−k 1 1 2 δ 2
≤ δ(1 + K) (1 + 2K + 10K )kyi − xk − 10K kyi − xk
2
+ 20δKη 2 kA(yi − x? )k + 3δη 2 σ 2
1 1 K−k+1 1 2 δ 2
≤ (1 − 5K )δ(1 + K) (1 + K )kyi − xk − 10K kyi − xk
2 ? 2 2 2
+ 20δKη kA(yi − x )k + 3δη σ .
SCAFFOLD: Stochastic Controlled Averaging for Federated Learning
Adding this to our previous equation gives the following recursive bound:
2 2
Er−1,k−1 kyi+ − x? kA + δ(1 + K1 K−k
) Er−1,k−1 kyi+ − xk ≤
2 2 ? 2
kyi − x? kA + (1 − 5K
1 1 K−k+1
)δ(1 + K ) kyi − xk − ( 3η 2 2 2 2
2 − 2η β − 20δKη )kA(yi − x )k2 + (3δ + β)η σ
1 1 µη
If η ≤ µK , then (1 − 5K ) ≤ (1 − 6 ) and we have the strongly-convex version of the first statement.
Observe that for quadratics, Er−1 [∇f (x)] = ∇f (Er−1 [x]). This implies that the analysis of ξ˜i,k
r
is essentially of a
deterministic process with σ = 0, proving the second statement. It is also straightforward to repeat exactly the above
argument to formally verify the second statement.
Server progress in one round. Now we combine the progress made by each client in one step to calculate the server
progress.
Lemma 23. Suppose (A2), (A5) and (A4) hold, and {fi } are quadratics. Then, the following holds for the update (28)
1
with η ≤ min( 10β 1
, 21δK 1
, 10µK ) and weights wk := (1 − µη
6 )
1−k
:
K K
ηX 2
X 2
wk Er−1 k∇f (xrk )k ≤ (f (Er−2 [xr−1 ]) − f ? ) − wK (f (Er−1 [xr ]) − f ? ) + wk 8η σN .
6
k=1 k=1
Proof. Let us do the non-convex (and general convex) case first. By summing over Lemma 22 we have
K
ηX 2 r r
Er−1 k∇f (yi,k )k ≤ ξi,0 − ξi,K + 7Kβη 2 σ 2 .
6
k=1
A similar result holds with σ = 0 for Er−1 [yi,k ]. Now, using Lemma 18 we have that
K N N
ηX 2 1 X ˜r 1 X ˜r 2 σ2
Er−1 k∇f (xrk )k ≤ (ξ + 1
N ξi,0 ) − (ξ + 1
N ξi,K ) +7Kβη N .
6 N i=1 i,0 N i=1 i,K
k=1
| {z } | {z }
r
=:θ+ r
=:θ−
proving the second part of the Lemma for weights wk = 1. The proof of strongly convex follows a very similar argument.
Unrolling Lemma 22 using weights wk := (1 − µη 6 )
1−k
gives
K K
ηX 2
X 2
wk Er−1 k∇f (xrk )k ≤ θ+
r r
− w K θ− + wk 7η σN .
6
k=1 k=1
As in the general-convex case, we can use Lemmas 19, 18 and 21 to prove that
K K
ηX 2
X 2
wk Er−1 k∇f (xrk )k ≤ (f (Er−2 [xr−1 ]) − f ? ) − wK (f (Er−1 [xr ]) − f ? ) + wk 8η σN .
6
k=1 k=1
Deriving final rates. The proof of Theorem VIII follows by appropriately unrolling Lemma 23. FOr weakly-convex
1
functions, we can simply use Lemma 2 with the probabilities set as prk = KR . For strongly-convex functions, we use
r µη 1−rk
pk ∝ (1 − 6 ) and follow the computations in Lemma 1.