You are on page 1of 17

000

001
002 Zeno: Distributed Stochastic Gradient Descent with Suspicion-based
003
004
Fault-tolerance
005
006
007
Anonymous Authors1
008
009
010 Abstract is also possible that in different iterations, different groups
011 of workers are faulty, which means that we can not simply
012 We present Zeno, a technique to make distributed
machine learning, particularly Stochastic Gra- identify any worker which is always faulty.
013
014 dient Descent (SGD), tolerant to an arbitrary
015 number of faulty workers. This generalizes pre- 4: Aggregation
016 vious results that assumed a majority of non-
faulty nodes; we need assume only one non-faulty Server
017
worker. Our key idea is to suspect workers that are

1: Pull
3: Push
018
019 potentially defective. Since this is likely to lead to
020 false positives, we use a ranking-based preference
mechanism. We prove the convergence of SGD
Worker Worker ... Worker
021 2: Gradient Computation
022 for non-convex problems under these scenarios.
023 Experimental results show that Zeno outperforms
024 existing approaches. Figure 1. Parameter Server architecture.
025
026 We focus on the problem of Stochastic Gradient Descent
027
1. Introduction
(SGD). We use the Parameter Server (PS) architecture (Li
028 In distributed machine learning, one of the hardest problems et al., 2014a;b) for distributed SGD. As illustrated in Figure
029 today is fault-tolerance. Faulty workers may take arbitrary 1, processes are composed of the server nodes and worker
030 actions or modify their portion of the data and/or models nodes. In each SGD iteration, the workers pull the latest
031 arbitrarily. In addition to adversarial attacks on purpose, it is model from the servers, estimate the gradients using the
032 also common for the workers to have hardware or software locally sampled training data, then push the gradient esti-
033 failures, such as bit-flipping in the memory or communi- mators to the servers. The servers aggregate the gradient
034 cation media. While fault-tolerance has been studied for estimators, and update the model by using the aggregated
035 distributed machine learning (Blanchard et al., 2017; Chen gradients.
036 et al., 2017; Yin et al., 2018; Feng et al., 2014; Su & Vaidya,
Our approach, in a nutshell is the following. We treat each
037 2016a;b; Alistarh et al., 2018), much of the work on fault-
candidate gradient estimator as a suspect. We compute a
038 tolerant machine learning makes strong assumptions. For
score using a stochastic zero-order oracle. This ranking
039 instance, a common assumption is that no more than 50% of
indicates how trustworthy the given worker is. Then, we
040 the workers are faulty (Blanchard et al., 2017; Chen et al.,
take the average over the several candidates with the highest
041 2017; Yin et al., 2018; Su & Vaidya, 2016a; Alistarh et al.,
scores. This allows us to tolerate a large number of incorrect
042 2018).
gradients. We prove that the convergence is as fast as fault-
043
We present Zeno, a new technique that generalizes the fail- free SGD. The variance falls as the number of non-faulty
044
ure model so that we only require at least one non-faulty workers increases.
045
046 (good) worker. In particular, faulty gradients may pretend to
To the best of our knowledge this paper is the first to the-
047 be good by behaving similar to the correct gradients in the
oretically and empirically study cases where a majority of
048 variance and magnitude, making them hard to distinguish. It
workers are faulty for non-convex problems. In summary,
049 1
Anonymous Institution, Anonymous City, Anonymous Region, our contributions are:
050 Anonymous Country. Correspondence to: Anonymous Author
051 <anon.email@domain.com>.
• A new approach for SGD with fault-tolerance, that works
052 with an arbitrarily large number of faulty nodes as long
Preliminary work. Under review by the International Conference
053 on Machine Learning (ICML). Do not distribute. as there is at least one non-faulty node.
054
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

055 • Theoretically, the proposed algorithm converges as fast We assume that there exists a minimizer of F (x), which is
056 as distributed synchronous SGD without faulty workers, denoted by x∗ .
057 with the same asymptotic time complexity.
We solve this problem in a distributed manner with m work-
058
• Experimental results validating that 1) existing majority- ers. In each iteration, each worker will sample n indepen-
059
based robust algorithms may fail even when the number dent and identically distributed (i.i.d.) data points from the
060
of faulty workers is lower than the majority, and 2) Zeno distribution D, and compute the gradient of the local empir-
061 Pn
gracefully handles such cases. ical loss Fi (x) = n1 j=1 f (x; z i,j ), ∀i ∈ [m], where z i,j
062
is the jth sampled data on the ith worker. The servers will
063
• The effectiveness of Zeno also extends to the case where collect and aggregate the gradients sent by the works, and
064
the workers use disjoint local data to train the model, update the model as follows:
065
i.e., the local training data are not identically distributed
066 xt+1 = xt − γ t Aggr({gi (xt ) : i ∈ [m]}),
across different workers. Theoretical and experimental
067
analysis is also provided in this case.
068 where Aggr(·) is an aggregation rule (e.g., averaging), and
069
2. Related work
(
070 ∗ ith worker is faulty,
t
071 gi (x ) = t
(1)
Many approaches for improving failure tolerance are based ∇Fi (x ) otherwise,
072
073 on robust statistics. For instance, Chen et al. (2017); Su &
Vaidya (2016a;b) use geometric median as the aggregation where “∗" represents arbitrary values.
074
075 rule. Yin et al. (2018) establishes statistical error rates for Formally, we define the failure model in synchronous SGD
076 marginal trimmed mean as the aggregation rule. Similar as follows.
077 to these papers, our proposed algorithm also works under
Byzantine settings. Definition 1. (Failure Model). In the tth iteration, let
078 {vit : i ∈ [m]} be i.i.d. random vectors in Rd , where
079 There are also robust gradient aggregation rules that are not vit = ∇Fi (xt ). The set of correct vectors {vit : i ∈ [m]}
080 based on robust statistics. For example, Blanchard et al. is partially replaced by faulty vectors, which results in
081 (2017) propose Krum, which select the candidates with {ṽit : i ∈ [m]}, where ṽit = gi (xt ) as defined in Equa-
082 minimal local sum of Euclidean distances. DRACO (Chen tion (1). In other words, a correct/non-faulty gradient is
083 et al., 2018) uses coding theory to ensure robustness. ∇Fi (xt ), while a faulty gradient, marked as “∗", is as-
084 signed arbitrary value. We assume that q out of m vectors
085 Alistarh et al. (2018) proposes a fault-tolerant SGD variant
different from the robust aggregation rules. The algorithm are faulty, where q < m. Furthermore, the indices of faulty
086 workers can change across different iterations.
087 utilizes the historical information, and achieves the optimal
088 sample complexity. However, the algorithm requires the
We observe that in the worst case, the failure model in Defi-
089 estimated upper bound of the variances of the stochastic
nition 1 is equivalent to the Byzantine failures introduced in
090 gradients, which makes the algorithm less practical. Fur-
Blanchard et al. (2017); Chen et al. (2017); Yin et al. (2018).
091 thermore, there are no empirical results provided.
In particlar, if the failures are caused by attackers, the faliur
092 In summary, the existing majority-based methods for syn- model includes the case where the attackers can collude.
093 chronous SGD (Blanchard et al., 2017; Chen et al., 2017;
094 To help understand the failure model in synchronous SGD,
Yin et al., 2018; Su & Vaidya, 2016a; Alistarh et al., 2018)
095 we illustrate a toy example in Figure 2.
assume that the non-faulty workers dominate the entire set
096 of workers. Thus, such algorithms can trim the outliers from The notations used in this paper is summarized in Table 1.
097 the candidates. However, in real-world failures or attacks,
098 there are no guarantees that the number of faulty workers Table 1. Notations
099 Notation Description
can be bounded from above. m Number of workers
100
n Number of samples on each worker
101 T Number of epochs
102
3. Model
[m] Set of integers {1, . . . , m}
103 We consider the following optimization problem: q Number of faulty workers
104 b Trim parameter of Zeno
105 min F (x), γ Learning rate
106 x∈Rd ρ Regularization weight of Zeno
nr Batch size of Zeno
107 where F (x) = Ez∼D [f (x; z)], z is sampled from some k·k All the norms in this paper are l2 -norms
108 unknown distribution D, d is the number of dimensions.
109
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

110 Algorithm 1 Zeno


111
112 Server
113 Input: ρ (defined in Definition 2), b (defined in Defini-
114 tion 3)
115 x0 ← rand() {Initialization}
116 Figure 2. A toy example of the failure model in synchronous SGD. for t = 1, . . . , T do
117
There are m = 7 candidate gradient estimators. The black dots Broadcast xt−1 to all the workers
represent the correct gradients, where ṽi = ∇Fi (xt ), i ∈ [m − 1]. Wait until all the gradients {ṽit : i ∈ [m]} arrive
118 The red dot represents the faulty gradient, whose value (in the
119 Draw the samples for evaluating stochastic descendant
worst case) is ṽm = ∇Fm (xt ), where  < 0 is a large negative
120 score frt (·) as defined in Definition 2
constant. The blue dashed circle represent the expectation of the
121 Compute ṽ¯t = Zenob ({ṽit : i ∈ [m]}) as defined in
true gradient ∇F (xt ). Thus, the averaged gradient, which will be
122 computed by the server, represented by the green dot, is far away Definition 3
123 from the true gradient, which is harmful to the model training. Update the parameter xt ← xt−1 − γ t ṽ¯t
124 end for
125 4. Methodology
126
Worker i = 1, . . . , m
for t = 1, . . . , T do
127 In contrast to the existing majority-based methods, we com-
Receive xt from the server
128 pute a score for each candidate gradient estimator by using
Draw the samples, compute, and send the gradient
129 the stochastic zero-order oracle. We rank each candidate
vit = ∇Fit (xt ) to the server
130 gradient estimator based on the estimated descent of the loss
end for
131 function, and the magnitudes. Then, the algorithm aggre-
132 gates the candidates with highest scores. The score roughly
133 indicates how trustworthy each candidate is.
134 Definition 2.P(Stochastic Descendant Score) Denote In other words, ṽ(i) is the vector with the ith highest score
135 fr (x) = n1r i=1
nr
f (x; zi ), where zi ’s are i.i.d. sam- in {ṽi : i ∈ [m]}.
136 ples drawn from D, and nr is the batch size of fr (·). The proposed aggregation rule, Zeno, aggregates the gra-
137 E[fr (x)] = F (x). For any update (gradient estimator) dient estimators by taking the average of the first m − b
138 u, based on the current parameter x, learning rate γ, and a elements in {ṽ(i) : i ∈ [m]} (the gradient estimators with
139 constant weight ρ > 0, we define its stochastic descendant the (m − b) highest scores), where m > b ≥ q:
140 score as follows:
141
142 Scoreγ,ρ (u, x) = fr (x) − fr (x − γu) − ρkuk2 . 1 X
m−b

143 Zenob ({ṽi : i ∈ [m]}) = ṽ(i) .


m − b i=1
144 The score defined in Definition 2 is composed of two parts:
145 the estimated descendant of the loss function, and the magni-
146 tude of the update. The score increases when the estimated
Note that zi ’s (in Definition 2) are independently sampled in
147 descendant of the loss function, fr (x) − fr (x − γṽi ), in-
different iterations. Furthermore, in each iteration, zi ’s are
148 creases. The score decreases when the magnitude of the
sampled after the arrival of the candidate gradient estimators
149 update, kṽi k2 , increases. Intuitively, the larger descendant
ṽit on the server. Since the faulty workers are not predictive,
150 suggests faster convergence, and the smaller magnitude sug-
they cannot obtain the exact information of fr (·), which
151 gests a smaller step size. Even if a gradient is faulty, a
means that the faulty gradients are independent of fr (·),
152 smaller step size makes it less harmful and easier to be
though the faulty workers can know E[fr (·)].
153 cancelled by the correct gradients.
154 Using Zeno as the aggregation rule, the detailed distributed
Using the score defined above, we establish the following
155 synchronous SGD is shown in Algorithm 1.
suspicion-based aggregation rule. We ignore the index of
156 iterations, t, for convenience. In Figure 3, we visualize the intuition underlying Zeno. It
157 is illustrated that all the selected candidates (arrows pointing
Definition 3. (Suspicion-based Aggregation) Assume that
158 inside the black dashed circle) are bounded by at least one
among the gradient estimators {ṽi : i ∈ [m]}, q elements
159 honest candidate. In other words, Zeno uses at least one
are faulty, and x is the current value of the parameters. We
160 honest candidate to establish a boundary (the black dashed
sort the sequence by the stochastic descendant score defined
161 circle), which filter out the potentially harmful candidates.
in Definition 2, which results in {ṽ(i) : i ∈ [m]}, where
162 The candidates inside the boundary are supposed to be harm-
163 Scoreγ,ρ (ṽ(1) , x) ≥ . . . ≥ Scoreγ,ρ (ṽ(m) , x). less, no matter they are actually faulty or not.
164
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

165 strong convexity by taking µ > 0.


166 Assumption 3. (Bounded Variance) We assume that in any
167 iteration, any correct gradient estimator vi = ∇Fi (x) has
168 2
the upper-bounded variance: E kvi − E [vi ]k ≤ V. Fur-
169 2
thermore, we assume that Ekvi k ≤ G.
170
171 In general, Assumption 3 bounds the variance and the sec-
172 ond moment of the correct gradients of any sample loss
173 function f (x; z), ∀z ∼ D.
174
Remark 1. Note that for the faulty gradients in our failure
175
Figure 3. Zeno on loss surface contours. We use the notations in model, none of the assumptions above holds.
176 Definition 2 and 3. The black dot is the current parameter x. The
177 arrows are the candidate updates {ṽi : i ∈ [m]}. Red arrows are
178 5.2. Convergence Guarantees
the incorrect updates. Green arrows are the correct updates. Taking
179 b = 3, Zeno filters out the 3 arrows pointing outside the black For general functions, including convex and non-convex
180 dashed circle. These 3 updates have the least descendant of the loss functions, we provide the following convergence guarantee.
181 function, among all the updates. There are some incorrect updates The proof can be found in the appendix.
182 (the red arrow) remaining inside the boundary. However, since
they are bounded by the correct updates, the remaining incorrect Theorem 1. For ∀x ∈ Rd , denote
183
184 updates are harmless. (
∗ ith worker is faulty,
185 ṽi =
∇Fi (x) otherwise,
186 5. Theoretical Guarantees
187
In this section, we prove the convergence of synchronous where i ∈ [m], with E[∇Fi (x)] = ∇F (x), and ṽ¯ =
188 2

189 SGD with Zeno as the aggregation rule under our failure Zenob ({ṽi : i ∈ [m]}). Taking γ ≤ L1 , ρ = βγ2 , and
190 model. We start with the assumptions required by the con- β > max(0, −µ), we have
191 vergence guarantees. The two basic assumptions are the
¯ − F (x) ≤ − γ k∇F (x)k2
E [F (x − γ ṽ)]
192 smoothness of the loss function, and the bounded variance 2
193 of the (non-faulty) gradient estimators. γ(b − q + 1)(m − q)V (L + β)γ 2 G
194 + + .
(m − b)2 2
195 5.1. Assumptions
2
196 Corollary 1. Take γ = L√ 1
, ρ = βγ2 , and β >
In this section, we highlight the necessary assumption for T
197
stochastic descendant score, followed by the assumptions max(0, −µ). Using Zeno, with E[∇Fi (xt )] = ∇F (xt )
198
for convergence guarantees. for ∀t ∈ {0, . . . , T }, after T iterations, we have
199
200 Assumption 1. (Unbiased evaluation) We assume that the PT −1 t 2
stochastic loss function, fr (x), evaluated in the stochastic t=0 Ek∇F (x )k
201
202 descendant score in Definition 2, is an unbiased estimator of  T  
the global loss function F (x). In other words, E[fr (x)] = 1 (b − q + 1)(m − q)
203 ≤O √ +O .
204 F (x). T (m − b)2
205
Note that we do not make any assumption for the Zeno Now, we consider a more general case, where each worker
206
batch size nr or the variance of fr (x). has disjoint (non-identically distributed) local dataset for
207
208 Assumption 2. (Bounded Taylor’s Approximation) We as- training, which results in non-identically distributed gradient
209 sume that f (x; z) has L-smoothness and µ-lower-bounded estimators. The server is still aware of the the entire dataset.
210 Taylor’s approximation (also called µ-weak convexity): For example, in volunteer computing (Meeds et al., 2015;
211 µ Miura & Harada, 2015), the server/coordinator can assign
212 h∇f (x; z), y − xi + ky − xk2 ≤ f (y; z) − f (x; z) disjoint tasks/subsets of training data to the workers, while
2
213 the server holds the entire training dataset. In this scenario,
L
214 ≤ h∇f (x; z), y − xi + ky − xk2 , we have the following convergence guarantee.
2
215 Corollary 2. Assume that
216 where µ ≤ L, and L > 0.
1 X
217 Note that Assumption 2 covers the case of non-convexity by F (x) = E [Fi (x)] , E [Fi (x)] 6= E [Fj (x)] ,
218 m
taking µ < 0, non-strong convexity by taking µ = 0, and i∈[m]
219
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

220 for ∀i, j ∈ [m], i 6= j. For the stochastic descendant score, • Zeno also works when training with disjoint local data.
221 we still have E [fr (x)] = F (x). Assumption 1, 2, and 3
2
222 still hold. Take γ = L√ 1
, ρ = βγ2 , and β > max(0, −µ). 6.1. Datasets and Evaluation Metrics
T
223 Using Zeno, after T iterations, we have We conduct experiments on benchmark CIFAR-10 image
224 PT −1 t 2 classification dataset (Krizhevsky & Hinton, 2009), which
225 t=0 Ek∇F (x )k
is composed of 50k images for training and 10k images for
226
 T    2  testing. We use convolutional neural network (CNN) with
227 1 b b (m − q)
≤O √ +O +O . 4 convolutional layers followed by 1 fully connected layer.
228 T m m2 (m − b) The detailed network architecture can be found in our sub-
229
mitted source code (will also be released upon publication).
230 These two corollaries tell us that when using Zeno as the In each experiment, we launch 20 worker processes. We
231 aggregation rule, even if there are failures, the convergence repeat each experiment 10 times and take the average. We
232 rate can be as fast as fault-free distributed synchronous use top-1 accuracy on the testing set and the cross-entropy
233 SGD. The variance decreases when the number of workers loss function on the training set as the evaluation metrics.
234 m increases, or the estimated number of faulty workers b
235 decreases. 6.1.1. BASELINES
236 Remark 2. There are two practical concerns for the pro-
237 posed algorithm. First, by increasing the batch size of We use the averaging without failures/attacks as the
238 fr (·) (nr in Definition 2), the stochastic descendant score gold standard, which is referred to as Mean without
239 will be potentially more stable. However, according to The- failures. Note that this method is not affected by b or
240 orem 1 and Corollary 1 and 2, the convergence rate is inde- q. The baseline aggregation rules are Mean, Median, and
241 pendent of the variance of fr . Thus, theoretically we can use Krum as defined below.
242 a single sample to evaluate the stochastic descendant score. Definition 4. (Median (Yin et al., 2018)) We define the
243 Second, theoretically we need larger ρ for non-convex prob- marginal median aggregation rule Median(·) as med =
244 lems. However, larger ρ makes Zeno less sensitive to the Median({ṽi : i ∈ [m]}), where for any j ∈ [d], the jth di-
245 descendant of the loss function, which potentially increases mension of med is medj = median ({(ṽ1 )j , . . . , (ṽm )j }),
246 the risk of aggregating harmful candidates. In practice, we (ṽi )j is the jth dimension of the vector ṽi , median(·) is the
247 can use a small ρ by assuming the local convexity of the loss one-dimensional median.
248 functions. Definition 5. (Krum (Blanchard et al., 2017))
249 X
250 5.3. Implementation Details: Time Complexity Krumb ({ṽi : i ∈ [m]}) = ṽk , k = argmin kṽi − ṽj k2 ,
251 i∈[m] i→j
252 Unlike the majority-based aggregation rules, the time com-
where i → j is the indices of the m − b − 2 nearest neigh-
253 plexity of Zeno is not trivial to analyze. Note that the
bours of ṽi in {ṽi : i ∈ [m]} measured by Euclidean dis-
254 convergence rate is independent of the variance of fr , which
tances.
255 means that we can use a single sample (nr = 1) to evalu-
256 ate fr to achieve the same convergence rate. Furthermore, Note that Krum requires 2b + 2 < m. Thus, b = 8 is the
257 in general, when evaluating the loss function on a single best we can take.
258 sample, the time complexity is roughly linear to the num-
259 ber of parameters d. Thus, informally, the time complexity 6.2. No Failure
260 of Zeno is O(dm) for one iteration, which is the same as
261 Mean and Median aggregation rules. Note that the time We first test the convergence when there are no failures. In
262 complexity of Krum is O(dm2 ). all the experiments, we take the learning rate γ = 0.1,
263 worker batch size 100, Zeno batch size nr = 4, and
264 ρ = 0.0005. Each worker computes the gradients on i.i.d.
6. Experiments samples. For both Krum and Zeno, we take b = 4. The
265
266 In this section, we evaluate the fault tolerance of the pro- result is shown in Figure 4. We can see that Zeno converges
267 posed algorithm. We summarize our results here: as fast as Mean. Krum converges slightly slower, but the
268 convergence rate is acceptable.
269 • Compared to the baselines, Zeno shows better conver-
270 gence with more faulty workers than non-faulty ones. 6.3. Label-flipping Failure
271 • Zeno is robust to the choices of the hyperparameters, In this section, we test the fault tolerance to the label-flipping
272 including the Zeno batch size nr , the weight ρ, and the failures. When such kind of failures happen, the work-
273 number of trimmed elements b. ers compute the gradients based on the training data with
274
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

275 Mean 2.5


0.8 Median
276 Krum 4 2
Zeno 4
277
Top-1 accuracy

0.6
1.5

Loss
278 0.4
1
279 Mean
Median
0.2 0.5 Krum 4
280 Zeno 4

281 0
100 101 102
0
100 101 102
Epoch Epoch
282
283 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
284
285 Figure 4. Convergence on i.i.d. training data, without failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
286 ρ = 0.0005. γ = 0.1. Each epoch has 25 iterations. Zeno performs similar to Mean.
287
2.5
288 0.8
Mean without failures
Mean
289 Median
Krum 8
2
Top-1 accuracy

0.6
290 Zeno 9
1.5
291

Loss
0.4
1 Mean without failures
292 Mean
Median
293 0.2 0.5 Krum 8

294 0 0
Zeno 9

295 100 101 102 100 101 102


Epoch Epoch
296
(a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
297
298 Mean without failures 2.5
0.8 Mean
299 Median 2
Krum 8
300
Top-1 accuracy

0.6
Zeno 16
1.5
301
Loss

0.4
302 1 Mean without failures
Mean
303 0.2
Median
0.5 Krum 8
304 Zeno 16

305 0
100 101 102
0
100 101 102
306 Epoch Epoch

307 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
308
309 Figure 5. Convergence on i.i.d. training data, with label-flipping failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
310 ρ = 0.0005. γ = 0.1. Each epoch has 25 iterations. Zeno outperforms all the baselines, especially when q = 12.
311
312
“flipped" labels, i.e., any label ∈ {0, . . . , 9}, is replaced by 6.4. Bit-flipping Failure
313
9 − label. Such kind of failures/attacks can be caused by
314 In this section, we test the fault tolerance to a more severe
data poisoning or software failures.
315 kind of failures. In such failures, the bits that controls the
316 In all the experiments, we take the learning rate γ = 0.1, sign of the floating numbers are flipped, due to some hard-
317 worker batch size 100, Zeno batch size nr = 4, and ρ = ware failure. A faulty worker pushes the negative gradient
318 0.0005. Each non-faulty worker computes the gradients on instead of the true gradient to the servers. To make the fail-
319 i.i.d. samples. ure even worse, one of the faulty gradients is copied to and
320 overwrites the other faulty gradients, which means that all
The result is shown in Figure 5. As expected, Zeno can
321 the faulty gradients have the same values.
tolerate more than half faulty gradients. When q = 8, Zeno
322
preforms similar to Krum. When q = 12, Zeno preforms In all the experiments, we take the learning rate γ = 0.1,
323
much better than the baselines. When there are faulty gra- worker batch size 100, Zeno batch size nr = 4, and ρ =
324
dients, Zeno converges slower, but still have better conver- 0.0005. Each non-faulty worker computes the gradients on
325
gence rates than the baselines. i.i.d. samples.
326
327 The result is shown in Figure 6. As expected, Zeno can
328 tolerate more than half faulty gradients. Surprisingly, Mean
329
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

330 Mean without failures 2.5


0.8 Mean
331 Median 2
Krum 8
332
Top-1 accuracy

0.6
Zeno 9
333 1.5

Loss
0.4
334 1 Mean without failures
Mean
335 0.2
Median
0.5 Krum 8
336 Zeno 9

337 0
100 101 102
0
100 101 102
338 Epoch Epoch

339 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
340
Mean without failures 2.5
341 0.8 Mean
Median
342 Krum 8
2
Top-1 accuracy

0.6
343 Zeno 16
1.5

Loss
344 0.4
Mean without failures
1
345 Mean
Median
0.2
346 0.5 Krum 8
Zeno 16
347 0 0
100 10 1
10 2
100 101 102
348 Epoch Epoch

349 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
350
351 Figure 6. Convergence on i.i.d. training data, with bit-flipping failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
352 ρ = 0.0005. γ = 0.1. Each epoch has 25 iterations. Zeno outperforms all the baselines, especially when q = 12.
353
354 Mean without failures 2.5
0.8
355 Mean
Median 2
356 Krum 8
Top-1 accuracy

0.6
Zeno 9
357 1.5
Loss

358 0.4
1 Mean without failures
Mean
359 Median
0.2 0.5 Krum 8
360 Zeno 9
361 0 0
100 10 1
10 2
100 101 102
362 Epoch Epoch

363 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
364
2.5
365 0.8
Mean without failures
Mean
366 Median
Krum 8
2
Top-1 accuracy

0.6
367 Zeno 16
1.5
Loss

368 0.4
1 Mean without failures
369 Mean
Median
370 0.2 0.5 Krum 8

371 0 0
Zeno 16

372 100 10 1
10 2
100 101 102
Epoch Epoch
373
(c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
374
375
Figure 7. Convergence on disjoint (non-i.i.d.) training data, with label-flipping failures. Batch size on the workers is 100. Batch size of
376 Zeno is nr = 4. ρ = 0.0005. γ = 0.05. Each epoch has 25 iterations. Zeno outperforms all the baselines, especially when q = 12.
377
378
379 performs well when q = 8. We will discuss this phe- 6.5. Disjoint Local Training Data
380 nomenon in Section 6.7. Zeno outperforms all the baselines.
381 In volunteer computing (Meeds et al., 2015; Miura &
When q = 12, Zeno is the only one avoiding catastrophic
382 Harada, 2015), it is reasonable for the coordinator to as-
divergence. Zeno converges slower, but still have better
383 sign disjoint tasks/datasets to different workers. As a result,
convergence than the baselines.
384
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

385 0.8
0.8
386
0.6
387 0.6
Top-1 accuracy

Loss
388 0.4
0.4

389 0.2 0.2


390
391 0
1 2 4 8 16
0
1 2 4 8 16
nr nr
392
393 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
394
395 Figure 8. Convergence on i.i.d. training data, with label-flipping failures, q = 8. Batch size on the workers is 100. Batch size of Zeno is
396 nr = 4. γ = 0.1. Each epoch has 25 iterations. nr is tuned.
397
398 each worker draws training samples from different datasets. compare the time complexity to the baselines. Compared to
399 The server is still aware of the entire dataset. We conduct Median, Zeno is computationally more expensive by the
400 experiments in such scenario, as discussed in Corollary 2. factor of nr = 4. However, compared to Krum, which re-
401 We test Zeno under label-flipping failures. The results un- quires 20×19/2 = 190 times of O(d) operators, Zeno only
402 der bit-flipping failures can be found in the appendix. The needs 21 × 4 = 84 times of O(d) operators. Furthermore,
403 results are shown in Figure 7. Due to the non-i.i.d. set- since the batch size on the workers is 100, the computation
404 ting, it is more difficult to distinguish faulty gradients from required on the server is less than that of one worker, which
405 non-faulty ones. In such bad cases, Zeno can still make does not “cancel out" the computational improvements due
406 reasonable progress, while the baselines, especially Krum, to data parallelism. The additional computation is the cost
407 performs much worse. that we have to pay for better robustness.
408
409 Another interesting observation is that, although Krum is
6.6. Hyperparameter Sensitivity the state-of-the-art algorithm, it does not perform as good
410
411 In Figure 8, we show the performance of Zeno with differ- as expected under our designed failures. The reason is
412 ent batch size nr . Larger nr improves the convergence, but that Krum requires the assumption that cσ < kgk for con-
413 the gap is not significant. nr = 1 still works. Zeno is also vergence, where c is a general constant, σ is the maximal
414 robust to different choices the other hyperparameters ρ and variance of the gradients, and g is the gradient. Note that
415 b. The experiments can be found in the appendix. kgk → 0 when SGD converges to a critical point. Thus,
416 such assumption is never guaranteed to be satisfied, if
417 6.7. Discussion the variance is large. Furthermore, the better SGD con-
418 verges, the less likely such assumption can be satisfied. We
419 An interesting observation is that, when q = 8, Mean seems can provide an 1-dimensional toy example to show what
420 to have good performance, while it is not supposed to be will happen when the variance is large enough. Suppose
421 fault-tolerant. The reason is that both label-flipping and there are 4 non-faulty gradients {0.2, 0.4, 1.6, 1.8} with the
422 bit-flipping failures do not change the magnitude of the mean 1, and 2 faulty gradient {−1, −1} with the oppo-
423 gradients. When the number of faulty gradients q is less site mean −1. According to Definition 5, taking b = 2,
424 than half, it is possible that the faulty gradients are cancelled Krum2 ({−1, −1, 0.2, 0.4, 1.6, 1.8}) = −1, which means
425 out by the non-faulty ones. However, when the magnitude is that Krum chooses the faulty gradient when variance is
426 enlarged, Mean will fail, as pointed out in Xie et al. (2018). large enough. Note that it is easier to have larger variances
427 In general, we find that Zeno is more robust than the current when the dimension d gets higher.
428 state of the art. When the faulty workers dominate, Zeno is
429 the only aggregator that converges in all experiments. When 7. Conclusion
430 the correct workers dominate, Median can be an alternative
431 with cheap computation. We propose a novel aggregation rule for synchronous SGD,
432 which takes the weak assumption that there is at least one
433 The computational complexity of Zeno depends on the honest worker. The algorithm has provable convergence.
434 complexity of inference and the Zeno batch size nr . These Our empirical results show good performance in practice.
435 additional hyperparameters make direct comparison to stan- We will apply the proposed method to asynchronous SGD
436 dard methods more challenging. If we take the approxi- in the future work.
437 mation that the computational complexity of inference is
438 linear to the number of parameters, then we can roughly
439
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

440 References
441
Alistarh, D., Allen-Zhu, Z., and Li, J. Byzantine stochastic
442
gradient descent. arXiv preprint arXiv:1803.08917, 2018.
443
444 Blanchard, P., Guerraoui, R., Stainer, J., et al. Machine
445 learning with adversaries: Byzantine tolerant gradient
446 descent. In Advances in Neural Information Processing
447 Systems, pp. 118–128, 2017.
448
449 Chen, L., Wang, H., Charles, Z., and Papailiopoulos, D.
450 Draco: Byzantine-resilient distributed training via redun-
451 dant gradients. In International Conference on Machine
452 Learning, pp. 902–911, 2018.
453
Chen, Y., Su, L., and Xu, J. Distributed statistical machine
454
learning in adversarial settings: Byzantine gradient de-
455
scent. POMACS, 1:44:1–44:25, 2017.
456
457 Feng, J., Xu, H., and Mannor, S. Distributed robust learning.
458 arXiv preprint arXiv:1409.5937, 2014.
459
460 Krizhevsky, A. and Hinton, G. Learning multiple layers of
461 features from tiny images. 2009.
462
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,
463
A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.
464
Scaling distributed machine learning with the parameter
465
server. In OSDI, volume 14, pp. 583–598, 2014a.
466
467 Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Com-
468 munication efficient distributed machine learning with
469 the parameter server. In Advances in Neural Information
470 Processing Systems, pp. 19–27, 2014b.
471
472 Meeds, E., Hendriks, R., al Faraby, S., Bruntink, M., and
473 Welling, M. Mlitb: machine learning in the browser.
474 PeerJ Computer Science, 1, 2015.
475
Miura, K. and Harada, T. Implementation of a practical dis-
476
tributed calculation system with browsers and javascript,
477
and application to distributed deep learning. CoRR,
478
abs/1503.05743, 2015.
479
480 Su, L. and Vaidya, N. H. Fault-tolerant multi-agent opti-
481 mization: Optimal iterative distributed algorithms. In
482 PODC, 2016a.
483
484 Su, L. and Vaidya, N. H. Defending non-bayesian
485 learning against adversarial attacks. arXiv preprint
486 arXiv:1606.08883, 2016b.
487
Xie, C., Koyejo, O., and Gupta, I. Phocas: dimensional
488
byzantine-resilient stochastic gradient descent. arXiv
489
preprint arXiv:1805.09682, 2018.
490
491 Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P.
492 Byzantine-robust distributed learning: Towards optimal
493 statistical rates. arXiv preprint arXiv:1803.01498, 2018.
494
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

495
496
497
498
Appendix
499
500
501
A. Proofs
502 A.1. Preliminaries
503
504 We use the following lemma to bound the aggregated vectors.
505 Lemma 1. (Bounded Score) Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as
506 {vi : i ∈ [m − q]}. Sorting the correct vectors by the stochastic descendant score, we obtain {v(i) : i ∈ [m − q]}. Then, we
507 have the following inequality:
508
509 Scoreγ,ρ (ṽ(i) , x) ≥ Scoreγ,ρ (v(i) , x), ∀i ∈ [m − q],
510
511 or, by flipping the signs on both sides, it is equivalent to
512
513 fr (x − γṽ(i) ) − fr (x) + ρkṽ(i) k2 ≤ fr (x − γv(i) ) − fr (x) + ρkv(i) k2 , ∀i ∈ [m − q],
514
515 Proof. We prove the lemma by contradiction.
516 Assume that Scoreγ,ρ (ṽ(i) , x) < Scoreγ,ρ (v(i) , x). Thus, there are i correct vectors having greater scores than ṽ(i) .
517 However, because ṽ(i) is the ith element in {ṽ(i) : i ∈ [m]}, there should be at most i − 1 vectors having greater scores than
518 it, which yields a contradiction.
519
520 A.2. Convergence guarantees
521
522 For general non-strongly convex functions and non-convex functions, we provide the following convergence guarantees.
523 Theorem 1. For ∀x ∈ Rd , denote
524 (
525 ∗ ith worker is Byzantine,
526 ṽi =
∇Fi (x) otherwise,
527
528 βγ 2
529 where i ∈ [m], and ṽ¯ = Zenob ({ṽi : i ∈ [m]}). Taking γ ≤ 1
L, and ρ = 2 , where
530 (
531 β = 0, if µ ≥ 0;
532 β ≥ |µ|, otherwise.
533
534 we have
535 2
536 ¯ − F (x) ≤ − γ k∇F (x)k2 + γ(b − q + 1)(m − q)V + (L + β)γ G .
E [F (x − γ ṽ)]
2 (m − b) 2 2
537
538
539 Proof. Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as {vi : i ∈ [m − q]}, where
540 E[vi ] = ∇F (x). Sorting the correct vectors by the online descendant score, we obtain {v(i) : i ∈ [m − q]}. We also sort ṽi
541 by the online descendant score and obtain {ṽ(i) : i ∈ [m]}.
542 According to the definition, ṽ¯ = Zenob ({ṽi : i ∈ [m]}) = m−b
1
Pm−b 1
Pm−b
i=1 ṽ(i) . Furthermore, we denote v̄ = m−b i=1 v(i) .
543
544 Using Assumption 2, we have
545 2
546 ¯ γ(ṽ¯ − ṽ(i) ) + µγ kṽ¯ − ṽ(i) k2 ,
¯ + ∇fr (x − γ ṽ),


fr (x − γṽ(i) ) ≥ fr (x − γ ṽ)
547 2
548 for ∀i ∈ [m − b].
549
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

550 By summing up, we have


551
m−b 2 m−b
552 1 X ¯ + µγ
X
553 fr (x − γṽ(i) ) ≥ fr (x − γ ṽ) kṽ¯ − ṽ(i) k2 . (2)
m − b i=1 2(m − b) i=1
554
555
556 Using Lemma 1, we have
557 fr (x − γṽ(i) ) + ρkṽ(i) k2 ≤ fr (x − γv(i) ) + ρkv(i) k2 ,
558
559 for ∀i ∈ [m − b].
560
561 Combined with Equation 2, we have
562 ¯
fr (x − γ ṽ)
563
m−b m−b
564 1 X µγ 2 X
565 ≤ fr (x − γṽ(i) ) − kṽ¯ − ṽ(i) k2
m − b i=1 2(m − b) i=1
566
m−b m−b m−b
567 1 X ρ X µγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2 .

≤ fr (x − γv(i) ) +
568 m − b i=1 m − b i=1 2(m − b) i=1
569
570
βγ 2
571 We take ρ = 2 , where
572 (
573 β = 0, if µ ≥ 0;
574 β ≥ |µ|, otherwise.
575
576 Thus, if µ ≥ 0, we have ρ = 0, which implies that
577
m−b m−b m−b
578 ρ X µγ 2 X βγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2 ≤ kv(i) k2 .

579 m − b i=1 2(m − b) i=1 2(m − b) i=1
580
581 Also, if µ < 0, since β ≥ −µ, we have
582
m−b m−b
583 ρ X µγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2

584 m − b i=1 2(m − b) i=1
585
m−b m−b
586 βγ 2 X µγ 2 X
¯ 2
kv(i) k2 − kṽ(i) k2 − kṽ(i) k2 − kṽk
 
587 =
2(m − b) i=1 2(m − b) i=1
588
m−b m−b m−b
589 βγ 2 X (−β − µ)γ 2 X µγ 2 X
¯ 2
= kv(i) k2 + kṽ(i) k2 + kṽk
590 2(m − b) i=1 2(m − b) i=1 2(m − b) i=1
591 m−b
592 βγ 2 X
≤ kv(i) k2 .
593 2(m − b) i=1
594
595
Thus, we have
596
597 ¯
fr (x − γ ṽ)
598 m−b m−b m−b
599 1 X ρ X µγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2

≤ fr (x − γv(i) ) +
600 m − b i=1 m − b i=1 2(m − b) i=1
601 m−b m−b
602 1 X βγ 2 X
≤ fr (x − γv(i) ) + kv(i) k2 .
603 m − b i=1 2(m − b) i=1
604
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

605 Using the L-smoothness, we have


606
607 Lγ 2
kv̄ − v(i) k2 ,

fr (x − γv(i) ) ≤ fr (x − γv̄) + ∇fr (x − γv̄), γ(v̄ − v(i) ) +


608 2
609
610 for ∀i ∈ [m − b]. By summing up, we have
611 m−b
612 1 X
fr (x − γv(i) )
613 m − b i=1
614 m−b
615 Lγ 2 X
≤ fr (x − γv̄) + kv̄ − v(i) k2
616 2(m − b) i=1
617 m−b
618 Lγ 2 X
≤ fr (x − γv̄) + kv(i) k2 .
619 2(m − b) i=1
620
621
Thus, we have
622
623 ¯
fr (x − γ ṽ)
624
m−b m−b
625 1 X βγ 2 X
≤ fr (x − γv(i) ) + kv(i) k2
626 m − b i=1 2(m − b) i=1
627
m−b
628 (L + β)γ 2 X
≤ fr (x − γv̄) + kv(i) k2 .
629 2(m − b) i=1
630
631
1
632 Again, using the L-smoothness and taking γ ≤ L, we have
633
634 fr (x − γv̄)
635 Lγ 2
636 ≤ fr (x) + h∇fr (x), −γv̄i + kv̄k2
2
637 γ
≤ fr (x) + h∇fr (x), −γv̄i + kv̄k2
638 2
639
640 Thus, we have
641
642 ¯ − fr (x)
fr (x − γ ṽ)
643 m−b
644 (L + β)γ 2 X
≤ fr (x − γv̄) − fr (x) + kv(i) k2
645 2(m − b) i=1
646 m−b
γ 2 (L + β)γ 2 X
647 ≤ h∇fr (x), −γv̄i + kv̄k + kv(i) k2 .
648 2 2(m − b) i=1
649
650 Conditional on ṽ(i) ’s, taking expectation w.r.t. fr on both sides, we have
651
652 ¯ − F (x)
F (x − γ ṽ)
653 m−b
654 γ (L + β)γ 2 X
≤ h∇F (x), −γv̄i + kv̄k2 + kv(i) k2
655 2 2(m − b) i=1
656 m−b
657 γ γ (L + β)γ 2 X
= − k∇F (x)k2 + k∇F (x) − v̄k2 + kv(i) k2 .
658 2 2 2(m − b) i=1
659
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

660 Now, taking the expectation w.r.t. ṽ(i) ’s on both sides and using Ekv(i) k2 ≤ G, we have
661
¯ − F (x)
E [F (x − γ ṽ)]
662
663 γ γ
m−b
(L + β)γ 2 X
664 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + Ekv(i) k2
2 2 2(m − b) i=1
665
666 γ γ (L + β)γ 2 G
≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + .
667 2 2 2
668
669 Now we just need to bound Ek∇F (x) − v̄k2 . For convenience, we denote g = ∇F (x). Note that for arbitrary subset
670 S ⊆ [m − q], |S| = m − b, we have the following bound:
671
i∈S (vi − g) 2
P
672
E
673 m−b
674 P P 2
i∈[m−q] (vi − g) − i∈S / (vi − g)

675 = E
m−b

676
677 P 2 2
i∈[m−q] (vi − g)
P
/ (vi − g)

678 ≤ 2E + 2E
i∈S


m−b m−b

679
680 P 2 2
i∈[m−q] (vi − g)
P
681 2(m − q)2 2(b − q)2 / (vi − g)
i∈S

= 2
E + 2
E
(m − b) m−q (m − b) b−q

682
683 P 2
2(m − q)2 V 2(b − q)2 i∈[m−q] kvi − gk
684 ≤ +
685 (m − b)2 m − q (m − b)2 b−q
2 2
686 2(m − q) V 2(b − q) (m − q)V
≤ +
687 (m − b)2 m − q (m − b)2 b − q
688 2(b − q + 1)(m − q)V
689 = .
(m − b)2
690
691
Putting all the ingredients together, we obtain the desired result
692
693 ¯ − F (x)
E [F (x − γ ṽ)]
694
γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
695 ≤ − k∇F (x)k2 + 2
+ .
696 2 (m − b) 2
697
698
699 1 βγ 2
Corollary 1. Take γ = √
L T
, and ρ = 2 , where β is the same as in Theorem 1. Using Zeno, after T iterations, we have
700
701 PT −1
Ek∇F (xt )k2
t=0
702
T
703  
 (L + β)G 1 2(b − q + 1)(m − q)V
704 ≤ 2L F (x0 ) − F (x∗ ) + √ +
705 L T (m − b)2
   
706 1 (b − q + 1)(m − q)
=O √ +O .
707 T (m − b)2
708
709 Proof. Taking x = xt , x − γZenob ({ṽi : i ∈ [m]}) = xt+1 , using Theorem 1, we have
710
E F (xt+1 ) − F (xt )
 
711
712 γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
713 ≤ − k∇F (xt )k2 + + .
2 (m − b)2 2
714
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

715 By telescoping and taking total expectation, we have


716
E F (xT ) − F (x0 )
 
717
718 T −1
γ X γ(b − q + 1)(m − q)V T (L + β)γ 2 GT
719 ≤− Ek∇F (xt )k2 + 2
+ .
2 t=0 (m − b) 2
720
721 1
Taking γ = √ , we have
722 L T
723 PT −1
724 t=0Ek∇F (xt )k2
725  T 
726 2L F (x0 ) − F (xT ) 2(b − q + 1)(m − q)V (L + β)G
≤ √ + 2
+ √
727 T (m − b) L T
728 2L F (x0 ) − F (x∗ )
 
2(b − q + 1)(m − q)V (L + β)G
729 ≤ √ + 2
+ √
T (m − b) L T
730    
731 1 (b − q + 1)(m − q)
=O √ +O .
732 T (m − b)2
733
734
735
Corollary 2. Assume that
736
737 1 X
F (x) = E [Fi (x)] ,
738 m
i∈[m]
739
740 and
741
742 E [Fi (x)] 6= E [Fj (x)] ,
743
744 for ∀i, j ∈ [m], i 6= j. For the stochastic descendant score, we still have E [fr (x)] = F (x). Assumption 1, 2, and 3 still
2
745 hold. Take γ = L√ 1
T
, and ρ = βγ2 , where β is the same as in Theorem 1. Using Zeno, after T iterations, we have
746 PT −1
747 Ek∇F (xt )k2
t=0
748
 T
749 2L F (x0 ) − F (x∗ )

4V 4bG 2b2 (m − q)G (L + β)G
750 ≤ √ + + + + √
T m m m2 (m − b) L T
751      2 
752 1 b b (m − q)
=O √ +O +O .
753 T m m2 (m − b)
754
755 Proof. Similar to the proof of Theorem 1, we define ṽ¯ = Zenob ({ṽi : i ∈ [m]}). Thus, reusing the proof in Theorem 1, we
756 have
757 2 m−b
¯ − F (x) ≤ − γ k∇F (x)k2 + γ k∇F (x) − v̄k2 + (L + β)γ
X
758 F (x − γ ṽ) kv(i) k2 .
759 2 2 2(m − b) i=1
760
761 Now, taking the expectation w.r.t. ṽ(i) ’s on both sides and using Ekv(i) k2 ≤ G, we have
762
763 ¯ − F (x)
E [F (x − γ ṽ)]
764 m−b
γ γ (L + β)γ 2 X
765 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + Ekv(i) k2
766 2 2 2(m − b) i=1
767 γ γ (L + β)γ 2 G
768 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + .
2 2 2
769
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

770 Now we just need to bound Ek∇F (x) − v̄k2 . We define that S1 = {∇Fi (x) : i ∈ [m]} \ {v(i) : i ∈ [m − b]}. Note that
771 |S1 | = b.
772
773 Ek∇F (x) − v̄k2
774
m−b
2

775 1 X 1 X
= E E [∇Fi (x)] − v(i)
776 m
i∈[m] m − b i=1


777 2 2
778 m−b 1 m−b m−b
1 X 1 X X 1 X
≤ 2E [∇F (x)] − v + 2E v − v

779 E i (i)
(i) (i)
m m m m − b


780 i∈[m] i=1 i=1 i=1
781 2 2
1 X 2
m m−b m−b
782 1 X 1 X 1 X 1 X
≤ 4E [∇F (x)] − ∇F (x) + 4E v + 2E v − v

E i i (i) (i)
m m m m m − b

783 i∈[m] i=1


v∈S1

i=1 i=1

784 2 m−b 2
m−b
785 4V 4b2 1 X
1 X 1 X
≤ + 2E v + 2E v(i) − v(i)

786 m m b m m − b
v∈S1 i=1 i=1
787
2 2
788 4V 4b2 mG

1 1 m−b
X
≤ + 2 +2 − E v(i)

789
m m b m m−b
790 i=1
2
791

4V 4bG b
792 ≤ + +2 (m − b)(m − q)G
m m m(m − b)
793
4V 4bG 2b2 (m − q)G
794 ≤ + + .
795 m m m2 (m − b)
796
797 Thus, we have
798
¯ − F (x)
E [F (x − γ ṽ)]
799
4bG 2b2 (m − q)G (L + β)γ 2 G
 
800 γ 2 γ 4V
≤ − k∇F (x)k + + + + .
801 2 2 m m m2 (m − b) 2
802
803 1
Follow the same procedure in Corollary 1, taking γ = √
L T
, we have
804
805 PT −1
806 t=0Ek∇F (xt )k2
807  T
2L F (x0 ) − F (x∗ )

808 4V 4bG 2b2 (m − q)G (L + β)G
≤ √ + + + + √
809 T m m m2 (m − b) L T
810 
1
  
b
 2
b (m − q)

811 =O √ +O +O .
T m m2 (m − b)
812
813
814
815
816 B. Additional Experiments
817
818
819
820
821
822
823
824
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

825
826
827
828 Mean 2.5
829 0.8 Median
Krum 4 2
830
Zeno 4
Top-1 accuracy

0.6
831
1.5
832

Loss
0.4
833 1
Mean
834 Median
0.2 0.5
835 Krum 4
Zeno 4
836 0 0
837 100 10 1
10 2
100 101 102
Epoch Epoch
838
839 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
840
841 Figure 9. Convergence on non-i.i.d. training data, without failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
842 ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
843
844
845
846
847
848
849
Mean without failures 2.5
850 0.8 Mean
851 Median 2
Krum 8
852
Top-1 accuracy

0.6
Zeno 9
853 1.5
Loss

854 0.4
Mean without failures
1
855 Mean
Median
856 0.2 0.5 Krum 8
857 Zeno 9
0 0
858 100 101 102 100 101 102
859 Epoch Epoch

860 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
861
2.5
862 0.8
Mean without failures
Mean
863 Median 2
Krum 8
864
Top-1 accuracy

0.6
Zeno 16
865 1.5
Loss

866 0.4
1 Mean without failures
867 Mean
Median
868 0.2 0.5 Krum 8
869 Zeno 16
870 0 0
100 10 1
10 2
100 101 102
871 Epoch Epoch
872 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
873
874 Figure 10. Convergence on non-i.i.d. training data, with label-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
875 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
876
877
878
879
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

880
881
882
883 Mean without failures 2.5
884 0.8 Mean
Median 2
885 Krum 8
Top-1 accuracy

0.6
886 Zeno 9
1.5
887

Loss
0.4
888 1 Mean without failures
Mean
889 Median
0.2 0.5
890 Krum 8
Zeno 9
891 0 0
892 100 10 1
10 2
100 101 102
Epoch Epoch
893
894 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
895 2.5
Mean without failures
896 0.8 Mean
897 Median 2
Krum 8
898
Top-1 accuracy

0.6
Zeno 16
1.5
899 Loss
900 0.4
1 Mean without failures
901 Mean
Median
0.2
902 0.5 Krum 8
903 Zeno 16
0 0
904 100 10 1
10 2
100 101 102
905 Epoch Epoch

906 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
907
908 Figure 11. Convergence on non-i.i.d. training data, with bit-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
909 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
910
911
912
913
914
915
916
917 0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18
0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18

918
919
Top-1 accuracy

0.6 0.6
920
Loss

921 0.4 0.4


922
923 0.2 0.2
924
925 0 0
0.002 0.001 0.0005 0.00025 0.002 0.001 0.0005 0.00025
926 ρ ρ
927 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
928
929 Figure 12. Convergence on i.i.d. training data, with label-flipping failures, q = 8. Batch size on the workers is 100. Batch size of Zeno is
930 nr = 4. Learning rate γ = 0.1. Each epoch has 25 iterations. ρ and b are tuned.
931
932
933
934

You might also like