You are on page 1of 8

Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

055
056
057
058
Appendix
059
060
061
A. Proofs
062 A.1. Preliminaries
063
064 We use the following lemma to bound the aggregated vectors.
065 Lemma 1. (Bounded Score) Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as
066 {vi : i ∈ [m − q]}. Sorting the correct vectors by the stochastic descendant score, we obtain {v(i) : i ∈ [m − q]}. Then, we
067 have the following inequality:
068
069 Scoreγ,ρ (ṽ(i) , x) ≥ Scoreγ,ρ (v(i) , x), ∀i ∈ [m − q],
070
071 or, by flipping the signs on both sides, it is equivalent to
072
073 fr (x − γṽ(i) ) − fr (x) + ρkṽ(i) k2 ≤ fr (x − γv(i) ) − fr (x) + ρkv(i) k2 , ∀i ∈ [m − q],
074
075 Proof. We prove the lemma by contradiction.
076 Assume that Scoreγ,ρ (ṽ(i) , x) < Scoreγ,ρ (v(i) , x). Thus, there are i correct vectors having greater scores than ṽ(i) .
077 However, because ṽ(i) is the ith element in {ṽ(i) : i ∈ [m]}, there should be at most i − 1 vectors having greater scores than
078 it, which yields a contradiction.
079
080 A.2. Convergence guarantees
081
082 For general non-strongly convex functions and non-convex functions, we provide the following convergence guarantees.
083 Theorem 1. For ∀x ∈ Rd , denote
084 (
085 ∗ ith worker is Byzantine,
086 ṽi =
∇Fi (x) otherwise,
087
088 βγ 2
089 where i ∈ [m], and ṽ¯ = Zenob ({ṽi : i ∈ [m]}). Taking γ ≤ 1
L, and ρ = 2 , where
090 (
091 β = 0, if µ ≥ 0;
092 β ≥ |µ|, otherwise.
093
094 we have
095 2
096 ¯ − F (x) ≤ − γ k∇F (x)k2 + γ(b − q + 1)(m − q)V + (L + β)γ G .
E [F (x − γ ṽ)]
2 (m − b) 2 2
097
098
099 Proof. Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as {vi : i ∈ [m − q]}, where
100 E[vi ] = ∇F (x). Sorting the correct vectors by the online descendant score, we obtain {v(i) : i ∈ [m − q]}. We also sort ṽi
101 by the online descendant score and obtain {ṽ(i) : i ∈ [m]}.
102 According to the definition, ṽ¯ = Zenob ({ṽi : i ∈ [m]}) = m−b
1
Pm−b 1
Pm−b
i=1 ṽ(i) . Furthermore, we denote v̄ = m−b i=1 v(i) .
103
104 Using Assumption ??, we have
105 2
106 ¯ γ(ṽ¯ − ṽ(i) ) + µγ kṽ¯ − ṽ(i) k2 ,
¯ + ∇fr (x − γ ṽ),


fr (x − γṽ(i) ) ≥ fr (x − γ ṽ)
107 2
108 for ∀i ∈ [m − b].
109
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

110 By summing up, we have


111
m−b 2 m−b
112 1 X ¯ + µγ
X
113 fr (x − γṽ(i) ) ≥ fr (x − γ ṽ) kṽ¯ − ṽ(i) k2 . (2)
m − b i=1 2(m − b) i=1
114
115
116 Using Lemma 1, we have
117 fr (x − γṽ(i) ) + ρkṽ(i) k2 ≤ fr (x − γv(i) ) + ρkv(i) k2 ,
118
119 for ∀i ∈ [m − b].
120
121 Combined with Equation 2, we have
122 ¯
fr (x − γ ṽ)
123
m−b m−b
124 1 X µγ 2 X
125 ≤ fr (x − γṽ(i) ) − kṽ¯ − ṽ(i) k2
m − b i=1 2(m − b) i=1
126
m−b m−b m−b
127 1 X ρ X µγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2 .

≤ fr (x − γv(i) ) +
128 m − b i=1 m − b i=1 2(m − b) i=1
129
130
βγ 2
131 We take ρ = 2 , where
132 (
133 β = 0, if µ ≥ 0;
134 β ≥ |µ|, otherwise.
135
136 Thus, if µ ≥ 0, we have ρ = 0, which implies that
137
m−b m−b m−b
138 ρ X µγ 2 X βγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2 ≤ kv(i) k2 .

139 m − b i=1 2(m − b) i=1 2(m − b) i=1
140
141 Also, if µ < 0, since β ≥ −µ, we have
142
m−b m−b
143 ρ X µγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2

144 m − b i=1 2(m − b) i=1
145
m−b m−b
146 βγ 2 X µγ 2 X
¯ 2
kv(i) k2 − kṽ(i) k2 − kṽ(i) k2 − kṽk
 
147 =
2(m − b) i=1 2(m − b) i=1
148
m−b m−b m−b
149 βγ 2 X (−β − µ)γ 2 X µγ 2 X
¯ 2
= kv(i) k2 + kṽ(i) k2 + kṽk
150 2(m − b) i=1 2(m − b) i=1 2(m − b) i=1
151 m−b
152 βγ 2 X
≤ kv(i) k2 .
153 2(m − b) i=1
154
155
Thus, we have
156
157 ¯
fr (x − γ ṽ)
158 m−b m−b m−b
159 1 X ρ X µγ 2 X
kv(i) k2 − kṽ(i) k2 − kṽ¯ − ṽ(i) k2

≤ fr (x − γv(i) ) +
160 m − b i=1 m − b i=1 2(m − b) i=1
161 m−b m−b
162 1 X βγ 2 X
≤ fr (x − γv(i) ) + kv(i) k2 .
163 m − b i=1 2(m − b) i=1
164
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

165 Using the L-smoothness, we have


166
167 Lγ 2
kv̄ − v(i) k2 ,

fr (x − γv(i) ) ≤ fr (x − γv̄) + ∇fr (x − γv̄), γ(v̄ − v(i) ) +


168 2
169
170 for ∀i ∈ [m − b]. By summing up, we have
171 m−b
172 1 X
fr (x − γv(i) )
173 m − b i=1
174 m−b
175 Lγ 2 X
≤ fr (x − γv̄) + kv̄ − v(i) k2
176 2(m − b) i=1
177 m−b
178 Lγ 2 X
≤ fr (x − γv̄) + kv(i) k2 .
179 2(m − b) i=1
180
181
Thus, we have
182
183 ¯
fr (x − γ ṽ)
184
m−b m−b
185 1 X βγ 2 X
≤ fr (x − γv(i) ) + kv(i) k2
186 m − b i=1 2(m − b) i=1
187
m−b
188 (L + β)γ 2 X
≤ fr (x − γv̄) + kv(i) k2 .
189 2(m − b) i=1
190
191
1
192 Again, using the L-smoothness and taking γ ≤ L, we have
193
194 fr (x − γv̄)
195 Lγ 2
196 ≤ fr (x) + h∇fr (x), −γv̄i + kv̄k2
2
197 γ
≤ fr (x) + h∇fr (x), −γv̄i + kv̄k2
198 2
199
200 Thus, we have
201
202 ¯ − fr (x)
fr (x − γ ṽ)
203 m−b
204 (L + β)γ 2 X
≤ fr (x − γv̄) − fr (x) + kv(i) k2
205 2(m − b) i=1
206 m−b
γ 2 (L + β)γ 2 X
207 ≤ h∇fr (x), −γv̄i + kv̄k + kv(i) k2 .
208 2 2(m − b) i=1
209
210 Conditional on ṽ(i) ’s, taking expectation w.r.t. fr on both sides, we have
211
212 ¯ − F (x)
F (x − γ ṽ)
213 m−b
214 γ (L + β)γ 2 X
≤ h∇F (x), −γv̄i + kv̄k2 + kv(i) k2
215 2 2(m − b) i=1
216 m−b
217 γ γ (L + β)γ 2 X
= − k∇F (x)k2 + k∇F (x) − v̄k2 + kv(i) k2 .
218 2 2 2(m − b) i=1
219
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

220 Now, taking the expectation w.r.t. ṽ(i) ’s on both sides and using Ekv(i) k2 ≤ G, we have
221
¯ − F (x)
E [F (x − γ ṽ)]
222
223 γ γ
m−b
(L + β)γ 2 X
224 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + Ekv(i) k2
2 2 2(m − b) i=1
225
226 γ γ (L + β)γ 2 G
≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + .
227 2 2 2
228
229 Now we just need to bound Ek∇F (x) − v̄k2 . For convenience, we denote g = ∇F (x). Note that for arbitrary subset
230 S ⊆ [m − q], |S| = m − b, we have the following bound:
231
i∈S (vi − g) 2
P
232
E
233 m−b
234 P P 2
i∈[m−q] (vi − g) − i∈S / (vi − g)

235 = E
m−b

236
237 P 2 2
i∈[m−q] (vi − g)
P
/ (vi − g)

238 ≤ 2E + 2E
i∈S


m−b m−b

239
240 P 2 2
i∈[m−q] (vi − g)
P
241 2(m − q)2 2(b − q)2 / (vi − g)
i∈S

= 2
E + 2
E
(m − b) m−q (m − b) b−q

242
243 P 2
2(m − q)2 V 2(b − q)2 i∈[m−q] kvi − gk
244 ≤ +
245 (m − b)2 m − q (m − b)2 b−q
2 2
246 2(m − q) V 2(b − q) (m − q)V
≤ +
247 (m − b)2 m − q (m − b)2 b − q
248 2(b − q + 1)(m − q)V
249 = .
(m − b)2
250
251
Putting all the ingredients together, we obtain the desired result
252
253 ¯ − F (x)
E [F (x − γ ṽ)]
254
γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
255 ≤ − k∇F (x)k2 + 2
+ .
256 2 (m − b) 2
257
258
259 1 βγ 2
Corollary 1. Take γ = √
L T
, and ρ = 2 , where β is the same as in Theorem ??. Using Zeno, after T iterations, we have
260
261 PT −1
Ek∇F (xt )k2
t=0
262
T
263  
 (L + β)G 1 2(b − q + 1)(m − q)V
264 ≤ 2L F (x0 ) − F (x∗ ) + √ +
265 L T (m − b)2
   
266 1 (b − q + 1)(m − q)
=O √ +O .
267 T (m − b)2
268
269 Proof. Taking x = xt , x − γZenob ({ṽi : i ∈ [m]}) = xt+1 , using Theorem ??, we have
270
E F (xt+1 ) − F (xt )
 
271
272 γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
273 ≤ − k∇F (xt )k2 + + .
2 (m − b)2 2
274
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

275 By telescoping and taking total expectation, we have


276
E F (xT ) − F (x0 )
 
277
278 T −1
γ X γ(b − q + 1)(m − q)V T (L + β)γ 2 GT
279 ≤− Ek∇F (xt )k2 + 2
+ .
2 t=0 (m − b) 2
280
281 1
Taking γ = √ , we have
282 L T
283 PT −1
284 t=0Ek∇F (xt )k2
285  T 
286 2L F (x0 ) − F (xT ) 2(b − q + 1)(m − q)V (L + β)G
≤ √ + 2
+ √
287 T (m − b) L T
288 2L F (x0 ) − F (x∗ )
 
2(b − q + 1)(m − q)V (L + β)G
289 ≤ √ + 2
+ √
T (m − b) L T
290    
291 1 (b − q + 1)(m − q)
=O √ +O .
292 T (m − b)2
293
294
295
Corollary 2. Assume that
296
297 1 X
F (x) = E [Fi (x)] ,
298 m
i∈[m]
299
300 and
301
302 E [Fi (x)] 6= E [Fj (x)] ,
303
304 for ∀i, j ∈ [m], i 6= j. For the stochastic descendant score, we still have E [fr (x)] = F (x). Assumption ??, ??, and ?? still
2
305 hold. Take γ = L√ 1
T
, and ρ = βγ2 , where β is the same as in Theorem ??. Using Zeno, after T iterations, we have
306 PT −1
307 Ek∇F (xt )k2
t=0
308
 T
309 2L F (x0 ) − F (x∗ )

4V 4bG 2b2 (m − q)G (L + β)G
310 ≤ √ + + + + √
T m m m2 (m − b) L T
311      2 
312 1 b b (m − q)
=O √ +O +O .
313 T m m2 (m − b)
314
315 Proof. Similar to the proof of Theorem ??, we define ṽ¯ = Zenob ({ṽi : i ∈ [m]}). Thus, reusing the proof in Theorem ??,
316 we have
317 2 m−b
¯ − F (x) ≤ − γ k∇F (x)k2 + γ k∇F (x) − v̄k2 + (L + β)γ
X
318 F (x − γ ṽ) kv(i) k2 .
319 2 2 2(m − b) i=1
320
321 Now, taking the expectation w.r.t. ṽ(i) ’s on both sides and using Ekv(i) k2 ≤ G, we have
322
323 ¯ − F (x)
E [F (x − γ ṽ)]
324 m−b
γ γ (L + β)γ 2 X
325 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + Ekv(i) k2
326 2 2 2(m − b) i=1
327 γ γ (L + β)γ 2 G
328 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + .
2 2 2
329
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

330 Now we just need to bound Ek∇F (x) − v̄k2 . We define that S1 = {∇Fi (x) : i ∈ [m]} \ {v(i) : i ∈ [m − b]}. Note that
331 |S1 | = b.
332
333 Ek∇F (x) − v̄k2
334
m−b
2

335 1 X 1 X
= E E [∇Fi (x)] − v(i)
336 m
i∈[m] m − b i=1


337 2 2
338 m−b 1 m−b m−b
1 X 1 X X 1 X
≤ 2E [∇F (x)] − v + 2E v − v

339 E i (i)
(i) (i)
m m m m − b


340 i∈[m] i=1 i=1 i=1
341 2 2
1 X 2
m m−b m−b
342 1 X 1 X 1 X 1 X
≤ 4E [∇F (x)] − ∇F (x) + 4E v + 2E v − v

E i i (i) (i)
m m m m m − b

343 i∈[m] i=1


v∈S1

i=1 i=1

344 2 m−b 2
m−b
345 4V 4b2 1 X
1 X 1 X
≤ + 2E v + 2E v(i) − v(i)

346 m m b m m − b
v∈S1 i=1 i=1
347
2 2
348 4V 4b2 mG

1 1 m−b
X
≤ + 2 +2 − E v(i)

349
m m b m m−b
350 i=1
2
351

4V 4bG b
352 ≤ + +2 (m − b)(m − q)G
m m m(m − b)
353
4V 4bG 2b2 (m − q)G
354 ≤ + + .
355 m m m2 (m − b)
356
357 Thus, we have
358
¯ − F (x)
E [F (x − γ ṽ)]
359
4bG 2b2 (m − q)G (L + β)γ 2 G
 
360 γ 2 γ 4V
≤ − k∇F (x)k + + + + .
361 2 2 m m m2 (m − b) 2
362
363 1
Follow the same procedure in Corollary ??, taking γ = √
L T
, we have
364
365 PT −1
366 t=0Ek∇F (xt )k2
367  T
2L F (x0 ) − F (x∗ )

368 4V 4bG 2b2 (m − q)G (L + β)G
≤ √ + + + + √
369 T m m m2 (m − b) L T
370 
1
  
b
 2
b (m − q)

371 =O √ +O +O .
T m m2 (m − b)
372
373
374
375
376 B. Additional Experiments
377
378
379
380
381
382
383
384
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

385
386
387
388 Mean 2.5
389 0.8 Median
Krum 4 2
390
Zeno 4
Top-1 accuracy

0.6
391
1.5
392

Loss
0.4
393 1
Mean
394 Median
0.2 0.5
395 Krum 4
Zeno 4
396 0 0
397 100 10 1
10 2
100 101 102
Epoch Epoch
398
399 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
400
401 Figure 9. Convergence on non-i.i.d. training data, without failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
402 ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
403
404
405
406
407
408
409
Mean without failures 2.5
410 0.8 Mean
411 Median 2
Krum 8
412
Top-1 accuracy

0.6
Zeno 9
413 1.5
Loss

414 0.4
Mean without failures
1
415 Mean
Median
416 0.2 0.5 Krum 8
417 Zeno 9
0 0
418 100 101 102 100 101 102
419 Epoch Epoch

420 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
421
2.5
422 0.8
Mean without failures
Mean
423 Median 2
Krum 8
424
Top-1 accuracy

0.6
Zeno 16
425 1.5
Loss

426 0.4
1 Mean without failures
427 Mean
Median
428 0.2 0.5 Krum 8
429 Zeno 16
430 0 0
100 10 1
10 2
100 101 102
431 Epoch Epoch
432 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
433
434 Figure 10. Convergence on non-i.i.d. training data, with label-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
435 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
436
437
438
439
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance

440
441
442
443 Mean without failures 2.5
444 0.8 Mean
Median 2
445 Krum 8
Top-1 accuracy

0.6
446 Zeno 9
1.5
447

Loss
0.4
448 1 Mean without failures
Mean
449 Median
0.2 0.5
450 Krum 8
Zeno 9
451 0 0
452 100 10 1
10 2
100 101 102
Epoch Epoch
453
454 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
455 2.5
Mean without failures
456 0.8 Mean
457 Median 2
Krum 8
458
Top-1 accuracy

0.6
Zeno 16
1.5
459 Loss
460 0.4
1 Mean without failures
461 Mean
Median
0.2
462 0.5 Krum 8
463 Zeno 16
0 0
464 100 10 1
10 2
100 101 102
465 Epoch Epoch

466 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
467
468 Figure 11. Convergence on non-i.i.d. training data, with bit-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
469 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
470
471
472
473
474
475
476
477 0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18
0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18

478
479
Top-1 accuracy

0.6 0.6
480
Loss

481 0.4 0.4


482
483 0.2 0.2
484
485 0 0
0.002 0.001 0.0005 0.00025 0.002 0.001 0.0005 0.00025
486 ρ ρ
487 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
488
489 Figure 12. Convergence on i.i.d. training data, with label-flipping failures, q = 8. Batch size on the workers is 100. Batch size of Zeno is
490 nr = 4. Learning rate γ = 0.1. Each epoch has 25 iterations. ρ and b are tuned.
491
492
493
494

You might also like