Professional Documents
Culture Documents
055
056
057
058
Appendix
059
060
061
A. Proofs
062 A.1. Preliminaries
063
064 We use the following lemma to bound the aggregated vectors.
065 Lemma 1. (Bounded Score) Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as
066 {vi : i ∈ [m − q]}. Sorting the correct vectors by the stochastic descendant score, we obtain {v(i) : i ∈ [m − q]}. Then, we
067 have the following inequality:
068
069 Scoreγ,ρ (ṽ(i) , x) ≥ Scoreγ,ρ (v(i) , x), ∀i ∈ [m − q],
070
071 or, by flipping the signs on both sides, it is equivalent to
072
073 fr (x − γṽ(i) ) − fr (x) + ρkṽ(i) k2 ≤ fr (x − γv(i) ) − fr (x) + ρkv(i) k2 , ∀i ∈ [m − q],
074
075 Proof. We prove the lemma by contradiction.
076 Assume that Scoreγ,ρ (ṽ(i) , x) < Scoreγ,ρ (v(i) , x). Thus, there are i correct vectors having greater scores than ṽ(i) .
077 However, because ṽ(i) is the ith element in {ṽ(i) : i ∈ [m]}, there should be at most i − 1 vectors having greater scores than
078 it, which yields a contradiction.
079
080 A.2. Convergence guarantees
081
082 For general non-strongly convex functions and non-convex functions, we provide the following convergence guarantees.
083 Theorem 1. For ∀x ∈ Rd , denote
084 (
085 ∗ ith worker is Byzantine,
086 ṽi =
∇Fi (x) otherwise,
087
088 βγ 2
089 where i ∈ [m], and ṽ¯ = Zenob ({ṽi : i ∈ [m]}). Taking γ ≤ 1
L, and ρ = 2 , where
090 (
091 β = 0, if µ ≥ 0;
092 β ≥ |µ|, otherwise.
093
094 we have
095 2
096 ¯ − F (x) ≤ − γ k∇F (x)k2 + γ(b − q + 1)(m − q)V + (L + β)γ G .
E [F (x − γ ṽ)]
2 (m − b) 2 2
097
098
099 Proof. Without loss of generality, we denote the m − q correct elements in {ṽi : i ∈ [m]} as {vi : i ∈ [m − q]}, where
100 E[vi ] = ∇F (x). Sorting the correct vectors by the online descendant score, we obtain {v(i) : i ∈ [m − q]}. We also sort ṽi
101 by the online descendant score and obtain {ṽ(i) : i ∈ [m]}.
102 According to the definition, ṽ¯ = Zenob ({ṽi : i ∈ [m]}) = m−b
1
Pm−b 1
Pm−b
i=1 ṽ(i) . Furthermore, we denote v̄ = m−b i=1 v(i) .
103
104 Using Assumption ??, we have
105 2
106 ¯ γ(ṽ¯ − ṽ(i) ) + µγ kṽ¯ − ṽ(i) k2 ,
¯ + ∇fr (x − γ ṽ),
fr (x − γṽ(i) ) ≥ fr (x − γ ṽ)
107 2
108 for ∀i ∈ [m − b].
109
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
220 Now, taking the expectation w.r.t. ṽ(i) ’s on both sides and using Ekv(i) k2 ≤ G, we have
221
¯ − F (x)
E [F (x − γ ṽ)]
222
223 γ γ
m−b
(L + β)γ 2 X
224 ≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + Ekv(i) k2
2 2 2(m − b) i=1
225
226 γ γ (L + β)γ 2 G
≤ − k∇F (x)k2 + Ek∇F (x) − v̄k2 + .
227 2 2 2
228
229 Now we just need to bound Ek∇F (x) − v̄k2 . For convenience, we denote g = ∇F (x). Note that for arbitrary subset
230 S ⊆ [m − q], |S| = m − b, we have the following bound:
231
i∈S (vi − g)
2
P
232
E
233
m−b
234
P P
2
i∈[m−q] (vi − g) − i∈S / (vi − g)
235 = E
m−b
236
237
P
2
2
i∈[m−q] (vi − g)
P
/ (vi − g)
238 ≤ 2E
+ 2E
i∈S
m−b m−b
239
240
P
2
2
i∈[m−q] (vi − g)
P
241 2(m − q)2
2(b − q)2
/ (vi − g)
i∈S
= 2
E + 2
E
(m − b) m−q (m − b) b−q
242
243 P 2
2(m − q)2 V 2(b − q)2 i∈[m−q] kvi − gk
244 ≤ +
245 (m − b)2 m − q (m − b)2 b−q
2 2
246 2(m − q) V 2(b − q) (m − q)V
≤ +
247 (m − b)2 m − q (m − b)2 b − q
248 2(b − q + 1)(m − q)V
249 = .
(m − b)2
250
251
Putting all the ingredients together, we obtain the desired result
252
253 ¯ − F (x)
E [F (x − γ ṽ)]
254
γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
255 ≤ − k∇F (x)k2 + 2
+ .
256 2 (m − b) 2
257
258
259 1 βγ 2
Corollary 1. Take γ = √
L T
, and ρ = 2 , where β is the same as in Theorem ??. Using Zeno, after T iterations, we have
260
261 PT −1
Ek∇F (xt )k2
t=0
262
T
263
(L + β)G 1 2(b − q + 1)(m − q)V
264 ≤ 2L F (x0 ) − F (x∗ ) + √ +
265 L T (m − b)2
266 1 (b − q + 1)(m − q)
=O √ +O .
267 T (m − b)2
268
269 Proof. Taking x = xt , x − γZenob ({ṽi : i ∈ [m]}) = xt+1 , using Theorem ??, we have
270
E F (xt+1 ) − F (xt )
271
272 γ γ(b − q + 1)(m − q)V (L + β)γ 2 G
273 ≤ − k∇F (xt )k2 + + .
2 (m − b)2 2
274
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
330 Now we just need to bound Ek∇F (x) − v̄k2 . We define that S1 = {∇Fi (x) : i ∈ [m]} \ {v(i) : i ∈ [m − b]}. Note that
331 |S1 | = b.
332
333 Ek∇F (x) − v̄k2
334
m−b
2
335
1 X 1 X
= E
E [∇Fi (x)] − v(i)
336 m
i∈[m] m − b i=1
337
2
2
338
m−b
1 m−b m−b
1 X 1 X
X 1 X
≤ 2E
[∇F (x)] − v + 2E v − v
339 E i (i)
(i) (i)
m m m m − b
340
i∈[m] i=1
i=1 i=1
341
2
2
1 X
2
m
m−b m−b
342
1 X 1 X
1 X 1 X
≤ 4E
[∇F (x)] − ∇F (x)
+ 4E
v + 2E v − v
E i i (i) (i)
m m m m m − b
343
i∈[m] i=1
v∈S1
i=1 i=1
344
2
m−b
2
m−b
345 4V 4b2
1 X
1 X 1 X
≤ + 2E
v
+ 2E
v(i) − v(i)
346 m m
b
m m − b
v∈S1 i=1 i=1
347
2
2
348 4V 4b2 mG
1 1
m−b
X
≤ + 2 +2 − E
v(i)
349
m m b m m−b
350 i=1
2
351
4V 4bG b
352 ≤ + +2 (m − b)(m − q)G
m m m(m − b)
353
4V 4bG 2b2 (m − q)G
354 ≤ + + .
355 m m m2 (m − b)
356
357 Thus, we have
358
¯ − F (x)
E [F (x − γ ṽ)]
359
4bG 2b2 (m − q)G (L + β)γ 2 G
360 γ 2 γ 4V
≤ − k∇F (x)k + + + + .
361 2 2 m m m2 (m − b) 2
362
363 1
Follow the same procedure in Corollary ??, taking γ = √
L T
, we have
364
365 PT −1
366 t=0Ek∇F (xt )k2
367 T
2L F (x0 ) − F (x∗ )
368 4V 4bG 2b2 (m − q)G (L + β)G
≤ √ + + + + √
369 T m m m2 (m − b) L T
370
1
b
2
b (m − q)
371 =O √ +O +O .
T m m2 (m − b)
372
373
374
375
376 B. Additional Experiments
377
378
379
380
381
382
383
384
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
385
386
387
388 Mean 2.5
389 0.8 Median
Krum 4 2
390
Zeno 4
Top-1 accuracy
0.6
391
1.5
392
Loss
0.4
393 1
Mean
394 Median
0.2 0.5
395 Krum 4
Zeno 4
396 0 0
397 100 10 1
10 2
100 101 102
Epoch Epoch
398
399 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
400
401 Figure 9. Convergence on non-i.i.d. training data, without failures. Batch size on the workers is 100. Batch size of Zeno is nr = 4.
402 ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
403
404
405
406
407
408
409
Mean without failures 2.5
410 0.8 Mean
411 Median 2
Krum 8
412
Top-1 accuracy
0.6
Zeno 9
413 1.5
Loss
414 0.4
Mean without failures
1
415 Mean
Median
416 0.2 0.5 Krum 8
417 Zeno 9
0 0
418 100 101 102 100 101 102
419 Epoch Epoch
420 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
421
2.5
422 0.8
Mean without failures
Mean
423 Median 2
Krum 8
424
Top-1 accuracy
0.6
Zeno 16
425 1.5
Loss
426 0.4
1 Mean without failures
427 Mean
Median
428 0.2 0.5 Krum 8
429 Zeno 16
430 0 0
100 10 1
10 2
100 101 102
431 Epoch Epoch
432 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
433
434 Figure 10. Convergence on non-i.i.d. training data, with label-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
435 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
436
437
438
439
Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance
440
441
442
443 Mean without failures 2.5
444 0.8 Mean
Median 2
445 Krum 8
Top-1 accuracy
0.6
446 Zeno 9
1.5
447
Loss
0.4
448 1 Mean without failures
Mean
449 Median
0.2 0.5
450 Krum 8
Zeno 9
451 0 0
452 100 10 1
10 2
100 101 102
Epoch Epoch
453
454 (a) Top-1 accuracy on testing set, with q = 8 (b) Cross entropy on training set, with q = 8
455 2.5
Mean without failures
456 0.8 Mean
457 Median 2
Krum 8
458
Top-1 accuracy
0.6
Zeno 16
1.5
459 Loss
460 0.4
1 Mean without failures
461 Mean
Median
0.2
462 0.5 Krum 8
463 Zeno 16
0 0
464 100 10 1
10 2
100 101 102
465 Epoch Epoch
466 (c) Top-1 accuracy on testing set, with q = 12 (d) Cross entropy on training set, with q = 12
467
468 Figure 11. Convergence on non-i.i.d. training data, with bit-flipping failures. Batch size on the workers is 100. Batch size of Zeno is
469 nr = 4. ρ = 0.0005. Learning rate γ = 0.05. Each epoch has 25 iterations.
470
471
472
473
474
475
476
477 0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18
0.8
Zeno8 Zeno10 Zeno12 Zeno14 Zeno16 Zeno18
478
479
Top-1 accuracy
0.6 0.6
480
Loss