Professional Documents
Culture Documents
DGT A Contribution-Aware Differential Gradient Transmission Mechanism For Distributed Machine Lear
DGT A Contribution-Aware Differential Gradient Transmission Mechanism For Distributed Machine Lear
article info a b s t r a c t
Article history: Distributed machine learning is a mainstream system to learn insights for analytics and intelligence
Received 13 January 2020 services of many fronts (e.g., health, streaming and business) from their massive operational data.
Received in revised form 31 January 2021 In such a system, multiple workers train over subsets of data and collaboratively derive a global
Accepted 9 March 2021
prediction/inference model by iteratively synchronizing their local learning results, e.g., the model gra-
Available online 14 March 2021
dients, which in turn generates heavy and bursty traffic and results in high communication overhead
Keywords: in cluster networks. Such communication overhead has became the main bottleneck that limits the
Distributed Machine Learning (DML) efficiency of training machine learning models distributedly. In this paper, our key observation is that
Gradient transmission local gradients learned by workers may have different contributions to global model convergence and
Parameter server architecture executing differential transmission for different gradients can reduce the communication overhead and
improve training efficiency. However, existing gradient transmission mechanisms treat all gradients the
same, which may lead to long training time.
Motivated by our observations, we propose Differential Gradient Transmission (DGT), a contribution-
aware differential gradient transmission mechanism for efficient distributed learning, which transfers
gradients with different transmission quality according to their contributions. In addition to designing
a general architecture of DGT, we have proposed a novel algorithm and a novel protocol to facilitate
fast model training. Experiments on a cluster with 6 GTX 1080TI GPUs and 1Gbps network show
that DGT decreases the model training time by 19.4% on GoogleNet, 34.4% on AlexNet and 36.5% on
VGG-11 compared to default gradient transmission on MXNET. Its acceleration is better than the other
two related transmission solutions. Besides, DGT works well with different datasets (Fashion-MNIST,
Cifar10), different data distributions (IID, non-IID) and different training algorithms (BSP, FedAVG).
© 2021 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.future.2021.03.006
0167-739X/© 2021 Elsevier B.V. All rights reserved.
H. Zhou, Z. Li, Q. Cai et al. Future Generation Computer Systems 121 (2021) 35–47
several application-layer solutions to reduce the communication distributed learning. Second, we deploy a contribution-aware dif-
overhead through compressing the amount of communication pa- ferential gradient transmission protocol on DGT, which consists
rameters [13–15], reducing the frequency of synchronization [16] of a priority-based differential transmission mechanism, and a
and overlapping communication with computation [17,18]. Both differential reception method(Section 5). Such protocol makes
these solutions exploit the property that the SGD-based algo- the packet of gradients with a higher contribution has higher
rithm or its variants can still make the model converge and transmission quality in network and be updated to the global
produce equal precision with a non-trivial amount of lower- model preferentially.
precision, delayed, or missed gradients. Some researcher [19] With these optimization methods, DGT can achieve good ap-
have argued that a customized transport-layer solution is needed plication performance, guaranteed convergence, and lower com-
for distributed learning. In this paper, we focus on exploring an munication overhead for distributed learning. Moreover, DGT
efficient end-to-end traffic scheduling mechanism for gradient requires no changes in the existing transport layer; all our tech-
transmission, which acts between application-layer and existing niques are implemented either at hosts or by changing simple
transport-layer protocols. configurations of switches.
Our key observation is that providing the same end-to-end We implement DGT as a communication middleware and inte-
gradient transmission for all gradients is unnecessary, even ad- grate it into the MXNET framework. Compared to the default gra-
verse to the efficiency of distributed learning. First, Gradients dient transmission on MXNET, our experimental results show that
have different contributions to model convergence, and their dif- DGT decreases model training time by 19.4% on Googlenet, 34.4%
ference becomes more obvious as the model converges. Second, on Alexnet, and 36.5% on VGG-11. Its acceleration is better than
gradient with a higher contribution requires higher transmission other two heuristic gradient transmission solutions, i.e., Sender-
reliability. Conversely, it is an opportunity to strategically re- based Dropping (SD) [19] and ATP [22]. Besides, DGT works well
duce the transmission reliability of low-contribution gradients to with different datasets (Fashion-MNIST, Cifar10), different data
mitigate communication overhead. Third, gradient with a higher distributions (IID, non-IID) and different training algorithms (BSP,
contribution requires a lower transmission delay. Preferentially FedAVG).
transmitting gradients with higher contributions will lift the ac- The main contributions of this work are summarized as fol-
curacy of the model as soon as possible. Therefore, we argue lows:
that differential gradient transmission is needed for distributed
learning. However, Existing solutions cannot provide refined dif- • We identify that differential gradient transmission is bene-
ferential gradient transmission. For example, existing DML frame- fit for distributed machine learning and propose a general
works such as MXNET [20] or Tensorflow [21] all rely on accurate differential gradient transmission (DGT) mechanism.
gradient transmission, i.e., the worker uploads all gradients to • We propose a novel algorithm and a novel protocol to facil-
parameter server upon reliable transmission service. Although itate fast model training on DGT.
there are some related works [19,22] execute approximate gradi- • We build a prototype of DGT on a popular distributed ma-
ent transmission by ignoring parts of lost gradient heuristically, chine learning system,1 i.e., MXNET, and demonstrate its
they still treat gradients the same. Our observations motivate us effectiveness over a real distributed cluster and ML models.
to perfect this research gap. The paper is organized as follows. We first provide some back-
In this paper, we propose Differential Gradient Transmission ground on distributed machine learning and introduce our key
(DGT), a contribution-aware differential gradient transmission observations in Section 2. After presenting an overview of DGT
mechanism for distributed machine learning. The basic idea of in Section 3, We describe the novel algorithm and protocol on
DGT is to provide differential transmission service for gradients DGT in Section 4 and Section 5 in details. In Section 6, we
according to their contribution to model convergence. Specifi- briefly present the implementation and deployment of DGT. After
cally, at each iteration, DGT enables high-contribution gradient be
reporting experimental results in Section 7, we review related
transmitted and updated to the model preferentially, and actively
work in Section 8 and conclude in Section 9.
reduces the transmission quality of the low-contribution gradient
to mitigate communication overhead. DGT has multiple transmis-
2. Background and motivation
sion channels which have different transmission reliability and
priority. For a raw gradient tensor, the sender first deconstructs it
This section first briefly describe the training paradigm of
and then classifies its gradients into two categories, i.e., important
distributed machine learning and traffic characteristic of its com-
category and unimportant category. The important gradients are
munication pattern(Section 2.1). Then, the section introduces our
scheduled to a reliable channel and delivered accurately. Instead,
observations that differential gradient transmission is needed
the unimportant gradients are transmitted in ‘‘best-effort’’ deliv-
for distributed learning(Section 2.2) and existing solutions lack
ery mode via unreliable channels with lower priority. The receiver
ability to provide such transmission service(Section 2.3). These
reconstructs the received gradients into a structured tensor and
observations motivate us to study and design an efficient and
then uploads it to the upper application. Although the general
easy-to-use differential gradient transmission mechanism in the
architecture of DGT is simple, designing and implementing a full-
next section.
fledged solution that can improve the efficiency of distributed
learning is not easy.
We propose a novel algorithm and a novel protocol to facilitate 2.1. Background on distributed machine learning
fast model training. First, we design an efficient approximate
gradient classification algorithm(Section 4). the algorithm esti- A general paradigm of machine learning is to continuously
mates the contribution of a gradient in block granularity and refine an ML model by minimizing its objective loss function
then classifies it according to its approximate contribution. In value [23,24]. The value measures the accuracy of the ML model,
the algorithm, we propose a convergence-aware classification e.g., it represents the error rate for a classification task. SGD-
threshold update method that makes the threshold adapt to based algorithms are a series of extensively used optimization
change of gradient as the training progresses. Compared to static
settings, the method makes a better trade-off between aver- 1 We have already made the source code of our prototype of DGT publicity
age communication overhead and the number of iterations for on https://github.com/zhouhuaman/dgt.
36
H. Zhou, Z. Li, Q. Cai et al. Future Generation Computer Systems 121 (2021) 35–47
by 2.58× than 0.141. Fig. 2(b) illustrates similar conclusion on This insight motivates us to propose Differential Gradient Trans-
Googlenet, the improvement achieves 4.80×. So, we argue that mission (DGT), a contribution-aware differential gradient trans-
gradient with a higher contribution requires higher transmission mission mechanism designed for distributed machine learning.
reliability. The basic idea of DGT is to provide differential transmission qual-
ity for different gradients according to their contribution. Specif-
2.2.3. Gradient with a higher contribution requires lower transmis- ically, at each iteration, DGT enables high-contribution gradients
sion delay to be transmitted and updated to the model preferentially and
Since SGD-based algorithms of distributed learning are ran- actively reduce the transmission quality of low-contribution gra-
dom optimization algorithms, SGD-based algorithms tolerate gra- dients to mitigate communication overhead. DGT consists of three
dient delay [14,34], that is, gradients at τ -th iteration can be parts deployed at sender hosts (as a user-level library), receiver
updated to the model at (τ + i)-th iteration, where i is the hosts (as a user-level library), and switches; they collectively
number of delayed round. So far, there have many works that perform differential gradient transmission.
exploit this feature to accelerate distributed learning. For ex- As shown in Fig. 3, a sender library estimates the contribution
ample, asynchronous SGD [35,36] exploits the feature to relax of all gradients in a gradient tensor. It then classifies all gradients
the strict synchronization between workers, i.e., it allows the into two categories, i.e., important gradients and unimportant
gradients of slow workers to be updated to the model lingeringly, gradients and schedules them to transmission channels with dif-
which significantly improves the system efficiency of distributed ferent transmission quality according to their importance and
contribution. A receiver library reconstructs the received gradi-
learning.
ents into a structured tensor that required by the application,
Besides, several works [14,34] have demonstrated that up-
using a zero-padding for lost gradients. Switches in DGT schedule
dating gradient with a higher contribution to the global model
transmissions according to their priority tag.
with lower delay is beneficial to model convergence. Gradient
On the way to build up an efficient DGT solution, the key
Sparsification [34] updates the gradient whose absolute value
challenges include:
is greater than a predefined threshold to the global model in
a timely manner and caches those small gradients until their • How to classify gradients? We need to overcome the ex-
accumulated value greater than the threshold. Actually, Gradi- tremely high complexity of accurate classification and de-
ent Sparsification differentiates the delay of updating gradients sign an efficient classification threshold update method (Sec-
to the global model according to their contribution. Therefore, tion 4).
when network resource is limited, transmitting gradient with a • How to differentiate transmissions for different gradients?
higher contribution preferentially can get better convergence gain We need to design an efficient and easy-to-use transmis-
on the global model. So, we argue that gradient with a higher sion protocol. The protocol improves the performance of
contribution requires a lower transmission delay. distributed learning as much as possible while guaranteeing
Overall, different gradients require different transmission qual- the application accuracy not be compromised (Section 5).
ity guarantees, such as transmission reliability and transmis-
sion delay. When network resources are limited, differentiating 4. Approximate gradient classification algorithm
the transmission qualities of gradients can maximize network
utilization and performance of distributed learning. This section discusses the design of an approximate gradient
classification algorithm and a heuristic update method for the
2.3. Lack of differential gradient transmission in existing solutions classification threshold in the algorithm.
Existing gradient transmission solutions [19,20,22] do not take 4.1. Algorithm design
into account the different transmission quality required by differ-
The contribution of each gradient can be estimated by its
ent gradient and transmit all gradients with same transmission
absolute value [14,37]. Since the amount of gradients usually
service. Specifically, we summarize existing gradient transmis-
reaches 10 million or even 100 million, accurately estimating
sion solutions into two categories, namely accurate transmission
and classifying for each dimension of a gradient tensor has not
and approximate transmission. Accurate transmission transmits
only a very high computational complexity but also massive com-
all gradients indiscriminately upon reliable transmission service,
munication complexity introduced by additional position indexes
i.e., TCP or RoCE, which has been popularly adopted by many dis-
which are used for reconstructing. Therefore, the above method
tributed machine learning framework [20,21]. It worth be noted
is not feasible in practice because of its high complexity.
that accurate transmission inevitably results in long tail latency
In this paper, we heuristically propose an Approximate Gra-
because of recovering lost gradients. Approximate transmission
dient Classification (AGC) algorithm to reduce the computation
solutions [19,22] exploit the feature of loss tolerance of the SGD-
and communication complexity of DGT, which estimates the con-
based algorithm to improve the efficiency of transmission. Such tribution of gradient and classify it in block granularity. The
solutions discard parts of gradients at sending end or passively pseudocode of Approximate Gradient Classification (AGC) algo-
ignore parts of gradients lost in the network. However, they rithm is described as Algorithm 1. Specifically, the algorithm
all do not consider the contribution of gradient and execute contains three steps, as shown in Fig. 6, as follows:
Random-based loss. Step 1: AGC divides the gradient tensor into a set of gra-
dient sub-blocks. Note that the block size is set according to
3. DGT design overview and challenges the network architecture and training environments for optimal
performance. For example, for a convolution layer, its gradients
In this section, we present a design overview of DGT. Then, we are naturally arranged in convolution kernel granularity. Besides,
analyze the challenges that motivate us to facilitate it in the next several works demonstrate that a convolution kernel is activated
section. if and only if the input data contains the feature it extracts [38,
We believe that identifying the importance of data and dis- 39]. We have verified the above insight by visualizing the gradient
criminately transmit them with different transmission quality distribution of a convolution layer. As shown in Fig. 4, in a
improve the performance of distributed training. convolution layer, the gradients present block-liking distribution
38
H. Zhou, Z. Li, Q. Cai et al. Future Generation Computer Systems 121 (2021) 35–47
Fig. 4. The gradient distribution of Alexnet’s convolution layer 1. Note that each line is an expansion of a 5*5 convolution kernel. We visualize the gradient
distribution at iteration = 10(a), 1000(b), 4000(c) and 8000(d).
Fig. 5. Job completion time (JCT) under different block-size settings of Alexnet’s
fully-connected layers. Note that the results are normalized with performance
according to the convolution kernel’s size after a mini-batch of Baseline.
training. So, we adopt the convolution kernel size as the block size
of a convolution layer. For a fully-connected layer, we heuristi-
cally classify its gradients with user-defined block size. In theory,
a large user-defined block size reduces the computation and
communication complexity while also decreases the classification
accuracy. It is necessary to set an appropriate user-defined block
size for optimal performance according to the specific training
environment. For example, as shown in Fig. 5, we compare the
job completion time (JCT) under different block size settings. If
the block size is too small (such as 32), the additional computa-
tion and communication overhead of DGT make the performance Fig. 6. Workflow of approximate gradient classification algorithm.
worse than the Baseline. Contrastly, If the block size is too large
(such as 8192), inaccuracy classification makes DGT require more
communication rounds, which also makes the performance worse contribution of current iteration with a momentum, aiming to
than the Baseline. In this experiment, 2048 is a good compromise properly consider the historical contribution.
value of block size. Step 3: All sub-blocks in the k-th tensor are ranked according
Step 2: Given a gradient sub-block with size n, AGC estimates to their contribution. Then, the top-p% of sub-blocks will be classi-
the 1-norm of its gradients as its contribution to model conver- fied into the important category and the rest into the unimportant
gence. At the same time, referring to momentum gradient descent category. Note that the classification threshold p determines the
algorithm [40], AGC averages the historical contribution to the amounts of important gradients.
39
H. Zhou, Z. Li, Q. Cai et al. Future Generation Computer Systems 121 (2021) 35–47
6.2. Deployment
7. Evaluation
Table 1
The distribution of training samples over workers. (label:num_samples).
Worker ID 1 2 3 4 5 6 7 8 9 10 11 12
Cifar10 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9:
IID
4166 4166 4166 4166 4166 4166 4166 4166 4166 4166 4166 4174
Fashion-MNIST 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9:
5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000
Cifar10 1:1250 0:1250 0:1250 0:1250 2:1000 4:1666 3:714 2:1000 0:1250 3:714 1:1250 1:1250
non-IID
2:1000 1:1250 3:714 3:714 5:714 6:2500 4:1666 3:714 2:1000 5:714 3:714 2:1000
5:714 4:1666 8:714 5:714 6:2500 8:714 7:1000 7:1000 3:714 7:1000 5:714 5:714
8:714 5:714 9:1250 7:1000 7:1000 9:1250 9:1250 8:714 9:1250 8:714 8:714 8:714
Fashion-MNIST 1:1500 0:1500 0:1500 0:1500 2:1200 4:2000 3:857 2:1200 0:1500 3:857 1:1500 1:1500
2:1200 1:1500 3:857 3:857 5:857 6:3000 4:2000 3:857 2:1200 5:857 3:857 2:1200
5:857 4:2000 8:857 5:857 6:3000 8:857 7:1200 7:1200 3:857 7:1200 5:857 5:857
8:857 5:857 9:1500 7:1200 7:1200 9:1500 9:1500 8:857 9:1500 8:857 8:857 8:857
Fig. 12. Job completion time (JCT) of training Alexnet on IID data partitions of Fig. 14. Job completion time (JCT) of training Alexnet on BSP and Fe-
Fashion-MNIST and Cifar10 with four gradient transmission mechanisms. Note dAVG(E=5) with four gradient transmission mechanisms. Note that the results
that the results are normalized with the performance of Baseline. are normalized with the performance of Baseline.
Fig. 15. Comparison of training performance on AlexNet when DGT updates classification threshold with static settings(p = 1.0, 0.8, 0.6, 0.4) and convergence-ware
update method. The training performance is evaluated by three aspects, that is job completion time (JCT)(a), number of iterations(b) and average communication
overhead per iteration (c). Note that the results are normalized with training performance when threshold p = 1.0.
As can be seen from Fig. 15(c), compared with static settings, job completion time is further gradually reduced. This benefit
the convergence-aware update method can make distributed comes from the reduction of the number of iterations. As shown
learning with the least job completion time in our experimental in Fig. 16(c), compared to Y{1,1}, Y{1,3}, Y{1,7} have fewer num-
cluster. In fact, as shown in Fig. 15(a) and Fig. 15(b), with a ber of iterations. This is because as the number of unreliable
lower static threshold, DGT make distributed learning with lower channels increases, unimportant gradients with a higher contri-
average communication overhead. However, in contrast, a lower bution has a smaller probability of being lost in the network. In
static threshold makes the trained model need more iterations this experiment, the job completion time of Y{1,7} is 10.6% lower
to converge. The reason is that a lower static threshold means than N{1,1} for Googlenet and 17.2% for Alexnet.
that more unimportant gradients are discarded in the network,
which significantly impairs the model convergence in the worst 7.2.4. Effect of differential reception method
case. Our experimental results verify the potential inverse rela- In order to test the effect of differential reception method
tionship between communication overhead and the number of (Diff-Reception) at the receiver side, we compare it with the
iterations. Compared with static settings, our convergence-aware other two methods: Baseline (default gradient transmission of
update method achieves better trade-off between communica- MXNET) and "Heuristic Dropping’’, which simply discards the
tion overhead and the number of iterations, making them all delayed gradients. The experimental configurations contain: the
achieve relative low value. This is because the convergence-aware classification threshold p=0.6, the gradient transmission solution
update method can adjust the classification threshold according is Y{1,1}. In the experiment, we compare the training perfor-
to the training progress of distributed learning. At the early mance of distributed learning when DGT adopts different gradient
stage of training, it avoids a low threshold to makes the model reception methods at the receiver side, as shown in Fig. 17.
convergence be impaired. And at the later stage, it can lift the As can be seen from Fig. 17(a), compared with Baseline, the
threshold, thereby reducing the communication overhead. Over- Diff-Reception reduce the job completion time by 8.9% on
all, at present, the convergence-aware classification threshold Googlenet and 20.9% on Alexnet. And however, ‘‘Heuristic Drop-
update method is a relatively excellent method. Exploring a better ping’’ has not achieved a significant acceleration. This is because,
method will be our future work. as illustrated in Fig. 17(c), the ‘‘Heuristic Dropping’’ results in a
significant increase in the number of iterations, which further
7.2.3. Effect of priority-based differential transmission illustrates that actively dropping delayed gradients will signif-
The purpose of this experiment is to verify the effectiveness icantly damage the model convergence. On the other hand, as
of the priority-based differential transmission mechanism of DGT. shown in Fig. 17(b), the processing overhead of Diff-Reception
Specifically, we compare the training performance of distributed is very low, thus making the communication overhead be al-
learning under the following settings: most same as the "Heuristic Dropping’’. Experimental results
have shown that our Diff-Reception is a comparatively excellent
• Baseline: MXNET’s default gradient transmission solution,
gradient reception method in the DGT solution.
where all gradients are transmitted indiscriminately through
a reliable channel.
• N{1,1}: gradient transmission with DGT solution which has 8. Related work
one reliable channel and one unreliable channel. This setting
does not distinguish the priority between channels. This section discusses related works to DGT.
• Y{x,y}: gradient transmission with DGT solution which has It has been extensively paid attention that the high com-
x reliable channels and y unreliable channels. This setting munication overhead seriously affects the performance of dis-
does distinguish the priority between channels. tributed machine learning. To reducing communication overhead,
most data centers adopt advanced network techniques that have
We compare the training performance of distributed learning larger bandwidth to facilitate cluster network, e.g., [27,50] con-
under different settings, as shown in Fig. 16. As can be seen from nect computing nodes with RDMA or Cray technologies. However,
Fig. 16(a), compared to N{1,1}, When prioritizing the transmis- increasing bandwidth between computing nodes is not a general
sions on different channels, Y{1,1} significantly reduces the job solution. As the number of parallel computing nodes increase, the
completion time. This is because Y{1,1} gives reliable channel bandwidth advantages gradually disappear. On the other hand,
more network resources to allow the flow of important gradients the solution is not applicable in a cross-domain cluster.
to be completed as quickly as possible, thus reducing the average To our best knowledge, there is still relatively little work
communication overhead per iteration, as shown in Fig. 16(b). At to optimize the efficiency of distributed learning from the per-
the same time, as we increase the number of unreliable channels, spective of data transmission. [19] is the most relevant work.
44
H. Zhou, Z. Li, Q. Cai et al. Future Generation Computer Systems 121 (2021) 35–47
Fig. 16. Comparison of training performance on two ML models, i.e., Alexnet and Googlenet, when DGT adopts different channel settings and priority settings. The
training performance is evaluated by three aspects, that is job completion time (JCT)(a), average communication overhead per iteration (b) and number of iterations(c).
Note that the results are normalized with training performance of Baseline.
Fig. 17. Comparison of training performance on two ML models, i.e., Alexnet and Googlenet, when DGT adopts different gradient reception protocol at receiver
side. The training performance is evaluated by three aspects, that is job completion time (JCT)(a), average communication overhead per iteration(b) and number of
iterations(c). Note that the results are normalized with training performance of Baseline.
Writing - review & editing. Qingqing Cai: Investigation, Writing - [18] A. Jayarajan, J. Wei, G. Gibson, A. Fedorova, G. Pekhimenko, Priority-based
review & editing. Hongfang Yu: Supervision, Funding acquisition, parameter propagation for distributed DNN training, 2019, arXiv preprint
arXiv:1905.03960.
Project administration. Shouxi Luo: Writing - review & editing.
[19] J. Xia, G. Zeng, J. Zhang, W. Wang, W. Bai, J. Jiang, K. Chen, Rethinking
Long Luo: Writing - review & editing. Gang Sun: Supervision, transport layer design for distributed machine learning, in: Proceedings of
Validation, Writing - review & editing. the 3rd Asia-Pacific Workshop on Networking 2019, ACM, 2019, pp. 22–28.
[20] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
Declaration of competing interest Z. Zhang, Mxnet: A flexible and efficient machine learning library for
heterogeneous distributed systems, 2015, arXiv preprint arXiv:1512.01274.
[21] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
The authors declare that they have no known competing finan- Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale
cial interests or personal relationships that could have appeared machine learning, in: 12th {USENIX} Symposium on Operating Systems
to influence the work reported in this paper. Design and Implementation ({OSDI} 16), 2016, pp. 265–283.
[22] K. Liu, S.-Y. Tsai, Y. Zhang, ATP: a datacenter approximate transmission
protocol, 2019, arXiv preprint arXiv:1901.01632.
Acknowledgment
[23] S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: A
review of classification techniques, Emerg. Artif. Intell. Appl. Comput. Eng.
This research was partially supported by the National Key Re- 160 (2007) 3–24.
search and Development Program of China (2019YF-B1802800), [24] G.A. Seber, A.J. Lee, Linear Regression Analysis, vol. 329, John Wiley & Sons,
PCL Future Greater-Bay Area Network Facilities for Large-scale 2012.
[25] R. Johnson, T. Zhang, Accelerating stochastic gradient descent using pre-
Experiments and Applications, China (LZC0019). China Postdoc-
dictive variance reduction, in: Advances in Neural Information Processing
toral Science Foundation (2019M66-3552) and Fundamental Re- Systems, 2013, pp. 315–323.
search Funds for the Central Universities, China (2682019CX61). [26] N.L. Roux, M. Schmidt, F.R. Bach, A stochastic gradient method with an
exponential convergence _rate for finite training sets, in: Advances in
References Neural Information Processing Systems, 2012, pp. 2663–2671.
[27] F.N. Iandola, M.W. Moskewicz, K. Ashraf, K. Keutzer, Firecaffe: near-linear
acceleration of deep neural network training on compute clusters, in:
[1] T. Ben-Nun, T. Hoefler, Demystifying parallel and distributed deep learning:
Proceedings of the IEEE Conference on Computer Vision and Pattern
An in-depth concurrency analysis, ACM Comput. Surv. 52 (4) (2019) 65.
Recognition, 2016, pp. 2592–2600.
[2] K.S. Chahal, M.S. Grover, K. Dey, R.R. Shah, A hitchhiker’s guide on
[28] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y.
distributed training of deep neural networks, J. Parallel Distrib. Comput.
Yang, L. Yu, et al., Highly scalable deep learning training system with
137 (2020) 65–76.
mixed-precision: Training imagenet in four minutes, 2018, arXiv preprint
[3] R. Mayer, H.-A. Jacobsen, Scalable deep learning on distributed infras-
arXiv:1807.11205.
tructures: Challenges, techniques and tools, 2019, arXiv preprint arXiv:
[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-
1903.11314.
scale hierarchical image database, in: Conference on Computer Vision and
[4] L. Bottou, Large-scale machine learning with stochastic gradient descent,
Pattern Recognition, IEEE, 2009, pp. 248–255.
in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
[30] M.J. Quinn, Parallel programming, TMH CSE 526 (2003).
[5] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initial-
[31] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, A. Smola, Parameter
ization and momentum in deep learning, in: International Conference on
server for distributed machine learning, in: Big Learning NIPS Workshop,
Machine Learning, 2013, pp. 1139–1147.
[6] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online vol. 6, 2013, p. 2.
learning and stochastic optimization, J. Mach. Learn. Res. 12 (Jul) (2011) [32] N. Strom, Scalable distributed DNN training using commodity GPU cloud
2121–2159. computing, in: Sixteenth Annual Conference of the International Speech
[7] M. Li, D.G. Andersen, A.J. Smola, K. Yu, Communication efficient distributed Communication Association, 2015.
machine learning with the parameter server, in: Advances in Neural [33] O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, Optimal distributed online
Information Processing Systems, 2014, pp. 19–27. prediction using mini-batches, J. Mach. Learn. Res. 13 (Jan) (2012) 165–202.
[8] L. Luo, J. Nelson, L. Ceze, A. Phanishayee, A. Krishnamurthy, Parameter hub: [34] A.F. Aji, K. Heafield, Sparse communication for distributed gradient descent,
a rack-scale parameter server for distributed deep neural network training, 2017, arXiv preprint arXiv:1704.05021.
in: Proceedings of the ACM Symposium on Cloud Computing, ACM, 2018, [35] B. Recht, C. Re, S. Wright, F. Niu, Hogwild: A lock-free approach to par-
pp. 41–54. allelizing stochastic gradient descent, in: Advances in Neural Information
[9] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Processing Systems, 2011, pp. 693–701.
Tulloch, Y. Jia, K. He, Accurate, large minibatch sgd: Training imagenet in [36] E.P. Xing, Q. Ho, W. Dai, J.K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar,
1 hour, 2017, arXiv preprint arXiv:1706.02677. Y. Yu, Petuum: A new platform for distributed machine learning on big
[10] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G.R. Ganger, P.B. Gib- data, IEEE Trans. Big Data 1 (2) (2015) 49–67.
bons, O. Mutlu, Gaia: geo-distributed machine learning approaching {LAN} [37] C. Hardy, E. Le Merrer, B. Sericola, Distributed deep learning on edge-
speeds, in: 14th {USENIX} Symposium on Networked Systems Design and devices: feasibility via adaptive compression, in: 16th International
Implementation ({NSDI} 17), 2017, pp. 629–647. Symposium on Network Computing and Applications (NCA), IEEE, 2017,
[11] J. Konečnỳ, H.B. McMahan, D. Ramage, P. Richtárik, Federated optimiza- pp. 1–8.
tion: Distributed machine learning for on-device intelligence, 2016, arXiv [38] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional net-
preprint arXiv:1610.02527. works, in: European Conference on Computer Vision, Springer, 2014, pp.
[12] L. Mai, C. Hong, P. Costa, Optimizing network performance in distributed 818–833.
machine learning, in: 7th {USENIX} Workshop on Hot Topics in Cloud [39] N. Ketkar, E. Santana, Deep Learning with Python, vol. 1, Springer, 2017.
Computing (HotCloud 15), 2015, p. 2. [40] N. Qian, On the momentum term in gradient descent learning algorithms,
[13] F. Seide, H. Fu, J. Droppo, G. Li, D. Yu, 1-bit stochastic gradient descent Neural Netw. 12 (1) (1999) 145–151.
and its application to data-parallel distributed training of speech dnns, in: [41] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G.V.d. Broeck, A semantic loss
Fifteenth Annual Conference of the International Speech Communication function for deep learning with symbolic knowledge, 2017, arXiv preprint
Association, 2014, pp. 1058–1062. arXiv:1711.11157.
[14] Y. Lin, S. Han, H. Mao, Y. Wang, W.J. Dally, Deep gradient compression: [42] M.A. Hughes, D.M. O’keeffe, K. Loughran, J.N. Butler, TCP control packet
Reducing the communication bandwidth for distributed training, 2017, differential service, US Patent 7, 366, 168, 2008.
arXiv preprint arXiv:1712.01887. [43] https://github.com/zhouhuaman/dgt.
[15] D. Alistarh, J. Li, R. Tomioka, M. Vojnovic, Qsgd: Randomized quantiza- [44] P. Ballester, R.M. Araujo, On the performance of GoogLeNet and AlexNet
tion for communication-optimal stochastic gradient descent, 2016, arXiv applied to sketches, in: Thirtieth AAAI Conference on Artificial Intelligence,
preprint arXiv:1610.02132. 2016.
[16] S. Sun, W. Chen, J. Bian, X. Liu, T.-Y. Liu, Ensemble-compression: A new [45] K. Simonyan, A. Zisserman, Very deep convolutional networks for
method for parallel training of deep neural networks, in: Joint European large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
Conference on Machine Learning and Knowledge Discovery in Databases, [46] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for
Springer, 2017, pp. 187–202. benchmarking machine learning algorithms, 2017, arXiv preprint arXiv:
[17] S.H. Hashemi, S.A. Jyothi, R.H. Campbell, Tictac: Accelerating distributed 1708.07747.
deep learning with communication scheduling, 2018, arXiv preprint arXiv: [47] K. Klosowski, Image recognition on CIFAR10 dataset using resnet18 and
1803.03288. keras, 2018.
46
H. Zhou, Z. Li, Q. Cai et al. Future Generation Computer Systems 121 (2021) 35–47
[48] B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Hongfang Yu is a professor at School of Informa-
Communication-efficient learning of deep networks from decentralized tion and Communication Engineering of University of
data, in: Artificial Intelligence and Statistics, PMLR, 2017, pp. 1273–1282. Electronic Science and Technology of China (UESTC).
[49] F. Sattler, S. Wiedemann, K.-R. Müller, W. Samek, Robust and Her current research interests include data center net-
communication-efficient federated learning from non-iid data, IEEE Trans. working, network (function) virtualization, cloud/edge
Neural Netw. Learn. Syst. (2019). computing, distributed AI system. Her research has
[50] Y. Ren, X. Wu, L. Zhang, Y. Wang, W. Zhang, Z. Wang, M. Hack, S. been supported by NSFC, National Key Research and
Jiang, Irdma: Efficient use of rdma in distributed deep learning systems, Development Program of China, National Grand Funda-
in: 19th International Conference on High Performance Computing and mental Research 973 Program and 863 Program et al.
Communications, IEEE, 2017, pp. 231–238. She submitted over 30 international and national-wide
[51] D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, QSGD: patent applications. She has authored/coauthored over
Communication-efficient SGD via gradient quantization and encoding, in: 200 papers on international journals and conferences.
Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
47