DGT A Contribution-Aware Differential Gradient Transmission Mechanism For Distributed Machine Lear

Future Generation Computer Systems 121 (2021) 35–47
Contents lists available at ScienceDirect
Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs
DGT: A contribution-aware differential gradient transmission

mechanism for distributed machine learning
∗
Huaman Zhou a , Zonghang Li a , Qingqing Cai a , Hongfang Yu a,c , , Shouxi Luo b , Long Luo a ,
∗∗
Gang Sun a,d ,
a
Key Laboratory of Optical Fiber Sensing and Communications (Ministry of Education), University of Electronic Science and Technology of
China, Chengdu, People’s Republic of China
b
Southwest Jiaotong University, Chengdu, People’s Republic of China
c
Peng Cheng Laboratory, Shenzhen, People’s Republic of China
d
Agile and Intelligent Computing Key Laboratory of Sichuan Province, Chengdu, China
article info a b s t r a c t
Article history: Distributed machine learning is a mainstream system to learn insights for analytics and intelligence
Received 13 January 2020 services of many fronts (e.g., health, streaming and business) from their massive operational data.
Received in revised form 31 January 2021 In such a system, multiple workers train over subsets of data and collaboratively derive a global
Accepted 9 March 2021
prediction/inference model by iteratively synchronizing their local learning results, e.g., the model gra-
Available online 14 March 2021
dients, which in turn generates heavy and bursty traffic and results in high communication overhead
Keywords: in cluster networks. Such communication overhead has became the main bottleneck that limits the
Distributed Machine Learning (DML) efficiency of training machine learning models distributedly. In this paper, our key observation is that
Gradient transmission local gradients learned by workers may have different contributions to global model convergence and
Parameter server architecture executing differential transmission for different gradients can reduce the communication overhead and
improve training efficiency. However, existing gradient transmission mechanisms treat all gradients the
same, which may lead to long training time.
Motivated by our observations, we propose Differential Gradient Transmission (DGT), a contribution-
aware differential gradient transmission mechanism for efficient distributed learning, which transfers
gradients with different transmission quality according to their contributions. In addition to designing
a general architecture of DGT, we have proposed a novel algorithm and a novel protocol to facilitate
fast model training. Experiments on a cluster with 6 GTX 1080TI GPUs and 1Gbps network show
that DGT decreases the model training time by 19.4% on GoogleNet, 34.4% on AlexNet and 36.5% on
VGG-11 compared to default gradient transmission on MXNET. Its acceleration is better than the other
two related transmission solutions. Besides, DGT works well with different datasets (Fashion-MNIST,
Cifar10), different data distributions (IID, non-IID) and different training algorithms (BSP, FedAVG).
© 2021 Elsevier B.V. All rights reserved.
1. Introduction distributed stochastic gradient descent (SGD) algorithm or its

variants [4–6]. In each iteration, computing machines fetch the
The success of emerging artificial intelligence technologies global model from parameter server [7,8], train the model on
is largely due to massive data and complex models, which re- their local dataset, and then upload their learning results, i.e., gra-
quires extremely high storage capacity and computing power dients, to parameter server for refining the model. Cluster of
to perform model training. It takes weeks or even months to distributed learning can be deployed on data center network [9],
achieve convergence on a single machine. To this end, distributed enterprise network [2] or cross-domain WAN [10,11]. Synchro-
machine learning [1–3] is an effective and promising training nizing massive amounts of parameters across distributed workers
paradigm for complex ML models, e.g., deep neural networks. In and parameter servers at each iteration poses great communica-
distributed learning, several computing machines (called worker)
tion traffic on the cluster network, creating potential congestion,
train a shared global model collaboratively by iteratively applying
especially on parameter server side [8]. In the worst case, com-
munication overhead significantly slows down the distributed
∗ Corresponding author.
∗∗ Corresponding author at: Key Laboratory of Optical Fiber Sensing and learning by several orders of magnitude [12–14].
Communications (Ministry of Education), University of Electronic Science and
Cutting down the communication overhead is a crucial break-
Technology of China, Chengdu, People’s Republic of China. through to improve the efficiency of distributed learning. Re-
E-mail address: gangsun@uestc.edu.cn (G. Sun). searchers from ML and systems communities have come up with
https://doi.org/10.1016/j.future.2021.03.006
0167-739X/© 2021 Elsevier B.V. All rights reserved.
H. Zhou, Z. Li, Q. Cai et al. Future Generation Computer Systems 121 (2021) 35–47
several application-layer solutions to reduce the communication distributed learning. Second, we deploy a contribution-aware dif-
overhead through compressing the amount of communication pa- ferential gradient transmission protocol on DGT, which consists
rameters [13–15], reducing the frequency of synchronization [16] of a priority-based differential transmission mechanism, and a
and overlapping communication with computation [17,18]. Both differential reception method(Section 5). Such protocol makes
these solutions exploit the property that the SGD-based algo- the packet of gradients with a higher contribution has higher
rithm or its variants can still make the model converge and transmission quality in network and be updated to the global
produce equal precision with a non-trivial amount of lower- model preferentially.
precision, delayed, or missed gradients. Some researcher [19] With these optimization methods, DGT can achieve good ap-
have argued that a customized transport-layer solution is needed plication performance, guaranteed convergence, and lower com-
for distributed learning. In this paper, we focus on exploring an munication overhead for distributed learning. Moreover, DGT
efficient end-to-end traffic scheduling mechanism for gradient requires no changes in the existing transport layer; all our tech-
transmission, which acts between application-layer and existing niques are implemented either at hosts or by changing simple
transport-layer protocols. configurations of switches.
Our key observation is that providing the same end-to-end We implement DGT as a communication middleware and inte-
gradient transmission for all gradients is unnecessary, even ad- grate it into the MXNET framework. Compared to the default gra-
verse to the efficiency of distributed learning. First, Gradients dient transmission on MXNET, our experimental results show that
have different contributions to model convergence, and their dif- DGT decreases model training time by 19.4% on Googlenet, 34.4%
ference becomes more obvious as the model converges. Second, on Alexnet, and 36.5% on VGG-11. Its acceleration is better than
gradient with a higher contribution requires higher transmission other two heuristic gradient transmission solutions, i.e., Sender-
reliability. Conversely, it is an opportunity to strategically re- based Dropping (SD) [19] and ATP [22]. Besides, DGT works well
duce the transmission reliability of low-contribution gradients to with different datasets (Fashion-MNIST, Cifar10), different data
mitigate communication overhead. Third, gradient with a higher distributions (IID, non-IID) and different training algorithms (BSP,
contribution requires a lower transmission delay. Preferentially FedAVG).
transmitting gradients with higher contributions will lift the ac- The main contributions of this work are summarized as fol-
curacy of the model as soon as possible. Therefore, we argue lows:
that differential gradient transmission is needed for distributed
learning. However, Existing solutions cannot provide refined dif- • We identify that differential gradient transmission is bene-
ferential gradient transmission. For example, existing DML frame- fit for distributed machine learning and propose a general
works such as MXNET [20] or Tensorflow [21] all rely on accurate differential gradient transmission (DGT) mechanism.
gradient transmission, i.e., the worker uploads all gradients to • We propose a novel algorithm and a novel protocol to facil-
parameter server upon reliable transmission service. Although itate fast model training on DGT.
there are some related works [19,22] execute approximate gradi- • We build a prototype of DGT on a popular distributed ma-
ent transmission by ignoring parts of lost gradient heuristically, chine learning system,1 i.e., MXNET, and demonstrate its
they still treat gradients the same. Our observations motivate us effectiveness over a real distributed cluster and ML models.
to perfect this research gap. The paper is organized as follows. We first provide some back-
In this paper, we propose Differential Gradient Transmission ground on distributed machine learning and introduce our key
(DGT), a contribution-aware differential gradient transmission observations in Section 2. After presenting an overview of DGT
mechanism for distributed machine learning. The basic idea of in Section 3, We describe the novel algorithm and protocol on
DGT is to provide differential transmission service for gradients DGT in Section 4 and Section 5 in details. In Section 6, we
according to their contribution to model convergence. Specifi- briefly present the implementation and deployment of DGT. After
cally, at each iteration, DGT enables high-contribution gradient be
reporting experimental results in Section 7, we review related
transmitted and updated to the model preferentially, and actively
work in Section 8 and conclude in Section 9.
reduces the transmission quality of the low-contribution gradient
to mitigate communication overhead. DGT has multiple transmis-
2. Background and motivation
sion channels which have different transmission reliability and
priority. For a raw gradient tensor, the sender first deconstructs it
This section first briefly describe the training paradigm of
and then classifies its gradients into two categories, i.e., important
distributed machine learning and traffic characteristic of its com-
category and unimportant category. The important gradients are
munication pattern(Section 2.1). Then, the section introduces our
scheduled to a reliable channel and delivered accurately. Instead,
observations that differential gradient transmission is needed
the unimportant gradients are transmitted in ‘‘best-effort’’ deliv-
for distributed learning(Section 2.2) and existing solutions lack
ery mode via unreliable channels with lower priority. The receiver
ability to provide such transmission service(Section 2.3). These
reconstructs the received gradients into a structured tensor and
observations motivate us to study and design an efficient and
then uploads it to the upper application. Although the general
easy-to-use differential gradient transmission mechanism in the
architecture of DGT is simple, designing and implementing a full-
next section.
fledged solution that can improve the efficiency of distributed
learning is not easy.
We propose a novel algorithm and a novel protocol to facilitate 2.1. Background on distributed machine learning
fast model training. First, we design an efficient approximate
gradient classification algorithm(Section 4). the algorithm esti- A general paradigm of machine learning is to continuously
mates the contribution of a gradient in block granularity and refine an ML model by minimizing its objective loss function
then classifies it according to its approximate contribution. In value [23,24]. The value measures the accuracy of the ML model,
the algorithm, we propose a convergence-aware classification e.g., it represents the error rate for a classification task. SGD-
threshold update method that makes the threshold adapt to based algorithms are a series of extensively used optimization
change of gradient as the training progresses. Compared to static
settings, the method makes a better trade-off between aver- 1 We have already made the source code of our prototype of DGT publicity
age communication overhead and the number of iterations for on https://github.com/zhouhuaman/dgt.
36
Fig. 1. Common pattern of distributed machine learning.
algorithms to minimize the objective value in research or produc-

tion scenarios [25,26]. In such algorithms, An ML model always
needs to be refined for multiple iterations. In each iteration, The
ML model is updated by its gradients after the forward–backward
process. The critical metric of machine learning is its training
performance, which is measured by model training time to the
desired accuracy. The work for improving training performance
is not only an algorithm engineering but also a system engi-
neering [27,28]. To handle huge datasets, e.g., ImageNet [29],
Synchronized distributed machine learning (DML) is an effective
method [30]. In such a training paradigm, computing nodes form
a training cluster to train a shared model collaboratively. In
system level, parameter-server architecture [20,21,31] is exten-
sively used to implement distributed learning. In each iteration,
as shown in Fig. 1, the parameter server distributes its cur-
rent parameters of the global ML model to multiple workers.
After fetching the current model, workers execute the forward–
backward process locally and then aggregate their learning re-
sults, i.e., gradients of the model, to parameter server for refining
the shared ML model using SGD-based algorithms.
It worth be noting that iterative synchronization of parame-
ters generates periodic and bursty traffic on a cluster network.
When network resources are limited, DML suffers from high Fig. 2. Number of iterations needed by distributed learning on two models,
communication overhead of data transmission. i.e. Alexnet and Googlenet, when gradient is lost with Random-based Loss or
Contribution-based Loss. P and P* tag the maximum loss probability in different
loss mode, with which the models can converge with same number of iterations
2.2. Why DML needs differential gradient transmission? compared to loss probability = 0 (within 5% fluctuation).
2.2.1. Gradients have different contribution to model convergence

In fact, different gradient has different influence to model and Alexnet. When gradients are lost randomly (called Random-
convergence [32]. The larger absolute value of a gradient, the based Loss) with different probability, the performance of dis-
larger amplitude by which the model will be adjusted to the tributed learning will be different. That is, when the loss prob-
corresponding dimension. If the absolute value of a gradient is ability lower than the loss tolerance bound, an ML model can
close to zero, the model will not be adjusted to its corresponding converge to the same accuracy. However, the convergence speed
dimension. So, the loss of gradient with smaller absolute value may be affected, which always be measured by job training time
has less influence on the model convergence. 2) many gradients to the desired accuracy of the application. The reason is that a
become close to zero at the middle training stage, which is larger loss probability may make the model need more iterations
consistent with the conclusion of [33], i.e., not all model pa- to converge.
rameters converge to their optimal value with the same number Random-based loss discussed above means that all gradients
of iterations—a property callednon-uniform convergence. In this have the same transmission reliability, i.e., all gradients have the
paper, we first define the absolute value of a gradient as its same loss probability of being discarded when network conges-
tion occurs. Based on the different contributions of gradients,
contribution to model convergence. So, we argue that gradients
we study Contribution-based loss, i.e., gradient with a lower
have different contributions to model convergence.
contribution has lower transmission reliability and be discarded
preferentially when network congestion occurs. We compare the
2.2.2. Gradient with a higher contribution requires higher transmis- impact of gradient loss on model convergence with Contribution-
sion reliability based loss and Random-based loss. Fig. 2 illustrates a significant
Xia et al. [19] has shown that SGD-based training can tolerate improvement. From Fig. 2(a), we can find that Alexnet can con-
bounded gradient loss through the experimental method. Our verge with the same number of iterations when loss probabil-
preliminary experiments also verify their conclusion on Googlenet ity (P*) is 0.365 in Contribution-based Loss, which is improved
37
by 2.58× than 0.141. Fig. 2(b) illustrates similar conclusion on This insight motivates us to propose Differential Gradient Trans-
Googlenet, the improvement achieves 4.80×. So, we argue that mission (DGT), a contribution-aware differential gradient trans-
gradient with a higher contribution requires higher transmission mission mechanism designed for distributed machine learning.
reliability. The basic idea of DGT is to provide differential transmission qual-
ity for different gradients according to their contribution. Specif-
2.2.3. Gradient with a higher contribution requires lower transmis- ically, at each iteration, DGT enables high-contribution gradients
sion delay to be transmitted and updated to the model preferentially and
Since SGD-based algorithms of distributed learning are ran- actively reduce the transmission quality of low-contribution gra-
dom optimization algorithms, SGD-based algorithms tolerate gradients to mitigate communication overhead. DGT consists of three
dient delay [14,34], that is, gradients at τ -th iteration can be parts deployed at sender hosts (as a user-level library), receiver
updated to the model at (τ + i)-th iteration, where i is the hosts (as a user-level library), and switches; they collectively
number of delayed round. So far, there have many works that perform differential gradient transmission.
exploit this feature to accelerate distributed learning. For ex- As shown in Fig. 3, a sender library estimates the contribution
ample, asynchronous SGD [35,36] exploits the feature to relax of all gradients in a gradient tensor. It then classifies all gradients
the strict synchronization between workers, i.e., it allows the into two categories, i.e., important gradients and unimportant
gradients of slow workers to be updated to the model lingeringly, gradients and schedules them to transmission channels with dif-
which significantly improves the system efficiency of distributed ferent transmission quality according to their importance and
contribution. A receiver library reconstructs the received gradi-
learning.
ents into a structured tensor that required by the application,
Besides, several works [14,34] have demonstrated that up-
using a zero-padding for lost gradients. Switches in DGT schedule
dating gradient with a higher contribution to the global model
transmissions according to their priority tag.
with lower delay is beneficial to model convergence. Gradient
On the way to build up an efficient DGT solution, the key
Sparsification [34] updates the gradient whose absolute value
challenges include:
is greater than a predefined threshold to the global model in
a timely manner and caches those small gradients until their • How to classify gradients? We need to overcome the ex-
accumulated value greater than the threshold. Actually, Gradi- tremely high complexity of accurate classification and de-
ent Sparsification differentiates the delay of updating gradients sign an efficient classification threshold update method (Sec-
to the global model according to their contribution. Therefore, tion 4).
when network resource is limited, transmitting gradient with a • How to differentiate transmissions for different gradients?
higher contribution preferentially can get better convergence gain We need to design an efficient and easy-to-use transmis-
on the global model. So, we argue that gradient with a higher sion protocol. The protocol improves the performance of
contribution requires a lower transmission delay. distributed learning as much as possible while guaranteeing
Overall, different gradients require different transmission qual- the application accuracy not be compromised (Section 5).
ity guarantees, such as transmission reliability and transmis-
sion delay. When network resources are limited, differentiating 4. Approximate gradient classification algorithm
the transmission qualities of gradients can maximize network
utilization and performance of distributed learning. This section discusses the design of an approximate gradient
classification algorithm and a heuristic update method for the
2.3. Lack of differential gradient transmission in existing solutions classification threshold in the algorithm.
Existing gradient transmission solutions [19,20,22] do not take 4.1. Algorithm design
into account the different transmission quality required by differ-
The contribution of each gradient can be estimated by its
ent gradient and transmit all gradients with same transmission
absolute value [14,37]. Since the amount of gradients usually
service. Specifically, we summarize existing gradient transmis-
reaches 10 million or even 100 million, accurately estimating
sion solutions into two categories, namely accurate transmission
and classifying for each dimension of a gradient tensor has not
and approximate transmission. Accurate transmission transmits
only a very high computational complexity but also massive com-
all gradients indiscriminately upon reliable transmission service,
munication complexity introduced by additional position indexes
i.e., TCP or RoCE, which has been popularly adopted by many dis-
which are used for reconstructing. Therefore, the above method
tributed machine learning framework [20,21]. It worth be noted
is not feasible in practice because of its high complexity.
that accurate transmission inevitably results in long tail latency
In this paper, we heuristically propose an Approximate Gra-
because of recovering lost gradients. Approximate transmission
dient Classification (AGC) algorithm to reduce the computation
solutions [19,22] exploit the feature of loss tolerance of the SGD-
and communication complexity of DGT, which estimates the con-
based algorithm to improve the efficiency of transmission. Such tribution of gradient and classify it in block granularity. The
solutions discard parts of gradients at sending end or passively pseudocode of Approximate Gradient Classification (AGC) algo-
ignore parts of gradients lost in the network. However, they rithm is described as Algorithm 1. Specifically, the algorithm
all do not consider the contribution of gradient and execute contains three steps, as shown in Fig. 6, as follows:
Random-based loss. Step 1: AGC divides the gradient tensor into a set of gra-
dient sub-blocks. Note that the block size is set according to
3. DGT design overview and challenges the network architecture and training environments for optimal
performance. For example, for a convolution layer, its gradients
In this section, we present a design overview of DGT. Then, we are naturally arranged in convolution kernel granularity. Besides,
analyze the challenges that motivate us to facilitate it in the next several works demonstrate that a convolution kernel is activated
section. if and only if the input data contains the feature it extracts [38,
We believe that identifying the importance of data and dis- 39]. We have verified the above insight by visualizing the gradient
criminately transmit them with different transmission quality distribution of a convolution layer. As shown in Fig. 4, in a
improve the performance of distributed training. convolution layer, the gradients present block-liking distribution
38
Fig. 3. An overview on design of DGT architecture.
Fig. 4. The gradient distribution of Alexnet’s convolution layer 1. Note that each line is an expansion of a 5*5 convolution kernel. We visualize the gradient
distribution at iteration = 10(a), 1000(b), 4000(c) and 8000(d).
Algorithm 1 Approximate Gradient Classification (AGC) algo-

rithm
Input:
The gradient tensor Gk,τ .
The block size n.
The classification threshold p.
1: Divide the gradient tensor Gk,τ into a set of gradient
sub-blocks [G1k,τ , G2k,τ , ..., Gm
k,τ ] according to the block size n.
j
2: Update the contribution Ck (τ ) of the j-th sub-block in the k-
j j
th tensor at τ -th iteration as: Ck (τ ) = α Ck (τ − 1) + (1 −
∑n j j
α) n
1
i=1 |gi |, gi ∈ Gk,τ ,
where = 0 and α (0 ≤ α ≤ 1)
Ck (0)
is a constant value called contribution momentum factor.
j j
3: If Ck (τ ) is in the top-p%, then Gk,τ is tagged as an important
gradient sub-block; Otherwise, it is tagged as an unimportant
gradient sub-block.
Fig. 5. Job completion time (JCT) under different block-size settings of Alexnet’s
fully-connected layers. Note that the results are normalized with performance
according to the convolution kernel’s size after a mini-batch of Baseline.
training. So, we adopt the convolution kernel size as the block size
of a convolution layer. For a fully-connected layer, we heuristi-
cally classify its gradients with user-defined block size. In theory,
a large user-defined block size reduces the computation and
communication complexity while also decreases the classification
accuracy. It is necessary to set an appropriate user-defined block
size for optimal performance according to the specific training
environment. For example, as shown in Fig. 5, we compare the
job completion time (JCT) under different block size settings. If
the block size is too small (such as 32), the additional computa-
tion and communication overhead of DGT make the performance Fig. 6. Workflow of approximate gradient classification algorithm.
worse than the Baseline. Contrastly, If the block size is too large
(such as 8192), inaccuracy classification makes DGT require more
communication rounds, which also makes the performance worse contribution of current iteration with a momentum, aiming to
than the Baseline. In this experiment, 2048 is a good compromise properly consider the historical contribution.
value of block size. Step 3: All sub-blocks in the k-th tensor are ranked according
Step 2: Given a gradient sub-block with size n, AGC estimates to their contribution. Then, the top-p% of sub-blocks will be classi-
the 1-norm of its gradients as its contribution to model conver- fied into the important category and the rest into the unimportant
gence. At the same time, referring to momentum gradient descent category. Note that the classification threshold p determines the
algorithm [40], AGC averages the historical contribution to the amounts of important gradients.
39
4.2. Convergence-aware classification threshold
The classification threshold p is a critical factor in the AGC

algorithm. On the one hand, a suitably static threshold is not
easy to determine. A higher threshold means that more gradients
need to be transmitted accurately. In an extreme case, when p=1,
the DGT scheme degenerates into a approximate transmission
solution. A lower threshold may make the training model need
more iterations to converge or even do damage to the application
accuracy in the worst case [19]. Fig. 7. Workflow of priority-based differential transmission.
On the other hand, A static threshold is not optimal. It worth
be noting that the loss tolerance of application may be different at
different stages of the training process. Specifically, at the early by user) groups equally. Suppose there has C groups with C
stage of training, most parameters of an ML model are rapidly priorities P1 to PC , where P1 > P2 > · · · > PC . Based on the
adjusted, and most gradients are important gradients at this time. priority of a group, the sender library schedules its gradients to
At the later stage of training, most parameters of the model the corresponding unreliable transmission channel. Specifically,
have converged, and most gradients are close to zero, which can Each channel enables the transmission priority of a packet by
be classified as unimportant gradients at this time. The static setting the DSCP field in its IP header. Fig. 7 illustrates a setting
threshold lacks considering the change of gradient as the model scenario with C = 1, where DSCP of the packet in the reliable
converges. channel will be tagged as eight and DSCP of the packet in the
Based on the above considerations, we propose a heuristic unreliable channel will be tagged as zero.
convergence-aware update method for the classification thresh- Now we present the design of switch in DGT solution. Note
old. the update method is described as follows: that we only configure the existing functionalities of the com-
At τ -th iteration, the threshold pτ can be updated by: modity switch and requires no kernel and hardware changes. DGT
lossτ −1 leverages the priority queues that are common in commodity
pτ = p0 × (1) switch to treat packets discriminately. Switch favors high-priority
loss0
packets and drops low-priority ones preferentially. WRR(short
where p0 is the initial threshold and always be set to 1 in our for Weighted Round Robin) is adopted in our solution, which
experiments. loss0 is the initial loss function value and lossτ −1 is is a scheduler mode of QoS, and uses a round-robin scheduling
the loss function value at (τ − 1)-th iteration. Because the loss algorithm between the queues and can avoid the lowest priority
function value basically characterize the trend of convergence queues not being serviced for a long time when traffic conges-
loss
of a model [41], lossτ −1 normalize the progress of training to a tion happens. We defined the weighted value for each queue
0
probability. As the model gradually converges, the classification to distribute different service time for queue. The recommended
threshold pτ gradually decreases, making more gradients to be configuration is that the weighted value of a reliable channel’s
marked as unimportant gradients. queue is at least two times the unreliable channel’s. The weighted
value of unreliable channels’ queue decrease in order according
5. Differential transmission protocol to their priorities. The benefit of such differential transmission is
to (1) make the application accuracy not be compromised even in
This section discusses the design of the differential transmis- the worst case, that is, all unimportant gradients are discarded in
sion protocol of DGT, which consists of differential transmission cluster network; (2) make the flow of the important be completed
reliability, priority and reception. as soon as possible, which significantly reduce the communi-
cation overhead in each iteration when used with differential
5.1. Priority-based differential transmission reception method(Section 5.2); (3) make gradients with smaller
contribution are dropped preferentially when network congestion
In order to design an efficient and easy-to-use differential occurs.
gradient transmission solution, we avoid designing a complex
new transport-layer protocol. Instead, we exploit the existed 5.2. Differential reception method
transport-layer protocol and transmission control mechanisms
widely supported by current operating systems. We add a Because of the higher transmission priority, the flow of impor-
scheduling-layer between application and transport-layer to per- tant gradients will be completed earlier. There was a challenge of
form differential transmission. Specifically, the scheduling-layer how to deal with the delayed unimportant gradients. Since the
establishes multiple end-to-end transmission channels with dif- loss probability of the packet of unimportant gradients is unpre-
ferent transmission reliability. A reliable transmission channel dictable, the receiver cannot judge their completion time. This
relies on reliable transmission service, e.g., TCP, and delivers means that the waiting time of the receiver is uncertain. Waiting
gradients accurately. Unreliable transmission channels rely on blindly for the completion of unimportant gradients will block
unreliable transmission service, e.g., UDP. The gradients in these important gradients from being updated to model in a timely
channels cannot be guaranteed to be delivered entirely. Be- manner. On the contrary, discarding the delayed unimportant
sides, Multiple channels are differentiated with different trans- gradients blindly actually cause waste of bandwidth resource.
mission priority supported by differential service [42] in existed In response to this challenge, we design a differential reception
transport-layer protocol. method. The method follows a principle: ‘‘important gradients
According to the number of transmission channels, the sender are timely involved in the update of a model and delayed unim-
library divides the gradient tensor into several groups. After being portant gradient are involved in the update lingeringly’’. After
ranked according to their contribution, the top-p% gradients are completing the transmission of important gradients, the receiver
tagged as important gradients and scheduled to a reliable channel immediately reconstructs the gradients from receive buffer and
that has the highest transmission priority. The rest are tagged as delivers it to the upper application. For delayed important gradi-
unimportant gradients and be further divided into C (predefined ents, the receiver takes them into gradient buffer asynchronously
40
Fig. 9. Functional block diagram of the modules of DGT.
partition module, a gradient classification module, a differen-

tial transmission scheduling module, and a differential reception
module. The functions of each module are introduced briefly
Fig. 8. Functional flow chart of differential reception method. below.
DGT configuration module: This module is responsible for
the management and configuration of parameters in DGT. During
and adds them onto the corresponding dimension of gradient the initialization of the MXNET platform, the module sequentially
tensor in the next iteration. As shown in Fig. 8, different chan- configures parameters for other modules and establishes multi-
nel receiving engines decapsulate messages from channel buffer ple transmission channels between worker and the parameter-
and perform WRITE operation to put gradients into gradient server. Besides, the module updates the classification threshold
buffer. After all important gradients being received, the receiver with convergence-aware update method at the beginning of each
library performs READ operation to read received gradients and iteration. In order to quickly obtain the loss function value of
reconstruct them. Such method is similar to gradient caching the previous iteration, we design a file-based message passing
at sending end in Gradient Sparsification. Actually, Compared to mechanism for communication between the upper application
gradient caching, our protocol can effectively alleviate the ‘‘stal- and DGT configuration module.
eness’’ problem [14]. Experimental results show that the average Gradient partition module: The module is responsible for
number of the delayed round of iteration is only 1. Overall, the dividing the original gradient tensor into sub-blocks. In order
differential reception method makes the received gradients can to ensure that sub-blocks can be reconstructed into structured
join aggregation and update in time, and delayed gradients can tensor at the receiving end, the module gives each sub-blocks
additional header information, i.e., position offset and sequence
contribute to model convergence with the least ‘‘staleness’’ effect.
number.
Gradient classification module: The module is responsible for
6. Implementation and deployment
(1) estimating the contribution of each gradient according to the
AGC algorithm; (2) and tagging them as important gradients or
This section discusses how we implement DGT with a real unimportant gradients. In this module, there has a contribution
distributed machine learning system and how to deploy it in an matrix for each gradient tensor and be indexed by a tuple (key,
existing distributed cluster. seq), where key is an identifier of a tensor and seq is an identifier
of its sub-block.
6.1. Real implementation Differential transmission scheduling module: The module
is responsible for scheduling gradients to transmission channels
We implement DGT as a communication middleware (sche- with different reliability and priority based on their importance
duling-layer) that connects application and transport-layer. and contribution. First, the channel scheduler selects a transmis-
Specifically, we integrate DGT into a popular distributed deep sion channel for each block, and then sends the block to sending
learning platform, i.e., MXNET. As shown in Fig. 9. The DGT queue of its channel. Second, the sending engine reads gradients
acts between the distributed learning engine and the distributed from sending queues, encapsulates them into a message, and
communication library. To build an efficient and easy-to-use sends the message out. It is worth noting that data generation and
communication middleware, the implementation of DGT follows data transmission is executed asynchronously in our implemen-
the following principles: tation. Such solution makes the blocking during data transmission
does not affect data generation, thus increasing the efficiency of
• Be compatible with the default gradient transmission mech- scheduling.
anism of MXNET. Differential reception module: The module is located at the
• All functionalities of DGT are transparent to the upper ap- receiving end and is responsible for reconstructing gradients from
plication. receive buffer to a structured tensor and then uploading it to up-
• Provide efficient and concise communication interface, mod- per application. For delayed unimportant gradients, the module
ifying PS-LITE [20] as few as possible. performs the differential reception method.
• Provide convenient and concise user configuration inter- In summary, the DGT component adopts a hierarchical and
faces. modular design method. The code is written in C/C++ language,
and some of the latest C++11 features are adopted. Concise codes
As shown in Fig. 9, the DGT component consists of five func- and rich comments make it is convenient for user to conduct
tional modules, that is, a DGT configuration module, a gradient deployment.
41
6.2. Deployment
We have already made the source code of our prototype of

DGT publicity on [43]. Since DGT is implemented at the appli-
cation layer without any modification on system kernel, it is
naturally suitable for large-scale deployment. For using the DGT
function, the user only needs to set the environment variable EN-
ABLE_DGT=1 in the startup script of MXNET. Please refer to [43]
for the detail configuration of other parameters.
7. Evaluation
This section presents our evaluation of DGT on a real dis-

tributed cluster with several classic ML models and datasets.
First, we evaluate the acceleration of DGT on distributed machine
learning by comparing it with three existing gradient transmis- Fig. 10. Comparison of job completion time on three ML models, i.e., Googlenet,
sion mechanisms. Then, We evaluate the effect of our novel Alexnet and VGG-11, when different gradient transmission mechanisms are
techniques on DGT. adopted for distributed learning. For each mechanism, we carry out multiple
settings of its parameters to achieve a best acceleration. Note that the results
are normalized with performance of Baseline.
7.1. Real implementation results
The experimental platform is built on MXNET, a flexible and

efficient library for deep learning. We evaluate DGT on a dis-
tributed cluster with 12 workers , 4 switches and 1Gbps network.
We select classic ML models (Googlenet [44], Alexnet [44], VGG-
11 [45]), datasets (Fashion-MNIST [46] , Cifar10 [47]) and training
algorithms (BSP [20], FedAVG [48]), to demonstrate the per-
formance of DGT. Specifically, we compare the DGT with the
following three gradient transmission schemes:
Baseline: MXNET’s default gradient transmission. All gradients
are transmitted upon reliable transmission service.
Sender-based Dropping (SD) [19]: Based on the pre-tested
bounded-loss tolerance of a model, SD randomly discards a cer-
tain proportion of gradients at the sending end, and the rest is
transmitted upon reliable transmission service.
ATP [22]: Based on the pre-tested bounded-loss tolerance of a
Fig. 11. Comparison of training performance on AlexNet when different gradient
model, ATP sets a maximum loss probability (MLR). All gradients transmission schemes are adopted for distributed training. For each scheme,
are transmitted approximately upon unreliable transmission ser- we carry out multiple settings of parameter to achieve a best result of job
vice. If the loss rate is higher than the MLR, ATP will resend part completion time (JCT). Note that the results are normalized with performance
of Baseline.
of the lost gradient until the loss rate is lower than the MLR. If
the loss rate is less than the MLR, ATP does not resend the lost
gradient.
For each transmission scheme, we evaluate the job completion DGT performance on classic datasets. In the above experi-
time when the testing model converges to the desired accuracy ments, we have conducted the performance of DGT on the task
and carry out multiple settings of parameters to achieve the best of training Alexnet with Fashion-MNIST. To evaluate the per-
result. formance of DGT on other more sophisticated tasks, we addi-
DGT performance on classic ML models. To evaluate the tionally execute experiments with a more sophisticated training
performance of DGT on classic ML models, We count the job dataset, i.e., Cifar10 [47], and more sophisticated data distribu-
completion time (JCT) of training Googlenet, Alexnet and VGG-11 tion, i.e., non-IID data partitions [49]. The detailed setup of client
with Fashion-MNIST. Note that the data partitions of 12 clients number and data partition is shown in Table 1. It can be seen
conform to independent and identical distribution (IID). As can from Fig. 12 that, on IID data partitions, the performances of
be seen from Fig. 10, compared to Baseline, DGT makes the job DGT on Fashion-MNIST and Cifar10 are all the better than other
completion time was reduced by 19.4% on Googlenet (average existing gradient transmission mechanisms. It is worth noted that
JCT with Baseline solution is 3.1h), 34.4% on Alexnet (average DGT’s gain on Cifar10 is higher than that on Fashion-MNIST. This
JCT with Baseline solution is 9.8h) and 36.5% on VGG-11 (av- is because the training on Cifar10 requires more communication
erage JCT with Baseline solution is 12.6h). In DGT solution, the rounds at the later training stage, at which the DGT has significant
amount of important gradients is less than the total amount of communication gain. Besides, Fig. 13 illustrates the robustness of
all gradients and gradually decreases as training progresses, thus DGT on non-IID data partitions.
making the communication overhead of DGT is significantly less DGT performance on classic training algorithms. We eval-
than Baseline, as illustrated in Fig. 11. From Fig. 10, we can also uate the performance of DGT on BSP algorithm which transfers
find that DGT achieves better acceleration than SD and ATP. This gradients after training on one batch of data, and FedAVG (E=5)
is because, as illustrated in Fig. 11, compared to SD and ATP, algorithm [48] which transfers gradients after training on five
DGT differentiates transmission reliability of gradients according batches of data. It can be seen from Fig. 14 that the performances
to their contribution, making the number of iterations less than of DGT are all better than other gradient transmission mecha-
SD and ATP. nisms on BSP and FedAVG (E=5). It is worth noted that DGT’s gain
42
Table 1
The distribution of training samples over workers. (label:num_samples).
Worker ID 1 2 3 4 5 6 7 8 9 10 11 12
Cifar10 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9:
IID
4166 4166 4166 4166 4166 4166 4166 4166 4166 4166 4166 4174
Fashion-MNIST 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9: 0-9:
5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000 5000
Cifar10 1:1250 0:1250 0:1250 0:1250 2:1000 4:1666 3:714 2:1000 0:1250 3:714 1:1250 1:1250
non-IID
2:1000 1:1250 3:714 3:714 5:714 6:2500 4:1666 3:714 2:1000 5:714 3:714 2:1000
5:714 4:1666 8:714 5:714 6:2500 8:714 7:1000 7:1000 3:714 7:1000 5:714 5:714
8:714 5:714 9:1250 7:1000 7:1000 9:1250 9:1250 8:714 9:1250 8:714 8:714 8:714
Fashion-MNIST 1:1500 0:1500 0:1500 0:1500 2:1200 4:2000 3:857 2:1200 0:1500 3:857 1:1500 1:1500
2:1200 1:1500 3:857 3:857 5:857 6:3000 4:2000 3:857 2:1200 5:857 3:857 2:1200
5:857 4:2000 8:857 5:857 6:3000 8:857 7:1200 7:1200 3:857 7:1200 5:857 5:857
8:857 5:857 9:1500 7:1200 7:1200 9:1500 9:1500 8:857 9:1500 8:857 8:857 8:857
Fig. 12. Job completion time (JCT) of training Alexnet on IID data partitions of Fig. 14. Job completion time (JCT) of training Alexnet on BSP and Fe-
Fashion-MNIST and Cifar10 with four gradient transmission mechanisms. Note dAVG(E=5) with four gradient transmission mechanisms. Note that the results
that the results are normalized with the performance of Baseline. are normalized with the performance of Baseline.
the classification accuracy. It is necessary to set an appropriate

user-defined block size for optimal performance according to the
specific training environment. For example, as shown in Fig. 5,
we compare the job completion time (JCT) under different block
size settings. If the block size is too small (such as 32), the ad-
ditional computation and communication overhead of DGT make
the performance worse than the Baseline. Contrastly, If the block
size is too large (such as 8192), inaccuracy classification makes
DGT require more communication rounds, which also makes the
performance worse than the Baseline. In this experiment, 2048 is
a good compromise value.
7.2.2. Effect of convergence-aware classification threshold

In order to test the effect of convergence-aware classifica-
tion threshold update method, we compare it with static set-
Fig. 13. Job completion time (JCT) of training Alexnet on non-IID data partitions tings, i.e., the classification threshold is predefined by the user
of Fashion-MNIST and Cifar10 with four gradient transmission mechanisms. Note
and cannot change throughout the training process. The metric
that the results are normalized with the performance of Baseline.
is the training performance when DGT uses the two different
methods in the worst case, i.e., the unimportant gradients are
completely lost in the network. In the experiment, we actively
on FedAVG (E=5) is less than that on BSP because FedAVG(E=5) drop the unimportant gradients at sending end to simulate the
performs fewer communication rounds. data loss in the network. Specifically, we compare the number
of iterations, average communication overhead per iteration and
7.2. Effect of DGT techniques total job completion time with different update methods for
classification threshold.
For presenting the technical advantages of DGT, we evaluate We set the threshold p as 1.0, 0.8, 0.6 and 0.4 in static set-
effect of its key techniques and discuss them in this section. tings. In convergence-aware setting, we make the threshold be
dynamically updated with our heuristic algorithm as formula (1).
7.2.1. Effect of block size in AGC algorithm We record the experimental results under different settings and
In theory, a large user-defined block size reduces the com- normalize them with training performance when threshold p=1.0,
putation and communication complexity while also decreases i.e., all gradients are transmitted reliably.
43
Fig. 15. Comparison of training performance on AlexNet when DGT updates classification threshold with static settings(p = 1.0, 0.8, 0.6, 0.4) and convergence-ware
update method. The training performance is evaluated by three aspects, that is job completion time (JCT)(a), number of iterations(b) and average communication
overhead per iteration (c). Note that the results are normalized with training performance when threshold p = 1.0.
As can be seen from Fig. 15(c), compared with static settings, job completion time is further gradually reduced. This benefit
the convergence-aware update method can make distributed comes from the reduction of the number of iterations. As shown
learning with the least job completion time in our experimental in Fig. 16(c), compared to Y{1,1}, Y{1,3}, Y{1,7} have fewer num-
cluster. In fact, as shown in Fig. 15(a) and Fig. 15(b), with a ber of iterations. This is because as the number of unreliable
lower static threshold, DGT make distributed learning with lower channels increases, unimportant gradients with a higher contri-
average communication overhead. However, in contrast, a lower bution has a smaller probability of being lost in the network. In
static threshold makes the trained model need more iterations this experiment, the job completion time of Y{1,7} is 10.6% lower
to converge. The reason is that a lower static threshold means than N{1,1} for Googlenet and 17.2% for Alexnet.
that more unimportant gradients are discarded in the network,
which significantly impairs the model convergence in the worst 7.2.4. Effect of differential reception method
case. Our experimental results verify the potential inverse rela- In order to test the effect of differential reception method
tionship between communication overhead and the number of (Diff-Reception) at the receiver side, we compare it with the
iterations. Compared with static settings, our convergence-aware other two methods: Baseline (default gradient transmission of
update method achieves better trade-off between communica- MXNET) and "Heuristic Dropping’’, which simply discards the
tion overhead and the number of iterations, making them all delayed gradients. The experimental configurations contain: the
achieve relative low value. This is because the convergence-aware classification threshold p=0.6, the gradient transmission solution
update method can adjust the classification threshold according is Y{1,1}. In the experiment, we compare the training perfor-
to the training progress of distributed learning. At the early mance of distributed learning when DGT adopts different gradient
stage of training, it avoids a low threshold to makes the model reception methods at the receiver side, as shown in Fig. 17.
convergence be impaired. And at the later stage, it can lift the As can be seen from Fig. 17(a), compared with Baseline, the
threshold, thereby reducing the communication overhead. Over- Diff-Reception reduce the job completion time by 8.9% on
all, at present, the convergence-aware classification threshold Googlenet and 20.9% on Alexnet. And however, ‘‘Heuristic Drop-
update method is a relatively excellent method. Exploring a better ping’’ has not achieved a significant acceleration. This is because,
method will be our future work. as illustrated in Fig. 17(c), the ‘‘Heuristic Dropping’’ results in a
significant increase in the number of iterations, which further
7.2.3. Effect of priority-based differential transmission illustrates that actively dropping delayed gradients will signif-
The purpose of this experiment is to verify the effectiveness icantly damage the model convergence. On the other hand, as
of the priority-based differential transmission mechanism of DGT. shown in Fig. 17(b), the processing overhead of Diff-Reception
Specifically, we compare the training performance of distributed is very low, thus making the communication overhead be al-
learning under the following settings: most same as the "Heuristic Dropping’’. Experimental results
have shown that our Diff-Reception is a comparatively excellent
• Baseline: MXNET’s default gradient transmission solution,
gradient reception method in the DGT solution.
where all gradients are transmitted indiscriminately through
a reliable channel.
• N{1,1}: gradient transmission with DGT solution which has 8. Related work
one reliable channel and one unreliable channel. This setting
does not distinguish the priority between channels. This section discusses related works to DGT.
• Y{x,y}: gradient transmission with DGT solution which has It has been extensively paid attention that the high com-
x reliable channels and y unreliable channels. This setting munication overhead seriously affects the performance of dis-
does distinguish the priority between channels. tributed machine learning. To reducing communication overhead,
most data centers adopt advanced network techniques that have
We compare the training performance of distributed learning larger bandwidth to facilitate cluster network, e.g., [27,50] con-
under different settings, as shown in Fig. 16. As can be seen from nect computing nodes with RDMA or Cray technologies. However,
Fig. 16(a), compared to N{1,1}, When prioritizing the transmis- increasing bandwidth between computing nodes is not a general
sions on different channels, Y{1,1} significantly reduces the job solution. As the number of parallel computing nodes increase, the
completion time. This is because Y{1,1} gives reliable channel bandwidth advantages gradually disappear. On the other hand,
more network resources to allow the flow of important gradients the solution is not applicable in a cross-domain cluster.
to be completed as quickly as possible, thus reducing the average To our best knowledge, there is still relatively little work
communication overhead per iteration, as shown in Fig. 16(b). At to optimize the efficiency of distributed learning from the per-
the same time, as we increase the number of unreliable channels, spective of data transmission. [19] is the most relevant work.
44
Fig. 16. Comparison of training performance on two ML models, i.e., Alexnet and Googlenet, when DGT adopts different channel settings and priority settings. The
training performance is evaluated by three aspects, that is job completion time (JCT)(a), average communication overhead per iteration (b) and number of iterations(c).
Note that the results are normalized with training performance of Baseline.
Fig. 17. Comparison of training performance on two ML models, i.e., Alexnet and Googlenet, when DGT adopts different gradient reception protocol at receiver
side. The training performance is evaluated by three aspects, that is job completion time (JCT)(a), average communication overhead per iteration(b) and number of
iterations(c). Note that the results are normalized with training performance of Baseline.
It discusses the bounded-loss tolerance of several deep learning 9. Conclusion

models for gradient transmission. And then, it explores an op-
portunity for reducing communication overhead by modifying the In this paper, we first claim that distributed machine learn-
existing transport-layer protocol to provide a bounded-loss data ing requires fined differential gradient transmission. Based on
transmission service. However, in practice, developing and de- our observations, we first propose Differential Gradient Transmis-
ploying a customized transport-layer protocol requires extensive sion (DGT), a contribution-aware end-to-end differential gradient
modification on system kernel, which poses challenges for large- transmission mechanism for distributed machine learning. DGT
scale deployment. Compared to [19], DGT is a more easy-to-use tries to estimate the contribution of a gradient, and then provide
solution that acts as a communication middleware supported by differential transmission service for it according to its contri-
existed transport-layer protocols, i.e., TCP and UDP. Besides, DGT bution. In DGT, gradient with a high contribution has higher
differentiates gradient transmission based on their contribution transmission reliability and priority. Also, it is updated into the
to reducing communication overhead further. global model preferentially. We implement DGT as a commu-
Sender-based Dropping [19] is a heuristic gradient transmis- nication middleware in the MXNET framework and open our
sion method that reduces communication overhead by randomly source code to facilitate large-scale DML for users. We verify the
discarding a certain proportion of gradient. Experimental results effectiveness of DGT on a real distributed cluster with classical
have shown that Sender-based Dropping only makes a marginal ML models and datasets. The experimental results show that,
gain. ATP [22] is a general approximate transmission protocol compared with the accurate gradient transmission, DGT signifi-
for approximate computing, which is used to transmit datasets cantly reduces the model training time. Besides, we compare DGT
with information redundancy. However, ATP is not well suitable with the other two heuristic approximate gradient transmission.
to transmit gradients in distributed learning because it does not DGT achieves better acceleration results than them. It worth be
take into account the difference of gradients. DGT considers the noted that the gradient classification algorithm proposed in this
difference of gradients and updates classification threshold dy- paper, i.e., AGC, only estimates differential contributions between
namically to adapt the change of loss tolerance of model for gradients within a same layer of the deep neural network, with-
gradient transmission, thus obtaining better acceleration than out considering differential contributions between gradients from
Sender-based Dropping and ATP. different layers. In future work, we will study differential contri-
Gradient Sparsification [14,34] and Gradient Quantization [13, butions between gradients from different layers and propose a
51] are communication-efficient application-layer algorithms. more refined gradient classification algorithm to further improve
With these algorithms, the global model can be trained to con- the performance of DGT.
verge with imprecise or delayed gradient. In fact, our gradi-
ent transmission scheme can work with communication-efficient CRediT authorship contribution statement
application-layer algorithms. For example, quantizing gradient
block of DGT will further reduce the amount of transferred data. Huaman Zhou: Conceptualization, Methodology, Software,
We will do this optimization in our future work. Writing - original draft, Validation. Zonghang Li: Investigation,
45
Writing - review & editing. Qingqing Cai: Investigation, Writing - [18] A. Jayarajan, J. Wei, G. Gibson, A. Fedorova, G. Pekhimenko, Priority-based
review & editing. Hongfang Yu: Supervision, Funding acquisition, parameter propagation for distributed DNN training, 2019, arXiv preprint
arXiv:1905.03960.
Project administration. Shouxi Luo: Writing - review & editing.
[19] J. Xia, G. Zeng, J. Zhang, W. Wang, W. Bai, J. Jiang, K. Chen, Rethinking
Long Luo: Writing - review & editing. Gang Sun: Supervision, transport layer design for distributed machine learning, in: Proceedings of
Validation, Writing - review & editing. the 3rd Asia-Pacific Workshop on Networking 2019, ACM, 2019, pp. 22–28.
[20] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
Declaration of competing interest Z. Zhang, Mxnet: A flexible and efficient machine learning library for
heterogeneous distributed systems, 2015, arXiv preprint arXiv:1512.01274.
[21] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.
The authors declare that they have no known competing finan- Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale
cial interests or personal relationships that could have appeared machine learning, in: 12th {USENIX} Symposium on Operating Systems
to influence the work reported in this paper. Design and Implementation ({OSDI} 16), 2016, pp. 265–283.
[22] K. Liu, S.-Y. Tsai, Y. Zhang, ATP: a datacenter approximate transmission
protocol, 2019, arXiv preprint arXiv:1901.01632.
Acknowledgment
[23] S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: A
review of classification techniques, Emerg. Artif. Intell. Appl. Comput. Eng.
This research was partially supported by the National Key Re- 160 (2007) 3–24.
search and Development Program of China (2019YF-B1802800), [24] G.A. Seber, A.J. Lee, Linear Regression Analysis, vol. 329, John Wiley & Sons,
PCL Future Greater-Bay Area Network Facilities for Large-scale 2012.
[25] R. Johnson, T. Zhang, Accelerating stochastic gradient descent using pre-
Experiments and Applications, China (LZC0019). China Postdoc-
dictive variance reduction, in: Advances in Neural Information Processing
toral Science Foundation (2019M66-3552) and Fundamental Re- Systems, 2013, pp. 315–323.
search Funds for the Central Universities, China (2682019CX61). [26] N.L. Roux, M. Schmidt, F.R. Bach, A stochastic gradient method with an
exponential convergence _rate for finite training sets, in: Advances in
References Neural Information Processing Systems, 2012, pp. 2663–2671.
[27] F.N. Iandola, M.W. Moskewicz, K. Ashraf, K. Keutzer, Firecaffe: near-linear
acceleration of deep neural network training on compute clusters, in:
[1] T. Ben-Nun, T. Hoefler, Demystifying parallel and distributed deep learning:
Proceedings of the IEEE Conference on Computer Vision and Pattern
An in-depth concurrency analysis, ACM Comput. Surv. 52 (4) (2019) 65.
Recognition, 2016, pp. 2592–2600.
[2] K.S. Chahal, M.S. Grover, K. Dey, R.R. Shah, A hitchhiker’s guide on
[28] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y.
distributed training of deep neural networks, J. Parallel Distrib. Comput.
Yang, L. Yu, et al., Highly scalable deep learning training system with
137 (2020) 65–76.
mixed-precision: Training imagenet in four minutes, 2018, arXiv preprint
[3] R. Mayer, H.-A. Jacobsen, Scalable deep learning on distributed infras-
arXiv:1807.11205.
tructures: Challenges, techniques and tools, 2019, arXiv preprint arXiv:
[29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-
1903.11314.
scale hierarchical image database, in: Conference on Computer Vision and
[4] L. Bottou, Large-scale machine learning with stochastic gradient descent,
Pattern Recognition, IEEE, 2009, pp. 248–255.
in: Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
[30] M.J. Quinn, Parallel programming, TMH CSE 526 (2003).
[5] I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initial-
[31] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, A. Smola, Parameter
ization and momentum in deep learning, in: International Conference on
server for distributed machine learning, in: Big Learning NIPS Workshop,
Machine Learning, 2013, pp. 1139–1147.
[6] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online vol. 6, 2013, p. 2.
learning and stochastic optimization, J. Mach. Learn. Res. 12 (Jul) (2011) [32] N. Strom, Scalable distributed DNN training using commodity GPU cloud
2121–2159. computing, in: Sixteenth Annual Conference of the International Speech
[7] M. Li, D.G. Andersen, A.J. Smola, K. Yu, Communication efficient distributed Communication Association, 2015.
machine learning with the parameter server, in: Advances in Neural [33] O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, Optimal distributed online
Information Processing Systems, 2014, pp. 19–27. prediction using mini-batches, J. Mach. Learn. Res. 13 (Jan) (2012) 165–202.
[8] L. Luo, J. Nelson, L. Ceze, A. Phanishayee, A. Krishnamurthy, Parameter hub: [34] A.F. Aji, K. Heafield, Sparse communication for distributed gradient descent,
a rack-scale parameter server for distributed deep neural network training, 2017, arXiv preprint arXiv:1704.05021.
in: Proceedings of the ACM Symposium on Cloud Computing, ACM, 2018, [35] B. Recht, C. Re, S. Wright, F. Niu, Hogwild: A lock-free approach to par-
pp. 41–54. allelizing stochastic gradient descent, in: Advances in Neural Information
[9] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Processing Systems, 2011, pp. 693–701.
Tulloch, Y. Jia, K. He, Accurate, large minibatch sgd: Training imagenet in [36] E.P. Xing, Q. Ho, W. Dai, J.K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar,
1 hour, 2017, arXiv preprint arXiv:1706.02677. Y. Yu, Petuum: A new platform for distributed machine learning on big
[10] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G.R. Ganger, P.B. Gib- data, IEEE Trans. Big Data 1 (2) (2015) 49–67.
bons, O. Mutlu, Gaia: geo-distributed machine learning approaching {LAN} [37] C. Hardy, E. Le Merrer, B. Sericola, Distributed deep learning on edge-
speeds, in: 14th {USENIX} Symposium on Networked Systems Design and devices: feasibility via adaptive compression, in: 16th International
Implementation ({NSDI} 17), 2017, pp. 629–647. Symposium on Network Computing and Applications (NCA), IEEE, 2017,
[11] J. Konečnỳ, H.B. McMahan, D. Ramage, P. Richtárik, Federated optimiza- pp. 1–8.
tion: Distributed machine learning for on-device intelligence, 2016, arXiv [38] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional net-
preprint arXiv:1610.02527. works, in: European Conference on Computer Vision, Springer, 2014, pp.
[12] L. Mai, C. Hong, P. Costa, Optimizing network performance in distributed 818–833.
machine learning, in: 7th {USENIX} Workshop on Hot Topics in Cloud [39] N. Ketkar, E. Santana, Deep Learning with Python, vol. 1, Springer, 2017.
Computing (HotCloud 15), 2015, p. 2. [40] N. Qian, On the momentum term in gradient descent learning algorithms,
[13] F. Seide, H. Fu, J. Droppo, G. Li, D. Yu, 1-bit stochastic gradient descent Neural Netw. 12 (1) (1999) 145–151.
and its application to data-parallel distributed training of speech dnns, in: [41] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G.V.d. Broeck, A semantic loss
Fifteenth Annual Conference of the International Speech Communication function for deep learning with symbolic knowledge, 2017, arXiv preprint
Association, 2014, pp. 1058–1062. arXiv:1711.11157.
[14] Y. Lin, S. Han, H. Mao, Y. Wang, W.J. Dally, Deep gradient compression: [42] M.A. Hughes, D.M. O’keeffe, K. Loughran, J.N. Butler, TCP control packet
Reducing the communication bandwidth for distributed training, 2017, differential service, US Patent 7, 366, 168, 2008.
arXiv preprint arXiv:1712.01887. [43] https://github.com/zhouhuaman/dgt.
[15] D. Alistarh, J. Li, R. Tomioka, M. Vojnovic, Qsgd: Randomized quantiza- [44] P. Ballester, R.M. Araujo, On the performance of GoogLeNet and AlexNet
tion for communication-optimal stochastic gradient descent, 2016, arXiv applied to sketches, in: Thirtieth AAAI Conference on Artificial Intelligence,
preprint arXiv:1610.02132. 2016.
[16] S. Sun, W. Chen, J. Bian, X. Liu, T.-Y. Liu, Ensemble-compression: A new [45] K. Simonyan, A. Zisserman, Very deep convolutional networks for
method for parallel training of deep neural networks, in: Joint European large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
Conference on Machine Learning and Knowledge Discovery in Databases, [46] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for
Springer, 2017, pp. 187–202. benchmarking machine learning algorithms, 2017, arXiv preprint arXiv:
[17] S.H. Hashemi, S.A. Jyothi, R.H. Campbell, Tictac: Accelerating distributed 1708.07747.
deep learning with communication scheduling, 2018, arXiv preprint arXiv: [47] K. Klosowski, Image recognition on CIFAR10 dataset using resnet18 and
1803.03288. keras, 2018.
46
[48] B. McMahan, E. Moore, D. Ramage, S. Hampson, B.A. y Arcas, Hongfang Yu is a professor at School of Informa-
Communication-efficient learning of deep networks from decentralized tion and Communication Engineering of University of
data, in: Artificial Intelligence and Statistics, PMLR, 2017, pp. 1273–1282. Electronic Science and Technology of China (UESTC).
[49] F. Sattler, S. Wiedemann, K.-R. Müller, W. Samek, Robust and Her current research interests include data center net-
communication-efficient federated learning from non-iid data, IEEE Trans. working, network (function) virtualization, cloud/edge
Neural Netw. Learn. Syst. (2019). computing, distributed AI system. Her research has
[50] Y. Ren, X. Wu, L. Zhang, Y. Wang, W. Zhang, Z. Wang, M. Hack, S. been supported by NSFC, National Key Research and
Jiang, Irdma: Efficient use of rdma in distributed deep learning systems, Development Program of China, National Grand Funda-
in: 19th International Conference on High Performance Computing and mental Research 973 Program and 863 Program et al.
Communications, IEEE, 2017, pp. 231–238. She submitted over 30 international and national-wide
[51] D. Alistarh, D. Grubic, J. Li, R. Tomioka, M. Vojnovic, QSGD: patent applications. She has authored/coauthored over
Communication-efficient SGD via gradient quantization and encoding, in: 200 papers on international journals and conferences.
Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.
Shouxi Luo received his BS degree in Communica-

tion Engineering and Ph.D degree in Communication
Huaman Zhou is a Ph.D student at School of Infor- and Information System from University of Electronic
mation and Communication Engineering of University Science and Technology of China in 2011 and 2016,
of Electronic Science and Technology of China (UESTC), respectively. From Oct. 2015 to Sep. 2016, he was
and he received the MS degree in UESTC in July 2018. an Academic Guest at the Department of Information
His research interests include distributed machine Technology and Electrical Engineering, ETH Zurich. His
learning and federated learning. research interests include data center networks and
software-defined networks.
Long Luo is a Postdoctoral Researcher at the University

of Electronic Science and Technology of China (UESTC).
She received the BS degree in communication engineer-
Zonghang Li is a Ph.D student at School of Information ing from Xi’an University of Technology in 2012, and
and Communication Engineering of University of Elec- her MS and Ph.D. degree in communication engineering
tronic Science and Technology of China (UESTC), and from the UESTC in 2015 and 2020, respectively. Her
he received the BS degree in UESTC in July 2018. His research interests include networking and distributed
research interests include distributed machine learning systems.
and federated learning.
Gang Sun is a professor of Computer Science at Uni-

versity of Electronic Science and Technology of China
(UESTC). His research interests include network virtual-
ization, cloud computing, high performance computing,
Qingqing Cai is pursuing her Master degree at School parallel and distributed systems, ubiquitous/pervasive
of Information and Communication Engineering of Uni- computing and intelligence and cyber security. He has
versity of Electronic Science and Technology of China co-authored 100 technical publications including paper
(UESTC). Her research interests include distributed in refereed journals and conferences, invited papers
machine learning and algorithms. and presentations and book chapters. He has also
edited special issues at top journals, such as Future
Generation Computer Systems and Multimedia Tool
and Applications. He has served as reviewers of IEEE Transactions on Indus-
trial Informatics, IEEE Communications Letters, IEEE Transactions on Network
and Services Management, IEEE Access, Information Fusion, Future Generation
Computer Systems and Journal of Network and Computer Applications. He is a
member of IEEE.
47

DGT A Contribution-Aware Differential Gradient Transmission Mechanism For Distributed Machine Lear

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DGT A Contribution-Aware Differential Gradient Transmission Mechanism For Distributed Machine Lear

Uploaded by

Copyright:

Available Formats

Future Generation Computer Systems 121 (2021) 35–47

Contents lists available at ScienceDirect

Future Generation Computer Systems

DGT: A contribution-aware differential gradient transmission

1. Introduction distributed stochastic gradient descent (SGD) algorithm or its

Fig. 1. Common pattern of distributed machine learning.

algorithms to minimize the objective value in research or produc-

2.2.1. Gradients have different contribution to model convergence

Fig. 3. An overview on design of DGT architecture.

Algorithm 1 Approximate Gradient Classification (AGC) algo-

4.2. Convergence-aware classification threshold

The classification threshold p is a critical factor in the AGC

Fig. 9. Functional block diagram of the modules of DGT.

partition module, a gradient classification module, a differen-

We have already made the source code of our prototype of

This section presents our evaluation of DGT on a real dis-

The experimental platform is built on MXNET, a flexible and

the classification accuracy. It is necessary to set an appropriate

7.2.2. Effect of convergence-aware classification threshold

It discusses the bounded-loss tolerance of several deep learning 9. Conclusion

Shouxi Luo received his BS degree in Communica-

Long Luo is a Postdoctoral Researcher at the University

Gang Sun is a professor of Computer Science at Uni-

You might also like