Accelerated CNN Training Through Gradient Approximation

Accelerated CNN Training Through Gradient Approximation
Ziheng Wang Sree Harsha Nelaturu

Department of EECS, M.I.T Department of ECE, SRMIST
77 Massachusetts Ave, Cambridge, MA Kattankulathur, Chennai, TN
ziheng@mit.edu sreeharsha murali@srmuniv.edu.in
arXiv:1908.05460v1 [cs.CV] 15 Aug 2019
Abstract tive training on large scale computing clusters consisting of

thousands of GPUs or high-end CPUs [32, 3, 14]. How-
Training deep convolutional neural networks such as ever, these computing clusters are still extremely expensive
VGG and ResNet by gradient descent is an expensive ex- and labor-intensive to set up or maintain, even if the actual
ercise requiring specialized hardware such as GPUs. Re- training process is reduced to minutes.
cent works have examined the possibility of approximating An alternative to using large computing clusters is to ac-
the gradient computation while maintaining the same con- celerate the computations of the gradients themselves. One
vergence properties. While promising, the approximations option is to introduce highly optimized software [6] or new
only work on relatively small datasets such as MNIST. They hardware [20, 15]. The training can also be performed
also fail to achieve real wall-clock speedups due to lack of in lower precision, which can lead to massive speedups
efficient GPU implementations of the proposed approxima- with appropriate hardware support [21]. Another less pur-
tion methods. In this work, we explore three alternative sued option, complementary to the previous two, is to
methods to approximate gradients, with an efficient GPU approximate the actual gradient computation themselves
kernel implementation for one of them. We achieve wall- [26, 27, 29, 2]. Other recent works have also suggested that
clock speedup with ResNet-20 and VGG-19 on the CIFAR- the exact gradient might not be necessary for efficient train-
10 dataset upwards of 7 percent, with a minimal loss in val- ing of deep neural networks. Studies have shown that only
idation accuracy. the sign of the gradient is necessary for efficient back prop-
agation [31, 30]. Surprisingly, even random gradients can
be used to efficiently train neural networks [17, 22]. How-
1. Introduction ever, these findings are mostly limited to small fully con-
nected networks on smaller datasets. The approximation
Deep convolutional neural networks (CNN) are now ar-
algorithms proposed also cannot directly translate into real
guably the most popular computer vision algorithms. Mod-
wall-clock speedups in training time due to lack of efficient
els such as VGG [25] and ResNet [12] are widely used.
GPU implementation.
However, these models contain up to hundreds of millions
of parameters, resulting in high memory footprint, long in- In this work, we hypothesize that we can extend gradient
ference time and even longer training time. approximation methods to deep neural networks to speed up
The memory footprint and inference time of deep CNNs gradient computations in the training process. We hypothe-
directly translate to application size and latency in produc- size that we can apply these approximations to only a sub-
tion. Popular techniques based on model sparsification are set of the layers and maintain the validation accuracy of the
able to deliver orders of magnitude reduction in the number trained network. We validate our hypotheses on three deep
of parameters in the network. [11] Together with emerging CNNs (2-layer CNN [16], ResNet-20 [12] VGG-19 [25]) on
efficient sparse convolution kernel implementations, deep CIFAR-10. Our methods are fully compatible with classic
CNNs can now be realistically used in production after deep CNN architectures and do not rely on explicit spar-
training [9, 23, 5]. sity information that must be input to the network, like ap-
However, the training of these deep CNNs is still a proaches such as SBnet and Sub-manifold networks [24, 8].
lengthy and expensive process. The hundreds of millions We summarize our contributions as follows:
of parameters in the model must all be iteratively updated
hundreds of thousands of times in a typical training process • We present three gradient approximation methods for
based on back-propagation. Recent research has attempted training deep CNNs, along with an efficient GPU im-
to address the training time issue by demonstrating effec- plementations for one of them.
1
method discussed in [17] and [22] as we regenerate the ran-
dom numbers every training batch. We implement this us-
ing tf.py func, where np.random.normal is used to generate
the random values. This approach is extremely inefficient,
though surprisingly faster than a naive cuRAND implemen-
tation in a custom tensorflow operation for most input cases.
We are working on a more efficient implementation.
2.3. Approximated Gradient

The third method we employ is based on the top-k se-
lection algorithms popular in literature. [29] In the gradient
computation for a filter in a convolutional layer, only the
largest-magnitude gradient value is retained for each output
channel and each batch element. They are scaled according
to the sum of the gradients in their respective output chan-
nels so that the gradient estimate is unbiased, similar to the
approach employed in [28]. All other gradients are set to
1
zero. This results in a sparsity ratio of 1 − HW , where H
Figure 1. Forward and backward propagation through a convolu- and W are the height and width of the output hidden layer.
tional layer during training. Asterisks indicate convolution opera- The filter gradient is then calculated from this sparse version
tions and the operation in the red box is the one we approximate. of the output gradient tensor with the saved input activations
from the forward pass. The algorithm can be trivially mod-
• We explore the application of these methods to deep ified to admit the top-k magnitude gradient values with an
CNNs and show that they allow for training conver- adjustment of the scaling parameter, a direction of future
gence with minimal validation accuracy loss. research. Similar to the random gradient method, we find
that we need to scale our approximated gradient by a factor
• We describe the concept of approximation schedules, a proportional to the batch size for effective training. In the
way to reason about applying different approximation experiments here, we scale them by 128 1
.
methods across different layers and training batches.
2.4. Efficient GPU Implementation
2. Approximation Methods
A major contribution of this work is an implementation
In a forward-backward pass of a deep CNN during train- of the approximated gradient method in CUDA. This is crit-
ing, a convolutional layer requires three convolution oper- ical to achieve actual wall-clock training speedups. A naive
ations: one for forward propagation and two for backward Tensorflow implementation using tf.image.extract glimpse
propagation, as demonstrated in Figure 1. We approximate does not use the GPU and results in significantly slower
the convolution operation which calculates the gradients of training time.
the filter values, which constitutes roughly a third of the Efficient GPU implementations for dense convolutions
computational time. We aim to apply the approximation a frequently use matrix lowering or transforms such as FFT or
quarter of the time across layers/batches. This leads to a Winograd [6, 18]. However, the overheads of these transfor-
theoretical maximum speedup of around 8 percent. mations might not be worth the benefit in a sparse setting.
Recent approaches have sought to perform the sparse con-
2.1. Zero Gradient
volution directly on CPU or GPU [23, 5]. Here we also opt
The first method passes back zero as the weight gradi- for the latter approach. We interpret the sparse convolution
ent of a chosen layer for a chosen batch. If done for every in the calculation of the filter gradient as a patch extraction
training batch, it effectively freezes the filter weights. procedure, as demonstrated in Figure 2.
Formally, let’s assume we have an input tensor I and out-
2.2. Random Gradient
put gradient tensor dO in N CHW format, where N is the
The second method passes back random numbers sam- batch dimension, C the channel dimension and H, W the
pled from a normal distribution with mean 0 and standard height and width of the hidden layer. The filter tensor, f ,
1
deviation 128 (inverse of batch size) as the weight gradi- has dimension KKCi Co , where K is the filter size, Ci is
ent of a chosen layer for a chosen batch. Different values the number of channels in I and Co is the number of chan-
in the weight gradient are chosen independently. Impor- nels in dO. We will use the symbol ∗ to denote the convolu-
tantly, this is different from the random feedback alignment tion operation. In order to compute df , we have to convolve
2
Figure 3. Example approximation schedule for a 5-layer network
over 4 training batches. Full Grad denotes regular gradient com-
putation without approximation.
Figure 2. The approximation algorithm illustrated for an example

with two filters and three input elements. For each filter, we extract slight data transpose overhead of the filter gradient from
a patch from each batch element’s input activations and accumu- Co KKCi to KKCi Co .
late the patches.
2.5. Approximation Schedules
Here, we introduce the concept of approximation sched-
I with dO. If we zero out all elements in dO except one for
ules. This concept allows us to specify when particular ap-
each output channel dimension, then convolution becomes
proximations are applied in the training process and how to
a collection of Co patches of shape KKCi from I, as spec-
combine different approximations. Existing approximation
ified below in Algorithm 1.
methods such as DropBack [7] and PruneTrain [19] can be
applied to a specific layer, but are applied over all train-
Algorithm 1 Max Gradient Approximation ing batches since their application at a particular training
1: df [:, :, :, :] = 0 batch changes the structure of the network, thus affecting all
2: for c = 1 : Co do subsequent batches. The three approximation methods we
3: for n = 1 : N do have mentioned approximate the gradient computation of a
4: row, col P ← arg max abs(dO[n, c, :, :]) weight filter for a single training batch. They can be thus ap-
5: sum ← dO[n, c, :, :] (sum is a scalar) plied to a specific layer for a specific training batch. We re-
6: df [:, c, :, :]+ = I[n, :, row : row + K, col : fer to the term ”approximation schedule” as a specification
col + K] ∗ sum of what approximation method to apply for each layer and
each training batch that is consistent with the above rules.
An example of an approximation schedule is shown in Fig-
Our kernel implementation expects input activations in
ure 3. More aggressive approximation schedules might lead
N HW Ci format and output gradients in N Co HW for-
to a higher loss in accuracy, but would also result in higher
mat. It produces the output gradient in Co KKCi format.
speedups. Here, we demonstrate that simple heuristics to
In N HW C format, GPU global memory accesses from
pick approximation schedules can lead to good results on
the patch extractions can be efficiently coalesced across the
common networks such as ResNet-20 and VGG-19. While
channel dimension, which is typically a multiple of 8. Each
the efficacy of simple heuristics is crucial for the applica-
thread block is assigned to process several batch elements
bility of the proposed approximation methods in practice,
for a fixed output channel. Each thread block first computes
determining the optimal approximation schedule for differ-
the indices and values of the nonzero weight values from
ent neural network architectures is an interesting direction
the output gradients. Then, they extract the corresponding
of future research.
patches from the input activations and accumulate them to
the result.
3. Evaluation
We benchmark the performance of our code against
NVIDIA cuDNN v7.4.2 library apis. Approaches such as We test our approach on three common neural network
cuSPARSE have been demonstrated to be less effective in a architectures (2-layer CNN [16], VGG-19 [25] and ResNet-
sparse convolution setting and are not pursued here [5]. All 20 [12]) on the CIFAR-10 dataset. The local response nor-
timing metrics are obtained on a workstation with a Titan- malization in the 2-layer CNN is replaced by the more mod-
Xp GPU and 8 Intel Xeon CPUs at 3.60GHz. ern batch normalization method [13]. For all three net-
All training experiments are conducted in N CHW for- works, we aim to use the approximation methods 25 per-
mat, the preferred data layout of cuDNN. As a result, we cent of the time. In this work, we test all three approx-
incur a data transpose overhead of the input activations imation methods separately and do not combine. On the
from N CHW to N HW C. In addition, we also incur a 2-layer CNN, we apply the selected approximation method
3
Table 1. Performance comparisons. All timing statistics in mi-
croseconds. Approx. total column is the sum of the CUDA Kernel
time and the transpose overhead.
Figure 4. The three approximation schedules studied for the 2- nels fully utilize the high floating point throughput of the
layer network using a) zero gradient method b) random gradient GPU to perform the dense gradient computations. In con-
method c) approximated gradient method. trast, sparse approximations of the gradient usually involve
less arithmetic/memory ratio and do not admit as efficient
kernel implementations on GPU. It is oftentimes necessary
to the second convolutional layer every other training batch.
to impose structure or high sparsity ratio to achieve actual
On VGG-19 and ResNet-20, we apply the selected approx-
performance gain [33]. Here we demonstrate that our gradi-
imation method to every fourth convolutional layer every
ent approximation method does yield an efficient GPU im-
training batch, starting from the second convolutional layer.
plementation that can lead to actual speedups compared to
For example, the three approximation schedules for the 2-
cuDNN.
layer CNN are shown in Figure 4. We start from the sec-
We present timing comparisons for a few select input
ond layer because recent work has shown that approximat-
cases encountered in the network architectures used in this
ing the first convolutional layer is difficult [2]. This results
work in Table 1. We aggregate the two data transpose over-
in four approximated layers for VGG-19 and five approx-
heads of the input activations and the filter gradients. (In
imated layers for ResNet-20. For the ResNet-20 model,
almost every case, the data transpose overhead of the input
we train a baseline ResNet-14 model as well. Training a
activations dominates.) We make three observations.
smaller model is typically done in practice when training
time is of concern. Ideally, our approximation methods to Firstly, in most cases, the gradient approximation, in-
train the larger ResNet-20 model should result in higher val- cluding data transposition, is at least three times as fast as
idation accuracy than the ResNet-14 model. For ResNet-20, the cuDNN baseline. Secondly, we observe that cuDNN
we also experiment with other approximation schedules to timing scales with the number of input channels times the
show that our approximation methods are robust to schedule height and width of the hidden layer, whereas our approxi-
choice. mation kernel timing scales with the number of input chan-
We train the networks until the validation accuracy sta- nels alone. This is expected from the nature of the computa-
bilizes. It took around 500 epochs for the 2-layer CNN, 250 tions involved: the performance bottleneck of our kernel is
epochs for the ResNet-20 model, and 200 epochs for the the memory intensive patch extractions, the sizes of which
VGG-19 model. We use an exponentially decaying learn- scale with the number of input channels times filter size.
ing rate with the Adam optimizer for all three models. We Thirdly, we observe that in many cases, the data transposi-
apply typical data augmentation techniques such as random tion overhead is over fifty percent of the kernel time, sug-
cropping and flipping to each minibatch. All training was gesting that our implementation can be further improved by
performed within Tensorflow 1.13 [1]. fusing the data transpose into the kernel as in SBNet [24].
This is left for future work.
3.1. Performance Comparisons
3.2. Training Convergence
We compare the performance of our GPU kernel for the
approximated gradient method with the full gradient com- We present convergence results for the training of our
putation for the weight filter as implemented in cuDNN three neural networks using the chosen approximation
v7.4.2. cuDNN offers state-of-the-art performance in dense schedules with two metrics, training loss and validation ac-
gradient computation and is used in almost every deep curacy. In Figure 5, we see that for the 2-layer CNN, all ap-
learning library. For each input case, cuDNN tests several proximation methods result in training loss curves and val-
hand-assembled kernels to pick the fastest one. The ker- idation accuracy curves similar to the ones obtained by full
4
Figure 6. a) Training loss of ResNet-20 with different approxima-
Figure 5. a) Training loss of 2-layer CNN with different approxi- tion methods. b) Validation accuracy of ResNet-20 with different
mation methods. b) Validation accuracy of 2-layer CNN with dif- approximation methods. The loss curve of the random gradient
ferent approximation methods. method stagnates but the validation accuracy is competitive.
gradient computation. We can even see that the random gra-

gradient descent does overtake all approximation methods
dient method surpasses full gradient computation in terms
finally in terms of validation accuracy. This suggests that
of validation accuracy. The zero gradient method is very
perhaps a fruitful approach to explore, at least for networks
similar to full gradient computation while the approximated
similar to VGG, would be to use approximations early on
gradient method does slightly worse.
in training and switch to full gradient computation when a
The approximation methods remain robust on larger net-
validation accuracy plateau has been reached.
works, such as ResNet-20, shown in Figure 6. In this
case, we can see from both the loss curves and the val- 3.3. Speedup-Accuracy Tradeoffs
idation accuracy that our approximated gradient methods
do slightly worse than full gradient computation, but better Here, we present the wall-clock speedups achieved for
than both random gradient or zero gradient methods. Curi- each network and approximation method. We compare the
ously, the random gradient method maintains a high training speedups against the validation accuracy loss, measured
loss throughout the training, but is still able to achieve good from the best validation accuracy achieved during train-
validation accuracy. ing. Validation accuracy was calculated every ten epochs.
For VGG-19, shown in Figure 7, we see that full gradient As aforementioned, the random gradient implementation is
descent actually lags that of the approximated methods in quite inefficient and is pending future work. The speedup
reaching target validation accuracy. In this case, all three takes into account the overhead of defining a custom op-
approximation methods perform very well. However, full eration in Tensorflow, as well as the significant overhead
5
Table 3. Training speedup and validation accuracy loss for the ap-
proximation methods on ResNet-20. Negative speedup indicates a
slowdown.
proximation methods on VGG-19. Negative speedup indicates a
slowdown.
does not involve switching gradient computations. We

avoid the switching overhead and can achieve speedups for
both the zero gradient method and the approximated gradi-
ent method. As shown in Table 3, the zero gradient method
achieves roughly a third of the speedup compared to train-
ing the baseline ResNet-14 model. The approximated gradi-
ent method also achieves a 3.5% wall-clock speedup, and is
the only method to suffer less accuracy loss than just using
a smaller ResNet-14. In the following section, we demon-
strate that with other approximation schedules, the approx-
imated gradient method can achieve as little as 0.1% accu-
Figure 7. a) Training loss of VGG-19 model with different approx-
racy loss.
imation methods. b) Validation accuracy of VGG-19 model with
different approximation methods. For VGG-19, despite being quicker to converge, the ap-
proximation methods all have worse validation accuracy
than the baseline method. (Table 4) The best approximation
method appears to be the random gradient method, though
it is extremely slow due to our inefficient implementation in
Tensorflow. The other two methods also achieve high val-
idation accuracies, with the approximated gradient method
slightly better than the zero gradient method. Both methods
are able to achieve speedups in training.
proximation methods on 2-layer CNN. Negative speedup indicates
3.4. Robustness to Approximation Schedule
a slowdown.
Here, we explore two new approximation schedules for
ResNet-20, keeping the total proportion of the time we ap-
of switching gradient computation on global training step. ply the approximation to 25 percent. We will refer to the
For the 2-layer CNN, we are unable to achieve wall-clock approximation schedule presented in the secion above as
speedup for all approximation methods, even the zero gra- schedule 1. Schedule 2 applies the selected approximation
dient one, because of this overhead. (Table 2) However, method every other layer for every other batch. Schedule 3
all approximation methods achieve little validation accu- applies the selected approximation method every layer for
racy loss. The random gradient method even outperforms every fourth batch. We also present the baseline result of
full gradient computation by 0.8%. the ResNet-14 model.
For ResNet-20, the approximation schedule we choose As we can see from Figure 8 and Table 5, under sched-
6
mated gradient are sufficient in successfully training neural
networks but don’t yet offer real wall-clock speedups. In
this work, we show that we can train deep neural networks
to good validation accuracy with very minimal gradient in-
formation on a subset of the layers, leading to wall-clock
speedups for training.
We are surprised by the consistent strong performance
of the zero gradient method. For ResNet-20, for two of the
three approximation schedules tested, the validation accu-
racy loss is better than that of a smaller baseline network.
Its performance is also satisfactory on VGG-19 as well as
the 2-layer CNN. It admits an extremely fast implementa-
tion that delivers consistent speedups. This points to a sim-
ple way to potentially boost training speed in deep neural
networks, while maintaining their performance advantage
Figure 8. a) Training loss of ResNet-20 with different approxima-
over shallower alternatives.
tion methods for approximation schedule 2. b) Validation accuracy We also demonstrate that random gradient methods can
of ResNet-20 with different approximation methods for approxi- train deep neural networks to convergence, provided they
mation schedule 2. c) Training loss of ResNet-20 with different are only applied to a subset of the layers. For the 2-layer
approximation methods for approximation schedule 3. d) Valida- CNN and VGG-19, this method leads to the least validation
tion accuracy of ResNet-20 with different approximation methods
accuracy loss of all three approximation methods. How-
for approximation schedule 3.
ever, its performance serious lags other methods on ResNet-
20, suggesting that its performance is network-architecture-
specific. Naive feedback alignment, where the random gra-
dient signal is fixed before training starts, has been shown
to be difficult to extend to deep convolutional architectures
[10, 4].We show here that if the random gradients are newly
generated every batch and applied to a subset of layers, they
can be used to train deep neural networks to convergence.
Interestingly, generating new random gradients every batch
effectively abolishes any kind of possible “alignment” in the
Table 5. Validation accuracy for different approximation schedules
on ResNet-20. Schedule 1 is the same as presented above. network, calling for a new explanation of why the network
converges. Evidently, this method holds the potential for an
extremely efficient implementation, something we are cur-
ules 2 and 3, both the zero gradient and the approximated rently working on.
gradient method perform well. In fact, for the approximated Finally, we present a gradient approximation method
gradient and the zero gradient methods the validation accu- with an efficient GPU implementation. Our approximation
racy loss is smaller than schedule 1. Indeed, in schedule method is consistent in terms of validation accuracy across
3, the approximated gradient’s best validation accuracy is different network architectures and approximation sched-
within 0.1% of that of the full gradient computation. The ules. Although the training wall clock time speedup isn’t
random gradient method’s validation accuracy is in now line large, the validation accuracy loss is also small. We wish
with its poor loss curve for these two approximation sched- to re-emphasize here the small validation accuracy differ-
ules. This suggests that the random gradient method does ence observed between the baseline ResNet-14 and ResNet-
not work well for ResNet-20 architecture. 20, leading us to believe that novel training speed-up meth-
ods must incur minimal validation accuracy loss to be more
4. Discussion and Conclusion practical than simply training a smaller network.
While research on accelerating deep learning inference In conclusion, we show that we can “fool” deep neural
abounds, there is relatively limited work focused on accel- networks into training properly while supplying it only very
erating the training process. Recent works such as Prune- minimal gradient information on select layers. The approx-
Train prune the neural network in training, but suffers quite imation methods are simple and robust, holding the promise
serious loss in validation accuracy [19]. Approaches such to accelerate the lengthy training process for state-of-the-art
as DropBack [7] and MeProp [29, 26] show that approxi- deep CNNs.
7
5. Future Work [9] Scott Gray, Alec Radford, and Diederik P Kingma.
Gpu kernels for block-sparse weights. arXiv preprint
Besides those already mentioned, there are several more arXiv:1711.09224, 2017.
interesting directions of future work. One direction is [10] Donghyeon Han and Hoi-jun Yoo. Efficient convolutional
predicting the validation accuracy loss that a neural net- neural network training with direct feedback alignment.
work would suffer from a particular approximation sched- arXiv preprint arXiv:1901.01986, 2019.
ule. With such a predictor, we can optimize for the fastest [11] Song Han, Huizi Mao, and William J Dally. Deep com-
approximation schedule while constraining the final valida- pression: Compressing deep neural networks with pruning,
tion accuracy loss before the training run. We can also ex- trained quantization and huffman coding. arXiv preprint
amine the effects of mingling different approximation meth- arXiv:1510.00149, 2015.
ods and integrating existing methods such as PruneTrain [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
and Dropback [19, 7]. Another direction is approximating
ings of the IEEE conference on computer vision and pattern
the gradient of the hidden activations, as is done in meProp recognition, pages 770–778, 2016.
[26]. However, if we approximate the hidden activations at [13] Sergey Ioffe and Christian Szegedy. Batch normalization:
a deeper layer of the network, the approximation error will Accelerating deep network training by reducing internal co-
be propagated to the shallower layers. Due to this concern, variate shift. arXiv preprint arXiv:1502.03167, 2015.
we start with approximating filter weight gradients, where [14] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang,
the effect of errors are local. Finally, we are working on Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo,
integrating this approach into a distributed training setting, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learn-
where the approximation schedule is now 3-dimensional ing training system with mixed-precision: Training imagenet
(machine, layer, batch). This approach would be crucial in four minutes. arXiv preprint arXiv:1807.11205, 2018.
for the approximation methods to work with larger scale [15] Norman P Jouppi, Cliff Young, Nishant Patil, David Patter-
datasets such as ImageNet, thus potentially allowing for son, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh
Bhatia, Nan Boden, Al Borchers, et al. In-datacenter per-
wall-clock speed-up in large scale training.
formance analysis of a tensor processing unit. In 2017
ACM/IEEE 44th Annual International Symposium on Com-
References puter Architecture (ISCA), pages 1–12. IEEE, 2017.
[1] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, [16] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- layers of features from tiny images. Technical report, Cite-
mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A seer, 2009.
system for large-scale machine learning. In 12th {USENIX} [17] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed,
Symposium on Operating Systems Design and Implementa- and Colin J Akerman. Random synaptic feedback weights
tion ({OSDI} 16), pages 265–283, 2016. support error backpropagation for deep learning. Nature
communications, 7:13276, 2016.
[2] Menachem Adelman and Mark Silberstein. Faster neural
[18] Xingyu Liu, Jeff Pool, Song Han, and William J Dally. Effi-
network training with approximate tensor operations. arXiv
cient sparse-winograd convolutional neural networks. arXiv
preprint arXiv:1805.08079, 2018.
[3] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Ex-
[19] Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen,
tremely large minibatch sgd: Training resnet-50 on imagenet
Mattan Erez, and Sujay Shanghavi. Prunetrain: Gradual
in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.
structured pruning from scratch for faster neural network
[4] Sergey Bartunov, Adam Santoro, Blake Richards, Luke Mar- training. arXiv preprint arXiv:1901.09290, 2019.
ris, Geoffrey E Hinton, and Timothy Lillicrap. Assessing [20] Stefano Markidis, Steven Wei Der Chien, Erwin Laure,
the scalability of biologically-motivated deep learning algo- Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core pro-
rithms and architectures. In Advances in Neural Information grammability, performance & precision. In 2018 IEEE In-
Processing Systems, pages 9368–9378, 2018. ternational Parallel and Distributed Processing Symposium
[5] Xuhao Chen. Escort: Efficient sparse convolutional neural Workshops (IPDPSW), pages 522–531. IEEE, 2018.
networks on gpus. arXiv preprint arXiv:1802.10280, 2018. [21] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory
[6] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael
Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed
Shelhamer. cudnn: Efficient primitives for deep learning. precision training. arXiv preprint arXiv:1710.03740, 2017.
arXiv preprint arXiv:1410.0759, 2014. [22] Arild Nøkland. Direct feedback alignment provides learning
[7] Maximilian Golub, Guy Lemieux, and Mieszko Lis. Drop- in deep neural networks. In Advances in neural information
back: Continuous pruning during training. arXiv preprint processing systems, pages 1037–1045, 2016.
arXiv:1806.06949, 2018. [23] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai
[8] Benjamin Graham and Laurens van der Maaten. Sub- Li, Yiran Chen, and Pradeep Dubey. Faster cnns with di-
manifold sparse convolutional networks. arXiv preprint rect sparse convolutions and guided pruning. arXiv preprint
arXiv:1706.01307, 2017. arXiv:1608.01409, 2016.
8
[24] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Urta-
sun. Sbnet: Sparse blocks network for fast inference. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 8711–8720, 2018.
[25] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
[26] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang.
meprop: Sparsified back propagation for accelerated deep
learning with reduced overfitting. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70,
pages 3299–3308. JMLR. org, 2017.
[27] Xu Sun, Xuancheng Ren, Shuming Ma, Bingzhen Wei, Wei
Li, Jingjing Xu, Houfeng Wang, and Yi Zhang. Training
simplification and model simplification for deep learning: A
minimal effort back propagation method. IEEE Transactions
on Knowledge and Data Engineering, 2018.
[28] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gra-
dient sparsification for communication-efficient distributed
optimization. In Advances in Neural Information Processing
Systems, pages 1306–1316, 2018.
[29] Bingzhen Wei, Xu Sun, Xuancheng Ren, and Jingjing Xu.
Minimal effort back propagation for convolutional neural
networks. arXiv preprint arXiv:1709.05804, 2017.
[30] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan
Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradi-
ents to reduce communication in distributed deep learning.
In Advances in neural information processing systems, pages
1509–1519, 2017.
[31] Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio.
Biologically-plausible learning algorithms can scale to large
datasets. arXiv preprint arXiv:1811.03567, 2018.
[32] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and
Kurt Keutzer. Imagenet training in minutes. In Proceedings
of the 47th International Conference on Parallel Processing,
page 1. ACM, 2018.
[33] Maohua Zhu, Jason Clemons, Jeff Pool, Minsoo Rhu,
Stephen W Keckler, and Yuan Xie. Structurally sparsified
backward propagation for faster long short-term memory
training. arXiv preprint arXiv:1806.00512, 2018.

Accelerated CNN Training Through Gradient Approximation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accelerated CNN Training Through Gradient Approximation

Uploaded by

Copyright:

Available Formats

Accelerated CNN Training Through Gradient Approximation

Ziheng Wang Sree Harsha Nelaturu

Abstract tive training on large scale computing clusters consisting of

2.3. Approximated Gradient

Figure 2. The approximation algorithm illustrated for an example

gradient computation. We can even see that the random gra-

does not involve switching gradient computations. We

You might also like