Professional Documents
Culture Documents
1
method discussed in [17] and [22] as we regenerate the ran-
dom numbers every training batch. We implement this us-
ing tf.py func, where np.random.normal is used to generate
the random values. This approach is extremely inefficient,
though surprisingly faster than a naive cuRAND implemen-
tation in a custom tensorflow operation for most input cases.
We are working on a more efficient implementation.
2
Figure 3. Example approximation schedule for a 5-layer network
over 4 training batches. Full Grad denotes regular gradient com-
putation without approximation.
3
Table 1. Performance comparisons. All timing statistics in mi-
croseconds. Approx. total column is the sum of the CUDA Kernel
time and the transpose overhead.
Figure 4. The three approximation schedules studied for the 2- nels fully utilize the high floating point throughput of the
layer network using a) zero gradient method b) random gradient GPU to perform the dense gradient computations. In con-
method c) approximated gradient method. trast, sparse approximations of the gradient usually involve
less arithmetic/memory ratio and do not admit as efficient
kernel implementations on GPU. It is oftentimes necessary
to the second convolutional layer every other training batch.
to impose structure or high sparsity ratio to achieve actual
On VGG-19 and ResNet-20, we apply the selected approx-
performance gain [33]. Here we demonstrate that our gradi-
imation method to every fourth convolutional layer every
ent approximation method does yield an efficient GPU im-
training batch, starting from the second convolutional layer.
plementation that can lead to actual speedups compared to
For example, the three approximation schedules for the 2-
cuDNN.
layer CNN are shown in Figure 4. We start from the sec-
We present timing comparisons for a few select input
ond layer because recent work has shown that approximat-
cases encountered in the network architectures used in this
ing the first convolutional layer is difficult [2]. This results
work in Table 1. We aggregate the two data transpose over-
in four approximated layers for VGG-19 and five approx-
heads of the input activations and the filter gradients. (In
imated layers for ResNet-20. For the ResNet-20 model,
almost every case, the data transpose overhead of the input
we train a baseline ResNet-14 model as well. Training a
activations dominates.) We make three observations.
smaller model is typically done in practice when training
time is of concern. Ideally, our approximation methods to Firstly, in most cases, the gradient approximation, in-
train the larger ResNet-20 model should result in higher val- cluding data transposition, is at least three times as fast as
idation accuracy than the ResNet-14 model. For ResNet-20, the cuDNN baseline. Secondly, we observe that cuDNN
we also experiment with other approximation schedules to timing scales with the number of input channels times the
show that our approximation methods are robust to schedule height and width of the hidden layer, whereas our approxi-
choice. mation kernel timing scales with the number of input chan-
We train the networks until the validation accuracy sta- nels alone. This is expected from the nature of the computa-
bilizes. It took around 500 epochs for the 2-layer CNN, 250 tions involved: the performance bottleneck of our kernel is
epochs for the ResNet-20 model, and 200 epochs for the the memory intensive patch extractions, the sizes of which
VGG-19 model. We use an exponentially decaying learn- scale with the number of input channels times filter size.
ing rate with the Adam optimizer for all three models. We Thirdly, we observe that in many cases, the data transposi-
apply typical data augmentation techniques such as random tion overhead is over fifty percent of the kernel time, sug-
cropping and flipping to each minibatch. All training was gesting that our implementation can be further improved by
performed within Tensorflow 1.13 [1]. fusing the data transpose into the kernel as in SBNet [24].
This is left for future work.
3.1. Performance Comparisons
3.2. Training Convergence
We compare the performance of our GPU kernel for the
approximated gradient method with the full gradient com- We present convergence results for the training of our
putation for the weight filter as implemented in cuDNN three neural networks using the chosen approximation
v7.4.2. cuDNN offers state-of-the-art performance in dense schedules with two metrics, training loss and validation ac-
gradient computation and is used in almost every deep curacy. In Figure 5, we see that for the 2-layer CNN, all ap-
learning library. For each input case, cuDNN tests several proximation methods result in training loss curves and val-
hand-assembled kernels to pick the fastest one. The ker- idation accuracy curves similar to the ones obtained by full
4
Figure 6. a) Training loss of ResNet-20 with different approxima-
Figure 5. a) Training loss of 2-layer CNN with different approxi- tion methods. b) Validation accuracy of ResNet-20 with different
mation methods. b) Validation accuracy of 2-layer CNN with dif- approximation methods. The loss curve of the random gradient
ferent approximation methods. method stagnates but the validation accuracy is competitive.
5
Table 3. Training speedup and validation accuracy loss for the ap-
proximation methods on ResNet-20. Negative speedup indicates a
slowdown.
Table 4. Training speedup and validation accuracy loss for the ap-
proximation methods on VGG-19. Negative speedup indicates a
slowdown.
6
mated gradient are sufficient in successfully training neural
networks but don’t yet offer real wall-clock speedups. In
this work, we show that we can train deep neural networks
to good validation accuracy with very minimal gradient in-
formation on a subset of the layers, leading to wall-clock
speedups for training.
We are surprised by the consistent strong performance
of the zero gradient method. For ResNet-20, for two of the
three approximation schedules tested, the validation accu-
racy loss is better than that of a smaller baseline network.
Its performance is also satisfactory on VGG-19 as well as
the 2-layer CNN. It admits an extremely fast implementa-
tion that delivers consistent speedups. This points to a sim-
ple way to potentially boost training speed in deep neural
networks, while maintaining their performance advantage
Figure 8. a) Training loss of ResNet-20 with different approxima-
over shallower alternatives.
tion methods for approximation schedule 2. b) Validation accuracy We also demonstrate that random gradient methods can
of ResNet-20 with different approximation methods for approxi- train deep neural networks to convergence, provided they
mation schedule 2. c) Training loss of ResNet-20 with different are only applied to a subset of the layers. For the 2-layer
approximation methods for approximation schedule 3. d) Valida- CNN and VGG-19, this method leads to the least validation
tion accuracy of ResNet-20 with different approximation methods
accuracy loss of all three approximation methods. How-
for approximation schedule 3.
ever, its performance serious lags other methods on ResNet-
20, suggesting that its performance is network-architecture-
specific. Naive feedback alignment, where the random gra-
dient signal is fixed before training starts, has been shown
to be difficult to extend to deep convolutional architectures
[10, 4].We show here that if the random gradients are newly
generated every batch and applied to a subset of layers, they
can be used to train deep neural networks to convergence.
Interestingly, generating new random gradients every batch
effectively abolishes any kind of possible “alignment” in the
Table 5. Validation accuracy for different approximation schedules
on ResNet-20. Schedule 1 is the same as presented above. network, calling for a new explanation of why the network
converges. Evidently, this method holds the potential for an
extremely efficient implementation, something we are cur-
ules 2 and 3, both the zero gradient and the approximated rently working on.
gradient method perform well. In fact, for the approximated Finally, we present a gradient approximation method
gradient and the zero gradient methods the validation accu- with an efficient GPU implementation. Our approximation
racy loss is smaller than schedule 1. Indeed, in schedule method is consistent in terms of validation accuracy across
3, the approximated gradient’s best validation accuracy is different network architectures and approximation sched-
within 0.1% of that of the full gradient computation. The ules. Although the training wall clock time speedup isn’t
random gradient method’s validation accuracy is in now line large, the validation accuracy loss is also small. We wish
with its poor loss curve for these two approximation sched- to re-emphasize here the small validation accuracy differ-
ules. This suggests that the random gradient method does ence observed between the baseline ResNet-14 and ResNet-
not work well for ResNet-20 architecture. 20, leading us to believe that novel training speed-up meth-
ods must incur minimal validation accuracy loss to be more
4. Discussion and Conclusion practical than simply training a smaller network.
While research on accelerating deep learning inference In conclusion, we show that we can “fool” deep neural
abounds, there is relatively limited work focused on accel- networks into training properly while supplying it only very
erating the training process. Recent works such as Prune- minimal gradient information on select layers. The approx-
Train prune the neural network in training, but suffers quite imation methods are simple and robust, holding the promise
serious loss in validation accuracy [19]. Approaches such to accelerate the lengthy training process for state-of-the-art
as DropBack [7] and MeProp [29, 26] show that approxi- deep CNNs.
7
5. Future Work [9] Scott Gray, Alec Radford, and Diederik P Kingma.
Gpu kernels for block-sparse weights. arXiv preprint
Besides those already mentioned, there are several more arXiv:1711.09224, 2017.
interesting directions of future work. One direction is [10] Donghyeon Han and Hoi-jun Yoo. Efficient convolutional
predicting the validation accuracy loss that a neural net- neural network training with direct feedback alignment.
work would suffer from a particular approximation sched- arXiv preprint arXiv:1901.01986, 2019.
ule. With such a predictor, we can optimize for the fastest [11] Song Han, Huizi Mao, and William J Dally. Deep com-
approximation schedule while constraining the final valida- pression: Compressing deep neural networks with pruning,
tion accuracy loss before the training run. We can also ex- trained quantization and huffman coding. arXiv preprint
amine the effects of mingling different approximation meth- arXiv:1510.00149, 2015.
ods and integrating existing methods such as PruneTrain [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
and Dropback [19, 7]. Another direction is approximating
ings of the IEEE conference on computer vision and pattern
the gradient of the hidden activations, as is done in meProp recognition, pages 770–778, 2016.
[26]. However, if we approximate the hidden activations at [13] Sergey Ioffe and Christian Szegedy. Batch normalization:
a deeper layer of the network, the approximation error will Accelerating deep network training by reducing internal co-
be propagated to the shallower layers. Due to this concern, variate shift. arXiv preprint arXiv:1502.03167, 2015.
we start with approximating filter weight gradients, where [14] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang,
the effect of errors are local. Finally, we are working on Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo,
integrating this approach into a distributed training setting, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learn-
where the approximation schedule is now 3-dimensional ing training system with mixed-precision: Training imagenet
(machine, layer, batch). This approach would be crucial in four minutes. arXiv preprint arXiv:1807.11205, 2018.
for the approximation methods to work with larger scale [15] Norman P Jouppi, Cliff Young, Nishant Patil, David Patter-
datasets such as ImageNet, thus potentially allowing for son, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh
Bhatia, Nan Boden, Al Borchers, et al. In-datacenter per-
wall-clock speed-up in large scale training.
formance analysis of a tensor processing unit. In 2017
ACM/IEEE 44th Annual International Symposium on Com-
References puter Architecture (ISCA), pages 1–12. IEEE, 2017.
[1] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, [16] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- layers of features from tiny images. Technical report, Cite-
mawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A seer, 2009.
system for large-scale machine learning. In 12th {USENIX} [17] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed,
Symposium on Operating Systems Design and Implementa- and Colin J Akerman. Random synaptic feedback weights
tion ({OSDI} 16), pages 265–283, 2016. support error backpropagation for deep learning. Nature
communications, 7:13276, 2016.
[2] Menachem Adelman and Mark Silberstein. Faster neural
[18] Xingyu Liu, Jeff Pool, Song Han, and William J Dally. Effi-
network training with approximate tensor operations. arXiv
cient sparse-winograd convolutional neural networks. arXiv
preprint arXiv:1805.08079, 2018.
preprint arXiv:1802.06367, 2018.
[3] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Ex-
[19] Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen,
tremely large minibatch sgd: Training resnet-50 on imagenet
Mattan Erez, and Sujay Shanghavi. Prunetrain: Gradual
in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.
structured pruning from scratch for faster neural network
[4] Sergey Bartunov, Adam Santoro, Blake Richards, Luke Mar- training. arXiv preprint arXiv:1901.09290, 2019.
ris, Geoffrey E Hinton, and Timothy Lillicrap. Assessing [20] Stefano Markidis, Steven Wei Der Chien, Erwin Laure,
the scalability of biologically-motivated deep learning algo- Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core pro-
rithms and architectures. In Advances in Neural Information grammability, performance & precision. In 2018 IEEE In-
Processing Systems, pages 9368–9378, 2018. ternational Parallel and Distributed Processing Symposium
[5] Xuhao Chen. Escort: Efficient sparse convolutional neural Workshops (IPDPSW), pages 522–531. IEEE, 2018.
networks on gpus. arXiv preprint arXiv:1802.10280, 2018. [21] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory
[6] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael
Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed
Shelhamer. cudnn: Efficient primitives for deep learning. precision training. arXiv preprint arXiv:1710.03740, 2017.
arXiv preprint arXiv:1410.0759, 2014. [22] Arild Nøkland. Direct feedback alignment provides learning
[7] Maximilian Golub, Guy Lemieux, and Mieszko Lis. Drop- in deep neural networks. In Advances in neural information
back: Continuous pruning during training. arXiv preprint processing systems, pages 1037–1045, 2016.
arXiv:1806.06949, 2018. [23] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai
[8] Benjamin Graham and Laurens van der Maaten. Sub- Li, Yiran Chen, and Pradeep Dubey. Faster cnns with di-
manifold sparse convolutional networks. arXiv preprint rect sparse convolutions and guided pruning. arXiv preprint
arXiv:1706.01307, 2017. arXiv:1608.01409, 2016.
8
[24] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Urta-
sun. Sbnet: Sparse blocks network for fast inference. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 8711–8720, 2018.
[25] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
[26] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang.
meprop: Sparsified back propagation for accelerated deep
learning with reduced overfitting. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70,
pages 3299–3308. JMLR. org, 2017.
[27] Xu Sun, Xuancheng Ren, Shuming Ma, Bingzhen Wei, Wei
Li, Jingjing Xu, Houfeng Wang, and Yi Zhang. Training
simplification and model simplification for deep learning: A
minimal effort back propagation method. IEEE Transactions
on Knowledge and Data Engineering, 2018.
[28] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gra-
dient sparsification for communication-efficient distributed
optimization. In Advances in Neural Information Processing
Systems, pages 1306–1316, 2018.
[29] Bingzhen Wei, Xu Sun, Xuancheng Ren, and Jingjing Xu.
Minimal effort back propagation for convolutional neural
networks. arXiv preprint arXiv:1709.05804, 2017.
[30] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan
Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradi-
ents to reduce communication in distributed deep learning.
In Advances in neural information processing systems, pages
1509–1519, 2017.
[31] Will Xiao, Honglin Chen, Qianli Liao, and Tomaso Poggio.
Biologically-plausible learning algorithms can scale to large
datasets. arXiv preprint arXiv:1811.03567, 2018.
[32] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and
Kurt Keutzer. Imagenet training in minutes. In Proceedings
of the 47th International Conference on Parallel Processing,
page 1. ACM, 2018.
[33] Maohua Zhu, Jason Clemons, Jeff Pool, Minsoo Rhu,
Stephen W Keckler, and Yuan Xie. Structurally sparsified
backward propagation for faster long short-term memory
training. arXiv preprint arXiv:1806.00512, 2018.