Professional Documents
Culture Documents
Abstract— Direct convolution methods are now drawing Early endeavors accelerate convolution in an indirect man-
increasing attention as they eliminate the additional stor- ner. These approaches either transform the input and kernel
age demand required by indirect convolution algorithms tensors into another domain [9], [10] or transform the input
(i.e., the transformed matrix generated by the im2col convolution
algorithm). Nevertheless, the direct methods require special tensor into a special format [11], [12] for efficient convolution
input–output tensor formatting, leading to extra time and mem- computation.
ory consumption to get the desired data layout. In this article, Though widely applied, the indirect algorithms introduce
we show that indirect convolution, if implemented properly, substantial additional memory footprint as the transforma-
is able to achieve high computation performance with the help of tion process requires reshaping and selectively duplicating
highly optimized subroutines in matrix multiplication while avoid
incurring substantial memory overhead. The proposed algorithm parts of the input/kernel/output tensors. The problem becomes
is called efficient convolution via blocked columnizing (ECBC). more critical on embedded devices, where storage resources
Inspired by the im2col convolution algorithm and the block are largely constrained [13], [14]. Previous works elimi-
algorithm of general matrix-to-matrix multiplication, we propose nate these memory demands through directly performed con-
to conduct the convolution computation blockwisely. As a result, volution [13]–[16]. By carefully designing the loop tiling,
the tensor-to-matrix transformation process (e.g., the im2col
operation) can also be done in a blockwise manner so that it ordering, and data layout, direct convolution1 is able to
only requires a small block of memory as small as the data achieve comparable performance or even outperform indirect
block. Extensive experiments on various platforms and networks methods.
validate the effectiveness of ECBC, as well as the superiority of While achieving overall good performance, existing direct
our proposed method against a set of widely used industrial-level convolution methods also suffer from two critical problems.
convolution algorithms.
First, in order to optimize the register blocking and memory
Index Terms— Convolutional neural networks (CNNs), direct access, these methods require the input and output to be
convolution, high performance computing for mobile devices, stored in an odd format. Specifically, the input and out-
im2col convolution, memory-efficient convolution (MEC).
put tensor with shape N × C × H × W are split as
I. I NTRODUCTION N × (C/x) × H × W × x for convolution computation. The
practical use of these methods requires either a large number of
C ONVOLUTIONAL neural networks (CNNs) have
achieved great success in various areas, including image
recognition [1]–[3], object detection [4]–[6], and semantic
operations implemented for this data layout or additional time
and memory overhead for data format transformation [17].
segmentation [7], [8]. Recently, as the emerging of high-end What is worse, the x factor can be different across different
mobile devices, more and more deep learning applications layers [15], [16] so that the reorganization must be carried out
are migrating from desktops to these edge devices. How- for each convolution layer. Second, because of the complicated
ever, the high memory and computation demands of convo- computation loops of convolution, the memory access during
lution obstacle the application of CNNs on these platforms, computation is not totally continuous, which is not friendly to
where memory and power resources are largely constrained. the popular hierarchical memory architecture in modern CPUs,
A high-performance routine that accelerates the convolution leading to further performance degradation.
computation on mobile devices is of demanding. In this article, we show that indirect convolution, if imple-
mented properly, is able to retain high computation perfor-
Manuscript received 8 December 2020; revised 30 March 2021; accepted mance with the help of highly optimized subroutines in matrix
25 June 2021. Date of publication 19 July 2021; date of current version
5 January 2023. This work was supported in part by the National Natural Sci- multiplication while avoiding substantial memory overhead
ence Foundation of China under Grant 61972396, in part by the National Key during the whole computation. The proposed algorithm is
Research and Development Program of China under Grant 2020AAA0103402, called efficient convolution via blocked columnizing (ECBC).
and in part by the Strategic Priority Research Program of the Chinese
Academy of Sciences under Grant XDA27040300. (Corresponding author: Inspired by the block algorithm of general matrix-to-matrix
Jian Cheng.) multiplication (GEMM), instead of performing the whole
Tianli Zhao, Xiangyu He, and Jian Cheng are with the Institute of convolution, we conduct the computation blockwisely. As a
Automation, Chinese Academy of Sciences, Beijing 100080, China (e-mail:
jcheng@nlpr.ia.ac.cn). result, the im2col transformation can also be done in a
Qinghao Hu, Weixiang Xu, Jiaxing Wang, and Cong Leng are with the blockwise manner. The block algorithm, together with memory
National Laboratory of Pattern Recognition, Beijing 100190, China.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3095276. 1 Direct convolution is defined as the convolution algorithm that is imple-
Digital Object Identifier 10.1109/TNNLS.2021.3095276 mented directly with nested loops.
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
434 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 435
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
436 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023
Fig. 2. (a) Computation pipeline of im2col-based convolution algorithm. When the kernel window slides through the input tensor I , each corresponding
subpatch of I is vectorized and duplicated into one column of a flat buffer I ∗ . In this way, the results of convolution can be computed through matrix
multiplication between the kernel matrix K ∗ and the flat buffer matrix I ∗ . (b) Computation pipeline of the block algorithm of GEMM. The computation is
first blocked along the p dimension with factor pc and, then, the n dimension with factor n c for cache blocking. The subsequent computations are further
decomposed along the m dimension with factor m r and, finally, the n c dimension with factor nr for register blocking.
submatrix, say B̂ of B, is duplicated into a contiguous area of of A, denoted as Â, is packed into A∗ . In this process,
memory B ∗ [refer to the arrow labeled by packb in Fig. 2(b)]. Â is blocked along the m dimension with the factor m r ,
The blocking factors n c and pc are carefully selected such that generating multiple m r × pc micro panels. These m/m r
the memory block B ∗ resides in the L2 cache and is reused micro panels are stored in A∗ one by one in column-major
during the computation of GEPB. order.2 Similarly, one pc × n c submatrix of B, denoted as B̂,
Most mobile devices are based on reduced instruction set is organized into multiple pc × nr micro panels and stored
computer (RISC) architecture, which means that data must be in B ∗ one by one in row-major order. In this way, during the
loaded into registers before operations are done, and single multiplication of submatrices of A and B, memory access of
instruction multiple data (SIMD) are also a widely utilized A∗ and B ∗ is actually continuous [47].
technique. For this reason, register blocking is further con- For ease of presentation, we formally define two operators
sidered. More specifically, each problem of GEPB is further packa and packb so that
blocked along the m dimension with the factor m r and finally
[h, w] = packa([m, pc, mr ]) ⇒ A∗ [m, pc, mr ] = Â[h, w]
the n c dimension with the factor nr . The blocking factors m r
and nr are selected according to the number and bit width [h, w] = packb([nc, pc, nr ]) ⇒ B ∗ [nc, pc, nr ] = B̂[h, w].
of (vector) registers. In this way, the whole computation is The operators packa and packb can be defined by the
finally decomposed into multiple inner product computations following equations:
between one m r × pc micro panel of A and one pc × nr
micro panel of B, which is highly optimized assembly. Due to packa([m, pc, mr ]) = [m × m r + mr, pc]
register blocking, during each computation of inner product, packb([nc, pc, nr ]) = [ pc, nc × nr + nr ]. (2)
each of the data is loaded into registers from memory just
once and reused for computation for some many times (nr
times for the micro panel of A and m r times for the micro IV. M ETHODOLOGY
panel of B). What is more, multiple output elements can The main drawback of im2col convolution algorithm is that
be calculated simultaneously with SIMD instructions, which it needs a large area of memory with size i c kh kw oh ow for
further improves the computation performance. the storage of the transformed input tensor, which is several
2) Packing: It is generally acknowledged that, because of times larger than the size of feature map. In this section,
the widely used memory hierarchy architecture in modern we introduce the main idea of ECBC. Our method only needs
CPUs, accessing memory in a continuous way during the a small area of shared memory to store the packed inputs, and
computation is much more efficient than noncontinuous it can be generalized to arbitrary convolutional architectures.
memory access. For the purpose of continuous memory
access during computation of GEMM, the packing operation A. Computation Pipeline
is actually needed. To this end, two areas of shared memory Fig. 3 briefly describes the computation pipeline of ECBC.
are needed, i.e., tensor A∗ with size of (m/m r ) × pc × m r The proposed algorithm is a principled improvement over
and tensor B ∗ with size (n c /nr ) × pc × n r . By packing the
corresponding elements of A and B into these two tensors 2 Row-major order and column-major order are two commonly used format
with a specific organization, the memory access during the for storing multidimensional tensors in linear memory. In row-major order,
adjacent elements in one row of the matrix are stored in a continuous area of
GEPB computation shown at the bottom of Fig. 2(b) is memory, whereas in column-major order, adjacent elements in one column of
completely continuous. For instance, one m × pc submatrix the matrix are stored in a continuous area of memory.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 437
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
438 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 439
Rin = i c i h i w
Rout = oc oh ow
Rker = oc i c kh kw . (7)
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
440 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023
Fig. 6. Comparison of single-thread computation performance of different algorithms on convolution layers. The notes under each group of bars denote the
configuration of convolutions. For example, the note (64, 56, 3)(128, 7, 1) denotes a 7 × 7 convolution with 64 input channels and 128 output channels. The
spatial size of the input tensor is 56 × 56. Also, the stride and padding of the convolution are 1 and 3, respectively. (a) Speedup with respect to im2col +
OpenBLAS of different algorithms on convolution layers on a single A72 core. (b) Speedup with respect to im2col + OpenBLAS of different algorithms on
convolution layers on a single A17 core.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 441
Fig. 7. Comparison of single-thread computation performance of different algorithms on convolution layers. The notes under each group of bars denote the
configuration of convolutions. For example, the note (64, 56, 3)(128, 7, 1) denotes a 7 × 7 convolution with 64 input channels and 128 output channels. The
spatial size of the input tensor is 56 × 56. Also, the stride and padding of the convolution are 1 and 3, respectively. (a) Speedup with respect to im2col +
OpenBLAS of different algorithms on convolution layers on 2× A72 cores. (b) Speedup with respect to im2col + OpenBLAS of different algorithms on
convolution layers on 2× A17 cores. (c) Speedup with respect to im2col + OpenBLAS of different algorithms on convolution layers on 3× A17 cores.
(d) Speedup with respect to im2col + OpenBLAS of different algorithms on convolution layers on 4× A17 cores.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
442 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023
Fig. 8. Comparison of different convolution algorithms in terms of run time and memory overhead. (a) Computation performance on resnet18 models with
different widths. All the performances are measured on a single ARM Cortex A72 Core. Lower is better. (b) Memory overhead of resnet18 models with
different widths. Lower is better. (c) Tradeoff between run time and memory overhead on resnet18 models with different widths. Lower is better.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 443
TABLE III
D ETAILED C OMPUTATION P ERFORMANCE OF D IFFERENT C ONVOLUTION
A LGORITHMS ON C ONVOLUTION L AYERS OF A LEX N ET [1]. T HE
VALUES IN TABLES A RE IN M ILLISECONDS . (a) P ERFORMANCE ON
A S INGLE A17 C ORE . (b) P ERFORMANCE ON 4× A17 C ORES .
(c) P ERFORMANCE ON A S INGLE A72 C ORE .
(d) P ERFORMANCE ON 2× A72 C ORES
D. Memory Overhead
In Section IV-F, we analyze the memory overhead of
different convolution algorithms theoretically. This section will
prove the memory efficiency of ECBC through experiments.
In Fig. 8, we show the tradeoff among run time and memory
overhead of different algorithms on resnet18 models with
different widths. In particular, in Fig. 8(a), we show the run
time in milliseconds of different convolution algorithms on
resnet18 models with different widths. Latency of all the
cores, ECBC is always more than 2.5× faster. The speedup models is measured on a single A72 core, and we can see that
even achieves 4× on GoogleNet and 5× on resnet50. The ECBC consistently achieves the best performance under all the
same trends also hold on ARM Cortex-A72 cores. cases. Fig. 8(b) shows the memory overhead of different con-
Table II(a) and (b) support that the advantages of ECBC volution algorithms. In this figure, all the memory overhead,
are more obvious on relatively more bandwidth-constrained including the memory required by convolution inputs/outputs,
platforms. Specifically, ECBC is overall 1.2× faster compared parameters of networks, and extra memory overhead required
to the im2col convolution on a single ARM Cortex-A72, while during computation, are considered. Note that the memory
the speedup is more than 2× on a single ARM Cortex-A17. overhead of convolution implementation of nnpack and tflite
Table II(b) and (d) also shows the same results, where ECBC is not shown in this figure. We mainly compare the memory
is 1.2–1.5× faster than the im2col convolution algorithm on overhead of ZMC [13], which needs the least memory among
2× A72 cores, while the improvement is more than 2.5× and all the algorithms because it does not introduce any additional
up to 5× on 4× A17 cores. memory overhead at all. We can see that the memory overhead
In Table III, we show the detailed run time of different con- of ECBC is obviously lower than im2col and MEC and almost
volution algorithms on each convolution layer of AlexNet [1]. the same with ZMC under all the cases. To further prove that
The values in tables are in milliseconds. We can see that ECBC ECBC is able to achieve a better tradeoff between run time
performs the best under all the cases excluding on conv4 and and memory overhead, in Fig. 8(c), we plot the summation
conv5 on 2× A72 cores. An interesting fact that is noteworthy of run time (in milliseconds) and memory overhead (in mega
is that, the performance of ECBC is stable and efficient, that is parameters) of different algorithms (lower is better). We can
to say, the performance of ECBC is not bad under all the cases, see that compared to other convolution algorithms, ECBC
while other methods may perform well under some cases and consistently performs superior in terms of tradeoff between
suffer poor performance under some other cases. For example, latency and memory overhead.
the performance of ZMC [13] is comparable with ECBC on Fig. 9 shows the memory overhead of different convolution
conv4 and conv5 on A72 CPUs and even faster than ECBC algorithms on different CNN models. Similar to Fig. 8, in this
by 0.21 and 0.82 ms relatively on these two layers on 2× figure, all the memory overhead includes the memory needed
A72 cores, while on other layers or platforms, it performs for storage of the inputs/outputs/parameters of models, and
inferior compared to ECBC. On conv1, ECBC is even more additional memory needed by various algorithms for compu-
than 1.5× faster than ZMC on all the platforms. The same tation of convolutions is considered. Lower is better. From
trends also hold for the performance of other algorithms. For Fig. 9, we can see that ECBC is able to save at least 30%
example, the performance of MEC [12] is similar to ECBC and up to 75% memory overhead compared to the im2col
on conv1 on A72 CPUs, and ECBC outperforms MEC by a convolution algorithm. Also, the memory needed by ECBC
large margin on other layers or platforms. is almost the same as that of ZMC [13]. In fact, compared
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
444 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023
to ZMC, a small piece of additional memory is needed by [8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
ECBC for the storage of the shared tensor Is . While, as what network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jul. 2017, pp. 2881–2890.
has been mentioned at the end of Section IV-F, the size of [9] T. Highlander and A. Rodriguez, “Very efficient training of convo-
Is is fixed and independent of the convolution configurations, lutional neural networks using fast Fourier transform and overlap-
it can be shared across all the convolutions in a model, thereby and-add,” 2016, arXiv:1601.06815. [Online]. Available: http://arxiv.
org/abs/1601.06815
largely reducing the total memory overhead. [10] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 4013–4021.
VI. C ONCLUSION
[11] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional
In this article, we propose a simple and efficient convo- neural networks for document processing,” in Proc. 10th Int. Workshop
Frontiers Handwriting Recognit., 2006, pp. 1–7.
lution algorithm for mobile devices. Instead of transforming [12] M. Cho and D. Brand, “MEC: Memory-efficient convolution for deep
the whole input tensor before computing, we reduce the neural network,” in Proc. 34th Int. Conf. Mach. Learn. (ICML), 2017,
memory overhead through blocking and memory sharing. pp. 815–824.
We show that, by carefully designing the loop order and [13] J. Zhang, F. Franchetti, and T. M. Low, “High performance zero-memory
overhead direct convolutions,” in Proc. 35th Int. Conf. Mach. Learn.
selecting the block size, the continuity and reusability of (ICML), 2018, pp. 5776–5785.
memory access can be improved, and memory overhead can [14] E. Georganas et al., “Anatomy of high-performance deep learning
also be largely reduced through memory sharing. Evaluated convolutions on SIMD architectures,” in Proc. Int. Conf. High Perform.
Comput., Netw., Storage Anal., Nov. 2018, pp. 830–841.
on multiple networks and platforms, our method gets high [15] E. Georganas et al., “Harnessing deep learning via a single building
computation and memory efficiency compared to previous block,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS),
algorithms. The results support that ECBC gains more benefits May 2020, pp. 222–233.
[16] Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang, “Optimizing
on low-end devices because they are more limited by memory CNN model inference on CPUs,” in Proc. USENIX Annu. Tech. Conf.,
resources and thus more sensitive to memory optimization. 2019, pp. 1025–1040.
What is more, ECBC also shows great scalability in that the [17] M. Dukhan, “The indirect convolution algorithm,” CoRR,
performance improvement is more obvious when in multi- vol. arXiv:1907.02129, pp. 1–10, Jul. 2019.
[18] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
thread context. Moreover, the performance of ECBC is stable deep neural networks with pruning, trained quantization and Huffman
and efficient. It performs not bad under a various range of coding,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.
platforms, convolutions, and CNN models. [19] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
sparsity in deep neural networks,” in Proc. Adv. Neural Inf. Process.
The main idea of ECBC is to conduct the convolution Syst., 2016, pp. 2074–2082.
computation blockwisely so that the tensor transformation [20] S. Guo, Y. Wang, Q. Li, and J. Yan, “DMCP: Differentiable Markov
operation can also be done in a blockwise manner, and channel pruning for neural networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2020, pp. 1539–1547.
the memory overhead can be thus largely reduced with
[21] J. Yu, L. Yang, N. Xu, J. Yang, and T. S. Huang, “Slimmable
memory sharing. A possible further direction of this work neural networks,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019,
is to optimize the memory cost of domain-transformation pp. 1–12.
convolution methods (i.e., FFT-based [9] or Winograd-based [22] J. Yu and T. Huang, “Universally slimmable networks and improved
training techniques,” in Proc. Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
[10] convolution algorithms) with the similar idea, which will pp. 1803–1811.
further improve the practical utility of these computation effi- [23] Z. Yang et al., “Searching for low-bit weights in quantized neural
cient while memory-hungry convolution algorithms on mobile networks,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS), 2020,
pp. 1–12.
applications. [24] Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, “ReactNet: Towards
precise binary neural network with generalized activation functions,” in
Proc. 16th Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 143–159.
R EFERENCES
[25] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- Real Net: Enhancing the performance of 1-bit CNNs with improved
tion with deep convolutional neural networks,” in Advances in Neural representational capability and advanced training algorithm,” in Proc.
Information Processing Systems, vol. 25, F. Pereira, C. J. C. Burges, Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 722–737.
L. Bottou, and K. Q. Weinberger, Eds. Red Hook, NY, USA: [26] P. Wang and J. Cheng, “Fixed-point factorized networks,” in Proc.
Curran Associates, 2012, pp. 1097–1105. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for pp. 4012–4020.
large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent. [27] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural
(ICLR), 2015, pp. 1–14. networks: Squeeze the last bit out with ADMM,” in Proc. 32th AAAI
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Conf. Artif. Intell. (AAAI), 2018, pp. 1–8.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [28] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
(CVPR), Jun. 2016, pp. 770–778. “Exploiting linear structure within convolutional networks for efficient
[4] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. evaluation,” in Proc. 38th Conf. Neural Inf. Process. Syst. (NeurIPS),
Comput. Vis. (ECCV), 2016, pp. 21–37. 2014, pp. 1269–1277.
[5] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- [29] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional
ment,” 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/ neural networks with low rank expansions,” in Proc. Brit. Mach. Vis.
1804.02767 Conf., 2014, pp. 1–13.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- [30] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky,
time object detection with region proposal networks,” in Proc. 28th Int. “Speeding up convolutional neural networks using fine-tuned CP-
Conf. Neural Inf. Process. Syst. (NIPS), vol. 1. Cambridge, MA, USA: decomposition,” in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), 2015,
MIT Press, 2015, pp. 91–99. pp. 1–11.
[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [31] S. Winograd, Arithmetic Complexity of Computations, vol. 43, no. 2.
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Philadelphia, PA, USA: Society for Industrial and Applied Mathematics,
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. 1980.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 445
[32] Z. Jia, A. Zlateski, F. Durand, and K. Li, “Optimizing N-dimensional, Qinghao Hu received the B.E. degree in computer
Winograd-based convolution for manycore CPUs,” in Proc. 23rd science from Northwestern Polytechnical University,
ACM SIGPLAN Symp. Princ. Pract. Parallel Program., Feb. 2018, Xi’an, China, in 2014, and the Ph.D. degree from the
pp. 109–123. National Laboratory of Pattern Recognition, Insti-
[33] A. Xygkis, D. Soudris, L. Papadopoulos, S. Yous, and D. Moloney, tute of Automation, Chinese Academy of Sciences,
“Efficient winograd-based convolution kernel implementation on edge Beijing, China, in 2019.
devices,” in Proc. 55th Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6. He currently works as a Research Assistant at
[34] Z. Jia, A. Zlateski, F. Durand, and K. Li, “Towards optimal Winograd the Institute of Automation, Chinese Academy of
convolution on manycores,” in Proc. SysML, 2018, pp. 1–3. Sciences. His current research interests include deep
[35] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed- neural network compression and acceleration, hash-
ding,” in Proc. ACM Int. Conf. Multimedia, Nov. 2014, pp. 675–678. ing, and quantization.
[36] J. J. Dongarra, J. D. Croz, and S. Hammarling, “A set of level 3 basic
linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16, no. 1,
pp. 1–17, 1990.
[37] Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 BLAS Xiangyu He received the B.E. degree in information
performance optimization on Loongson 3A processor,” in Proc. IEEE security from the Beijing University of Posts and
18th Int. Conf. Parallel Distrib. Syst. (ICPADS), Dec. 2012, pp. 684–691. Telecommunications, Beijing, China, in 2017. He is
[38] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi, “AUGEM: Automatically currently pursuing the Ph.D. degree with the Insti-
generate high performance dense linear algebra kernels on x86 CPUs,” tute of Automation, Chinese Academy of Sciences,
in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC), Beijing.
Nov. 2013, pp. 17–21. His current research interests include deep learning
[39] Intel. (2020). MKL: Intel Math Kernel Library. [Online]. Avail- and image retrieval.
able: https://software.intel.com/content/www/us/en/develop/tools/math-
kernel-library.html
[40] G. Guennebaud et al. (2010). Eigen V3. [Online]. Available: http://eigen.
tuxfamily.org
[41] A. Gural and B. Murmann, “Memory-optimal direct convolutions for
maximizing classification accuracy in embedded applications,” in Proc. Weixiang Xu received the B.E. degree in automa-
36th Int. Conf. Mach. Learn. (ICML), 2019, pp. 2515–2524. tion from Northeastern University, Shenyang, China,
[42] T. Chen et al., “TVM: An automated end-to-end optimizing compiler in 2018. He is currently pursuing the Ph.D. degree
for deep learning,” in Proc. 12th USENIX Symp. Operating Syst. Design with the National Laboratory of Pattern Recogni-
Implement. (OSDI), 2018, pp. 578–594. tion, Institute of Automation, Chinese Academy of
[43] S. Zheng, Y. Liang, S. Wang, R. Chen, and K. Sheng, “Flextensor: An Sciences, Beijing, China.
automatic schedule exploration and optimization framework for tensor His current research interests include compression
computation on heterogeneous system,” in Proc. Architectural Support of deep learning and high-performance computing.
Program. Lang. Operating Syst. (ASPLOS), 2020, pp. 859–873.
[44] L. Zheng et al., “Ansor: Generating high-performance tensor programs
for deep learning,” in Proc. 14th USENIX Symp. Operating Syst. Design
Implement. (OSDI), 2020, pp. 863–879.
[45] T. Chen et al., “MXNet: A flexible and efficient machine learning library
for heterogeneous distributed systems,” in Proc. Neural Inf. Process. Jiaxing Wang received the B.E. degree in School
Syst. Workshop Mach. Learn. Syst., 2015, pp. 1–6. of Control and Computer Engineering, North China
[46] M. Abadi et al., “Tensorflow: A system for large-scale machine learn- Electric Power University, Beijing, China, in 2015.
ing,” in Proc. 12th USENIX Symp. Operating Syst. Design Implement. He is currently pursuing the Ph.D. degree with
(OSDI), 2016, pp. 265–283. the Institute of Automation, Chinese Academy of
[47] K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix Sciences, Beijing.
multiplication,” ACM Trans. Math. Softw., vol. 34, no. 3, pp. 12:1–12:25, His current research interests include efficient deep
May 2008. learning and Bayesian methods.
[48] T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, and
F. G. V. Zee, “Anatomy of high-performance many-threaded matrix
multiplication,” in Proc. IEEE 28th Int. Parallel Distrib. Process. Symp.,
May 2014, pp. 1049–1059.
[49] F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly
instantiating BLAS functionality,” ACM Trans. Math. Softw., vol. 41, Cong Leng received the B.E. degree in automation
no. 3, pp. 1–33, Jun. 2015. from Central South University, Changsha, China,
[50] M. Dukhan. NNPACK. (2020). [Online]. Available: https://github. in 2011, and the Ph.D. degree in pattern recognition
com/Maratyszcza/NNPACK and intelligent systems from the Institute of Automa-
[51] (2020). TFLITE. [Online]. Available: https://www.tensorflow.org/lite tion, Chinese Academy of Sciences, Beijing, China,
[52] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE in 2016.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. His current research interests include machine
[53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking learning, deep learning, and their applications in
the inception architecture for computer vision,” in Proc. IEEE Conf. computer vision and data mining.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.
Tianli Zhao received the B.E. degree in engineering Jian Cheng (Member, IEEE) received the B.S. and
physics from Tsinghua University, Beijing, China, M.S. degrees in mathematics from Wuhan Univer-
in 2016. He is currently pursuing the Ph.D. degree sity, Wuhan, China, in 1998 and 2001, respectively,
with the National Laboratory of Pattern Recogni- and the Ph.D. degree in pattern recognition and
tion, Institute of Automation, Chinese Academy of intelligent systems from the Institute of Automa-
Sciences, Beijing. tion, Chinese Academy of Sciences, Beijing, China,
His research focuses on compression and accelera- in 2004.
tion of deep learning and high-performance comput- He is currently a Professor with the Institute of
ing of deep learning algorithms on mobile platforms. Automation, Chinese Academy of Sciences. His cur-
rent major research interests include deep learning,
computer vision, and chip design.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.