You are on page 1of 13

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO.

1, JANUARY 2023 433

ECBC: Efficient Convolution


via Blocked Columnizing
Tianli Zhao , Qinghao Hu, Xiangyu He, Weixiang Xu , Jiaxing Wang,
Cong Leng, and Jian Cheng , Member, IEEE

Abstract— Direct convolution methods are now drawing Early endeavors accelerate convolution in an indirect man-
increasing attention as they eliminate the additional stor- ner. These approaches either transform the input and kernel
age demand required by indirect convolution algorithms tensors into another domain [9], [10] or transform the input
(i.e., the transformed matrix generated by the im2col convolution
algorithm). Nevertheless, the direct methods require special tensor into a special format [11], [12] for efficient convolution
input–output tensor formatting, leading to extra time and mem- computation.
ory consumption to get the desired data layout. In this article, Though widely applied, the indirect algorithms introduce
we show that indirect convolution, if implemented properly, substantial additional memory footprint as the transforma-
is able to achieve high computation performance with the help of tion process requires reshaping and selectively duplicating
highly optimized subroutines in matrix multiplication while avoid
incurring substantial memory overhead. The proposed algorithm parts of the input/kernel/output tensors. The problem becomes
is called efficient convolution via blocked columnizing (ECBC). more critical on embedded devices, where storage resources
Inspired by the im2col convolution algorithm and the block are largely constrained [13], [14]. Previous works elimi-
algorithm of general matrix-to-matrix multiplication, we propose nate these memory demands through directly performed con-
to conduct the convolution computation blockwisely. As a result, volution [13]–[16]. By carefully designing the loop tiling,
the tensor-to-matrix transformation process (e.g., the im2col
operation) can also be done in a blockwise manner so that it ordering, and data layout, direct convolution1 is able to
only requires a small block of memory as small as the data achieve comparable performance or even outperform indirect
block. Extensive experiments on various platforms and networks methods.
validate the effectiveness of ECBC, as well as the superiority of While achieving overall good performance, existing direct
our proposed method against a set of widely used industrial-level convolution methods also suffer from two critical problems.
convolution algorithms.
First, in order to optimize the register blocking and memory
Index Terms— Convolutional neural networks (CNNs), direct access, these methods require the input and output to be
convolution, high performance computing for mobile devices, stored in an odd format. Specifically, the input and out-
im2col convolution, memory-efficient convolution (MEC).
put tensor with shape N × C × H × W are split as
I. I NTRODUCTION N × (C/x) × H × W × x for convolution computation. The
practical use of these methods requires either a large number of
C ONVOLUTIONAL neural networks (CNNs) have
achieved great success in various areas, including image
recognition [1]–[3], object detection [4]–[6], and semantic
operations implemented for this data layout or additional time
and memory overhead for data format transformation [17].
segmentation [7], [8]. Recently, as the emerging of high-end What is worse, the x factor can be different across different
mobile devices, more and more deep learning applications layers [15], [16] so that the reorganization must be carried out
are migrating from desktops to these edge devices. How- for each convolution layer. Second, because of the complicated
ever, the high memory and computation demands of convo- computation loops of convolution, the memory access during
lution obstacle the application of CNNs on these platforms, computation is not totally continuous, which is not friendly to
where memory and power resources are largely constrained. the popular hierarchical memory architecture in modern CPUs,
A high-performance routine that accelerates the convolution leading to further performance degradation.
computation on mobile devices is of demanding. In this article, we show that indirect convolution, if imple-
mented properly, is able to retain high computation perfor-
Manuscript received 8 December 2020; revised 30 March 2021; accepted mance with the help of highly optimized subroutines in matrix
25 June 2021. Date of publication 19 July 2021; date of current version
5 January 2023. This work was supported in part by the National Natural Sci- multiplication while avoiding substantial memory overhead
ence Foundation of China under Grant 61972396, in part by the National Key during the whole computation. The proposed algorithm is
Research and Development Program of China under Grant 2020AAA0103402, called efficient convolution via blocked columnizing (ECBC).
and in part by the Strategic Priority Research Program of the Chinese
Academy of Sciences under Grant XDA27040300. (Corresponding author: Inspired by the block algorithm of general matrix-to-matrix
Jian Cheng.) multiplication (GEMM), instead of performing the whole
Tianli Zhao, Xiangyu He, and Jian Cheng are with the Institute of convolution, we conduct the computation blockwisely. As a
Automation, Chinese Academy of Sciences, Beijing 100080, China (e-mail:
jcheng@nlpr.ia.ac.cn). result, the im2col transformation can also be done in a
Qinghao Hu, Weixiang Xu, Jiaxing Wang, and Cong Leng are with the blockwise manner. The block algorithm, together with memory
National Laboratory of Pattern Recognition, Beijing 100190, China.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2021.3095276. 1 Direct convolution is defined as the convolution algorithm that is imple-
Digital Object Identifier 10.1109/TNNLS.2021.3095276 mented directly with nested loops.
2162-237X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
434 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023

3) Extensive experiments on various platforms and net-


works validate the effectiveness of ECBC as well as
the superiority of our proposed method against a set of
widely used industrial-level convolution algorithms. For
example, on 4× ARM Cortex-A17 cores, ECBC is at
least 2.5× and up to 5× faster compared to the im2col
convolution algorithm on a wide range of CNN models,
and the memory overhead is saved by at least 30% and
up to 75%.

II. R ELATED W ORKS


As the growing popularity of CNNs, efficient computing of
convolution has long been a major concern of researchers.
Some works reduce the computation and memory com-
Fig. 1. Performance of the proposed ECBC and the direct convolution algo- plexity of convolutions with model compression, such as
rithm ZMC [13] normalized to im2col convolution algorithm on convolution pruning [18]–[22], quantization [23]–[27], and low-rank
layers of AlexNet [1]. ZMC [13] is able to get higher performance compared to decomposition [28]–[30]. While achieving high compression
the im2col convolution algorithm because it eliminates the memory overhead
required by the im2col procedure, while ECBC further outperforms ZMC due or acceleration ratio, these methods accelerate convolution
to its better memory access pattern. computation at the expense of more training cost and accu-
racy loss. Some other works focus on accelerating the com-
putation of full-precision convolutions so that there is no
sharing, can greatly reduce the memory overhead in indirect demand for further fine-tuning of models and at the same
methods. We demonstrate that, with proper loop ordering, time retaining the accuracy of models. Early efforts accel-
block size selection, and delicately designed kernel tensor erate convolution in an indirect manner. These methods can
layout, the continuity and reusability of memory access can be roughly divided into two categories: domain-transformed-
be improved. The optimized memory access is friendly to the based methods and spatial-domain-based methods. Domain-
memory hierarchy architecture of CPUs, which also improves transformed-based methods, such as fast Fourier transform
the computation efficiency. Compared to previous indirect (FFT)-based convolution [9] and Winograd-based convolu-
methods, ECBC significantly reduces the memory overhead tion [10], [31]–[34] transform the input/kernel tensors into
and enjoys better memory access. Compared to direct methods, another domain, in which the convolution can be calculated
ECBC does not assume any special data layout of input–output with less floating-point operations, thus improving the com-
tensors, so there is no need to transform the data layout of putation efficiency. Spatial-domain-based methods, such as
input–output tensors before convolution. im2col-based convolution [11], [35], convert convolution into
To further illustrate the advantages of ECBC, in Fig. 1, matrix multiplication found in basic linear algebra subpro-
we show the performances of the state-of-the-art direct con- grams [36] so that highly optimized libraries [37]–[40] can be
volution algorithm zero memory overhead direct convolution utilized. To achieve this goal, these methods need to first dupli-
(ZMC) [13] and our proposed ECBC on convolution layers cate subpacks of the input tensor into a large flat matrix, incur-
of AlexNet [1]. The benchmarks are conducted on 4× ARM ring additional memory overhead for the transformed input.
Cortex-A17 cores and normalized by im2col + OpenBLAS. The main problem with these methods is that the addi-
ZMC is able to run faster than the im2col convolution tional transformation operation incurs nontrivial memory or
algorithm by approximately 2× because it eliminates the computation overhead to the system. The problem becomes
additional packing overhead, while the proposed ECBC further more critical on embedded devices, where memory, compu-
outperforms ZMC by at least 10% (conv5) and up to 60% tation, and power resources are largely limited [13], [14].
(conv1). The performance improvement is mainly gained by Thus, many recent works have focused on memory-efficient
better continuity of memory access of ECBC. What is more, convolution (MEC) algorithms, which means to maximize
we achieve this performance at the cost of only a small area the computation performance under some memory constraints.
of shared additional memory overhead. Cho and Brand [12] took the first step toward this direction.
In summary, the main contributions of this article are They propose to reduce the memory overhead during computa-
threefold. tion with multiple small matrix multiplication operations, each
1) We present ECBC and show that indirect convolution with a smaller patch of input data. This method still needs
methods can achieve high computation performance additional memory overhead for the transformed input data.
with highly optimized subroutines in matrix multiplica- Thereby, directly implemented convolution, which is able to
tion while avoid incurring substantial memory overhead. eliminate the extra storage demands during computation, has
2) We propose proper loop ordering, block size selection, approached to the mind of scholars [13]–[16]. These methods
and kernel layout for convolution computation using optimize the computation by carefully reorganizing the nested
our ECBC. The continuity and reusability of memory loops of convolution. Their performances are largely relied
access can be improved without additional memory on architecture-specific data layouts of input–output/kernel
consumption. tensors. This implies that the practical use of these algorithms

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 435

TABLE I B. Im2col-Based Convolution


C ONVOLUTION R ELATED N OTATIONS U SED IN T HIS A RTICLE
Im2col-based convolution is perhaps the most well-known
convolution method, which is widely used in deep learning
frameworks such as Caffe [35], Mxnet [45], and Tensor-
flow [46]. Fig. 2(a) shows the main idea of im2col-based
convolution algorithm. It first transforms the input tensor I
into a flat matrix I ∗ . As the kernel K sliding through both
dimensions of I with strides sh and sw , the corresponding
patch of I is vectorized and duplicated into a column of I ∗ .
needs either large number of operations implemented under For example, the subpatch with blue dashed line borders is
these data layouts or additional memory and operation over- vectorized and duplicated to the first column of I ∗ , and the
heads for data layout transformation. Dukhan et al. [17] pro- subpatch with green dashed line borders is vectorized and
posed an indirect convolution algorithm specifically optimized duplicated to the second column of I ∗ . The kernel tensor K
for NHWC data layout. It performs the convolution computa- is directly reinterpreted as a oc × i c kh kw matrix K ∗ . Then,
tion with the help of indirect GEMM and a relatively small the convolution results can be calculated with GEMM between
im2col buffer, which contains only the pointers to the rows of K ∗ and I ∗ , i.e., O = K ∗ × I ∗ . The output O can be
input data. More recently, Gural and Murmann [41] proposed automatically stored in oc × oh × ow order after GEMM, so no
memory optimal direct convolution (MOC) for deployment of reordering is needed.
CNNs on microcontrollers with only 2-kB memory. The main For ease of explanation, we define an operator im2col. Tak-
idea is to reuse the memory of input data during computation ing the index [h, w] of I ∗ as input, the im2col operator
of output. This method is proved to be memory optimal but not generates the corresponding index [i c, i h, i w] of I so that
optimized for computation efficiency. Some other solutions,
such as TVM [42], FlexTensor [43], and Ansor [44], propose [i c, i h, i w] = i m2col([h, w]) ⇒ I ∗ [h, w] = I [i c, i h, i w].
to generate high-performance tensor computation routines with Apparently, the index [i c, i h, i w] can be easily computed
the approach of just-in-time code generation and automatic by the following equations:
code tuning. These approaches need to take a long time to  
h
search good routines in a large search space for a specific ic = kd = mod (h, kh kw )
k k
convolution/platform.  h w
kd
kh = kw = mod (kd, kw )
III. P RELIMINARIES kw
 
Our work is closely related to the im2col convolution w
oh = ow = mod (w, ow )
algorithm and the block algorithm of GEMM. In this section, ow
we first describe the notations in this article and then review i h = − ph + kh × dh + oh × sh
these two algorithms briefly. i w = − pw + kw × dw + ow × sw . (1)
A. Notations
C. Block Algorithm of GEMM
Table I shows the main notations related to a convolution
1) Overview: GEMM is the most important operation in
operation. We use capital characters to denote tensors and
linear algebra. There have been many libraries providing
matrices and lowercase characters to denote integers. We use
high-performance implementations of GEMM based on CPUs,
a set of integers included in a pair of square brackets ([i, j ])
such as OpenBLAS [38], Eigen [40], and MKL [39]. The
to denote index and a capital character followed by an index
efficiency of these implementations mainly depends on the
( A[i, j ]) to denote a specific element of a matrix or tensor.
block algorithm of GEMM [47]–[49]. Fig. 2(b) shows the
If the index [i, j ] is out of range of dimensions of A, then
overall pipeline of this algorithm.
we define A[i, j ] = 0. We define add operation between
For two matrices A ∈ Rm× p and B ∈ R p×n , GEMM
two indices as [i, j ] + [ p, q] = [i + p, j + q]. A sequence
computes the matrix multiplication between A and B: C =
{a, a + 1, . . . , b − 1} is denoted as a : b. We use sequences
A × B. In the first stage, both A and B are blocked along the
to denote a submatrix, for example, A[i : i + p, j : j + q]
p dimension with factor pc , resulting in  p/ pc  subproblems
denotes a p × q submatrix of the matrix A with the top-left
of general panel-to-panel multiplication [GEPP; refer to the
corner A[i, j ]. A single : is used to denote all the elements of
middle of Fig. 2(b)]. During this stage, the corresponding
a dimension. For example, A[:, j : j + q] denotes a submatrix
m × pc submatrix of A, say Â, is duplicated in a continuous
of A, starting at A[0, j ] with the same rows as A and columns
area of memory, namely a tensor A∗ of special format [refer
of q.
c ,n c to the arrow labeled by packa in Fig. 2(b)], which will
For a matrix A, we use Am i, j to denote the subblock at
be described in detail in Section III-C2. After that, each
the i th row and j th column when A is blocked along the two
GEPP problem is further blocked along the n dimension with
dimensions with block size m c and n c , respectively. Namely,
c ,n c the factor n c for cache blocking, creating multiple problems
Ami, j = A[i ∗ m c : i ∗ m c + m c , j ∗ n c : j ∗ n c + n c ] and
nc of general panel-to-block multiplication [GEPB; refer to the
A:,i is used to denote the submatrix A[:, i ∗ n c : i ∗ n c + n c ]. middle of Fig. 2(b)]. In this stage, the corresponding pc × n c

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
436 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023

Fig. 2. (a) Computation pipeline of im2col-based convolution algorithm. When the kernel window slides through the input tensor I , each corresponding
subpatch of I is vectorized and duplicated into one column of a flat buffer I ∗ . In this way, the results of convolution can be computed through matrix
multiplication between the kernel matrix K ∗ and the flat buffer matrix I ∗ . (b) Computation pipeline of the block algorithm of GEMM. The computation is
first blocked along the p dimension with factor pc and, then, the n dimension with factor n c for cache blocking. The subsequent computations are further
decomposed along the m dimension with factor m r and, finally, the n c dimension with factor nr for register blocking.

submatrix, say B̂ of B, is duplicated into a contiguous area of of A, denoted as Â, is packed into A∗ . In this process,
memory B ∗ [refer to the arrow labeled by packb in Fig. 2(b)]. Â is blocked along the m dimension with the factor m r ,
The blocking factors n c and pc are carefully selected such that generating multiple m r × pc micro panels. These m/m r 
the memory block B ∗ resides in the L2 cache and is reused micro panels are stored in A∗ one by one in column-major
during the computation of GEPB. order.2 Similarly, one pc × n c submatrix of B, denoted as B̂,
Most mobile devices are based on reduced instruction set is organized into multiple pc × nr micro panels and stored
computer (RISC) architecture, which means that data must be in B ∗ one by one in row-major order. In this way, during the
loaded into registers before operations are done, and single multiplication of submatrices of A and B, memory access of
instruction multiple data (SIMD) are also a widely utilized A∗ and B ∗ is actually continuous [47].
technique. For this reason, register blocking is further con- For ease of presentation, we formally define two operators
sidered. More specifically, each problem of GEPB is further packa and packb so that
blocked along the m dimension with the factor m r and finally
[h, w] = packa([m, pc, mr ]) ⇒ A∗ [m, pc, mr ] = Â[h, w]
the n c dimension with the factor nr . The blocking factors m r
and nr are selected according to the number and bit width [h, w] = packb([nc, pc, nr ]) ⇒ B ∗ [nc, pc, nr ] = B̂[h, w].
of (vector) registers. In this way, the whole computation is The operators packa and packb can be defined by the
finally decomposed into multiple inner product computations following equations:
between one m r × pc micro panel of A and one pc × nr
micro panel of B, which is highly optimized assembly. Due to packa([m, pc, mr ]) = [m × m r + mr, pc]
register blocking, during each computation of inner product, packb([nc, pc, nr ]) = [ pc, nc × nr + nr ]. (2)
each of the data is loaded into registers from memory just
once and reused for computation for some many times (nr
times for the micro panel of A and m r times for the micro IV. M ETHODOLOGY
panel of B). What is more, multiple output elements can The main drawback of im2col convolution algorithm is that
be calculated simultaneously with SIMD instructions, which it needs a large area of memory with size i c kh kw oh ow for
further improves the computation performance. the storage of the transformed input tensor, which is several
2) Packing: It is generally acknowledged that, because of times larger than the size of feature map. In this section,
the widely used memory hierarchy architecture in modern we introduce the main idea of ECBC. Our method only needs
CPUs, accessing memory in a continuous way during the a small area of shared memory to store the packed inputs, and
computation is much more efficient than noncontinuous it can be generalized to arbitrary convolutional architectures.
memory access. For the purpose of continuous memory
access during computation of GEMM, the packing operation A. Computation Pipeline
is actually needed. To this end, two areas of shared memory Fig. 3 briefly describes the computation pipeline of ECBC.
are needed, i.e., tensor A∗ with size of (m/m r ) × pc × m r The proposed algorithm is a principled improvement over
and tensor B ∗ with size (n c /nr ) × pc × n r . By packing the
corresponding elements of A and B into these two tensors 2 Row-major order and column-major order are two commonly used format

with a specific organization, the memory access during the for storing multidimensional tensors in linear memory. In row-major order,
adjacent elements in one row of the matrix are stored in a continuous area of
GEPB computation shown at the bottom of Fig. 2(b) is memory, whereas in column-major order, adjacent elements in one column of
completely continuous. For instance, one m × pc submatrix the matrix are stored in a continuous area of memory.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 437

Algorithm 1 ECBC Algorithm


Require: Input tensor I , packed input kernel K ∗∗ , shared
tensor Is , block factors pc , n c , m r , n r
Ensure: Output tensor O
1: Reinterpret O as a oc × oh ow matrix
2: Initialize O as all zeros
3: for p ∈ [0 : i c k h k w ] in steps of pc do
4: pc = M I N( pc , i c kh kw − p)
5: for n ∈ [0 : oh ow ] in steps of n c do
6: nc = M I N(n c , oh ow − n)
7: Reshape Is as a  nc n r  × pc × n r tensor
8: for i ∈ [0 :  nc nr
] in parallel do
9: for j ∈ [0 : pc] do
10: for k ∈ [0 : nr ] do
11: [i h∗ , i w∗ ] = packb([i, j, k])
12: [c, h, w] = i m2col([ p + i h∗ , n + i w∗ ])
13: Is [i, j, k] = I [c, h, w]
Fig. 3. Overview of ECBC. The red arrows illustrate the computation 14: end for
flow of im2col-based convolution with GEMM implemented with the block
algorithm described in Section III-C. It first packs the input tensor I into a
15: end for
large matrix I ∗ with im2col and then packs each subblock of I ∗ into the shared 16: end for
tensor Is . While the arrow labeled by packing denotes ECBC’s computation 17: for mr ∈ [0 : oc ] in steps of m r in parallel do
flow, for each subblock of I ∗ , we directly pack the corresponding elements
of the input tensor I into the shared tensor Is , thus deprecating the significant
18: for nr ∈ [0 : nc] in steps of nr do
memory overhead needed by I ∗ . 19: O[mr : mr + m r , n + nr : n + nr + nr ]+ =
K ∗∗ [ ppc , mr m r , :, :] × Is [ n r , :, :]
nr

20: end for


im2col-based convolution. Red arrows in Fig. 3 show the 21: end for
computation flow of im2col-based convolution. The input 22: end for
tensor I with dimension i c × i h × i w is first unfolded and 23: end for
duplicated into a flat matrix I ∗ with dimension i c kh kw × oh ow 24: Reinterpret O as a oc × oh × ow tensor
with im2col operation, and then, the output tensor O can
be obtained by GEMM between the kernel matrix K and
the transformed inputs I ∗ . Following the block algorithm of needed extra memory is the fixed size shared tensor Is , which
GEMM, the matrix I ∗ is blocked into multiple pc × n c is much smaller than I ∗ . Results in the final algorithm are
submatrices, and the computation of GEMM is decomposed shown in Algorithm 1. We will describe the algorithm in detail
into multiple subproblems as in the following.
 p ∗ p ,n
O:,n cj = K :,ic × Ii, j c c . (3)
i B. Data Layout of Kernels
p ∗ p ,n
In each iteration, before computing K :,ic × Ii, j c c , the cor- Recall that in the block algorithm of GEMM, the left matrix
∗ p ,n
responding submatrix Ii, j c c is packed into the shared (e.g., the kernel matrix in convolution) is supposed to be
(n c /nr ) × pc × nr tensor Is for continuous memory access. blocked and packed in a special format for continuous memory
An interesting observation is that in the summation illus- access during computing. The reformatting is very important
trated in (3), for a specific index i, j , the corresponding sub- to the computation performance but incurs additional memory
∗ p ,n ∗ p ,n
matrix Ii, j c c is multiplied just once. Specifically, after Ii, j c c and time cost. Note that once trained, elements of the kernel
pc
is multiplied by K :,i , it will never be used in the subsequent matrix are fixed. Thus, here in ECBC, we address the problem
computation. In other words, for different indices i and j , by blocking and packing the kernel in advance. With proper
the corresponding submatrices are decoupled. This indicates kernel layout, inference can be conducted without introducing
that there is no need to store these submatrices separately extra memory and time cost. To achieve this goal and fulfill
as what is done in the im2col-based convolution algorithm. the requirement of ECBC, a simple but novel storage format
Instead of packing the input tensor I into a flat matrix I ∗ and of the kernel tensor is proposed.
∗ p ,n
then packing each submatrix Ii, j c c to the shared tensor Is , Fig. 4 shows the proposed data layout of kernel data. First,
we directly pack the corresponding subpatch of I with respect the kernel tensor K of size oc × i c × kh × kw is reinterpreted
∗ p ,n
to the submatrix Ii, j c c into the shared tensor Is in each as a oc × i c kh kw matrix K ∗ , i.e.,
iteration (the arrow labeled with packi ng in Fig. 3). After
p K [i, j, k, l] = K ∗ [i, j kh kw + kkw + l]. (4)
Is is filled, we multiply Is by the corresponding submatrix K :,ic
of the kernel tensor and accumulate the results to the cor- K ∗ is then packed into a tensor K ∗∗ with size
responding submatrix O:,n cj of the output tensor. In this way, (i c kh kw )/ pc  × oc /m r  × pc × m r . The matrix K ∗ is
the transformed inputs I ∗ are entirely deprecated. The only first blocked along the column dimension with factor pc ,

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
438 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023

into SIMD registers and reused during the whole iteration,


so the blocking factors m r and nr are upper bounded by the
number and length of SIMD registers in CPUs. The factor pc
should be as large as possible to reuse these m r × nr register
values sufficiently. However, to avoid stalls caused by cache
miss, we must ensure that the m r × pc micro panel of the
kernel K ∗∗ along with the nr × pc micro panel of Is should
match the L1 cache size, so that in each iteration in line 17,
the corresponding m r × pc micro panel of K ∗∗ resides in
Fig. 4. Proposed data layout of the kernel data. The original kernel tensor K L1 cache and is reused during the loop in line 18. In this
with shape oc × ic × kh × kw is first flattened into a oc × ic kh kw matrix K ∗ .
The matrix is then blocked along the column dimension with factor n c and respect, n c should be as large as possible to make full use of
finally the row dimension with factor m r , generating multiple micro panels these L1 cache values. What is more, the loop in line 8 for
with size m r × n c . These micro panels are stored in memory one by one. packing procedure and line 17 for multiplication calculation
The fastest dimension is m r , followed by n c . Note that this process does
not incur any extra memory overhead because the kernel tensors are fixed in can be done in parallel, which further improves computation
CNN models after training so that they can be preorganized in advance. efficiency.

generating (i c kh kw )/ pc  submatrices with size oc × pc . These E. Parallelism


submatrices are stored in K ∗∗ one by one, with the same
format as that of A∗ described in Section III-C, i.e., In order to make full use of multicore architectures in
     modern CPUs, parallelism is also considered in our algorithm.
w h
K ∗ [h, w] = K ∗∗ , , w% pc , h%m r . (5) Note that it is not sufficient to parallelize the out-most two
pc mr loops. Since different iterations of these two loops need dif-
ferent subpatches of the input data (e.g., pc × n c submatrices
C. Packing of Input Data of I ∗ ). This means that one distinct shared tensor Is will be
As what has been described in Section III-C, the right required for each thread in order for completing the subsequent
matrix of GEMM (e.g., the input data in convolution) should computations, leading to multiple times of additional memory
also be blocked and packed for continuous memory access. overhead. Recall that in Section IV-D, it is expected that the
Different from the case of kernel data, the blocking and shared tensor Is should be stored into L2 cache, which is
packing operations of the input data cannot be done in shared across many cores in the context of mobile CPUs.
advance because they are dependent on the input image of However, these multiple shared tensors will introduce further
CNN models and thus are different for different computations. additional memory overhead and may even exceed the size of
For this reason, we apply the blocking and packing operations L2 cache, leading to performance degradation. What is worse,
on-the-fly. different iterations of the out-most loop will accumulate the
Fortunately, through proper organization of computation intermediate results into the same area of the output data.
loops, this procedure does not introduce much memory cost. This means that parallelization of this loop will cause write
In Algorithm 1, the key insight lies in that, each iteration of the conflict between different threads, thus incurring time penalty
first two loops (e.g., the loops of lines 3 and 5) is related to a on system performance.
∗ p ,n For the reasons described above, in our algorithm, the par-
pc × n c submatrix I p,nc c of I ∗ , and this submatrix will never
be used in the subsequent iterations. Thus, instead of storing allelization is applied on the inner iterations. To be more
all of these submatrices separately in memory, we only pack detailed, the packing of the input data is parallelized along
the corresponding pc × n c submatrix into a shared tensor, the n c dimension and the subsequent computation is paral-
say Is , at the very beginning of each iteration. In this way, lelized along the oc dimension. This means that each thread
the additional memory overhead is reduced to the size of Is is assigned to a block of channels of the output data for
(e.g., pc × n c ), which is far below that of the other indirect computation. In this way, only one area of shared tensor Is is
convolution algorithms. required, and the L2-cache data are shared across multithreads.
In each iteration of one specific thread, the corresponding
D. Blocking Factors m r × pc micro panel of K ∗∗ is fetched in its private L1 cache
and reused for computation.
The blocking factors pc , n c , m r , and nr are carefully
selected to match the architecture parameters of CPUs, such
as the cache size, the number, and length of SIMD registers. F. Analysis
In particular, the shared tensor Is should match the L2 cache The main goal of our method is to optimize the memory
size. In each iteration, after packing pc × n c elements of I access during computation. In this section, we analyze and
into Is , it will be stored into L2 cache and reused during compare our method with other algorithms in terms of conti-
the loop in lines 17–21. Line 19 performs the inner product nuity of memory access and overall memory ahead.
between one micro panel of K ∗∗ with size m r × pc and one 1) Continuity of Memory Access: It is a common agree-
micro panel of Is with size pc × n r , which is highly optimized ment that accessing memory in a continuous manner is more
and implemented assembly with SIMD instructions. In this efficient than noncontinuous memory access because of the
procedure, the m r × nr intermediate results of O are stored popular utilized hierarchical memory architecture in modern

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 439

and the memory needed by the input–output and kernel tensors


is the same for all the algorithms, despite that the data layout
of the kernel tensor is different across different algorithms

Rin = i c i h i w
Rout = oc oh ow
Rker = oc i c kh kw . (7)

We start from the extra memory overhead of the


im2col-convolution algorithm [11], [35]. The algorithm needs
Fig. 5. Storage format and memory access order of (a) directly implemented
a buffer for the storage of the matrix generated by the im2col
convolution algorithm [13] and (b) proposed ECBC. During computation of procedure. The size of the transformed tensor is
ECBC, the memory access is totally continuous, which is friendly to cache,
resulting in its high computation performance. im2col
Rext = i c k h k w × oh ow . (8)

We now analyze the additional memory overhead of


MEC [12]. The main idea of the algorithm can be seen
CPUs. This is very important to the system performance.
as to apply the im2col operation on only one dimension,
In this section, we mainly focus on the continuity of memory
instead of both the oh and ow dimension as in the im2col
access during computation (e.g., lines 17–21 in Algorithm 1).
convolution algorithm. Following the analysis of the original
Fig. 5 shows the comparison of storage format and memory
paper, the needed extra memory overhead is
access order of the start-of-the-art directly implemented convo-
lution algorithm [13] and the proposed ECBC. In the method MEC
Rext = i c (i h + 2 ph )kw ow . (9)
of [13], the input data are first blocked along the channel
dimension with some factor cblk , arising (i c /cblk ) blocks of The additional memory overhead of direct convolution
data, each with shape cblk × i h × i w . All of these blocks algorithms depends on the actual scenario because these
are stored in memory one by one, and the fastest dimension algorithms are implemented under some special data for-
is the blocked channels cblk , followed by i h and i w . While mat. Between which the most commonly used format is
during computation, the fastest accessing dimension is along (C/x) × H × W × x [13]–[16]. In some implementa-
 tions [14]–[16], the blocking factor x can be different for
the input channel i c with another blocking factor cblk , which is
larger than cblk . In this context, during the first cblk iterations of input–output tensors. In this case, additional memory overhead
computation, elements at the first cblk channels [e.g., the ones with the same size as the input tensor is needed for data
with blue colors at the bottom of Fig. 5(a)] are accessed. format transformation. For other implementations such as
In this stage, the memory access is continuous. However, Zhang et al.’s method [13], the x factor is predefined and
at iteration (cblk + 1), the program tries to load data at the keeps the same between the input and output tensors, so no
(cblk + 1)th channel [e.g., the ones with orange colors at additional memory is needed for computation.
the bottom of Fig. 5(a)]. These data are not continuous with Finally, we analyze the additional memory overhead
the first cblk ones, which will cause cache miss, and thereby of ECBC. In our algorithm, the only needed extra memory
leading to performance degradation. is the shared tensor Is with size pc × n c
While in ECBC, the core computation is the multiplication
between one m r × pc micro panel of the kernel tensor and one
ECBC
Rext = pc × n c . (10)
pc × n c submatrix of the input tensor, which is packed into pc and n c are architecture-specific parameters. In all the
the shared tensor Is . Recall that each pc × n c submatrix of implementations of our experiments, we set pc = n c = 256,
the input data is blocked along the n c dimension into (n c /nr ) which is fixed and much smaller than the size of input–output
micro panels and stored in memory one by one, and the tensors, and can be shared between all the convolution layers
fastest dimension is the blocked column. During computation in a model.
illustrated in lines 18–20 of Algorithm 1 and the bottom of
Fig. 3, these micro panels are multiplied by the corresponding
kernel data one by one. The storage order of data is the same V. E XPERIMENTAL R ESULTS
as the order they are accessed, which means that the memory To evaluate the proposed algorithm, in this section,
access of ECBC is totally continuous. we present the comparison between ECBC and other widely
2) Memory Overhead: In this section, we analyze and used industrial-level convolution algorithms on various plat-
compare the memory overhead of different convolution algo- forms and networks.
rithms. For all the convolution algorithms, the total memory Experiments are conducted on two platforms with different
overhead Ralg contains four terms: the memory needed by the ARM architectures:
input Rin , output Rout , parameters Rker , and additional memory 1) 2× ARM Cortex-A72 CPUs, with frequency up to
overhead Rext 1.8 GHz. ARM V8 architecture;
2) 4× ARM Cortex-A17 CPUs, with frequency up to
Ralg = Rin + Rout + Rker + Rext (6) 1.8 GHz. ARM V7 architecture.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
440 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023

Fig. 6. Comparison of single-thread computation performance of different algorithms on convolution layers. The notes under each group of bars denote the
configuration of convolutions. For example, the note (64, 56, 3)(128, 7, 1) denotes a 7 × 7 convolution with 64 input channels and 128 output channels. The
spatial size of the input tensor is 56 × 56. Also, the stride and padding of the convolution are 1 and 3, respectively. (a) Speedup with respect to im2col +
OpenBLAS of different algorithms on convolution layers on a single A72 core. (b) Speedup with respect to im2col + OpenBLAS of different algorithms on
convolution layers on a single A17 core.

A. Experimental Setup and padding) appearing frequently in CNN models. We do not


We implement ECBC on ARM CPUs in C++ with conduct experiments on 1 × 1 convolutions because they are
OpenMP for parallel computing. In all of our experiments, exactly equivalent to GEMM, where there will be no difference
the blocking factors pc and n c are all set to 256. m r and between our method and the traditional blocking algorithm
nr are different on different architectures because the number of GEMM described in Section III-C. We will report the
of registers is different. On ARM Cortex-A72 cores, we set computation performance and memory overhead of different
m r = 4, nr = 16, while on ARM Cortex-A17 cores, we set algorithms on these convolution layers in this section.
m r = 4, nr = 8. To evaluate the computation efficiency, we run a convo-
We compare our proposed ECBC with the popular im2col lution layer with each algorithm 100 times and record the
convolution algorithm. We extract the implementation of the average run time. We report the speed up of each algorithm
im2col convolution algorithm from the popular deep learn- normalized to im2col + OpenBLAS. Both single-thread and
ing framework Caffe [35] and utilize OpenBLAS [37] and multithread performances are presented. The results are sum-
Eigen [40] as high-performance matrix multiplication routine. marized in Figs. 6 and 7. The notes under each group of bars
We also present a comparison of ECBC with the convolution denote the convolution configurations. The gray dashed lines
implementations provided by NNPACk [50] and TFLite [51], are the performance of im2col + OpenBLAS. We can see
which are two of the most popular frameworks for CNN that ECBC achieves the best performance under all the cases
acceleration on mobile devices. Besides, we also compare excluding seven convolutions on 2× ARM Cortex-A72 cores,
ECBC with the MEC [12] and the state-of-the-art direct where ECBC performs slightly inferior compared to ZMC.
convolution algorithm ZMC [13]. Both of these convolution Fig. 6(a) shows the performance comparison between ECBC
algorithms are not open-sourced, so we reimplemented them and other methods on a single ARM Cortex-A72 CPU. Com-
by ourselves. pared to im2col convolution algorithm, ECBC is approxi-
mately 1.2× and sometimes nearly 2× faster. ZMC performs
better than im2col convolution on most layers, while it still
B. Results on Convolution Layers performs inferior compared to ECBC. Specifically, ECBC out-
We conduct 22 convolution layers with various configura- performs ZMC by about 10%, and the improvement achieves
tions (input–output channels, spatial size, kernel size, stride, up to 40% under some configurations.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 441

Fig. 7. Comparison of single-thread computation performance of different algorithms on convolution layers. The notes under each group of bars denote the
configuration of convolutions. For example, the note (64, 56, 3)(128, 7, 1) denotes a 7 × 7 convolution with 64 input channels and 128 output channels. The
spatial size of the input tensor is 56 × 56. Also, the stride and padding of the convolution are 1 and 3, respectively. (a) Speedup with respect to im2col +
OpenBLAS of different algorithms on convolution layers on 2× A72 cores. (b) Speedup with respect to im2col + OpenBLAS of different algorithms on
convolution layers on 2× A17 cores. (c) Speedup with respect to im2col + OpenBLAS of different algorithms on convolution layers on 3× A17 cores.
(d) Speedup with respect to im2col + OpenBLAS of different algorithms on convolution layers on 4× A17 cores.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
442 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023

Fig. 8. Comparison of different convolution algorithms in terms of run time and memory overhead. (a) Computation performance on resnet18 models with
different widths. All the performances are measured on a single ARM Cortex A72 Core. Lower is better. (b) Memory overhead of resnet18 models with
different widths. Lower is better. (c) Tradeoff between run time and memory overhead on resnet18 models with different widths. Lower is better.

Fig. 6(b) shows the performance comparison between TABLE II


ECBC and other methods on a single ARM Cortex-A17 CPU. RUN T IME OF D IFFERENT C ONVOLUTION A LGORITHMS ON
D IFFERENT CNN M ODELS . T HE VALUES IN TABLES A RE
In this case, the advantages of all the convolution algorithms IN M ILLISECONDS . (a) R ESULTS ON A S INGLE A17
are more obvious. NNPACK and TFLite perform better than C ORE . (b) R ESULTS ON 4× A17 C ORES .
the im2col convolution, and the improvement factor is approxi- (c) R ESULTS ON A S INGLE A72 C ORE .
(d) R ESULTS ON 2× A72 C ORES
mately 1.5×, while ECBC still performs the best among all the
algorithms. Compared to the im2col convolution algorithms,
ECBC is more than 2× faster on all the convolution layers.
Fig. 7(a)–(d) shows the comparisons of computation effi-
ciency of different algorithms in multithread context. Overall,
the performance improvement of our algorithm is greater than
that in a single-thread case. For example, on a single ARM
Cortex-A17 CPU, ECBC is faster than im2col + OpenBLAS
by a factor of less than 2.5×, while on 4× ARM Cortex-
A17 cores, the performance improvement is more than 2.5×
on most layers. On conv3, the improvement factor even
achieves more than 3×. Similar conclusions also hold on ARM
Cortex-A72.
In general, the performance improvement is more obvious
on ARM Cortex-A17 cores. For example, by comparison of
Fig. 6(a) and (b), we can find that on a single ARM Cortex-
A72 CPU, ECBC performs about 1.25× better than im2col +
OpenBLAS, while the improvement is nearly 2.5× on a single
ARM Cortex-A17. The results shown in Fig. 7(a) and (d) show
similar trends. We can see that on 2× ARM Cortex-A72 cores,
ECBC is about 1.25× faster than im2col + OpenBLAS, while
it is overall more than 2.5× and up to 3× faster on 4×
ARM Cortex-A17 cores. The main reason is that ARM Cortex-
A17 CPU is more limited by memory access and bandwidth,
so it gains more benefits from memory access optimization. see that ECBC performs the best against all the algorithms
We believe that the performance gap between ECBC and in all the cases except on GoogleNet and InceptionV3 on
these baseline implementations can be larger on more low-end 2× ARM Cortex-A72 cores, where ECBC performs slightly
devices, which may be a possible further direction of our work. slower compared to ZMC.
Similar to the results of experiments conducted on stand-
alone convolution layers, overall, the performance improve-
C. Results on CNNs ment of ECBC is more obvious when running with multiple
For comprehensive evaluation, we run and compare the threads or on relatively more bandwidth-constrained platform
performance of different algorithms on real models [1]–[3], (the ARM Cortex-A17 CPU with ARMV7 architecture). For
[52], [53]. For each model, we measure and report the summa- example, by comparison of Table II(a) and (b), ECBC is over-
tion of run time of its convolution layers. The results are shown all nearly 2.5× faster than the im2col convolution algorithm on
in Table II, and the run time is shown in milliseconds. We can a single ARM Cortex-A17 core, while when running with 4×

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 443

TABLE III
D ETAILED C OMPUTATION P ERFORMANCE OF D IFFERENT C ONVOLUTION
A LGORITHMS ON C ONVOLUTION L AYERS OF A LEX N ET [1]. T HE
VALUES IN TABLES A RE IN M ILLISECONDS . (a) P ERFORMANCE ON
A S INGLE A17 C ORE . (b) P ERFORMANCE ON 4× A17 C ORES .
(c) P ERFORMANCE ON A S INGLE A72 C ORE .
(d) P ERFORMANCE ON 2× A72 C ORES

Fig. 9. Normalized memory overhead of different convolution algorithms


on real CNN models. In this figure, all the memory overhead includes the
memory needed for the input–output/parameters of the models and additional
memory needed by various algorithms for computation are considered. Lower
is better.

D. Memory Overhead
In Section IV-F, we analyze the memory overhead of
different convolution algorithms theoretically. This section will
prove the memory efficiency of ECBC through experiments.
In Fig. 8, we show the tradeoff among run time and memory
overhead of different algorithms on resnet18 models with
different widths. In particular, in Fig. 8(a), we show the run
time in milliseconds of different convolution algorithms on
resnet18 models with different widths. Latency of all the
cores, ECBC is always more than 2.5× faster. The speedup models is measured on a single A72 core, and we can see that
even achieves 4× on GoogleNet and 5× on resnet50. The ECBC consistently achieves the best performance under all the
same trends also hold on ARM Cortex-A72 cores. cases. Fig. 8(b) shows the memory overhead of different con-
Table II(a) and (b) support that the advantages of ECBC volution algorithms. In this figure, all the memory overhead,
are more obvious on relatively more bandwidth-constrained including the memory required by convolution inputs/outputs,
platforms. Specifically, ECBC is overall 1.2× faster compared parameters of networks, and extra memory overhead required
to the im2col convolution on a single ARM Cortex-A72, while during computation, are considered. Note that the memory
the speedup is more than 2× on a single ARM Cortex-A17. overhead of convolution implementation of nnpack and tflite
Table II(b) and (d) also shows the same results, where ECBC is not shown in this figure. We mainly compare the memory
is 1.2–1.5× faster than the im2col convolution algorithm on overhead of ZMC [13], which needs the least memory among
2× A72 cores, while the improvement is more than 2.5× and all the algorithms because it does not introduce any additional
up to 5× on 4× A17 cores. memory overhead at all. We can see that the memory overhead
In Table III, we show the detailed run time of different con- of ECBC is obviously lower than im2col and MEC and almost
volution algorithms on each convolution layer of AlexNet [1]. the same with ZMC under all the cases. To further prove that
The values in tables are in milliseconds. We can see that ECBC ECBC is able to achieve a better tradeoff between run time
performs the best under all the cases excluding on conv4 and and memory overhead, in Fig. 8(c), we plot the summation
conv5 on 2× A72 cores. An interesting fact that is noteworthy of run time (in milliseconds) and memory overhead (in mega
is that, the performance of ECBC is stable and efficient, that is parameters) of different algorithms (lower is better). We can
to say, the performance of ECBC is not bad under all the cases, see that compared to other convolution algorithms, ECBC
while other methods may perform well under some cases and consistently performs superior in terms of tradeoff between
suffer poor performance under some other cases. For example, latency and memory overhead.
the performance of ZMC [13] is comparable with ECBC on Fig. 9 shows the memory overhead of different convolution
conv4 and conv5 on A72 CPUs and even faster than ECBC algorithms on different CNN models. Similar to Fig. 8, in this
by 0.21 and 0.82 ms relatively on these two layers on 2× figure, all the memory overhead includes the memory needed
A72 cores, while on other layers or platforms, it performs for storage of the inputs/outputs/parameters of models, and
inferior compared to ECBC. On conv1, ECBC is even more additional memory needed by various algorithms for compu-
than 1.5× faster than ZMC on all the platforms. The same tation of convolutions is considered. Lower is better. From
trends also hold for the performance of other algorithms. For Fig. 9, we can see that ECBC is able to save at least 30%
example, the performance of MEC [12] is similar to ECBC and up to 75% memory overhead compared to the im2col
on conv1 on A72 CPUs, and ECBC outperforms MEC by a convolution algorithm. Also, the memory needed by ECBC
large margin on other layers or platforms. is almost the same as that of ZMC [13]. In fact, compared

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
444 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 34, NO. 1, JANUARY 2023

to ZMC, a small piece of additional memory is needed by [8] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
ECBC for the storage of the shared tensor Is . While, as what network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jul. 2017, pp. 2881–2890.
has been mentioned at the end of Section IV-F, the size of [9] T. Highlander and A. Rodriguez, “Very efficient training of convo-
Is is fixed and independent of the convolution configurations, lutional neural networks using fast Fourier transform and overlap-
it can be shared across all the convolutions in a model, thereby and-add,” 2016, arXiv:1601.06815. [Online]. Available: http://arxiv.
org/abs/1601.06815
largely reducing the total memory overhead. [10] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 4013–4021.
VI. C ONCLUSION
[11] K. Chellapilla, S. Puri, and P. Simard, “High performance convolutional
In this article, we propose a simple and efficient convo- neural networks for document processing,” in Proc. 10th Int. Workshop
Frontiers Handwriting Recognit., 2006, pp. 1–7.
lution algorithm for mobile devices. Instead of transforming [12] M. Cho and D. Brand, “MEC: Memory-efficient convolution for deep
the whole input tensor before computing, we reduce the neural network,” in Proc. 34th Int. Conf. Mach. Learn. (ICML), 2017,
memory overhead through blocking and memory sharing. pp. 815–824.
We show that, by carefully designing the loop order and [13] J. Zhang, F. Franchetti, and T. M. Low, “High performance zero-memory
overhead direct convolutions,” in Proc. 35th Int. Conf. Mach. Learn.
selecting the block size, the continuity and reusability of (ICML), 2018, pp. 5776–5785.
memory access can be improved, and memory overhead can [14] E. Georganas et al., “Anatomy of high-performance deep learning
also be largely reduced through memory sharing. Evaluated convolutions on SIMD architectures,” in Proc. Int. Conf. High Perform.
Comput., Netw., Storage Anal., Nov. 2018, pp. 830–841.
on multiple networks and platforms, our method gets high [15] E. Georganas et al., “Harnessing deep learning via a single building
computation and memory efficiency compared to previous block,” in Proc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS),
algorithms. The results support that ECBC gains more benefits May 2020, pp. 222–233.
[16] Y. Liu, Y. Wang, R. Yu, M. Li, V. Sharma, and Y. Wang, “Optimizing
on low-end devices because they are more limited by memory CNN model inference on CPUs,” in Proc. USENIX Annu. Tech. Conf.,
resources and thus more sensitive to memory optimization. 2019, pp. 1025–1040.
What is more, ECBC also shows great scalability in that the [17] M. Dukhan, “The indirect convolution algorithm,” CoRR,
performance improvement is more obvious when in multi- vol. arXiv:1907.02129, pp. 1–10, Jul. 2019.
[18] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
thread context. Moreover, the performance of ECBC is stable deep neural networks with pruning, trained quantization and Huffman
and efficient. It performs not bad under a various range of coding,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14.
platforms, convolutions, and CNN models. [19] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
sparsity in deep neural networks,” in Proc. Adv. Neural Inf. Process.
The main idea of ECBC is to conduct the convolution Syst., 2016, pp. 2074–2082.
computation blockwisely so that the tensor transformation [20] S. Guo, Y. Wang, Q. Li, and J. Yan, “DMCP: Differentiable Markov
operation can also be done in a blockwise manner, and channel pruning for neural networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2020, pp. 1539–1547.
the memory overhead can be thus largely reduced with
[21] J. Yu, L. Yang, N. Xu, J. Yang, and T. S. Huang, “Slimmable
memory sharing. A possible further direction of this work neural networks,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019,
is to optimize the memory cost of domain-transformation pp. 1–12.
convolution methods (i.e., FFT-based [9] or Winograd-based [22] J. Yu and T. Huang, “Universally slimmable networks and improved
training techniques,” in Proc. Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
[10] convolution algorithms) with the similar idea, which will pp. 1803–1811.
further improve the practical utility of these computation effi- [23] Z. Yang et al., “Searching for low-bit weights in quantized neural
cient while memory-hungry convolution algorithms on mobile networks,” in Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS), 2020,
pp. 1–12.
applications. [24] Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, “ReactNet: Towards
precise binary neural network with generalized activation functions,” in
Proc. 16th Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 143–159.
R EFERENCES
[25] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- Real Net: Enhancing the performance of 1-bit CNNs with improved
tion with deep convolutional neural networks,” in Advances in Neural representational capability and advanced training algorithm,” in Proc.
Information Processing Systems, vol. 25, F. Pereira, C. J. C. Burges, Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 722–737.
L. Bottou, and K. Q. Weinberger, Eds. Red Hook, NY, USA: [26] P. Wang and J. Cheng, “Fixed-point factorized networks,” in Proc.
Curran Associates, 2012, pp. 1097–1105. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for pp. 4012–4020.
large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent. [27] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural
(ICLR), 2015, pp. 1–14. networks: Squeeze the last bit out with ADMM,” in Proc. 32th AAAI
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Conf. Artif. Intell. (AAAI), 2018, pp. 1–8.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [28] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus,
(CVPR), Jun. 2016, pp. 770–778. “Exploiting linear structure within convolutional networks for efficient
[4] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. evaluation,” in Proc. 38th Conf. Neural Inf. Process. Syst. (NeurIPS),
Comput. Vis. (ECCV), 2016, pp. 21–37. 2014, pp. 1269–1277.
[5] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve- [29] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional
ment,” 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/ neural networks with low rank expansions,” in Proc. Brit. Mach. Vis.
1804.02767 Conf., 2014, pp. 1–13.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- [30] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky,
time object detection with region proposal networks,” in Proc. 28th Int. “Speeding up convolutional neural networks using fine-tuned CP-
Conf. Neural Inf. Process. Syst. (NIPS), vol. 1. Cambridge, MA, USA: decomposition,” in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), 2015,
MIT Press, 2015, pp. 91–99. pp. 1–11.
[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [31] S. Winograd, Arithmetic Complexity of Computations, vol. 43, no. 2.
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Philadelphia, PA, USA: Society for Industrial and Applied Mathematics,
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. 1980.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.
ZHAO et al.: ECBC 445

[32] Z. Jia, A. Zlateski, F. Durand, and K. Li, “Optimizing N-dimensional, Qinghao Hu received the B.E. degree in computer
Winograd-based convolution for manycore CPUs,” in Proc. 23rd science from Northwestern Polytechnical University,
ACM SIGPLAN Symp. Princ. Pract. Parallel Program., Feb. 2018, Xi’an, China, in 2014, and the Ph.D. degree from the
pp. 109–123. National Laboratory of Pattern Recognition, Insti-
[33] A. Xygkis, D. Soudris, L. Papadopoulos, S. Yous, and D. Moloney, tute of Automation, Chinese Academy of Sciences,
“Efficient winograd-based convolution kernel implementation on edge Beijing, China, in 2019.
devices,” in Proc. 55th Design Autom. Conf. (DAC), Jun. 2018, pp. 1–6. He currently works as a Research Assistant at
[34] Z. Jia, A. Zlateski, F. Durand, and K. Li, “Towards optimal Winograd the Institute of Automation, Chinese Academy of
convolution on manycores,” in Proc. SysML, 2018, pp. 1–3. Sciences. His current research interests include deep
[35] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed- neural network compression and acceleration, hash-
ding,” in Proc. ACM Int. Conf. Multimedia, Nov. 2014, pp. 675–678. ing, and quantization.
[36] J. J. Dongarra, J. D. Croz, and S. Hammarling, “A set of level 3 basic
linear algebra subprograms,” ACM Trans. Math. Softw., vol. 16, no. 1,
pp. 1–17, 1990.
[37] Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 BLAS Xiangyu He received the B.E. degree in information
performance optimization on Loongson 3A processor,” in Proc. IEEE security from the Beijing University of Posts and
18th Int. Conf. Parallel Distrib. Syst. (ICPADS), Dec. 2012, pp. 684–691. Telecommunications, Beijing, China, in 2017. He is
[38] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi, “AUGEM: Automatically currently pursuing the Ph.D. degree with the Insti-
generate high performance dense linear algebra kernels on x86 CPUs,” tute of Automation, Chinese Academy of Sciences,
in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal. (SC), Beijing.
Nov. 2013, pp. 17–21. His current research interests include deep learning
[39] Intel. (2020). MKL: Intel Math Kernel Library. [Online]. Avail- and image retrieval.
able: https://software.intel.com/content/www/us/en/develop/tools/math-
kernel-library.html
[40] G. Guennebaud et al. (2010). Eigen V3. [Online]. Available: http://eigen.
tuxfamily.org
[41] A. Gural and B. Murmann, “Memory-optimal direct convolutions for
maximizing classification accuracy in embedded applications,” in Proc. Weixiang Xu received the B.E. degree in automa-
36th Int. Conf. Mach. Learn. (ICML), 2019, pp. 2515–2524. tion from Northeastern University, Shenyang, China,
[42] T. Chen et al., “TVM: An automated end-to-end optimizing compiler in 2018. He is currently pursuing the Ph.D. degree
for deep learning,” in Proc. 12th USENIX Symp. Operating Syst. Design with the National Laboratory of Pattern Recogni-
Implement. (OSDI), 2018, pp. 578–594. tion, Institute of Automation, Chinese Academy of
[43] S. Zheng, Y. Liang, S. Wang, R. Chen, and K. Sheng, “Flextensor: An Sciences, Beijing, China.
automatic schedule exploration and optimization framework for tensor His current research interests include compression
computation on heterogeneous system,” in Proc. Architectural Support of deep learning and high-performance computing.
Program. Lang. Operating Syst. (ASPLOS), 2020, pp. 859–873.
[44] L. Zheng et al., “Ansor: Generating high-performance tensor programs
for deep learning,” in Proc. 14th USENIX Symp. Operating Syst. Design
Implement. (OSDI), 2020, pp. 863–879.
[45] T. Chen et al., “MXNet: A flexible and efficient machine learning library
for heterogeneous distributed systems,” in Proc. Neural Inf. Process. Jiaxing Wang received the B.E. degree in School
Syst. Workshop Mach. Learn. Syst., 2015, pp. 1–6. of Control and Computer Engineering, North China
[46] M. Abadi et al., “Tensorflow: A system for large-scale machine learn- Electric Power University, Beijing, China, in 2015.
ing,” in Proc. 12th USENIX Symp. Operating Syst. Design Implement. He is currently pursuing the Ph.D. degree with
(OSDI), 2016, pp. 265–283. the Institute of Automation, Chinese Academy of
[47] K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix Sciences, Beijing.
multiplication,” ACM Trans. Math. Softw., vol. 34, no. 3, pp. 12:1–12:25, His current research interests include efficient deep
May 2008. learning and Bayesian methods.
[48] T. M. Smith, R. van de Geijn, M. Smelyanskiy, J. R. Hammond, and
F. G. V. Zee, “Anatomy of high-performance many-threaded matrix
multiplication,” in Proc. IEEE 28th Int. Parallel Distrib. Process. Symp.,
May 2014, pp. 1049–1059.
[49] F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly
instantiating BLAS functionality,” ACM Trans. Math. Softw., vol. 41, Cong Leng received the B.E. degree in automation
no. 3, pp. 1–33, Jun. 2015. from Central South University, Changsha, China,
[50] M. Dukhan. NNPACK. (2020). [Online]. Available: https://github. in 2011, and the Ph.D. degree in pattern recognition
com/Maratyszcza/NNPACK and intelligent systems from the Institute of Automa-
[51] (2020). TFLITE. [Online]. Available: https://www.tensorflow.org/lite tion, Chinese Academy of Sciences, Beijing, China,
[52] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE in 2016.
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9. His current research interests include machine
[53] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking learning, deep learning, and their applications in
the inception architecture for computer vision,” in Proc. IEEE Conf. computer vision and data mining.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826.

Tianli Zhao received the B.E. degree in engineering Jian Cheng (Member, IEEE) received the B.S. and
physics from Tsinghua University, Beijing, China, M.S. degrees in mathematics from Wuhan Univer-
in 2016. He is currently pursuing the Ph.D. degree sity, Wuhan, China, in 1998 and 2001, respectively,
with the National Laboratory of Pattern Recogni- and the Ph.D. degree in pattern recognition and
tion, Institute of Automation, Chinese Academy of intelligent systems from the Institute of Automa-
Sciences, Beijing. tion, Chinese Academy of Sciences, Beijing, China,
His research focuses on compression and accelera- in 2004.
tion of deep learning and high-performance comput- He is currently a Professor with the Institute of
ing of deep learning algorithms on mobile platforms. Automation, Chinese Academy of Sciences. His cur-
rent major research interests include deep learning,
computer vision, and chip design.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY DELHI. Downloaded on February 01,2024 at 19:30:01 UTC from IEEE Xplore. Restrictions apply.

You might also like