Professional Documents
Culture Documents
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: In current efficient convolutional neural networks, 1 1 convolution is widely used. However, the
Received 10 April 2020 amount of computation and the number of parameters of 1 1 convolution layers account for a large part
Revised 14 August 2020 of these neural network models. In this paper, we propose to use linear-phase pointwise convolution ker-
Accepted 13 October 2020
nels (LPPC kernels) to reduce the computational complexities and storage costs of these neural networks.
Available online 3 November 2020
Communicated by Zidong Wang
We design four types of LPPC kernels based on the parity of the number of input channels and symmetry
of the weights of the pointwise convolution kernel. Experimental results show that Type-I LPPC kernels
can compress some popular networks better with a small reduction in accuracy than the other types of
2010 MSC:
00-01
LPPC kernels. The LPPC kernels can be used as new 1 1 convolution kernels to design efficient neural
99-00 network architectures in the future. Moreover, the LPPC kernels are friendly to low-power hardware
accelerator design to achieve lower memory access cost and smaller model size.
Keywords: Ó 2020 Published by Elsevier B.V.
Convolutional Neural Network (CNN)
Efficient neural network
Pointwise convolution
Linear-phase filter
1. Introduction and the size of the deep neural network model. In light-weight
CNNs using depthwise separable convolution, the amount of com-
Artificial Neural Networks (ANNs) have been greatly developed putation and the number of parameters of the 1 1 (pointwise)
since AlexNet [1] won the champion of the 2012 Imagenet Large convolution layers account for a large part of the overall network
Scale Visual Recognition Challenge (ILSVRC2012) [2]. ANNs have model. Taking 1.0MobileNet-224 network [12] as an example,
been widely used in computer vision, natural language processing, the computational costs and parameters of 1 1 convolution lay-
automatic control [3,4], and so on. Convolutional Neural Networks ers account for 94.86% and 74.59% of the entire network,
(CNNs) are popular in computer vision tasks, such as classification, respectively.
detection. To achieve higher accuracy, CNNs become deeper and To further lower the computational complexity of the 1 1
bigger and their computational complexity becomes significantly convolution, this paper introduces a new 1 1 convolution kernel
high, (e.g., VGG [5], GoogLeNet [6], ResNet [7], DesNet [8], SENet named linear-phase pointwise convolution (LPPC) kernel. This idea
[9], GPipe [10]), which makes them difficult to be applied to is inspired by the concept of the linear-phase filter in signal pro-
embedded and mobile devices with limited computing power cessing, which has been widely used in many applications, such
and storage resources. Besides, it is increasingly difficult to down- as image and video compression. We impose various symmetry
load new models from the cloud. constraints on the weights of the pointwise convolution kernels
Compact network design [11], also known as efficient network and propose four types of LPPC kernels. These new pointwise con-
design, is to generate light-weight and high-efficiency network volution kernels can decrease the number of parameters of the
architectures. Depthwise separable convolution [12–15]and group pointwise convolution layer by half and reduce the number of mul-
convolution [14–16] are widely used to design compact network tiplications. Experimental results show that these new pointwise
architectures, which aim to reduce the computational complexity convolution kernels can effectively compress the network model
while having only a small reduction in accuracy.
⇑ Corresponding author. This paper is organized as follows. We start by reviewing the
E-mail address: zhangguohe@xjtu.edu.cn (G. Zhang).
methods of acquiring small and efficient models in Section 2. In
https://doi.org/10.1016/j.neucom.2020.10.067
0925-2312/Ó 2020 Published by Elsevier B.V.
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579
Section 3, we introduce linear-phase pointwise convolution ker- proposed using INT4 or lower, such as binary quantization
nels. In Section 4, we apply the idea to MobileNet, MobileNet v2, [29,30] and ternary quantization [31,32]. The quantization meth-
and ResNet50, and analyze the results. Section 5 contains conclu- ods can compress the network model effectively and accelerate
sions and future work. the inference process significantly. Another model compression
method is called knowledge distillation, which is first proposed
in [33] and generalized by Hinton [34]. It uses a small student net-
2. Related works work to learn the behavior of a large teacher network.
Striking an optimal balance between accuracy and performance 3. Depthwise separable convolution layer with LPPC kernels
for deep neural network architectures has been an active research
area in the last several years. There are two methods for achieving In this section, we first review depthwise separable convolution
efficient network models: one is to design efficient model architec- and then explain the proposed linear-phase pointwise convolution
ture manually or by network architecture search (NAS) algorithms (LPPC) kernels. The LPPC kernels can compress the neural networks
[17], and the other is to obtain small network models from pre- by replacing the normal pointwise convolution kernels with LPPC
trained models by shrinking, factorizing or compressing without kernels without modifying network architecture. Fig. 1 shows the
changing network architecture. depthwise separable convolution layer in which some pointwise
In [18], the Network-In-Network architecture is proposed, convolution kernels are LPPC kernels.
which uses 1 1 convolution to increase the network capacity
while keeping the overall computational complexity small. Squee- 3.1. Depthwise separable convolution
zeNet [19] is a light-weight network structure that uses 1 1 con-
volution extensively. Its squeeze and expand modules focus on Depthwise separable convolution is originally introduced in
reducing the number of parameters. The method reaches 57.55% [35]. It introduces a depthwise convolution layer and a 1 1 con-
accuracy, which is the same as AlexNet but the model is 50 volution (also known as pointwise convolution) layer. The depth-
smaller. Google developed two efficient architectures named wise convolution takes the branching strategy to the extreme,
MobileNet [12] and MobileNet v2 [13] in 2017 and 2018 respec- i.e., the number of branches equals the number of input/output
tively. MobileNet uses efficient depthwise separable convolutions channels. After the light-weight depthwise convolution, the point-
to reduce computational complexity. It also achieves state-of-art wise convolution then applies 1 1 convolution to combine the
accuracy with low latency. MobileNet v2 introduces the linear bot- outputs of the depthwise convolution.
tleneck and inverted residual structure to construct more efficient The depthwise separable convolution essentially approximates
architecture. In [14], ShuffleNet is developed. ShuffleNet intro- the standard convolution by a depthwise convolution and a point-
duces pointwise group convolution and channel shuffle operations. wise convolution. If a standard convolutional layer that uses a con-
Pointwise group convolution further reduces the amount of com- volution kernel of size DK DK C I C O produces an output
putation while channel shuffle makes information flow in all feature map of size DF DF C O , the ratio of computational costs
groups, which enables the model to have better accuracy and lower (FLOPs) of the depthwise separable convolution to the standard
latency. The authors of ShuffleNet summarize four guidelines for convolution will be written as:
designing efficient neural network architectures and propose Shuf-
DF 2 C I ð2DK 2 1ÞþDF 2 C O ð2C I 1Þ
fleNet v2 [15], which improves group convolution by channel split
DF 2 C O ð2C I DK 2 1Þ
and uses channel shuffle for the split channel as well. ð1Þ
NAS is another popular way to design the neural network archi- C1O þ D12 ;
K
tecture, which aims at ‘‘neural nets to design neural net”. There are
many efficient convolutional neural networks designed by NAS where DK is the spatial dimension of the kernel, C I is the number of
(e.g., EfficientNet [20], MansNet [21]). EfficientNet is based on a input channels, C O is the number of output channels, and DF is the
baseline model designed by NAS and uses a compound scaling fac- spatial width and height of the square output feature map.
tor to regulate model architecture. EfficientNet-B0 achieves higher
accuracy than ResNet-50 with up to 4.9 parameter reduction 3.2. Linear-phase pointwise convolution (LPPC) kernel
and 11 FLOPs reduction. EfficientNet-B7 achieves 84.4% top-1
accuracy on ImageNet that is higher than GPipe [10]. MansNet In digital signal processing theory, the impulse response hðnÞ of
aims at designing CNN architecture for mobile devices and achieve a linear phase finite impulse response (FIR) filter satisfies the fol-
a good trade-off between accuracy and latency. It is 1.8 faster lowing condition:
than MobileNet v2 with 0.5% higher accuracy and 2.3 faster than hðnÞ ¼ hðN nÞ; 0 6 n 6 N: ð2Þ
NASNet [22] with 1.2% higher accuracy. MobileNet v3 [23] is
designed by the method of NAS, and it is a more efficient CNN Depending on the parity of N and the symmetry of hðnÞ, there are
architecture than MobileNet v2 and MobileNet. four types of linear phase FIR filters as follow:
Compression is another method to acquire small network mod-
els from the trained large network models. It usually reduces the Type-I FIR filter: hðnÞ is symmetric and N is even;
model size at the price of accuracy. The typical techniques include Type-II FIR filter: hðnÞ is symmetric and N is odd;
pruning, quantization, and Huffman coding [24]. Pruning is based Type-III FIR filter: hðnÞ is antisymmetric and N is even;
on the assumption that many parameters in deep networks are Type-IV FIR filter: hðnÞ is antisymmetric and N is odd;
unnecessary and can be removed. Pruning can make the network
structure sparse. In recent years, model quantization methods have Linear-phase filters have less number of distinct coefficients than
found many applications and enabled the deployment of CNN arbitrary filters and can lead to efficient fast implementations
algorithms on the terminal hardware. Quantization refers to the [36]. Moreover, in many applications such as image and video com-
process of reducing the number of bits that represent a number pression, it is well known that linear-phase filters can achieve bet-
[25–28]. It has been extensively demonstrated that weights and ter visual quality. Inspired by these results, we propose to introduce
activations can be represented using 8-bit integers (INT8) without the linear-phase constraint to the convolution kernels of pointwise
incurring significant loss in accuracy. Some methods are also convolution layers. That is, we add the constraint that the weights
573
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579
Fig. 1. Depthwise separable convolutions with pointwise convolution layer which has LPPC kernels.
of these 1 1 kernels are either symmetric or antisymmetric in the 3.3. The implementation of LPPC kernels
depth dimension. We denote these 1 1 convolution kernels with
linear-phase constraint as linear-phase pointwise convolution In the training process, we insert a 1 1 convolution layer
(LPPC) kernels. before the pointwise convolution layer to implement the add oper-
Fig. 2 shows the four types of LPPC kernels. Symbols a; b and c ations or subtraction operations in channel dimension as described
denote the weight of the LPPC kernels. We use these symbols as in Section 3.2. The weights of the inserted 1 1 layers are specially
simple examples to demonstrate the symmetry of these four types initialized and set as ‘‘requires_grad=False”, so these layers cannot
of LPPC kernels. Similar to the FIR filter, (a) is Type-I LPPC kernel, learn anything. Taking the Type-I kernels as an example. If the
(b) is Type-II LPPC kernel, (c) is Type-III LPPC kernel, (d) is Type- number of input channels of a linear-phase pointwise convolution
IV LPPC kernel. layer is M, the number of the 1 1 kernels of the inserted 1 1
Applying these linear-phase kernels to pointwise convolution convolution layer will be M=2. The weights of these kernels should
process is equivalent to a ‘‘channel shrinkage” process. For be specially initialized as follow:
instance, if the number of input channels is even, such as 4, then
to apply a 4-tap Type-I linear-phase kernel, the first channel will kernel 1 : 1 0 0 0 0 1
be added to the fourth channel, and the second channel will be kernel 2 : 0 1 0 0 1 0
added to the third channel, as shown in Fig. 3. After that, the kernel 3 : 0 0 1 1 0 0 ð4Þ
resulted two-channel feature map is convolved with the
1 1 2 kernel. After the ‘‘channel shrinkage”, the number of
kernel M
2
:0 01 10 0
parameters of a pointwise convolution kernel is reduced to half
of the original value. When deploying the LPPC-constrained network models in devices
The LPPC kernels can further decrease the amount of computa- and doing the inference process, we remove these inserted 1 1
tion by reducing multiplication. If all normal pointwise convolu- convolution layer and directly implement the add operations or
tion kernels are replaced with LPPC kernels, the FLOPs in a subtraction operations. Then the new input feature map is con-
pointwise convolution layer can be reduced to: volved by a group of 1 1 kernels whose depth dimension is
DF 2 C O ð2C I 1Þ DF 2 C O ðC I 1Þ þ DF 2 C2I reduced by half.
ð3Þ
¼ ðC O 0:5Þ C I DF 2 :
4. Designs and experiments
Table 1 Table 3
The network architecture of the LPPC-constrained MobileNet. Results of the LPPC kernels on the ImageNet dataset.
Layer stride Output size Repeat Model ImageNet Accuracy Million Million
Top-1/Top-5,% Parameters FLOPs
Image – 224 224 3 –
Conv_1 2/1 112 112 32 1 Baseline MobileNet 70.62/89.74 4.23 589
Block_1 1 112 112 64 1 MobileNet(Type-I 68.41/88.30 2.66 320
Block_2 2 56 56 128 1 MobileNet(Type-II) 68.15/88.40 2.66 320
Block_3 1 56 56 128 1 MobileNet(Type-III) 68.09/88.28 2.66 320
Block_4 2 28 28 256 1 MobileNet(Type-IV) 68.00/88.21 2.66 320
Block_5 1 28 28 256 1
Block_6 2 14 14 512 1
Block_7 1 14 14 512 5
Block_8 2 7 7 1024 1 In these four types of LPPC kernels, the Type-I and Type-II LPPC
Block_9 1 7 7 1024 1 kernels are symmetric, and the Type-III and Type-IV LPPC kernels
GlobalPool 7/2 1 1 1024 – are antisymmetric. As depicted in Fig. 3, the convolution operation
FC – 1 1 1000 –
using the Type-I or Type-II LPPC kernels is equivalent to applying
Softmax – 1000/10 – add operation in channel dimension first, and then applying point-
wise convolution operation whose number of input channels is
reduced by half. On the other hand, using the Type-III or Type-IV
regression sequentially. When we use ImageNet2012 [2] as the LPPC kernels is equivalent to applying subtraction operation in
training data, we use 1.0MobileNet-224 as the baseline. When channel dimension at first. For the Type-III and Type-IV LPPC ker-
we use the CIFAR-10 dataset [37], we modify the 1.0MobileNet- nels, the subtraction operation makes many elements of the output
224 by changing the first standard 3 3 convolution layer stride feature map of depthtwise convolution layers less than or equal to
to one and GlobalPool stride to two. All neural network models zero. After the new pointwise convolution layer, the ReLU function
are trained in PyTorch [38] using SGD with the learning rate turns many elements of the output feature map into zero. This sub-
decreased with epochs. We use regularization and data augmenta- traction operation could destroy the feature extracted by the
tion techniques the same as the original MobileNet. Other training depthwise convolution. The Type-I LPPC-constrained network has
parameters are also the same as the original network. one more kernel than the Type-II LPPC-constrained network in
each pointwise convolution layer, because we need to make the
number of parameters of the Type-I LPPC kernel even and make
4.1. Fully LPPC-constrained MobileNet
the number of parameters of the Type-II LPPC kernel odd. So
Type-I LPPC-constrained network has slightly more parameters,
As is shown in Fig. 2, we design four types of linear-phase point-
which provides Type-I LPPC-constrained network with higher
wise convolution kernels. To apply them in MobileNet, we first
accuracy.
replace all kernels in pointwise convolution layers with one type
of LPPC kernels. Since all numbers of channels N in the baseline
MobileNet are even, we use N-1 for Type-II and Type-IV kernels
4.2. Partially LPPC-constrained MobileNet
in the new models. Table 2 shows the performance of the new
MobileNet models using these four types of LPPC kernels on
In this part, instead of applying the linear-phase constraint to
CIFAR-10. We can find that Type-I LPPC kernels can achieve slightly
all pointwise convolution kernels in MobileNet, we only apply
higher accuracy compared with the other three types of LPPC ker-
the constraint to a portion of the pointwise convolution kernels,
nels. Table 3 shows the results of the LPPC-constrained MobileNet
so that we can easily control the tradeoff between complexity
on the ImageNet2012 dataset. For CIFAR-10, Type-I kernels can
and performance. We keep the number of channels of every layer
also achieve better performance. In summary, Type-I LPPC-
the same as the original MobileNet and use a hyper-parameter L
constrained MobileNet can achieve higher classification accuracy
to control the portion of the LPPC kernels in all kernels. Fig. 4
than the other three types.
shows the new depthwise separable convolution layer block. The
depthwise separable convolution layer includes 3 3 depthwise
Table 2 convolution, BatchNorm, ReLU nonlinearity, two branches of
Results of the LPPC kernels on the CIFAR-10 dataset. 1 1 pointwise convolutions (one for unconstrained kernels and
Model CIFAR-10 Accuracy Million Million
one for LPPC kernels), BatchNorm, and ReLU nonlinearity
Top-1,% Parameters FLOPs sequentially.
Baseline MobileNet 86.29 3.22 48
Tables 4 and 5 show the results of the experiments using differ-
MobileNet(Type-I) 85.91 1.65 39 ent L values in the new MobileNet models in which we use the
MobileNet(Type-II) 85.54 1.65 39 same L and Type-I kernels for every pointwise convolution layer.
MobileNet(Type-III) 85.50 1.65 39 Table 4 shows the results of the LPPC-constrained MobileNet
MobileNet(Type-IV) 85.55 1.65 39
architectures with different L value on the CIFAR-10 dataset. As
575
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579
Table 4
Results of different L on the CIFAR-10 dataset.
Table 5
Results of different L on the ImageNet dataset.
576
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579
Table 6
Results of type-I LPPC-constrained MobileNet v2 and ResNet50.
577
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579
578
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579
Shuting Cheng is currently pursuing the B.E. degree in (2012), the DARPA Young Faculty Award (2013), TUM-IAS Hans Fisher Fellowship
microelectronics science and engineering with the from Germany (2017) and seven best paper awards. Dr. Li is the Fellow of IEEE and
Department of Microelectronics, Xi’an Jiaotong Univer- Distinguished Member of ACM, a distinguished speaker of ACM (2017-2020), and a
sity, Xi’an, China. Her current research interests include distinguished lecturer of IEEE CAS society (2018-2019).
lightweight neural networks and digital integrated cir-
cuit design.
Yiran Chen received B.S and M.S. from Tsinghua
University and Ph.D. from Purdue University in 2005.
After five years in industry, he joined University of
Pittsburgh in 2010 as Assistant Professor and then
promoted to Associate Professor with tenure in 2014,
held Bicentennial Alumni Faculty Fellow. He now is the
Professor of the Department of Electrical and Computer
Engineering at Duke University and serving as the
director of NSF Industry-University Cooperative
Li Sun studied computer science at the Northwestern Research Center (IUCRC) for Alternative Sustainable and
Polytechnic University, China, where she received the Intelligent Computing (ASIC) and co-director of Duke
PhD degree in 2010. In 2016, she became an associate University Center for Computational Evolutionary
professor for information and communication engi- Intelligence (CEI), focusing on the research of new memory and storage systems,
neering at the Airforce Engineering University. She is machine learning and neuromorphic computing, and mobile computing systems.
the currently working for Machine Learning in the Dr. Chen has published one book and more than 350 technical publications and has
microelectronics Academy at the Xi’an Jiaotong been granted 94 US patents. He serves or served the associate editor of several IEEE
University. Her research interests include image pro- and ACM transactions/journals and served on the technical and organization
cessing, machine learning, and probabilistic graphical committees of more than 50 international conferences. He received 6 best paper
model. awards and 13 best paper nominations from international conferences. He is the
recipient of NSF CAREER award and ACM SIGDA outstanding new faculty award. He
is the Fellow of IEEE and Distinguished Member of ACM, a distinguished lecturer of
IEEE CEDA, and the recipient of the Humboldt Research Fellowship for Experienced
Researchers.
579