You are on page 1of 8

Neurocomputing 423 (2021) 572–579

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Efficient neural network using pointwise convolution kernels with linear


phase constraint
Feng Liang a, Zhichao Tian a, Ming Dong a, Shuting Cheng a, Li Sun a, Hai Li b, Yiran Chen b, Guohe Zhang a,⇑
a
School of Microelectronics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
b
Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA

a r t i c l e i n f o a b s t r a c t

Article history: In current efficient convolutional neural networks, 1  1 convolution is widely used. However, the
Received 10 April 2020 amount of computation and the number of parameters of 1  1 convolution layers account for a large part
Revised 14 August 2020 of these neural network models. In this paper, we propose to use linear-phase pointwise convolution ker-
Accepted 13 October 2020
nels (LPPC kernels) to reduce the computational complexities and storage costs of these neural networks.
Available online 3 November 2020
Communicated by Zidong Wang
We design four types of LPPC kernels based on the parity of the number of input channels and symmetry
of the weights of the pointwise convolution kernel. Experimental results show that Type-I LPPC kernels
can compress some popular networks better with a small reduction in accuracy than the other types of
2010 MSC:
00-01
LPPC kernels. The LPPC kernels can be used as new 1  1 convolution kernels to design efficient neural
99-00 network architectures in the future. Moreover, the LPPC kernels are friendly to low-power hardware
accelerator design to achieve lower memory access cost and smaller model size.
Keywords: Ó 2020 Published by Elsevier B.V.
Convolutional Neural Network (CNN)
Efficient neural network
Pointwise convolution
Linear-phase filter

1. Introduction and the size of the deep neural network model. In light-weight
CNNs using depthwise separable convolution, the amount of com-
Artificial Neural Networks (ANNs) have been greatly developed putation and the number of parameters of the 1  1 (pointwise)
since AlexNet [1] won the champion of the 2012 Imagenet Large convolution layers account for a large part of the overall network
Scale Visual Recognition Challenge (ILSVRC2012) [2]. ANNs have model. Taking 1.0MobileNet-224 network [12] as an example,
been widely used in computer vision, natural language processing, the computational costs and parameters of 1  1 convolution lay-
automatic control [3,4], and so on. Convolutional Neural Networks ers account for 94.86% and 74.59% of the entire network,
(CNNs) are popular in computer vision tasks, such as classification, respectively.
detection. To achieve higher accuracy, CNNs become deeper and To further lower the computational complexity of the 1  1
bigger and their computational complexity becomes significantly convolution, this paper introduces a new 1  1 convolution kernel
high, (e.g., VGG [5], GoogLeNet [6], ResNet [7], DesNet [8], SENet named linear-phase pointwise convolution (LPPC) kernel. This idea
[9], GPipe [10]), which makes them difficult to be applied to is inspired by the concept of the linear-phase filter in signal pro-
embedded and mobile devices with limited computing power cessing, which has been widely used in many applications, such
and storage resources. Besides, it is increasingly difficult to down- as image and video compression. We impose various symmetry
load new models from the cloud. constraints on the weights of the pointwise convolution kernels
Compact network design [11], also known as efficient network and propose four types of LPPC kernels. These new pointwise con-
design, is to generate light-weight and high-efficiency network volution kernels can decrease the number of parameters of the
architectures. Depthwise separable convolution [12–15]and group pointwise convolution layer by half and reduce the number of mul-
convolution [14–16] are widely used to design compact network tiplications. Experimental results show that these new pointwise
architectures, which aim to reduce the computational complexity convolution kernels can effectively compress the network model
while having only a small reduction in accuracy.
⇑ Corresponding author. This paper is organized as follows. We start by reviewing the
E-mail address: zhangguohe@xjtu.edu.cn (G. Zhang).
methods of acquiring small and efficient models in Section 2. In

https://doi.org/10.1016/j.neucom.2020.10.067
0925-2312/Ó 2020 Published by Elsevier B.V.
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579

Section 3, we introduce linear-phase pointwise convolution ker- proposed using INT4 or lower, such as binary quantization
nels. In Section 4, we apply the idea to MobileNet, MobileNet v2, [29,30] and ternary quantization [31,32]. The quantization meth-
and ResNet50, and analyze the results. Section 5 contains conclu- ods can compress the network model effectively and accelerate
sions and future work. the inference process significantly. Another model compression
method is called knowledge distillation, which is first proposed
in [33] and generalized by Hinton [34]. It uses a small student net-
2. Related works work to learn the behavior of a large teacher network.

Striking an optimal balance between accuracy and performance 3. Depthwise separable convolution layer with LPPC kernels
for deep neural network architectures has been an active research
area in the last several years. There are two methods for achieving In this section, we first review depthwise separable convolution
efficient network models: one is to design efficient model architec- and then explain the proposed linear-phase pointwise convolution
ture manually or by network architecture search (NAS) algorithms (LPPC) kernels. The LPPC kernels can compress the neural networks
[17], and the other is to obtain small network models from pre- by replacing the normal pointwise convolution kernels with LPPC
trained models by shrinking, factorizing or compressing without kernels without modifying network architecture. Fig. 1 shows the
changing network architecture. depthwise separable convolution layer in which some pointwise
In [18], the Network-In-Network architecture is proposed, convolution kernels are LPPC kernels.
which uses 1  1 convolution to increase the network capacity
while keeping the overall computational complexity small. Squee- 3.1. Depthwise separable convolution
zeNet [19] is a light-weight network structure that uses 1  1 con-
volution extensively. Its squeeze and expand modules focus on Depthwise separable convolution is originally introduced in
reducing the number of parameters. The method reaches 57.55% [35]. It introduces a depthwise convolution layer and a 1  1 con-
accuracy, which is the same as AlexNet but the model is 50  volution (also known as pointwise convolution) layer. The depth-
smaller. Google developed two efficient architectures named wise convolution takes the branching strategy to the extreme,
MobileNet [12] and MobileNet v2 [13] in 2017 and 2018 respec- i.e., the number of branches equals the number of input/output
tively. MobileNet uses efficient depthwise separable convolutions channels. After the light-weight depthwise convolution, the point-
to reduce computational complexity. It also achieves state-of-art wise convolution then applies 1  1 convolution to combine the
accuracy with low latency. MobileNet v2 introduces the linear bot- outputs of the depthwise convolution.
tleneck and inverted residual structure to construct more efficient The depthwise separable convolution essentially approximates
architecture. In [14], ShuffleNet is developed. ShuffleNet intro- the standard convolution by a depthwise convolution and a point-
duces pointwise group convolution and channel shuffle operations. wise convolution. If a standard convolutional layer that uses a con-
Pointwise group convolution further reduces the amount of com- volution kernel of size DK  DK  C I  C O produces an output
putation while channel shuffle makes information flow in all feature map of size DF  DF  C O , the ratio of computational costs
groups, which enables the model to have better accuracy and lower (FLOPs) of the depthwise separable convolution to the standard
latency. The authors of ShuffleNet summarize four guidelines for convolution will be written as:
designing efficient neural network architectures and propose Shuf-
DF 2 C I ð2DK 2 1ÞþDF 2 C O ð2C I 1Þ
fleNet v2 [15], which improves group convolution by channel split
DF 2 C O ð2C I DK 2 1Þ
and uses channel shuffle for the split channel as well. ð1Þ
NAS is another popular way to design the neural network archi-  C1O þ D12 ;
K
tecture, which aims at ‘‘neural nets to design neural net”. There are
many efficient convolutional neural networks designed by NAS where DK is the spatial dimension of the kernel, C I is the number of
(e.g., EfficientNet [20], MansNet [21]). EfficientNet is based on a input channels, C O is the number of output channels, and DF is the
baseline model designed by NAS and uses a compound scaling fac- spatial width and height of the square output feature map.
tor to regulate model architecture. EfficientNet-B0 achieves higher
accuracy than ResNet-50 with up to 4.9  parameter reduction 3.2. Linear-phase pointwise convolution (LPPC) kernel
and 11  FLOPs reduction. EfficientNet-B7 achieves 84.4% top-1
accuracy on ImageNet that is higher than GPipe [10]. MansNet In digital signal processing theory, the impulse response hðnÞ of
aims at designing CNN architecture for mobile devices and achieve a linear phase finite impulse response (FIR) filter satisfies the fol-
a good trade-off between accuracy and latency. It is 1.8  faster lowing condition:
than MobileNet v2 with 0.5% higher accuracy and 2.3  faster than hðnÞ ¼ hðN  nÞ; 0 6 n 6 N: ð2Þ
NASNet [22] with 1.2% higher accuracy. MobileNet v3 [23] is
designed by the method of NAS, and it is a more efficient CNN Depending on the parity of N and the symmetry of hðnÞ, there are
architecture than MobileNet v2 and MobileNet. four types of linear phase FIR filters as follow:
Compression is another method to acquire small network mod-
els from the trained large network models. It usually reduces the  Type-I FIR filter: hðnÞ is symmetric and N is even;
model size at the price of accuracy. The typical techniques include  Type-II FIR filter: hðnÞ is symmetric and N is odd;
pruning, quantization, and Huffman coding [24]. Pruning is based  Type-III FIR filter: hðnÞ is antisymmetric and N is even;
on the assumption that many parameters in deep networks are  Type-IV FIR filter: hðnÞ is antisymmetric and N is odd;
unnecessary and can be removed. Pruning can make the network
structure sparse. In recent years, model quantization methods have Linear-phase filters have less number of distinct coefficients than
found many applications and enabled the deployment of CNN arbitrary filters and can lead to efficient fast implementations
algorithms on the terminal hardware. Quantization refers to the [36]. Moreover, in many applications such as image and video com-
process of reducing the number of bits that represent a number pression, it is well known that linear-phase filters can achieve bet-
[25–28]. It has been extensively demonstrated that weights and ter visual quality. Inspired by these results, we propose to introduce
activations can be represented using 8-bit integers (INT8) without the linear-phase constraint to the convolution kernels of pointwise
incurring significant loss in accuracy. Some methods are also convolution layers. That is, we add the constraint that the weights
573
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579

Fig. 1. Depthwise separable convolutions with pointwise convolution layer which has LPPC kernels.

of these 1  1 kernels are either symmetric or antisymmetric in the 3.3. The implementation of LPPC kernels
depth dimension. We denote these 1  1 convolution kernels with
linear-phase constraint as linear-phase pointwise convolution In the training process, we insert a 1  1 convolution layer
(LPPC) kernels. before the pointwise convolution layer to implement the add oper-
Fig. 2 shows the four types of LPPC kernels. Symbols a; b and c ations or subtraction operations in channel dimension as described
denote the weight of the LPPC kernels. We use these symbols as in Section 3.2. The weights of the inserted 1  1 layers are specially
simple examples to demonstrate the symmetry of these four types initialized and set as ‘‘requires_grad=False”, so these layers cannot
of LPPC kernels. Similar to the FIR filter, (a) is Type-I LPPC kernel, learn anything. Taking the Type-I kernels as an example. If the
(b) is Type-II LPPC kernel, (c) is Type-III LPPC kernel, (d) is Type- number of input channels of a linear-phase pointwise convolution
IV LPPC kernel. layer is M, the number of the 1  1 kernels of the inserted 1  1
Applying these linear-phase kernels to pointwise convolution convolution layer will be M=2. The weights of these kernels should
process is equivalent to a ‘‘channel shrinkage” process. For be specially initialized as follow:
instance, if the number of input channels is even, such as 4, then
to apply a 4-tap Type-I linear-phase kernel, the first channel will kernel 1 : 1 0 0    0 0 1
be added to the fourth channel, and the second channel will be kernel 2 : 0 1 0    0 1 0
added to the third channel, as shown in Fig. 3. After that, the kernel 3 : 0 0 1    1 0 0 ð4Þ
resulted two-channel feature map is convolved with the   
1  1  2 kernel. After the ‘‘channel shrinkage”, the number of
kernel M
2
:0 01 10 0
parameters of a pointwise convolution kernel is reduced to half
of the original value. When deploying the LPPC-constrained network models in devices
The LPPC kernels can further decrease the amount of computa- and doing the inference process, we remove these inserted 1  1
tion by reducing multiplication. If all normal pointwise convolu- convolution layer and directly implement the add operations or
tion kernels are replaced with LPPC kernels, the FLOPs in a subtraction operations. Then the new input feature map is con-
pointwise convolution layer can be reduced to: volved by a group of 1  1 kernels whose depth dimension is
 
DF 2  C O  ð2C I  1Þ  DF 2  C O  ðC I  1Þ þ DF 2  C2I reduced by half.
ð3Þ
¼ ðC O  0:5Þ  C I  DF 2 :
4. Designs and experiments

MobileNet [12] uses depthwise separable convolution to


decrease the complexity of the model while achieving state-of-
art accuracy in standard benchmarks. In MobileNet network, the
number of parameters of the pointwise convolution layers
accounts for 74.59% of the entire network. If the LPPC kernels
replace the kernels of the pointwise convolutional layers of the
MobileNet network, the number of parameters of the entire net-
work will be significantly reduced. Therefore, we start from the
MobileNet network to explore the effect of LPPC kernels on the
compression effect and performance of the network model. Then
we will apply the idea to MobileNet v2 [13] and ResNet50 [7] as
well.
The architecture of MobileNet is shown in Table 1. The architec-
ture starts with a standard 3  3 convolution layer and then has a
Fig. 2. Four types of pointwise convolution kernels with linear phase constraint. (a) stack of 13 depthwise separable convolution layer blocks, followed
Type-I kernel; (b) Type-II kernel; (c) Type-III kernel; (d) Type-IV kernel. by a GlobalPool layer, a fully connected layer, and a Softmax
574
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579

Fig. 3. ‘‘Channel shrinkage” operation.

Table 1 Table 3
The network architecture of the LPPC-constrained MobileNet. Results of the LPPC kernels on the ImageNet dataset.

Layer stride Output size Repeat Model ImageNet Accuracy Million Million
Top-1/Top-5,% Parameters FLOPs
Image – 224  224  3 –
Conv_1 2/1 112  112  32 1 Baseline MobileNet 70.62/89.74 4.23 589
Block_1 1 112  112  64 1 MobileNet(Type-I 68.41/88.30 2.66 320
Block_2 2 56  56  128 1 MobileNet(Type-II) 68.15/88.40 2.66 320
Block_3 1 56  56  128 1 MobileNet(Type-III) 68.09/88.28 2.66 320
Block_4 2 28  28  256 1 MobileNet(Type-IV) 68.00/88.21 2.66 320
Block_5 1 28  28  256 1
Block_6 2 14  14  512 1
Block_7 1 14  14  512 5
Block_8 2 7  7  1024 1 In these four types of LPPC kernels, the Type-I and Type-II LPPC
Block_9 1 7  7  1024 1 kernels are symmetric, and the Type-III and Type-IV LPPC kernels
GlobalPool 7/2 1  1  1024 – are antisymmetric. As depicted in Fig. 3, the convolution operation
FC – 1  1  1000 –
using the Type-I or Type-II LPPC kernels is equivalent to applying
Softmax – 1000/10 – add operation in channel dimension first, and then applying point-
wise convolution operation whose number of input channels is
reduced by half. On the other hand, using the Type-III or Type-IV
regression sequentially. When we use ImageNet2012 [2] as the LPPC kernels is equivalent to applying subtraction operation in
training data, we use 1.0MobileNet-224 as the baseline. When channel dimension at first. For the Type-III and Type-IV LPPC ker-
we use the CIFAR-10 dataset [37], we modify the 1.0MobileNet- nels, the subtraction operation makes many elements of the output
224 by changing the first standard 3  3 convolution layer stride feature map of depthtwise convolution layers less than or equal to
to one and GlobalPool stride to two. All neural network models zero. After the new pointwise convolution layer, the ReLU function
are trained in PyTorch [38] using SGD with the learning rate turns many elements of the output feature map into zero. This sub-
decreased with epochs. We use regularization and data augmenta- traction operation could destroy the feature extracted by the
tion techniques the same as the original MobileNet. Other training depthwise convolution. The Type-I LPPC-constrained network has
parameters are also the same as the original network. one more kernel than the Type-II LPPC-constrained network in
each pointwise convolution layer, because we need to make the
number of parameters of the Type-I LPPC kernel even and make
4.1. Fully LPPC-constrained MobileNet
the number of parameters of the Type-II LPPC kernel odd. So
Type-I LPPC-constrained network has slightly more parameters,
As is shown in Fig. 2, we design four types of linear-phase point-
which provides Type-I LPPC-constrained network with higher
wise convolution kernels. To apply them in MobileNet, we first
accuracy.
replace all kernels in pointwise convolution layers with one type
of LPPC kernels. Since all numbers of channels N in the baseline
MobileNet are even, we use N-1 for Type-II and Type-IV kernels
4.2. Partially LPPC-constrained MobileNet
in the new models. Table 2 shows the performance of the new
MobileNet models using these four types of LPPC kernels on
In this part, instead of applying the linear-phase constraint to
CIFAR-10. We can find that Type-I LPPC kernels can achieve slightly
all pointwise convolution kernels in MobileNet, we only apply
higher accuracy compared with the other three types of LPPC ker-
the constraint to a portion of the pointwise convolution kernels,
nels. Table 3 shows the results of the LPPC-constrained MobileNet
so that we can easily control the tradeoff between complexity
on the ImageNet2012 dataset. For CIFAR-10, Type-I kernels can
and performance. We keep the number of channels of every layer
also achieve better performance. In summary, Type-I LPPC-
the same as the original MobileNet and use a hyper-parameter L
constrained MobileNet can achieve higher classification accuracy
to control the portion of the LPPC kernels in all kernels. Fig. 4
than the other three types.
shows the new depthwise separable convolution layer block. The
depthwise separable convolution layer includes 3  3 depthwise
Table 2 convolution, BatchNorm, ReLU nonlinearity, two branches of
Results of the LPPC kernels on the CIFAR-10 dataset. 1  1 pointwise convolutions (one for unconstrained kernels and
Model CIFAR-10 Accuracy Million Million
one for LPPC kernels), BatchNorm, and ReLU nonlinearity
Top-1,% Parameters FLOPs sequentially.
Baseline MobileNet 86.29 3.22 48
Tables 4 and 5 show the results of the experiments using differ-
MobileNet(Type-I) 85.91 1.65 39 ent L values in the new MobileNet models in which we use the
MobileNet(Type-II) 85.54 1.65 39 same L and Type-I kernels for every pointwise convolution layer.
MobileNet(Type-III) 85.50 1.65 39 Table 4 shows the results of the LPPC-constrained MobileNet
MobileNet(Type-IV) 85.55 1.65 39
architectures with different L value on the CIFAR-10 dataset. As
575
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579

MobileNet with L0=rand1 and MobileNet with L0=rand2 are archi-


tectures in which the hyper-parameter L in each pointwise layer
is a random number between 0 and 1. As on CIFAR-10, we use
the existing hyperparameter tuning algorithm to explore better
architectures on ImagetNet, but it does not work. Therefore, we
are developing a new method to do this work.
According to the data in Tables 4 and 5, we plot the relationship
between classification accuracy and the number of parameters, as
shown in Fig. 5. It can be concluded from Fig. 5 that the accuracy is
positively correlated with the parameter quantity, but the decrease
in accuracy using the LPPC kernel is acceptable in some applica-
tions that need low storage cost and latency. Moreover, abnormal
data points (MobileNet with L=1 on CIFAR-10, MobileNet with
L0=search on CIFAR-10, and MobileNet with L0=rand1 on ImageNet)
show that there are better architectures which have lower param-
Fig. 4. The new block with LPPC kernels. eters but have higher accuracy.
In the article proposing MobileNet [12], the authors use width
the proportion of the LPPC kernel increases, the accuracy of the multiplier and resolution multiplier to trade off between computa-
new MobileNet model gradually reduces in general, but the num- tional costs and accuracy. MobileNet v2 [13] introduces the linear
ber of parameters and the amount of computation also decreases. bottleneck and inverted residual structure to make a more efficient
The exception is that the accuracy of the situation with L=1 is architecture. ShuffleNet [14] decreases the computational costs of
higher than L=0.75. In the last row, L0 denotes a set of L value the 1  1 convolution layer by introducing group convolution to
(0.21, 0.40, 0.15, 0.79, 0.74, 0.15, 0.62, 0.46, 0.33, 0.26, 0.76, 0.14, it and uses channel shuffle operation to make information flow
0.20), and each L is the proportion of Type-I LPPC kernels in corre- in all groups. EfficientNet [20] uses a compound scaling factor to
sponding pointwise convolution layer. These values are obtained regulate model architecture. GhostNet [40] develops the Ghost
by using a hyper-parameter tuning tool named HpBandSter [39] module to decrease the computational costs and extract more fea-
that is based on Bayesian optimization and bandit-based methods. tures. The methods in MobileNet and EfficientNet are to use hyper-
The model of the architecture using this set of L values can achieve parameters to regulate network architectures. The methods in
slightly higher accuracy than the baseline MobileNet model on MobileNet v2, ShuffleNet, and GhostNet are to design more effi-
CIFAR-10. This result indicates there are LPPC-constrained Mobile- cient modules. These two ways require professional experience,
Net architectures that can achieve better accuracy than the original but using the LPPC kernels to make the existing architectures more
MobileNet. efficient is a simple way. The LPPC kernels also aim to make net-
On the ImageNet2012 classification dataset, the accuracy of work architectures more efficient by decreasing the computational
new MobileNet architectures is shown in Table 5. With the propor- costs of the 1  1 convolution layers. Moreover, the convolution
tion of the LPPC kernels in each pointwise convolution layer operation using LPPC kernels can be seen as a new convolution
increases, the amount of computation and the number of parame- operation and be used to construct efficient architectures. For neu-
ters of models decrease prominently with only a small drop in ral network architectures using LPPC kernels, we can also use these
accuracy. Notably, the top-5 accuracy drops smaller except the methods mentioned above to make architectures more efficient.
model with L=0.25. When L value is 1, although the top-1 and Furthermore, pruning and quantization can be used to further
top-5 accuracy decreases by 2.21% and 1.44% respectively, the compress the efficient neural network models that use LPPC
number of parameters drops by 37% and FLOPs drops by 45%. kernels.

Table 4
Results of different L on the CIFAR-10 dataset.

Model ImageNet Accuracy Decrease in Top-1 Million Million


Top-1,% Accuracy,% Parameters FLOPs
Baseline MobileNet 86.29 - 3.22 48
MobileNet(L=0.25) 86.22 0.07 2.82 42
MobileNet(L=0.5) 86.04 0.25 2.43 37
MobileNet(L=0.75) 85.76 0.53 2.04 31
MobileNet(L=1) 85.91 0.38 1.65 26
MobileNet(L0=Search) 86.31 +0.02 2.69 39

Table 5
Results of different L on the ImageNet dataset.

Model ImageNet Accuracy Decrease in Top-1 Decrease in Top-5 Million Million


Top-1,%/Top-5,% Accuracy,% Accuracy,% Parameters FLOPs
Baseline MobileNet 70.62/89.74 - - 4.23 589
MobileNet(L=0.25) 70.44/89.35 0.18 0.39 3.83 522
MobileNet(L=0.5) 69.93/89.26 0.69 0.48 3.44 455
MobileNet(L=0.75) 69.04/88.75 1.58 0.99 3.05 387
MobileNet(L=1) 68.41/88.30 2.21 1.44 2.66 320
MobileNet(L0=rand1) 69.15/88.93 1.47 0.81 2.92 381
MobileNet(L0=rand2) 69.67/89.20 0.95 0.54 3.59 474

576
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579

can reduce the number of parameters with an acceptable model


accuracy drop.

5. Conclusion and future work

In this paper, we propose a new pointwise convolution kernel


named LPPC kernel. This kernel is inspired by the linear-phase
FIR filter. It can decrease the computational complexity and reduce
the number of parameters of a deep convolutional neural network
in which the amount of computation of the 1  1 convolution lay-
ers account for a large part.
We first test four types of LPPC kernels in MobileNet architec-
ture. Results show that Type-I kernels can acquire higher accuracy
in the standard benchmark. We then explore the effect of the pro-
portion of LPPC kernel on the MobileNet, and the results show that
the amount of computation and the number of parameters of mod-
els decrease with the proportion of the LPPC kernels increases.
Using LPPC kernels to compress existing models may lead to a
small decrease in accuracy but the convolution operation using
LPPC kernel can be seen as a new form of convolution and be used
in efficient neural network architecture design in the future. The
result of the search method on the CIFAR-10 dataset indicates that
there is a set of L values in LPPC-constrained MobileNet that can
make the network achieve near-optimal accuracy.
In future work, we will develop new algorithms to explore bet-
ter LPPC-constrained MobileNet architectures and plan to design
some new efficient CNN architectures using LPPC kernels. More-
over, the LPPC kernel is friendly to hardware accelerator design
because it can reduce memory access cost and has potential advan-
tages in reducing hardware energy consumption. We will design a
special hardware accelerator using FPGA for efficient network
architecture design based on the LPPC kernels.

CRediT authorship contribution statement

Feng Liang: Project administration, Methodology, Writing -


review & editing. Zhichao Tian: Conceptualization, Writing - orig-
inal draft, Software. Ming Dong: Investigation, Writing - original
draft, Software. Shuting Cheng: Formal analysis, Writing - review
Fig. 5. The relation between the number of parameters and accuracy. & editing. Li Sun: Software, Visualization. Hai Li: Writing - review
& editing, Validation. Yiran Chen: Methodology, Supervision.
Guohe Zhang: Project administration, Resources, Supervision.
4.3. The LPPC kernels in other network architectures
Declaration of Competing Interest
In the current neural network architecture design, small convo-
lution kernels are used frequently. VGG [5] only uses 3  3 recep- The authors declare that they have no known competing finan-
tive fields and ResNet [7] only uses 3  3 and 1  1 convolution cial interests or personal relationships that could have appeared
kernels. Efficient network design only uses 3  3 and 1  1 convo- to influence the work reported in this paper.
lution layers, such as MobileNet v2 [13], ShuffleNet [14]. In this
part, we replace all 1  1 kernels of 1  1 convolution behind Acknowledgments
3  3 convolution with the Type-I LPPC kernels in MobileNet v2
and ResNet50. The results are shown in Table 6. LPPC-MobileNet This work was supported by the National Natural Science Foun-
v2 model achieves 69.20% top-1 accuracy on the ImageNet dataset dation of China (No. 61474093), the Natural Science Foundation of
with 14% fewer parameters, and the LPPC-ResNet50 model has 10% Shaanxi Province, China (No. 2020JM-006) and the National
fewer parameters compared with baseline ResNet50 and has a Science Foundation for the Distinguished Young Scholars of China
1.32% drop in top-1 accuracy. In summary, the Type-I LPPC kernels (Grant No.61701531).

Table 6
Results of type-I LPPC-constrained MobileNet v2 and ResNet50.

Model ImageNet Accuacy Decrease in Top-1 Million Milion


Top-1,% Accuracy,% Parameters FLOPs
MobileNet v2 70.20 1.00 3.504 328
LPPC-MobileNet v2 69.20 3.016 265
ResNet50 77.15 1.32 25.557 4135
LPPC-ResNet50 75.83 23.038 3772

577
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579

References [29] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural


networks: Training deep neural networks with weights and activations
constrained to +1 or -1, in: arXiv preprint arXiv:1602.02830, 2016..
[1] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[30] M. Courbariaux, Y. Bengio, J.-P. David, Binaryconnect: Training deep neural
convolutional neural networks, in: Advances in neural information processing
networks with binary weights during propagations, in: Advances in neural
systems, 2012, pp. 1097–1105..
information processing systems, 2015, pp. 3123–3131..
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.
[31] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, P. Dubey, Ternary
Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition
neural networks with fine-grained quantization, in: arXiv preprint
challenge, Int. J. Computer Vision 115 (3) (2015) 211–252.
arXiv:1705.01462, 2017..
[3] K. Shi, J. Wang, S. Zhong, Y. Tang, J. Cheng, Hybrid-driven finite-time H1
[32] C. Zhu, S. Han, H. Mao, W.J. Dally, Trained ternary quantization, in: arXiv
sampling synchronization control for coupling memory complex networks
preprint arXiv:1705.01462, 2016..
with stochastic cyber attacks, Neurocomputing 387 (2020) 241–254.
[33] C. Bucilu, R. Caruana, A. Niculescu-Mizil, Model compression, in: Proceedings
[4] K. Shi, J. Wang, Y. Tang, S. Zhong, Reliable asynchronous sampled-data filtering
of the 12th ACM SIGKDD international conference on Knowledge discovery
of T-S fuzzy uncertain delayed neural networks with stochastic switched
and data mining, 2006, pp. 535–541.
topologies, Fuzzy Sets Syst. 381 (2020) 1–25.
[34] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, in:
[5] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
arXiv preprint arXiv:1503.02531, 2015..
image recognition, in: arXiv preprint arXiv:1704.04861, 2014..
[35] L. Sifre, Ecole polytechnique, cmap phd thesis, rigid-motion scattering for
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.
image classification, 2014..
Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of
[36] P. Vaidyanathan, Multirate systems and filter banks, Prentice Hall, Englewood
the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
Cliffs, 1993.
[7] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in,
[37] A. Krizhevsky, Learning multiple layers of features from tiny images, 2009..
in: Proceedings of the IEEE conference on computer vision and pattern
[38] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A.
recognition, 2016, pp. 770–778.
Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, 2017..
[8] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected
[39] S. Falkner, A. Klein, F. Hutter, BOHB: Robust and efficient hyperparameter
convolutional networks, in: Proceedings of the IEEE conference on computer
optimization at scale, in: Proceedings of the 35th International Conference on
vision and pattern recognition, 2017, pp. 4700–4708.
Machine Learning, Vol. 80 of Proceedings of Machine Learning Research, PMLR,
[9] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the
2018, pp. 1437–1446..
IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–
[40] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, C. Xu, Ghostnet: More features from
7141.
cheap operations, in: arXiv preprint arXiv:1911.11907, 2019..
[10] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q.V.
Le, Y. Wu, et al., Gpipe: Efficient training of giant neural networks using
pipeline parallelism, Adv. Neural Inform. Process. Syst. (2019) 103–112.
[11] J. Cheng, P.-S. Wang, G. Li, Q.-H. Hu, H.-Q. Lu, Recent advances in efficient Feng Liang has been an associate professor of Micro-
computation of deep convolutional neural networks, Front. Inform. Technol. electronics School at the Xi’an Jiaotong University since
Electron. Eng. 19 (1) (2018) 64–77. 2011. He received the B.E degree from Zhengzhou
[12] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. University, P.R. China in 1998, and the M.E. and Ph.D.
Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for
degree from Xi’an Jiaotong University, P.R. China in
mobile vision applications, in: arXiv preprint arXiv:1704.04861, 2017..
2001 and 2008, respectively. His current research
[13] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv 2:
interests include Signal Processing, Machine Learning,
Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE
conference on computer vision and pattern recognition, 2018, pp. 4510–4520. VLSI design, test and computer architecture.
[14] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet, An extremely efficient
convolutional neural network for mobile devices, in: Proceedings of the IEEE
conference on computer vision and pattern recognition, 2018, pp. 6848–6856.
[15] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, Shufflenet v2: Practical guidelines for
efficient cnn architecture design, in: Proceedings of the European Conference
on Computer Vision (ECCV), 2018, pp. 116–131.
[16] T. Zhang, G.-J. Qi, B. Xiao, J. Wang, Interleaved group convolutions, in:
Proceedings of the IEEE international conference on computer vision, 2017, pp. Zhichao Tian is currently a Master Degree Candidate at
4373–4382. the school of microelectronics, Xi’an Jiaotong University
[17] B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning, in:
in China. He received a bachelor’s degree in microelec-
arXiv preprint arXiv:1611.01578, 2017..
tronics from Xi’an University of Technology in June
[18] M. Lin, Q. Chen, S. Yan, Network in network, in: arXiv preprint
arXiv:1312.4400, 2013.. 2016 in China. His research is focused on convolutional
[19] F.N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W.J. Dally, K. Keutzer, neural networks for image recognition, digital inte-
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb grated circuit design.
model size, in: arXiv preprint arXiv:1602.07360, 2016..
[20] M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional neural
networks, in: arXiv preprint arXiv:1905.11946, 2019..
[21] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, Q.V. Le,
Mnasnet: Platform-aware neural architecture search for mobile, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019, pp. 2820–2828.
[22] B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning transferable architectures for
scalable image recognition, in: Proceedings of the IEEE conference on
Ming Dong is currently a Master Degree Candidate at
computer vision and pattern recognition, 2018, pp. 8697–8710.
the school of software, Xi’an Jiaotong University in
[23] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R.
Pang, V. Vasudevan, et al., Searching for mobilenetv3, in: Proceedings of the China. She received a bachelor’s degree in software
IEEE International Conference on Computer Vision, 2019, pp. 1314–1324. engineering from Hunan Normal University in June
[24] S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural 2017 in China. Her research is focused on lightweight
networks with pruning, trained quantization and huffman coding, in: arXiv neural network model compression, neural network
preprint arXiv:1510.00149, 2015.. structure search.
[25] P. Gysel, J. Pimentel, M. Motamedi, S. Ghiasi, Ristretto: A framework for
empirical study of resource-efficient inference in convolutional neural
networks, IEEE Trans. Neural Networks Learn. Syst. 29 (11) (2018) 5784–5789.
[26] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, Y. Zou, Dorefa-net: Training low
bitwidth convolutional neural networks with low bitwidth gradients, in: arXiv
preprint arXiv:1606.06160, 2016..
[27] A. Zhou, A. Yao, Y. Guo, L. Xu, Y. Chen, Incremental network quantization:
Towards lossless cnns with low-precision weights, in: arXiv preprint
arXiv:1702.03044, 2017..
[28] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D.
Kalenichenko, Quantization and training of neural networks for efficient
integer-arithmetic-only inference, in: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 2704–2713.

578
F. Liang, Z. Tian, M. Dong et al. Neurocomputing 423 (2021) 572–579

Shuting Cheng is currently pursuing the B.E. degree in (2012), the DARPA Young Faculty Award (2013), TUM-IAS Hans Fisher Fellowship
microelectronics science and engineering with the from Germany (2017) and seven best paper awards. Dr. Li is the Fellow of IEEE and
Department of Microelectronics, Xi’an Jiaotong Univer- Distinguished Member of ACM, a distinguished speaker of ACM (2017-2020), and a
sity, Xi’an, China. Her current research interests include distinguished lecturer of IEEE CAS society (2018-2019).
lightweight neural networks and digital integrated cir-
cuit design.
Yiran Chen received B.S and M.S. from Tsinghua
University and Ph.D. from Purdue University in 2005.
After five years in industry, he joined University of
Pittsburgh in 2010 as Assistant Professor and then
promoted to Associate Professor with tenure in 2014,
held Bicentennial Alumni Faculty Fellow. He now is the
Professor of the Department of Electrical and Computer
Engineering at Duke University and serving as the
director of NSF Industry-University Cooperative
Li Sun studied computer science at the Northwestern Research Center (IUCRC) for Alternative Sustainable and
Polytechnic University, China, where she received the Intelligent Computing (ASIC) and co-director of Duke
PhD degree in 2010. In 2016, she became an associate University Center for Computational Evolutionary
professor for information and communication engi- Intelligence (CEI), focusing on the research of new memory and storage systems,
neering at the Airforce Engineering University. She is machine learning and neuromorphic computing, and mobile computing systems.
the currently working for Machine Learning in the Dr. Chen has published one book and more than 350 technical publications and has
microelectronics Academy at the Xi’an Jiaotong been granted 94 US patents. He serves or served the associate editor of several IEEE
University. Her research interests include image pro- and ACM transactions/journals and served on the technical and organization
cessing, machine learning, and probabilistic graphical committees of more than 50 international conferences. He received 6 best paper
model. awards and 13 best paper nominations from international conferences. He is the
recipient of NSF CAREER award and ACM SIGDA outstanding new faculty award. He
is the Fellow of IEEE and Distinguished Member of ACM, a distinguished lecturer of
IEEE CEDA, and the recipient of the Humboldt Research Fellowship for Experienced
Researchers.

Hai (Helen) Li is Clare Boothe Luce Associate Professor of


Electrical and Computer Engineering Department at Guohe Zhang was born in Hubei, China in 1981. He
Duke University, USA. She works on hardware/software received his B.S. and Ph.D. degree in 2003 and 2008
co-design for accelerating machine learning, brain- respectively from Xi’an Jiaotong University, China. He is
inspired computing systems, and memory architecture currently a Professor in School of Microelectronics, Xi’an
and optimization. She has authored or co-authored Jiaotong University, Xi’an, China. His research interests
more than 300 technical papers in these areas and has include the semiconductor device physics and inte-
authored a book entitled Nonvolatile Memory Design: grated circuits design, image processing and intelligent
Magnetic, Resistive, and Phase Changing (CRC Press, system, algorithm and hardware co-design and imple-
2011). Dr. is serving as the associate director of NSF mentation for deep learning and signal processing sys-
Industry-University Cooperative Research Center tems, error-resilient low-cost computing techniques for
(IUCRC) for Alternative Sustainable and Intelligent embedded systems.
Computing (ASIC) and co-director of Duke Center for Evolutionary Intelligence
(CEI). Dr. Li also serves as an Associate Editor of IEEE TVLSI, IEEE TCAD, ACM
TODAES, ACM TACO, IEEE TMSCS, IEEE TECS, IEEE CEM, and IET-CPS. She has served
as organization committee and technical program committee members for over 30
international conference series. Dr. Li is a recipient of the NSF CAREER Award

579

You might also like