You are on page 1of 5

Design Space Exploration of Convolution

Algorithms to Accelerate CNNs on FPGA


Kala S1 , Debdeep Paul2 , Babita R Jose1 , Nalesh S3
1
School of Engineering, 3 Department of Electronics
Cochin University of Science And Technology, Kochi-22, India
{kalas,babitajose,nalesh}@cusat.ac.in

2
Department of Electrical Engineering
Indian Institute of Technology Patna, India
debdeep.ee15@iitp.ac.in

Abstract—Deep Neural Networks (DNN) are promising so- Convolution layers account for more than 90% of the total
lutions for various artificial intelligence tasks. Convolutional computations in a CNN. As the CNN model goes deeper,
Neural Network (CNN) is a variant of DNN, which is widely number of layers increase and computational complexity also
used in various computer vision tasks like image and face
recognition, autonomous vehicles, games, video surveillance and increases. Therefore in this paper, we restrict our focus to
various medical applications. CNNs are both compute and mem- these layers. Various techniques are available in many re-
ory bound. Convolutional layers are the most computationally search papers for performing convolution operations. Three
complex operation in CNN. Owing to the computation demanded major methods for computing convolutions, like conventional
by convolutions of CNNs, FPGAs are found to be suitable method, Fast Fourier Transform (FFT) based convolution and
for accelerating CNNs. In this paper we have carried out a
design space exploration of various algorithms for performing Winograd minimal filtering method are discussed in this paper.
operations in different convolutional layers of CNNs. Analysis Depending on the input feature map size and the kernel size,
has been done to select an appropriate algorithm for various selection of convolution algorithm in each convolutional layer
convolution layers of AlexNet CNN model based on the kernel of a CNN model varies. In this paper we perform a design
size and input feature map. First convolution layer in AlexNet space exploration of various algorithms used for performing
CNN model with three channels of 227×227 feature size and
96 channels of 11×11 kernel, has been implemented in Xilinx convolutions in CNN.
Virtex-7 FPGA. Rest of the paper is organized as follows. Section II gives the
Index Terms—Convolutional Neural Network, Deep learning, background and related works. Section III discusses FFT based
FFT, FPGA, Winograd minimal filtering convolution and Winograd minimal filtering for computing
I. I NTRODUCTION convolution operations. Section IV gives the design space
exploration of convolution techniques. Implementations and
In recent years, deep learning has gained wide popularity
evaluation results are described in Section V. Section VI
in performing various machine learning tasks. Among various
concludes the paper.
deep learning techniques, Deep Neural Network (DNN) has
the capability of learning high level features with high per-
formance. A common form of DNN is Convolutional Neural II. BACKGROUND AND R ELATED WORK
Network (CNN), which consists of a number of convolutional
layers. CNNs find applications in speech processing, natural CNNs are composed of different layers like convolution
language processing, image classification, face recognition, layer (CONV), pooling layers, normalization layer and fully
cancer detection, weather forecasting and many other con- connected (FC) layers. CONV layers are used to perform
sumer electronic devices. CNNs are accelerated using Graphic feature extraction and pooling layers perform sub sampling.
Processing Unit (GPU), Application Specific Integrated Circuit Feature classification is performed by fully connected layers,
(ASIC) and Field Programmable Gate Array (FPGA) in recent which are memory bound owing to the large number of
years. GPUs are are widely used in CNN tasks, but their power weights in its computation. In convolution operation, each
consumption is very high and hence unsuitable for embedded element of input is multiplied with each coefficient of filter
system applications [13]. Various FPGA based implementa- and the results are summed up. This is a MAC (multiply-
tions [6]–[10], [13], [14], [16]–[18] and ASIC implementa- accumulate) operation. CONV layers compose of 90% of the
tions [1], [2] of CNN are available from various research total computations in a CNN.
groups. ASIC based accelerators give high performance but Consider M channels of H × W input feature maps and
with limited flexibility whereas FPGA based implementations K × K kernels, to perform convolution. Also consider a stride
show acceptable power consumption and performance. of S, and the output feature map is denoted as Y . Conventional
978-1-5386-6575-6 /18/$31.00 © 2018 IEEE convolution (direct convolution) is given by the formula as in

1
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:36 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. FFT based Convolution

Perform inverse FFT to get the results in time domain. Fig.


Fig. 1. Convolution Operation 2 shows FFT based convolution scheme. FFTs of the input
feature and the kernel are taken and are multiplied together. To
Equation (1), perform matrix multiplication, augmentation is done. Inverse
FFT is computed for the result.
 K 
M  K
Yi,n,x,y = Di,m,x+u+S,y+v+S × Gn,m,u,v (1) B. Winograd minimal filtering
m=1 u=1 v=1
Winograd minimal filtering is a fast algorithm based on
Convolution operation is shown in Fig. 1. Chinese Remainder Theorem, for performing convolution op-
ASIC based implementations of CNN are presented in [1], eration. In 2D Winograd algorithm multiple adjacent image
[2]. Authors in [2] propose an architecture with interconnected tiles are combined together and computation is optimized to
nodes which contains 64 nodes. A reconfigurable and scalable reduce multiplications. The first step is to transform both
CNN architecture is presented in [1] with reconfigurable deep input and kernel tiles using simple matrix multiplication
neural architecture (DNA). Various FPGA based implemen- operation. Element-wise multiplication is performed between
tations of CNN are presented in [7], [9], [10], [13], [15], transformed input and kernel tiles. Finally an output transform
[17], [19]. An architecture which exploits parallelism in CNN step is used to generate the final convolution output. This
has been proposed by [10]. In [15], authors have proposed algorithm is most efficient for computing convolutions with
an accelerator for CNN, with data reuse. [20] gives an small kernel sizes.
openCL based FPGA implementation of CNN model. All the Winograd algorithm for computing m outputs using an r
above mentioned research papers implemented the convolution tap filter will require m + r − 1 multiplication [4]. Consider
layers using conventional method which uses general matrix F (m, r) =F (2, 3) which involves computation of two outputs
multiplication. FFT based convolutions are presented in [12], using 3 tap filter. The algorithm is given below, where d
[21], but it is useful for large sized filters. FFT overlap and denotes input data and g denotes filter coefficients.
add-convolution technique is used in [21] instead of FFT
convolution. ⎛ ⎞
  g0  
Winograd based CNNs are proposed in [4]–[6]. The per-
F(2,3)= 0
d d1 d2 ⎝g1 ⎠ = m0 + m1 + m2
formance of Winograd based CNN depends on the Winograd d1 d2 d3 m1 − m2 − m3
tile size chosen for implementation. Tile size is denoted as g2
F (m, r), where m denotes output for a r tap filter. Most where m0 , m1 , m2 and m3 are calculated as in Equation (3),
commonly used tile sizes are F (2, 3) and F (4, 3) [5], [6],
[19]. Large tile size gives better performance, at the cost of g0 + g1 + g2
less precision. Winograd minimal filtering reduces the number m0 = (d0 − d2 )g0 m1 = (d1 + d2 ) (3)
2
of multiplications in convolution, compared to conventional g0 − g1 + g2
convolution method. m3 = (d1 − d3 )g2 m2 = (d2 − d1 )
2
(4)
III. FAST A LGORITHMS FOR C ONVOLUTION
Commonly used fast convolution algorithms are FFT and Here, the input data is a tile of size m + r − 1, ie., in the case
Winograd minimal filtering. These algorithms are discussed of F (2, 3), it is 4. For computing m0 , m1 , m2 and m3 , only
in this section. 4 multiplications are required.
In matrix form, output Y is written as,
A. FFT Convolution
Convolution can be efficiently performed in frequency do- Y = AT [(Gg)  (B T d)] (5)
main. Equation (2) gives mathematical representation of FFT
where  denotes element-wise multiplication.
based convolution.
2D minimal algorithm for computing m × m output tile
x(n) ∗ h(n) = IF F T {F F T (x(n)) × F F T (h(n))} (2) using r × r filter requires (m + r − 1) × (m + r − 1)
multiplications, where (m + r − 1) × (m + r − 1) inputs
Here, Fourier transform of the input image and kernel are taken are required. This can be denoted as shown in Equation (6).
thereby transforming them to frequency domain. Both FFTs
should be of same length for element-wise multiplication. μ( F (m × m, r × r)) = μ( F (m × m)) × μ( F (r × r)) (6)

2018 Eighth International Symposium on Embedded Computing and System Design (ISED)

2
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:36 UTC from IEEE Xplore. Restrictions apply.
TABLE I
D ESIGN PARAMETERS FOR CONVOLUTION

Tile input output kernel size


F (3, 2) 4 3 2
F (2, 3) 4 2 3
F (3, 3) 5 3 3
F (4, 3) 6 4 3
F (6, 3) 8 6 3

Output Y can be written as,


Y = AT [(GgGT )  (B T dB)]A (7)
The transform matrices A, B and G can be precomputed, Fig. 3. Execution time for various convolution techniques
once the value of m and r are known. Conventional convolu-
tion for a one dimensional function F (2, 3) has 2×3 = 6 multi- TABLE II
plications, whereas in Winograd algorithm, only 2 + 3 − 1 = 4 A LEX N ET CONV L AYER C ONFIGURATION
multiplications are required. In the case of 2D convolution,
CONV1 CONV2 CONV3 CONV4 CONV5
F (2 × 2, 3 × 3) involves 22 × 32 = 36 multiplications, whereas
F eaturein 227×227 27×27 13×13 13×13 13×13
Winograd takes 42 = 16 multiplications. There is a complexity
36 Kernel size 11×11 5×5 3×3 3×3 3×3
reduction in terms of multiplication, by a factor 16 = 2.25.
Kernels 96 256 384 384 256
In general, the ratio of hardware complexity in terms of
Channels 3 48 256 192 192
multiplication for conventional and Winograd convolution is
F eatureout 55×55 27×27 13×13 13×13 13×13
given by Equation (8),
Stride 4 1 1 1 1
m2 × r2
Ratio of M ultiplications = 2 (8)
(m + r − 1)
filter is of size 11×11. Rest of the convolution layers uses 3×3
IV. D ESIGN S PACE E XPLORATION
filter. For small filter sizes Winograd gives best performance.
FFT and Winograd filtering are fast algorithms for perform- Thus a hybrid approach can be used in realizing convolution
ing convolutions. However these algorithms are optimum at layers of AlexNet.
certain input feature size and kernel size. For large kernel sizes
either direct convolution or FFT based convolution is suited, Architecture for implementing CONV1 of AlexNet
while Winograd algorithm gives better performance for small First convolution layer in AlexNet model uses 11 × 11
kernel sizes. As kernel size increase, if we use large tile size in filter with an input feature of size 227 × 227. Here, 96
Winograd algorithm, transform size of the tile also increases, filters and three input channels are used. First layer can be
which results in degradation of precision. implemented using conventional convolution technique. It uses
We have conducted experiments with various kernel sizes general matrix-matrix multiplication algorithm for realization.
and input feature maps for conventional, Winograd minimal Representation of first CONV layer is shown in Fig.6.
filtering and also FFT convolution. The kernel sizes used Proposed architecture for implementation of the first convo-
and the input feature map considered are shown in Table I. lutional layer of AlexNet is shown in Fig. 7. Input features and
We have chosen radix-2 FFT for convolution computations. filter coefficients are accessed from DDR and the output fea-
Zero padding has been done to get the powers of two. We
have chosen F (2, 3), F (3, 3), F (4, 3), F (3, 2) and F (6, 3)
Winograd tile sizes for experiments. Fig. 3 shows the execution
time of convolution operations with varying input size and
filter size for different convolution methods.
AlexNet Model
AlexNet [3] was the first CNN model to win the ImageNet
challenge in 2012. It has five CONV layers and three FC
layers. Filters used in AlexNet are of sizes 11, 5 and 3. Con-
figurations of AlexNet model are given in Table II. Number
of parameters and MAC operations in various convolutional
layers of AlexNet is shown in Fig.4 and Fig.5 respectively.
First CONV layer (CONV1) in AlexNet can be imple-
mented using conventional convolution technique since the Fig. 4. Parameters in CONV layers of AlexNet

2018 Eighth International Symposium on Embedded Computing and System Design (ISED)

3
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:36 UTC from IEEE Xplore. Restrictions apply.
TABLE III
R ESOURCE U TILIZATION

Resource Used Available Percentage utilization


LUTs 519312 1221600 42.51
DSP48E 2112 2160 97.78
Flip Flops 136513 2443200 5.59

V. I MPLEMENTATION AND R ESULTS


Convolution architecture for first layer of AlexNet model
has been implemented on Xilinx XC7V2000 FPGA. We have
used 32-bits floating point arithmetic for implementation. The
Fig. 5. MAC operations in CONV layers of AlexNet
architecture has been verified using Matlab simulation envi-
ronment. Floating point multipliers in MAC unit are five stage
pipelined. Proposed architecture has an operating frequency of
200 MHz. Resource utilization of the proposed architecture is
given in Table III. Our architecture gives a performance of 422
GFLOPs for the first layer of convolution in AlexNet CNN
model.

VI. C ONCLUSION
Convolutional layers account for more than 90% of total
computation in Convolutional Neural networks. Hence an effi-
cient hardware architecture is required for this highly complex
computational unit. We have explored various algorithms for
performing convolution operations in CNN. As a case study
we have considered AlexNet CNN model. It is seen from the
Fig. 6. CONV1 layer of AlexNet
analysis that first layer of AlexNet gives better performance
with conventional algorithm. In this paper we have proposed
an efficient architecture for implementing convolution opera-
tion in the first layer of AlexNet. We have used 32-bit floating
ture is also written back to DDR. Single channel of 227×227
point arithmetic for implementation. Our architecture has
input feature is considered first, which has to be convolved
been implemented in Xilinx XC7V2000 FPGA with operating
with 96 filters of size 11×11 each. In our architecture we use
frequency of 200 MHz. Our architecture gives a performance
11 MAC units per filter for performing a row convolution. We
of 422 GFLOPs for the first layer of convolution in AlexNet
have 96 such row convolution units which perform convolution
CNN model. Future work focus on implementation of rest of
operation in parallel. Similarly three channels of input feature
the layers in AlexNet using suitable fast algorithms.
map in CONV1 are considered for computation. Outputs of
all the channels are added together, reusing the accumulators ACKNOWLEDGEMENT
in MAC (multiply and accumulate) unit.
This work is supported in part by Kerala State Council for
Science Technology and Environment (KSCSTE), under Back-
to-Lab programme of Women Scientist Division.

R EFERENCES
[1] Fengbin Tu, Shouyi Yin, Peng Ouyang, Shibin Tang, Leibo Liu and
Shaojun Wei., ”Deep Convolutional Neural Network Architecture With
Reconfigurable Computation Patterns” In IEEE Transactions on Very
Large Scale Integration (VLSI) Systems,August 2017, pp. 2220–2233
[2] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang,
Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun and Olivier Temam”,
”DaDianNao: A Machine-Learning Supercomputer ” In Proceedings of
The 47th Annual IEEE/ACM International Symposium on Microarchi-
tecture, MICRO 2014
[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ”ImageNet
Classification with Deep Convolutional Neural Networks”. NIPS 2012.
[4] Andrew Lavin and Scott Gray. ”Fast Algorithms for Convolutional
Neural Networks”. In IEEE CVPR,2016.
[5] Roberto DiCecco, et.al., ”Caffeinated FPGAs: FPGA Framework For
Fig. 7. Proposed Architecture Convolutional Neural Networks”. In IEEE FPT 2016.

2018 Eighth International Symposium on Embedded Computing and System Design (ISED)

4
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:36 UTC from IEEE Xplore. Restrictions apply.
[6] Liqiang Lu, Yun Liang, Qingcheng Xiao and Shengen Yan, ”Evaluating
Fast Algorithms for Convolutional Neural Networks on FPGAs”. In
IEEE FCCM, 2017.
[7] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao and
Jason Cong, ”Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks”. In ACM/SIGDA FPGA 2015.
[8] Manoj Alwani, Han Chen, Michael Ferdman and Peter Milder,”Fused-
Layer CNN Accelerators”, In Proceedings of The 47th Annual
IEEE/ACM International Symposium on Microarchitecture, MICRO
2016
[9] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song
Yao, Song Han, Yu Wang and Huazhong Yang, ”Angel-Eye: A Complete
Design Flow for Mapping CNN Onto Embedded FPGA”. IEEE TCAD
37, 1 (Jan. 2018), pp. 35–47.
[10] Mohammad Motamedi, Philipp Gysel, Venkatesh Akella and Soheil
Ghiasi, ”Design Space Exploration of FPGA-Based Deep Convolutional
Neural Networks”. In IEEE ASP-DAC 2016.
[11] K. Simonyan and A. Zisserman. ”Very Deep Convolutional Networks
for Large-Scale Image Recognition”. In ICLR 2015.
[12] Michael Mathieu, Mikael Henaff, and Yann LeCun. ”Fast Training of
Convolutional Networks through FFTs”. In ICLR 2014.
[13] Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou and Lingli
Wang, ”A High Performance FPGA-based Accelerator for Large-Scale
Convolutional Neural Networks”. In FPL 2016.
[14] Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo and Sarma Vrudhula,
”Scalable and Modularized RTL Compilation of Convolutional Neural
Networks onto FPGA”. In FPL 2016.
[15] Maurice Peemen, Arnaud A. A. Setio, Bart Mesman and Henk Cor-
poraal, ”Memory-Centric Accelerator Design for Convolutional Neural
Networks”. In IEEE ICCD 2013.
[16] Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula and Srihari
Cadambi, ”A Dynamically Configurable Coprocessor for Convolutional
Neural Networks”, In International Symposium on Computer Architec-
ture, ISCA, 2010.
[17] Jiantao Qiu et.al, ”Going Deeper with Embedded FPGA Platform for
Convolutional Neural Network”. In ACM/SIGDA FPGA 2016.
[18] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan and Jason Cong,
”Caffeine: Towards Uniformed Representation and Acceleration for
Deep Convolutional Neural Networks”, In Proceedings of International
Conference on Computer Aided Design”, ICCAD, 2016.
[19] Abhinav Podili and Chi Zhang and Viktor Prasanna”, ”Fast and Effi-
cient Implementation of Convolutional Neural Networks on FPGA”, In
Proceedings of ASAP, 2017
[20] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei
Ma, Sarma Vrudhula, Jae-sun Seo, Yu Cao, ”Throughput-Optimized
OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural
Networks”, In FPGA 2016
[21] Tahmid Abtahi, Amey Kulkarni, and Tinoosh Mohsenin ”Accelerating
Convolutional Neural Network with FFT on Tiny Cores”, In ISCAS 2017

2018 Eighth International Symposium on Embedded Computing and System Design (ISED)

5
Authorized licensed use limited to: National University Fast. Downloaded on September 01,2023 at 06:42:36 UTC from IEEE Xplore. Restrictions apply.

You might also like