You are on page 1of 8

2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International

Conference on Ubiquitous Computing and Communications (ISPA/IUCC)

Laius: An 8-bit Fixed-point CNN Hardware


Inference Engine
Zhisheng Li*1 ,Lei Wang1 ,Shasha Guo1 ,Yu Deng1 ,Qiang Dou1 ,Haifang Zhou1 , Wenyuan Lu2
1
School of Computer Science, National University of Defense Technology, Changsha, China
2
Xi’an Satellite Monitoring and Control Center, Xi’an, China
*lizhsh_123@163.com

Abstract—Convolutional Neural Network (CNN) is one of the limited memory, which brings a huge challenge to the existing
most effective neural network model for many classification CNN hardware accelerator architecture.
tasks, such as voice recognition, computer vision and biological GPU, ASIC and FPGA are three main hardware acceleration
information processing. Unfortunately, Computation of CNN is
both memory-intensive and computation-intensive, which brings platforms for CNN acceleration. The advantage of FPGA
a huge challenge to the design of the hardware accelerators. A is the short development cycle and high flexibility. Much
large number of hardware accelerators for CNN inference are effort has been taken to make full use of FPGA. Work
designed by the industry and the academia. Most of the engines [9] presents a roofline model that implements quantitative
are based on 32-bit floating point matrix multiplication, where modeling of bandwidth, resources and throughput. In work
the data precision is over-provisioned for inference job and the
hardware cost are too high. [1] [2], the author uses the roofline model as a guide to
In this paper, a 8-bit fixed-point LeNet inference engine (Laius) achieve a better trade-off among bandwidth, throughput and
is designed and implemented on FPGA. In order to reduce the resource utilization. In work [3] [8] [16], the author proposes
consumption of FPGA resource, we proposed a methodology to different computing units to improve computational efficiency
find the optimal bit-length for weight and bias in LeNet, which and improve throughput. In work [12] [13], a hierarchical
results in using 8-bit fixed point for most of the computation
and using 16-bit fixed point for other computation. The PE
approach is proposed and implemented, making the allo-
(Processing Element) design is proposed. Pipelining and PE tiling cation of resources more granular and improving resource
technique is use to improve the performance of the inference utilization. In addition, several researches discuss that low-
engine. By theoretical analysis, we came to the conclusion that precision training and inference can bring all aspects of
DSP resource in FPGA is the most critical resource, it should performance improvement with tolerant accuracy loss. In work
be carefully used during the design process. We implement
the inference engine on Xilinx 485t FPGA. Experiment result
[4] [6], the authors propose a low-precision training and a
shows that the designed LeNet inference engine can achieve data compression method to reduce bandwidth pressure and
44.9 Gops throughput with 8-bit fixed-point operation after power consumption respectively. Unfortunately, the existing
pipelining. Moreover, with only 1% loss of accuracy, the 8-bit work above is designed only for the convolutional layer in the
fixed-point engine largely reduce 31.43% in latency, 87.01% in CNNs, not for the entire network mapping.
LUT consumption, 66.50% in BRAM consumption, 65.11% in
DSP consumption and 47.95% reduction in power compared to
In this paper, a LeNet hardware inference engine with low-
a 32-bit fixed-point inference engine with the same structure. precision operation (8-bit fixed-point) Laius is designed and
Index Terms—CNN accelerator, FPGA, LeNet, Inference, Im- implemented to explore the tradeoff between accuracy and
plementation resources consumption. The input of the Laius is the pixels of
a given picture. The data and weights are calculated at each
I. I NTRODUCTION layer in order and written to the BRAM. The layers of the
Convolutional Neural Network (CNN) is one of the most engine are arranged in the order of the entire layers of LeNet.
important model to fit complex and non-linear data, with In order to improve performance, PE tiling and the weight split
wide applications ranging from voice recognition, computer are leveraged to exploit the computational parallelization. By
vision and bio information processing. Compared to many utilizing the derivation of mathematical models and ping-pong
traditional algorithms, CNNs can always achieve superior optimization, we implement a four stage pipeline to improve
performance and accuracy. LeNet, a representative kind of the throughput.
CNNs, is proposed for recognizing hand-written digits. Due The main contributions of this paper are,
to its attractive feature extraction abilities, LeNet has been 1) It designs and implements the inference engine for LeNet
widely used in many real-world applications. (Laius) with 8-bit fixed-point precision. With only 1% loss
Despite these advantages, LeNet, like nearly all kinds of of accuracy, this 8-bit fixed-point engine largely reduce
CNNs, is computation-intensive and memory-intensive for hardware resources consumption compared to 32-bit fixed-
computation. As the complexity of the real-world applications point engine of the same structure.
increasing, the limitations of CNNs become more serious. A 2) With the help of our in-house fixed-point CNN training
enormous amount of computations have to been finished with a framework and the one-to-one inference engine simulator,

0-7695-6329-5/17/31.00 ©2017 IEEE 143


DOI 10.1109/ISPA/IUCC.2017.00030
we proposed a methodology to find the optimal bit-width TABLE I
L E N ET PARAMETERS
and scaling factor for weights and bias in the network.
3) A theoretical analysis model is proposed to efficiently LeNet parameters
manage the free hardware resource during hardware im- Layer Type Input Weights Stride Output
plementation. CONV1 1*28*28 20*1*5*5 1 20*24*24
Pooling1 20*24*24 2 20*12*12
Laius is implemented on Xilinx 485t FPGA. Experiment CONV2 20*12*12 50*20*5*5 1 50*8*8
result shows that Laius achieves 44.9 Gops throughput with Pooling2 50*8*8 2 50*4*4
FC1 800*1 800*500 500*1
8-bit fixed-point after pipelined under Xilinx FPGA Virtex7
ReLU 500*1 500*1
485t. Moreover, with only 1% loss of accuracy, the 8-bit fixed- FC2 500*1 500*10 10*1
point engine largely reduce 31.43% in delay, 87.01% in LUT, Soft-Max 10*1 10*1
66.50% in BRAM, 65.11% in DSP and 47.95% in power
compared to 32-bit fixed-point engine in the same structure.
The rest of this paper is organized as follows : The section
2 introduces the related work. Background is given in section
3. The section 4 introduces the engine architecture and im-
plementation details. Optimization of the engine described in
detail is given in section 5. Experiment and results are shown
in section 6. The section 7 is the conclusion.

II. BACKGROUND
A. LeNet
LeNet is primarily used in hand-written numbers recog-
nition and is applied in identifying handwritten postcode in
practice.
Fig. 2. Convolutional Layer Diagram
1) The Composition of LeNet: LeNet consists of 8 layers.
The structure has been given in the Table 1 and Figure 1. The
structure is composed of 2 convolutional layers, following with
2) Convolutional Layer: The main operation of the con-
2 pooling layers, 2 fully connected layer with 1 activation
volutional layer is the convolutional operation. As shown in
layer between them and a classification layer. The convo-
the Figure 2, the input and the weights correspond to the
lutional (CONV) layer aims to detect the local connection
multiply-accumulate operation. The parameters involved in the
characteristics of the previous layer. The units of a CONV
convolutional operation have been given. Each set of weights
layer is organized by feature maps. Each feature map has
corresponds to a bias. The result of each set of operations
many neurons, each connected to a local region of the input
is summed with the corresponding bias. The corresponding
feature map through a set of weights. All neurons of the same
formula is as follows:
feature map share the same set of weights, also called the 
convolutional kernel. Different feature maps use different set Output = (Inputi × W eightsj ) + Bias (1)
of weights. The pooling layer fuses semantic-similar features
into a single one. A typical pooling layer, max pooling layer, It is quite noticeable that each group of weights contains a
computes the local maximum for one/multiple feature map.
The fully-connected (FC) layer is that each neuron of the
layer has connections with all neurons of the adjacent layer.
Non-linear layer aims to perform non-linear transformation
for input data and produce output the same size as input. In
LeNet’s Non-linear layer, the ReLU function is used.

Fig. 1. LeNet Structure Fig. 3. CONV2 implementation procedure

larger number of layers, so we take two steps to complete

144
CONV2. The first step, calculate the corresponding data. And III. A RCHITECTURE OF L AIUS
the data dimension is 20 * 50 * 8 * 8. In the second step, Laius is designed according to the network structure as
the data in the group is added to the final result, and the data the Figure 6 shows. The input of the engine is the pixel
dimension becomes 50 * 8 * 8, as shown in the Figure 3. information of the picture. The weights and data are calculated
at each layer in order, and the results are finally obtained.
The layers of the Laius are arranged in the order of CONV1,
Pooling1, CONV2_1, CONV2_2, Pooling2, FC1, ReLU and
FC2. The weights are written to the BRAM directly. Image
information and the weight of the BRAM is the input of the
operations to get the results. The output of each layer is stored
in BRAM and becomes the input of the next layer.
A. Convolutional Layer Design
Computation-intensive is the characteristic of convolutional
Fig. 4. Fully-connected Layer layers. Increasing data reuse can improve computational effi-
ciency. In the convolutional operation, the input data in the
3) Fully-connected layer and others: The operation of the sliding window is calculated with the corresponding weights.
fully-connected layer is similar to the convolutional operation In the two adjacent operations, the input data is partially
as shown in the Figure 4. Each output point corresponds overlapped. We design the PE to take advantage of the sliding
to a set of weight data. The weight data is calculated with window operation, which can improve the reuse of data.
each corresponding input data. The result of the operation is According to the weights of the dimension, the PE that we
summed with the corresponding bias, so we can get the final design is as shown in Figure 7. PE is a traditional tree-like
output. The corresponding formula is as follows: structure. PE can share part of the data, which conforms to the
 operation of the sliding window. This approach can improve
Output = (Inputi × W eightsj ) + Bias (2)
the reuse of data, and reduce the bandwidth pressure. Different
The full-connected layer is characterized by the low utiliza- convolutional layers have different parameters. We can design
tion and large amount of the weight data. Pooling operation different numbers of PE according to different convolutional
layers. In this design, we take eight data to calculate.
B. Fully-connected Layer Design
The data of fully-connected layers is huge, and there is no
reusability. The PE of the fully-connected layer is also a tree
structure, as shown in Figure 7. We improve the performance
by designing the size of m in PE and the number of PEs
based on the parameters of the fully-connected layer. After
computing all the input and weights, the result is added to the
bias, which becomes the output data.
Fig. 5. Pooling Layer Diagram
C. Other Layers Design
is divided into many kinds. The pooling operation in LeNet is The pooling layer in LeNet is max-pooling, and the slid-
max pooling. In the set data frame, selecting the largest data as ing window is 2 * 2. Comparing the four numbers in the
the output data. As shown in the Figure 5, the corresponding window, the largest is the output. During the implementation,
operation is given. The role of the pooling layer is to reduce the Pooling layer reads 16 data at a time and outputs 4
the data dimension and lessen computation. The corresponding results simulataneously. The activation layer of LeNet uses the
formula is as follows: ReLU function. Detection of the input data can be positive or
Output = max (Input1 , Input2 , ...Inputn ) (3) negative, positive number does not operate, and the value of
the negative is zero.
In CNNs, the activation layer has a lot of function, the ReLU
function is used in LeNet. ReLU principle is very simple. D. Weights Arrangement
When the input data is positive, the output is the original The weights of the Laius are written to BRAM. The weights
number; when the input is negative, the output is 0. In CNNs, of each layer are written to multiple BRAMs so that data can
the classification layer generally uses the Soft-Max function. be read from multiple BRAMs at the same time, increasing
The specific operation is shown as follows: the read speed of the data. The weights in the network are

n four-dimensional. The four-dimensional data is transformed
Outputi = eInputi / (eInputn ) (4) into a one-dimensional data in accordance with the order of
n=1 the dimension. The expanded weights are written to BRAM

145
Fig. 6. Architecture of Laius

Fig. 7. Convolutional implementation detail

in order. The weights are arranged in order to facilitate the


calculation of addresses. After the PE unit is tiled, we split the
corresponding weights and put them into multiple BRAMs.
IV. O PTIMIZATION
FPGA resources are limited and valuable. DSP resources are
the bottleneck of performance. Making full use of hardware
resources is an effective method to improve performance.
Roofline model has now been proposed to solve the trade-
off relationship among hardware resources, bandwidth, and
throughput. In order to tackle the performance bottleneck, we
take the following optimization. Explore low precision calcula-
tions, and achieve 8 bit fixed-point calculation. The unit tiling
and the weight split are matched to achieve the computational
parallelization. Through the derivation of mathematical models
and ping-pong optimization, we implement the pipeline.
A. Reducing Precision Fig. 8. Caffe’s fixed-point training [20]
We refer to the work [20] about training CNN model in
fix point precision. The process of integer training is shown
in the Figure 8. After several iterations of the floating point Each layer of output, even if data overflow exists, will lead to
training, the data is taken out, which enlarged and rounded, the final results of the error. To prevent such catastrophe, we
then the fixed-point is obtained Through fixed-point training, have a conservative attitude towards the data bit-width. After
we get the scale factor and 8-bit fixed-point weights. The the simulator test, we will set the width of each layer as shown
image information is enlarged 40 times and rounded as input in Table 2. In Table 2, 8-16 bit means that the multiply result
to the engine. 8-bit fixed-point weights with 16-bit fixed-point of a 8-bit fixed-point and a 16-bit fixed-point are obtained with
bias are obtained from the training model. CONV1, CONV2, a 16-bit fixed-point.
FC1 and FC2 operations, the result is scaled down by 200
times before it can be used as an output. The biggest problem B. Parallelization
in operation with 8-bit fixed-point weights is data overflow. Most of the data in the network operation is independent.
To solve this problem, we use the simulator to evaluate each The independence of the data provides the conditions for
layer of output data. The number of adder and multiplier bits parallelism. PE tiling can make the data run to obtain a certain
per layer is determined according to the evaluation result. We degree of parallel. At the same time, the weights need to be
used the 10,000 test images from MNIST as a data set for split according to the PE and placed into multiple BRAMs.
data evaluation. The results of the experiment show that the This can provide enough computing data, without taking up
accuracy of the network and the data per layer is very sensitive. more BRAM resources. As is shown in Figure 9, after the PE

146
TABLE II
T HE BIT- WIDTH OF EACH LAYER IN L AIUS

Input Weights Bias Multiplier Adder Output Scaling Factor


CONV1 8bit 8bit 16bit 8bit 16bit 8bit 200
CONV2_1 8bit 8bit 16bit 8bit 16bit 16bit 200
CONV2_2 16bit 8bit 16bit 8-16bit 16bit 16bit 200
FC1 16bit 8bit 16bit 8-16bit 16bit 16bit 200
FC2 16bit 8bit 16bit 8-16bit 32bit 16bit 200

D. Theoretical analysis
In the same structure, 8-bit fixed-point engine can save
a lot of hardware resources compared to 32-bit fixed-point
engine. Low-precision engine can be configured with more
PE units for same hardware resources, so performance can be
significantly improved. Using the theoretical model to guide
the hardware resource allocation, the engine performance can
be further improved. We do following studies with the existing
engine configuration. First, we parameterize the resource con-
sumption. PE number reflects the consumption of DSP. The
number of PEs can be used to evaluate the consumption of
DSP.
The number of PEs of the 8-bit fixed-point engine is:

NDSP = NCON V 1 + NCON V 2 + NF C1 + NF C2 (5)


Fig. 9. Layout of Weights The number of PEs after optimization is:
MDSP =C1 × NCON V 1 + C2 × NCON V 2 + F1 × NF C1 +
is changed from one to three, the weight is also changed from F2 × NF C2 {C1 C2 N+ , F1 F2 R+ }
one to three BRAMs. The method of splitting the weights is (6)
determined by the dimensions of the data parallel. The basic
where C1 and C2 represent the multiple of CONV1 layer
principle is the correspondence between the data before and
structure and CONV2 layer structure, and F 1, F 2 represent
after the split.
the multiple of FC1 layer structure and FC2 layer structure.
The total number of BRAMs in the FPGA is known. The
C. Ping-Pong Optimization number of BRAMs used for the weights can be calculated. The
To achieve the improve in throughput, the pipeline is an intermediate result of the BRAM is what we explore. In the
essential technology. The computational complexity of the lay- pipeline, we assume CONV1 and Pooling1 as a stage, each of
ers of the network is different, resulting in different operating CONV2_1 and CONV2_2 as a stage, and the rest as a stage,
time. By analyzing the time of each stage, the runtime of the which are shown in the Figure 11.
convolutional layer and fully-connected layers is dominant. In
order to achieve a balanced pipeline, we have to use more
resources for them to reduce the running time. Our goal is to
achieve roughly the same running time in each section. In the
pipeline, the use of ping-pong optimization is to achieve data
processing. As is shown in Figure 10, the ping-pong operation Fig. 11. Pipeline Segmentation
among the pipeline segments allows the hardware resources to
work continuously.
The theoretical model only optimizes the operation of
CONV1, CONV2, FC1 and FC2. We use 36K resources as
a research unit. We assume that BRAM can support 4 data
access. Due to the presence of data reuse, the situation in the
formula is the maximum expenditure. The BRAM for the 8-bit
fixed-point engine can be expressed as:

Fig. 10. Ping-Pong Optimization NBRAM =BRAMInput + BRAMCON V 21


(7)
+ BRAMCON V 22 + BRAMF C1

147
The BRAM expenditure after optimization can be expressed
as:
MBRAM =C1 × BRAMInput + C2 × BRAMCON V 21
+ F1 × BRAMCON V 22 + F2 × BRAMF C1
(8)
The delay after optimization can be expressed as:
Delay =(TCON V 1 /C1 ) + TP 1 + (TCON V 2 /C2 ) + TP 2
(9)
+ (TF C1 /F1 ) + TReLU + (TF C2 /F2 )
The time of each pipeline stage can be expressed as:


⎪ T1 = (TCON V 1 )/C1 + TP 1

T2 = TCON V 21 /C2
(10)

⎪ T3 = TCON V 22 /C2
⎩ Fig. 12. Experimental procedure
T4 = TP 2 + (TF C1 )/F1 + TReLU + (TF C2 )/F2
After optimization, the engine’s PE and BRAM will meet the
total number of resources. According to resource constraints the basis of the 16-bit fixed-point version, plus the design of
and optimization goals, we can get: the bit width mentioned in IV -A, we get the 8-bit fixed-point
⎧ version.

⎪ MDSP /NDSP  DSPT otal /DSPU sed
⎨ We set the 32 bit fixed-point performance as baseline, and
MBRAM /NBRAM  BRAMT otal /BRAMU sed
(11) study the relative performance of 16/8 fixed-point. Figure 13

⎪ min(Delay)
⎩ shows the performance comparison between each precision
min(max(Ti − Tj )(i = j))
implementation. It can be clearly seen from the Table 3 that
The parameters of (C1, C2, F 1, F 2) can be obtained by in all configurations under the same premise, low bit-width can
bringing the number of PE per layer in the existing engine bring a full range of performance improvements. It is clearly
and the use of BRAM into the model. According to the that the 8-bit fixed-point engine largely reduces 31.43% in
performance needs, we can choose the appropriate parameters delay, 87.01% in LUT, 66.50% in BRAM, 65.11% in DSP
for configuration optimization. and 47.95% in power compared to 32 bit fixed-point engine
in the same structure.
V. EXPERIMENT AND RESULTS
From the experimental results, the performance of the
A. Experiment Setup engine implemented with low-precision (8-bit fixed-point)
The experiment is carried out in a Vivado HLS 2015.4 comprehensively increases. We synthesize the 8-bit fixed-point
with the processor I5-6600. Our implementation is built on engine with enabling the DSP priority option while 32-bit
Xilinx FPGA: Virtex7 485t. We achieve a total of five versions fixed-point engine not.
of the engine. The five versions are 32-bit fixed-point (not a.The delay is reduced due to the synthesize approach using
optimized), 32-bit fixed-point, 16-bit fixed-point, 8-bit fixed- DSP priority, and the low bit-width reduces the computational
point and 8-bit fixed-point after pipelined. load.
Laius implements the mapping of LeNet networks. MNIST b.The synthesize approach uses DSP priority, resulting in
contains the training and test data sets. Our research group a significant reduction in LUT usage, which provides more
uses C language to write the corresponding network simulator. chances for further performance improvement.
Caffe’s model provides the weight of fixed-point training for c.The engine design did not take full advantage of the
the experiment. The data of simulator are compared with bandwidth of each BRAM, in fact, we can reduce the use
the data of Laius to ensure the correctness of the network of BRAM. BRAM is not a performance bottleneck.
structure. In the low-precision experiment, the bit-width of the d.Although the synthesize approach is different, 8-bit fixed-
network layer is determined by software simulator through the point engine DSP usage reduction is very significant. This pro-
test set of data analysis. After the configuration of the Laius vides computational resources for performance improvement.
parameters of the various layers, parameters are returned to e.The low bit-width reduces the computational load and the
the simulator, and finally the corresponding accuracy can be power consumption.
achieved, as shown in the Figure 12. It is worth noting that 16-bit fixed-point and 8-bit fixed-
point engine using DSP resources almost, this is the point we
B. Experiment Result
need to optimized in the future work.
1) Performance: Section V -A has mentioned that we con- 2) Pipeline: We analyzed the hardware limitations and the
ducted five groups of experiments. On the basis of the 32-bit requirements of the pipeline in section IV -D. The number of
fixed-point version, we have optimized the operation to get the PE and BRAM per layer of the engine that is not pipelined is
optimized 32-bit fixed-point and 16-bit fixed-point version. On computable. The values of the relevant calculation parameters

148
TABLE III
32/16/8 BIT FIXED - POINT PERFORMANCE

Delay/ms LUT FF BRAM DSP Throughput Power/W


Total 303600 607200 1030 2800
32bit 1.4 55466 2493 1025.5 1645 4.1Gops 0.903
16bit 1.1 15285 2074 571 564 5.2Gops 0.607
8bit 0.96 7204 1316 343.5 574 6.0Gops 0.470

Fig. 13. 32/16/8 bit fixed-point comparison

TABLE IV
C ALCULATION PARAMETERS

Parameter N_DSP N_BRAM M_DSP M_BRAM T1 T2 T3 T4


Value 544 56 C1*544 C2*56 (2880)/C1+720 8000/C2 8000/C2 700+10000/F1+1250/F2

TABLE V
8- BIT FIXED - POINT AND PIPELINE PERFORMANCE

Delay/ms LUT FF BRAM DSP IO BUFG Throughput Power/W


8bit 0.96 7204 1316 343.5 574 17 1 6.0Gops 0.470
Pipeline 0.49 9071 1681 619 916 17 1 44.9Gops 0.658

TABLE VI
are given in the Table 4, According to the equation (11), we C OMPARISON OF ACCURACY
can calculate the corresponding combination of parameters.
According to the theoretical model, it is easy to draw the Accuracy
conclusion that the throughput can reach 124.7 Gops with (C1, Caffe original model 99.13%
32-bit engine 99.17%
C2, F 1, F 2) = (2,4,4,10). Fixcaffe trained model 98.19%
We choose (C1, C2, F 1, F 2) = (1,2,4,2.5) to implement Laius 98.16%
the pipeline. The hardware expenditure of the engine after the
pipeline (Laius) is shown in Table 5.
From the data in the Table, the throughput has been greatly accuracy of 98.16%, which is only 1% lower than the 32-bit
improved compared to before pipelined. When the DSP uses fixed-point engine.
one-third, the LUT uses a small fraction of the throughput to
44.9 Gops. C. Discussion
3) Comparison of Accuracy: In order to test the accuracy By comparing the performance between different bit-width
of the 8-bit engine, we create a fully equivalent simulator. We engines, we can see that the performance gain of low bit-width
test the Caffe original model, the 32-bit fixed-point, the 8-bit implementation increases largely.
fixed-point training model, and Laius. The test set contains 1) The accuracy of Laius is ensured by low-precision training
10,000 test images in MNIST. and simulator data evaluation. In addition, the choice of
As it can be seen from the data in the Table 6, Laius has an scaling factor can also have a better way to match the use

149
of the simulator. to enlarge the throughput. Experiments result demonstrates
2) Through the experimental results, low-precision engine that the implementation with low-precision could achieve good
brings with the performance improvement in all of aspects. tradeoff between accuracy and resources that with only 1%
8-bit fixed-point can save much more hardware resources loss of accuracy, this 8-bit fixed-point engine largely reduces
than the 32-bit fixed point engine. When using the same hardware resources requirement compared to 32-bit fixed-point
resources, more computing resources can be gained to engine in the same structure.
improve performance. Moreover, the vast majority of LUT
VIII. ACKNOWLEDGMENT
resources are not being used, which provides more chances
for further performance improvement. This project was supported by NSFC 61402501 and
3) In the derivation of the theoretical model, we assume that 61472432. I was also grateful to my teachers and students
only the convolutional layers and the fully-connected layers who had helped me in this paper.
are optimized. This theoretical model in a certain range will R EFERENCES
affect the choice of parameters and limit the performance
[1] Zhang C, Li P, Sun G, et al. Optimizing FPGA-based Accelerator Design
improvement. for Deep Convolutional Neural Networks[J]. 2015:161-170.
4) The access mode of the engine is very simple, we can [2] Motamedi M, Gysel P, Akella V, et al. Design space exploration of
explore a better access mode so that CONV2 can be FPGA-based Deep Convolutional Neural Networks[C]// Design Automa-
tion Conference. IEEE, 2016:575-580.
completed in one turn. [3] Wang C, Gong L, Yu Q, et al. DLAU: A Scalable Deep Learning
Accelerator Unit on FPGA[J]. IEEE Transactions on Computer-Aided
VI. R ELATED W ORK Design of Integrated Circuits and Systems, 2016, PP(99):1-1.
[4] Gupta S, Agrawal A, Gopalakrishnan K, et al. Deep Learning with
Recently, FPGA-based CNNs accelerator has attracted more Limited Numerical Precision[J]. Computer Science, 2015.
and more attention because of its reconfigurability and superior [5] Reagen B, Whatmough P, Adolf R, et al. Minerva: Enabling Low-Power,
energy efficiency compared to GPUs. The performance of Highly-Accurate Deep Neural Network Accelerators[C]// ACM/IEEE,
International Symposium on Computer Architecture. IEEE, 2016:267-
GPU accelerator varies according to data batch size while 278.
FPGA-accelerator’s performance is insensitive to that. There [6] Han S, Liu X, Mao H, et al. EIE: Efficient Inference Engine on Com-
are several available works on FPGA accelerator for CNNs. pressed Deep Neural Network[J]. Acm Sigarch Computer Architecture
News, 2016, 44(3):243-254.
In CNNs, multiply-accumulate (MAC) is the main form of [7] Ji Y, Zhang Y H, Li S C, et al. NEUTRAMS: Neural network trans-
operation. formation and co-design under neuromorphic hardware constraints[C]//
Work [3] [8] [16] is designed to constitute three different The, Ieee/acm International Symposium on Microarchitecture. ACM,
2016:1-13.
types of computing unit. The work of this paper draws on [8] Farabet C, Poulet C, Han J Y, et al. CNP: An FPGA-based processor
the unit design of [3] [16] to realize the reuse of data. The for Convolutional Networks[C]// International Conference on Field
goal of FPGA accelerator design is to improve resource uti- Programmable Logic and Applications. IEEE, 2009:32-37.
[9] Meloni P, Deriu G, Conti F, et al. Curbing the roofline:a scalable
lization. Work [12] [13] propose and implement a hierarchical and flexible architecture for CNNs on FPGA[C]// ACM International
approach, making the allocation of resources more granular Conference on Computing Frontiers. ACM, 2016:376-383.
and improving resource utilization. In the work of [4] [6], the [10] Suda N, Chandra V, Dasika G, et al. Throughput-Optimized OpenCL-
based FPGA Accelerator for Large-Scale Convolutional Neural Net-
authors propose low-precision training and data compression works[C]// Acm/sigda International Symposium on Field-Programmable
methods to reduce bandwidth pressure and reduce power Gate Arrays. ACM, 2016:16-25.
consumption respectively. In the work of [1] [2], the author [11] Chen T, Du Z, Sun N, et al. DianNao: a small-footprint high-throughput
accelerator for ubiquitous machine-learning[J]. Acm Sigplan Notices,
uses the Roofline model as a guide to form a better trade-off 2014, 49(4):269-284.
among bandwidth, throughput and resource utilization. Many [12] Shen Y, Ferdman M, Milder P. Overcoming resource underutilization
of them focus on improving the energy efficiency [17] [18]. in spatial CNN accelerators[C]// International Conference on Field
Programmable Logic and Applications. IEEE, 2016:1-4.
One of them uses OpenCL-based high level synthesis tools [13] Shen Y, Ferdman M, Milder P. Maximizing CNN Accelerator Efficiency
to utilize FPGA’s special pipelining capability to minimize Through Resource Partitioning[J]. 2016.
the requirement for memory bandwidth [17]. Another work [14] Du Z, Fasthuber R, Chen T, et al. ShiDianNao:shifting vision processing
closer to the sensor[J]. Acm Sigarch Computer Architecture News, 2015,
[18] is about a deeply pipelined architecture using multi- 43(3):92-104.
FPGAs instead of single-board FPGA, which expends the [15] Alwani M, Chen H, Ferdman M, et al. Fused-layer CNN accelera-
design space for optimal energy efficiency. Besides pipelined tors[C]// Ieee/acm International Symposium on Microarchitecture. IEEE,
2016:1-12.
design, another work [19] is worth mentioning. They solve the [16] Chen Y H, Emer J, Sze V. Eyeriss: A Spatial Architecture for Energy-
challenge for FPGA accelerator to achieve a higher throughput Efficient Dataflow for Convolutional Neural Networks[J]. IEEE Micro,
than GPU counterparts while keeping the energy efficiency 2016, PP(99):1-1.
[17] Wang D, An J, Xu K. PipeCNN: An OpenCL-Based FPGA Accelerator
advantage over GPUs. for Large-Scale Convolution Neuron Networks[J]. 2016.
[18] Zhang C, Wu D, Sun J, et al. Energy-Efficient CNN Implementation
VII. C ONCLUSION on a Deeply Pipelined FPGA Cluster[C]// International Symposium on
Low Power Electronics and Design. ACM, 2016:326-331.
In this work, we design and implement a LeNet hardware [19] Li Y, Liu Z, Xu K, et al. A GPU-Outperforming FPGA Accelerator
inference engine (Laius) with 8-bit fixed-point operation. PE Architecture for Binary Convolutional Neural Networks[J]. 2017.
tiling is introduced to exploit the data-level parallelization. [20] Guo S, Wang L, Chen B, et al. FixCaffe: Training CNN with Low
Precision Arithmetic Operations by Fixed Point Caffe[J]. 2017:38-50.
Additionally, this engine is designed to be a four stage pipeline

150

You might also like