You are on page 1of 10

1642 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO.

5, MAY 2018

Data and Hardware Efficient Design for


Convolutional Neural Network
Yue-Jin Lin and Tian Sheuan Chang , Senior Member, IEEE

Abstract— Hardware design of deep convolutional neural net-


works (CNNs) faces challenges of high computational complexity
and data bandwidth as well as huge divergence in different
CNN network layers, in which the throughput of the convolu-
tional layer would be bounded by available hardware resource,
and throughput of the fully connected layer would be bounded
by available data bandwidth. Thus, a highly flexible and efficient
design is desired to meet these needs. This paper presents an
end-to-end CNN accelerator that maximizes hardware utilization
with run-time configurations of different kernel sizes. It also
minimizes data bandwidth with the output first strategy to
improve the data reuse of the convolutional layers by up to
300× ∼ 600× compared with the non-reused case. The whole
CNN implementation of the target network is generated optimally
for both hardware and data efficiency under design resource Fig. 1. Computational complexity and data amount of the popular networks,
constraints, which can be run-time reconfigured by the layer where C1, C2.. denote the convolutional layer, and F6, F7…denote the fully
optimized parameters to achieve real-time and end-to-end CNN connected layer.
acceleration. An implementation example for AlexNet consumes
a 1.783 M gate count for 216 MACs and a 142.64 kb internal
buffer with TSMC 40 nm process, and achieves 99.7 and 61.6 f/s
under 454 MHz clock frequency for the convolutional layers and
the whole AlexNet, respectively.
Index Terms— Convolutional neural networks (CNNs),
hardware design.

I. I NTRODUCTION

I N RECENT years, deep convolutional neural net-


works (CNNs) are widely adopted in the recogni-
tion [2]–[6], detection [7]–[11], tracking [14], [15], scene
labeling [17], [18], pose estimation [21], [22], industrial
inspection [23], [24], and other computer vision fields because
of it unprecedented accuracies that are even better than
human accuracy [5]. However, computations of convolutional Fig. 2. Distribution of computation and data of different networks.
neural network demands over several Giga Multiplications-
and-Accumulations (GMACs) and millions of data amount per variations of these demands between different layers cause
layer as shown in Fig. 1. The high computational complexity the regular hardware CNN implementation more challenging.
and data amount demands significant hardware and bandwidth Fig. 2 shows the demands of complexity and data access
resources for real time applications. Furthermore, the high in different layers. This figure shows that the convolutional
layer is dominated by computation and the fully connected
Manuscript received June 16, 2017; revised August 11, 2017,
September 5, 2017, and September 22, 2017; accepted October 2, 2017. Date layer is dominated by data access. Fig. 3 shows that the
of publication November 7, 2017; date of current version April 2, 2018. data distributions among layers is quite different even these
This work was supported in part by the Ministry of Science and Technology, are all convolutional layers. In which, the last few layers
Taiwan, under Grant 104-2221-E-009-171-MY2 and Grant 105-2119-M-
009-009, and in part by the Research of Excellence Program under Grant with smaller image sizes have more weights. Thus, an effi-
106-2633-E-009-001. This paper was recommended by Associate Editor cient CNN implementation needs a flexible design with layer
T. Serrano-Gotarredona. (Corresponding author: Tian Sheuan Chang.) adaptive approach to accommodate these variations. In brief,
Y.-J. Lin is with Novatek, Taiwan.
T. S. Chang is with the Department of Electronics Engineering, the trade-off between flexibility, performance, and cost is the
National Chiao Tung University, Taiwan (e-mail: f34150@yahoo.com.tw; major design consideration on the CNN implementation.
tschang@mail.nctu.edu.tw) To address these complexity and bandwidth issues, several
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. works [13], [26]–[33] have been proposed. However, most
Digital Object Identifier 10.1109/TCSI.2017.2759803 of these works focus on the convolutional part only and
1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1643

Fig. 5. Computation of the convolutional layer.


Fig. 3. Distribution of different data type of different networks.

Fig. 6. An example of the maximum pooling.

Fig. 4. Typical CNN structure. of a convolutional layer is defined by



Ouputo (r, c) = Weight o,i ⊗ Inputi (S ∗ r, S ∗ c) + Biaso
fail to address layer specific needs in a hardware and data i
efficient way. Thus, we proposed a CNN accelerator design
+Bias
 o
with the following contributions: = Weighto,i (x, y)
• An efficient output first strategy to reuse data of the i x y
convolutional layers by up to 300× ∼ 600× compared ∗ Inputi (S ∗ r + x, S ∗ c + y) + Biaso (1)
to the non-reused case.
• An optimized CNN accelerator with run-time configu- where “Input” is the input feature map, “Output” is the output
rations of multiple kernel sizes to maximize hardware feature map, and “Weight” is the filter kernel. The output is
utilization. the sum of the all convolutional results and the bias term. The
• An example implementation of this architecture on successive convolution will have a stride distance S (stride of
AlexNet [2] to achieve state-of-the-art throughput and shifting window) between each other. Fig. 5 shows an example
area efficiency. with S equal to two.
The rest of the paper is organized as following. Section II The pooling layer executes the down-sampling operation
gives a short overview of the CNN algorithm. Section III that not only reduces the feature map size but also improves the
shows the related works on CNN hardware designs. translational invariance of features. The typical pooling types
Section IV presents the analysis of computational complexity are maximum pooling and average pooling that calculate the
and data access. Based on this analysis, Section V shows our maximum value and the average value from the corresponding
proposed hardware design and its generators. The implementa- kernels respectively. Fig. 6 shows an example of maximum
tion results and comparisons are shown in Section VI. Finally, pooling that has the kernel size 2×2 and the stride size two.
this paper is concluded in Section VII. The fully connected layer uses lots of trained weights to
decide the category likelihood for the classification. The fully
connected layer is just a pure neural network which has numer-
II. OVERVIEW OF T HE CNN A LGORITHM ous weight data and thus demands high weight bandwidth. The
Fig. 4 shows an example of a typical CNN structure, which equation of a fully connected layer is defined by
includes the convolutional layer, pooling layer, fully connected 
layer, and activation function. All these layers would be Ouput (o) = Weight (o, i) ∗ Input (i) + Bias (o) (2)
i
implemented in this paper.
The convolutional layer uses numerous trained filters where “Input” is the input node, “Output” is the output node,
inspired from the receptive fields of the visual cortex to extract and “Weight” is the weight synapse. The output is the sum of
features from low level local ones (edge, corner, line…) to all convolution results and the bias term as shown in Fig. 7.
high level abstract ones (category…) [35]. The convolutional Activation function tries to represent abstractly the expected
layer behaves sparser than a pure neural network because fire rate of the biological neurons [36]. Moreover, adding
of its weight sharing property. This weight sharing property non-linear activation functions after convolutional layers and
makes convolutional layer more relieved from high weight fully connected layers makes whole network more possible to
bandwidth than that in a pure neural network. The equation approach the real model than just a linear composition model.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1644 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

Fig. 8. An example to map a non-unit stride convolutional kernel to multiple


unit stride convolutional kernels. Input and kernel are divided into four parts.
The parts with same symbols will be associated for convolution operation.

weight bandwidth on the fully connected layer by model


Fig. 7. Computation of the fully connected layer.
compression. Reference [32] used the dual-range multiply-
accumulate (DRMAC) to have low power convolution opera-
Some popular functions include the sigmoid function ( 1+e1 −x ) tion. However, it only tested on a small size dataset, MNIST,
and the hyperbolic tangent function (tanh(x)). Recently, the as an example. Eyeriss [33] was a spatial array architecture
rectified linear unit, abbreviated as ReLU [37] (max(0, x)), can that can be mapped to wide range kernels (1 ∼ 32 × 1 ∼ 12).
have much faster training convergence [2], and thus becomes Moreover, Eyeriss proposed a row stationary (RS) [42] data
more popular after AlexNet adopts this function. In this paper, flow that includes both output stationary [12], [28] and weight
we will also use ReLU as our choice of the activation function. stationary [27], [30], [31] to maximize the internal and exter-
nal reuse and minimize the movement energy consumption.
III. R ELATED W ORK Envision [34] proposed a precision scalable design to achieve
A. FPGA Designs ultra low power.
nn-X is a series of designs [16], [19], [27], [38] that can In summary, various designs have been proposed, but none
translate the software based networks to related hardware has a systematic analysis and approach on the design problem
configuration sequences. This first design, CNP [38], can to ease design generation for different networks.
execute all networks but face the difficulty of scalability on
IV. D ESIGN A NALYSIS
higher network performance. Then the NeuFlow design [16]
on FPGA and its ASIC counterpart [19] was designed to be A. All Unit-Stride Convolutional Kernels
more scalable because of its runtime reconfigurable dataflow Hardware design of CNN faces high computational com-
architecture. The revised version of NeuFlow, nn-X [27], used plexity, data access and network variations. For CNN, one
a large filter kernel size for implementation, and thus had lower of the major variations comes from different kernel sizes
hardware utilization on the small filter kernel cases. and non-unit stride length. Too many possible kernel sizes
Beyond computation, [28] first proposed a memory-centric make convolutional layers hard to analyze and implement.
design methodology that tried to minimize external com- Convolutional kernels with non-unit stride length lead to lower
munication and buffer size. To handle external communica- 1 1
throughput about stride2 since true output data are only stride2
tion and buffer size more efficiently, they used HLS (high of all output data. Thus, to simplify design and ease analysis,
level synthesis) accelerator template to find the optimized we propose to map all non-unit stride convolutional kernels
access solution. Reference [29] had taken a step further to to the decomposed unit stride convolutional kernels.
optimize not only communication but also computation with Fig. 8 shows an example that maps one 6x6 convolu-
the HLS tool by the roofline model. However, the hardware tional kernel with stride two to four decomposed 3x3 con-
structure was optimized for only a fixed network and the volutional kernels with stride one. For a  KxK kernel
convolutional layer and it was not optimized for other kinds 
with
 K  stride
 S, the new kernel size will be K
S + i ×
of networks. Reference [30] implemented VGG [4] with
improved bandwidth and hardware utilization by the dynamic- S + j , i f (K %S == 0) i, j = 0, elsei, j = 0 · · · S − 1.
For a 6x6 kernel with stride 2, the new mapped kernel size
precision data quantization method. However, this implemen-
will be four (3x3). For a 7x7 kernel with stride 2 as used
tation was dedicated to 3x3 convolutions and impossible to
in GoogleNet and ResNet, the new mapped kernel size will
execute other networks with different kernel size.
be (3x3, 3x4, 4x3, 4x4). This will result in a fully usable
output. However, above nonsquare kernel size is bad for
B. ASIC Designs implementation. Thus, we use zero padding forthe  kernels
 and
The DaDianNao series [12], [13], [39]–[41] minimized the have a new formula for mapped kernel size as KS × KS .
communication traffic by a tiling strategy and internal buffer. Thus, for a 7×7 kernel with stride 2, we will have four (4×4)
In which, the most largest scale design, DaDianNao [13], kernels. Note this will result in slightly lower hardware uti-
stored all weights in the on-chip eDRAM to reduce lization (roughly around 70%), but is still better than hardware
off-chip weight access at its most. However, the modern utilization of the original stride one (roughly 1 S 2 , usually
CNNs are too large to store all weights in the on-chip <= 1/4). Thus, we will discuss only the stride one case for the
memory. Origami [25], [31] implemented a scene labeling convolutional layer in the following. The original input with
network [17] with fixed 7x7 kernel size. Besides, it only index (x, y) is also divided
   to multiple smaller one with new
implemented the convolutional layer. EIE [26] solved the high index (x’, y’) = ( Sx , Sy ). Each smaller input corresponds

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1645

TABLE II
D ATA R EUSE P OLICY

O RC ∗ I K 2 and the computational complexity of convolu-


tional layer is I O RC K 2 . Therefore the computational com-
plexity of fully connected layer is I ∗ O.
Fig. 9. Illustration of the symbols in the convolutional model.
TABLE I D. Analysis of Data Access
S YMBOL D EFINITIONS OF THE C ONVOLUTIONAL M ODEL
To evaluate the degree of data reuse, we use average data
access as an index
Amount o f data per batch
Aver age data access =
layer (i nput + out put + weight)
(3)
This index measures the number of data access per unit
data. In which, the denominator is amount of input, output
and weight per batch when only accessed once. These three
quantities only depend on the network parameters, and thus
the denominator of the right hand side in Eq. (3) is actually
a fixed normalization factor. The numerator in Eq. (3) is the
final accessed input, output and weight based on different data
reuse policy and strategy as discussed below. Note, without
to the one new filter kernel at the same relative position for
loss of generality, we assume that all these data have the same
convolution as in Fig. 8.
precision to ease our analysis.
With above conversion, each new mapped convolution will
1) Data Reuse Policy: Table II shows the general data reuse
be executed as the normal convolution. To prepare the input
policy for input, output, and weight. Input data access can
data without non-contiguous memory access, one way is to
be reduced by To because To different output maps reuse
load the original input as usual to a local input line buffer and
these input data. For output, local accumulated partial sums
distribute input to different filter kernels. Their output will be
access can be reduced by Ti because different Ti input maps
combined to generate the final output. The drawback of this
can share these partial sums. Weight data access can be
approach is extra line buffer cost, but this approach can be
reduced by Tr Tc Tb since Tr Tc pixels and Tb batches can reuse
integrated into current design seamlessly without extra control
these weight data. For the fully connected layer, we can set
burden or non-contiguous memory access.
Tr = Tc = 1asaspecialcase, which has much lower data reuse
possibility for weight data access.
B. Symbol Definitions for Analysis 2) Data Reuse Strategy: Based on above data reuse policy,
To ease the following analysis, we adopt the symbol defi- we can maximize one of them first as our data reuse strategy
nitions in a tile based model as shown in Fig. 9 and Table I. such that we can access one of data type for only one time.
The tile based model divides the input/output/batch into non- Others would basically follow the data reuse policy mentioned
overlapped rectangular partition units, denoted as tiles, to ease above.
processing. In this analysis, the results of the fully connected Table III summarizes the total data amount per batch for
layer will be derived as the simplified version of the convo- different data reuse strategies. To illustrate different strategies,
lutional layer because the fully connected layer is the special Fig. 10 shows the average data access of the convolutional
case of convolutional layer (R = C = K =Kp = Sp = 1) if layer for different strategies with AlexNet as an example.
the kernel size is 1x1. It shows that the output first strategy has almost half of
average data access compared to the one in the input first
strategy and quarter of average data access compared to the
C. Analysis of Computational Complexity one in the weight first strategy. The reason that the output
For one output pixel, we need all input maps (I ) and first strategy has lower data amount than others is due to
weight kernels (K ∗ K ) as input and I ∗ K 2 MACs. Therefore, the pooling layer (factor S12 in Table III) and localized partial
p
the number of total MACs for O ∗ R ∗ C output pixels is accumulated sum (avoid factor 2 in Table III for write and read

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1646 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

TABLE III
D ATA A MOUNT PER BATCH F OR T HREE D IFFERENT
D ATA R EUSE S TRATEGIES

Fig. 11. The architecture of the convolutional neural network accelerator.

Fig. 10. Average data access of the convolutional layer for different data
reuse strategies.
Fig. 12. Data flow of the CNN accelerator.
TABLE IV
E FFICIENCY R ATIO OF D ATA R EUSE F OR S INGLE BATCH C ONVOLUTIONAL Fig. 12 shows the data flow of the CNN accelerator has two
L AYERS OF THE P OPULAR N ETWORKS operating modes: the convolutional mode (convolutional layer,
pooling layer, and ReLU layer) and fully connected mode
(fully connected layer, ReLU layer, and 1x1 convolutional
layer). In which, the 1x1 convolutional layer is regarded as
the fully connected layer with many batches. For inference
of a CNN, this accelerator is operated layer by layer. The
accelerator will be run-time reconfigured according to its layer
structure through configuration data. The output of intermedi-
partial sum) to save extra write out and read in accesses. Thus, ate layers will be stored to external memory and read back as
we choose this strategy in this paper. Though this strategy next layer input.
has been used in others works, e.g. [33], [34], our analytical For the proposed accelerator, we adopt the output first
equation helps find the best design point under the resource strategy to design the data flow of the convolutional layer.
constraints through program search. In this design, we set Ti = 1 to minimize input buffer size
Table IV shows the data reuse efficiency compared to the because the overall data amount per batch is independent
non-reuse case (three (input + weight + partial sum) data per of Ti 3 4 in the output first strategy. That is, we just compute
MAC and data amount of output first reuse), which is around one input map in a certain computational time.
300× to 600× depending on the network structures. Fig. 13 shows the data flow of the convolutional layer, which
includes the initial stage for the first input map convolution,
the final stage for the last one, and the intermediate stage
V. S YSTEM A RCHITECTURE for the others. In the initial stage, we set the initial partial
Fig. 11 shows the whole architecture of our CNN acceler- sums to be the bias terms. Then in the intermediate stage,
ator, which includes three data buffers and address generators the partial sums will be stored in the partial result buffer except
for input, weight and output, respectively, a configurable the case of no pooling layer. In above exceptional case, we will
processing engine and partial result buffer. This CNN acceler- save the final convolutional outputs into the output data buffer,
ator can realize an end-to-end implementation which includes which will be stored to the external memory afterwards. Above
the convolutional layer, pooling layer, fully connected layer, exceptional case also applies to the final stage. In addition,
and ReLU layer. Moreover, the convolutional layer, pooling the final data path depends on if the ReLU layer exists.
layer, and fully connected layer can share the same processing The hardware of the convolutional layer can execute the
engine to exploit hardware efficiency. pooling layer as well if the operation is average or maximum

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1647

Fig. 16. Multiple 2D convolvers.

of the processing engine for self-accumulations, which needs


extra ker nel_si ze2 latency. To reduce the extra fill-in latency,
we execute self-accumulations on multiple input nodes for
Fig. 13. Data flow of the convolutional layer. one certain computational time. The data flow of the fill-in
procedure of MAC output registers depends on the stage, that
is, the initial stage to fill in the zeros, the intermediate stage
to fill in the partial sums from the partial result buffer, and
the final stage to fill in the partial sums similar to that of
the convolutional layer. In the above fill-in procedure, we also
move the previous partial results to the partial result buffer at
the same time for the initial and intermediate stage, and to the
Fig. 14. Two modes for the pooling layer. output data buffer for the final stage. The final data path also
depends on whether the ReLU layer exists.

A. CNN Processing Engine


The basic architecture of our CNN processing engine (PE)
is a multiple 2D convolvers [38] as shown in Fig. 16. In this
architecture, we compute one input map and multiple output
maps at the same time since the overall data amount per
batch is independent from Ti and smaller for lager To (input
data ∝ O
To ) in the proposed output first strategy. This mul-
tiple 2D convolvers can support different kernel sizes which
include 1x1, 3x3, 4x4, and 5x5. Larger kernel sizes are decom-
posed into smaller ones. Here the 1x1 convolutional layer
is realized by the fully connected layer with many batches.
Fig. 17 shows how to restructure PEs to fit different kernel
sizes. For example, we can restructure 27 MACs to three
3x3 kernels (27 MACs), one 4x4 kernel (16 MACs), or one
Fig. 15. Data flow of the pooling layer.
5x5 kernel (25 MACs). With decomposition and restructuring,
we can maximize hardware utilization.
instead of convolution. The average operation is a special
case of the convolution operation whose kernel weights are B. Buffer Design and the Corresponding Bandwidth
1
all kernelsize2 . In addition, we can also execute the maximum The buffer design and the required data bandwidth for
operation by reconstructing the original accumulator to the input, weight and output data types has significant effects
max mode as shown in Fig. 14. For the MAC mode, out = in on hardware utilization. To ease analysis without loss of
∗ w + acc_in. For the MAX mode, out = sign(acc_in – in) ? generality, we analyze the hardware utilization and average
in : acc_in; data access of one type of data buffer by assuming the infinity
Fig. 15 shows the data flow of the average and maximum size for the other two types of data buffers. This analysis could
pooling. Its input is the convolutional outputs from the partial be regarded as the upper bound of the hardware utilization and
result buffer. In addition, the weight andfilter in will be the lower bound of data access under ideal conditions. Besides,
constants that depends on the pooling type. For example, to be more concrete, we assume AlexNet as our example for
the weight is 14 or 19 and the filter in is 0 in 2x2 or 3x3 average the following analysis.
pooling, respectively. In the maximum pooling, the weight and Fig. 18 to Fig. 20 shows analysis results of different types
the filter in are −1 and negative maximum, which is invalid of data buffers by assuming 108 MACs and 16 bits data. The
to the true maximum value, as shown in Fig. 15. All the final results show that the upper bound of the hardware utilization
results will be put in the output data buffer to be stored to the is higher for the ping-pong buffer and the average data access
external memory after the pooling operations. is lower for the single buffer in the input, output, and weight
To execute the fully connected layer on the same hard- cases respectively. Thus, we choose the ping-pong buffer to
ware of the convolutional layer, the data input of the fully hide the latency of the data access and maximize hardware
connected layer has to be filled in the MAC output registers utilization.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1648 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

Fig. 19. Output data buffer effect on (a) hardware utilization, and (b) average
data access.

Fig. 20. Weight data buffer effect on (a) hardware utilization, and (b) average
data access.

Fig. 21. Weight bandwidth effect on the convolutional layer.

Fig. 17. An example of PE connections for 27 MACs, (a) 3x3 convolution


mode, (b) 4x4 convolution mode, (c) 5x5 convolution mode, and (d) fully
connected mode.

Fig. 18. Input data buffer effect on (a) hardware utilization, and (b) average
data access.
Fig. 22. Weight bandwidth effect on the fully connected layer.
The optimal input and output buffer size is determined by
the convolutional layers because the data per tile is always per cycle for the analysis in Fig. 21. Fig. 21 shows that we
one for the fully connected layer and the fully connected layer need higher weight bandwidth or/and more batches for larger
needs only smaller size of input and output data buffer. scale convolutional layer hardware. On the other hand, for the
The weight buffer size of the convolutional layer is just fully connected layer, this problem is getting worse, which
the number of hardware MACs (denoted as M∗ ). For fully also exists even in the smaller scale hardware. In the analysis
connected layer, the weight buffer size, Ti M∗ , cannot improve of Fig. 22, we assume 108 MACs, the high bandwidth case
the hardware utilization anymore after the upper bound due with 27 weight loading per cycle, and the low bandwidth case
to the limited weight bandwidth. In short, this buffer size with 1 weight loading per cycle. Fig. 22 shows that the fully
is relative insensitive once exceeding a certain level. Despite connected layers need much higher weight bandwidth or/and
this, weight buffer size matters for the fully connected layers larger batches.
more than the convolutional layers. Thus, we will use the fully
connected layers to decide the optimized weight buffer size. C. Generator
For weight bandwidth problem, we assume a large scale Based on above basic architecture and analysis, we use a
hardware with 432 MACs, small scale hardware with generator to accept input configurations and design constraints
54 MACs, the high bandwidth case with 9 weight loading and generate the optimized design and its parameters. These
per cycle, and the low bandwidth case with 1 weight loading parameters include the static Verilog parameters and dynamic

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1649

TABLE VI
T HE I MPLEMENTATION R ESULTS OF THE BATCH -4 A LEX N ET C ASE

TABLE VII
C OMPARISONS ON THE C ONVOLUTIONAL L AYERS
OF THE B ATCH -4 A LEX N ET C ASE
Fig. 23. The flow of the generator.

Fig. 24. An example for the design space exploration of AlexNet C1.
TABLE V
T HE C ONFIGURATIONS OF THE E XAMPLE I MPLEMENTATION

by data amount per batch. λ1 and λ2 come from λ, which is


a optimization parameter to trade-off the hardware utilization
and data access.
Fig. 24 shows an example for the design space explo-
ration of AlexNet C1 on the constraints of 108 MACs and
120KB (# of data=60000) internal buffer size. This figure
shows that the design point a has the best hardware utilization
configuration data to provide the trade-off between the per- solution (λ = 0) and the design point b has the best data reuse
formance, cost, and flexibility of the CNN accelerator design. solution (λ = ∞). The final choice will depend on the user
Fig. 23 shows the proposed generator flow that includes the constraints and preference.
hardware parameters search such as final adopted buffer size,
PE numbers and weight bandwidth, and the corresponding
VI. I MPLEMENTATION R ESULTS
Verilog and configuration data generation.
In the generator flow, we will find the optimal tile config- To demonstrate our design, we use the same area cost,
uration by the two following cost functions to make a trade data access per frame, and the targeted network (batch-4
off between hardware utilization and data access number. The AlexNet) as those in Eyeriss [33] to be design constraints
first one is hardware cost first, which is our first priority, as for a detailed comparison. Our implementation is designed by
H + λ1 ∗ D, where λ1 = λ ∗ total MACs/total data. Verilog and synthesized with TSMC 40nm CMOS technology
The second one is data access first, which is the second process, which consumes 1.783M gate count for 216 MACs
priority, as and 142.64 KB internal buffer as shown in Table V. This
D + λ2 ∗ H, where λ2 = λ ∗ total data/total MACs. implementation can achieve 99.7 fps and 61.6 fps for the
In above equations, H is the hardware utilization represented convolutional layers and all layers of AlexNet respectively
by the cycle count per batch and D is data access represented under 454 MHz clock rate as shown in Table VI.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
1650 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS, VOL. 65, NO. 5, MAY 2018

TABLE VIII
C OMPARISONS B ETWEEN CNN A CCELERATORS

Table VII shows comparisons on the convolutional layers these works. The reason for this high efficiency is the proposed
of the batch-4 AlexNet case. Our design has much higher architecture design with high hardware utilization and optimal
throughput and lower area cost than that of Eyeriss [33]. design generator under design constraints.
Thus the area efficiency of our design is higher even after
technology scaling. This is because our connection is more VII. C ONCLUSION
regular that makes lower hardware cost. For the bandwidth In this paper, we propose a run-time reconfigurable
comparison, we use uncompressed bandwidth in [33] since CNN architecture that can handle non-unit stride kernels and
their compression method can be adopted here as well. Under process different kernel types efficiently to overcome the large
this condition, our design needs smaller internal buffer size computational complexity and data amount of the CNN algo-
to obtain almost the same data access per frame. In addition, rithm. The overall design exploits layer specific characteristics
our design has also higher area and bandwidth efficiency than for the buffer design and data bandwidth. The hardware design
those in [33]. Thus our data reuse strategy indeed relieves the adopts a tile based model and the output first strategy to
data bandwidth. reuse data of the convolutional layers with 300× ∼ 600×
Table VIII shows the comparisons between different CNN better efficiency than the non-reused case. Designs for different
accelerators. In this table, we only list the designs implemented networks can be generated by a CNN generator for a target
with ASIC processes instead of FPGA chips due to space network that is optimal under the given hardware resources
limitation. For fair comparison, we have scaled all throughput and data bandwidth. An example design for AlexNet with
to the same process node. In which, area cost in [25] is TSMC 40nm process consumes 1.783M gate count for
the smallest one because it had fixed connections and only 216 MACs and 142.64 KB internal buffer size, and achieves
implemented convolutional layers. Thus, it has the highest 61.6 fps under the 454 MHz clock frequency, which shows
area efficiency but with the limited application. Except this, better area and bandwidth efficiency compared to the other
the area efficiency of our design is the highest one among state-of-the-art works.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.
LIN AND CHANG: DATA AND HARDWARE EFFICIENT DESIGN FOR CNN 1651

R EFERENCES [26] S. Han et al., “EIE: Efficient inference engine on compressed deep neural
network,” in Proc. ISCA, 2016, pp. 243–254.
[1] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and [27] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello,
M. A. Horowitz, “Convolution engine: Balancing efficiency & flexibility “A 240 G-ops/s mobile coprocessor for deep neural networks,” in Proc.
in specialized computing,” in Proc. ACM Int. Symp. Comput. Archit., IEEE CVPRW, Jun. 2014, pp. 682–687.
2013, pp. 24–35. [28] M. Peemen, A. A. Setio, B. Mesman, and H. Corporaal, “Memory-
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- centric accelerator design for convolutional neural networks,” in Proc.
tion with deep convolutional neural networks,” in Proc. NIPS, 2012, ICCD, 2013, pp. 13–19.
pp. 1097–1105. [29] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
[3] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE FPGA-based accelerator design for deep convolutional neural networks,”
CVPR, Jun. 2015, pp. 1–9. in Proc. ISFPGA, 2015, pp. 161–170.
[4] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional [30] J. Qiu et al., “Going deeper with embedded FPGA platform for
networks for large-scale image recognition.” [Online]. Available: convolutional neural network,” in Proc. ISFPGA, 2016, pp. 26–35.
https://arxiv.org/abs/1409.1556 [31] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image L. Benini, “Origami: A convolutional network accelerator,” in Proc.
recognition,” in Proc. IEEE CVPR, Jun. 2016, pp. 770–778. GLSVLSI, 2015, pp. 199–204.
[6] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks [32] J. Sim, J.-S. Park, M. Kim, D. Bae, Y. Choi, and L.-S. Kim,
for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., “A 1.42TOPS/W deep convolutional neural network recognition proces-
vol. 35, no. 1, pp. 221–231, Jan. 2013. sor for intelligent IoE systems,” in IEEE Int. Solid-State Circuits Conf.
[7] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and (ISSCC) Dig. Tech. Papers, Jan./Feb. 2016, pp. 264–265.
Y. LeCun. (2014). “OverFeat: Integrated recognition, localization [33] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
and detection using convolutional networks.” [Online]. Available: energy-efficient reconfigurable accelerator for deep convolutional neural
https://arxiv.org/abs/1312.6229 networks,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature Tech. Papers, Jan./Feb. 2016, pp. 262–263.
hierarchies for accurate object detection and semantic segmentation,” [34] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envi-
in Proc. IEEE CVPR, Jun. 2014, pp. 580–587. sion: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-
[9] T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional convolutional frequency-scalable Convolutional Neural Network processor in 28nm
neural network for scene text detection,” IEEE Trans. Image Process., FDSOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
vol. 25, no. 6, pp. 2529–2541, Jun. 2016. Tech. Papers, Feb. 2017, pp. 246–247.
[10] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural [35] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
network cascade for face detection,” in Proc. IEEE CVPR, Jun. 2015, tional networks,” in Proc. ECCV, 2014, pp. 818–833.
pp. 5325–5334. [36] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
[11] D. Tomè, F. Monti, L. Baroffio, L. Bondi, M. Tagliasacchi, and networks,” in Proc. AISTATS, 2011, p. 275.
S. Tubaro, “Deep convolutional neural networks for pedestrian detec- [37] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
tion,” Signal Process., Image Commun., vol. 47, pp. 482–489, Sep. 2016. boltzmann machines,” in Proc. ICML, 2010, pp. 807–814.
[12] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the [38] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “CNP: An FPGA-based
sensor,” in Proc. ISCA, Jun. 2015, pp. 92–104. processor for convolutional networks,” in Proc. FPL, 2009, pp. 32–37.
[13] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in [39] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator
Proc. MICRO, 2014, pp. 609–622. for ubiquitous machine-learning,” in Proc. ASPLOS, 2014, pp. 269–284.
[14] Y. Chen, X. Yang, B. Zhong, S. Pan, D. Chen, and H. Zhang, [40] D. Liu et al., “PuDianNao: A polyvalent machine learning accelerator,”
“CNNTracker: Online discriminative object tracking via deep convo- in Proc. ASPLOS, 2015, pp. 369–381.
lutional neural network,” Appl. Soft Comput., vol. 38, pp. 1088–1098, [41] T. Chen et al., “A high-throughput neural network accelerator,” IEEE
Jan. 2016. Micro, vol. 35, no. 3, pp. 24–32, May 2015.
[15] H. Li, Y. Li, and F. Porikli, “Robust online visual tracking with a single [42] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
convolutional neural network,” in Proc. ACCV, 2014, pp. 194–209. energy-efficient dataflow for convolutional neural networks,” in Proc.
[16] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and ISCA, 2016, pp. 367–379.
Y. LeCun, “NeuFlow: A runtime reconfigurable dataflow processor for
vision,” in Proc. IEEE CVPRW, Jun. 2011, pp. 109–116.
[17] L. Cavigelli, M. Magno, and L. Benini, “Accelerating real-time embed- Yue-Jin Lin received the M.S. degrees in electronics
ded scene labeling with convolutional networks,” in Proc. DAC, 2015, engineering from National Chiao Tung University,
p. 108. Hsinchu, Taiwan, in 2016. He is currently with
[18] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning Novatek, Hsinchu. His current research interests are
deep features for scene recognition using places database,” in Proc. image processing, deep learning, and digital inte-
NIPS, 2014, pp. 487–495. grated circuits.
[19] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and
E. Culurciello, “NeuFlow: Dataflow vision processing system-on-a-
chip,” in Proc. MWSCAS, 2012, pp. 1044–1047.
[20] F. Conti and L. Benini, “A ultra-low-energy convolution engine for fast
brain-inspired vision in multicore clusters,” in Proc. IEEE Des. Autom.
Test Eur. Conf., Mar. 2015, pp. 683–688.
[21] W. Yang, W. Ouyang, H. Li, and X. Wang, “End-to-end learning Tian Sheuan Chang (S’93–M’06–SM’07) received
of deformable mixture of parts and deep convolutional neural net- the B.S., M.S., and Ph.D. degrees in electronic
works for human pose estimation,” in Proc. IEEE CVPR, Jun. 2016, engineering from National Chiao Tung Univer-
pp. 3073–3082. sity (NCTU), Hsinchu, Taiwan, in 1993, 1995, and
[22] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for 1999, respectively.
human pose estimation in videos,” in Proc. IEEE ICCV, Dec. 2015, From 2000 to 2004, he was a Deputy Manager
pp. 1913–1921. with Global Unichip Corporation, Hsinchu. In 2004,
[23] D. Weimer, B. Scholz-Reiter, and M. Shpitalni, “Design of deep con- he joined the Department of Electronics Engineer-
volutional neural network architectures for automated feature extraction ing, NCTU, where he is currently a Professor.
in industrial inspection,” CIRP Ann.-Manuf. Technol., vol. 65, no. 1, In 2009, he was a Visiting Scholar with Imec,
pp. 417–420, 2016. Belgium. His current research interests include
[24] X. Bian, S. N. Lim, and N. Zhou, “Multiscale fully convolutional system-on-a-chip design, VLSI signal processing, and computer architecture.
network with application to industrial inspection,” in Proc. WACV, 2016, Dr. Chang has received the Excellent Young Electrical Engineer from the
pp. 1–8. Chinese Institute of Electrical Engineering in 2007, and the Outstanding
[25] L. Cavigelli and L. Benini. (Dec. 2015). “Origami: A 803 Young Scholar from the Taiwan IC Design Society in 2010. He has been
GOp/s/W convolutional network accelerator.” [Online]. Available: actively involved in many international conferences as an organizing commit-
https://arxiv.org/abs/1512.04295 tee or technical program committee member.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:29:36 UTC from IEEE Xplore. Restrictions apply.

You might also like