Professional Documents
Culture Documents
Fast and Efficient Convolutional Accelerator For Edge Computing
Fast and Efficient Convolutional Accelerator For Edge Computing
1, JANUARY 2020
Abstract—Convolutional neural networks (CNNs) are a vital approach in machine learning. However, their high complexity and energy
consumption make them challenging to embed in mobile applications at the edge requiring real-time processes such as smart phones.
In order to meet the real-time constraint of edge devices, recently proposed custom hardware CNN accelerators have exploited parallel
processing elements (PEs) to increase throughput. However, this straightforward parallelization of PEs and high memory bandwidth
require high data movement, leading to large energy consumption. As a result, only a certain number of PEs can be instantiated when
designing bandwidth-limited custom accelerators targeting edge devices. While most bandwidth-limited designs claim a peak
performance of a few hundred giga operations per second, their average runtime performance is substantially lower than their roofline
when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, as a result of low resource utilization and arithmetic
intensity. In this work, we propose a zero-activation-skipping convolutional accelerator (ZASCA) that avoids noncontributory
multiplications with zero-valued activations. ZASCA employs a dataflow that minimizes the gap between its average and peak
performances while maximizing its arithmetic intensity for both sparse and dense representations of activations, targeting the bandwidth-
limited edge computing scenario. More precisely, ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-
art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime
performance and the peak performance. Using its zero-skipping feature, ZASCA can further improve the performance efficiency of the
state-of-the-art CNNs by up to 1.9 depending on the sparsity degree of activations. The implementation results in 65-nm TSMC CMOS
technology show that, compared to the most energy-efficient accelerator, ZASCA can process convolutions from 5.5 to 17.5 faster, and
is between 2.1 and 4.5 more energy efficient while occupying 2.1 less silicon area.
Index Terms—Convolutional neural network, machine learning, hardware accelerator, edge computing
1 INTRODUCTION
neural networks (DNNs), especially convolu- 666M multiply-accumulates (MACs)) in its convolutional
D EEP
tional neural networks (CNNs) [1], have received tre-
mendous attention due to their ability to surpass human-
layers, and 117.2M operations (i.e., 58.6M MACs) in its
fully-connected layers. VGGNet-16 [4] is another well-
level accuracy on a wide range of complex tasks such as known CNN, containing 13 convolutional layers with
recognition, classification and detection [2]. Depending 14.7M weights and 3 fully-connected layers with 124M
on their size and complexity, these networks achieve dif- weights. VGGNet-16 performs 30.6G operations in its con-
ferent degrees of classification/recognition accuracy. A volutional layers and 248M operations in its fully-
CNN is a stack of multiple convolutional layers followed connected layers, achieving 27 percent MCR on Image-
by fully-connected layers: convolutional layers extract Net. Recently, ResNets [5] achieved a lower complexity
high level abstractions and features of raw data, whereas and a better MCR by using residual connections. For
fully-connected networks are used to learn non-linear instance, ResNet-18 achieves a similar MCR to VGGNet-
combinations of the extracted features. In 2012, a CNN 16 (i.e., 27.88 percent on ImageNet) while performing
called AlexNet [3] was introduced: it is constituted of 5 3.6G and 1M operations in its 17 convolutional layers
convolutional layers followed by 3 fully-connected layers with 11M weights and 1 fully-connected layer with 1M
and achieves 42.9 percent misclassification rate (MCR) on weights, respectively. Moreover, ResNet-50, containing
the ImageNet dataset. AlexNet contains 2.3M weights and 49 convolutional layers with 23.5M weights and 1 fully-
58.6M weights in its convolutional and fully-connected connected layer with 2M weights, achieved a better MCR
layers, respectively, performing 1332M operations (i.e., (i.e., 22.85 percent on ImageNet) by going even deeper.
ResNet-50 respectively performs 7G and 4M operations
within the two types of layers. All these CNNs have won
A. Ardakani and W.J. Gross are with the Electrical and Computer Engineering the ImageNet Large Scale Visual Recognition Challenge
Department, McGill University, Montreal, QC H3A 0G4, Canada.
E-mail: arash.ardakani@mail.mcgill.ca, warren.gross@mcgill.ca. (ILSVRC) [6].
C. Condo is with the Mathematical and Algorithmic Sciences Lab, Huawei Regardless of the fact that in almost all the aforementioned
Technologies Co. Ltd, Paris 75002, France. E-mail: carlo.condo@huawei.com. CNNs the majority of weights is found in fully-connected
Manuscript received 18 Sept. 2018; revised 26 Aug. 2019; accepted 11 Sept. layers, the number of operations are dominated by convolu-
2019. Date of publication 16 Sept. 2019; date of current version 19 Dec. 2019. tions. As a result, the processing time of CNNs is also domi-
(Corresponding author: Arash Ardakani.)
Recommended for acceptance by D. Marculescu.
nated by the convolutional processes. This issue can easily be
Digital Object Identifier no. 10.1109/TC.2019.2941875 addressed by exploiting parallel processing elements (PEs) to
0018-9340 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 139
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
140 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
TABLE 1
Convolutional Layer Computation Parameters
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 141
2 3
TABLE 2 W1 0 0 0
The FID for Convolutional Computations 6
6 W2
..
.
7
7
)
6 .. 7
6 7
6 0 . 7 S:
6 7
6 7
6 . 7
6 . 7
6 . W1 7
6 .. 7
6 W2 0 . 7
6 7
6 7
6 WWf 0 7
6 7
6 .. .. 7
M¼6
6 0 . W1 . 7;
7 (3)
6 W2 7
6 7
6 WWf 0 7
6 7
6 .. 7
6 0 . W1 7
6 7
6 . 7
6 .. W2 7
6 7
6 .. 7
6 . WWf 7
6 7
6 7
6 .. .. .. 7
4 . . . 5
0 0 WWf
Considering each output pixel assigned to a neuron, Table 2 M can contain only Wf non-zero elements at most. The shift
shows the convolutional process of this example in a way amount within each row of the output activation map is
similar to the fully-connected layer computations, where equal to S, denoted with a horizontal dashed line in the
input pixels are read sequentially at each clock cycle (CC) matrix M. The number of columns of the matrix M indicates
and the neurons share the same input pixels. This example the N output pixels that belong to the same row of the out-
considers Cin ¼ 1, Cout ¼ 1, Hin ¼ 2, Win ¼ 8, Hf ¼ 1, Wf ¼ 3 put activation map, while the number of rows of M denotes
and S ¼ 1. Similar to the fully-connected dataflow, each neu- the required number of clock cycles. We use the GFID
ron loads a different weight at each time step, subsequently matrix M to represent different filter sizes used in the state-
accumulating the weighted input pixels. The number of time of-the-art CNNs. State-of-the-art networks such as LeNet,
steps required to perform the convolutional computations is ZFNet, AlexNet, GoogleNet, VGGNet, NIN and ResNet are
also equal to the number of input pixels, Hin Win . When all constructed based on a combination of filter sizes of
passed to the next neuron belonging to the same row of the 11 11 with S ¼ 4, 7 7 with S ¼ 2, 5 5 with S ¼ 1, 5 5
output activation map, the weights need to be shifted by one with S ¼ 2, 3 3 with S ¼ 1, and 1 1 with S ¼ 1. All of
position. However, weight passing between neurons of dif- the aforementioned filter sizes and others can be easily rep-
ferent rows requires a shift of Wf positions, as can be resented using the matrix M. For instance, we represent
observed between clock cycles #6 and #9 for W1 denoted in two filter sizes of 3 3 with S ¼ 1 and 7 7 with S ¼ 2
red in Table 2. using the GFID below.
As shown in Table 2, three PEs (i.e., neurons) are suffi- In Section 2.1, we showed that 3 PEs are sufficient to per-
cient to perform the convolutions. In fact, there are only 3 form the convolutions for filter size of 3 3 with S ¼ 1.
active neurons at each time step. Each PE thus receives its Therefore, a CE containing only 3 neurons can perform the
input at clock cycle 3 i þ 1, 3 i þ 2 and 3 i þ 3. Their convolutional computations. Considering a convolution of a
outputs are also valid after 3 clock cycles in the given exam- row of a filter map with its corresponding input pixels,
ple. So far, we only considered a case with Hf ¼ 1. In case N þ 2 clock cycles are required to generate N output pixels
of Hf ¼ 3, the procedure in Table 2 has to be repeated 2 which belong to the same row of the output activation map.
times more: the first iteration with W1 , W2 and W3 , the sec- For instance, in the given example in Table 2, 8 clock cycles
ond with W4 , W5 and W6 , and the final one with W7 , W8 and are required to generate the output pixels of the first row of
W9 . Similarly, for higher values of Cin , the process has to be the output activation map (i.e., the first 6 output pixels).
to repeated Cin times. Therefore, a memory is required to This example can also be expressed using the GFID matrix
store the partial values generated by the 3 neurons for each M as follows:
output pixel. In general, N output pixels can be computed
using 3 neurons (i.e., PEs) and 3 separate N=3-element
SRAM memories working in parallel. The unit generating
the N output pixels of an output activation map is referred
to as a convolutional element (CE).
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
142 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
TABLE 3 Wf
UFmax ¼ lim UF ¼ 100: (7)
The Maximum Utilization Factor for Different Filter Sizes N!1 T S
[Wf ; S] Eq. (7) suggests that the highest performance efficiency is
[1; 1] [3; 1] [5; 1] [5; 2] [7; 2] [11; 4] obtained when N ðWf SÞ. Table 3 shows the maximum
utilization factors for filters with different filter sizes and
Max UF 100% 100% 100% 83% 88% 92%
T 1 3 5 3 4 3 stride values. Please note that Table 3 only shows the com-
monly used filter sizes while other filter sizes can be still
represented using the GFID.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 143
Fig. 5. Involved hardware resources and paths in case of convolution computations for (a) Wf ¼ 3 and S ¼ 1, (b) Wf ¼ 7 and S ¼ 2.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
144 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
TABLE 4 TABLE 6
The Maximum Utilization Factor for Different Filter The Complete Convolutional Process (a) without and
Sizes Using the Proposed CE (b) with Skipping Zero-Valued Activations
[Wf ; S]
[1; 1] [3; 1] [5; 1] [5; 2] [7; 2] [11; 4]
Max UF 100% 100% 83% 83% 53% 92%
5 ZERO-ACTIVATION-SKIPPING CONVOLUTIONAL
ACCELERATOR
So far, we have introduced a reconfigurable CE that supports
different filter sizes with high utilization factors. We also
showed that the weight generator unit provides an appropri-
ate shift amount for each PE using shift registers. In fact, the
values of input weights (i.e., the values of each row of each
input filter) are circulated through the shift registers for Wf
clock cycles as discussed in Section 4. The size of the shift reg-
ister (i.e., the number of registers involved in the computa-
tion) varies depending on the filter size being used. For
instance, 3 registers are used for the convolutional computa-
tions for Wf ¼ 3 and S ¼ 1. More generally, the number of
registers required for the computations, which are hereafter
referred to as Nr , can be obtained by multiplication between flow of weights through the registers. Similar approach can
the required stride value and the number of PEs used in each be used for other filter sizes depending on their Nr value and
CE. Table 5 summarizes the number of involved registers in the number of consecutive zero-valued activations. There-
the convolutional computations given Wf and S for com- fore, noncontributory computations can be avoided using
monly-used filter sizes while our architecture is not limited the proposed method and CE. However, in order to take
to them. advantage of zero-valued activations, an encoder/decoder is
Table 5 and Fig. 5 show that the position of weights in the required to specify the number of zero-valued activations to
registers are unchanged after Nr clock cycles. This observa- the CE. Moreover, using the encoder/decoder allows us to
tion suggests that if any multiple of Nr activations in conse- compress the activations in order to reduce memory accesses
cutive order is equal to zero, we can skip their computations to the off-chip memory.
using the proposed CE. Table 6a illustrates an example for The encoder can be easily realized using counters and
Wf ¼ 3 with S ¼ 1 when 6 consecutive activations are zero comparators to keep track of zero-valued activations as
(denoted in gray). According to Table 6a, the output values shown in Fig. 6. The final output activations, which are
Y5 , Y6 , Y7 and Y8 are zeros due to the multiplications with stored in the internal SRAM inside the CE, are first passed
zero-valued activations. Moreover, the output values Y3 and to the ReLU and then to the encoder at each clock cycle. It is
Y4 are valid at clock cycles #4 since their remaining computa- worth noting that we use 15 bits for representation of the
tions are noncontributory due to the multiplications with value of input/output activations. The encoder starts count-
zero-valued activations. As a result, the dataflow changes to ing consecutive zero-valued activations right after detecting
a simpler one as shown in Table 6b and the noncontributory the first zero-valued activation. The encoder passes its
computations can be skipped without any impact on the incoming input directly to its output port along with a sin-
gle flag bit while counting the number of consecutive zero-
TABLE 5 valued activations. As soon as the encoder detects that the
The Number of Registers Used in the Convolutional number of consecutive zero-valued activations is a multiple
Process for Different Filter Sizes of Nr denoted as n Nr , it outputs the value of n using 15
bits along with a bit to distinguish between the zero and
[Wf ; S] non-zero activations. More precisely, the encoded word
[1; 1] [3; 1] [5; 1] [5; 2] [7; 2] [11; 4] denotes that the value of current activation and n upcoming
# registers (Nr ) 1 3 6 6 12 12 activations are zero when the flag bit is high. Otherwise, it
only denotes the value of non-zero activation. It is worth
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 145
TABLE 7
An Example of Encoding Procedure Given Nr ¼ 3
Encoder Write
CC
Input Output (Hex) Address
#1 128 0040 0
#2 0 0000 1
#3 0 0000 2
#4 0 0000 3
Fig. 6. The encoder.
#5 0 8001 2
#6 0 0000 3
mentioning that in case of detecting a multiple of Nr by the #7 0 0000 4
encoder, the value of n is overwritten into the memory ele- #8 0 8002 2
ment storing the second zero-valued activation of the corre- #9 0 0000 3
#10 512 0100 4
sponding sequence of zero-valued activations. Its appropriate #11 768 0180 5
address is also generated by the controller of the system. For
instance, Table 7 illustrates the encoding procedure given the
sequence f128; 0; 0; 0; 0; 0; 0; 0; 0; 512; 768g and Nr ¼ 3. Upon
the completion of the encoding procedure, the off-chip mem- Wf is also a small value compared to the processing time of
ory contains the consecutive encoded/compressed values convolutions for the first row of the first input filter (i.e.,
as f0040; 0000; 8002; 0000; 0100; 0180g in hexadecimal format. Wf ðS N þ Wf SÞ). More precisely, the input band-
The number of memory accesses to the off-chip memory for width from the ðWf Þth clock cycle to the ðS N þ Wf SÞth
the writing procedure using the encoded values is the same as clock cycle is occupied by input pixels only. Therefore, we
the dense model. can fill out this available bandwidth by pipelining the CEs
The decoder performs the inverse computations. It uses with up to bðS N þ Wf SÞ=Wf c stages, while the addi-
the flag bit to either skip the noncontributory computations tional latency overhead is negligible compared to the overall
or perform the computations of non-zero-valued activations. latency of the system.
In case of receiving a low flag bit, it passes the non-zero value
of activation to the CE. Otherwise, it first passes zero to the 5.2 Processing Time and Memory Accesses of
CE to store all the intermediate output values of neurons into ZASCA
their corresponding SRAM. It then passes the value of n to Earlier in Section 4 we showed that in convolutional pro-
the controller of the system to update the writing addresses cesses, a single CE computes N out of Hout Wout pixels of
of the internal SRAMs in the CE. This unit is simply imple- one of Cout output activation maps within Cin Hf ðS
mented using multiplexers. N þ Wf SÞ clock cycles. We also showed that the total num-
ber of weight passing occurrences for the computation of a
5.1 Exploiting Parallel CEs single convolutional layer is equal to Hout 1, which causes
While the proposed reconfigurable CE can efficiently per- additional ðWf 1Þ ðHout 1Þ clock cycles for the compu-
form convolutional computations, using a single CE results tations of each row of the input filters. Considering p parallel
in a long latency and numerous memory accesses. To address CEs, the number of required clock cycles is expressed as
this issue, p CEs are instantiated in parallel to generate p out
of Cout output activation maps in parallel. Since the reconfig-
Wout Hout Cout
urable CE itself can function as up to 6 parallel CE, the upper CC ¼ ðS N þ Wf SÞ Hf Cin
N p
bound for the maximum number of CE is 6p in ZASCA.
Therefore, the computational latency of ZASCA is effectively Cout
þ ðWf 1Þ ðHout 1Þ Hf Cin :
reduced by a p factor when compared to a single reconfigura- p
ble CE. Moreover, the memory accesses are reduced as well, (11)
since the input pixels are shared among the parallel CE, while
each CE is fed by a different set of weights. Eq. (11) suggests that the number of required clock cycles
Exploiting parallel CE requires an input bandwidth of for convolutional computations is independent of N for large
ð1 þ 6 P Þ 16 bits (6 p 16 for weights and 16 for input values of N (i.e., S N ðWf SÞ) when not considering
pixels). However, most of the embedded and mobile devi- the weight passing overheads. In Section 5.1, we showed that
ces cannot provide such a high bandwidth. In fact, their input pixels are shared among all CEs and each pixel is read
bandwidth is limited by the bandwidth of the off-chip mem- at each clock cycle. This means that the number of memory
ory. For instance, most of the low-power DDRs (LPDDRs) accesses by input activation maps (MAimaps ) is equal to the
such as LPDDR4 provide the bit-width of up to 64 bits [23]. number of clock cycles required to complete the convolution.
As a result, the bandwidth of state-of-the-art convolutional On the other hand, the weights are read in the first Wf clock
accelerators for mobile devices is limited to 64 bits [8], [9], cycles out of a total of ðS N þ Wf SÞ. As a result, the
[13], [24]. To overcome this problem, ZASCA makes use of number of required memory accesses by filters to compute
pipelining. As discussed in Section 3, each input pixel is N out of Hout Wout pixels of one out of Cout output activa-
read at each clock cycle while Wf weights are read only for tion map is equal to Cin Hf Wf . In general, the number
the first Wf clock cycles when performing the convolutional of memory accesses by filters (MAfilters ) can be computed as
process of the first row of the first input filter. The parameter follows:
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
146 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
TABLE 9
The Main Characteristics of ZASCA
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 147
6.3 Performance
We hereafter denote ZASCA performing convolutions on
dense activations as ZASCAD and ZASCA performing con-
volutions on sparse activations as ZASCAS. It is worth men-
tioning that the accuracy of the models using the dense
representation is the same as those using the sparse repre-
sentation for activations. Fig. 9 shows the performance of
ZASCAD and ZASCAS when running AlexNet, VGG-16,
ResNet-18 and ResNet-50 in terms of giga operations per sec-
ond compared to Eyeriss, EEPS and DSIP. It is worth noting
that Eyeriss was tested on AlexNet and VGG-16 while EEPS
and DSIP was only tested on AlexNet. Fig. 9 shows that ZAS-
CAD outperforms Eyeriss in terms of runtime performance
by factors of 1.4 and 3.1 when running AlexNet and
VGG-16, respectively. Skipping noncontributory computa-
tions further improves the runtime performance ranging
from 2 to 5.8 compared to Eyeriss. ZASCAD yields a
slightly better runtime performance compared to EEPS while
EEPS contains 64 more MAC units than ZASCA. Exploiting
the sparsity among the activations, ZASCAS performs the
convolutions of AlexNet 1.5 faster than EEPS. Finally, ZAS-
CAD and ZASCAS outperforms DSIP in terms of runtime
performance by factors of 2.7 and 4, respectively.
As discussed in Section 5, all the PEs in ZASCA share the
same input activation pixel and the complete calculations of
each output pixel are performed serially, allowing all the PEs
to easily skip the noncontributory computations with zero-
Fig. 8. The breakdown of sparsity degree of CNNs when running on valued input activation pixels. However, Eyeriss, EEPS and
ZASCA. DSIP rely on a 2D array of PEs and perform the computations
of each output pixel in a semi-parallel fashion. More pre-
obtained from their original papers for comparison pur- cisely, the computations of each output pixel are split into
poses throughout this paper. different parts where a certain number of these parts is per-
formed in parallel. The results of all the parts are then added
6.2 Sparsity Degree of Input Activations to make output pixels. In this way, PEs computing each part
As discussed in Section 1, the sparsity among activations is take different activation pixels. In case of skipping the com-
an intrinsic occurrence in CNNs using the ReLU which putations with zero-valued input activation pixels of each
dynamically clamps all negatively-valued activations to part, the runtime performance is still limited by the part con-
zero. Fig. 8 shows the breakdown of sparsity degree of acti- taining the least number of zero-valued input activation pix-
vations in convolutional layers of AlexNet, VGG-16, ResNet- els. Note that having all the parts with the same number of
18 and ResNet-50 when running on ZASCA. More spe- zero-valued input activation pixels is unlikely to happen
cifically, we only considered the zero-valued activations when exploiting the aforementioned approach. Moreover,
that can be skipped using ZASCA in our measurement of using batch sizes greater than 1 exacerbates the problem
the sparsity degree (see Section 5). The sparsity degrees were since skipping the noncontributory computations is possible
measured over the validation set of ImageNet dataset. The only when all the batches contain zero-valued input activa-
sparsity degree of each layer is linearly proportional to tion pixels at the same time.
the amount of speedup. For instance, 70 percent sparsity Eyeriss and DSIP use batch sizes greater than 1 to obtain
degree denotes that 70 percent of MAC operations are a lower number of memory accesses and to achieve a higher
avoided on ZASCA and the computation time is reduced by energy efficiency. Using high batch sizes allows more filter
a factor of 2.3. reuse and maximizes the utilization of all available PEs,
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
148 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
TABLE 10
The Total Latency of ZASCAD, ZASCAS,
Eyeriss, EEPS and DSIP
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 149
Fig. 13. The roofline model of ZASCAD, ZASCAS, Eyeriss, EEPS and DSIP tested on AlexNet, VGG-16 and ResNet-50. The circles and solid/
dashed lines denote the average runtime performance and the peak performance bound, respectively.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
150 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
consumption exceeds the power budgets of embedded circuit-level techniques, ZASCAS outperforms Envision in
devices [16]. terms of gate count (1.9 smaller), latency (1.5 and 2.4
Despite the existence of many works on accelerating lower), throughput (1.5 and 2.4 faster) and performance
sparse multiplications [18], [36], [37], [38], [39], [40], only few efficiency (3.2 and 5 better).
works have focused on exploiting sparsity to accelerate con- In [12], a configurable accelerator framework (i.e., CAF)
volutional computations in CNNs, such as Cnvlutin [19], was introduced. CAF was fabricated in 28 nm UTBB FD-SOI
Cambricon-X [20] and SCNN [21]. These architectures are technology with a silicon area of 34.9 mm2 and contains vari-
mainly designed to accelerate computations on the cloud ous types of accelerators including 8 convolutional accel-
where unlimited data bandwidth is provided. Cnvlutin erator cores (the total of 288 MACs) for computer vision
relies on DaDianNao’s architecture [30] as its baseline and applications. This architecture uses 16-bit fixed-point repre-
achieves up to a 1.55 speedup by exploiting sparsity among sentations and performs the convolutional computations
activations. Cambricon-X is a convolutional accelerator con- of AlexNet to 17.1 ms while consuming 61 mW at 0.575 V
taining 528 operators working at 1 GHz that can perform and its performance efficiency is respectively limited to
convolutions of both dense and sparse models. As a result, it 67 percent. Despite the advanced technology nodes used in
can achieve a theoretical peak performance of 528 G-ops/s CAF, ZASCAS outperforms this architecture in terms of
when performing on dense models. Exploiting weight spar- latency (1.2 lower), throughput (1.2 faster) and perfor-
sity, it can skip ineffectual multiplications and speed up the mance efficiency (1.8 better).
computations by up to a factor of 2.51 over its dense model, As discussed in Section 6, ZASCAS outperforms Eyeriss
yielding 544 G-ops/s at most [20]. Therefore, the highest [8] in terms of gate count (1.8 smaller), latency (5.5 and
achievable performance efficiency of this accelerator is 41 17.5 lower), throughput (2 and 5.8 faster), performance
and 103 percent when performing on dense and sparse mod- efficiency (2.2 and 6.2 better) and energy efficiency (2.1
els, respectively. However, ZASCA outperforms Cambri- and 4.5 more efficient) while having roughly the same
con-X in terms of performance efficiency by yielding the number of memory accesses per batch. It is worth noting that
peak performance efficiency of 94 and 168 percent when a direct comparison of ZASCA with the works published in
performing on dense and sparse activations, respectively. [9], [12] does not constitute a fair comparison, since they
Finally, SCNN employs sparsity among both activations and dynamically modulate precision, frequency and supply volt-
weights to reduce energy, latency and data transfer time. age and use advanced technology nodes, which allows them
Using this approach, SCNN managed to improve perfor- to instantiate more PEs while still having a low-power/
mance and energy by factors of 2.7 and 2.3 compared to energy consumption. However, the introduced performance
its dense baseline, respectively. Despite the improvements efficiency metric can be used for a fair comparison as it
that these accelerators provide over their baseline, they fail reflects the performance of the accelerators independent of
to meet the power and memory bandwidth constraint of their technology nodes, precisions and optimization techni-
edge devices, which is limited to a few hundered mW [22]. ques as shown in Fig. 14. It is worth mentioning that the cir-
However, ZASCA uses a substantially different dataflow cuit-level techniques used in [9], [11], [13] can also be used in
from the above designs and aims to accelerate convolutions ZASCA to further improve its energy/power consumption.
for low-power mobile devices at the edge. Moreover, since When considering beyond standard CNNs, bottleneck
no quantitative runtime result was provided, a direct com- and depthwise separable convolutions are commonly used
parison cannot be made with these works. nowadays due to computational considerations. The bottle-
Recently, a few works have focused on minimizing neck convolutions were first introduced and used in very
energy by modulating precision, frequency and supply volt- deep residual networks such as ResNet-50 [5]. In the bottle-
age of their accelerator for each convolutional layer [9], [11], neck architectures, each standard convolutional layer is usu-
[13]. In [9], a precision-scalable convolutional accelerator ally sandwiched between two convolutional layers with the
(i.e., Envision), fabricated in 28 nm UTBB FD-SOI technol- filter size of 1 1. The first convolution is used to reduce the
ogy, was introduced. This architecture dynamically adapts dimensionality of inputs (i.e., the number of channels) and
itself depending on the required precision for each layer, the last one to restore it. In this way, a very deep network can
instead of using a fixed precision. More precisely, it exploits be achieved while using inexpensive 1 1 filters [26]. In this
a reconfigurable MAC which is able to perform a 16-bit, two work, we showed that ZASCA can successfully map the bot-
8-bit and four 4-bit multiplications/accumulations, depend- tleneck architecture used in ResNet-50 while skipping its
ing on the required precision. As a result, using a dynamic noncontributory computations with zero-valued input acti-
fixed-point technique allows to change frequency and sup- vation pixels. Using depthwise separable convolutions,
ply voltage over time which results in a lower power/energy which is a building block of MobilNets [27], is another way
consumption. Envision contains 256 reconfigurable MACs. of reducing complexity of standard convolutions. In this
This accelerator performs the convolutional computations of approach, each standard convolution is reformed into a
AlexNet to 21.3 ms (62.6 G-ops/s), and those of VGG-16 to depthwise convolution followed by a pointwise convolution.
598.8 ms (51.3 G-ops/s), while its performance efficiency is In the depthwise convolution, each input channel is associ-
respectively limited to 38 and 32 percent on average. Similar ated with a separate 2D filter and the result of convolution
to Eyeriss, the low performance efficiency of Envision results between each input channel and its filter constructs a single
in a large gate count of 1950 kgates. More precisely, Envision output channel. On the other hand, the pointwise convolu-
requires more parallelism to meet the timing constraint tion is referred to the standard convolution with the filter
of real-time edge devices, resulting in a large silicon area. size of 1 1. Since each input activation channel contributes
Despite the differences in the technology nodes and applied only to one output activation channel in the depthwise
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 151
convolution, this convolutional process requires parallel [15] S. Williams, et al., “Roofline: An insightful visual performance
model for multicore architectures,” Commun. ACM, vol. 52,
access to all the input activation channels and all the filters to pp. 65–76, Apr. 2009.
perform the computations in parallel. As a result, ZASCA [16] A. Ardakani, et al., “An architecture to accelerate convolution in
cannot fully exploit its available resources for this type of deep neural networks,” IEEE Trans. Circuits Syst. I: Regular Papers,
convolution. More precisely, ZASCA can only perform the vol. 65, no. 4, pp. 1349–1362, Apr. 2018.
[17] M. Horowitz, “1.1 computing’s energy problem (and what we can
depthwise convolutional computations of two output chan- do about it),” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech.
nels in parallel due to its limited data bandwidth. However, Papers, Feb. 2014, pp. 10–14.
the final impact of depthwise convolutional processes is neg- [18] S. Han, et al., “EIE: Efficient inference engine on compressed deep
neural network,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput.
ligible as the computational complexity of depthwise separa- Archit., Jun. 2016, pp. 243–254.
ble convolutions is dominated by the pointwise convolutions [19] J. Albericio, et al., “Cnvlutin: Ineffectual-neuron-free deep neural
[27]. Note that ZASCA can efficiently perform the pointwise network computing,” in Proc. 43rd Int. Symp. Comput. Archit.,
convolutional computations (see Section 4.1.3). 2016, pp. 1–13.
[20] S. Zhang, et al., “Cambricon-X: An accelerator for sparse neural
networks,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit.,
8 CONCLUSION Oct. 2016, pp. 1–12.
[21] A. Parashar, et al., “SCNN: An accelerator for compressed-sparse
In this paper, we proposed ZASCA: an accelerator perform- convolutional neural networks,” in Proc. ACM/IEEE 44th Annu.
ing convolutions using either dense or sparse activations. Int. Symp. Comput. Archit., Jun. 2017, pp. 27–40.
ZASCA exploits a dataflow (i.e., GFID) that allows to perform [22] R. Andri, et al., “YodaNN: An ultra-low power convolu-
tional neural network accelerator based on binary weights,”
the convolutional computations while reading input data in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Jul. 2016,
under the bandwidth bottleneck of edge devices, which maxi- pp. 236–241.
mizes its performance efficiency. Such a dataflow also enables [23] Micron Technology Inc., “DDR4 SDRAM for Automotive.” 2015.
[Online]. Available: http://www.micron.com
ZASCA to avoid noncontributory multiplications/accumula- [24] Y.-H. Chen, et al., “Eyeriss: An energy-efficient reconfigurable
tions with zero-valued activations to further speedup the accelerator for deep convolutional neural networks,” in Proc. IEEE
computations. Compared to the state-of-the-art accelerator Int. Solid-State Circuits Conf., 2016, pp. 262–263.
for mobile devices, ZASCA enhances the latency, throughput, [25] Y. LeCun, et al., “Backpropagation applied to handwritten
zip code recognition,” Neural Comput., vol. 1, pp. 541–551,
energy efficiency and performance efficiency by up to 17.5, Dec. 1989.
5.8, 4.5 and 6.2, respectively. [26] F. N. Iandola, et al., “SqueezeNet: AlexNet-level accuracy with
50x fewer parameters and <1MB model size,” CoRR, vol. abs/
1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/
REFERENCES 1602.07360
[27] A. G. Howard, et al., “MobileNets: Efficient convolutional
[1] Y. Lecun, et al., “Gradient-based learning applied to document neural networks for mobile vision applications,” CoRR, vol. abs/
recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. 1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/
[2] Y. Lecun, et al., “Deep learning,” Nature, vol. 521, pp. 436–444, 1704.04861
May 2015. [28] A. Vedaldi, et al., “MatConvNet: Convolutional neural networks
[3] A. Krizhevsky, et al., “ImageNet classification with deep convolu- for MATLAB,” in Proc. 23rd ACM Int. Conf. Multimedia, 2015,
tional neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., pp. 689–692.
2012, pp. 1097–1105. [29] Y. Jia, et al., “Caffe: Convolutional architecture for fast feature
[4] K. Simonyan, et al., “Very deep convolutional networks for large- embedding,” arXiv:1408.5093, 2014.
scale image recognition,” in Proc. Int. Conf. Learn. Representations, [30] Y. Chen, et al., “DaDianNao: A machine-learning super-
2015. computer,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit.,
[5] K. He, et al., “Deep residual learning for image recognition,” in Dec. 2014, pp. 609–622.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016. [31] T. Luo, et al., “DaDianNao: A neural network supercomputer,”
[6] O. Russakovsky, et al., “ImageNet large scale visual recognition IEEE Trans. Comput., vol. 66, no. 1, pp. 73–88, Jan. 2017.
challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015. [32] T. Chen, et al., “DianNao: A small-footprint high-throughput
[7] Y.-H. Chen, et al., “Eyeriss: A spatial architecture for energy- accelerator for ubiquitous machine-learning,” in Proc. 19th Int.
efficient dataflow for convolutional neural networks,” in Proc. Conf. Archit. Support Program. Languages Operating Syst., 2014,
43rd Int. Symp. Comput. Archit., 2016, pp. 367–379. pp. 269–284.
[8] Y. H. Chen, et al., “Eyeriss: An energy-efficient reconfigurable accel- [33] Z. Du, et al., “ShiDianNao: Shifting vision processing closer to the
erator for deep convolutional neural networks,” IEEE J. Solid-State sensor,” in Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput. Archit.,
Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017. Jun. 2015, pp. 92–104.
[9] B. Moons, et al., “14.5 Envision: A 0.26-to-10TOPS/W subword- [34] W. Choi, et al., “On-chip communication network for efficient
parallel dynamic-voltage-accuracy-frequency-scalable convolu- training of deep convolutional networks on heterogeneous many-
tional neural network processor in 28nm FDSOI,” in Proc. IEEE core systems,” IEEE Trans. Comput., vol. 67, no. 5, pp. 672–686,
Int. Solid-State Circuits Conf., Feb. 2017, pp. 246–247. May 2018.
[10] S. Wang, et al., “Chain-NN: An energy-efficient 1D chain architec- [35] N. P. Jouppi, et al., “In-datacenter performance analysis of a tensor
ture for accelerating deep convolutional neural networks,” in Proc. processing unit,” in Proc. 44th Annu. Int. Symp. Comput. Archit., 2017,
Des. Autom. Test Europe Conf. Exhib., Mar. 2017, pp. 1032–1037. pp. 1–12.
[11] D. Shin, et al., “14.2 DNPU: An 8.1TOPS/W reconfigurable CNN- [36] J. Sun, et al., “Sparse matrix-vector multiplication design on
RNN processor for general-purpose deep neural networks,” in FPGAs,” in Proc. 15th Annu. IEEE Symp. Field-Programmable Cus-
Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2017, pp. 240–241. tom Comput. Mach., Apr. 2007, pp. 349–352.
[12] G. Desoli, et al., “14.1 A 2.9TOPS/W deep convolutional neu- [37] M. deLorimier, et al., “Floating-point sparse matrix-vector multi-
ral network SoC in FD-SOI 28nm for intelligent embedded sys- ply for FPGAs,” in Proc. ACM/SIGDA 13th Int. Symp. Field-
tems,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2017, Programmable Gate Arrays, 2005, pp. 75–85.
pp. 238–239. [38] S. Jain-Mendon, et al., “A hardware-software co-design
[13] B. Moons, et al., “An energy-efficient precision-scalable ConvNet approach for implementing sparse matrix vector multiplication
processor in a 40-nm CMOS,” IEEE J. Solid-State Circuits, vol. 52, on FPGAs,” Microprocessors Microsyst., vol. 38, pp. 873–888,
no. 4, pp. 903–914, Apr. 2017. Nov. 2014.
[14] J. Jo, et al., “DSIP: A scalable inference accelerator for convolu-
tional neural networks,” IEEE J. Solid-State Circuits, vol. 53, no. 2,
pp. 605–618, Feb. 2018.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
152 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
[39] D. Gregg, et al., “FPGA based sparse matrix vector multiplication Warren J. Gross (S’92–M’04–SM’10) received
using commodity DRAM memory,” in Proc. Int. Conf. Field Pro- the BASc degree in electrical engineering from
grammable Logic Appl., Aug. 2007, pp. 786–791. the University of Waterloo, Waterloo, ON, Can-
[40] Y. Zhang, et al., “FPGA versus GPU for sparse matrix vector mul- ada, in 1996, and the MASc and PhD degrees
tiply,” in Proc. Int. Conf. Field-Programmable Technol., Dec. 2009, from the University of Toronto, Toronto, ON,
pp. 255–262. Canada, in 1999 and 2003, respectively. He is a
professor and Louis-Ho faculty scholar in techno-
Arash Ardakani received the BSc degree in elec- logical innovation with the Department of Electri-
trical engineering from the Sadjad University of cal and Computer Engineering, McGill University,
Technology, Mashhad, Iran, in 2011, and the MSc Montreal, QC, Canada. He currently serves as
degree from the Sharif University of Technology, chair of the department. His research interests
Tehran, Iran, in 2013. He is currently working include the design and implementation of signal processing systems and
toward the PhD degree in electrical and computer custom computer architectures. He served as the chair for the IEEE Sig-
engineering at McGill University, Montre al, QC, nal Processing Society Technical Committee on Design and Implementa-
Canada. His research interests include the VLSI tion of Signal Processing Systems. He served as the general co-chair for
implementation of signal processing algorithms, in the IEEE GlobalSIP 2017 and the IEEE SiPS 2017 and the technical pro-
particular channel coding schemes and machine gram co-chair for SiPS 2012. He also served as an organizer for the
learning algorithms. Workshop on Polar Coding in Wireless Communications at WCNC 2018
and WCNC 2017, the Symposium on Data Flow Algorithms and Architec-
ture for Signal Processing Systems (GlobalSIP 2014), and the IEEE ICC
2012 Workshop on Emerging Data Storage Technologies. He served as
Carlo Condo received the MSc degree in electri- an associate editor of the IEEE Transactions on Signal Processing and
cal and computer engineering from the Politecnico as a senior area editor. He is a licensed professional engineer in the Prov-
di Torino and the University of Illinois at Chicago, ince of Ontario. He is a senior member of the IEEE.
Chicago, Illinois, in 2010, and the PhD degree in
electronics and telecommunications engineering
from the Politecnico di Torino and IMT Atlantique, " For more information on this or any other computing topic,
in 2014. From 2015 to 2017, he was a post-
doctoral fellow with the ISIP Laboratory, McGill please visit our Digital Library at www.computer.org/csdl.
University, where he was a McGill University dele-
gate at the 3GPP meetings for the fifth-generation
wireless systems standard (5G) in 2017. Since
2018, he has been a researcher with the Communication Algorithms
Design Team, Huawei Paris Research Center. His research is focused
on channel coding, design and implementation of encoder and decoder
architectures, digital signal processing, and machine learning.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.