Fast and Efficient Convolutional Accelerator For Edge Computing

138 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO.
1, JANUARY 2020
Fast and Efficient Convolutional

Accelerator for Edge Computing
Arash Ardakani , Carlo Condo , and Warren J. Gross , Senior Member, IEEE
Abstract—Convolutional neural networks (CNNs) are a vital approach in machine learning. However, their high complexity and energy
consumption make them challenging to embed in mobile applications at the edge requiring real-time processes such as smart phones.
In order to meet the real-time constraint of edge devices, recently proposed custom hardware CNN accelerators have exploited parallel
processing elements (PEs) to increase throughput. However, this straightforward parallelization of PEs and high memory bandwidth
require high data movement, leading to large energy consumption. As a result, only a certain number of PEs can be instantiated when
designing bandwidth-limited custom accelerators targeting edge devices. While most bandwidth-limited designs claim a peak
performance of a few hundred giga operations per second, their average runtime performance is substantially lower than their roofline
when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, as a result of low resource utilization and arithmetic
intensity. In this work, we propose a zero-activation-skipping convolutional accelerator (ZASCA) that avoids noncontributory
multiplications with zero-valued activations. ZASCA employs a dataflow that minimizes the gap between its average and peak
performances while maximizing its arithmetic intensity for both sparse and dense representations of activations, targeting the bandwidth-
limited edge computing scenario. More precisely, ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-
art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime
performance and the peak performance. Using its zero-skipping feature, ZASCA can further improve the performance efficiency of the
state-of-the-art CNNs by up to 1.9 depending on the sparsity degree of activations. The implementation results in 65-nm TSMC CMOS
technology show that, compared to the most energy-efficient accelerator, ZASCA can process convolutions from 5.5 to 17.5 faster, and
is between 2.1 and 4.5 more energy efficient while occupying 2.1 less silicon area.
Index Terms—Convolutional neural network, machine learning, hardware accelerator, edge computing
1 INTRODUCTION
neural networks (DNNs), especially convolu- 666M multiply-accumulates (MACs)) in its convolutional
D EEP
tional neural networks (CNNs) [1], have received tre-
mendous attention due to their ability to surpass human-
layers, and 117.2M operations (i.e., 58.6M MACs) in its
fully-connected layers. VGGNet-16 [4] is another well-
level accuracy on a wide range of complex tasks such as known CNN, containing 13 convolutional layers with
recognition, classification and detection [2]. Depending 14.7M weights and 3 fully-connected layers with 124M
on their size and complexity, these networks achieve dif- weights. VGGNet-16 performs 30.6G operations in its con-
ferent degrees of classification/recognition accuracy. A volutional layers and 248M operations in its fully-
CNN is a stack of multiple convolutional layers followed connected layers, achieving 27 percent MCR on Image-
by fully-connected layers: convolutional layers extract Net. Recently, ResNets [5] achieved a lower complexity
high level abstractions and features of raw data, whereas and a better MCR by using residual connections. For
fully-connected networks are used to learn non-linear instance, ResNet-18 achieves a similar MCR to VGGNet-
combinations of the extracted features. In 2012, a CNN 16 (i.e., 27.88 percent on ImageNet) while performing
called AlexNet [3] was introduced: it is constituted of 5 3.6G and 1M operations in its 17 convolutional layers
convolutional layers followed by 3 fully-connected layers with 11M weights and 1 fully-connected layer with 1M
and achieves 42.9 percent misclassification rate (MCR) on weights, respectively. Moreover, ResNet-50, containing
the ImageNet dataset. AlexNet contains 2.3M weights and 49 convolutional layers with 23.5M weights and 1 fully-
58.6M weights in its convolutional and fully-connected connected layer with 2M weights, achieved a better MCR
layers, respectively, performing 1332M operations (i.e., (i.e., 22.85 percent on ImageNet) by going even deeper.
ResNet-50 respectively performs 7G and 4M operations
within the two types of layers. All these CNNs have won
A. Ardakani and W.J. Gross are with the Electrical and Computer Engineering the ImageNet Large Scale Visual Recognition Challenge
Department, McGill University, Montreal, QC H3A 0G4, Canada.
E-mail: arash.ardakani@mail.mcgill.ca, warren.gross@mcgill.ca. (ILSVRC) [6].
C. Condo is with the Mathematical and Algorithmic Sciences Lab, Huawei Regardless of the fact that in almost all the aforementioned
Technologies Co. Ltd, Paris 75002, France. E-mail: carlo.condo@huawei.com. CNNs the majority of weights is found in fully-connected
Manuscript received 18 Sept. 2018; revised 26 Aug. 2019; accepted 11 Sept. layers, the number of operations are dominated by convolu-
2019. Date of publication 16 Sept. 2019; date of current version 19 Dec. 2019. tions. As a result, the processing time of CNNs is also domi-
(Corresponding author: Arash Ardakani.)
Recommended for acceptance by D. Marculescu.
nated by the convolutional processes. This issue can easily be
Digital Object Identifier no. 10.1109/TC.2019.2941875 addressed by exploiting parallel processing elements (PEs) to
0018-9340 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Chungbuk National Univ. Downloaded on April 02,2020 at 06:25:41 UTC from IEEE Xplore. Restrictions apply.
ARDAKANI ET AL.: FAST AND EFFICIENT CONVOLUTIONAL ACCELERATOR FOR EDGE COMPUTING 139
Fig. 2. The performance efficiency of Eyeriss, EEPS and DSIP tested on

AlexNet and VGG-16. Note that the performance efficiency of Eyeriss,
EEPS and DSIP is taken from their original papers.
denoted by a red dashed line) and runtime performance. We

measure the gap as the ratio between the average runtime
performance and the peak performance and refer to it as per-
formance efficiency. Fig. 2 shows the performance efficiency
Fig. 1. The roofline model of Eyeriss, EEPS and DSIP tested on AlexNet
of the aforementioned convolutional accelerators. The per-
and VGG-16. The square points and dashed lines denote the average formance efficiency of Eyeriss varies from 26 to 55 percent
runtime performance and the roofline (i.e., the peak performance), when running AlexNet and VGG-16: from 45 to 74 percent of
respectively. its potential, hardware resources and energy are wasted. On
the other hand, EEPS and DSIP have a higher performance
increase throughput. However, a straightforward paralleliza- efficiency while having a poor arithmetic intensity. More
tion requires high data movement and bandwidth, leading to precisely, these accelerators have only focused on maximiz-
high energy consumption [7]. ing one implementation aspect: either arithmetic intensity
During the past few years, many convolutional accelera- (e.g., Eyeriss) or performance efficiency (e.g., EEPS and
tors with different dataflows have been introduced in litera- DSIP) while an accelerator maximizing both aspects is miss-
ture [8], [9], [10], [11], [12], targeting edge devices. While these ing in literature.
ASIC architectures can successfully reduce the number of As a first attempt to increase the performance efficiency
memory access to the off-chip memory and meet the latency and the arithmetic intensity, we presented a new dataflow,
constraints of small CNNs such as AlexNet within the edge called fully-connected inspired dataflow (FID), and its
computing scenario, they fail to employ the full potential of architecture in [16] for filter sizes fixed to 3 3. However,
their architectures due to the limited data bandwidth of edge FID and its architecture suffer from lack of reconfigurability
devices, resulting in a low performance efficiency. In fact, and are only restricted to VGG-like networks where their fil-
their bandwidth bottleneck and network topology (i.e., data- ter sizes are fixed to 3 3 for all layers. It is worth mention-
flow) prevent them from providing the required parallelism ing that reconfigurability is of paramount importance in
for their PEs right after each access to the off-chip memory. designing a convolutional accelerator as filter sizes of each
Therefore, a great deal of time is wasted reading data from layer can be different from others. For instance, AlexNet
memories. As a result, there is a huge gap between their peak consists of 5 layers of convolutions with filter sizes 11 11,
performance and average runtime performance. The peak 5 5 and 3 3.
performance is defined as the maximum achievable number Another key motivation for our work is exploiting the
of operations per second, roughly computed as intrinsic sparsity among activations incurred by using the rec-
tified linear unit (ReLU) as the non-linear function in order to
peak performance ¼ 2 #PE f; (1) avoid unnecessary MAC operations. More precisely, ReLUs
dynamically clamp all negatively-valued activations to zero
where #PE and f denote the number of PEs and nominal fre- at runtime. As a result, the sparsity degree of each layer highly
quency, respectively, and where each PE performs a single relies on the network and dataset being used. Fig. 3 shows the
MAC at each clock cycle. The runtime performance is also average sparsity degree of activations for AlexNet, VGG-16,
defined as the number of operations per second obtained ResNet-18 and ResNet-50. The sparsity degree of activations
when performing computations on a specific dataset. (i.e., the fraction of zeros in activations) suggests that noncon-
Fig. 1 visually illustrates the above limitation of state- tributory multiplications/additions can be avoided to speed
of-the-art accelerators (i.e., Eyeriss [8], EEPS [13] and DSIP up the convolutional computations. Moreover, reading zero-
[14]) using their roofline model that ties together their per- valued activations can be skipped during DRAM accesses as
formance and arithmetic intensity. The performance of an memory accesses directly contribute in total energy consump-
accelerator is measured as the number of operations per tion of systems [17], [18]. It is worth mentioning that the spar-
second while the arithmetic intensity is defined as the num- sity among activations is an intrinsic occurrence in neural
ber of operations performed on a byte of accessed data from networks using the ReLU as their non-linear function. There-
the off-chip memory [15]. fore, there is no approximation involved in their computa-
Fig. 1 shows that Eyeriss has the highest arithmetic inten- tions and skipping the noncontributory computations does
sity when performing the convolutions of AlexNet and not incur any accuracy degradation.
VGG-16 and compared to EEPS and DSIP. However, there is Several recent convolutional accelerators specialized for
a huge gap between its peak performance (i.e., the roofline the inference computation in the cloud have also investigated
140 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 1, JANUARY 2020
TABLE 1
Convolutional Layer Computation Parameters
Name of Parameter Description

Hin /Win Height/Width of the input activation maps
Hout /Wout Height/Width of the output activation maps
Hf /Wf Height/Width of the filter plane
Cin =Cout No. of channels of the input/output
activation maps
Fig. 3. The average sparsity degree of convolutional layers for AlexNet, S Stride number
VGG-16, ResNet-18 and ResNet-50.
the sparsity among activations and weights to skip unneces-

of weights, also referred to as a filter. The main computational
sary computations. Cnvlutin [19] and Cambricon-X [20]
kernel of a convolutional layer involves high-dimensional
exploit the sparsity among activations and weights mainly to
convolutions. The convolutional layers take input pixels,
speed up the convolutional computations, respectively.
which are also called input activation maps, arranged in 3
SCNN [21] introduced a dataflow to exploit the sparsity
dimensions (i.e., height Hin , width Win and channel Cin ), and
among both activations and weights to reduce computation
generate output pixels, which are also called output activation
cycles and energy consumption. Despite the speedup that
maps, arranged in 3 dimensions (i.e., height Hout , width Wout
these accelerators achieve over their baseline, their power
and channel Cout ). This transformation is a result of the convo-
consumption is far beyond the power budget of edge systems
lution between the input activation maps and a set of Cout 3D
which is limited to a few hundred mW [22]. More precisely,
filters. More precisely, every single 2D Hout Wout plane of
these accelerators exploit the straightforward parallelism and
the output activation maps is a result of the convolution
require high data bandwidth. However, the data bandwidth
between the 3D input activation maps with a set of 3D filters.
of most of the embedded and mobile devices is limited by the
In fact, a summation of multiple plane-wise 2D convolutions
bandwidth of the off-chip memory. For instance, most of
forms a 3D convolution. At the end, the results of 3D convolu-
the low-power DDRs (LPDDRs) such as LPDDR4 provide
tions are also added to 1D bias. In summary, the convolu-
the bit-width of up to 64 bits [23]. As a result, the bandwidth
tional processes with the input activation maps, the output
of state-of-the-art convolutional accelerators for mobile
activation maps, the filters and the bias matrices denoted as
devices is limited to 64 bits [8], [9], [13], [24]. Therefore, acce-
X, Y , W and B, respectively, can be expressed as
lerating convolutions while skipping the noncontributory
computations within the edge computing scenario is nontriv- Hf Wf
Cin X
X X
ial and missing. Y ðz; t; qÞ ¼ BðqÞ þ XðzS þ j; tS þ i; kÞ W ðj; i; k; qÞ;
Motivated by the aforementioned statements, in this paper, k¼1 j¼1 i¼1
we present a zero-activation-skipping convolutional accelera-

Hout ¼ ðHin Hf þ SÞ=S;
tor (ZASCA) that enables further improvements in the arith-
metic intensity and the performance efficiency over non-
Wout ¼ ðWin Wf þ SÞ=S; (2)
zero-skipping accelerators for edge devices. To this end, we
first extend and generalize FID to support all type of filter
where 1 z Hout , 1 t Wout and 1 q Cout . The
sizes used in state -of-the-art CNNs. We then propose a zero-
stride S represents the number of activation map pixels of
skipping paradigm based on the generalized FID (GFID) and
which the filter is shifted after each convolution. Contrary to
its architecture (i.e., ZASCA). ZASCA is optimized to achieve
the fully-connected layers, convolutional computations are
high arithmetic intensity, performance efficiency and energy
dominated by numerous MACs according to Eq. (2), leading
efficiency, while keeping the power consumption below the
to a high degree of computational complexity. The parame-
budget of mobile/embedded devices. Finally, we evaluate the
ters used in the convolutional process are summarized in
performance of ZASCA on state-of-the-art CNN models (i.e.,
Table 1.
AlexNet, VGGNet-16, ResNet-18 and ResNet-50) and show
that ZASCA performs the convolutional computations of
2.1 Fully-Connected Inspired Dataflow for
these CNNs with an 83 percent minimum performance
Convolutional Computations
efficiency when using dense representation (i.e., no zero-
activation-skipping). Using its zero-skipping feature, ZASCA In [16], FID was introduced. It can be used to efficiently per-
can further improve the performance efficiency by a factor of form the computations of convolutional layers with filter
1.9. Finally, we show that the roofline model of ZASCA parameter Wf fixed to 3. Let us note that 2D convolution is
surpasses the roofline models of the state-of-the-art accelera- the weighted summation of each pixel of an input image
tors in terms of both performance efficiency and arithmetic with its neighboring pixels, and consider an input image as
intensity. a matrix X82 , a filter as a matrix W13 and an output as a
matrix Y62 , such that
2 PRELIMINARIES
X1 X2 . . . X8
X¼ ; W ¼ ½ W1 W2 W3 ;
Inspired by the organization of the animal visual cortex, it was X9 X10 . . . X16
shown that the connectivity of neurons in convolutional
layers can be mathematically described by a convolution Y1 Y2 . . . Y6
Y ¼ :
operation [25]. All neurons in a convolutional layer share a set Y7 Y8 . . . Y12
2 3
TABLE 2 W1 0 0 0
The FID for Convolutional Computations 6
6 W2
..
.
7
7
)
6 .. 7
6 7
6 0 . 7 S:
6 7
6 7
6 . 7
6 . 7
6 . W1 7
6 .. 7
6 W2 0 . 7
6 7
6 7
6 WWf 0 7
6 7
6 .. .. 7
M¼6
6 0 . W1 . 7;
7 (3)
6 W2 7
6 7
6 WWf 0 7
6 7
6 .. 7
6 0 . W1 7
6 7
6 . 7
6 .. W2 7
6 7
6 .. 7
6 . WWf 7
6 7
6 7
6 .. .. .. 7
4 . . . 5
0 0 WWf
Considering each output pixel assigned to a neuron, Table 2 M can contain only Wf non-zero elements at most. The shift
shows the convolutional process of this example in a way amount within each row of the output activation map is
similar to the fully-connected layer computations, where equal to S, denoted with a horizontal dashed line in the
input pixels are read sequentially at each clock cycle (CC) matrix M. The number of columns of the matrix M indicates
and the neurons share the same input pixels. This example the N output pixels that belong to the same row of the out-
considers Cin ¼ 1, Cout ¼ 1, Hin ¼ 2, Win ¼ 8, Hf ¼ 1, Wf ¼ 3 put activation map, while the number of rows of M denotes
and S ¼ 1. Similar to the fully-connected dataflow, each neu- the required number of clock cycles. We use the GFID
ron loads a different weight at each time step, subsequently matrix M to represent different filter sizes used in the state-
accumulating the weighted input pixels. The number of time of-the-art CNNs. State-of-the-art networks such as LeNet,
steps required to perform the convolutional computations is ZFNet, AlexNet, GoogleNet, VGGNet, NIN and ResNet are
also equal to the number of input pixels, Hin Win . When all constructed based on a combination of filter sizes of
passed to the next neuron belonging to the same row of the 11 11 with S ¼ 4, 7 7 with S ¼ 2, 5 5 with S ¼ 1, 5 5
output activation map, the weights need to be shifted by one with S ¼ 2, 3 3 with S ¼ 1, and 1 1 with S ¼ 1. All of
position. However, weight passing between neurons of dif- the aforementioned filter sizes and others can be easily rep-
ferent rows requires a shift of Wf positions, as can be resented using the matrix M. For instance, we represent
observed between clock cycles #6 and #9 for W1 denoted in two filter sizes of 3 3 with S ¼ 1 and 7 7 with S ¼ 2
red in Table 2. using the GFID below.
As shown in Table 2, three PEs (i.e., neurons) are suffi- In Section 2.1, we showed that 3 PEs are sufficient to per-
cient to perform the convolutions. In fact, there are only 3 form the convolutions for filter size of 3 3 with S ¼ 1.
active neurons at each time step. Each PE thus receives its Therefore, a CE containing only 3 neurons can perform the
input at clock cycle 3 i þ 1, 3 i þ 2 and 3 i þ 3. Their convolutional computations. Considering a convolution of a
outputs are also valid after 3 clock cycles in the given exam- row of a filter map with its corresponding input pixels,
ple. So far, we only considered a case with Hf ¼ 1. In case N þ 2 clock cycles are required to generate N output pixels
of Hf ¼ 3, the procedure in Table 2 has to be repeated 2 which belong to the same row of the output activation map.
times more: the first iteration with W1 , W2 and W3 , the sec- For instance, in the given example in Table 2, 8 clock cycles
ond with W4 , W5 and W6 , and the final one with W7 , W8 and are required to generate the output pixels of the first row of
W9 . Similarly, for higher values of Cin , the process has to be the output activation map (i.e., the first 6 output pixels).
to repeated Cin times. Therefore, a memory is required to This example can also be expressed using the GFID matrix
store the partial values generated by the 3 neurons for each M as follows:
output pixel. In general, N output pixels can be computed
using 3 neurons (i.e., PEs) and 3 separate N=3-element
SRAM memories working in parallel. The unit generating
the N output pixels of an output activation map is referred
to as a convolutional element (CE).
3 GENERALIZED FULLY-CONNECTED INSPIRED

DATAFLOW (GFID)
Let us define a generalized form of the FID by unrolling it The matrix M also confirms that there are only 3 active
over time as a matrix M:where each column of the matrix neurons at each time steps, highlighted in red. Considering
TABLE 3 Wf
UFmax ¼ lim UF ¼ 100: (7)
The Maximum Utilization Factor for Different Filter Sizes N!1 T S
[Wf ; S] Eq. (7) suggests that the highest performance efficiency is
[1; 1] [3; 1] [5; 1] [5; 2] [7; 2] [11; 4] obtained when N ðWf SÞ. Table 3 shows the maximum
utilization factors for filters with different filter sizes and
Max UF 100% 100% 100% 83% 88% 92%
T 1 3 5 3 4 3 stride values. Please note that Table 3 only shows the com-
monly used filter sizes while other filter sizes can be still
represented using the GFID.
filters with Wf ¼ 7 and S ¼ 2, the shift amounts within each

4 GENERALIZED HARDWARE ARCHITECTURE
row of the output activation map is equal to 2 as shown in
the following matrix M: In Section 3, we showed that different filter sizes require dif-
ferent number of PEs per CE. In order to perform the com-
putations while achieving a high performance efficiency,
the number of PEs per CE has to be reconfigurable. In other
words, K instantiated PEs have to dynamically adapt to act
as a multiple of T PEs to achieve the maximum possible uti-
lization factor. The closed form solution for this strategy is
K ¼ LCMðTi Þ; i 2 f1; 3; 4; 5g; (8)
where LCM denotes the least common multiple. Using this

approach, 60 PEs are required to achieve the maximum pos-
sible utilization factor for all the network sizes listed in
Table 3. Depending on the required T , the 60 PEs can
dynamically behave as a set of T PEs. For instance, they can
act as 60, 20, 15 and 12 parallel CEs for T equal to 1, 3, 4 and
While the higher stride value linearly decreases the number 5, respectively, where each CE also contains 1, 3, 4 and 5
of pixels in the output activation maps, it also reduces the PEs. However, using 60 reconfigurable PEs is not trivial and
number of neurons required to perform the convolutional results in a complex architecture.
computations. For instance, the above matrix M shows that Observing commonly-used CNNs, we find out that T ¼ 1
there are only 4 active neurons at each time step, while the and T ¼ 3 are the dominant minimum numbers of PEs.
width of the filter Wf ¼ 7. According to the matrix M, 15 More precisely, the two filters with Wf ¼ 5 and Wf ¼ 7 have
clock cycles are required to generate 5 output pixels in the the least impact on the overall performance efficiency of
given example. CNNs, since they are used in only one layer of CNNs. For
instance, the first layer of ResNets is fixed to the receptive
field of 7 7 and stride of S ¼ 2. The filter sizes of the
3.1 Utilization Factor for Different Filter Sizes remaining layers are either fixed to 3 3 (for ResNet-18 and
The number of clock cycles required to perform the convolu- ResNet-34) or a combination of 1 1 and 3 3 (for ResNet-
tions using GFID is equal to the number of input pixels. Con- 50, ResNet-101 and ResNet-152) [5]. Therefore, we use K ¼ 6
sidering Cin ¼ 1 and Hf ¼ 1, in order to generate N pixels of PEs inside the reconfigurable CE: the reason is twofold. First
an output activation map, S N þ Wf S clock cycles are of all, 6 PEs can be easily used as 2 and 6 CEs containing 3
required to perform the convolutions according to Eq. (2). and 1 PEs for T ¼ 3 and T ¼ 1, which are the dominant mini-
Let us define the number of required PEs in the CE as T . The mum numbers of PEs for most of CNNs introduced in litera-
number of pixels computed by each neuron is equal to N=T ture. Second, they can perform the computations for T ¼ 4
when N is a multiple of T . Each neuron also requires Wf and T ¼ 5 with a minimum level of complexity for the
clock cycles to generate an output pixel. Therefore, the utili- address generator unit. In this case, with K larger than what
zation factor of GFID can be expressed as strictly necessary, the number of clock cycles required to
perform the convolutional computations remains the same.
N
Wf
T However, the utilization factors of PEs for these cases
UF ¼ 100: (6)
S N þ Wf S decreases. It is worth mentioning that while we only dis-
cussed about commonly-used filter sizes in CNNs, other fil-
In Section 1, we discussed the importance of high per- ter sizes can be implemented using this approach.
formance efficiency. The utilization factor of PEs in a convo- Fig. 4 shows the architecture of the CE. It consists of two
lutional accelerator is also linearly proportional to its main sub-blocks: the weight generator and K ¼ 6 PEs work-
performance efficiency. Any increase in the utilization fac- ing in parallel. All the PEs share the same input activation
tor of PEs exploited in the CE results in a higher perfor- pixel while their weights are different. Each PE takes an
mance efficiency. Considering the fact that Wf and S are input activation map and its corresponding weight accord-
usually small, a high UF is achieved for a large value of N. ing to the proposed GFID and performs the accumulation-
In other words, the maximum achievable utilization factor multiplication for the first row of the first input filter, i.e., W1 ,
can be obtained as W2 , . . ., WWf . This process takes Wf clock cycles and the
Wf ¼ 3 and S ¼ 1. It is worth noting that CEs are separated

using a dashed line. Each CE loads the weights of the first
row of the first filter (i.e., W1 , W2 and W3 ) through the input
ports denoted as In #1 and In #4 in Fig. 5a. These weights
then loop through the first register of each set to provide
one clock cycle delay for each neuron according to (3). Con-
sidering Eq. (6), the utilization factor of each neuron for this
case can be computed as
Fig. 4. The reconfigurable CE. N

UF ¼ ; (9)
N þ2
computed partial value is stored in a memory of L elements.
which approaches 100 percent for large values of N.
Afterwards, the PE starts the processing of another output
activation pixel, using the same weights. The convolutional
4.1.2 Filters with Wf ¼ 7 and S ¼ 2
computations of the first row of the first input filter require
S N þ Wf S clock cycles, as discussed in Section 3.1. The convolutional computations for filters with Wf ¼ 5 and
Upon reaching this point, the partial value of the first output S ¼ 1 are performed in a way similar to the convolutional
activation pixel is read from the memory and the computa- computations of the filters with Wf ¼ 3 and S ¼ 1, with the
tions of the second row of the first input filter are performed difference that 4 neurons are active at each time step. There-
for S N þ Wf S clock cycles. In general, this procedure fore, 6 PEs are used to compute the convolutions for Wf ¼ 7
is repeated for Hf times until the computations of the first fil- and S ¼ 2, while only 4 neurons are sufficient. As a result,
ter are finished (i.e., upon completion of Hf ðS Nþ Wf the reconfigurable CE functions as a single CE containing 6
SÞ clock cycles). At this point, the computation of the second PEs (see Fig. 5b). The CE loads the weights of the first row of
of the Cin filters starts. Upon completion of Cin Hf ðS the first filter (i.e., W1 , W2 , . . ., and W7 ) through the input
N þ Wf SÞ clock cycles, the output value of each PE is port In #1 and they loop through the black paths in Fig. 5b. In
passed through the ReLU and the result is stored in the off- this scheme, the first two registers of each register set are
chip memory. The ReLU as the non-linear function is nowa- used to provide the required two delays for each PE, as
days a common choice in CNNs since ReLUs achieve a better shown in Section 3. It is worth mentioning that while 12
accuracy performance when compared to other non-linear registers are used in this case, only 7 of them contain the
functions [3], [4], [5], [19], [21], [26], [27]. While we use weights. The utilization factor for this configuration is com-
the ReLU as the non-linear function in this paper, other puted as follows:
non-linear functions can also be adopted in PEs. 7N
UF ¼ : (10)
12N þ 30
4.1 Reconfigurable Weight Generator Unit
The weight generator unit provides each neuron an appro- Since 4 PEs are sufficient to perform the computations of
priate weight according to the proposed GFID. The weight this case, using 6 neurons highly affects the utilization factor
generator unit consists of 6 register sets where each set con- and results in 53 percent for large values of N. However, the
tains 11 registers. The appropriate weight for each neuron final impact of this case when considering the computations
is provided by selecting among these shift registers. For of the whole system is negligible due to the fact that this
instance, we show the reconfigurability of the weight genera- configuration is usually used for one layer in the state-
tor unit for the filter sizes of Wf ¼ 3 with S ¼ 1 and Wf ¼ 7 of-the-art CNNs such as ResNets and GoogleNets.
with S ¼ 2 as case-studies below.
4.1.3 Other Filters
4.1.1 Filters with Wf ¼ 3 and S ¼ 1 Similar to the filter sizes of Wf ¼ 3 with S ¼ 1 and Wf ¼ 7
As discussed in Section 4, in case of Wf ¼ 3 and S ¼ 1, the with S ¼ 2, any other filter size can also be implemented
reconfigurable CE containing 6 neurons can function as two using the proposed CE. Table 4 lists the maximum achievable
CEs of 3 neurons each. Fig. 5a shows the weight generator UF for commonly-used filter sizes in literature using the pro-
unit and its working path highlighted in black when using posed architecture. Comparing Table 4 with Table 3 shows
Fig. 5. Involved hardware resources and paths in case of convolution computations for (a) Wf ¼ 3 and S ¼ 1, (b) Wf ¼ 7 and S ¼ 2.
TABLE 4 TABLE 6
The Maximum Utilization Factor for Different Filter The Complete Convolutional Process (a) without and
Sizes Using the Proposed CE (b) with Skipping Zero-Valued Activations
[Wf ; S]
[1; 1] [3; 1] [5; 1] [5; 2] [7; 2] [11; 4]
Max UF 100% 100% 83% 83% 53% 92%
that using K ¼ 6 PEs in parallel reduces the maximum

achievable UF for the filter sizes of Wf ¼ 5 with S ¼ 1
and Wf ¼ 7 with S ¼ 2 from 100 and 88 percent to 83 and
53 percent, respectively. However, the final impact of these
cases when considering the computations of the whole sys-
tem is negligible, as these configurations are normally used
for a few layers only in the state-of-the-art CNNs.
5 ZERO-ACTIVATION-SKIPPING CONVOLUTIONAL
ACCELERATOR
So far, we have introduced a reconfigurable CE that supports
different filter sizes with high utilization factors. We also
showed that the weight generator unit provides an appropri-
ate shift amount for each PE using shift registers. In fact, the
values of input weights (i.e., the values of each row of each
input filter) are circulated through the shift registers for Wf
clock cycles as discussed in Section 4. The size of the shift reg-
ister (i.e., the number of registers involved in the computa-
tion) varies depending on the filter size being used. For
instance, 3 registers are used for the convolutional computa-
tions for Wf ¼ 3 and S ¼ 1. More generally, the number of
registers required for the computations, which are hereafter
referred to as Nr , can be obtained by multiplication between flow of weights through the registers. Similar approach can
the required stride value and the number of PEs used in each be used for other filter sizes depending on their Nr value and
CE. Table 5 summarizes the number of involved registers in the number of consecutive zero-valued activations. There-
the convolutional computations given Wf and S for com- fore, noncontributory computations can be avoided using
monly-used filter sizes while our architecture is not limited the proposed method and CE. However, in order to take
to them. advantage of zero-valued activations, an encoder/decoder is
Table 5 and Fig. 5 show that the position of weights in the required to specify the number of zero-valued activations to
registers are unchanged after Nr clock cycles. This observa- the CE. Moreover, using the encoder/decoder allows us to
tion suggests that if any multiple of Nr activations in conse- compress the activations in order to reduce memory accesses
cutive order is equal to zero, we can skip their computations to the off-chip memory.
using the proposed CE. Table 6a illustrates an example for The encoder can be easily realized using counters and
Wf ¼ 3 with S ¼ 1 when 6 consecutive activations are zero comparators to keep track of zero-valued activations as
(denoted in gray). According to Table 6a, the output values shown in Fig. 6. The final output activations, which are
Y5 , Y6 , Y7 and Y8 are zeros due to the multiplications with stored in the internal SRAM inside the CE, are first passed
zero-valued activations. Moreover, the output values Y3 and to the ReLU and then to the encoder at each clock cycle. It is
Y4 are valid at clock cycles #4 since their remaining computa- worth noting that we use 15 bits for representation of the
tions are noncontributory due to the multiplications with value of input/output activations. The encoder starts count-
zero-valued activations. As a result, the dataflow changes to ing consecutive zero-valued activations right after detecting
a simpler one as shown in Table 6b and the noncontributory the first zero-valued activation. The encoder passes its
computations can be skipped without any impact on the incoming input directly to its output port along with a sin-
gle flag bit while counting the number of consecutive zero-
TABLE 5 valued activations. As soon as the encoder detects that the
The Number of Registers Used in the Convolutional number of consecutive zero-valued activations is a multiple
Process for Different Filter Sizes of Nr denoted as n Nr , it outputs the value of n using 15
bits along with a bit to distinguish between the zero and
[Wf ; S] non-zero activations. More precisely, the encoded word
[1; 1] [3; 1] [5; 1] [5; 2] [7; 2] [11; 4] denotes that the value of current activation and n upcoming
# registers (Nr ) 1 3 6 6 12 12 activations are zero when the flag bit is high. Otherwise, it
only denotes the value of non-zero activation. It is worth
TABLE 7
An Example of Encoding Procedure Given Nr ¼ 3
Encoder Write
CC
Input Output (Hex) Address
#1 128 0040 0
#2 0 0000 1
#3 0 0000 2
#4 0 0000 3
Fig. 6. The encoder.
#5 0 8001 2
#6 0 0000 3
mentioning that in case of detecting a multiple of Nr by the #7 0 0000 4
encoder, the value of n is overwritten into the memory ele- #8 0 8002 2
ment storing the second zero-valued activation of the corre- #9 0 0000 3
#10 512 0100 4
sponding sequence of zero-valued activations. Its appropriate #11 768 0180 5
address is also generated by the controller of the system. For
instance, Table 7 illustrates the encoding procedure given the
sequence f128; 0; 0; 0; 0; 0; 0; 0; 0; 512; 768g and Nr ¼ 3. Upon
the completion of the encoding procedure, the off-chip mem- Wf is also a small value compared to the processing time of
ory contains the consecutive encoded/compressed values convolutions for the first row of the first input filter (i.e.,
as f0040; 0000; 8002; 0000; 0100; 0180g in hexadecimal format. Wf ðS N þ Wf SÞ). More precisely, the input band-
The number of memory accesses to the off-chip memory for width from the ðWf Þth clock cycle to the ðS N þ Wf SÞth
the writing procedure using the encoded values is the same as clock cycle is occupied by input pixels only. Therefore, we
the dense model. can fill out this available bandwidth by pipelining the CEs
The decoder performs the inverse computations. It uses with up to bðS N þ Wf SÞ=Wf c stages, while the addi-
the flag bit to either skip the noncontributory computations tional latency overhead is negligible compared to the overall
or perform the computations of non-zero-valued activations. latency of the system.
In case of receiving a low flag bit, it passes the non-zero value
of activation to the CE. Otherwise, it first passes zero to the 5.2 Processing Time and Memory Accesses of
CE to store all the intermediate output values of neurons into ZASCA
their corresponding SRAM. It then passes the value of n to Earlier in Section 4 we showed that in convolutional pro-
the controller of the system to update the writing addresses cesses, a single CE computes N out of Hout Wout pixels of
of the internal SRAMs in the CE. This unit is simply imple- one of Cout output activation maps within Cin Hf ðS
mented using multiplexers. N þ Wf SÞ clock cycles. We also showed that the total num-
ber of weight passing occurrences for the computation of a
5.1 Exploiting Parallel CEs single convolutional layer is equal to Hout 1, which causes
While the proposed reconfigurable CE can efficiently per- additional ðWf 1Þ ðHout 1Þ clock cycles for the compu-
form convolutional computations, using a single CE results tations of each row of the input filters. Considering p parallel
in a long latency and numerous memory accesses. To address CEs, the number of required clock cycles is expressed as
this issue, p CEs are instantiated in parallel to generate p out
of Cout output activation maps in parallel. Since the reconfig-
Wout Hout Cout
urable CE itself can function as up to 6 parallel CE, the upper CC ¼ ðS N þ Wf SÞ Hf Cin
N p
bound for the maximum number of CE is 6p in ZASCA.
Therefore, the computational latency of ZASCA is effectively Cout
þ ðWf 1Þ ðHout 1Þ Hf Cin :
reduced by a p factor when compared to a single reconfigura- p
ble CE. Moreover, the memory accesses are reduced as well, (11)
since the input pixels are shared among the parallel CE, while
each CE is fed by a different set of weights. Eq. (11) suggests that the number of required clock cycles
Exploiting parallel CE requires an input bandwidth of for convolutional computations is independent of N for large
ð1 þ 6 P Þ 16 bits (6 p 16 for weights and 16 for input values of N (i.e., S N ðWf SÞ) when not considering
pixels). However, most of the embedded and mobile devi- the weight passing overheads. In Section 5.1, we showed that
ces cannot provide such a high bandwidth. In fact, their input pixels are shared among all CEs and each pixel is read
bandwidth is limited by the bandwidth of the off-chip mem- at each clock cycle. This means that the number of memory
ory. For instance, most of the low-power DDRs (LPDDRs) accesses by input activation maps (MAimaps ) is equal to the
such as LPDDR4 provide the bit-width of up to 64 bits [23]. number of clock cycles required to complete the convolution.
As a result, the bandwidth of state-of-the-art convolutional On the other hand, the weights are read in the first Wf clock
accelerators for mobile devices is limited to 64 bits [8], [9], cycles out of a total of ðS N þ Wf SÞ. As a result, the
[13], [24]. To overcome this problem, ZASCA makes use of number of required memory accesses by filters to compute
pipelining. As discussed in Section 3, each input pixel is N out of Hout Wout pixels of one out of Cout output activa-
read at each clock cycle while Wf weights are read only for tion map is equal to Cin Hf Wf . In general, the number
the first Wf clock cycles when performing the convolutional of memory accesses by filters (MAfilters ) can be computed as
process of the first row of the first input filter. The parameter follows:
TABLE 9
The Main Characteristics of ZASCA
Technology TSMC 65nm GP CMOS

Core Area (Layout) 2.45 mm 2.45 mm
Gate Count (Including SRAMs) 1036 K
On-chip SRAMs 36.9 KBytes
Number of MAC Units 192
Supply Voltage 1V
Nominal Frequency 200 MHz
Peak Throughput (on dense activations) 76.8 G-ops/s
Arthmetic Precision 16-bit fixed-point
Supported CNN Sizes Filter Height (Hf ): 1-1024
Filter Width (Wf ): 1-11
Fig. 7. The architecture of ZASCA and its layout. # Filters (Cout ): 1-4096
# Channels (Cin ): 1-4096
Stride (S): 1-11

Wout Hout
MAfilters ¼ Hf Wf Cin Cout : (12)
N costs depending on p and N. Since the reconfigurable CE
functions differently based on Wf and S, the effective values
Finally, the total number of memory accesses (MA) is a sum-
of N and p vary for each case. Table 8 shows the effective val-
mation of memory accesses by filters, input activation maps
ues of N and p for commonly-used filter sizes in literature
and output activation maps, where the number of memory
while our architecture is not limited to them. The effective
accesses by output activation maps (MAomaps ) is equal to
values of N and p, denoted as Neff and peff respectively,
Wout Hout Cout . It is worth mentioning that so far, we
have to be used in all the equations reported in this paper
have computed the latency and memory access of ZASCA
that rely on these two values.
performing on dense activations. Considering ZASCA per-
forming on sparse activations, the latency and number of 6.1 Methodology
memory accesses by input activation maps (MAimaps ) are lin-
ZASCA was described in Verilog and placed and routed via
early reduced depending on the sparsity degree of activa-
Cadence Virtuoso Layout Suite using TSMC 65 nm GP
tions. More precisely, the latency and number of memory
CMOS technology. The resulting layout is shown in Fig. 7.
accesses by input activation maps are equal to (1 - Sp ) CC,
Artisan dual-port memory compiler was also used for the
where Sp denotes the sparsity degree.
implementation of internal SRAMs inside the CEs. ZASCA
works at a nominal frequency of 200 MHz for convolutional
6 IMPLEMENTATION RESULTS processes. Table 9 lists a summary of the main characteris-
tics of ZASCA. For evaluation purposes, we use four state-
In this paper, we optimize ZASCA for a low-latency, low- of-the-art CNNs: AlexNet, VGG-16, ResNet-18 and ResNet-
memory access implementation while keeping its power 50 using [28], [29] in order to perform image classification
consumption below the power budget of mobile devices, on ImageNet dataset [6]. We then evaluate ZASCA on
limited to a few hundred mW [22]. Fig. 7 shows the architec- the aforementioned CNNs and show the implementation
ture of ZASCA, which consists of three main sub-blocks: results compared to the state-of-the-art Eyeriss, EEPS and
CEs, pipelining stages and encoder/decoder units. ZASCA DSIP accelerators for edge devices. Eyeriss was fabricated
contains 32 reconfigurable CEs, each of which with 6 PEs. in TSMC 65 nm CMOS technology with a silicon area
Each PE is also associated with L ¼ 64 24-bit memory. The of 12.52 mm2 (1,852 kgates) and tested on AlexNet and
pipelining stages provide the required amount of shifts VGGNet-16 [8], [24]. Since both Eyeriss and ZASCA were
depending on the value of Wf using shift registers and mul- implemented in TSMC 65 nm CMOS technology and use
tiplexers, as discussed in Section 5.1. The encoder/decoder 16-bit fixed-point representations, a direct comparison of
units also enable zero-activation-skipping convolutional these two implementations is fair. EEPS was fabricated in a
computations. 40-nm Low-Power (LP) technology with a silicon area of 2.4
The p and N parameters do not only affect latency and mm2 and tested only on AlexNet. EEPS uses 16-bit MAC
number of memory accesses, but also impact power and area units for the convolutions while modulating precision, fre-
costs. Therefore, it is possible to obtain different trade-offs quency and voltage to reduce its power/energy consump-
between processing time, throughput and implementation tion. For a fair comparison with EEPS, we perform a
technology scaling according to [14] only when comparing
TABLE 8
The Effective Value of N and p in ZASCA in terms of power/energy consumption. DSIP was fabri-
Depending on Wf and S cated in a 65-nm CMOS technology with a silicon area of
10.56 mm2 and tested only on AlexNet. Since both DSIP and
[Wf ; S] ZASCA were implemented in a 65 nm CMOS technology
[1; 1] [3; 1] [5; 1] [5; 2] [7; 2] [11; 4] and use 16-bit fixed-point representations, a direct comp-
arison of these two implementations is also fair. It is
Neff 64 192 384 192 384 192
peff 192 64 32 64 32 64 worth mentioning that the implementation results of the
state-of-the-art Eyeriss, EEPS and DSIP accelerators were
Fig. 9. The runtime performance of ZASCAD, ZASCAS, Eyeriss, EEPS

and DSIP. Note that the runtime performance of Eyeriss, EEPS and
DSIP is taken from their original papers.
6.3 Performance
We hereafter denote ZASCA performing convolutions on
dense activations as ZASCAD and ZASCA performing con-
volutions on sparse activations as ZASCAS. It is worth men-
tioning that the accuracy of the models using the dense
representation is the same as those using the sparse repre-
sentation for activations. Fig. 9 shows the performance of
ZASCAD and ZASCAS when running AlexNet, VGG-16,
ResNet-18 and ResNet-50 in terms of giga operations per sec-
ond compared to Eyeriss, EEPS and DSIP. It is worth noting
that Eyeriss was tested on AlexNet and VGG-16 while EEPS
and DSIP was only tested on AlexNet. Fig. 9 shows that ZAS-
CAD outperforms Eyeriss in terms of runtime performance
by factors of 1.4 and 3.1 when running AlexNet and
VGG-16, respectively. Skipping noncontributory computa-
tions further improves the runtime performance ranging
from 2 to 5.8 compared to Eyeriss. ZASCAD yields a
slightly better runtime performance compared to EEPS while
EEPS contains 64 more MAC units than ZASCA. Exploiting
the sparsity among the activations, ZASCAS performs the
convolutions of AlexNet 1.5 faster than EEPS. Finally, ZAS-
CAD and ZASCAS outperforms DSIP in terms of runtime
performance by factors of 2.7 and 4, respectively.
As discussed in Section 5, all the PEs in ZASCA share the
same input activation pixel and the complete calculations of
each output pixel are performed serially, allowing all the PEs
to easily skip the noncontributory computations with zero-
Fig. 8. The breakdown of sparsity degree of CNNs when running on valued input activation pixels. However, Eyeriss, EEPS and
ZASCA. DSIP rely on a 2D array of PEs and perform the computations
of each output pixel in a semi-parallel fashion. More pre-
obtained from their original papers for comparison pur- cisely, the computations of each output pixel are split into
poses throughout this paper. different parts where a certain number of these parts is per-
formed in parallel. The results of all the parts are then added
6.2 Sparsity Degree of Input Activations to make output pixels. In this way, PEs computing each part
As discussed in Section 1, the sparsity among activations is take different activation pixels. In case of skipping the com-
an intrinsic occurrence in CNNs using the ReLU which putations with zero-valued input activation pixels of each
dynamically clamps all negatively-valued activations to part, the runtime performance is still limited by the part con-
zero. Fig. 8 shows the breakdown of sparsity degree of acti- taining the least number of zero-valued input activation pix-
vations in convolutional layers of AlexNet, VGG-16, ResNet- els. Note that having all the parts with the same number of
18 and ResNet-50 when running on ZASCA. More spe- zero-valued input activation pixels is unlikely to happen
cifically, we only considered the zero-valued activations when exploiting the aforementioned approach. Moreover,
that can be skipped using ZASCA in our measurement of using batch sizes greater than 1 exacerbates the problem
the sparsity degree (see Section 5). The sparsity degrees were since skipping the noncontributory computations is possible
measured over the validation set of ImageNet dataset. The only when all the batches contain zero-valued input activa-
sparsity degree of each layer is linearly proportional to tion pixels at the same time.
the amount of speedup. For instance, 70 percent sparsity Eyeriss and DSIP use batch sizes greater than 1 to obtain
degree denotes that 70 percent of MAC operations are a lower number of memory accesses and to achieve a higher
avoided on ZASCA and the computation time is reduced by energy efficiency. Using high batch sizes allows more filter
a factor of 2.3. reuse and maximizes the utilization of all available PEs,
TABLE 10
The Total Latency of ZASCAD, ZASCAS,
Eyeriss, EEPS and DSIP
Latency (ms) / batch size

CNN AlexNet VGG-16 ResNet-18 ResNet-50
ZASCAD 20.8/1 421.8/1 56/1 103.6/1
ZASCAS 14.2/1 246.8/1 30.4/1 59.9/1
Eyeriss [8] 115.3/4 4309.5/3 – –
EEPS [13] 21.3/1 – – –
DSIP [14] 226.6/4 – – –
Fig. 11. The energy efficiency of ZASCAD, ZASCAS, Eyeriss, EEPS and
DSIP. Note that the energy efficiency of Eyeriss, EEPS and DSIP is
Note that the total latency of Eyeriss, EEPS and DSIP is taken from their
taken from their original papers.
original papers.
Introduction of ResNet-18 was one of the first attempts on
resulting in lower memory accesses and a higher energy the software side to reduce the computational complexity of
efficiency compared to the use of batch size of 1, as dis- VGGNet-16 while keeping its accuracy performance almost
cussed in [7]. However, using this method results in a intact. In fact, ResNet-18 requires 8.5 less operations com-
higher computational latency. Increasing the number of PEs pared to VGGNet-16. In this work, we showed that ZASCAD
is another way to improve the runtime performance at simi- also reflects a similar improvement factor (i.e., 7.5) in the
lar or better energy efficiency in existing dataflows [7]. The total execution time compared to that of VGGNet-16 on the
choices of large batch sizes and large number of PEs aim at hardware side when running ResNet-18. Moreover, using
better exploiting bandwidth and hardware resources, but the sparse representation for activations enables ZASCAS to
these are ineffective design choices when deploying to a achieve the performance needed for real-time applications
bandwidth-constrained edge computing device with a lim- (i.e., the minimum throughput of 30 frames per second)
ited power budget. As a result, the number of PEs in Eyeriss when running ResNet-18.
and DSIP is less than 200 with the maximum batch size of 4.
Moreover, Eyeriss, EEPS and DSIP use an on-chip SRAM 6.4 Power and Energy Consumption
to store and reuse input activations in order to provide the
Fig. 10 reports average power consumption of ZASCAD,
required parallelism for its processing units and reduce
ZASCAS, DSIP, EEPS and Eyeriss across AlexNet, VGG-16,
memory accesses to the off-chip memory. However, such a
ResNet-18 and ResNet-50. Power consumptions of ZASCAD
technique results in a long latency due to the bandwidth bot-
and ZASCAS was obtained based on post-synthesis layout
tleneck of the off-chip memory. Therefore, Eyeriss and DSIP
by measuring actual switching activities over 100 images
suffer from a high computational latency. On the other hand,
randomly selected from the validation set. ZASCAD, ZAS-
GFID enables ZASCA to perform the convolutional pro-
CAS, Eyeriss and EEPS dissipate power ranging from 248 to
cesses right after fetching the first data from the off-chip
305 mW, meeting the power budget of edge devices. How-
memory, which results in a low computational latency.
ever, DSIP dissipates lower power due to its smaller number
Table 10 summarizes the total latency (including the read-
of PEs when compared to the aforementioned accelerators.
ing/writing latency) of ZASCAD, ZASCAS, Eyeriss, EEPS
Fig. 11 illustrates average energy consumption of ZAS-
and DSIP. Eyeriss performs convolutional computations of
CAD and ZASCAS compared to EEPS, DSIP and Eyeriss.
AlexNet and VGGNet-16 in 115.3 ms and 4.3 s while using a
When performing the convolutional process of AlexNet on
batch size of 4 and 3, respectively. DSIP performs the convo-
dense activations, DSIP achieves the highest energy effi-
lutions of AlexNet withing 226.6 ms when using a batch size
ciency when compared to ZASCAD, EEPS and Eyeriss.
of 4. ZASCAD outperforms Eyeriss and DSIP in terms of
However, ZASCAS outperforms DSIP by a factor of 1.4.
latency by factors ranging from 5.5 and 10.9. Moreover,
Considering the inference computation of VGG-16, ZAS-
ZASCAS further speeds up the computations by a factor of
CAD is 2.7 energy efficient than Eyeriss while ZASCAS is
1.7 on average compared to ZASCAD.
1.7 energy efficient than ZASCAD on average. Moreover,
ZASCAS achieves the highest energy efficiency among the
aforementioned accelerators when performing the convolu-
tions of ResNet-50.
6.5 Memory Accesses

Memory accesses to the off-chip DRAM play a crucial role in
the system energy efficiency as discussed in [8]. ZASCA
offers two techniques to reduce the number of memory
access. First, we showed that exploiting parallel tiles reduces
the number of accesses to input activations by a factor of p
(see Section 5.2). Second, ZASCA exploits sparsity among
activations to compress them using an encoder. As a result,
Fig. 10. The power consumption of ZASCAD, ZASCAS, Eyeriss, EEPS
and DSIP. Note that the runtime performance of Eyeriss, EEPS and the number of memory accesses to input activations is
DSIP is taken from their original papers. reduced depending on their sparsity degree. Fig. 12 shows
Fig. 14. The performance efficiency of ZASCAD, ZASCAS, Eyeriss,

EEPS and DSIP. Note that the performance efficiency of Eyeriss, EEPS
Fig. 12. The memory accesses of ZASCAD, ZASCAS, Eyeriss, EEPS
and DSIP is taken from their original papers.
and DSIP. Note that the memory accesses of Eyeriss, EEPS and DSIP
is taken from their original papers.
achieves the highest arithmetic intensity among accelerators
the memory accesses to the off-chip DRAM for ZASCAD, performing convolutions on dense activations. This is mainly
ZASCAS, DSIP, EEPS and Eyeriss. As discussed earlier, Eye- because Eyeriss compresses sparse activations to reduce
riss and DSIP use high batch sizes to obtain a lower number memory accesses while still performing the convolutions on
of memory accesses while this approach incurs a high dense activations. However, the main drawback of Eyeriss is
latency. However, ZASCAD obtains competitive memory its poor performance. In fact, there is a huge gap between its
accesses while achieving up to 10.9 lower latency com- runtime performance and its peak performance bound as
pared to Eyeriss and DSIP. On the other hand, ZASCAS fur- denoted using an arrow in Fig. 13. The gap between the run-
ther reduces memory access by a factor of 1.38 on average time performance and the peak performance bound is com-
over ZASCAD by compressing activations, resulting in the puted using the performance efficiency as shown in Fig. 14.
lowest number of off-chip memory accesses when compared On the other hand, ZASCAD has a competitive arithmetic
with the state-of-the-art accelerators. intensity compared to Eyeriss while achieving the highest
performance efficiency among the state-of-the-art accelera-
6.6 Roofline Model tors. Performing the computations on sparse activations ena-
The roofline model is commonly used to visually model the bles ZASCA to yields the highest arithmetic intensity and
performance of accelerators by showing their inherent hard- performance efficiency. In fact, ZASCAS surpasses the peak
ware limitations [15]. This model is obtained by tying arith- performance bound of ZASCA by skipping noncontributory
metic intensity and the performance of a given accelerator, computations.
where the arithmetic intensity is the number of operations
performed per byte of data accessed from the off-chip mem-
ory for each batch. It is worth noting that the performance 7 RELATED WORK
of architectures is bounded by their peak performance and During the past few years, numerous works have been con-
peak memory bandwidth. Since the memory bandwidth of ducted towards ASIC implementations of DNNs. However,
convolutional accelerator is fixed, we have only considered most of them were only tested on either small datasets or out-
the peak performance bound in the roofline model. The dated CNNs which require order of magnitudes lower
roofline model of ZASCA along with the state-of-the-art parameters and computations [22], [30], [31], [32], [33], [34].
accelerators is depicted in Fig. 13, where the peak perfor- Recently, Google released a custom DNN processor tensor
mance bound of ZASCA and the state-of-the-art accelera- processing unit (TPU) [35]. TPU is a programmable and
tors are denoted using a solid line and dashed horizontal reconfigurable processor that can perform both fully-con-
lines, respectively. The roofline model shows that Eyeriss nected and convolutional computations. However, its power
Fig. 13. The roofline model of ZASCAD, ZASCAS, Eyeriss, EEPS and DSIP tested on AlexNet, VGG-16 and ResNet-50. The circles and solid/
dashed lines denote the average runtime performance and the peak performance bound, respectively.
consumption exceeds the power budgets of embedded circuit-level techniques, ZASCAS outperforms Envision in
devices [16]. terms of gate count (1.9 smaller), latency (1.5 and 2.4
Despite the existence of many works on accelerating lower), throughput (1.5 and 2.4 faster) and performance
sparse multiplications [18], [36], [37], [38], [39], [40], only few efficiency (3.2 and 5 better).
works have focused on exploiting sparsity to accelerate con- In [12], a configurable accelerator framework (i.e., CAF)
volutional computations in CNNs, such as Cnvlutin [19], was introduced. CAF was fabricated in 28 nm UTBB FD-SOI
Cambricon-X [20] and SCNN [21]. These architectures are technology with a silicon area of 34.9 mm2 and contains vari-
mainly designed to accelerate computations on the cloud ous types of accelerators including 8 convolutional accel-
where unlimited data bandwidth is provided. Cnvlutin erator cores (the total of 288 MACs) for computer vision
relies on DaDianNao’s architecture [30] as its baseline and applications. This architecture uses 16-bit fixed-point repre-
achieves up to a 1.55 speedup by exploiting sparsity among sentations and performs the convolutional computations
activations. Cambricon-X is a convolutional accelerator con- of AlexNet to 17.1 ms while consuming 61 mW at 0.575 V
taining 528 operators working at 1 GHz that can perform and its performance efficiency is respectively limited to
convolutions of both dense and sparse models. As a result, it 67 percent. Despite the advanced technology nodes used in
can achieve a theoretical peak performance of 528 G-ops/s CAF, ZASCAS outperforms this architecture in terms of
when performing on dense models. Exploiting weight spar- latency (1.2 lower), throughput (1.2 faster) and perfor-
sity, it can skip ineffectual multiplications and speed up the mance efficiency (1.8 better).
computations by up to a factor of 2.51 over its dense model, As discussed in Section 6, ZASCAS outperforms Eyeriss
yielding 544 G-ops/s at most [20]. Therefore, the highest [8] in terms of gate count (1.8 smaller), latency (5.5 and
achievable performance efficiency of this accelerator is 41 17.5 lower), throughput (2 and 5.8 faster), performance
and 103 percent when performing on dense and sparse mod- efficiency (2.2 and 6.2 better) and energy efficiency (2.1
els, respectively. However, ZASCA outperforms Cambri- and 4.5 more efficient) while having roughly the same
con-X in terms of performance efficiency by yielding the number of memory accesses per batch. It is worth noting that
peak performance efficiency of 94 and 168 percent when a direct comparison of ZASCA with the works published in
performing on dense and sparse activations, respectively. [9], [12] does not constitute a fair comparison, since they
Finally, SCNN employs sparsity among both activations and dynamically modulate precision, frequency and supply volt-
weights to reduce energy, latency and data transfer time. age and use advanced technology nodes, which allows them
Using this approach, SCNN managed to improve perfor- to instantiate more PEs while still having a low-power/
mance and energy by factors of 2.7 and 2.3 compared to energy consumption. However, the introduced performance
its dense baseline, respectively. Despite the improvements efficiency metric can be used for a fair comparison as it
that these accelerators provide over their baseline, they fail reflects the performance of the accelerators independent of
to meet the power and memory bandwidth constraint of their technology nodes, precisions and optimization techni-
edge devices, which is limited to a few hundered mW [22]. ques as shown in Fig. 14. It is worth mentioning that the cir-
However, ZASCA uses a substantially different dataflow cuit-level techniques used in [9], [11], [13] can also be used in
from the above designs and aims to accelerate convolutions ZASCA to further improve its energy/power consumption.
for low-power mobile devices at the edge. Moreover, since When considering beyond standard CNNs, bottleneck
no quantitative runtime result was provided, a direct com- and depthwise separable convolutions are commonly used
parison cannot be made with these works. nowadays due to computational considerations. The bottle-
Recently, a few works have focused on minimizing neck convolutions were first introduced and used in very
energy by modulating precision, frequency and supply volt- deep residual networks such as ResNet-50 [5]. In the bottle-
age of their accelerator for each convolutional layer [9], [11], neck architectures, each standard convolutional layer is usu-
[13]. In [9], a precision-scalable convolutional accelerator ally sandwiched between two convolutional layers with the
(i.e., Envision), fabricated in 28 nm UTBB FD-SOI technol- filter size of 1 1. The first convolution is used to reduce the
ogy, was introduced. This architecture dynamically adapts dimensionality of inputs (i.e., the number of channels) and
itself depending on the required precision for each layer, the last one to restore it. In this way, a very deep network can
instead of using a fixed precision. More precisely, it exploits be achieved while using inexpensive 1 1 filters [26]. In this
a reconfigurable MAC which is able to perform a 16-bit, two work, we showed that ZASCA can successfully map the bot-
8-bit and four 4-bit multiplications/accumulations, depend- tleneck architecture used in ResNet-50 while skipping its
ing on the required precision. As a result, using a dynamic noncontributory computations with zero-valued input acti-
fixed-point technique allows to change frequency and sup- vation pixels. Using depthwise separable convolutions,
ply voltage over time which results in a lower power/energy which is a building block of MobilNets [27], is another way
consumption. Envision contains 256 reconfigurable MACs. of reducing complexity of standard convolutions. In this
This accelerator performs the convolutional computations of approach, each standard convolution is reformed into a
AlexNet to 21.3 ms (62.6 G-ops/s), and those of VGG-16 to depthwise convolution followed by a pointwise convolution.
598.8 ms (51.3 G-ops/s), while its performance efficiency is In the depthwise convolution, each input channel is associ-
respectively limited to 38 and 32 percent on average. Similar ated with a separate 2D filter and the result of convolution
to Eyeriss, the low performance efficiency of Envision results between each input channel and its filter constructs a single
in a large gate count of 1950 kgates. More precisely, Envision output channel. On the other hand, the pointwise convolu-
requires more parallelism to meet the timing constraint tion is referred to the standard convolution with the filter
of real-time edge devices, resulting in a large silicon area. size of 1 1. Since each input activation channel contributes
Despite the differences in the technology nodes and applied only to one output activation channel in the depthwise
convolution, this convolutional process requires parallel [15] S. Williams, et al., “Roofline: An insightful visual performance
model for multicore architectures,” Commun. ACM, vol. 52,
access to all the input activation channels and all the filters to pp. 65–76, Apr. 2009.
perform the computations in parallel. As a result, ZASCA [16] A. Ardakani, et al., “An architecture to accelerate convolution in
cannot fully exploit its available resources for this type of deep neural networks,” IEEE Trans. Circuits Syst. I: Regular Papers,
convolution. More precisely, ZASCA can only perform the vol. 65, no. 4, pp. 1349–1362, Apr. 2018.
[17] M. Horowitz, “1.1 computing’s energy problem (and what we can
depthwise convolutional computations of two output chan- do about it),” in Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech.
nels in parallel due to its limited data bandwidth. However, Papers, Feb. 2014, pp. 10–14.
the final impact of depthwise convolutional processes is neg- [18] S. Han, et al., “EIE: Efficient inference engine on compressed deep
neural network,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput.
ligible as the computational complexity of depthwise separa- Archit., Jun. 2016, pp. 243–254.
ble convolutions is dominated by the pointwise convolutions [19] J. Albericio, et al., “Cnvlutin: Ineffectual-neuron-free deep neural
[27]. Note that ZASCA can efficiently perform the pointwise network computing,” in Proc. 43rd Int. Symp. Comput. Archit.,
convolutional computations (see Section 4.1.3). 2016, pp. 1–13.
[20] S. Zhang, et al., “Cambricon-X: An accelerator for sparse neural
networks,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit.,
8 CONCLUSION Oct. 2016, pp. 1–12.
[21] A. Parashar, et al., “SCNN: An accelerator for compressed-sparse
In this paper, we proposed ZASCA: an accelerator perform- convolutional neural networks,” in Proc. ACM/IEEE 44th Annu.
ing convolutions using either dense or sparse activations. Int. Symp. Comput. Archit., Jun. 2017, pp. 27–40.
ZASCA exploits a dataflow (i.e., GFID) that allows to perform [22] R. Andri, et al., “YodaNN: An ultra-low power convolu-
tional neural network accelerator based on binary weights,”
the convolutional computations while reading input data in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Jul. 2016,
under the bandwidth bottleneck of edge devices, which maxi- pp. 236–241.
mizes its performance efficiency. Such a dataflow also enables [23] Micron Technology Inc., “DDR4 SDRAM for Automotive.” 2015.
[Online]. Available: http://www.micron.com
ZASCA to avoid noncontributory multiplications/accumula- [24] Y.-H. Chen, et al., “Eyeriss: An energy-efficient reconfigurable
tions with zero-valued activations to further speedup the accelerator for deep convolutional neural networks,” in Proc. IEEE
computations. Compared to the state-of-the-art accelerator Int. Solid-State Circuits Conf., 2016, pp. 262–263.
for mobile devices, ZASCA enhances the latency, throughput, [25] Y. LeCun, et al., “Backpropagation applied to handwritten
zip code recognition,” Neural Comput., vol. 1, pp. 541–551,
energy efficiency and performance efficiency by up to 17.5, Dec. 1989.
5.8, 4.5 and 6.2, respectively. [26] F. N. Iandola, et al., “SqueezeNet: AlexNet-level accuracy with
50x fewer parameters and <1MB model size,” CoRR, vol. abs/
1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/
REFERENCES 1602.07360
[27] A. G. Howard, et al., “MobileNets: Efficient convolutional
[1] Y. Lecun, et al., “Gradient-based learning applied to document neural networks for mobile vision applications,” CoRR, vol. abs/
recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. 1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/
[2] Y. Lecun, et al., “Deep learning,” Nature, vol. 521, pp. 436–444, 1704.04861
May 2015. [28] A. Vedaldi, et al., “MatConvNet: Convolutional neural networks
[3] A. Krizhevsky, et al., “ImageNet classification with deep convolu- for MATLAB,” in Proc. 23rd ACM Int. Conf. Multimedia, 2015,
tional neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., pp. 689–692.
2012, pp. 1097–1105. [29] Y. Jia, et al., “Caffe: Convolutional architecture for fast feature
[4] K. Simonyan, et al., “Very deep convolutional networks for large- embedding,” arXiv:1408.5093, 2014.
scale image recognition,” in Proc. Int. Conf. Learn. Representations, [30] Y. Chen, et al., “DaDianNao: A machine-learning super-
2015. computer,” in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchit.,
[5] K. He, et al., “Deep residual learning for image recognition,” in Dec. 2014, pp. 609–622.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016. [31] T. Luo, et al., “DaDianNao: A neural network supercomputer,”
[6] O. Russakovsky, et al., “ImageNet large scale visual recognition IEEE Trans. Comput., vol. 66, no. 1, pp. 73–88, Jan. 2017.
challenge,” Int. J. Comput. Vis., vol. 115, pp. 211–252, 2015. [32] T. Chen, et al., “DianNao: A small-footprint high-throughput
[7] Y.-H. Chen, et al., “Eyeriss: A spatial architecture for energy- accelerator for ubiquitous machine-learning,” in Proc. 19th Int.
efficient dataflow for convolutional neural networks,” in Proc. Conf. Archit. Support Program. Languages Operating Syst., 2014,
43rd Int. Symp. Comput. Archit., 2016, pp. 367–379. pp. 269–284.
[8] Y. H. Chen, et al., “Eyeriss: An energy-efficient reconfigurable accel- [33] Z. Du, et al., “ShiDianNao: Shifting vision processing closer to the
erator for deep convolutional neural networks,” IEEE J. Solid-State sensor,” in Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput. Archit.,
Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017. Jun. 2015, pp. 92–104.
[9] B. Moons, et al., “14.5 Envision: A 0.26-to-10TOPS/W subword- [34] W. Choi, et al., “On-chip communication network for efficient
parallel dynamic-voltage-accuracy-frequency-scalable convolu- training of deep convolutional networks on heterogeneous many-
tional neural network processor in 28nm FDSOI,” in Proc. IEEE core systems,” IEEE Trans. Comput., vol. 67, no. 5, pp. 672–686,
Int. Solid-State Circuits Conf., Feb. 2017, pp. 246–247. May 2018.
[10] S. Wang, et al., “Chain-NN: An energy-efficient 1D chain architec- [35] N. P. Jouppi, et al., “In-datacenter performance analysis of a tensor
ture for accelerating deep convolutional neural networks,” in Proc. processing unit,” in Proc. 44th Annu. Int. Symp. Comput. Archit., 2017,
Des. Autom. Test Europe Conf. Exhib., Mar. 2017, pp. 1032–1037. pp. 1–12.
[11] D. Shin, et al., “14.2 DNPU: An 8.1TOPS/W reconfigurable CNN- [36] J. Sun, et al., “Sparse matrix-vector multiplication design on
RNN processor for general-purpose deep neural networks,” in FPGAs,” in Proc. 15th Annu. IEEE Symp. Field-Programmable Cus-
Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2017, pp. 240–241. tom Comput. Mach., Apr. 2007, pp. 349–352.
[12] G. Desoli, et al., “14.1 A 2.9TOPS/W deep convolutional neu- [37] M. deLorimier, et al., “Floating-point sparse matrix-vector multi-
ral network SoC in FD-SOI 28nm for intelligent embedded sys- ply for FPGAs,” in Proc. ACM/SIGDA 13th Int. Symp. Field-
tems,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2017, Programmable Gate Arrays, 2005, pp. 75–85.
pp. 238–239. [38] S. Jain-Mendon, et al., “A hardware-software co-design
[13] B. Moons, et al., “An energy-efficient precision-scalable ConvNet approach for implementing sparse matrix vector multiplication
processor in a 40-nm CMOS,” IEEE J. Solid-State Circuits, vol. 52, on FPGAs,” Microprocessors Microsyst., vol. 38, pp. 873–888,
no. 4, pp. 903–914, Apr. 2017. Nov. 2014.
[14] J. Jo, et al., “DSIP: A scalable inference accelerator for convolu-
tional neural networks,” IEEE J. Solid-State Circuits, vol. 53, no. 2,
pp. 605–618, Feb. 2018.
[39] D. Gregg, et al., “FPGA based sparse matrix vector multiplication Warren J. Gross (S’92–M’04–SM’10) received
using commodity DRAM memory,” in Proc. Int. Conf. Field Pro- the BASc degree in electrical engineering from
grammable Logic Appl., Aug. 2007, pp. 786–791. the University of Waterloo, Waterloo, ON, Can-
[40] Y. Zhang, et al., “FPGA versus GPU for sparse matrix vector mul- ada, in 1996, and the MASc and PhD degrees
tiply,” in Proc. Int. Conf. Field-Programmable Technol., Dec. 2009, from the University of Toronto, Toronto, ON,
pp. 255–262. Canada, in 1999 and 2003, respectively. He is a
professor and Louis-Ho faculty scholar in techno-
Arash Ardakani received the BSc degree in elec- logical innovation with the Department of Electri-
trical engineering from the Sadjad University of cal and Computer Engineering, McGill University,
Technology, Mashhad, Iran, in 2011, and the MSc Montreal, QC, Canada. He currently serves as
degree from the Sharif University of Technology, chair of the department. His research interests
Tehran, Iran, in 2013. He is currently working include the design and implementation of signal processing systems and
toward the PhD degree in electrical and computer custom computer architectures. He served as the chair for the IEEE Sig-
engineering at McGill University, Montre al, QC, nal Processing Society Technical Committee on Design and Implementa-
Canada. His research interests include the VLSI tion of Signal Processing Systems. He served as the general co-chair for
implementation of signal processing algorithms, in the IEEE GlobalSIP 2017 and the IEEE SiPS 2017 and the technical pro-
particular channel coding schemes and machine gram co-chair for SiPS 2012. He also served as an organizer for the
learning algorithms. Workshop on Polar Coding in Wireless Communications at WCNC 2018
and WCNC 2017, the Symposium on Data Flow Algorithms and Architec-
ture for Signal Processing Systems (GlobalSIP 2014), and the IEEE ICC
2012 Workshop on Emerging Data Storage Technologies. He served as
Carlo Condo received the MSc degree in electri- an associate editor of the IEEE Transactions on Signal Processing and
cal and computer engineering from the Politecnico as a senior area editor. He is a licensed professional engineer in the Prov-
di Torino and the University of Illinois at Chicago, ince of Ontario. He is a senior member of the IEEE.
Chicago, Illinois, in 2010, and the PhD degree in
electronics and telecommunications engineering
from the Politecnico di Torino and IMT Atlantique, " For more information on this or any other computing topic,
in 2014. From 2015 to 2017, he was a post-
doctoral fellow with the ISIP Laboratory, McGill please visit our Digital Library at www.computer.org/csdl.
University, where he was a McGill University dele-
gate at the 3GPP meetings for the fifth-generation
wireless systems standard (5G) in 2017. Since
2018, he has been a researcher with the Communication Algorithms
Design Team, Huawei Paris Research Center. His research is focused
on channel coding, design and implementation of encoder and decoder
architectures, digital signal processing, and machine learning.

Fast and Efficient Convolutional Accelerator For Edge Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fast and Efficient Convolutional Accelerator For Edge Computing

Uploaded by

Copyright:

Available Formats

138 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO.

Fast and Efficient Convolutional

Fig. 2. The performance efficiency of Eyeriss, EEPS and DSIP tested on

denoted by a red dashed line) and runtime performance. We

Name of Parameter Description

the sparsity among activations and weights to skip unneces-

we present a zero-activation-skipping convolutional accelera-

3 GENERALIZED FULLY-CONNECTED INSPIRED

filters with Wf ¼ 7 and S ¼ 2, the shift amounts within each

where LCM denotes the least common multiple. Using this

Wf ¼ 3 and S ¼ 1. It is worth noting that CEs are separated

Fig. 4. The reconfigurable CE. N

that using K ¼ 6 PEs in parallel reduces the maximum

Technology TSMC 65nm GP CMOS

Fig. 9. The runtime performance of ZASCAD, ZASCAS, Eyeriss, EEPS

Latency (ms) / batch size

6.5 Memory Accesses

Fig. 14. The performance efficiency of ZASCAD, ZASCAS, Eyeriss,

You might also like