FEECA Design Space Exploration For Low-Latency and Energy-Efficient Capsule Network Accelerators

716 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO.
4, APRIL 2021
FEECA: Design Space Exploration for Low-Latency

and Energy-Efficient Capsule Network Accelerators
Alberto Marchisio , Graduate Student Member, IEEE, Vojtech Mrazek , Member, IEEE,
Muhammad Abdullah Hanif , Graduate Student Member, IEEE,
and Muhammad Shafique , Senior Member, IEEE
Abstract— In the past few years, Capsule Networks (CapsNets) I. I NTRODUCTION

have taken the spotlight compared to traditional convolutional
neural networks (CNNs) for image classification. Unlike CNNs,
CapsNets have the ability to learn the spatial relationship between
features of the images. However, their complexity grows because
C ONVOLUTIONAL neural networks (CNNs) have
reached state-of-the-art results in terms of accuracy
for several machine learning (ML) applications, such as
of their heterogeneous capsule structure and the dynamic routing,
which is an iterative algorithm to dynamically learn the coupling
object detection [1], speech recognition [2], and image
coefficients of two consecutive capsule layers. This necessi- classification [3]. Recently, Sabour et al. [4] from Google
tates specialized hardware accelerators for CapsNets. Moreover, Brain proposed the Dynamic Routing algorithm to efficiently
a high-performance and energy-efficient design of CapsNet accel- learn the internal connections between capsules of the
erators requires exploration of different design decisions (such Capsule Networks (CapsNets) [5]. Such CapsNets are able
as the size and configuration of the processing array and the to encapsulate multidimensional features (e.g., position,
structure of the processing elements). Toward this, we make orientation, and scaling) across the layers, while traditional
the following key contributions: 1) FEECA, a novel method- CNNs cannot. Thus, CapsNets can beat traditional CNNs
ology to explore the design space of the (micro)architectural
in multiple tasks, such as image classification, as shown
parameters of a CapsNet hardware accelerator and 2) Cap-
sAcc, the first specialized RTL-level hardware architecture to in [4]. Moreover, CapsNets can also be applied effectively to
perform CapsNets inference with high performance and high other ML application domains, such as vehicle detection [6],
energy efficiency. Our CapsAcc achieves significant performance speech recognition [7], and natural language processing [8].
improvement, compared to an optimized GPU implementation, Despite their outstanding learning capabilities, demonstrated
due to its efficient implementation of key activation functions, by state-of-the-art image classification accuracy [9], [10],
such as squash and softmax, and an efficient data reuse for the main challenge in the deployment of CapsNets is their
the dynamic routing. The FEECA methodology employs the extremely high complexity because they require intense
Non-dominated Sorting Genetic Algorithm (NSGA-II) to explore
computations due to matrix multiplications in the capsule
the Pareto-optimal points with respect to area, performance, and
energy consumption. This requires analytical modeling of the processing and the iterative dynamic routing-by-agreement
number of clock cycles required to perform each operation of algorithm for learning the cross-coupling between capsules.
the CapsNet inference and the memory accesses to enable a Current state-of-the-art Deep Neural Network (DNN) accel-
fast yet accurate design space exploration. We synthesized the erators [11]–[31] proposed energy-aware solutions for infer-
complete accelerator architecture in a 45-nm CMOS technology ence with traditional CNN models. Although processing ele-
using Synopsys design tools and evaluated it for the MNIST ment (PE) array-based designs, such as [18], perform parallel
benchmark (as done by the original CapsNet paper from Google matrix multiply-and-accumulate (MAC) operations with good
Brain’s team) and for a more complex data set, the German efficiency, the existing CNN accelerators cannot compute
Traffic Sign Recognition Benchmark (GTSRB).
several key operations of the CapsNets (i.e., the squashing
Index Terms— Capsule network (CapsNet), design space and the iterative routing-by-agreement) with high performance.
exploration (DSE), design, Deep Neural Network (DNN), Moreover, efficiently processing capsules requires architectural
hardware accelerator, inference, non-dominated sorting genetic enhancements in the processing array. Therefore, an effi-
algorithm (NSGA-II).
cient dataflow requires a direct feedback connection from
Manuscript received October 21, 2020; revised January 6, 2021; accepted the outputs of the activation units back to the inputs of the
January 31, 2021. Date of publication February 25, 2021; date of current computational units to improve the performance and reduce the
version April 1, 2021. This work was supported in part by the Doctoral College memory accesses. Moreover, the search for an efficient Cap-
Resilient Embedded Systems, which is run jointly by TU Wien’s Faculty of
Informatics and FH-Technikum Wien, and in part by the Czech Science Foun- sNet accelerator significantly increases the complexity of the
dation under Project 19-10137S. (Corresponding author: Alberto Marchisio.) design space, due to a large set of parameters of the hardware
Alberto Marchisio and Muhammad Abdullah Hanif are with architecture (e.g., size and shape of the computational unit
the Department of Informatics, Institute of Computer Engineering, and pipeline stages) and their impact on the performance and
Technische Universität Wien (TU Wien), 1040 Vienna, Austria (e-mail:
alberto.marchisio@tuwien.ac.at).
the energy efficiency. Such a design space exploration (DSE)
Vojtech Mrazek is with the Faculty of Information Technology, Brno leads to a multiobjective optimization problem, where it is
University of Technology, 61200 Brno, Czech Republic. challenging to identify the Pareto-frontiers in terms of area,
Muhammad Shafique is with the Division of Engineering, New York energy, and performance of the CapsNet accelerator.
University Abu Dhabi, Abu Dhabi, United Arab Emirates.
Color versions of one or more figures in this article are available at
To address the above-discussed issues, we propose FEECA,
https://doi.org/10.1109/TVLSI.2021.3059518. a methodology for the DSE of CapsNet accelerators
Digital Object Identifier 10.1109/TVLSI.2021.3059518 (see Fig. 1). In Section I-A, we discuss the associated
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: California State University Fresno. Downloaded on June 19,2021 at 07:11:31 UTC from IEEE Xplore. Restrictions apply.
MARCHISIO et al.: FEECA: DESIGN SPACE EXPLORATION FOR LOW-LATENCY AND ENERGY-EFFICIENT CapsNet ACCELERATORS 717
scientific challenges, followed by our novel contributions in

Section I-B.
A. Key Scientific Challenges
In this work, we tackle the following fundamental
challenges.
1) How does a GPU perform when executing a CapsNet
inference? This requires a detailed analysis of the per-
formance for each operation of the CapsNet inference.
2) How to obtain the level of reconfigurability necessary Fig. 1. Our flow for the DSE of CapsNet accelerators.
to perform different operations compared to traditional
CNN accelerators? This requires the investigation of
the dataflow for each computational operation of the
CapsNet inference.
3) How to identify Pareto-optimal solutions from different
configurations of the computational units of our hard-
ware accelerator for CapsNets? This requires analytical
modeling of the performance, area, and energy consump-
tion that are dependent on the sizes of the PE array and Fig. 2. Example to explain the types of spatial relationships captured by
on the configuration of the architectural parameters of capsules.
the accelerator.
we first discuss the specific features of a CapsNet. Afterward,
B. Our Novel Contributions we describe its architectural models, which have been used in
this article.
The key contributions in this work are the following.
1) We devise a methodology, FEECA, to obtain A. Capsule and Squashing Function
a low-latency (Fast), Energy-Efficient CapsNet The basic unit of a CapsNet is the capsule. Compared to a
Accelerator (see Section IV). traditional CNN, which is composed of neurons, i.e., scalar
2) We design CapsAcc, a specialized accelerator that can values, a capsule is represented in the form of a vector of
perform inference on CapsNets, and we design an effi- neurons. The key advantage of having features represented
cient dataflow for exploiting data reuse in the most as a vector, called “prediction vector,” is that different spatial
critical computations of the CapsNets (see Section V). properties of the image (e.g., position, scaling, and orientation)
3) We perform a DSE using the multiobjective can be learned. The length of the vector represents the prob-
Non-dominated Sorting Genetic Algorithm (NSGA-II) ability that the entity exists, while the features are encoded
algorithm to configure the microarchitectural parameters in the orientation of the vector. The example in Fig. 2 shows
of the accelerator. With our analysis, we can identify how different features of a face (e.g., eyes, mouth, nose, and
Pareto-optimal sets of configurations of the architectural ears) can be represented in a capsule.
parameters (see Section VI). The squashing is an activation function designed to effi-
4) To enable the above-discussed designs, we analyze ciently fit to the prediction vector. It introduces nonlinearity
the memory requirements and the performance in the into an array and normalizes the outputs to values between
forward pass of CapsNets, through experiments on a 0 and 1. Given s j as the input of the squashing function for
high-end GPU, which allows us to identify the bottle- the capsule j (or, from another perspective, the sum of the
necks (see Section III). weighted prediction vector) and v j as its respective output,
5) We implement and synthesize the complete architecture the squashing function is defined by the following equation:
for a 45-nm technology using the ASIC design flow and 2
perform evaluations for performance, area, and power s j sj
vj = 2 . (1)
consumption (see Section VII). 1 + s j s j
Fig. 1 illustrates an overview of our novel contributions that
are embedded in our methodology flow. The input-output relationships of the squashing function and
Before proceeding to the technical sections, we present an its first derivative are shown in Fig. 3. Note that, for the sake of
overview of CapsNets (see Section II) to a level of detail clarity, we have plotted the single-dimensional input function
necessary to understand the key operations of these networks. since a multidimensional input version cannot be visualized in
a chart. The squashing function produces an output bounded
between 0 and 1, while its first derivative follows the behavior
II. BACKGROUND : A N OVERVIEW OF C APS N ETS of the red line, with a peak at the point (0.5767, 0.6495).
CapsNets, proposed by Google Brain’s team [4], introduced
many novelties compared to CNNs, such as the concept of cap- B. Routing-by-Agreement Algorithm
sules (i.e., multidimensional arrays of neurons), the squashing The predictions are propagated across two consecutive
activation function, and the routing-by-agreement algorithm. capsule layers through the routing-by-agreement algorithm.
Since this article focuses on the analysis of the inference It is an iterative process that introduces a feedback path in
process, the layers and algorithms (such as decoder, margin the inference pass. The relations between the input prediction
loss, and reconstruction loss) specific to the training process vectors û i| j and the output vectors v j are learned dynamically.
are beyond the scope of our discussion. For the sake of clarity, The flow diagram of the routing-by-agreement is reported
718 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 4, APRIL 2021
Fig. 6. Architecture of the CapsNet for the GTSRB data set.

Fig. 3. Squashing function and its first derivative, considering
single-dimensional input.
Fig. 7. Distribution of trainable parameters on the CapsNet across layers.
2) CapsNet Architecture for GTSRB: Fig. 6 illustrates the

CapsNet architecture [35] designed for the GTSRB [33] data
set. It consists of three layers.
1) Conv1: Traditional convolutional layer, with 256 chan-
Fig. 4. Flow of the routing-by-agreement algorithm. nels, with filter size of 9 × 9, stride = 1, and ReLU [34]
activations.
2) PrimaryCaps: The first capsule layer, with 16 channels.
Each 16-D capsule has 5 × 5 convolutional filters with
stride = 1.
3) ClassCaps: The last capsule layer, with 32-D capsules
for each of the 43 output classes.
III. A NALYSIS OF C APS N ET C OMPLEXITY

In the following, we perform a comprehensive analy-
sis to identify how CapsNet inference is performed on a
high-end GPU platform, such as the one used in our exper-
Fig. 5. Overview of the CapsNet architecture, based on the design of [4] iments, i.e., the Nvidia Ge-Force GTX1070 GPU. First,
for the MNIST data set.
in Section III-A, we quantitatively analyze the number of the
in Fig. 4. This algorithm introduces a loop in the forward trainable parameters per layer that must be fed from the mem-
pass, because the coupling coefficients cij are learned during ory. Then, in Section III-B, we benchmark our PyTorch-based
the routing due to the dependence of their values on the current CapsNet implementation [36] for the MNIST data set to
data. The logits bij are iteratively updated based on the values measure the performance of the inference process on our
of the outputs v j . Thus, they cannot be considered as constant GPU. Note that, although complex, the training process of
parameters, learned during the training process. Intuitively, this CapsNets [37] is not discussed since we focus on the inference
step can cause a computational bottleneck, as demonstrated in only.
Section III.
A. Trainable Parameters of the CapsNet
C. CapsNet Architectures
Table I shows quantitatively the number of parameters
In the following, we analyze two architectures of CapsNets, needed for each layer. As evident, the majority of the weights
which are designed for the MNIST [32] and the German belong to the PrimaryCaps layer due to its 256 channels
Traffic Sign Recognition Benchmark (GTSRB) [33] data set, and 8-D capsules. Even though the ClassCaps layer has fully
respectively. connected behavior, it accounts for less than 25% of the total
1) CapsNet Architecture for MNIST: Fig. 5 illustrates the parameters of the CapsNet. Finally, the Conv1 layer parame-
CapsNet architecture [4] designed for the MNIST [32] data ters and the coupling coefficients account for a very small
set. It consists of three layers. percentage of the total parameters. The breakdown is reported
1) Conv1: Traditional convolutional layer, with 256 chan- in Fig. 7. Based on that, we make a valuable observation for
nels, with filter size of 9 × 9, stride = 1, and ReLU [34] designing our hardware accelerator: by considering 8-bit fixed
activations. point weights, we can estimate that an on-chip memory size
2) PrimaryCaps: The first capsule layer, with 32 channels. of 8 MB is large enough to contain every parameter of the
Each 8-D capsule has 9 × 9 convolutional filters with CapsNet. Besides, the memory for CapsNet accelerators can
stride = 2. be optimized using CapStore [38] and DESCNet [39].
3) ClassCaps1 : The last capsule layer, with 16-D capsules
for each output class. B. Performance Analysis on a GPU
1 In
the original paper [4], this layer is called DigitCaps. We changed the At this stage, we measure the time required for an inference
name to ClassCaps to assert a generic image classification. pass on the GPU. The experimental setup is shown in Fig. 8.
TABLE I
I NPUT S IZE , N UMBER OF T RAINABLE PARAMETERS , AND
O UTPUT S IZE OF E ACH L AYER OF THE C APS N ET
Fig. 10. Performance of each step of the ClassCaps layer during the inference
pass.
Fig. 8. Experimental setup for GPU analyses.
Fig. 11. Our FEECA methodology for obtaining Pareto-optimal design con-
figurations of the CapsNet accelerators with respect to the given optimization
objectives.
Fig. 9. Layerwise performance of the CapsNet inference pass.
Fig. 9 shows the time consumed by the computations for

each layer. The ClassCaps layer is the computational bot-
tleneck because it is around 10× slower than the previous
layers. To obtain more detailed results, further analysis has
been performed regarding the performance of each step of
the routing-by-agreement process (see Fig. 10). It is evident
that the squashing operation inside the ClassCaps layer is
the most compute-intensive operation. This analysis gives us
the motivation to spend more effort in optimizing routing-
by-agreement and squashing in our CapsNet accelerator.
C. Summary of Key Observations From Our Analyses

From the analyses performed in Sections III-A and III-B, Fig. 12. Overview of our CapsAcc Architecture.
we derive the following key observations.
to calculate the parameters of the accelerator for a given
1) The CapsNet inference performed on the GPU is more
configuration, such as energy consumption, chip area, and
compute-intensive than memory-intensive due to the
delay (latency is taken during the inference of one CapsNet
bottleneck represented by the squashing operation.
input element). The analytical models use a set of presyn-
2) A massive parallel computation capability in the hard-
thesized internal primitives, such as PEs and registers. Then,
ware accelerator is desirable to achieve a similar or
a space-search engine is used to find the optimal configurations
better performance than the GPU.
of a generic CapsNet accelerator. Since two or more opti-
3) Since several parameters need to be stored in the mem-
mization parameters are typically required, a multiobjective
ory, the buffers located in between the on-chip memory
search algorithm is needed to find configurations that trade
and the PEs are beneficial to maintain high through-
off all the parameters. In this work, we propose to deploy
put and mitigate the latency due to on-chip memory
two different search algorithms, i.e., the brute-force and the
accesses.
NSGA-II algorithm (see Section VI), to reduce the search time.
The output of the search engine is a set of configurations.
IV. FEECA: A M ETHODOLOGY TO D ESIGN A FAST, These configurations are applied to the generic CapsNet accel-
E NERGY-E FFICIENT C APS N ET ACCELERATOR erator to get the final set of Pareto-optimal accelerators.
The FEECA methodology (see Fig. 11) requires a CapsNet
and some optimization objectives (hardware parameters of a V. C APS ACC : A RCHITECTURAL T EMPLATE OF THE
CapsNet accelerator) as inputs. The output of the methodology BASELINE C APS N ET I NFERENCE ACCELERATOR
is a set of Pareto-optimal CapsNet accelerators.
The methodology works in general as follows. First, A. Designing the CapsAcc Architrcture
we construct a generic configurable CapsNet accelerator (see Following the observations discussed in Section III-C,
Section V), with a set of possible configurations (the search we designed the complete CapsAcc architecture and imple-
space for the further optimization) and analytical models mented it in hardware (RTL). The top-level architecture is
Fig. 13. Architecture of different components of our CapsAcc. (a) processing element array. (b) Single (n pe = 1) PE. (c) Accumulator. (d) Activation unit.
(e) Squashing function unit. (f) Norm function unit. (g) Softmax function unit.
shown in Fig. 12, where the green blocks highlight our

novel contributions over other existing accelerators for CNNs.
The detailed architectures of different components of our
accelerator are shown in Fig. 13. At the core, our CapsAcc
architecture has a PE array that is responsible for all the
matrix and vector operations in the CapsNets. The choice
of PE arrays is based on the fact that they have demon- Fig. 14. Functionality of the control unit of our CapsAcc.
strated to be extremely efficient in processing convolutional
layers [16] that are also the initial layers of the CapsNets. and an 8-bit fixed-point weight and 2) the sum is designed as
Moreover, our CapsAcc supports a specialized dataflow (see a 25-bit fixed-point value. At full throttle, each PE produces
Section V-B), which allows us to exploit the computational one output-per-clock cycle, which also implies one output-
parallelism for multidimensional matrix operations. The partial per-clock cycle for every column of the PE array.
sums are stored and properly added together by the accumula- 2) Accumulator: The accumulator unit consists of a FIFO
tor unit. The activation unit performs different activation func- buffer to store the partial sums coming from the PE array and
tions, according to the requirements for each stage. The buffers add them up when needed. We designed the accumulator to
(data, routing, and weight buffers) are essential to temporarily support 25-bit fixed-point data. Fig. 13(c) shows the data path
store the information to feed the PE array without accessing of our accumulator. The multiplexer allows feed the buffer
the data and weight memories frequently. The two multiplexers with the data coming from the PE array or with the data
at the input of the PE array introduce the flexibility to process coming from the internal adder of the accumulator. In the
new data or reuse them according to the respective dataflow. overall CapsAcc architecture, there are as many accumulators
The control unit coordinates all the operations, at each stage as the number of columns of the PE array.
of the inference. 3) Activation Unit: The activation units follow the Accumu-
1) Processing Element Array: The PE array of our CapsAcc lators. As shown in Fig. 13(d), they perform different functions
architecture is shown in Fig. 13(a). It is composed of a in parallel, while the multiplexer (placed at the bottom of the
2-D array of PEs, with n 1 rows and n 2 columns. For better figure) selects the path to propagate the information toward
understandability, Fig. 13(a) presents the 4 × 4 version, while, the output. The figure shows only one unit, while, in the
in our actual CapsAcc design, we use a 16 × 16 PE array. The complete CapsAcc architecture, there is one activation unit for
inputs are propagated toward the outputs of the PE array both each column of the PE array. The 25-bit data values coming
horizontally (data) and vertically (weight and partial sum). from the accumulators are reduced to 8-bit fixed-point values
In the first row, the inputs corresponding to the Partial sums to reduce the computations at this stage.
are zero-valued because each sum at this stage is equal to The rectified linear unit (ReLU) [34] is used for the first
0. Meanwhile, the weight outputs in the last row are not two layers of the CapsNet. It is implemented by connecting
connected because they are not used in the following stages. the input to the output through a multiplexer, which sets the
Fig. 13(b) shows the data path of a single PE. It has output to zero if the input is negative.
three inputs and three outputs: data, weight, and partial sum, We designed the normalization operator (Norm) with
respectively. The core of the PE is composed of a multiplier a structure performing the square-and-accumulate operation,
and an adder. As shown in Fig. 13(b), it has four internal where, instead of a traditional multiplier, there is a Power2
registers: 1) Data Reg. to store and synchronize the Data value operator. Its data path is shown in Fig. 13(f). A register stores
coming from the left; 2) Sum Reg. to store the partial sum the partial sum and the Sqrt operator produces the output.
before sending it to the neighbor PE below; 3) Weight1 Reg. We designed the Sqrt operator as a lookup table with 12-bit
to synchronize the vertical transfer; and 4) Weight2 Reg. to input and 8-bit output. The Norm operator produces a valid
store the value for data reuse. The latter is particularly useful output every n + 1 clock cycles, where n is the length of the
for convolutional layer operations, where the same weight of vector (or capsule dimension) for which we want to compute
a filter must be reused across different input data. For fully the Norm. Such an operator is used either used to compute
connected computations, the second weight register introduces the classification prediction or as an input for the squashing
just one clock cycle latency, without affecting the throughput. function, as illustrated in Fig. 13(d).
The bit widths of each element have been designed as follows: We designed and implemented the squashing function as
1) each PE computes the product of an 8-bit fixed-point data a lookup table, as shown in Fig. 13(e). Following (1), the
Fig. 15. Dataflow of our CapsAcc for different scenarios of the case study. (a) Convolutional layer mapping. (b) Sum generation & squashing operation
mapping for the first routing iteration. (c) Update & softmax operation mapping. (d) Sum generation & squashing operation mapping for all but the first
routing iteration.
function takes the vector s j (elementwise) and its norm s j Algorithm 1 Mapping Algorithm for CapsuleNet Operations
as inputs. The Norm input is coming from its respective Onto the PE Array
unit. Hence, the Norm operation is not implemented again
inside the squash unit. The LUT takes a 6-bit fixed-point data
and a 5-bit fixed-point norm as inputs to produce an 8-bit
output. The first output of the vector is produced with just one
additional clock cycle compared to the Norm. We decided to
limit the bit width to reduce the computational requirements
at this stage, following the analysis performed in Section III,
which shows the highest computational load for this opera-
tion. Such a design using an LUT significantly reduces the
latency of the squashing operation, as we will demonstrate in
Section V-D. A pure logic-based implementation would have
required complex mathematical operations that would not be access rate. Moreover, the accumulator unit contains a buffer
efficient when implemented in hardware. for storing the output partial sums, and the routing buffer is
The softmax function design is shown in Fig. 13(g). used to store the coefficients of the dynamic routing.
Initially, it computes the exponential function (8-bit lookup
table) and accumulates the sum in a register, followed by a B. Dataflow Design
division. Overall, having an array of n elements, this block is
In this section, we provide the details on how to map the
able to compute the softmax function of the whole array in
processing of different types of layers and operations onto our
2n cycles.
CapsAcc architecture, in a step-by-step fashion. To feed the PE
4) Control Unit: At each stage of the inference process, array, we adopt the mapping policy described in Algorithm 1.
it generates different control signals for all the components For the ease of understanding, we illustrate the process with
of the accelerator architecture, according to the operations the help of a case study performing MNIST classification on
needed. Its functionality is shown in Fig. 14. The core of our CapsAcc. Note that each stage of the CapsuleNet inference
the control unit is a finite state machine (FSM), which gen- requires its own mapping scheme.
erates at the output the control signals for the multiplexers, 1) Dataflow of the Conv1 Layer: The Conv1 layer has filters
the memories, the buffers, and all the other components of of size 9 × 9 and 256 channels. As shown in Fig. 16(a),
the CapsAcc architecture. A set of counters interacts with the we designed a row-by-row mapping (A, B), and after the last
FSM to guarantee the correct timing of all the operations. row, we move to the next channel (C). Fig. 17(a) shows
For example, in a convolution operation, the number of clock how the dataflow is mapped onto our CapsAcc architecture.
cycles needed to process the data for a given set of weights An illustrative example of mapping the weights onto the
is counted, before the next set of weights are loaded onto the weight buffer is shown in Fig. 17. To perform the convolution
PE array. Therefore, the control unit is essential for correctly efficiently, we hold the weight values in the PE array to reuse
scheduling the operations of the accelerator. the filter across different input data.
5) Memory Hierarchy: Besides the registers that are embed- 2) Dataflow of the PrimaryCaps Layer: Compared to the
ded in the PE array and in the activation unit, the memory Conv1 layer, the PrimaryCaps layer has one more dimen-
hierarchy is organized as follows. All the weights for each sion, which is the capsule size (i.e., 8). However, we treat
operation are stored in the on-chip weight memory, while the 8-D capsule as a convolutional layer with eight output
the initial data, which correspond to the pixel intensities of channels. Thus, Fig. 16(b) shows that we map the parameters
the input image, are stored in the on-chip data memory. row-by-row (A, B) and then moving through different input
As the interface between the memories and the accelerator, channels (C), and only at the third stage, we move on to the
the data buffer and weight buffer work as a cushion for the next output channel (D). This mapping procedure allows us to
interaction with the PE array at high bandwidth and high minimize the accumulator size because our CapsAcc computes
Fig. 16. Dataflow of the process of mapping different layers onto our
CapsAcc architecture. (a) Conv1 layer. (b) PrimaryCaps layer. (c) ClassCaps
Fig. 18. Synthesis flow and tool chain of our experimental setup.
layer.
computes squash, and the outputs v j are stored back in

the routing buffer. This dataflow is shown in Fig. 17(b).
2) Update and Softmax: The predictions û j |i are reused
through the horizontal feedback of the architecture,
the outputs v j are coming from the routing buffer, the PE
array computes the updates for bij , and the softmax at
Fig. 17. Mapping process shown through an example of convolutional filters the activation unit produces the coefficients cij that are
mapped onto the weight buffer and the PE array. stored back in the routing buffer. Fig. 17(c) shows the
dataflow described above.
the output features for the same output channel first. Since the 3) Sum Generation and Squash: Fig. 17(d) shows the
type of this layer is convolutional, the weight reuse dataflow dataflow for this scenario. Compared to the Fig. 17(b),
is the same as the one in the previous layer, as reported the predictions û j |i are coming from the horizontal
in Fig. 17(a). feedback link, thus exploiting data reuse also in this
3) Dataflow of the ClassCaps Layer: The mapping of the stage.
ClassCaps layer is shown in Fig. 16(c). After mapping row-
by-row (A, B), we consider input capsules and input channels
D. Synthesis of the Complete CapsAcc
as the third dimension (C) and output capsules and output
channels as the fourth dimension (D). Hence, in this way, 1) Experimental Setup: We implemented the complete
the output feature map (OFMAP) reuse is achieved, for mini- design of our CapsAcc architecture in RTL (VHDL) and
mizing the energy consumption of the accumulators. However, evaluated it for the MNIST data set (to stay consistent with
recalling the algorithm in Fig. 4, other types of computations, the original CapsNet paper). We synthesized the complete
i.e., sum, squash, update, and softmax, need to be performed architecture in a 45-nm CMOS technology using the ASIC
in this layer. The input vectors for computing the sum and design flow with the Synopsys Design Compiler. We did
update operations are mapped column by column onto the functional and timing validation through gate-level simulations
PE array. This approach, having each vector mapped onto the using Mentor ModelSim and obtained the precise area, power,
same column of the PE array, simplifies the computations of and performance of our design. The complete synthesis flow is
the squash and softmax functions, which are performed by shown in Fig. 18, where the blue and green boxes represent the
the activation units to avoid the interdependence across the inputs and the output results of our experiments, respectively.
different columns and is avoided. Important Note: Since our hardware design is fully func-
Then, for each step of the routing-by-agreement process, tionally compliant with the original CapsNet design of the
we design the corresponding dataflow. It is a critical phase work of [4], we observed the same accuracy of classification.
because a less efficient mapping can potentially have a huge Therefore, we do not present any classification results in this
impact on the overall performance. article and only focus on the performance, area, and power
First, we apply an algorithmic optimization on the routing- results, which are more relevant for an optimized hardware
by-agreement algorithm. During the first operation, instead architecture.
of initializing bij to 0 and computing the softmax on them, 2) Discussion on Comparative Results: The graph in Fig. 19
we directly initialize the coupling coefficients cij to 0. The shows the performance (execution time) results of the different
starting point is indicated with the green arrow in Fig. 4. With layers of CapsNet inference on our CapsAcc, while Fig. 20
this optimization, we can skip the softmax computation at the shows the performance of every sequence of the routing
first routing iteration. In fact, in this operation, all the inputs process. Compared with the Nvidia GTX1070 GPU perfor-
are equal to 0; thus, they do not depend on the current data. mance (see Figs. 9 and 10), we obtained a significant speed-up
for the overall computation time of a CapsNet inference
pass (6×). The main notable improvements are witnessed in
C. Dataflow of the Dynamic Routing the ClassCaps layer (12×) and in the squashing operation
Regarding the dataflow of our CapsAcc, we identified three (172×).
different scenarios during the dynamic routing algorithm. 3) Detailed Area and Power Breakdown: The details and
1) First Sum Generation and Squash: The predictions û j |i synthesis parameters for our design are reported in Table II.
are loaded from the data buffer, the coupling coefficients Table III shows the absolute values for the area and power
cij are coming from the routing buffer, the PE array consumption of all the components of the synthesized Cap-
computes the sums s j , the activation unit selects and sAcc. Fig. 21(a) and (b) shows the area and power breakdowns,
Fig. 19. Layerwise performance of the inference pass on the CapsNet on

CapsAcc compared to the GPU.
Fig. 22. One step (generation) of multiobjective heuristic NSGA-II
algorithm [41].
Fig. 23. Uniform crossover of two configurations c1 and c2 of length k = 5

parameters and randomly generated crossover vector (0, 1, 1, 0, 1).
1) Problem Formulation: The optimization problem is

Fig. 20. Performance of the inference pass on each step of the routing- defined as follows.
by-agreement algorithm on CapsAcc compared to the GPU.
1) We have as input k parameters p1 ∈ P1 , p2 ∈
TABLE II
P2 , . . . , pk ∈ Pk of the accelerator, where Pi is a set
PARAMETERS OF O UR S YNTHESIZED C APS A CC
of possible values of parameter pi .
2) We define a set of configurations C ⊆ P1 × P2 ×· · ·× Pk .
3) We are primarily interested in the configurations belong-
ing to the Pareto set that contains the so-called nondom-
inated solutions.
For example, if we consider two configurations c1 and
TABLE III c2 ∈ C, c1 dominates c2 if: 1) c1 is not worse than c2 in
A REA AND P OWER FOR THE D IFFERENT C OMPONENTS OF O UR C APS A CC all objectives and 2) c1 is strictly better than c2 in at least one
objective.
B. Search Algorithms: Brute-Force Versus Heuristic Search

A straightforward approach uses a brute-force search. For
the small test cases, the evaluation of all the configurations
can be feasible. It is important to use specialized algorithms
to construct the Pareto front. In this work, we use an efficient
construction algorithm based on binary space partitioning [40].
However, the enumeration of all the possible combinations
may be time-consuming. To avoid that, we propose to use a
multiobjective heuristic algorithm. The search algorithm uses
a modified variant of the NSGA-II [41]. It is a powerful
and smart algorithm for multiobjective optimizations, which
significantly reduces the exploration time, despite finding
Fig. 21. (a) Area and (b) power breakdown of our CapsAcc. solutions on the Pareto-front.
1) NSGA-II Algorithm: The NSGA-II algorithm (see
respectively, of our CapsAcc architecture. These figures show Fig. 22) generates a set of offspring Q t from the current
that the area and power contributions are dominated by the population Pt . Each offspring is generated from two randomly
buffers, and the PE array is less than 1/3 of the total budget. picked individuals c1 and c2 from Pt . Then, a crossover
binary vector of length k is randomly generated. This vector
VI. D ESIGN S PACE E XPLORATION FOR THE specifies whether either c1 or c2 is used as a source for
H ARDWARE A RCHITECTURE crossover, as shown in Fig. 23. After that, one randomly
selected parameter of the configuration (so-called gene) is
A. Optimization Problem mutated with a small probability ρ.
The architecture designed in Section V serves as a baseline The individuals Pt ∪ Q t are sorted into multiple fronts
to deploy our FEECA methodology, whose goal is to find Fi , according to the dominance relation. The first front F1
Pareto-optimal sets of architectural parameters of the CapsNet contains all the nondominated solutions along the Pareto front.
accelerator to achieve a good tradeoff between our design Each subsequent front (F2 , F3 , . . . ) is constructed by removing
objectives, which are area, energy, and performance. Since we all the preceding fronts from the population and finding a
focus on multiple objectives, standard optimization methods new Pareto front. The first fronts (F1 and F2 in Fig. 22) are
(e.g., branch & bound) are not suitable for this task because copied to the next population Pt+1 . If any front must be split
they typically optimize only one objective and are exhaustive (F3 in Fig. 22), a crowding distance is used for the selection
in nature. of individuals to Pt+1 .
Algorithm 2 NSGA-II
Fig. 24. Example of a generalized PE with n pe = 8 pairs, bin = 8, bout = 25,

and n stg = 1.
The longest computational paths of the tree can be reduced

by inserting pipeline registers along the paths, i.e., by increas-
ing the parameter n stg . This modification may cause significant
The algorithm runs iteratively for g generations (steps). Its area and energy overhead because of the additional registers
pseudocode is reported in Algorithm 2, where the following to be inserted in every wire at the same pipeline stage.
procedures are used.
1) RandomCon f igur ati ons(X, n) randomly picks n con- D. Estimation of the Parameters of the Accelerator
figurations from a set X. An essential aspect of the brute-force search, which is
2) Cr ossover And Mutate(X, n) generates n new off- also valid for the heuristic space-search, is the estimation of
springs from parents P by uniform crossover and muta- the HW parameters of the accelerator in a fast and accurate
tion. way. Note, in this work, we focus on parameters of the HW
3) Esti mate Par ameter s(X) evaluate the new candidate accelerator that have no impact on the overall accuracy of
solutions from a set X. the CapsNet. Hence, we focus on the parameters of the HW
4) Pi ck Par eto(X) selects Pareto optimal solutions from a accelerator, which are area, delay, and energy consumption
set X, and these solutions are removed from the set. (considering the contributions of the PE array and the memory
5) Di stanceCr owdi ng(X, n) returns n solutions from a accesses).
set X (see [41] for further details). 1) Area: The model to estimate the area of the PE array is
The advantage of having a multiobjective algorithm is that simple yet accurate. We consider a modular approach where
it reconstructs the Pareto front in each generation and tries to the estimation is built bottom-up. Since the PE array is a
cover all possible solutions. The output of the multiobjective Cartesian grid of PEs, the area of the PE array can be estimated
algorithm is a set of nondominated circuits. as a sum of values from a fully characterized set of primitives
(PEs, Regs.) for a given clock period T . Therefore, only the
C. Set of Internal Primitives logic synthesis of the primitives is needed.
2) Delay: Modeling the delay, i.e., the computation latency
In contrast to the first version of our CapsAcc (see Fig. 12), of one inference pass of the CapsNet, is the most critical step
we propose a modified version of PEs with multiple pairs of because it has to take into account having different values
weight and data inputs (n pe ) that are multiplied and reduced of the internal primitives, as well as different dataflows for
using a reduction tree. each layer/operation of the inference. Therefore, we build
Such types of PEs can be generated in a configurable one analytical model for each operation, which computes
manner, varying the following parameters: the number of clock cycles needed to process the inputs
1) number of input pairs n pe of bit width bin ; of the respective layer. It is parameterized by the internal
2) bit width of the partial sum bout ; primitives of the accelerator, i.e., n stg , n pe , #COLS, and
3) number of stages of the pipeline n stg ; #ROWS. Therefore, for each layer, the delay is computed by
4) number of rows of the PE array #ROWS; multiplying the number of clock cycles by the clock period.
5) number of columns of the PE array #COLS. The overall delay of the CapsNet inference is the sum of the
The PEs are constructed to have a minimal logical depth delays for each single operation.
D = log2 (n out + 1), where n out is the maximum number of 3) Energy Consumption: The energy consumption needed
outputs from the multiplier that must be added. We assume by the accelerator to complete one inference pass has two can
that bout ≥ 2bin because the output is the result of a sum of be breakdown into two parts. The first one is the energy con-
multiplications. Then, the bit width of each adder in the tree sumed by the PE array, i.e., the power consumption of the PE
structure has a depth value lower than or equal to bout . Com- array (calculated in a similar way as for computing the area,
pared to the PE architecture in Fig. 13(b), the PE in Fig. 24 is a summing the power consumption of a fully characterized set of
more generalized version. Hence, in the following experiments primitives), multiplied by the delay. The second contribution
of Section VII, we will use the latter version. the energy is required for the reading operations from the data
Fig. 26. Power consumption and area of PEs with various bit width of P.Sum
(bout ) and n stg = 1. The dotted lines show the maximal number of inputs n pe
that can be synthesized without violating the constraint for a given bit width.
TABLE IV
Fig. 25. Experimental setup (orange) and toolflow for this section. PARAMETERS OF SRAM
and weight memories, assuming a maximum SRAM available

of 8 MB.
VII. E XPERIMENTAL R ESULTS
In this section, we show the ability of the proposed
methodology to find Pareto-optimal configurations of the PE
array of the CapsAcc (see Section V) to efficiently perform
inference of CapsNets. We conducted these experiments on the
CapsNet model for the GTSRB data set [33], as described in
Section II-C2. The experiments are divided into four parts.
In the first experiment (see Section VII-A), the synthesis
results of the internal primitives (PEs, registers, and so on) for
selected n stg , T , n pe , bin , and bout parameters are discussed.
Then, two sets of Pareto-optimal configurations in terms of Fig. 27. Pareto-optimal configurations found by the brute-force algo-
energy versus delay and area versus delay objectives are rithm (optimal), NSGA-II algorithm (heuristic), and random search, and the
constructed and analyzed in Section VII-B. The speedup and other Pareto-dominated solutions (brute-force), for (left) energy versus delay
and (right) area versus delay objectives.
quality of the heuristic NSGA-II searching algorithm are
discussed in Section VII-C. Finally, a 3-D Pareto front is
constructed in Section VII-D. The experimental setup and
toolflow are shown in Fig. 25. Here, the search algorithm
explores different configurations to select Pareto-optimal solu-
tions based on the design objectives. The evaluation is done
based on the synthesized components (PEs and Regs.) and the
models extracted from the baseline CapsAcc performing the
inference of the input CapsNet.
A. Generator: Synthesis of Internal Primitives
First, we generate the design primitives for bin = 8, bout =
25, n stg ∈ {1, 2}, and n pe ∈ [1, 400]. The generated PEs have
been synthesized using Synopsys Design Compiler in a 45-nm
technology node and clock periods T ∈ {2, 3, 4}. In Fig. 26,
the parameters of the designs for n stg ∈ {1, 2} are shown.
Note that, as shown by pointers ①, the constraint on the clock Fig. 28. Energy and delay of the separate layers with configurations that
are (blue dots) optimal for the whole CapsNet and (red dots) optimal for
period limits the number of inputs n pe because the depth of the one layer only. The highlighted solution (n stg = 1, n pe = 7, #COLS = 12,
reduction tree is larger and the timing constraints are violated. #ROWS = 1, mem bw = 1024, and T = 2 ns) consumes approximately 80%
For example, setting the clock period to 2 ns limits n pe to 7. of the energy in the PrimaryCaps and Conv1 layer, while the contributions
Therefore, the maximal n pe is 7, 130, and 300 for n stg = 1 for the other layer are significantly lower.
and 7, 150, and 400 for n stg = 2, respectively.
We also synthesize the designs where the computational B. Complete Accelerator Construction
path is divided into two clock cycles (registers are after the The parameters of the CapsAcc are optimized
multipliers) in a pipelined fashion. The additional registers using the proposed FEECA methodology, as discussed
cause 28% power overhead compared to a single-cycle com- in Fig. 25. We consider two pairs of objectives, which are
putation. The area overhead is 36%. To compute the energy energy versus delay (E versus D) and area versus delay
consumption due to the memory accesses, we design the (A versus D). Using a brute-force algorithm, our FEECA
SRAM memory using the CACTI-P tool [42], considering the methodology finds 228 E versus D Pareto-optimal
total size of 8 MB and the block size of 128 B. The results of configurations and 127 A versus D Pareto-optimal
area, energy for the read access, and leakage power, varying configurations, as shown in Fig. 27 (optimal points). Note
the memory bandwidth (mem bw ), are reported in Table IV. that the Pareto-optimal solutions obtained by the brute-force
Fig. 29. Distribution of n pe and #COLS parameters for configurations that are Pareto optimal for E-L and A-L objectives. The blue figures show the
distribution of the objectives of the whole CapsNet and the red ones of the configuration optimized for a single layer.
are highly overlapping with the solutions generated with the equal to 32 × 1 (see pointer ⑤). As visible, the distribution of
NSGA-II algorithm, meaning that the latter is an efficient the optimal parameters for the A versus D design objectives is
and fidelitous design space algorithm. Moreover, pointer ② different. Since the area strongly depends on mem bw , all their
indicates that there is a relatively small area difference values lead to some Pareto-optimal solutions.
between the configurations. Note that there are different
solutions with the same area, but different delays. We C. Heuristic Search Algorithm
also compared the Pareto-optimal solutions found by the The brute-force algorithm eventually finds the optimal
NSGA-II-based FEECA methodology with a random solutions. However, it is very slow because all the possible
search [43] of the same number of candidate solutions. solutions are explored. Therefore, we implement the heuristic
Compared to the Pareto-optimal points found by the random NSGA-II algorithm to speed up the search process. For the
search (see the green points in Fig. 27), the Pareto-optimal E-L objectives, the NSGA-II runs for 1000 iterations of the
points found by the NSGA-II-based search exhibit 67× and generation process, with a population size |P| = |Q| = 50,
146× lower average normalized Euclidean distance (ANED) to find up to 50 Pareto-optimal configurations.
to the optimal points for the E versus D and A versus D The NSGA-II algorithm needs only 50 050 evaluations
objectives, respectively. (0.44% of the search space). Therefore, the exploration time
For the E versus D objectives, Fig. 28 shows the energy has been decreased from 2.5 h to 30 s, compared to using
consumption and the delay of the configurations optimized the brute-force. The design for the E versus D objective is not
for the: 1) overall E versus D and 2) E versus D of every trivial because the optimal Pareto frontier consists of 228 con-
single layer. The PrimaryCaps layer has the biggest impact figurations. Therefore, the initial settings |P| = |Q| = 50
on the overall energy and delay, and thus, the layerwise and allow finding only a small subset of the optimal solutions,
the CapsNet-optimal configurations, in that case, fall almost regularly distributed due to the distance crowding. However,
on the same curve. On the other hand, the CapsNet-optimal almost all the found solutions belong to the optimal Pareto
configurations degrade the performance of the Sum, Update, set, and the ANED from the found solutions to the nearest
and mostly Conv1 layers, but these layers participate on the optimal ones is 4·10−5 . However, the ANED from the optimal
overall objectives with a lower impact compared to the Prima- solutions to the nearest found solutions is 0.006. To reduce
ryCaps layer. Indeed, as indicated by pointers ③, an optimal such distance from the optimal solutions, we increase the size
solution for the whole CapsNet belongs only to the Prima- of the population to |P| = |Q| = 150 (150 150 evaluations;
ryCaps layer optimal, while it is not optimal for the other 1.31% of the exploration time). With these settings, we found
layers. 150 solutions. Although such modification causes 3× more
Another view on the optimal configurations is presented time for the design, the ANED from optimal to found solutions
in Fig. 29. This figure shows the distribution of the parameters is decreased to 0.001, and each found solution belongs to the
of the CapsNet accelerator for different configurations. Note optimal Pareto set. The heuristic design for the A versus D
that if we consider all the objectives, better results are achieved objective with |P| = |Q| = 150 allows us to find 97 of
when using #ROWS=1. Considering the E versus D objec- 127 configurations with an ANED from optimal to found
tives, it is convenient to maximize mem bw . The highest con- solutions equal to 2 · 10−4 . The results are shown in Fig. 27.
tribution to the overall delay and energy consumption is due
to the PrimaryCaps layer. It is convenient to choose the value D. Multiobjective Optimization
of n pe in the range between 1 and 7. However, considering the By running the search algorithm on our benchmark, three
Conv1 only, a better choice would have been n pe ∈ {4.7} (see objectives of the CapsNet accelerator are optimized: the area
pointer ④) and equal to 4 for the ClassCaps layer. The Sum, on the chip, the energy consumption for the inference of one
Update, and ClassCaps layers prefer the size of the PE array input image, and the delay (i.e., the latency of the inference).
TABLE V
R ESULTS FOR THE PE A RRAY AND E STIMATED D ELAY, A REA , AND E NERGY C ONSUMPTION FOR THE W HOLE C APS A CC A RCHITECTURE . T HE fastest
(L OWEST D ELAY ) C ONFIGURATION OF THE C APS A CC I S H IGHLIGHTED IN G REEN IN THE F IRST ROW, W HILE THE O RIGINAL V ERSION OF THE
C APS A CC , W HICH WAS A NALYZED IN S ECTION V, I S R EPORTED IN THE S ECOND L AST ROW. A LL C IRCUITS H AVE
B EEN S YNTHESIZED W ITH THE C LOCK P ERIOD T = 3 NS
Fig. 31. Distribution of the configuration parameters for the optimal solutions
found for three objectives. The n stg parameter was always equal to 1.
(pointer ⑦), while it is not Pareto-optimal for the other case.

Indeed, if we consider the EDP, 75% of the configurations
are Pareto-dominated. Similarly, considering the ADP and the
EAP, more than 42% and 43% of the configurations are Pareto-
dominated, respectively.
The distribution of the parameters for the three optimization
objectives is visualized in Fig. 31. Note that the resulting
Fig. 30. Pareto-set of configurations with three objectives in a figure (top) and
distribution is significantly different from the distribution of
combining two objectives as products (bottom). The highlighted solution has the solutions found for two objectives, as shown in Fig. 29.
a configuration (n stg = 1, n pe = 4, #COLS = 32, #ROWS = 1, T = 3 ns, Hence, it is a sign of the interdependence of the design
and mem bw = 1024). objectives.
Fig. 30 reports two different visualization perspectives of E. Case Study: Synthesis of a Pareto-Optimal Solution
the results. The first one (top) shows the results as a 3-D As a case study, we synthesized the complete PE array of
plot where we can identify the resulting Pareto frontiers. The the selected solution (highlighted with a gray circle in Fig. 30)
same results are also visualized on 2-D plots (bottom), where using the Synopsys Design Compiler. The microarchitectural
each couple of two objectives is combined into products, structure of the PE array is shown in Fig. 32. Note that, since
which are energy × delay (EDP), area × delay (ADP), and the solution has one row, the structure of the PE differs from
energy × area (EAP), respectively. By reducing the dimension the generic PE (see Fig. 24) in two aspects.
of the space, only a smaller number of solutions remain 1) Since there is only one row, the W eight1 Reg. is not
in the Pareto-frontiers, which are shown by the gray lines. needed because there is no reason to store the weight
For example, the solution that is analyzed, as a case study, values for the subsequent rows.
in Section VII-E, and marked with a gray circle in Fig. 30, 2) Since there is only one row, the input partial sums are
lays on the Pareto-frontier only in the last two plots, i.e., the null. Therefore, all the relative connections and additions
ADP versus energy tradeoff (pointer ⑥) and EAP versus delay are omitted.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-

tion with deep convolutional neural networks,” in Proc. NIPS, 2012,
pp. 1097–1105.
[4] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between
capsules,” in Proc. NeurIPS, 2017, pp. 3859–3869.
[5] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-
encoders,” in Proc. Int. Conf. Artif. Neural Netw. (ICANN), 2011,
pp. 44–51.
[6] Y. Yu, T. Gu, H. Guan, D. Li, and S. Jin, “Vehicle detection
from high-resolution remote sensing imagery using convolutional cap-
sule networks,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 12,
pp. 1894–1898, Dec. 2019.
Fig. 32. Microarchitecture of the PE array for the selected solution, which [7] X. Wu et al., “Speech emotion recognition using capsule networks,”
has n pe = 4, #COLS = 32, and #ROWS = 1. in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP),
May 2019, pp. 6695–6699.
[8] W. Zhao, H. Peng, S. Eger, E. Cambria, and M. Yang, “Towards scalable
The first row of Table V shows the results of the selected and reliable capsule networks for challenging NLP applications,” in
solution in Fig. 30. The last two rows report the results of state- Proc. 57th Annu. Meeting Assoc. Comput. Linguistics. Florence, Italy:
of-the-art designs of the baseline architecture of CapsAcc (see Association for Computational Linguistics, Jul. 2019, pp. 1549–1559,
doi: 10.18653/v1/P19-1150.
Section V-D) and DESCNet [39], while the rest of the rows [9] J. Rajasegaran et al., “DeepCaps: Going deeper with capsule networks,”
show the results of a few other solutions, which have the same in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
amount of multipliers as the solution in the first row. pp. 10717–10725.
From the results in Table V, we can derive that energy and [10] A. Marchisio, A. Massa, V. Mrazek, B. Bussolino, M. Martina, and
M. Shafique, “NASCaps: A framework for neural architecture search to
delay significantly depend on the microarchitectural configu- optimize the accuracy and hardware efficiency of convolutional capsule
rations of the accelerator, while the area is strongly affected networks,” in Proc. 39th Int. Conf. Comput.-Aided Design ICCAD,
by the memory bandwidth. High memory bandwidth implies Nov. 2020, pp. 1–9.
low delay, but at the cost of higher energy consumption. The [11] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2014,
throughput of the system can be simply derived as the inverse pp. 609–622.
of the latency. [12] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and
Key Observation: Note that, despite the baseline design L. Benini, “Origami: A convolutional network accelerator,” in Proc.
being 2-D in terms of rows and columns of the PE array, GLSVLSI, 2015, pp. 199–204.
[13] S. Han et al., “EIE: Efficient inference engine on compressed deep neural
there exist solutions in the Pareto-front with only one row network,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit.
or only one column. This is due to the fact that the second (ISCA), Jun. 2016, pp. 243–254.
dimension needed for accelerating the computation of matrix [14] S. Zhang et al., “Cambricon-X: An accelerator for sparse neural net-
multiplication is embedded in the parameter n pe , which is works,” in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture
(MICRO), Oct. 2016, pp. 1–12.
higher than 1. Therefore, efficient parallelism is guaranteed [15] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
by the designed PE, as shown in Fig. 13(b). It is remarkable A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
that the selected solution, in the first row of Table V, reduces computing,” in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit.
the delay of a factor 7× compared to the last row, which (ISCA), Jun. 2016, pp. 1–13.
[16] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor
corresponds to the solution analyzed in Section V. processing unit,” in Proc. ISCA, 2017, pp. 1–12.
[17] A. Parashar et al., “SCNN: An accelerator for compressed-sparse
VIII. C ONCLUSION convolutional neural networks,” in Proc. 44th Annu. Int. Symp. Comput.
This article presents FEECA, a novel methodology to Archit. (ISCA), 2017, pp. 27–40.
[18] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A flexible
explore the design space for a specialized hardware accelerator dataflow accelerator architecture for convolutional neural networks,”
to compute CapsNet inference. Having flexible connections in Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA),
through multiplexers as the input of the PE array enables Feb. 2017, pp. 553–564.
efficient processing of different types of workload, which is [19] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible
dataflow mapping over DNN accelerators via reconfigurable intercon-
crucial for CapsNets. Through an exploration of the design nects,” in Proc. 23rd Int. Conf. Architectural Support Program. Lang.
space with the help of the NSGA-II algorithm, we can find Operating Syst. (ASPLOS), 2018, pp. 461–475.
Pareto-optimal solutions of the computing hardware. Thus, [20] H. Sharma et al., “Bit fusion: Bit-level dynamically composable archi-
a multiobjective (area, delay, and energy) search leads to tecture for accelerating deep neural network,” in Proc. ACM/IEEE 45th
Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2018, pp. 764–775.
a wider set of solutions compared to reducing the search [21] S. Sen, S. Jain, S. Venkataramani, and A. Raghunathan, “SparCE:
dimension to only two objectives. Moreover, we presented the Sparsity aware general-purpose core extensions to accelerate deep neural
results of the complete synthesized accelerator compared to an networks,” IEEE Trans. Comput., vol. 68, no. 6, pp. 912–925, Jun. 2019.
optimized GPU implementation. Our accelerator provides the [22] M. A. Hanif, A. Marchisio, T. Arif, R. Hafiz, S. Rehman, and
M. Shafique, “X-DNNs: Systematic cross-layer approximations for
first proof-of-concept for realizing CapsNet hardware in a sys- energy-efficient deep neural networks,” J. Low Power Electron., vol. 14,
tematic way and opens new avenues for its high-performance no. 4, pp. 520–534, Dec. 2018.
inference deployments. [23] M. Oh et al., “Convolutional neural network accelerator with reconfig-
urable dataflow,” in Proc. Int. SoC Design Conf. (ISOCC), Nov. 2018,
R EFERENCES pp. 42–43.
[24] A. Marchisio et al., “Deep learning for edge computing: Current trends,
[1] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D object detec- cross-layer optimizations, and open research challenges,” in Proc. IEEE
tion network for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Comput. Soc. Annu. Symp. VLSI (ISVLSI), Jul. 2019, pp. 553–559.
Pattern Recognit. (CVPR), Jul. 2017, pp. 6526–6534. [25] C. Deng, F. Sun, X. Qian, J. Lin, Z. Wang, and B. Yuan, “TIE:
[2] A. Graves and J. Schmidhuber, “Framewise phoneme classification with Energy-efficient tensor train-based inference engine for deep neural
bidirectional LSTM networks,” in Proc. IEEE Int. Joint Conf. Neural network,” in Proc. 46th Int. Symp. Comput. Archit., Jun. 2019,
Netw. (IJCNN), Jul./Aug. 2005, pp. 2047–2052. pp. 264–278.
[26] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and Vojtech Mrazek (Member, IEEE) received the
M. Martina, “An updated survey of efficient hardware architectures Ing. and Ph.D. degrees in information technology
for accelerating deep convolutional neural networks,” Future Internet, from the Faculty of Information Technology, Brno
vol. 12, no. 7, p. 113, Jul. 2020. University of Technology, Brno, Czech Republic,
[27] S. Sharify et al., “Laconic deep learning inference acceleration,” in Proc. in 2014 and 2018, respectively.
46th Int. Symp. Comput. Archit., Jun. 2019, pp. 304–317. He was a Visiting Post-Doctoral Researcher with
[28] J. Li et al., “SqueezeFlow: A sparse CNN accelerator exploiting the Department of Informatics, Institute of Com-
concise convolution rules,” IEEE Trans. Comput., vol. 68, no. 11, puter Engineering, Technische Universität Wien
pp. 1663–1677, Nov. 2019. (TU Wien), Vienna, Austria, from 2018 to 2019. He
[29] A. Marchisio, V. Mrazek, M. A. Hanif, and M. Shafique, “ReD-CaNe: is currently a Researcher with the Faculty of Infor-
A systematic methodology for resilience analysis and design of capsule mation Technology, Brno University of Technology.
networks under approximations,” in Proc. Design, Autom. Test Eur. Conf. He has authored or coauthored over 30 conference papers/journal articles
Exhib. (DATE), Mar. 2020, pp. 1205–1210. focused on approximate computing and evolvable hardware. His research
[30] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and interests are approximate computing, genetic programming, and machine
M. Shafique, “Hardware and software optimizations for accelerating learning.
deep neural networks: Survey of current trends, challenges, and the road Dr. Mrazek received several awards for his research in approximate com-
ahead,” IEEE Access, vol. 8, pp. 225134–225180, 2020. puting, including the Joseph Fourier Award in 2018 for research in computer
[31] M. A. Hanif and M. Shafique, “A cross-layer approach towards develop- science and engineering.
ing efficient embedded deep learning systems,” in Proc. MICPRO, 2021,
Art. no. 103609, doi: 10.1016/j.micpro.2020.103609. Muhammad Abdullah Hanif (Graduate Student
[32] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn- Member, IEEE) received the B.Sc. degree in elec-
ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11, tronic engineering from the Ghulam Ishaq Khan
pp. 2278–2324, Nov. 1998. Institute of Engineering Sciences and Technology
[33] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, (GIKI), Topi, Pakistan, in 2011, and the M.Sc.
“Detection of traffic signs in real-world images: The German traffic sign degree in electrical engineering with a specializa-
detection benchmark,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), tion in digital systems and signal processing from
Aug. 2013, pp. 1–8. the School of Electrical Engineering and Computer
[34] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Science, National University of Sciences and Tech-
Boltzmann machines,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, nology, Islamabad, Pakistan, in 2016. He is currently
pp. 1–8. working toward the Ph.D. degree in computer engi-
[35] A. D. Kumar, “Novel deep learning model for traffic sign detection using neering at Technische Universität Wien (TU Wien), Vienna, Austria, under
capsule networks,” CoRR, vol. abs/1805.04424, May 2018. the supervision of Prof. M. Shafique.
[36] A. Paszke et al., “Automatic differentiation in PyTorch,” in Proc. NIPS He was also a Research Associate with the Vision Processing Lab, Infor-
Autodiff Workshop, 2017, pp. 1–4. mation Technology University, Lahore, Pakistan, and a Lab Engineer with
[37] A. Marchisio et al., “FasTrCaps: An integrated framework for fast yet
GIKI, Pakistan. He is currently a University Assistant with the Department
accurate training of capsule networks,” in Proc. Int. Joint Conf. Neural
of Informatics, Institute of Computer Engineering, TU Wien. His research
Netw. (IJCNN), Jul. 2020, pp. 1–8.
interests are in brain-inspired computing, machine learning, approximate
[38] A. Marchisio, M. A. Hanif, M. T. Teimoori, and M. Shafique, “Capstore:
computing, computer architecture, energy-efficient design, robust computing,
Energy-efficient design and management of the on-chip memory for
system-on-chip design, and emerging technologies.
capsulenet inference accelerators,” CoRR, vol. abs/1902.01151, Apr.
Mr. Hanif was a recipient of the President’s Gold Medal for the outstanding
2019.
academic performance during the M.Sc. degree.
[39] A. Marchisio, V. Mrazek, M. A. Hanif, and M. Shafique, “DESCNet:
Developing efficient scratchpad memories for capsule network hard- Muhammad Shafique (Senior Member, IEEE)
ware,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., early received the Ph.D. degree in computer science from
access, Oct. 13, 2020, doi: 10.1109/TCAD.2020.3030610. the Karlsruhe Institute of Technology (KIT), Karl-
[40] T. Glasmachers, “A fast incremental BSP tree archive for non-dominated sruhe, Germany, in 2011.
points,” in Evolutionary Multi-Criterion Optimization. Berlin, Germany: Afterward, he established and led a highly recog-
Springer-Verlag, 2017, doi: 10.1007/978-3-319-54157-0_18. nized research group at KIT for several years
[41] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist and conducted impactful collaborative Research and
multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput., Development activities across the globe. In Octo-
vol. 6, no. 2, pp. 182–197, Apr. 2002. ber 2016, he joined the Faculty of Informatics,
[42] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTI-P: Institute of Computer Engineering, Technische Uni-
Architecture-level modeling for SRAM-based structures with advanced
versität Wien (TU Wien), Vienna, Austria, as a
leakage reduction techniques,” in Proc. IEEE/ACM Int. Conf. Comput.- Full Professor of computer architecture and robust, and energy-efficient
Aided Design (ICCAD), Nov. 2011, pp. 694–701. technologies. Since September 2020, he has been with the Division of
[43] T. Jansen, Evolutionary Algorithms and Other Randomized Search
Engineering, New York University Abu Dhabi (NYU-AD), Abu Dhabi,
Heuristics. Berlin, Germany: Springer, 2013.
United Arab Emirates. He is also a Global Network Faculty with the NYU
[44] A. Marchisio, M. A. Hanif, and M. Shafique, “CapsAcc: An efficient
Tandon School of Engineering, New York, NY, USA. He holds one U.S.
hardware accelerator for CapsuleNets with data reuse,” in Proc. Design,
patent has (co)authored six Books, more than ten book chapters, and over
Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2019, pp. 964–967.
300 articles in premier journals and conferences. His research interests are
Alberto Marchisio (Graduate Student Member, in design automation and system-level design for brain-inspired computing,
IEEE) received the B.Sc. degree in electronic engi- AI & machine learning hardware, wearable healthcare devices and systems,
neering and the M.Sc. degree in electronic engi- autonomous systems, energy-efficient systems, robust computing, hardware
neering (electronic systems) from the Politecnico di security, emerging technologies, field-programmable gate arrays (FPGAs),
Torino, Turin, Italy, in October 2015 and April 2018. Multi-Processor System on Chips (MPSoCs), and embedded systems. His
He is currently working toward the Ph.D. degree at research has a special focus on cross-layer analysis, modeling, design, and
Computer Architecture and Robust Energy-Efficient optimization of computing and memory systems. The researched technologies
Technologies (CARE-Tech.) Lab, Institute of Com- and tools are deployed in application use cases from Internet-of-Things (IoT),
puter Engineering, Technische Universität Wien (TU smart cyber–physical systems (CPS), and ICT for Development (ICT4D)
Wien), Vienna, Austria, under the supervision of domains.
Prof. Dr. Muhammad Shafique. Dr. Shafique received the 2015 ACM/SIGDA Outstanding New Faculty
His main research interests include hardware and software optimizations Award, the AI 2000 Chip Technology Most Influential Scholar Award in 2020,
for machine learning, brain-inspired computing, VLSI architecture design, six gold medals, and several best paper awards and nominations at prestigious
emerging computing technologies, robust design, and approximate computing conferences. He has served as the PC Chair, the General Chair, the Track
for energy efficiency. He received the honorable mention at the Italian National Chair, and a PC Member for several prestigious IEEE/ACM conferences.
Finals of Maths Olympic Games in 2012 and the Richard Newton Young He has given several keynotes, invited talks, and tutorials, as well as organized
Fellow Award in 2019. many special sessions at premier venues.

FEECA Design Space Exploration For Low-Latency and Energy-Efficient Capsule Network Accelerators

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FEECA Design Space Exploration For Low-Latency and Energy-Efficient Capsule Network Accelerators

Uploaded by

Copyright:

Available Formats

716 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO.

FEECA: Design Space Exploration for Low-Latency

Abstract— In the past few years, Capsule Networks (CapsNets) I. I NTRODUCTION

scientific challenges, followed by our novel contributions in

Fig. 6. Architecture of the CapsNet for the GTSRB data set.

Fig. 7. Distribution of trainable parameters on the CapsNet across layers.

2) CapsNet Architecture for GTSRB: Fig. 6 illustrates the

III. A NALYSIS OF C APS N ET C OMPLEXITY

Fig. 8. Experimental setup for GPU analyses.

Fig. 9. Layerwise performance of the CapsNet inference pass.

Fig. 9 shows the time consumed by the computations for

C. Summary of Key Observations From Our Analyses

shown in Fig. 12, where the green blocks highlight our

computes squash, and the outputs v j are stored back in

Fig. 19. Layerwise performance of the inference pass on the CapsNet on

Fig. 23. Uniform crossover of two configurations c1 and c2 of length k = 5

1) Problem Formulation: The optimization problem is

B. Search Algorithms: Brute-Force Versus Heuristic Search

Fig. 24. Example of a generalized PE with n pe = 8 pairs, bin = 8, bout = 25,

The longest computational paths of the tree can be reduced

and weight memories, assuming a maximum SRAM available

(pointer ⑦), while it is not Pareto-optimal for the other case.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-

You might also like