You are on page 1of 2

High Speed, Approximate Arithmetic Based

Convolutional Neural Network Accelerator


Mohammed E. Elbtity∗,†, Hyun-Wook Son†, Dong-Yeong Lee†, and HyungWon Kim†
∗Nanoelectronics Integrated Systems Center, Nile University, Giza 12588, Egypt
†Mixed-Signal Integrated Systems Center, Chunguk National University, Cheongju 28644, South Korea.
Emails: Elbtity@ieee.org, hwkim@cbnu.ac.kr

Abstract— Convolutional Neural Networks (CNNs) for Artificial


Intelligence (AI) algorithms have been widely used in many
applications especially for image recognition. However, the
growth in CNN-based image recognition applications raised
challenge in executing millions of Multiply and Accumulate
(MAC) operations in the state-of-the-art CNNs. Therefore, GPUs,
FPGAs, and ASICs are the feasible solutions for balancing
processing speed and power consumption. In this paper, we
propose an efficient hardware architecture for CNN that
provides high speed, low power, and small area targeting ASIC
implementation of CNN accelerator. To realize low cost inference Fig. 1. Our Convolutional Neural Network Architecture
accelerator. we introduce approximate arithmetic operators for
MAC operators, which comprise the key datapath components of
computation; hence, the aim of designing high speed and
CNNs. The proposed accelerator architecture exploits parallel energy efficient CNNs has become a challenge. Therefore,
memory access, and N-way high speed and approximate MAC researchers are looking for alternatives to CPUs and GPUs to
2020 International SoC Design Conference (ISOCC) | 978-1-7281-8331-2/20/$31.00 ©2020 IEEE | DOI: 10.1109/ISOCC50952.2020.9333013

units in the convolutional layer as well as the fully connected accelerate CNN algorithms in an efficient way. Google TPUs,
layers. Since CNNs are tolerant to small error due to the nature FPGAs [1], and ASICs are competitive alternatives to the
of convolutional filters, the approximate arithmetic operations conventional computing systems as they can overcome the
incur little or no noticeable loss in the accuracy of the CNN, compute-bound issue in the CPUs or GPUs. However, these
which we demonstrate in our test results. For the approximate systems are still suffering from the heavy data traffic for
MAC unit, we use Dynamic Range Unbiased Multiplier (DRUM) external memory because accessing external memory
approximate multiplier and Approximate Adder with OR consumes significant amount of energy per read/write
operations on LSBs (AOL) which can substantially reduce the operation as well as long access time. To reduce this data
chip area and power consumption. The configuration of the movement overhead, alternative approaches are 1) modify the
approximate MAC units within each layer affects the overall memory to allow wider data bus, and 2) use multiple
accuracy of the CNN. We implemented various configurations of distributed memories. Parallel access allows several data to be
approximate MAC on an FPGA, and evaluated the accuracy processed in one clock cycle; hence, increasing the operating
using an extended MNIST dataset. Our implementation and speed and improving the hardware resource utilization.
evaluation with selected approximate MACs demonstrate that
the proposed CNN Accelerator reduces the area of CNN by 15%
at the cost of a small accuracy loss of only 0.982% compared to
We take advantage of the useful property that CNNs have the
the reference CNN. characteristics of intrinsic error tolerance. Due to this property,
using approximate multipliers and adders has little impact on
Keywords; Approximate Arithmetic,, Convolutional Neural the accuracy, while significantly reducing the implementation
Network (CNN), Hardware Accelerators, Approximate MACs. cost and the power consumption [2].

I. INTRODUCTION In this paper, we propose a CNN accelerator architecture with


parallel on-chip memory access with approximate MAC
Convolutional Neural Network is a class of deep neural operators aimed at MNIST classification. This paper is
network that is widely used to analyze visual imagery. CNNs
organized as follows. In section II, we present our regular
for object classification are comprised of a number of
implementation of CNN architecture and its accuracy. The
convolutional layers, pooling layers, and fully connected
layers. Convolutional layers consist of convolutional filters that details of the approximate computing units are described in
analyze the object features, while fully connected layers flatten section III. In section IV, the results of the approximate
the features and categorized them into crisp classes. computing based CNN in a comparison with the regular CNN
Convolutional layer is used to extract the features among the architecture are presented, followed by the conclusion in
image or the feature map. Pooling layer is used to decrease the section V.
spatial size of the feature map, while the Fully-Connected layer
uses the outputs of convolutional and pooling layer and classify
them into a label. With the advancement in the accuracy of
CNNs, CNNs require massive data-movement and complex

978-1-7281-8331-2/20/$31.00 ©2020 IEEE 71 ISOCC 2020

Authorized licensed use limited to: University of Cape Town. Downloaded on May 19,2021 at 13:48:51 UTC from IEEE Xplore. Restrictions apply.
II. OUR REGULAR CNN IMPLEMENTATION

Fig 1 shows the overall architecture of the target CNN model


for MNIST classification. Before implementing the CNN
model into a proposed hardware architecture, we have
optimized the CNN to a minimal structure consisting of a
convolutional layer, a max pooling layer, and two fully
connected layers that produce the classification results. The
(a) (b)
CNN takes as input an image of 28 x 28 pixels and the
Fig. 2. Approximate Multiplier
convolutional layer has four 3 x 3 convolution filters to extract
the features. Both convolutional layer and fully connected The least significant bits of a pre-defined range are logically
layers require a larger number of multipliers and adders. Each ORed, while the rest of the bits of the operands are accurately
layer is equipped with multi-way parallel MAC units, added by a small regular adder.
multiplying multiple inputs and weights at the same time
followed by accumulation conducted by a tree of adders. IV. IMPLEMENTATION AND EVALUATION RESULTS
Additionally, the multi-way parallel MAC blocks operate
We have implemented the optimized CNN for MNIST of
iteratively in FC layers to reduce the hardware cost. The Fig. 1 by applying the proposed approximate MAC units using
classification result is the output node of the largest score at the an FPGA (Zynq UltraScale+). We have evaluated several
output layer, which is determined by a comparator. We configurations of approximate multipliers and adders to
implemented the CNN inference engine with regular integer compare the impact of the approximate data to regular data
arithmetic MACs and evaluated its performance using the ratio on the accuracy and overall hardware cost. We have
FPGA. We obtained an accuracy of 95.12% with a chip size of discovered that approximate multipliers have more significant
7.56 mm2 for this CNN implementation without approximate impact than the approximate adders and observed that FC
MACs. layers are more sensitive than the Convolutional layer to
approximate units. By selecting these various configurations
III. APPROXIMATE MAC UNITS for each filter and each layer, we can further improve the
accuracy over the CNN accelerator based on conventional bit
Most of the previous research on accelerating CNNs using quantization. In this work, however, for simplicity we chose
TPUs, FPGAs, and ASICs have attempted to quantize the data limited configurations and obtained smaller area and lower
to reduce the required data width of features and weights [3]. power consumption than conventional CNN accelerator at a
Such approaches, however, can incur substantial loss in the small cost of accuracy loss of only 0.98%.
accuracy, and thus are often limited to marginal data width
reduction. Recently, some research has applied approximate V. CONCLUSION
computing units to CNN accelerators, which takes advantage
of the intrinsic error tolerance capability of CNNs [4]. In this With different configurations of approximate MAC units in
work, we introduce an CNN architecture based on approximate CNN, we can get even higher accuracy than the regular CNN
MAC operators consisting of DRUM multiplier and architecture. Therefore, it is the architect’s job to determine the
Approximate Adder with OR operations on LSBs (AOL) trade-off between accuracy and area/power savings. Our
instead of the regular MAC units. One of the key advantages of approximate CNN accelerator achieves 94.141% of accuracy
the proposed approximate MAC unit is that we optimize the with only 0.982% of accuracy loss while it occupies 15% less
configuration of the approximate multipliers and approximate area than the regular CNN accelerator. In the future, we intent
adders individually for every MAC operator units in each to generalize our proposed architecture to various CNN
convolution filter in each layer, which can allow to provide models.
higher accuracy than the conventional CNN hardware
accelerators. REFERENCES
[1] Abdelouahab et.al (2018). Accelerating CNN inference on FPGAs: A
Fig.2.a shows the architecture of the DRUM approximate Survey. ArXiv, abs/1806.01683.
multiplier constructed for our CNN implementation. It can be [2] Guo, Chuliang et al. “A Reconfigurable Approximate Multiplier for
visualized as two parts. The first part is the steering logic that Quantized CNN Applications.” 2020 25th Asia and South Pacific
Design Automation Conference (ASP-DAC) (2020): 235-240.
dynamically select the most significant bits within the
[3] Zhu, Zhenhua et al. “A Configurable Multi-Precision CNN Computing
operands. The steering logic includes a multiplexer (SEL), a Framework Based on Single Bit RRAM.” 2019 56th ACM/IEEE Design
priority encoder, a leading-one detector (LOD), and a barrel Automation Conference (DAC) (2019): 1-6.
shifter. The second part is a regular multiplier of reduced size [4] Wang, Zhijiang et al. “Approximate Multiply-Accumulate Array for
that multiplies the most significant bits from both operands. Convolutional Neural Networks on FPGA.” 2019 14th International
We have observed that DRUM multiplier provides an average Symposium on Reconfigurable Communication-centric Systems-on-
Chip (ReCoSoC) (2019): 35-42.
power saving of 58% and an average area saving of 20% [5].
[5] Hashemi et.al “DRUM: A Dynamic Range Unbiased Multiplier for
The approximate adder employed is based on Approximate approximate applications.” 2015 IEEE/ACM International Conference
Adder with OR operations on LSBs (AOL) [6], which is on Computer-Aided Design (IC- CAD) (2015): 418-425.
illustrated in Fig. 2.b. Its primary operation consists of OR [6] Dalloo, Ayad, Ardalan Najafi and Alberto Garc ı́ a-Ortiz. “Systematic
operations for lower bits and addition operation for upper bits. Design of an Approximate Adder: The Optimized Lower Part Constant-
OR Adder.” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems 26 (2018): 1595-1599.

978-1-7281-8331-2/20/$31.00 ©2020 IEEE 72 ISOCC 2020

Authorized licensed use limited to: University of Cape Town. Downloaded on May 19,2021 at 13:48:51 UTC from IEEE Xplore. Restrictions apply.

You might also like