You are on page 1of 6

1

Gaurav Pandey, Pritesh Kumar Yadav and Prasanna Kumar Misra

Power EfficientReconfigurable Accelerator for


Deep Convolutional Neural Networks

Instruction, Multiple Data (MIMD) architectures. The reason
Abstract—In this work, the power consumption of a hardware for the development of new Parallelisms was common for all
accelerator is studied for deep convolutional neural network the generations – the power handling limitations of the
(CNN) applications. CNNs are widely used in modern Artificial chip[14].
Intelligence (AI) applications especially for computer vision
tasksin the self-driving cars industry.The large size of image data Therefore, the requirement is to deliver hardware solutions
set required to train the network brings thechallengesfor design
in order to maximize the performance for Deep Neural
engineers for obtaining improved throughput with minimal
energy usage.The requirement of large size data increases the off- Networks with an equal emphasis on reducing the power
chip to on-chip data accesses resulting to increase the power consumption. The modern hardware development and the
consumption.In this accelerator, the off-chip DRAM accesses clock frequency restrictions are not due to because there is no
have been minimized by data reuse and local accumulation of scope for such increments but because of the power wall
partial results using Row Stationary (RS) Dataflow. To minimize constraint, which limits the further performance improvements
the energy, the Processing Elements (PEs) performed by just increasing the clock frequencies of the conventional
computation on 8-bit data. The clock gating was incorporated to architectures. This current trend of focusing on energy
minimize the power consumption of accelerator by deactivating reduction without sacrificing the performance gave birth to the
the clock to the unused PEs. From the study, it has been observed
that this architecture has potential to save power to the extentof
development of Application Specific Architectures, which can
63%.The designed accelerator can be used as power efficient be either an ASIC(Application Specific Integrated Circuit) or
architecture for modern deep CNNs. an ASIP(Application Specific Instruction Processor). The
current work focuses on the design on an ASIC for a reduced
Index Terms—ASIC; CNN; Accelerator; Low Power;Deep power solution to the modern-day Deep Convolutional Neural
Convolutional Neural Network; Networks (CNNs) while maximizing the performance.The
designed accelerator has been optimized to minimize power in
I. INTRODUCTION multiple ways. First, the computer units are only 8-bit, just
like those used in Google Tensor Processing Unit (TPU).
Second, the off-chip memory accesses were reduced by
The Deep neural networks (DNNs) are used in various
increasing the data reuse and local accumulation of partial
artificial intelligence (AI) applications such as Self-Driving
results by using the Row Stationary (RS) dataflow for
Cars [1-2], Speech Recognition[3-4], Natural Language
performing convolution, which has been proved to deliver best
Processing [5], Image Classification [6], Object Detection [7],
power results. Finally, clock gating was employed to turn off
Cancer Detection [8-9], Robotics [10-11] and many more. The
the clock to the unused Processing Elements (PEs) and save
CNNs have gone deeper and deeper with the research progress
power.
which is due toimprovement in accuracy[12-13]. The research
community is putting efforts towards artificial intelligence
enabled processor chips where the chips have capability to The rest of this paper,an overview of CNNs is discussed in
learn on their own and can have accuracies far higher than section II. The conventional hardware accelerator for deep
what a normal human being can achieve. The DNN CNN is discussed in section III. The energy efficient hardware
architectures deliver state-of-the-art accuracy on many AI accelerator for deep CNN is presented in section IV. The
tasks at the cost of high computational complexity. The dataflow used in the accelerator is discussed in section V. In
tradeoff in application accuracy, throughput, hardware cost section VI the results of the accelerator are reported. Finally,
[14](in terms of chip area), and power consumption motivated theconclusion is given in section VII.
the researchers to work on energy efficient processing of
DNNs that can provide better throughput and application II.BASICS OF CNN
accuracy at minimal hardware cost[15-19].The usage of array In a CNN, multiple convolutional layers are stacked for
of processing elementsincrease the hardware cost that limits feature extraction or representation learning from input
wide deployment of DNNs in AI systems. images, which are further helpful in classification. In deep
CNNs, the large number of layers helps achieve astonishingly
Traditionally, the hardware performance was improved by good accuracy [20-22] by transforming the input images into
using various parallelisms. It started with Instruction Level intermediate forms called Feature Maps or representation of
Parallelism (ILP). Then the Data Level Parallelism (DLP) was features.
used that includes vectorization and Single Instruction, In each layer, the filters can be 4-Dimensional where
Multiple Data (SIMD) approaches. Then Task Level multiple 3-Dimensional filters are stacked together to act on a
Parallelism (TLP) was later employed using Multiple 3-Dimensional Input Feature Maps.

2

III. CONVENTIONAL HARDWARE ACCELERATOR FOR CNN


Conventional Accelerators for deep CNNs focused only on
improving the computational throughput of the system without
much emphasis on reducing the off-chip memory accesses[23-
28]. The most important component of CONV and FC layers
are Multiply-and-Accumulate (MAC) operations, which can
be easily optimized using parallel structure. CPUs and GPUs
employ temporal parallelism in the form of SIMD or multi-
threading (SIMT). In such architectures, ALUs are controlled
by a single common controller and although ALUs can fetch
data from memory hierarchy, they can’t communicate directly
with each other. To utilize full benefits of such hardware, the
computation kernel of the CNN is modified to perform the
MAC operations in parallel. Generally, such modifications try
Fig. 1: Architecture of modern Deep CNNs [18]
to convert the calculations in the form of matrix multiplication
operations, often results to a reduction in efficiency in storage
To perform convolution, each filter is slide over the input or very complex memory access patterns.
feature map by the given stride value and a pixel of the output Fast Fourier Transform (FFT) [29] is one such
feature map is generated on each slide. There are also 1-D transformation. Convolution using FFT is done by first taking
biases which are added to the filtered feature maps. Multiple FFTs of the input feature map and the filter, then these two are
stacked filters are used to create an output feature map with a multiplied in the frequency domain and then the result is
depth greater than one. converted to the spatial domain using inverse FFT to get the
output feature map. While FFT reduces the computation, it
creates a huge demand on the storage capacity and bandwidth
of the memory system. Also, the benefits of FFT reduce with
filter size and it is difficult to use sparse matrices with FFT,
which further reducing the improvement in complexity.
There are other algorithms which transform the kernel to
reduce the computation with a little focus on memory access
patterns [23, 30-32].

IV. ENERGY EFFICIENT HARDWARE ACCELERATOR FOR CNN


The major bottleneck for processing DNNs is in memory
access. In the worst case, all the data has to be brought form
the off-chip DRAM which affects both energy efficiency and
throughput. Furthermore, DRAM accesses take several orders
Fig. 2: 2-D convolution procedure
of magnitude higher energy than on-chip memory
Figure 2 explains how a 2-D convolution [16] is performed accesses[32].
on an 8x8 input feature map and 4x4 filter with a stride of 1. Accelerators take advantage of the fact that there is no
In case the input feature map cannot be traversed an integral randomness in Deep Neural Network (DNN) processing and
number of steps for a given stride, the input feature map can this allows design of a fixed dataflow, which can adapt for
be zero-padded to get an integral number of traversal steps. various DNN shapes and sizes and also for maximum energy
The steps are then further taken forward by sliding the filter efficiency. Such dataflows are designed so as to minimize data
over the next rows, taking step lengths of given stride. accesses from more costly (energy-wise) lower levels of
For an input feature map of size IxI, convolution with a memory hierarchy and also to maximize the utilization of data
filter of size FxF with stride of S produces an output feature already brought into the upper and much energy efficient
map of size OxO where O is given by the following formula upper memory hierarchy levels.
Thearchitecture of the accelerator is shown in Fig. 3. The
number of Processing Elements (PEs) in each row and column
I −F
O= +1 can be changed depending on the application for which the
S accelerator would be used[15, 17, 33]. When to be used for
training in datacentres where energy is not a primary concern,
After the convolution, activation function such as Rectified the number of PEs can be quite large which would accelerate
Linear Unit (ReLU) is used to introduce non-linearity in the training process at an expense of increased power
hidden layers. Finally, Fully Connected Layers with softmax consumption. On the contrary, when the accelerator has to be
activation are used to get the class values for various classes. used for inference which is quite common of embedded
Looking into the computational contents on the large size devices where power is of utmost importance, the number of
input datasets, a mechanism needs to be incorporated that can PEs can quite low. The inference process would still be
improve the computational throughput. accelerated with major power benefits.
3

To perform the final addition the data of the related PEs is


transferred to the top most PE where one additional step of a
summation is done to get the row wise result for each output
feature map. Since each PE works independent of other PEs,
different PEs can work on different inputs in order to
maximize efficiency and data re-use. The simulation was done
by taking the most optimum size of input feature map and the
filter for the given accelerator dimensions and keeping in
check that the computation required only a single pass for the
complete convolution. Since there was no routing network, the
accelerator had to rely on the assumption of availability of an
efficient mapper and presence of data in the inputs
Scratchpads.
In order to get the results for power saving on the
accelerator using clock gating[34], the input feature map size
and filter size was kept below the optimum required so as to
Fig. 3: Architecture of the hardware accelerator for simulation study[17] get some of the PEs in the inactive state. This process was
done keeping in mind that as the input traverses the deep
CNNs, the dimensions of the input to the subsequent layers
The architecture of a single Processing Element (PE) is shown change and so do the convolution sizes, this causes some of
in Fig 4. Each processing element has a Scratchpad, which is a the PEs to get into inactive state by getting their clocks turned
set of separate Register Files for each of the inputs and outputs off. The functional simulation results are shown in Fig. 7.
– Input Feature Map, Filter and Output Feature Map. The Size
of each of the Register File can also be variably chosen to fit
each of the feature maps and filter weights to increase data re-
use. Each PE also had a Multiply-Accumulate (MAC) Unit in
order to perform the required computation, supervised by a
control unit – separate for each PE, allowing different PEs to
work independently on different inputs at a time irrespective
of each other, this further enhances parallelism and data re-
use[15, 17, 25, 28].

Fig. 5: Functional Simulation Results

V. ROW STATIONERY DATAFLOW


The steps to performing a 2-D convolution by Row
Stationary (RS) Dataflow can be easily understood by figures
6, 7 and 8. Figure 6 explains how a 2-D convolution can be
broken down into multiple 1-D convolutions and then the
partial sums can be combined to obtain the final result.

Fig. 4: Architecture of Processing Element used in accelerator

The accelerator performs the convolution using the Row


Stationary (RS) Dataflow which has been proved to yield
maximum power savings due to least data movement
requirements[17]. In order to perform the convolution, the
rows of the filter are loaded into all the columns of the
corresponding PE row and the rows of the filter maps are Fig. 6: Breaking down a 2-D convolution into multiple 1-D convolutions[17]
loaded diagonally in the PE array. Each of the PE then
performs the required multiply accumulate on the loaded data.
4

by 23 % in a synchronous RISC CPU after introducing clock-


gating technique. Also, the combination of asynchronous
technique and clock-gating technique can further minimizes
the overall power while maintain other performance metrics
[36-37].Fig. 11 shows the area occupied by the design. This
improvement is observed at the cost of area.

Fig. 7: Performing a 1-D convolution on a single PE[17]

Fig. 8: Mapping the convolution to the accelerator[34]

Figure 7 show how a 1-D convolution can be performed using


a single Processing Element (PE) of the accelerator. Finally,
Figure 8 explains how the convolutions are performed using
the RS Dataflow in our accelerator. The filter coefficients are
loaded row wise and the Input Feature Map rows are loaded
diagonally. The partial sums from all the Processing Elements
(PEs) are accumulated row wise and then the final rows of the
Output Feature map are obtained. As the accelerator size
hugely defines the filter and input dimensions to be processed
in a single pass, it can be easily seen why the accelerator
dimensions are important for efficient processing [17, 25, and
28].

III. RESULTS AND DISCUSSION Fig. 9 – Flowchart showing the state machine for each PE

The Processing Elements (PEs) were designed using state


machine based on the algorithm in Fig. 9. The loading of data
into the filter and input feature map scratchpads of the PEs
was done based on the Row Stationary (RS) Dataflow. Output
of each PE was also stored in a separate scratchpad. For
multiple PEs, the data to be accumulated was passed to a spare
adder for final addition stage.
The Verilog code of the accelerator was synthesized using
the standard cells of 180 nm SCL CMOS technology
inSynopsys Design Compiler. The power consumption and the
chip area were evaluated before and after applying clock
gating. In the simulation study the global buffer size is kept
constant for the accelerator of all sizes. In our design, (with
reference to Fig .3) the global buffer size was 300 Bytes for all
accelerators used in simulation study.
The power consumption of the accelerator is reported in
Fig.10. From the results it has been observed that the power
consumption of the design can significantly be reduced by
incorporating the clock gating [35] when size of the Fig. 10: Impact on power comparison with clock gating
accelerator increases. In [36], power consumption is reduced
5

[8] Hirasawa, Toshiaki, Kazuharu Aoyama, Tetsuya Tanimoto, Soichiro


Ishihara, SatokiShichijo, Tsuyoshi Ozawa, Tatsuya Ohnishi et al.
"Application of artificial intelligence using a convolutional neural
network for detecting gastric cancer in endoscopic images." Gastric
Cancer 21, no. 4 (2018): 653-660.
[9] Azizi, Shekoofeh, ShararehBayat, Pingkun Yan, Amir Tahmasebi, Jin
Tae Kwak, Sheng Xu, BarisTurkbey et al. "Deep recurrent neural
networks for prostate cancer detection: analysis of temporal enhanced
ultrasound." IEEE transactions on medical imaging 37, no. 12 (2018):
2695-2703.
[10] Erol, Berat A., AbhijitMajumdar, Jonathan Lwowski, Patrick Benavidez,
Paul Rad, and Mo Jamshidi. "Improved deep neural network object
tracking system for applications in home robotics." In Computational
Intelligence for Pattern Recognition, pp. 369-395. Springer, Cham,
2018.
[11] Sünderhauf, Niko, Oliver Brock, Walter Scheirer, RaiaHadsell, Dieter
Fox, Jürgen Leitner, Ben Upcroft et al. "The limits and potentials of
deep learning for robotics." The International Journal of Robotics
Research 37, no. 4-5 (2018): 405-420.
[12] Chen, Andrew Tzer-Yeu, MortezaBiglari-Abhari, I. Kevin, Kai Wang,
AbdesselamBouzerdoum, and FokHing Chi Tivive. "Convolutional
neural network acceleration with hardware/software co-design." Applied
Fig. 11: Impact on Area with clock gating Intelligence 48, no. 5 (2018): 1288-1301.
[13] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
IV. CONCLUSION Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
[14] David A. Patterson and John L. Hennessy, “Computer Architecture: A
Quantitative Approach”, Elsevier 2011.
The accelerator’s dimensions of 2D-array arrangement of [15] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural
Processing Elements (PEs) have beenstudied. The accelerator networks,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
size was varied according to the CNN architecture employed Arrays, 2015, pp. 161–170.
and the size of the filters and feature maps. The Dimension of [16] F. Conti and L. Benini, “A ultra-low-energy convolution engine for fast
the internal 2D-array has been chosen depending on whether it brain-inspired vision in multicore clusters,” in Proc. Design, Autom.
TestEur. Conf. Exhibit., 2015, pp. 683–688.
will be used for training or for inference since inference can be [17] Y.-H. Chen, J. Emer, and V. Sze, “Using dataflow to optimize energy
done on smaller arrays with improved energy efficiency. The efficiency of deep neural network accelerators,” IEEE Micro, vol. 37,
synthesis results suggest thatby using clock gating method at no. 3, p. 21, May/Jun. 2017.
80 MHz frequency; the power consumption can be reduced by [18] Vivienne Sze, Yu-Hsin Chen, Joel Emer, Amr Suleiman, Zhengdong
Zhang, “Hardware for Machine Learning: Challenges and
approximately 63% at the cost of 19 % chip area increase.This Opportunities,” IEEE CICC 2017, pp.1-5.
work can be extended in future to further enhance the energy [19] “Hardware Architectures for Deep Neural Networks”, MIT and Nvidia,
efficiency and performance by optimizing the number of 16th October 2016.
active processing elements which depends on the applications. [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
CNN data contains many zeros, which allows compression. Process. Syst., vol. 25. 2012, pp. 1097–1105.
The data adaptive processing can be applied to save [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
bothmemory bandwidthswith minimal processing power. large-scale image recognition,” CoRR, vol. abs/1409.1556, pp.1–14,
Sep. 2014.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
V. REFERENCES for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
[1] Chen, Zhilu, and Xinming Huang. "End-to-end learning for lane keeping Recognit. (CVPR), 2016.
of self-driving cars." IEEE Intelligent Vehicles Symposium (IV), 2017, [23] M. Sankaradas et al., “A massively parallel coprocessor for
pp. 1856-1860. convolutional neural networks,” in Proc. 20th IEEE Int. Conf. Appl.-
[2] Daily, Mike, SwarupMedasani, Reinhold Behringer, and Mohan Trivedi. Specific Syst., Archit. Process., Jul. 2009, pp. 53–60.
"Self-driving cars." Computer 50, no. 12 (2017): 18-23. [24] V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an embedded
[3] Fayek, Haytham M., Margaret Lech, and Lawrence Cavedon. biologically-inspired machine vision processor” in Proc. Int. Conf.
"Evaluating deep learning architectures for Speech Emotion Field-Program. Technol. (FPT), Dec. 2010, pp. 273–278.
Recognition." Neural Networks 92 (2017): 60-68. [25] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal,
[4] Lim, Wootaek, Daeyoung Jang, and Taejin Lee. "Speech emotion “Memorycentric accelerator design for convolutional neural networks,”
recognition using convolutional and recurrent neural networks." In 2016 inProc. IEEE 31st Int. Conf. Comput. Design (ICCD), Oct. 2013,
Asia-Pacific Signal and Information Processing Association Annual pp. 13–19.
Summit and Conference (APSIPA), pp. 1-4. IEEE, 2016. [26] T. Chen et al., “DianNao: A small-footprint high-throughput accelerator
[5] Gupta, Deepak, AsifEkbal, and Pushpak Bhattacharyya. "A Deep Neural for ubiquitous machine-learning,” in Proc. 19th Int. Conf. Archit.
Network based Approach for Entity Extraction in Code-Mixed Indian Support Program. Lang. Oper. Syst., 2014, pp. 269–284.
Social Media Text." In Proceedings of the Eleventh International [27] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,”
Conference on Language Resources and Evaluation (LREC-2018). 2018. in Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitecture, 2014,
[6] Cheng, Gong, Zhenpeng Li, Junwei Han, Xiwen Yao, and Lei Guo. pp. 609–622.
"Exploring hierarchical convolutional features for hyperspectral image [28] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
classification." IEEE Transactions on Geoscience and Remote learning with limited numerical precision,” CoRR, vol. abs/1502.02551,
Sensing 56, no. 11 (2018): 6712-6722. pp. 1–10, Feb. 2015.
[7] Zhang, Shifeng, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. [29] A. Lavin and S. Gray, “Fast algorithms
"Single-shot refinement neural network for object detection." forconvolutionalneuralnetworks”inProc.CVPR, 2016, pp. 4013–4021.
In Proceedings of the IEEE Conference on Computer Vision and Pattern [30] J.CongandB.Xiao,“Minimizingcomputationinconvolutionalneuralnetwor
Recognition, pp. 4203-4212. 2018. ks”, inProc.ICANN, 2014, pp. 281–290.
6

[31] D. H. Bailey, K. Lee, and H. D. Simon, “UsingStrassen’s algorithm to


accelerate the solution oflinear systems,” J. Supercomput., vol. 4, no. 4,
pp. 357–371, Jan. 1991.
[32] M. Horowitz, “Computing’s energy problem(and what we can do about
it),” in IEEE ISSCCDig. Tech. Papers, Feb. 2014, pp. 10–14.
[33] AradavanPedram, “Dark Memory and Accelerator-RichSystem
Optimization in the Dark Silicon Era”, IEEE Design andTest, 2016.
[34] Y.LeCun, Y.Bengio, and G.Hinton, “Deeplearning,” Nature, vol. 521,
no. 7553,pp. 436–444, May 2015.
[35] Mohamed Shaker, MagdyBayoumi, “Novel Clock Gating techniques for
lower Flip-flops and its Applications” IEEE International Midwest
Symposium on Circuits and Systems 2013, pp.420-424.
[36] Yadav, P.K. and Misra, P.K., 2017, December. Power Aware Study of
32-bit 5-stage Pipeline RISC CPU using 180nm CMOS Technology.
In 2017 14th IEEE India Council International Conference
(INDICON) (pp. 1-4). IEEE.
[37] Srivastva, P., Yadav, P.K. and Misra, P.K., 2018, October. Design of 32
bit Asynchronous RISC CPU Using Micropipeline. In 2018 Conference
on Information and Communication Technology (CICT) (pp. 1-5). IEEE.

You might also like