Energy-Efficient Design of Processing Element For Convolutional Neural Network

1332 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO.
11, NOVEMBER 2017
Energy-Efficient Design of Processing Element for

Convolutional Neural Network
Yeongjae Choi, Student Member, IEEE, Dongmyung Bae, Jaehyeong Sim, Student Member, IEEE,
Seungkyu Choi, Minhye Kim, Student Member, IEEE, and Lee-Sup Kim, Senior Member, IEEE
Abstract—Convolutional neural network (CNN) is the most of MAC process are massive. Especially in image recogni-
prominent algorithm for its wide usage and good performance. tion, over ten billion of MAC computations are necessary to
Despite the fact that the processing element (PE) plays an impor- perform a highly accurate CNN network [1]. This induces huge
tant role in CNN processing, there has been no study focusing on
PE design optimized for state-of-the-art CNN algorithms. In this
energy consumption on logical operations. Meanwhile, data
brief, we propose an optimal PE implementation including a data transfer in the on/off chip memory system becomes the other
representation scheme, circuit block configurations, and control main source of energy consumption. This problem is criti-
signals for energy-efficient CNN. To validate the excellence of this cal in a dedicated accelerator in that current CMOS process
brief, we compared our proposed design with several previous technology cannot provide sufficient on-chip SRAMs to store
methods, and fabricated a silicon chip. The software simulation CNN data which is megabyte-scale. Therefore, a large amount
results demonstrated that we can reduce 54% of data bit lengths
with negligible accuracy loss. Our optimization on PE achieves to
of data communication between a chip and external DRAM
save computing power up to 47%, and an accelerator exploiting are compulsory, which causes enormous energy dissipation.
our method shows superior results in terms of power, area, and In sum, the energy crisis that can occur due to both logical
external DRAM access. computation and memory access must be overcome to effi-
Index Terms—Neuromorphic computing, convolutional neural ciently employ CNN in a consumer device with an embedded
network, processing element, neural network processor. accelerator.
Several previous studies suggested dedicated processors for
CNN. Du et al. [2] proposed a 2-D systolic array CNN
processor with an inter-processing element (PE) data prop-
agation scheme, and the memory bandwidth was successfully
I. I NTRODUCTION
reduced. In [3], tile-based computing was proposed to smartly
ONVOLUTIONAL neural network (CNN) is the most
C advanced network that is accurate enough to be practi-
cally utilized in user-level applications, such as visual analysis,
manage data movements inside the processor. Neuromorphic
system effectively utilizing bit truncation technique was intro-
duced in [4]. Further to these, Pullini et al. [5] presented Mia
image/speech recognition, and natural language processing. Wallace, MPSoC designed to flexibly execute various CNN
Its correctness and versatility have attracted the interest of models in low power.
researchers who have been working on developing better CNN While effective CNN architectures have been actively stud-
models for wider applications. At the same time, with the wave ied, research on PEs themselves, the core parts of the system,
of IoT and smart mobile devices, there is a strong need for has not been conducted. In this brief, we argue that the key
energy-efficient CNN hardware in resource-hungry environ- solution of the energy problem requires an efficient data rep-
ments where powerful GPUs and general purpose processors resentation scheme with appropriate optimizations of a PE.
are hardly used. Hence, designing a dedicated accelerator for Since all calculations of CNN algorithm occur inside the PE,
CNN is particularly crucial. careful consideration on the PE design is essential to modify
A fundamental operation of CNN is multiplication and data representation methods while maintaining an acceptable
accumulation (MAC). It is admittedly simple, but iterations accuracy. Our contributions to CNN PE design are summarized
by following three points.
Manuscript received December 23, 2016; revised March 16, 2017; 1) We analyze inputs/outputs of PE blocks during CNN
accepted April 1, 2017. Date of publication April 6, 2017; date of cur- processing, and propose optimized scheme to perform
rent version November 1, 2017. This work was supported by the National state-of-the-art CNN networks. By exploiting the obser-
Research Foundation of Korea through the Korea Government (MSIP) under
Grants NRF-2014R1A2A1A05004316 and 2010-0028680. This brief was vation that the dynamic range of two PE inputs are
recommended by Associate Editor C.-T. Cheng. (Corresponding author: distinct, we introduce a heterogeneous representation
Lee-Sup Kim.) (HR). To express the feature efficiently, Significant bits
The authors are with the School of Electrical Engineering, Korea
Advanced Institute of Science and Technology, Daejeon 34141,
securing encoding (SSE) is presented.
South Korea (e-mail: yjchoi@vlsi2.kaist.ac.kr; baedm12@vlsi2.kaist.ac.kr; 2) We verify that CNN arithmetic operations based on the
jhsim@mvlsi.kaist.ac.kr; skchoi@vlsi2.kaist.ac.kr; mhkim@mvlsi.kaist.ac.kr; signed magnitude are better in terms of energy consump-
leesup@kaist.ac.kr). tion than the conventional two’s complement method,
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. and utilize the signed magnitude system in our PE
Digital Object Identifier 10.1109/TCSII.2017.2691771 design.
1549-7747 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHOI et al.: ENERGY-EFFICIENT DESIGN OF PE FOR CNN 1333
Fig. 2. Brief outline and example cases of SSE. An SSE number consists
of three parts: sign bit, most significant position (MSP), and significands.
Algorithm 1 SSE Generation

1: Given (l+1) bits FXP feature F = sfl−1 fl−2 · · · f1 f0 , where
s and fi denote the sign bit and ith magnitude bit, respec-
tively. Bits are represented in a signed magnitude system.
Fig. 1. Overall architecture of our processor and the proposed PE with HR 2: Detect the first 1, the most significant position (MSP),
scheme. fm , by checking bits from fl−1 to fmax(g,l−2n ) . (n is the
number of bits indicating MSP. g is allocated bits for the
significand.) If there is no 1 within the searching range,
3) An optimized PE design suitable for performing all MSP becomes 0.
layers in the CNN is introduced. Utilizing our HR 3: Save the position of the first 1, MSP, in binary n bits.
method, we design a PE to execute state-of-the-art CNN pn−1 pn−2 · · · p0 = m − max(g, l − 2n ) + 1
algorithms in an energy-efficient manner. 4: Save g bits next to the MSP in significands.
From the evaluation results, we showed that our work can qg−1 qg−2 · · · q0 = fm−1 fm−2 · · · fm−g
effectively reduce the external memory requirement as well as 5: Final encoding format.
operating power of a CNN accelerator. {s, pn−1 pn−2 · · · p0 , qg−1 qg−2 · · · q0 } : (1 + n + g) bits.
The remainder of this brief is structured as follows.
Section II describes intuitions on our data representation and
details of the proposed schemes. In Section III, a concrete PE
design and data path controls are explained. Evaluation set- kernels, constant, and accumulation register. Fig. 1 illustrates
tings and results are presented in Section IV. Then, we make the architecture of our processor and the outline of HR.
our conclusions in Section V. 1) Significant Bits Securing Encoding (SSE): As comput-
ing results are accumulated in the register of each PE until
II. O PTIMIZATIONS ON THE DATA R EPRESENTATION an assigned convolution window is finished, the data is com-
pressed only when the window process is over. Because of
A CNN accelerator should be designed with consideration the error-resiliency inherent in CNN algorithms, a reasonable
not only of energy-efficiency but also network flexibility for approximation scarcely affects the overall accuracy. Since we
diverse applications. Thus, analysis and implementation must exploit FXP MAC for the internal data of PE, an appropriate
cover the worst possible scenario to correspond to every kind encoding scheme for the FXP values is necessary. From the
of network. PE generally takes two inputs, features and ker- simple but powerful principle that bits near the MSB are of
nels, from their designated data paths. Compared to the kernels importance, we developed Significant bits securing encoding
typically bounded within [−1, +1] during the network train- (SSE) to effectively describe the features of CNN algorithms
ings, the dynamic ranges of the features are extremely broad. (Fig. 2). The difference between SSE and FLP lies in the
Previous approaches have utilized either floating point (FLP) medium which they use to compress the original data into sig-
or variants of fixed point (FXP) to express all kinds of data. nificands. FLP controls ‘the position of a decimal point’ using
From the fact that two input paths show totally dissimilar pat- exponent, but SSE utilizes ‘the relative position of the first one’
terns of values, we can get an intuition to use different data compared to the original MSB of full fixed point word. By lim-
types suitable for each path. iting the representable range appropriate for CNN algorithms,
SSE can reduce hardware implementation cost compared to
A. Heterogeneous Representation (HR) the FLP system. The detail of SSE generation is described in
Due to the high complexity and energy consumption of the Algorithm 1.
FLP MAC block, FXP MAC is more energy-efficient. On the 2) Dual-Mode Fixed Point: Besides the feature data, there
other hand, the FXP-based system lacks the ability to express are kernels and constants in convolution, and we use the dual-
a wide range of values, which is essential in CNN convolu- mode fixed point to express them. As mentioned before, their
tion. Hence, to exploit the advantages of both FXP’s efficiency dynamic range is strictly narrow so that the powerful FLP
and FLP’s flexibility, we assign our novel data encoding on style is excessive. Furthermore, assigning the FXP system to
expressing the feature while using dual-mode FXP system on one of the data paths enables an easy implementation of the
1334 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 11, NOVEMBER 2017
FXP MAC block. As all data go through multiplication at

first, SSE features can be simply joined to FXP MAC during
the multiplication with the aid of a small shifter. Since the
majority of data entering this path are kernels of convolution
or fully connected layers, only the lower half corresponding to
the fractional part are usually turned on. When large constants
are rarely needed, full bits including the integer section are
employed. This flexible utilization does not require complex
controls because the granularity is large, and the occurrence
of configuration change is scarce.
This FXP path is imperative for carrying out minor lay-
ers such as normalization and avg pooling. Most prior works
have just concentrated on the convolution layer so that only
MAC between features and kernels is supported in the PE. Fig. 3. (a) PE design (b) SSE to FXP interflow during multiplication, and
(c) SSE encoder.
Nevertheless, other layers include arithmetic operations of
two features, which demand either extra computing blocks
or communications with the external host processor to han- length of multiplication becomes much shorter than the basic
dle the process. The overheads of both options are too heavy. FXP PE structure (l × l operation turns into l × (g + 1).) This
In the case of mobile/IoT platform particularly, it might be greatly reduces the power consumption spent on multiplica-
impossible to incorporate a host-processor providing enough tion. A simple one-detection circuit and a shifter comprise
computing performance. A dual-mode FXP path could then the SSE encoding block (Fig. 3 (c)), which incurs negligible
be utilized for accurate calculations between two features power and area overheads compared to the savings thanks to
by forwarding values in the accumulation register to MAC. SSE coding. The overhead of the decoding section could be
Fig. 1 shows the internal path of PE directly connecting the cut down additionally by applying a gating technique, which
accumulation register and MAC. stops decoding before the convolution window is finished.
B. Signed Magnitude Arithmetic B. Processing Element Design

In addition to the convenience of encoding features, signed The bit length of the internal PE paths is irrelevant to
magnitude shows further advantage in CNN processing. the amount of data transmission between on-chip SRAM
Han et al. [6] displayed the distribution graph of kernels in. and PEs because of the decoding and encoding procedures.
The distribution is typically symmetric with an axis of sym- Consequently, we could exploit FXP bits which cover the
metry at zero, whereas features are always above zero. Hence, total dynamic ranges of CNN computing. From the analysis
there are numerous changes in multiplication results from neg- results plotted in Fig. 6(a), 24 bits are sufficient to express the
ative to positive and vice versa. These bit flips of kernels are internal values. To realize a dual-mode FXP without tough
even propagated through the MAC block to the accumulation wiring efforts, 12 bits, half of the total 24 bits, are connected
register, affecting the overall circuits of the PE. between the on-chip SRAM and the PE. Kernel transmission
Commonly, processing systems employ two’s complement just uses the lower 12 bits of FXP register in the PE. In the
arithmetic because it is better at handling both positive and case of a constant, signals from the SRAM are saved in the
negative numbers than one’s complement or the signed magni- upper and lower 12 bits sequentially during two clocks.
tude expression. However, frequent transitions of sign in small Two input paths of the HR system are not interchangeable.
magnitude numbers cause massive but needless bit flips in Hence, we have to modify the basic MAC circuits exploited in
two’s complement schemes in contrast with the signed mag- previous works to support all operations in CNN algorithms
nitude system (Fig. 7(a)). Switching power from the flips other than convolution. As a consequence, our PE includes
deteriorates overall power consumption in data buses, regis- a constant register and MUXes for the elaborate controls of
ters, and PEs. Also considering the fact that a typical CNN arithmetic. Our PE design is illustrated in Fig. 3(a).
PE already has a comparator for pooling operations [2], signed Convolution and Fully Connected Layer: Both types of lay-
magnitude representation does not incur additional circuit ers are composed of intensive MACs with a constant addition.
overhead in our system. As a result, we chose the signed Fig. 4 explains how to run MACs and a constant addition in
magnitude as the basic data format instead. our PE. To deal with the constant addition, the value from
the accumulation register is multiplied by 1. Then, a constant
register is set and connected to the addition block.
III. PE D ESIGN
Pooling Layer: Max or average pooling is commonly
A. Heterogeneous Data Paths exploited in state-of-the-art CNN models. When data in the
When SSE feature enters the PE, multiplication is conducted pooling window goes into the PE sequentially, successive com-
first. Applying heterogeneous data paths to FXP MAC, simple parisons between the accumulation register and SSE inputs
shift and multiplication circuits are necessary. As described in perform max pooling. In our design, the comparator can
Fig. 3 (b), bit shifting according to MSP after the multiplica- work for both signed magnitude arithmetic and max com-
tion between the significands of SSE and FXP data naturally parison in the same manner. Only the first 11 bits of the
performs SSE decoding. A crucial advantage here is that the internal PE paths are activated during the max pooling process.
CHOI et al.: ENERGY-EFFICIENT DESIGN OF PE FOR CNN 1335
Fig. 4. Data path controls in each arithmetic operation.
Fig. 6. Accuracy running on 32 bits floating point set as a baseline. Relative

accuracy of benchmark for: (a) FXPB; (b) Minifloat; (c) SSE in HR system.
were processed with our customized function. The majority

of CNN networks from recent studies are variants of several
state-of-the-art CNN models: AlexNet [10], GoogleNet [11],
VGG-16 [1], and VGG-19. Therefore, we organize them as
our benchmark.
To validate the effect of our proposal in a CNN accelerator,
Fig. 5. Representation schemes used in comparisons: Heterogeneous rep-
resentation, floating point baseline (FLPB), fixed point baseline (FXPB),
hardware cost simulations for 5 types of data representation
Minifloat (MF) [7] and Dynamic fixed point (DFXP) [7]. schemes were executed (Fig. 5). RTL-level functional exper-
iments were conducted using a Cadence NCVerilog. For a
Average pooling can be decomposed into convolutions and concrete power simulation, the Synopsys PrimeTime PX tool
a constant multiplication which correspond to accumulation was utilized. A chip was fabricated in Samsung 65nm CMOS
of features in the pooling window and a multiplication with process.
1/‘pooling window size’, respectively.
Normalization Layer: In [2], the decomposition of the
normalization layer into several consecutive layers, such as B. Accuracy Evaluation
addition, multiplication, division, and exponential operation, Accuracy curves of Fig. 6 show that our HR is the most
was discussed. Excluding additions and multiplications, oth- preferable. As articulated in Section II, important charac-
ers are involved with non-linear operations, demanding costly teristics on expressing kernels and features are contrasting.
components. To deal with the non-linear computing, a look- Fig. 6(a) states that CNN accuracy is limited by dynamic
up-table (LUT) for SSE was designed. Based on the accuracy ranges of feature. On the other hand, the accuracy of Minifloat
simulations, an optimized LUT for our CNN benchmarks is is restricted by the length of bits storing kernel information.
implemented. When non-linear calculations are necessary, SSE This can be verified in Fig. 6(b) that the accuracy of Minifloat
from the previous layer enters the LUT, and output comes out starts to deteriorate in longer bit width compared to that of
through the LUT buffer. HR, which uses dual-mode FXP to represent kernels. Dynamic
fixed point also aims to take advantages of both FXP and FLP
systems. However, it needs at least 13 bits to handle CNN
IV. E VALUATION
benchmark same to ours [12]. In sum, 11 bits of SSE are suf-
A. Evaluation Settings ficient in HR system, and this is reduction in the number of
To analyze data and measure the accuracy of CNN, we bits (Fig. 6(c)). Due to the stochastic behaviors of CNN algo-
used a Caffe simulator [8] and ran 1000 test images from the rithm, small accuracy improvements can be observed when
ILSVRC dataset [9]. All arithmetic operations in the Caffe using 11-bit SSE compared to the 24-bit fixed point case.
1336 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 11, NOVEMBER 2017
TABLE I TABLE II
C OMPARISONS A MONG DATA R EPRESENTATION S CHEMES C HIP C HARACTERISTIC
Chip measurement: A functionality of our silicon chip was

tested. Its performance was measured (Fig. 8(b)), and is shown
in Table II. Though the test chip was fabricated in small scale
to deal with simpler CNN networks, result comparisons in
Table I demonstrate the effect of our work in actual silicon
level.
V. C ONCLUSION
Despite the significance of the PE in a CNN accelerator,
optimization has not been considered enough in previous stud-
Fig. 7. (a) Distributions of bit flips on accumulation register for Alexnet. ies. In this brief, we intensively researched state-of-the-art
(b) Power and area comparison of PEs. FXP baseline utilizes 24-bit FXP, and
the rest are based on HR scheme.
CNN networks and proposed an efficient HR system including
SSE. To support all operations in CNN algorithms, detailed PE
data paths and control signals based on HR were implemented.
From the evaluation, our SSE is able to compress the origi-
nal feature by 54% with less than 3% accuracy loss, which
leads to 60% reduction of external DRAM accesses. Our PE
design for HR shows outstanding improvements, which con-
sumes 47% less power and occupies 16% less area. Overall
comparisons in accelerator level demonstrate that HR system
is superior to other representation methods.
Fig. 8. (a) Chip die photo. (b) Measurement system using FPGA. ACKNOWLEDGMENT
Test chip fabrication was supported by MPW of IDEC.
C. Hardware Evaluation R EFERENCES

Processing element: In spite of additional control signals [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
and MUXes for HR data paths, the bit reduction and simpli- [2] Z. Du et al., “Shidiannao: Shifting vision processing closer to the sen-
fied multiplication of our suggestion make remarkable power sor,” in Proc. 42nd Annu. Int. Symp. Comput. Archit. (ISCA), Portland,
and area savings. We implement a conventional PE including OR, USA, 2015, pp. 92–104.
[3] J. Sim et al., “14.6 A 1.42tops/w deep convolutional neural network
a FXPB MAC, the proposed PE, and an SSE PE using two’s recognition processor for intelligent IoE systems,” in Proc. IEEE Int.
complement instead of a signed magnitude system. Fig. 7(b) Solid-State Circuits Conf. (ISSCC), San Francisco, CA, USA, Jan. 2016,
shows that the best power consumption was achieved by our pp. 264–265.
[4] S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan, “AxNN:
PE in signed magnitude. Due to extra control signals and data Energy-efficient neuromorphic systems using approximate computing,”
paths, a very small area overhead is added for the signed in Proc. Int. Symp. Low Power Electron. Design (ISLPED), 2014,
magnitude system, but it is negligible to affect the design pp. 27–32.
choice. [5] A. Pullini et al., “A heterogeneous multi-core system-on-chip for energy
efficient brain inspired computing,” IEEE Trans. Circuits Syst. II, Exp.
Comparisons in accelerator level: Five accelerators based Briefs, to be published, doi: 10.1109/TCSII.2017.2652982.
on each data representation (Fig. 5) were synthesized to have [6] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
equal number of PEs (128) and size of SRAM (312KB). The deep neural networks with pruning, trained quantization and Huffman
coding,” CoRR, vol. abs/1510.00149, 2015.
minimum number of bits showing accuracy degradation under [7] P. Gysel, “Ristretto: Hardware-oriented approximation of convolutional
3% were utilized. CNN data extracted from Caffe were prop- neural networks,” CoRR, vol. abs/1605.06402, 2016.
agated to the synthesized accelerators. Table I illustrates our [8] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
ding,” CoRR, vol. abs/1408.5093, 2014.
achievements. External DRAM accesses of kernels are pro- [9] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
portional to the bit length so that it is possible to reduce lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
15 ∼ 54% of kernel transmission. In the case of feature, [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
DRAM accesses only arise when intermediate data between tion with deep convolutional neural networks,” in Proc. NIPS, 2012,
pp. 1106–1114.
layers exceed the capacity of on-chip feature SRAM. The [11] C. Szegedy et al., “Going deeper with convolutions,” in Proc. Comput.
dataflow used in DRAM access simulation is based on tile- Vis. Pattern Recognit. (CVPR), Boston, MA, USA, 2015, pp. 1–9.
based computing introduced in [3]. As a result, the decrease in [12] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodtand, and
A. Moshovos, “Stripes: Bit-serial deep neural network computing,” in
data bit length more significantly affects the amount of feature Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit. (MICRO), Taipei,
DRAM access (up to 60%). Taiwan, Oct. 2016, pp. 1–12.

Energy-Efficient Design of Processing Element For Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

You might also like

Energy-Efficient Design of Processing Element For Convolutional Neural Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Energy-Efficient Design of Processing Element For Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

1332 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO.

11, NOVEMBER 2017

Energy-Efficient Design of Processing Element for

Algorithm 1 SSE Generation

FXP MAC block. As all data go through multiplication at

B. Signed Magnitude Arithmetic B. Processing Element Design

Fig. 4. Data path controls in each arithmetic operation.

Fig. 6. Accuracy running on 32 bits floating point set as a baseline. Relative

were processed with our customized function. The majority

Chip measurement: A functionality of our silicon chip was

C. Hardware Evaluation R EFERENCES

You might also like