Professional Documents
Culture Documents
Energy-Efficient Design of Processing Element For Convolutional Neural Network
Energy-Efficient Design of Processing Element For Convolutional Neural Network
Energy-Efficient Design of Processing Element For Convolutional Neural Network
Abstract—Convolutional neural network (CNN) is the most of MAC process are massive. Especially in image recogni-
prominent algorithm for its wide usage and good performance. tion, over ten billion of MAC computations are necessary to
Despite the fact that the processing element (PE) plays an impor- perform a highly accurate CNN network [1]. This induces huge
tant role in CNN processing, there has been no study focusing on
PE design optimized for state-of-the-art CNN algorithms. In this
energy consumption on logical operations. Meanwhile, data
brief, we propose an optimal PE implementation including a data transfer in the on/off chip memory system becomes the other
representation scheme, circuit block configurations, and control main source of energy consumption. This problem is criti-
signals for energy-efficient CNN. To validate the excellence of this cal in a dedicated accelerator in that current CMOS process
brief, we compared our proposed design with several previous technology cannot provide sufficient on-chip SRAMs to store
methods, and fabricated a silicon chip. The software simulation CNN data which is megabyte-scale. Therefore, a large amount
results demonstrated that we can reduce 54% of data bit lengths
with negligible accuracy loss. Our optimization on PE achieves to
of data communication between a chip and external DRAM
save computing power up to 47%, and an accelerator exploiting are compulsory, which causes enormous energy dissipation.
our method shows superior results in terms of power, area, and In sum, the energy crisis that can occur due to both logical
external DRAM access. computation and memory access must be overcome to effi-
Index Terms—Neuromorphic computing, convolutional neural ciently employ CNN in a consumer device with an embedded
network, processing element, neural network processor. accelerator.
Several previous studies suggested dedicated processors for
CNN. Du et al. [2] proposed a 2-D systolic array CNN
processor with an inter-processing element (PE) data prop-
agation scheme, and the memory bandwidth was successfully
I. I NTRODUCTION
reduced. In [3], tile-based computing was proposed to smartly
ONVOLUTIONAL neural network (CNN) is the most
C advanced network that is accurate enough to be practi-
cally utilized in user-level applications, such as visual analysis,
manage data movements inside the processor. Neuromorphic
system effectively utilizing bit truncation technique was intro-
duced in [4]. Further to these, Pullini et al. [5] presented Mia
image/speech recognition, and natural language processing. Wallace, MPSoC designed to flexibly execute various CNN
Its correctness and versatility have attracted the interest of models in low power.
researchers who have been working on developing better CNN While effective CNN architectures have been actively stud-
models for wider applications. At the same time, with the wave ied, research on PEs themselves, the core parts of the system,
of IoT and smart mobile devices, there is a strong need for has not been conducted. In this brief, we argue that the key
energy-efficient CNN hardware in resource-hungry environ- solution of the energy problem requires an efficient data rep-
ments where powerful GPUs and general purpose processors resentation scheme with appropriate optimizations of a PE.
are hardly used. Hence, designing a dedicated accelerator for Since all calculations of CNN algorithm occur inside the PE,
CNN is particularly crucial. careful consideration on the PE design is essential to modify
A fundamental operation of CNN is multiplication and data representation methods while maintaining an acceptable
accumulation (MAC). It is admittedly simple, but iterations accuracy. Our contributions to CNN PE design are summarized
by following three points.
Manuscript received December 23, 2016; revised March 16, 2017; 1) We analyze inputs/outputs of PE blocks during CNN
accepted April 1, 2017. Date of publication April 6, 2017; date of cur- processing, and propose optimized scheme to perform
rent version November 1, 2017. This work was supported by the National state-of-the-art CNN networks. By exploiting the obser-
Research Foundation of Korea through the Korea Government (MSIP) under
Grants NRF-2014R1A2A1A05004316 and 2010-0028680. This brief was vation that the dynamic range of two PE inputs are
recommended by Associate Editor C.-T. Cheng. (Corresponding author: distinct, we introduce a heterogeneous representation
Lee-Sup Kim.) (HR). To express the feature efficiently, Significant bits
The authors are with the School of Electrical Engineering, Korea
Advanced Institute of Science and Technology, Daejeon 34141,
securing encoding (SSE) is presented.
South Korea (e-mail: yjchoi@vlsi2.kaist.ac.kr; baedm12@vlsi2.kaist.ac.kr; 2) We verify that CNN arithmetic operations based on the
jhsim@mvlsi.kaist.ac.kr; skchoi@vlsi2.kaist.ac.kr; mhkim@mvlsi.kaist.ac.kr; signed magnitude are better in terms of energy consump-
leesup@kaist.ac.kr). tion than the conventional two’s complement method,
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. and utilize the signed magnitude system in our PE
Digital Object Identifier 10.1109/TCSII.2017.2691771 design.
1549-7747 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHOI et al.: ENERGY-EFFICIENT DESIGN OF PE FOR CNN 1333
Fig. 2. Brief outline and example cases of SSE. An SSE number consists
of three parts: sign bit, most significant position (MSP), and significands.
Fig. 1. Overall architecture of our processor and the proposed PE with HR 2: Detect the first 1, the most significant position (MSP),
scheme. fm , by checking bits from fl−1 to fmax(g,l−2n ) . (n is the
number of bits indicating MSP. g is allocated bits for the
significand.) If there is no 1 within the searching range,
3) An optimized PE design suitable for performing all MSP becomes 0.
layers in the CNN is introduced. Utilizing our HR 3: Save the position of the first 1, MSP, in binary n bits.
method, we design a PE to execute state-of-the-art CNN pn−1 pn−2 · · · p0 = m − max(g, l − 2n ) + 1
algorithms in an energy-efficient manner. 4: Save g bits next to the MSP in significands.
From the evaluation results, we showed that our work can qg−1 qg−2 · · · q0 = fm−1 fm−2 · · · fm−g
effectively reduce the external memory requirement as well as 5: Final encoding format.
operating power of a CNN accelerator. {s, pn−1 pn−2 · · · p0 , qg−1 qg−2 · · · q0 } : (1 + n + g) bits.
The remainder of this brief is structured as follows.
Section II describes intuitions on our data representation and
details of the proposed schemes. In Section III, a concrete PE
design and data path controls are explained. Evaluation set- kernels, constant, and accumulation register. Fig. 1 illustrates
tings and results are presented in Section IV. Then, we make the architecture of our processor and the outline of HR.
our conclusions in Section V. 1) Significant Bits Securing Encoding (SSE): As comput-
ing results are accumulated in the register of each PE until
II. O PTIMIZATIONS ON THE DATA R EPRESENTATION an assigned convolution window is finished, the data is com-
pressed only when the window process is over. Because of
A CNN accelerator should be designed with consideration the error-resiliency inherent in CNN algorithms, a reasonable
not only of energy-efficiency but also network flexibility for approximation scarcely affects the overall accuracy. Since we
diverse applications. Thus, analysis and implementation must exploit FXP MAC for the internal data of PE, an appropriate
cover the worst possible scenario to correspond to every kind encoding scheme for the FXP values is necessary. From the
of network. PE generally takes two inputs, features and ker- simple but powerful principle that bits near the MSB are of
nels, from their designated data paths. Compared to the kernels importance, we developed Significant bits securing encoding
typically bounded within [−1, +1] during the network train- (SSE) to effectively describe the features of CNN algorithms
ings, the dynamic ranges of the features are extremely broad. (Fig. 2). The difference between SSE and FLP lies in the
Previous approaches have utilized either floating point (FLP) medium which they use to compress the original data into sig-
or variants of fixed point (FXP) to express all kinds of data. nificands. FLP controls ‘the position of a decimal point’ using
From the fact that two input paths show totally dissimilar pat- exponent, but SSE utilizes ‘the relative position of the first one’
terns of values, we can get an intuition to use different data compared to the original MSB of full fixed point word. By lim-
types suitable for each path. iting the representable range appropriate for CNN algorithms,
SSE can reduce hardware implementation cost compared to
A. Heterogeneous Representation (HR) the FLP system. The detail of SSE generation is described in
Due to the high complexity and energy consumption of the Algorithm 1.
FLP MAC block, FXP MAC is more energy-efficient. On the 2) Dual-Mode Fixed Point: Besides the feature data, there
other hand, the FXP-based system lacks the ability to express are kernels and constants in convolution, and we use the dual-
a wide range of values, which is essential in CNN convolu- mode fixed point to express them. As mentioned before, their
tion. Hence, to exploit the advantages of both FXP’s efficiency dynamic range is strictly narrow so that the powerful FLP
and FLP’s flexibility, we assign our novel data encoding on style is excessive. Furthermore, assigning the FXP system to
expressing the feature while using dual-mode FXP system on one of the data paths enables an easy implementation of the
1334 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 64, NO. 11, NOVEMBER 2017
TABLE I TABLE II
C OMPARISONS A MONG DATA R EPRESENTATION S CHEMES C HIP C HARACTERISTIC
V. C ONCLUSION
Despite the significance of the PE in a CNN accelerator,
optimization has not been considered enough in previous stud-
Fig. 7. (a) Distributions of bit flips on accumulation register for Alexnet. ies. In this brief, we intensively researched state-of-the-art
(b) Power and area comparison of PEs. FXP baseline utilizes 24-bit FXP, and
the rest are based on HR scheme.
CNN networks and proposed an efficient HR system including
SSE. To support all operations in CNN algorithms, detailed PE
data paths and control signals based on HR were implemented.
From the evaluation, our SSE is able to compress the origi-
nal feature by 54% with less than 3% accuracy loss, which
leads to 60% reduction of external DRAM accesses. Our PE
design for HR shows outstanding improvements, which con-
sumes 47% less power and occupies 16% less area. Overall
comparisons in accelerator level demonstrate that HR system
is superior to other representation methods.
Fig. 8. (a) Chip die photo. (b) Measurement system using FPGA. ACKNOWLEDGMENT
Test chip fabrication was supported by MPW of IDEC.