2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks

1924 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO.
6, JUNE 2022
BitBlade: Energy-Efficient Variable Bit-Precision

Hardware Accelerator for Quantized
Neural Networks
Sungju Ryu , Member, IEEE, Hyungjun Kim , Member, IEEE, Wooseok Yi,
Eunhwan Kim, Graduate Student Member, IEEE, Yulhwa Kim , Graduate Student Member, IEEE,
Taesu Kim , Student Member, IEEE, and Jae-Joon Kim , Member, IEEE
Abstract— We introduce an area/energy-efficient precision-

scalable neural network accelerator architecture. Previous
precision-scalable hardware accelerators have limitations such as
the under-utilization of multipliers for low bit-width operations
and the large area overhead to support various bit precisions.
To mitigate the problems, we first propose a bitwise summation,
which reduces the area overhead for the bit-width scaling.
In addition, we present a channel-wise aligning scheme (CAS)
to efficiently fetch inputs and weights from on-chip SRAM
buffers and a channel-first and pixel-last tiling (CFPL) scheme
to maximize the utilization of multipliers on various kernel sizes.
A test chip was implemented in 28-nm CMOS technology, and Fig. 1. Various bit-width requirements on DNN inferences [1]–[4].
the experimental results show that the throughput and energy
efficiency of our chip are up to 7.7× and 1.64× higher than those
of the state-of-the-art designs, respectively. Moreover, additional communication. Recent works show that each layer/network of
1.5–3.4× throughput gains can be achieved using the CFPL the DNNs has different bit-width requirement for optimal effi-
method compared to the CAS. ciency [1]–[4] (see Fig. 1). Neural networks have been usually
Index Terms— Bit-precision scaling, bitwise summation, represented in a fixed bit width on the hardware accelerators.
channel-first and pixel-last tiling (CFPL), channel-wise aligning, However, these conventional accelerators cannot maximize the
deep neural network, hardware accelerator, multiply–accumulate
unit.
performance of quantized neural networks (QNNs), because
I. I NTRODUCTION such a microarchitecture with the fixed bit-width computation
does not fully exploit the dynamic bit-width scalability of the
Q UANTIZATION, which reduces the number of bits
representing neural parameters in neural network com-
putations, is a widely adopted scheme to minimize off-chip
QNNs.
Envision [5] introduced the dynamic-voltage-accuracy-
frequency scaling (DVAFS) scheme to increase the
Manuscript received June 3, 2021; revised November 3, 2021 and performance of the MAC operation by turning on inactive
December 15, 2021; accepted January 2, 2022. Date of publication January 21, submultipliers and enabling single instruction, multiple
2022; date of current version May 26, 2022. This article was approved by
Associate Editor Vivek De. This work was supported in part by the Institute data (SIMD) computations in the low-bit modes. However,
of Information & Communications Technology Planning & Evaluation (IITP) the DVAFS multiplier still has unused submultipliers in the
grant funded by the Korea Government [Ministry of Science and ICT (MSIT)] low-precision operations, and hence, it cannot maximize the
under Grant 2020-0-01309 (Development of Artificial Intelligence Deep
Learning Processor Technology for Complex Transaction Processing Server, throughput of QNN computations. Bit fusion (BF) [6] pre-
50%) and in part by the National Research Foundation of Korea (NRF) grant sented a dynamically composable architecture where fused PE
funded by the Korea Government (MSIT) under Grant 2020R1A2C2004329 fully utilizes the submultipliers called BitBricks. The partial
(NeuroHub: Web-based Open Simulation Platform for Collaborative In-
Memory Neural Network Hardware Research, 50%). (Corresponding author: sums from the BitBricks are added through an adder tree in the
Jae-Joon Kim.) fused PE-level by using multiplexers, dynamic bit-shift logic
Sungju Ryu is with the School of Electronic Engineering, Soongsil Univer- blocks, and adders. While such a dynamically composable
sity, Seoul 06978, Republic of Korea (e-mail: sungju.ryu@ssu.ac.kr).
Hyungjun Kim, Eunhwan Kim, Yulhwa Kim, and Taesu Kim are with the scheme maximizes the MAC throughput, the additional cir-
Department of Convergence IT Engineering, Pohang University of Science cuits needed to implement the dynamic composability signifi-
and Technology, Pohang 37673, Republic of Korea. cantly increase the chip area and the power consumption [6].
Wooseok Yi was with the Department of Creative IT Engineering, Pohang
University of Science and Technology (POSTECH), Pohang 37673, Republic To overcome the limitations of the previous dynamic
of Korea. He is now with the AI&SW Research Center, Samsung Advanced precision-scalable hardware accelerators [5]–[7], we introduce
Institute of Technology, Suwon 16676, Republic of Korea. an area and energy-efficient bit-width scalable architecture.
Jae-Joon Kim is with the Department of Electrical and Computer Engi-
neering, Seoul National University, Seoul 08826, Republic of Korea (e-mail: The main contributions of the proposed design are given as
kimjaejoon@snu.ac.kr). follows.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2022.3141050. 1) We introduce the BitBlade (BB) hardware architecture,
Digital Object Identifier 10.1109/JSSC.2022.3141050 which is based on a bitwise summation scheme to reduce
0018-9200 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
RYU et al.: BB: ENERGY-EFFICIENT VARIABLE BIT-PRECISION HARDWARE ACCELERATOR FOR QNNs 1925
Fig. 2. (a) Baseline design (BF) and (b) proposed BB architecture. B(x, y) describes a BitBrick.
for both input and weight while achieving full utilization of

the submultipliers. Fig. 2(a) illustrates a simple example of
the top-level structure of the BF design. In this example, a PE
array has 16 PEs. A PE consists of 16 submultipliers called
BitBricks, and each BitBrick performs 2-bit multiplication.
After the 2-bit multiplication results are constructed from
the BitBricks in the PEs, the 2-bit multiplication results are
Fig. 3. Breakdown of the power and area of a PE in the BF [6]. shifted depending on the desired bit width. In the case of
2 × 2 multiplication (input = 2b and weight = 2b), no shift
operations are performed (Case (1). A PE handles 16 2-bit
area and power overheads for the dynamic bit-width
multiplications simultaneously. When 4 × 2 multiplication
scaling [8].
(input = 4b and weight = 2b) is executed (Case B), two
2) We fabricate a test chip using 28-nm CMOS technology.
BitBricks construct a fused-PE (F-PE) by completing a 4 × 2
A channel-wise aligning scheme (CAS) is implemented
multiplication result. One BitBrick in an F-PE shifts the 2-bit
to efficiently fetch the data from SRAM buffers to the
multiplication result by 2 bits to the left. The other BitBrick
PE array [9].
in the F-PE does not perform the shift-left operation. Eight
3) To maximize the throughput on various sizes of
4 × 2 multiplications are concurrently performed. In the
weight kernels, we present a channel-first and pixel-
8 × 8 multiplication (input = 8b and weight = 8b) (Case C),
last tiling (CFPL) scheme. We also modify the CAS to
an F-PE is comprised of 16 BitBricks. Only single 8-bit
support this tiling method.
multiplication is computed in a PE in that case. After the
BitBricks perform the 2-bit multiplications, the shifted partial
II. P RELIMINARY: P REVIOUS P RECISION -S CALABLE multiplication results are added via the intra-PE adder trees.
H ARDWARE ACCELERATORS Although the PE of the BF shows higher utilization of the
Several hardware accelerators have been reported to better submultipliers than that of the previous DVAS and DVAFS
support the various bit-width requirements of the QNNs. methods, the shift-add logic gates account for the largest part
UNPU [10] and stripes [11] realized dynamic bit-width scaling of the PE [6] and incur significant area overhead (see Fig. 3).
but supported the variable precision for either activation or Our BB architecture aims to mitigate such area and power
weight only, thereby having limitations in maximizing the overheads due to the shift-add logic that is used to support
performance of the QNNs. The DVAS scheme [7] and the variable precision.
DVAFS approach of the Envision [5] supported the variable
precision for both activation and weight by turning on/off the III. B IT B LADE A RCHITECTURE
submultipliers to support the target bit-width mode. However,
these approaches still experienced under-utilization of the A. Bitwise Summation
multipliers due to the inactive submultipliers in the low-bit To reduce the area and power overhead of the variable-
computation mode. precision support logic, we first propose the bitwise sum-
On the other hand, the BF [6] maximized the throughput in a mation method. In the baseline BF architecture, variable
variety of bit-precision modes by supporting variable precision shift-left operations occur inside the BitBricks, and the output
1926 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 6, JUNE 2022
Fig. 4. Implementation example of bitwise summation. In this example, 4-bit × 4-bit MAC operations are performed. PE array has four PEs, and each PE
has four BitBricks for a simple explanation. Four stages (a)–(d) in the bitwise summation are separately described in the figure for the illustration, but all the
stages are simultaneously performed.
BitBrick(1)’s in different PEs (Stage 2). The remaining two

4-bit multiplications (Stages 3 and 4) are executed in the same
manner as in Stages 1 and 2. By doing so, all the BitBricks
within PE have the 2-bit numbers, which came from the same
bit position of the different inputs and weights. The 2-bit
multiplication results from the BitBricks are added through
the intra-PE adder tree, and the added psums are shifted by
variable amount outside the PE. Finally, all the psums are
accumulated (inter-PE adder tree) in the output buffer.
B. Supporting Various Bit Widths

Fig. 5 shows how to handle various bit widths in our BB
architecture with examples of 2-bit × 2-bit and 4-bit × 2-bit
multiplications. In the 2-bit multiplication case [see Fig. 5(a)],
no shift operations are performed in the PE array. On the
Fig. 5. Handling various bit widths in proposed bitwise summation. other hand, PEs first perform dot products in the 4-bit × 2-bit
(a) Example of 2-bit × 2-bit multiplication. (b) Example of 4-bit × 2-bit multiplication [see Fig. 5(b)]. Then, part of dot product results
multiplication.
from (PE#8-15) are shifted to the left by 2-bit, while no
shift operations are applied to the other psums from (PE#0-7).
of multiple BitBricks are summed by the adder tree [see Similarly, BB supports three configurations (2, 4, and 8 bit) for
Fig. 2(a)]. In this case, BitBricks with the same index in the both activations and weights, and hence, nine precision cases
different PEs always have the same bit-shift parameters (filled can be configured. Recent neural networks can be deeply quan-
with the same background colors). In the BB architecture, tized by using binary neural parameters [1]. To handle these
we move the BitBricks with the same parameter to a single PE binary neural networks, binary inputs/weights are converted
[see Fig. 2(b)]. While each BitBrick in the baseline design has into 2-bit two’s complement numbers before they are fed to the
a dedicated variable shift logic, which incurs large area/power PE array. After that, the binary neural network computations
overhead (see Fig. 3), each PE (16 BitBricks in this example) are performed on the BB in the same way as the 2-bit precision
requires only one shift logic in our BB design due to the case.
proposed bitwise summation, thereby minimizing the logic
complexity for the variable bit precision support.
IV. T EST-C HIP I MPLEMENTATION
Fig. 4 describes an example of a 4-bit × 4-bit dot product
to show how the proposed bitwise summation method works. A. Top-Level Architecture
First, 4-bit input numbers and weights are divided into four Fig. 6 shows the top-level organization of the BB test chip.
groups of two consecutive bits. Then, they are sent to the The test chip consists of 16 PE arrays, a 144-kB SRAM global
BitBrick(0)’s in the different PEs (Stage 1). Second inputs buffer, and a global controller. Every PE array includes an
and weights are divided into 2-bit numbers and sent to the inter-PE adder tree, dynamic shift logic blocks, and 16 PEs.
Fig. 6. Top-level architecture of the test chip.
Fig. 8. (a) Proposed CAS has a simpler input/weight aligning logic than
spatial-domain-wise aligning method. (b) Applying CAS to convolutional and
fully connected layers. “IA” and “W” indicate the input activation and the
weight, respectively.
C. Channel-Wise Aligning Scheme

In neural network accelerators, on-chip reuse is usually
exploited to reduce the number of DRAM accesses, thereby
maximizing the energy efficiency and reducing the latency for
the off-chip communication [12]. Similar to previous works,
a global buffer is used in the BB chip to upload neural
Fig. 7. Precision-scalable PE array with various bit widths. parameters for the on-chip reuse, and the tiled parts of the
parameters stored in the global buffer are fetched to PE
Each PE has 16 2-bit multipliers and an intra-PE adder tree. arrays multiple times. When the data is fetched to PE arrays,
A PE controller in a PE array handles the dynamic bit scaling an aligning logic is needed to match the input/weight pairs for
and signed/unsigned representations of the input numbers. correct multiplication results. We implemented a CAS which is
similar to the height–width channel (HWC) layout commonly
used in GPU implementations [13], [14]. The channel-wise
B. Implementation of a PE Array aligning approach, on the other hand, solely packs channel-
A PE array of our test chip has 16 precision-scalable PEs, domain data into an SRAM word, as opposed to the HWC
and each PE consists of 16 2-b multipliers (see Fig. 7). arrangement, which packs both spatial and channel-domain
To minimize the area overhead for the variable bit-precision data into an SRAM word.
support, we applied the bitwise summation method introduced Fig. 8(a) shows the difference between the conventional
in Section III. After the 2-b multipliers generate partial mul- spatial-domain-wise aligning scheme and the proposed CAS.
tiplication results, the psums are added through an intra-PE The number of inputs/weights packed in an SRAM word varies
adder tree dedicated to each PE. The added psums are shifted depending on the target bit precision. In the spatial-domain-
depending on the target bit width. Finally, the shifted numbers wise aligning, input activations/weights in the consecutive
are accumulated until the psum computation is complete. spatial domain are packed in a word. To pick the input activa-
Fig. 7 also shows the cases for mapping of the variable- tions and weights from the SRAM words for the dot product
precision multiplication on the proposed PE array. Depend- computation, the complex aligning logic is needed. On the
ing on the bit-width requirement (2–8 b for both IA/W), other hand, in the proposed CAS, we pack the inputs/weights
16 2-b submultipliers are used to fully construct from one 8-b from consecutive channels into a word, which allows the
multiplication to 16 2-b multiplications in the same manner, simpler aligning logic than the spatial-domain-wise aligning
as illustrated in Fig. 4. To apply the bitwise summation, 2-b case. An example of the channel-wise aligning method is
multiplications at the same bit positions in the different terms illustrated in Fig. 8(b). In the convolutional layer, the base
of a dot product (filled with the same background color and addresses of the inputs/weights are first calculated. Then, the
pattern) are gathered in a PE (gray arrow). data in the consecutive input-channel indices are continuously
vector–matrix multiplications (see Fig. 8). Residual layers are

handled in the PE arrays by adding the identity values to
the psums in the accumulators. Pooling layers and activation
functions are usually designed in a dedicated logic, but simple
ReLU activation function is only supported and pooling com-
putations, and other types of activation functions are not imple-
mented in our test chip. Hence, they are computed on the host
CPU (see Fig. 9). However, please note that computing pooling
layers and activation functions can be easily implemented in
many other ways. In terms of on-chip quantization, our test
chip simply truncates the psums depending on the target bit
width. When other types of quantization computations are
required, they are performed at the host CPU (see Fig. 9).
We will further discuss the quantization in Section VII-B.
V. C HANNEL -F IRST AND P IXEL -L AST T ILING S CHEME
Fig. 9. Activation and quantization flow. In the built-in activation and A. Tiling Method
quantization logic, the ReLU activation function and simple truncation are
supported. When other types of activation functions and quantization logics While the proposed CAS efficiently fetches the data from
are required, the psums are sent to the host CPU to be activated and quantized. the global buffer to PE arrays, it cannot maximize the multi-
plier array utilization because kernels with a smaller size than
the size of the PE array lead to the under-utilization of the
fetched to the PE arrays. The inputs are broadcast to the PE
multipliers.
arrays, and weights in an output channel are dedicated to each
Fig. 10(a) shows an illustration of 32 PE arrays
PE array. After inner products are complete, the kernel window
(32 × 32 PEs) with 2-bit precision case. Each PE consists
is slid to the next stride. Such an operation repeats until the
of 16 2-bit multipliers, and hence, the PE simultaneously
convolutional computation is finished. Fully connected layers
performs a single (one 8-bit) or multiple (up to 16 2-bit) mul-
are also computed in a similar manner to the convolutional
tiplication(s) depending on the target bit width, as explained in
layer case. Inputs are broadcast across all the PE arrays, and
Section III. In the CAS, each PE array receives inputs/weights
a single weight column is sent to each PE array.
located in the consecutive input channel indices of a single
pixel on the spatial domain (see Fig. 8). When the number of
D. Dataflow multipliers in a PE array becomes large [512 2-bit multipliers
We implemented the PE arrays of our test chip using a in Fig. 10(a)] and the number of inputs/weights in a channel
weight-stationary dataflow, in the same manner as Google’s direction is not large enough to fill the PEs [256 in Fig. 10(a)],
TPU [15]. Weights are pre-stored in the PEs, and they are multipliers that do not receive the data are not used, thereby
reused by multiple input activations. When weights are loaded leading to the performance degradation.
to the PEs, each PE array receives 32 bits/cycle because a To mitigate the problem, we introduce a CFPL scheme that
PE requires 16 2-bit weights in the worst case. The 16 PE simultaneously deals with multiple pixels in the spatial domain
arrays communicate with the global buffer at 32 × 16 = [see Fig. 10(b)]. If the number of inputs/weights in a pixel is
512 bits/cycle for the weights. In terms of input activations, 256, two pixels are handled in a PE array at the same time
a PE requires 16 2-bit inputs in the worst case, which is the because 512 2-bit multipliers are used in a PE array in this
same as the weight case above. A PE array has 16 PEs, and the example. Furthermore, when #IC = 128 [see Fig. 10(c)], four
inputs are reused by all PE arrays. 16 PE arrays communicate pixels can be loaded to the PE array in the same manner.
with the global buffer at 32 × 16 = 512 bits/cycle for the However, such a fine-grained approach may increase the logic
inputs. Each PE array generates a psum (32-bit). When using complexity to support various sizes of input channels [see
the weight-stationary dataflow, the psums generated from the Fig. 8(a)]. Therefore, we still feed only two pixels to the
PE arrays are stored/loaded in/from the global accumulator PE array for the #IC = 128 case at the expense of potential
buffer before they are completed [15]. Hence, the 16 PE performance degradation [see Fig. 10(d)]. This coarse-grained
arrays communicate with the global buffer at 32 × 16 × approach shows the smaller logic complexity than the fine-
2(read/write) = 1024 bits/cycle for the psums. When the grained case and still shows the higher utilization of the
psums are finished, they are fed into the activation unit (ReLU) multipliers than the original CAS [see Fig. 10(a)]. However,
in each PE array before being transferred to the global buffer. if the number of input channels (#IC) is very small, we cannot
Most of the layers, such as convolutional, residual, and fully maintain the high throughput on the coarse-grained approach.
connected layers, are expressed by multiple vector–matrix For example, when the number of input channels is 64, and
multiplications and addition/accumulations. Convolutional and two pixels are simultaneously handled, PE utilization is only
fully connected layers are computed on the PE arrays as 1/4. It is reduced to 1/8 if we have only 32 input channels.
Fig. 10. (a) CAS (channel-only tiling) with 32 PE arrays (32 × 32 PEs) in the 2-bit precision case. (b)–(d) CFPL schemes. Fine-grained tiling with
(b) #IC = 256 and (c) #IC = 128 cases. (c) Coarse-grained tiling with #IC = 128 case. “IC,” “OC,” and “P” indicate the input channel, the output channel,
and the pixel, respectively.
Fig. 12. Implementation example of row-wise continuous pixel addressing

with four pixels and four input channels that are handled at the same time.
“Barrel” indicates one barrel shifter. The example shows a case when barrel
shift parameter is 2.
the PE array. After cycle 4, the kernel window strides to the

right, and the load and multiplication operations continue until
the dot product computation is finished. In cycle 4, there is
only one pixel left for the multiplications inside the kernel
window. In this case, only half of the PE array is used in the
last clock cycle (cycle 4). The #P = 4 case where four pixels
Fig. 11. Mapping CFPL scheme to SRAM cells and PE array. Examples are simultaneously handled proceeds in a similar manner to the
with (a) two and (b) four pixel domains handled simultaneously. “X” in Case case of #P = 2, but it requires more attention [see Fig. 11(b)].
B indicates traffic congestion between SRAM cells and the PE array.
The pixels in the first row of the feature map are stored
in P#0, 1, 2, 3, . . ., as illustrated. In typical addressing, the
index of the first pixel in the second row is continued from
B. Mapping Features on SRAM Cells the index of the last pixel in the first row, as shown in the
To implement the CFPL scheme, the feature data need to be Case A of Fig. 11(b). In this case, however, it is possible for
carefully mapped to the SRAM. Fig. 11 illustrates an example multiple pixels to access the different rows of the same SRAM
to show how the feature map is stored in the SRAM cells and banks. In the given example, two pixels in cycle 0 (red ellipse)
then fed to the PE array. In the case of #P = 2 where two simultaneously try to access the different rows of the same
pixel domains are handled concurrently [see Fig. 11(a)], each SRAM banks (“X” marks). Such traffic congestion degrades
of the pixels has a corresponding address in the SRAM buffer PE utilization substantially. Hence, we suggest using the row-
depending on the spatial-domain index. In other words, the wise continuous pixel addressing method, as illustrated in
upper left pixel of the feature map is stored in a P#0 part of the Case B. In this scheme, the second row starts with the P#3
SRAM buffer, and the next pixel is stored in a P#1 part. In a instead of P#1 so that the conflict in the Case A can be
similar manner, the following pixels are alternately allocated avoided.
to the P#0 and P#1 parts of the SRAM. In clock cycle 0,
the first two pixels (red ellipse) are fetched to the PE array, C. Implementation of Tiling Scheme
and the PE array computes a dot product computation. In the As the kernel window moves over the feature map, input
next cycle, the next two pixels (green ellipse) are loaded to activations and weights from different pixel addresses are
Fig. 13. Area breakdown of PE arrays in Envision, BF, and BB.
Fig. 15. Energy consumption on various DNN workloads. (a) AlexNet.

(b) VGG-16. (c) ResNet-152. (d) LSTM-TIMIT.
Fig. 14. Normalized area of a PE array depending on the number of BitBricks

in each PE. The numbers written next to accelerators in the x-axis are the TABLE I
number of 2-bit multipliers in a PE.
T OP -O NE VALIDATION S ET A CCURACY [%] ON A LEX N ET [3]
matched [Case B of Fig. 11(b)]. In other words, P#0, 1,

and 2 are multiplied by weight (0, 0), weight (0, 1), and
weight (0, 2), as shown in the middle figure of Case B.
In the next stride for which the kernel window is moved
by 1 to the right, as shown in the right-hand side figure
of Case B, P#1, 2, and 3 are multiplied by weight (0, 0),
In Fig. 13, the areas of PE arrays (16 × 16 PEs) in BB
weight (0, 1), and weight (0, 2). In a similar manner, P#2,
and the baseline architecture are compared. Each PE has 16
3, and 0 are multiplied by weight (0, 0), weight (0, 1), and
2-bit multipliers in the accelerators. The original concept of the
weight (0, 2) in the following stride window. For such an
DVAFS method in the Envision is simply achieved by turning
address matching, input feature map must be aligned before it
on/off the submultipliers depending on the bit-precision, but
is sent to PE array. Fig. 12 shows an implementation example
it needs a large number of multiplexers to handle various sign
of the row-wise continuous pixel addressing scheme. The
extension cases for the bit requirements. Furthermore, each
illustration describes the case of four pixel domains and four
PE of the Envision architecture performs batch processing in
input channels for a simple explanation. The pixel aligning
the low-bit configurations, and hence, SIMD accumulators to
logic includes four barrel shifters. The input activations located
handle the batch condition occupy 20% of the PE array area.
at the same input channel address are collected by each barrel
The reconfigurable logic of BF accounts for the largest part
shifter. The input activations rotate depending on the barrel
(59%) of the PE area due to the dynamic shift-add logic.
shift parameters inside each barrel shifter. Afterward, the
In contrast, our BB reduces the area of PE arrays by 41%–44%
rotated input activations are sent to the different positions in
due to the proposed bitwise summation method.
the PE array.
We also varied the number of BitBricks in each PE (see
Fig. 14). As the number of BitBricks in a PE increases, the
VI. R ESULTS
number of input/weight pairs participating in a single dot
A. Core-Level Evaluation product becomes larger. As a result, the area of PE arrays
We first compare the area and energy consumption of BB in the BB decreased by up to 51% compared to the area of
architecture with those of previous works. The accelerators BF in the #BitBricks = 64 case. However, the area of BB in
were synthesized in gate-level cells by using 28-nm CMOS #BitBricks = 8 still shows a smaller area (37%) than the area
technology. of BF because the logic for reconfigurability occupies 68% in
Fig. 17. Measured (a) throughput and (b) energy efficiency of the test chip.
Fig. 16. (a) Chip micrograph of test chip. (b) Experimental setup. We used
a Xilinx ZC706 FPGA evaluation board as a DRAM interface. while consuming 74 mW at 1.0 V of the supply voltage.
The maximum throughput with 2-bit precision is 1.42 TOPS.
the #BitBricks = 8. The overhead is significantly reduced in At 0.6 V, the test chip shows the maximum energy efficiency
our BB due to the bitwise summation scheme. of 44.1 TOPS/W consuming 7.8 mW for the 2-bit precision
The comparison of the energy consumption among the computation. By applying the proposed bitwise summation and
accelerators is illustrated in Fig. 15. We first evaluated the the CAS, the peak performance per compute area (=compute
energy consumption on various networks [see Fig. 15(a)–(d)]. density [TOPS/mm2 ]) of the test chip is improved by up to
To evaluate the core-level energy consumption, we ran a 7.7× and 5.1× compared to the Envision [5] and UNPU [10]
time-based analysis using Synopsys PrimeTime PX with a baselines, respectively (see Table II). The peak on-chip energy
real dataset. However, performing gate-level simulation for all efficiency of the test chip is higher than the Envision by 10.3×
layers of neural networks takes too long; therefore, we cal- and shows comparable energy efficiency to that of UNPU.
culated the average power consumption from selected matrix
multiplication components of each neural network instead. C. System-Level Evaluation
Using average power information, we calculated the energy It is well known that DRAM communication consumes
consumption for the entire network. Envision shows compa- much larger energy than the on-chip computations [12].
rable energy consumption to the BF in the 8-bit case, but it Therefore, a system-level evaluation, including DRAM
suffers from a significant increase in the energy consumption accesses, often provides more realistic assessments.
on the 2-bit because only 25% of submultipliers are used for To evaluate system-level energy consumption, we directly
the SIMD multiplication. On the other hand, BB architecture used chip measurements for the core level, and simulation
consumes smaller energy than the baselines. results are used for the memory accesses. We do not have
Table I shows top-one validation set accuracy [3] on the chip of BF, so we used the power simulation result of
AlexNet. An 8-bit quantization for both activations and the synthesized BF model. For UNPU design, we directly
weights shows 54.5% of accuracy. When using 4- and 2-bit adopted the measurements from the UNPU paper [10]. The
quantization techniques, the accuracy drops to 54.5% and UNPU was fabricated in a 65-nm technology, which is
51.3%, respectively. However, the core-level energy efficiency different from our design where a 28-nm technology was
is improved by almost 4× and 16×, respectively. used. For a fair comparison, the UNPU results were scaled
to a 28-nm technology based on our simulation results for
B. Test-Chip Measurements energy consumption ratio between 28- and 65-nm gate-level
To validate the effectiveness of the proposed schemes, standard cells using representative cells, such as NAND2,
we fabricated a test chip using 28-nm CMOS technology. NOR2, INV, and D Flip-flop (UNPU_28 nm in the results).
Fig. 16(a) shows a chip micrograph, and the test chip consists We used a system-level BF simulator and their default simu-
of 16 × 16 PEs and a 144-kB SRAM global buffer. The die lation parameters as follows [18]. Off-chip DRAM bandwidth
area is 0.71 mm2 (see Table II). A Xilinx ZC706 FPGA eval- is 192 bit/cycle, and the Micrometer LPDDR3 model [19] is
uation board was used for DRAM interface [see Fig. 16(b)]. used for the analysis of DRAM communication. Both read and
We verified the function of the test chip using the ImageNet write energies are assumed to be equal to 15 pJ/bit [6], [20].
dataset. Weights are pre-trained using the PyTorch framework. Fig. 18 compares the area and energy consumption of our
The input image and pre-trained weights are first loaded test chip with those of UNPU and BF. BF only proposed a
on the DRAM of the FPGA evaluation board, and then, precision-scalable PE, and it did not consider the interface
inputs/weights are sent to the test chip. The test chip performs between an SRAM global buffer and PE arrays. Therefore,
the inference task using the inputs/weights received from the we apply the proposed CAS to both the BF and our design for
FPGA evaluation board. After the inference task is complete, a fair comparison. Due to the proposed bitwise summation, our
results are sent to the terminal through the interface on the test chip (Ours_a1w1o1) shows a smaller chip area than that
FPGA evaluation board [see Fig. 16(b)]. of the BF (BF_a1w1o1). For a fair comparison, we applied
Fig. 17 shows the measurement results of the test chip. The the same area constraint as that of BF and increased the
test chip operates up to 195 MHz of the clock frequency SRAM capacities of input/weight/output buffers in our design.
TABLE II
C OMPARISON OF T EST C HIP W ITH P REVIOUS S TATE - OF - THE -A RT P RECISION -S CALABLE A CCELERATOR C HIPS
Fig. 18. (a) Area and (b) system-level energy consumptions on the ImageNet
dataset. a{x}w{y}o{z}: Ifmap/Psum SRAM = 32*x + 16*z kB, and weight
SRAM = 64*y kB.
The increased global buffer size allows our design to have the
increased amount of parameter reuse by uploading the larger
input/weight/output tiles to the on-chip SRAM buffer than the
baseline, thereby minimizing the number of DRAM accesses.
We evaluate the area and energy consumption for the various
SRAM capacities. Among the SRAM configurations, our test
chip with 3× larger weight SRAM buffer (Ours_a1w3o1)
shows a similar area to the BF baseline (BF_a1w1o1). The
3× larger weight SRAM buffer improves the overall energy Fig. 19. System-level energy consumption on various neural networks and bit
efficiency by 36% for the VGG-16 workload with 2-bit pre- widths. a{x}w{y}o{z}: Ifmap/Psum SRAM = 32*x + 16*z kB, and weight
cision due to the increased on-chip reuse. On the other hand, SRAM = 64*y kB.
the UNPU has a very different microarchitecture from BF and
ours, so it is difficult to modify the SRAM capacity and/or the D. Evaluation of Channel-First and Pixel-Last Tiling
number of PEs for a comparison. The UNPU still has a large When kernels in the neural networks have a smaller number
number of PEs and larger SRAM buffers than ours, but we of weights than the number of effective multipliers of a
perform the evaluation using the original UNPU configuration. PE array, the utilization of the multipliers is reduced in the
Although the feature-map-reuse scheme that was used in the proposed CAS. The proposed CFPL scheme (see Fig. 10)
UNPU maximizes the on-chip energy efficiency (see Table II), increases the multiplier array utilization by simultaneously
it leads to a large number of external memory accesses (EMA) handling multiple pixels in each PE array. Fig. 20 evaluates
for the psums [10]. As a result, our design shows smaller the proposed CFPL scheme and compares it to the CAS
energy consumption than the UNPU by up to 64%. for 2- and 4-bit width configurations. The fine-grained tiling
We also evaluate the system-level energy efficiency on maximizes the utilization by filling input/weight pairs located
various neural networks (AlexNet, ResNet-18, VGG-16, and at the multiple pixel domains into the PE arrays. By doing so,
MobileNetV2) with several bit widths (see Fig. 19). Our the fine-grained tiling increases the multiplier utilization by
architecture reduces the overall energy consumption by 1.1–7.2× and 1.9–14.2× on 4- and 2-bit cases, respectively.
27%–39% compared to the BF. The energy efficiency However, the fine-grained tiling leads to a large area overhead
of the UNPU is comparable to our design in the as the number of pixels concurrently handled in a PE array
IA/W = 8-b/8-b case, but the number of DRAM accesses for varies depending on the kernel size. On the other hand, the
the psums significantly increases in lower bit configurations coarse-grained tiling mitigates the area overhead by fixing the
due to the feature-map-reuse method of the UNPU. maximum number of pixels concurrently computed in a clock
Fig. 22. Area overhead with CFPL scheme on various number of PEs.
F-CFPL: fine-grained CFPL. C-CFPL(P): coarse-grained CFPL with P(= the
maximum pixel numbers handled in a clock cycle on each PE array).
Fig. 20. Utilization of multipliers with CAS and CFPL on (a) 2- and
(b) 4-bit precision cases. F-CFPL: fine-grained CFPL. C-CFPL(P): coarse-
grained CFPL with P(= the maximum pixel numbers handled in a clock
cycle on each PE array). The numbers written next to networks in the x-axis
are the number of PEs in our accelerator.
Fig. 23. Energy consumption of PE arrays with CAS and CFPL on

(a) 2- and (b) 4-bit precision cases. F-CFPL: fine-grained CFPL. C-CFPL(P):
coarse-grained CFPL with P(= the maximum pixel numbers handled in a
clock cycle on each PE array). The numbers written next to networks in the
x-axis are the number of PEs in our accelerator.
first layers. As a result, they show a lower utilization rate

than the rest of the layers. Because it takes a long time to
compute such layers with a small number of multipliers, layers
Fig. 21. Layer-wise utilization of multipliers on coarse-grained
with low multiplier utilization experience the increased total
CFPL (C-CFPL) with 16 pixels. For evaluation, 4096 PEs are used inference time and significantly degrade the average multiplier
(VGG-16/AlexNet_4096). (a) VGG-16. (b) AlexNet cases. utilization.
Fig. 22 shows an area overhead to implement the CFPL
cycle, and it still increases the utilization by 1.1–6.5× and scheme. While the fine-grained tiling scheme shows very high
1.5–11.8× over the CAS on 4- and 2-bit cases, respectively. utilization over a wide range of PE numbers, it leads to a
In the CAS, the utilization substantially decreases as the large area overhead (15%–28%) to handle various input/weight
number of PEs increases because the number of multipliers aligning cases. On the other hand, the coarse-grained tiling
in a PE array increases, while the kernel sizes in the neural scheme shows a smaller area overhead (1%–15%) than the
networks are not changed. In contrast, the proposed channel- fine-grained method and it still maintains high multiplier array
first and pixel-last schemes show the high utilization over a utilization.
wide range of PE sizes by handling the multiple pixels at the Furthermore, we evaluated the energy consumption of
same time. PE arrays with various tiling schemes (see Fig. 23). The
We also evaluated the layer-wise multiplier array utilization CAS shows the largest energy consumption among the tiling
(see Fig. 21). Both networks have only three channels in the schemes due to its low multiplier array utilization (see Fig. 20).
B. Quantization Methods
WRPN [3] increased the number of channels to re-gain the
accuracy in the quantized networks. TWN and BWN used
binary precision for weights [2]. In XNOR-Net [1], multiplica-
tions are replaced with XNOR operations using binary numbers
for both inputs and weights. For the 2-bit QNNs, PACT and
SAWB [4] use a trainable activation clipping parameter and
statistics of the weight distribution. There is no dedicated
quantizer logic in our design. Instead, our chip receives the
QNN parameters from external memory. It is assumed that
the quantization is done at the software level.
Fig. 24. Area comparison of PE array between BF and BB when 1-/2-bit
multipliers are used for BitBricks.
C. Computing Depthwise Convolution
In the standard convolutional layer, inputs are convolved
The coarse-grained tiling scheme with two pixels consumes with weights for multiple output channels. Hence, inputs are
larger energy than the fine-grained tiling scheme because the broadcast to the PE arrays and reused in the multiple PE
fine-grained tiling has much larger utilization than the coarse- arrays. In contrast, the depthwise convolutional layer does not
grained tiling with two pixels. In contrast, the utilization have input reusability. Inputs are convolved with weights for
of the coarse-grained tiling with 16 pixels is comparable to a single output channel, so inputs are only sent to one PE
the fine-grained tiling, and the area overhead is still smaller array. Therefore, computing the depthwise convolutional layers
than the fine-grained scheme. Therefore, the coarse-grained shows the lower multiplier array utilization than the standard
tiling scheme shows smaller energy consumption than the convolutional layer case. For example, if the number of PE
fine-grained tiling in almost all networks except for the 2-bit arrays is 16 as in our test chip, the maximum utilization rate
quantized AlexNet with 4096 PEs (AlexNet_4096) where is only 1/16. Previous works also suffer from such an under-
the average number of input/weight pairs for a dot product is utilization of multipliers on the depthwise convolutional layers,
smaller than other networks, and a small number of BitBricks so our design still shows an improvement on the MobileNet
are only used for the 2-bit computation. workload using depthwise convolutional layers (see Fig. 19).
We fabricated our test chip with a coarse-grained tiling However, our work mainly targets intra PE-array-level design
scheme with one pixel (C-CPFL(1)). However, we simulated optimization. If other independent design schemes, such as
multiple C-CFPL setups to assess potential improvements to a channel stationary dataflow [21], are adopted in our work,
our approach, as described in this Section. we can also recover the multiplier array utilization.
VIII. C ONCLUSION
VII. D ISCUSSION
We introduced an area and energy-efficient hardware accel-
A. Further Optimization for Binary Neural Networks erator, BB, for QNNs. We first introduced a bitwise summation
In our BB design, each BitBrick uses a 2-bit multiplier, method to reduce the area/power overheads to support variable
which is the same as the configuration used in BF. To compute bit width. We also presented a channel-wise aligning method to
binary neural networks, all the binary inputs/weights are con- fetch data from the global buffer to PE arrays more efficiently.
verted into 2-bit numbers before starting the computation (see The experimental results show that the throughput and system-
Section III-B). Although we can reduce the number of DRAM level energy efficiency were increased by up to 7.7× and
accesses by using 1-bit parameters when computing binary 1.64× compared to the previous works. To maximize the
neural networks, we cannot maximize the on-chip performance utilization of the various kernel sizes, a CFPL scheme was
and the energy efficiency because such a binary computation also introduced. The additional improvements over the CAS
consumes the same on-chip power as the power for the 2-bit are 1.5–3.4×.
computation. For each BitBrick, a 1-bit multiplier can be used ACKNOWLEDGMENT
instead of a 2-bit multiplier to further maximize computational The chip fabrication was supported by Samsung Electronics,
efficiency. Fig. 24 shows the area comparison of the PE array and the EDA tool was supported by the IC Design Education
between BF and our BB. To support the 1-bit multiplication, Center (IDEC).
both architectures increase in the PE area because the reconfig-
urable logic for variable bit-shift operation needs to support an
R EFERENCES
additional 1-bit case. However, the BB design shows a smaller
area overhead than BF because the proposed bitwise summa- [1] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-Net:
ImageNet classification using binary convolutional neural networks,” in
tion requires a smaller number of variable shift logics than the Proc. Eur. Conf. Comput. Vis. New York, NY, USA: Springer, 2016,
BF case. Therefore, BB with 1-bit multiplier-based BitBricks pp. 525–542. [Online]. Available: https://eccv2020.eu/
can improve the throughput of binary computation by 4× with [2] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” 2016,
arXiv:1605.04711.
the 48% of area overhead, and the area is smaller than the BF [3] A. Mishra, E. Nurvitadhi, J. J Cook, and D. Marr, “WRPN: Wide
case by 54%. reduced-precision networks,” 2017, arXiv:1709.01134.
[4] J. Choi, S. Venkataramani, V. Srinivasan, K. Gopalakrishnan, Z. Wang, Hyungjun Kim (Member, IEEE) received the B.S.
and P. Chuang, “Accurate and efficient 2-bit quantized neural networks,” and Ph.D. degrees from the Pohang University
in Proc. 2nd SysML Conf., 2019, 2019. of Science and Technology (POSTECH), Pohang,
[5] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envi- Republic of Korea, in 2016 and 2021, respectively.
sion: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy- He did an internship at the Holst Centre, Eind-
frequency-scalable convolutional neural network processor in 28 nm hoven, The Netherlands, for organic memory diode
FDSOI,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2017, pp. 246–247. design from January to September 2015 and also
[6] H. Sharma et al., “Bit fusion: Bit-level dynamically composable archi- spent the summer of 2018 at the IBM Thomas
tecture for accelerating deep neural network,” in Proc. ACM/IEEE 45th J. Watson Research Center, Yorktown Heights, NY,
Annu. Int. Symp. Comput. Archit. (ISCA), 2018, pp. 764–775. USA, as a Research Intern for in-memory neural net-
[7] B. Moons and M. Verhelst, “An energy-efficient precision-scalable work hardware design. His current research interest
convnet processor in 40-nm CMOS,” IEEE J. Solid-State Circuits, includes hardware–software co-design of deep neural network accelerators.
vol. 52, no. 4, pp. 903–914, Apr. 2017.
[8] S. Ryu, H. Kim, W. Yi, and J.-J. Kim, “Bitblade: Area and energy- Wooseok Yi received the B.S. degree in physics
efficient precision-scalable neural network accelerator with bitwise sum- from the Pohang University of Science and Technol-
mation,” in Proc. 56th Annu. Design Autom. Conf., Oct. 2019, pp. 1–6. ogy (POSTECH), Pohang, South Korea, in 2013, and
[9] S. Ryu et al., “A 44.1 tops/w precision-scalable accelerator for quantized the Ph.D. degree from the Department of Creative IT
neural networks in 28 nm CMOS,” in Proc. IEEE Custom Integr. Circuits Engineering (CiTE), POSTECH, in 2020.
Conf. (CICC), Oct. 2020, pp. 1–4. He is currently a Staff Researcher with the AI&SW
[10] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: Research Center, Samsung Advanced Institute of
An energy-efficient deep neural network accelerator with fully variable Technology (SAIT), Suwon, Republic of Korea.
weight bit precision,” IEEE J. Solid-State Circuits, vol. 54, no. 1, His current research interests include neuromorphic
pp. 173–185, Jan. 2019. hardware, spin-transfer torque random access mem-
[11] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, ory, in-/near-memory computing, and brain-inspired
“Stripes: Bit-serial deep neural network computing,” in Proc. 49th Annu. computing.
IEEE/ACM Int. Symp. Microarchitecture (MICRO), Oct. 2016, pp. 1–12.
[12] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- Eunhwan Kim (Graduate Student Member, IEEE)
efficient reconfigurable accelerator for deep convolutional neural net- received the B.S. and M.S. degrees in electrical
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, engineering from Kookmin University, Seoul, South
May 2017. Korea, in 2010 and 2012, respectively. He is cur-
[13] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient neural network rently pursuing the Ph.D. degree with the Pohang
kernels for arm Cortex-M CPUs,” 2018, arXiv:1801.06601. University of Science and Technology (POSTECH),
[14] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “PULP- Pohang, South Korea.
NN: Accelerating quantized neural networks on parallel ultra-low-power From 2012 to 2014, he worked on the design
RISC-V processors,” Phil. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., of the display driver interface at DB Hitek, Seoul.
vol. 378, no. 2164, Feb. 2020, Art. no. 20190155. From 2014 to 2018, he was a Research Associate
[15] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor with the i-Lab, POSTECH, working on hardware
processing unit,” in Proc. 44th Annu. Int. Symp. Comput. Archit., 2017, security. His current research interests include low-power circuit design and
pp. 1–12. hardware security.
[16] C.-H. Lin et al., “7.1 A 3.4-to-13.3 tops/w 3.6 tops dual-core deep- Yulhwa Kim (Graduate Student Member,
learning accelerator for versatile ai applications in 7 nm 5g smartphone IEEE) received the B.S. degree in creative IT
soc,” in IEEE ISSCC Dig. Tech. Papers, Oct. 2020, pp. 134–136. engineering (CiTE) from the Pohang University
[17] Y. Jiao et al., “7.2 A 12 nm programmable convolution-efficient neural- of Science and Technology, Pohang, South Korea,
processing-unit chip achieving 825 tops,” in IEEE ISSCC Dig. Tech. in 2016, where she is currently pursuing the Ph.D.
Papers, Oct. 2020, pp. 136–140. degree.
[18] H. Sharma Bit Fusion: Bit-Level Dynamically Composable Architecture Her current research interests include
for Accelerating Deep Neural Networks. Accessed: Aug. 1, 2019. hardware–software co-design of deep neural
[Online]. Available: https://github.com/hsharma35/bitfusion network accelerators and in-memory computing.
[19] Mobile LPDDR3 SDRAM: 178-Ball, Single-Channel Mobile LPDDR3
SDRAM Features. Accessed: Aug. 1, 2019. [Online]. Available:
https://www.micron.com/products/dram/lpdram/16Gb Taesu Kim (Student Member, IEEE) received the
[20] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: B.S. degree in electrical engineering from the Korea
Scalable and efficient neural network acceleration with 3D memory,” Institute of Science and Technology (KAIST), Dae-
ACM SIGPLAN Notices, vol. 52, no. 4, pp. 751–764, May 2017. jeon, South Korea, in 2016. He is currently pursuing
[21] S. Ryu, Y. Oh, and J.-J. Kim, “MobileWare: A high-performance the Ph.D. degree with the Department of Creative IT
mobilenet accelerator with channel stationary dataflow,” in 2021 Engineering (CiTE), Pohang University of Science
IEEE/ACM Int. Conf. Computer-Aided Design (ICCAD), Oct. 2021, and Technology (POSTECH), Pohang, South Korea.
pp. 1–8. His current research interests include network
compression, tiny machine learning, and energy-
efficient machine learning hardware accelerators.
Sungju Ryu (Member, IEEE) received the B.S.

degree in electrical engineering from Pusan National Jae-Joon Kim (Member, IEEE) received the B.S.
University, Busan, Republic of Korea, in 2015, and and M.S. degrees in electronics engineering from
the Ph.D. degree from the Department of Creative IT Seoul National University, Seoul, South Korea, in
Engineering (CiTE), Pohang University of Science 1994 and 1998, respectively, and the Ph.D. degree
and Technology (POSTECH), Pohang, Republic of from the School of Electrical and Computer Engi-
Korea, in 2021. neering, Purdue University, West Lafayette, IN,
He was a Staff Researcher with the AI&SW USA, in 2004.
Research Center, Samsung Advanced Institute of From 2004 to 2013, he was a Research Staff
Technology (SAIT), Suwon, Republic of Korea, Member with the IBM Thomas J. Watson Research
in 2021, where he focused on the computer archi- Center, Yorktown Heights, NY, USA. He was a
tecture design. He is currently an Assistant Professor with the School of Professor with the Pohang University of Science
Electronic Engineering, Soongsil University, Seoul, Republic of Korea. His and Technology, Pohang, South Korea, from 2013 to 2021. He is currently
current research interests include energy-efficient hardware accelerators for a Professor with Seoul National University. His current research interests
deep neural networks, low-power VLSI design, high-performance computing, include the design of deep learning hardware accelerators, neuromorphic
and in-/near-memory computing. processors, hardware security circuits, and circuits for exploratory devices.

2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks

Uploaded by

Copyright:

Available Formats

1924 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO.

BitBlade: Energy-Efficient Variable Bit-Precision

Abstract— We introduce an area/energy-efficient precision-

for both input and weight while achieving full utilization of

BitBrick(1)’s in different PEs (Stage 2). The remaining two

B. Supporting Various Bit Widths

Fig. 6. Top-level architecture of the test chip.

C. Channel-Wise Aligning Scheme

vector–matrix multiplications (see Fig. 8). Residual layers are

V. C HANNEL -F IRST AND P IXEL -L AST T ILING S CHEME

Fig. 12. Implementation example of row-wise continuous pixel addressing

the PE array. After cycle 4, the kernel window strides to the

Fig. 13. Area breakdown of PE arrays in Envision, BF, and BB.

Fig. 15. Energy consumption on various DNN workloads. (a) AlexNet.

Fig. 14. Normalized area of a PE array depending on the number of BitBricks

matched [Case B of Fig. 11(b)]. In other words, P#0, 1,

Fig. 23. Energy consumption of PE arrays with CAS and CFPL on

first layers. As a result, they show a lower utilization rate

Sungju Ryu (Member, IEEE) received the B.S.

You might also like