Professional Documents
Culture Documents
6, JUNE 2022
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
RYU et al.: BB: ENERGY-EFFICIENT VARIABLE BIT-PRECISION HARDWARE ACCELERATOR FOR QNNs 1925
Fig. 2. (a) Baseline design (BF) and (b) proposed BB architecture. B(x, y) describes a BitBrick.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
1926 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 6, JUNE 2022
Fig. 4. Implementation example of bitwise summation. In this example, 4-bit × 4-bit MAC operations are performed. PE array has four PEs, and each PE
has four BitBricks for a simple explanation. Four stages (a)–(d) in the bitwise summation are separately described in the figure for the illustration, but all the
stages are simultaneously performed.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
RYU et al.: BB: ENERGY-EFFICIENT VARIABLE BIT-PRECISION HARDWARE ACCELERATOR FOR QNNs 1927
Fig. 8. (a) Proposed CAS has a simpler input/weight aligning logic than
spatial-domain-wise aligning method. (b) Applying CAS to convolutional and
fully connected layers. “IA” and “W” indicate the input activation and the
weight, respectively.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
1928 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 6, JUNE 2022
Fig. 9. Activation and quantization flow. In the built-in activation and A. Tiling Method
quantization logic, the ReLU activation function and simple truncation are
supported. When other types of activation functions and quantization logics While the proposed CAS efficiently fetches the data from
are required, the psums are sent to the host CPU to be activated and quantized. the global buffer to PE arrays, it cannot maximize the multi-
plier array utilization because kernels with a smaller size than
the size of the PE array lead to the under-utilization of the
fetched to the PE arrays. The inputs are broadcast to the PE
multipliers.
arrays, and weights in an output channel are dedicated to each
Fig. 10(a) shows an illustration of 32 PE arrays
PE array. After inner products are complete, the kernel window
(32 × 32 PEs) with 2-bit precision case. Each PE consists
is slid to the next stride. Such an operation repeats until the
of 16 2-bit multipliers, and hence, the PE simultaneously
convolutional computation is finished. Fully connected layers
performs a single (one 8-bit) or multiple (up to 16 2-bit) mul-
are also computed in a similar manner to the convolutional
tiplication(s) depending on the target bit width, as explained in
layer case. Inputs are broadcast across all the PE arrays, and
Section III. In the CAS, each PE array receives inputs/weights
a single weight column is sent to each PE array.
located in the consecutive input channel indices of a single
pixel on the spatial domain (see Fig. 8). When the number of
D. Dataflow multipliers in a PE array becomes large [512 2-bit multipliers
We implemented the PE arrays of our test chip using a in Fig. 10(a)] and the number of inputs/weights in a channel
weight-stationary dataflow, in the same manner as Google’s direction is not large enough to fill the PEs [256 in Fig. 10(a)],
TPU [15]. Weights are pre-stored in the PEs, and they are multipliers that do not receive the data are not used, thereby
reused by multiple input activations. When weights are loaded leading to the performance degradation.
to the PEs, each PE array receives 32 bits/cycle because a To mitigate the problem, we introduce a CFPL scheme that
PE requires 16 2-bit weights in the worst case. The 16 PE simultaneously deals with multiple pixels in the spatial domain
arrays communicate with the global buffer at 32 × 16 = [see Fig. 10(b)]. If the number of inputs/weights in a pixel is
512 bits/cycle for the weights. In terms of input activations, 256, two pixels are handled in a PE array at the same time
a PE requires 16 2-bit inputs in the worst case, which is the because 512 2-bit multipliers are used in a PE array in this
same as the weight case above. A PE array has 16 PEs, and the example. Furthermore, when #IC = 128 [see Fig. 10(c)], four
inputs are reused by all PE arrays. 16 PE arrays communicate pixels can be loaded to the PE array in the same manner.
with the global buffer at 32 × 16 = 512 bits/cycle for the However, such a fine-grained approach may increase the logic
inputs. Each PE array generates a psum (32-bit). When using complexity to support various sizes of input channels [see
the weight-stationary dataflow, the psums generated from the Fig. 8(a)]. Therefore, we still feed only two pixels to the
PE arrays are stored/loaded in/from the global accumulator PE array for the #IC = 128 case at the expense of potential
buffer before they are completed [15]. Hence, the 16 PE performance degradation [see Fig. 10(d)]. This coarse-grained
arrays communicate with the global buffer at 32 × 16 × approach shows the smaller logic complexity than the fine-
2(read/write) = 1024 bits/cycle for the psums. When the grained case and still shows the higher utilization of the
psums are finished, they are fed into the activation unit (ReLU) multipliers than the original CAS [see Fig. 10(a)]. However,
in each PE array before being transferred to the global buffer. if the number of input channels (#IC) is very small, we cannot
Most of the layers, such as convolutional, residual, and fully maintain the high throughput on the coarse-grained approach.
connected layers, are expressed by multiple vector–matrix For example, when the number of input channels is 64, and
multiplications and addition/accumulations. Convolutional and two pixels are simultaneously handled, PE utilization is only
fully connected layers are computed on the PE arrays as 1/4. It is reduced to 1/8 if we have only 32 input channels.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
RYU et al.: BB: ENERGY-EFFICIENT VARIABLE BIT-PRECISION HARDWARE ACCELERATOR FOR QNNs 1929
Fig. 10. (a) CAS (channel-only tiling) with 32 PE arrays (32 × 32 PEs) in the 2-bit precision case. (b)–(d) CFPL schemes. Fine-grained tiling with
(b) #IC = 256 and (c) #IC = 128 cases. (c) Coarse-grained tiling with #IC = 128 case. “IC,” “OC,” and “P” indicate the input channel, the output channel,
and the pixel, respectively.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
1930 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 6, JUNE 2022
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
RYU et al.: BB: ENERGY-EFFICIENT VARIABLE BIT-PRECISION HARDWARE ACCELERATOR FOR QNNs 1931
Fig. 17. Measured (a) throughput and (b) energy efficiency of the test chip.
Fig. 16. (a) Chip micrograph of test chip. (b) Experimental setup. We used
a Xilinx ZC706 FPGA evaluation board as a DRAM interface. while consuming 74 mW at 1.0 V of the supply voltage.
The maximum throughput with 2-bit precision is 1.42 TOPS.
the #BitBricks = 8. The overhead is significantly reduced in At 0.6 V, the test chip shows the maximum energy efficiency
our BB due to the bitwise summation scheme. of 44.1 TOPS/W consuming 7.8 mW for the 2-bit precision
The comparison of the energy consumption among the computation. By applying the proposed bitwise summation and
accelerators is illustrated in Fig. 15. We first evaluated the the CAS, the peak performance per compute area (=compute
energy consumption on various networks [see Fig. 15(a)–(d)]. density [TOPS/mm2 ]) of the test chip is improved by up to
To evaluate the core-level energy consumption, we ran a 7.7× and 5.1× compared to the Envision [5] and UNPU [10]
time-based analysis using Synopsys PrimeTime PX with a baselines, respectively (see Table II). The peak on-chip energy
real dataset. However, performing gate-level simulation for all efficiency of the test chip is higher than the Envision by 10.3×
layers of neural networks takes too long; therefore, we cal- and shows comparable energy efficiency to that of UNPU.
culated the average power consumption from selected matrix
multiplication components of each neural network instead. C. System-Level Evaluation
Using average power information, we calculated the energy It is well known that DRAM communication consumes
consumption for the entire network. Envision shows compa- much larger energy than the on-chip computations [12].
rable energy consumption to the BF in the 8-bit case, but it Therefore, a system-level evaluation, including DRAM
suffers from a significant increase in the energy consumption accesses, often provides more realistic assessments.
on the 2-bit because only 25% of submultipliers are used for To evaluate system-level energy consumption, we directly
the SIMD multiplication. On the other hand, BB architecture used chip measurements for the core level, and simulation
consumes smaller energy than the baselines. results are used for the memory accesses. We do not have
Table I shows top-one validation set accuracy [3] on the chip of BF, so we used the power simulation result of
AlexNet. An 8-bit quantization for both activations and the synthesized BF model. For UNPU design, we directly
weights shows 54.5% of accuracy. When using 4- and 2-bit adopted the measurements from the UNPU paper [10]. The
quantization techniques, the accuracy drops to 54.5% and UNPU was fabricated in a 65-nm technology, which is
51.3%, respectively. However, the core-level energy efficiency different from our design where a 28-nm technology was
is improved by almost 4× and 16×, respectively. used. For a fair comparison, the UNPU results were scaled
to a 28-nm technology based on our simulation results for
B. Test-Chip Measurements energy consumption ratio between 28- and 65-nm gate-level
To validate the effectiveness of the proposed schemes, standard cells using representative cells, such as NAND2,
we fabricated a test chip using 28-nm CMOS technology. NOR2, INV, and D Flip-flop (UNPU_28 nm in the results).
Fig. 16(a) shows a chip micrograph, and the test chip consists We used a system-level BF simulator and their default simu-
of 16 × 16 PEs and a 144-kB SRAM global buffer. The die lation parameters as follows [18]. Off-chip DRAM bandwidth
area is 0.71 mm2 (see Table II). A Xilinx ZC706 FPGA eval- is 192 bit/cycle, and the Micrometer LPDDR3 model [19] is
uation board was used for DRAM interface [see Fig. 16(b)]. used for the analysis of DRAM communication. Both read and
We verified the function of the test chip using the ImageNet write energies are assumed to be equal to 15 pJ/bit [6], [20].
dataset. Weights are pre-trained using the PyTorch framework. Fig. 18 compares the area and energy consumption of our
The input image and pre-trained weights are first loaded test chip with those of UNPU and BF. BF only proposed a
on the DRAM of the FPGA evaluation board, and then, precision-scalable PE, and it did not consider the interface
inputs/weights are sent to the test chip. The test chip performs between an SRAM global buffer and PE arrays. Therefore,
the inference task using the inputs/weights received from the we apply the proposed CAS to both the BF and our design for
FPGA evaluation board. After the inference task is complete, a fair comparison. Due to the proposed bitwise summation, our
results are sent to the terminal through the interface on the test chip (Ours_a1w1o1) shows a smaller chip area than that
FPGA evaluation board [see Fig. 16(b)]. of the BF (BF_a1w1o1). For a fair comparison, we applied
Fig. 17 shows the measurement results of the test chip. The the same area constraint as that of BF and increased the
test chip operates up to 195 MHz of the clock frequency SRAM capacities of input/weight/output buffers in our design.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
1932 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 6, JUNE 2022
TABLE II
C OMPARISON OF T EST C HIP W ITH P REVIOUS S TATE - OF - THE -A RT P RECISION -S CALABLE A CCELERATOR C HIPS
Fig. 18. (a) Area and (b) system-level energy consumptions on the ImageNet
dataset. a{x}w{y}o{z}: Ifmap/Psum SRAM = 32*x + 16*z kB, and weight
SRAM = 64*y kB.
The increased global buffer size allows our design to have the
increased amount of parameter reuse by uploading the larger
input/weight/output tiles to the on-chip SRAM buffer than the
baseline, thereby minimizing the number of DRAM accesses.
We evaluate the area and energy consumption for the various
SRAM capacities. Among the SRAM configurations, our test
chip with 3× larger weight SRAM buffer (Ours_a1w3o1)
shows a similar area to the BF baseline (BF_a1w1o1). The
3× larger weight SRAM buffer improves the overall energy Fig. 19. System-level energy consumption on various neural networks and bit
efficiency by 36% for the VGG-16 workload with 2-bit pre- widths. a{x}w{y}o{z}: Ifmap/Psum SRAM = 32*x + 16*z kB, and weight
cision due to the increased on-chip reuse. On the other hand, SRAM = 64*y kB.
the UNPU has a very different microarchitecture from BF and
ours, so it is difficult to modify the SRAM capacity and/or the D. Evaluation of Channel-First and Pixel-Last Tiling
number of PEs for a comparison. The UNPU still has a large When kernels in the neural networks have a smaller number
number of PEs and larger SRAM buffers than ours, but we of weights than the number of effective multipliers of a
perform the evaluation using the original UNPU configuration. PE array, the utilization of the multipliers is reduced in the
Although the feature-map-reuse scheme that was used in the proposed CAS. The proposed CFPL scheme (see Fig. 10)
UNPU maximizes the on-chip energy efficiency (see Table II), increases the multiplier array utilization by simultaneously
it leads to a large number of external memory accesses (EMA) handling multiple pixels in each PE array. Fig. 20 evaluates
for the psums [10]. As a result, our design shows smaller the proposed CFPL scheme and compares it to the CAS
energy consumption than the UNPU by up to 64%. for 2- and 4-bit width configurations. The fine-grained tiling
We also evaluate the system-level energy efficiency on maximizes the utilization by filling input/weight pairs located
various neural networks (AlexNet, ResNet-18, VGG-16, and at the multiple pixel domains into the PE arrays. By doing so,
MobileNetV2) with several bit widths (see Fig. 19). Our the fine-grained tiling increases the multiplier utilization by
architecture reduces the overall energy consumption by 1.1–7.2× and 1.9–14.2× on 4- and 2-bit cases, respectively.
27%–39% compared to the BF. The energy efficiency However, the fine-grained tiling leads to a large area overhead
of the UNPU is comparable to our design in the as the number of pixels concurrently handled in a PE array
IA/W = 8-b/8-b case, but the number of DRAM accesses for varies depending on the kernel size. On the other hand, the
the psums significantly increases in lower bit configurations coarse-grained tiling mitigates the area overhead by fixing the
due to the feature-map-reuse method of the UNPU. maximum number of pixels concurrently computed in a clock
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
RYU et al.: BB: ENERGY-EFFICIENT VARIABLE BIT-PRECISION HARDWARE ACCELERATOR FOR QNNs 1933
Fig. 22. Area overhead with CFPL scheme on various number of PEs.
F-CFPL: fine-grained CFPL. C-CFPL(P): coarse-grained CFPL with P(= the
maximum pixel numbers handled in a clock cycle on each PE array).
Fig. 20. Utilization of multipliers with CAS and CFPL on (a) 2- and
(b) 4-bit precision cases. F-CFPL: fine-grained CFPL. C-CFPL(P): coarse-
grained CFPL with P(= the maximum pixel numbers handled in a clock
cycle on each PE array). The numbers written next to networks in the x-axis
are the number of PEs in our accelerator.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
1934 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 57, NO. 6, JUNE 2022
B. Quantization Methods
WRPN [3] increased the number of channels to re-gain the
accuracy in the quantized networks. TWN and BWN used
binary precision for weights [2]. In XNOR-Net [1], multiplica-
tions are replaced with XNOR operations using binary numbers
for both inputs and weights. For the 2-bit QNNs, PACT and
SAWB [4] use a trainable activation clipping parameter and
statistics of the weight distribution. There is no dedicated
quantizer logic in our design. Instead, our chip receives the
QNN parameters from external memory. It is assumed that
the quantization is done at the software level.
Fig. 24. Area comparison of PE array between BF and BB when 1-/2-bit
multipliers are used for BitBricks.
C. Computing Depthwise Convolution
In the standard convolutional layer, inputs are convolved
The coarse-grained tiling scheme with two pixels consumes with weights for multiple output channels. Hence, inputs are
larger energy than the fine-grained tiling scheme because the broadcast to the PE arrays and reused in the multiple PE
fine-grained tiling has much larger utilization than the coarse- arrays. In contrast, the depthwise convolutional layer does not
grained tiling with two pixels. In contrast, the utilization have input reusability. Inputs are convolved with weights for
of the coarse-grained tiling with 16 pixels is comparable to a single output channel, so inputs are only sent to one PE
the fine-grained tiling, and the area overhead is still smaller array. Therefore, computing the depthwise convolutional layers
than the fine-grained scheme. Therefore, the coarse-grained shows the lower multiplier array utilization than the standard
tiling scheme shows smaller energy consumption than the convolutional layer case. For example, if the number of PE
fine-grained tiling in almost all networks except for the 2-bit arrays is 16 as in our test chip, the maximum utilization rate
quantized AlexNet with 4096 PEs (AlexNet_4096) where is only 1/16. Previous works also suffer from such an under-
the average number of input/weight pairs for a dot product is utilization of multipliers on the depthwise convolutional layers,
smaller than other networks, and a small number of BitBricks so our design still shows an improvement on the MobileNet
are only used for the 2-bit computation. workload using depthwise convolutional layers (see Fig. 19).
We fabricated our test chip with a coarse-grained tiling However, our work mainly targets intra PE-array-level design
scheme with one pixel (C-CPFL(1)). However, we simulated optimization. If other independent design schemes, such as
multiple C-CFPL setups to assess potential improvements to a channel stationary dataflow [21], are adopted in our work,
our approach, as described in this Section. we can also recover the multiplier array utilization.
VIII. C ONCLUSION
VII. D ISCUSSION
We introduced an area and energy-efficient hardware accel-
A. Further Optimization for Binary Neural Networks erator, BB, for QNNs. We first introduced a bitwise summation
In our BB design, each BitBrick uses a 2-bit multiplier, method to reduce the area/power overheads to support variable
which is the same as the configuration used in BF. To compute bit width. We also presented a channel-wise aligning method to
binary neural networks, all the binary inputs/weights are con- fetch data from the global buffer to PE arrays more efficiently.
verted into 2-bit numbers before starting the computation (see The experimental results show that the throughput and system-
Section III-B). Although we can reduce the number of DRAM level energy efficiency were increased by up to 7.7× and
accesses by using 1-bit parameters when computing binary 1.64× compared to the previous works. To maximize the
neural networks, we cannot maximize the on-chip performance utilization of the various kernel sizes, a CFPL scheme was
and the energy efficiency because such a binary computation also introduced. The additional improvements over the CAS
consumes the same on-chip power as the power for the 2-bit are 1.5–3.4×.
computation. For each BitBrick, a 1-bit multiplier can be used ACKNOWLEDGMENT
instead of a 2-bit multiplier to further maximize computational The chip fabrication was supported by Samsung Electronics,
efficiency. Fig. 24 shows the area comparison of the PE array and the EDA tool was supported by the IC Design Education
between BF and our BB. To support the 1-bit multiplication, Center (IDEC).
both architectures increase in the PE area because the reconfig-
urable logic for variable bit-shift operation needs to support an
R EFERENCES
additional 1-bit case. However, the BB design shows a smaller
area overhead than BF because the proposed bitwise summa- [1] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-Net:
ImageNet classification using binary convolutional neural networks,” in
tion requires a smaller number of variable shift logics than the Proc. Eur. Conf. Comput. Vis. New York, NY, USA: Springer, 2016,
BF case. Therefore, BB with 1-bit multiplier-based BitBricks pp. 525–542. [Online]. Available: https://eccv2020.eu/
can improve the throughput of binary computation by 4× with [2] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” 2016,
arXiv:1605.04711.
the 48% of area overhead, and the area is smaller than the BF [3] A. Mishra, E. Nurvitadhi, J. J Cook, and D. Marr, “WRPN: Wide
case by 54%. reduced-precision networks,” 2017, arXiv:1709.01134.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.
RYU et al.: BB: ENERGY-EFFICIENT VARIABLE BIT-PRECISION HARDWARE ACCELERATOR FOR QNNs 1935
[4] J. Choi, S. Venkataramani, V. Srinivasan, K. Gopalakrishnan, Z. Wang, Hyungjun Kim (Member, IEEE) received the B.S.
and P. Chuang, “Accurate and efficient 2-bit quantized neural networks,” and Ph.D. degrees from the Pohang University
in Proc. 2nd SysML Conf., 2019, 2019. of Science and Technology (POSTECH), Pohang,
[5] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 envi- Republic of Korea, in 2016 and 2021, respectively.
sion: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy- He did an internship at the Holst Centre, Eind-
frequency-scalable convolutional neural network processor in 28 nm hoven, The Netherlands, for organic memory diode
FDSOI,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2017, pp. 246–247. design from January to September 2015 and also
[6] H. Sharma et al., “Bit fusion: Bit-level dynamically composable archi- spent the summer of 2018 at the IBM Thomas
tecture for accelerating deep neural network,” in Proc. ACM/IEEE 45th J. Watson Research Center, Yorktown Heights, NY,
Annu. Int. Symp. Comput. Archit. (ISCA), 2018, pp. 764–775. USA, as a Research Intern for in-memory neural net-
[7] B. Moons and M. Verhelst, “An energy-efficient precision-scalable work hardware design. His current research interest
convnet processor in 40-nm CMOS,” IEEE J. Solid-State Circuits, includes hardware–software co-design of deep neural network accelerators.
vol. 52, no. 4, pp. 903–914, Apr. 2017.
[8] S. Ryu, H. Kim, W. Yi, and J.-J. Kim, “Bitblade: Area and energy- Wooseok Yi received the B.S. degree in physics
efficient precision-scalable neural network accelerator with bitwise sum- from the Pohang University of Science and Technol-
mation,” in Proc. 56th Annu. Design Autom. Conf., Oct. 2019, pp. 1–6. ogy (POSTECH), Pohang, South Korea, in 2013, and
[9] S. Ryu et al., “A 44.1 tops/w precision-scalable accelerator for quantized the Ph.D. degree from the Department of Creative IT
neural networks in 28 nm CMOS,” in Proc. IEEE Custom Integr. Circuits Engineering (CiTE), POSTECH, in 2020.
Conf. (CICC), Oct. 2020, pp. 1–4. He is currently a Staff Researcher with the AI&SW
[10] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU: Research Center, Samsung Advanced Institute of
An energy-efficient deep neural network accelerator with fully variable Technology (SAIT), Suwon, Republic of Korea.
weight bit precision,” IEEE J. Solid-State Circuits, vol. 54, no. 1, His current research interests include neuromorphic
pp. 173–185, Jan. 2019. hardware, spin-transfer torque random access mem-
[11] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, ory, in-/near-memory computing, and brain-inspired
“Stripes: Bit-serial deep neural network computing,” in Proc. 49th Annu. computing.
IEEE/ACM Int. Symp. Microarchitecture (MICRO), Oct. 2016, pp. 1–12.
[12] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- Eunhwan Kim (Graduate Student Member, IEEE)
efficient reconfigurable accelerator for deep convolutional neural net- received the B.S. and M.S. degrees in electrical
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, engineering from Kookmin University, Seoul, South
May 2017. Korea, in 2010 and 2012, respectively. He is cur-
[13] L. Lai, N. Suda, and V. Chandra, “CMSIS-NN: Efficient neural network rently pursuing the Ph.D. degree with the Pohang
kernels for arm Cortex-M CPUs,” 2018, arXiv:1801.06601. University of Science and Technology (POSTECH),
[14] A. Garofalo, M. Rusci, F. Conti, D. Rossi, and L. Benini, “PULP- Pohang, South Korea.
NN: Accelerating quantized neural networks on parallel ultra-low-power From 2012 to 2014, he worked on the design
RISC-V processors,” Phil. Trans. Roy. Soc. A, Math., Phys. Eng. Sci., of the display driver interface at DB Hitek, Seoul.
vol. 378, no. 2164, Feb. 2020, Art. no. 20190155. From 2014 to 2018, he was a Research Associate
[15] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor with the i-Lab, POSTECH, working on hardware
processing unit,” in Proc. 44th Annu. Int. Symp. Comput. Archit., 2017, security. His current research interests include low-power circuit design and
pp. 1–12. hardware security.
[16] C.-H. Lin et al., “7.1 A 3.4-to-13.3 tops/w 3.6 tops dual-core deep- Yulhwa Kim (Graduate Student Member,
learning accelerator for versatile ai applications in 7 nm 5g smartphone IEEE) received the B.S. degree in creative IT
soc,” in IEEE ISSCC Dig. Tech. Papers, Oct. 2020, pp. 134–136. engineering (CiTE) from the Pohang University
[17] Y. Jiao et al., “7.2 A 12 nm programmable convolution-efficient neural- of Science and Technology, Pohang, South Korea,
processing-unit chip achieving 825 tops,” in IEEE ISSCC Dig. Tech. in 2016, where she is currently pursuing the Ph.D.
Papers, Oct. 2020, pp. 136–140. degree.
[18] H. Sharma Bit Fusion: Bit-Level Dynamically Composable Architecture Her current research interests include
for Accelerating Deep Neural Networks. Accessed: Aug. 1, 2019. hardware–software co-design of deep neural
[Online]. Available: https://github.com/hsharma35/bitfusion network accelerators and in-memory computing.
[19] Mobile LPDDR3 SDRAM: 178-Ball, Single-Channel Mobile LPDDR3
SDRAM Features. Accessed: Aug. 1, 2019. [Online]. Available:
https://www.micron.com/products/dram/lpdram/16Gb Taesu Kim (Student Member, IEEE) received the
[20] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: B.S. degree in electrical engineering from the Korea
Scalable and efficient neural network acceleration with 3D memory,” Institute of Science and Technology (KAIST), Dae-
ACM SIGPLAN Notices, vol. 52, no. 4, pp. 751–764, May 2017. jeon, South Korea, in 2016. He is currently pursuing
[21] S. Ryu, Y. Oh, and J.-J. Kim, “MobileWare: A high-performance the Ph.D. degree with the Department of Creative IT
mobilenet accelerator with channel stationary dataflow,” in 2021 Engineering (CiTE), Pohang University of Science
IEEE/ACM Int. Conf. Computer-Aided Design (ICCAD), Oct. 2021, and Technology (POSTECH), Pohang, South Korea.
pp. 1–8. His current research interests include network
compression, tiny machine learning, and energy-
efficient machine learning hardware accelerators.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 18,2023 at 10:29:07 UTC from IEEE Xplore. Restrictions apply.