Professional Documents
Culture Documents
Hardware-Robust In-RRAM-Computing
for Object Detection
Yu-Hsiang Chiang, Cheng-En Ni, Yun Sung, Tuo-Hung Hou , Senior Member, IEEE,
Tian-Sheuan Chang , Senior Member, IEEE, and Shyh-Jye Jou , Senior Member, IEEE
Abstract— In-memory computing is becoming a popular To solve the above issues, one common approach is to model
architecture for deep-learning hardware accelerators recently due these nonideal effects during model training such that the
to its highly parallel computing, low power, and low area cost. trained model can more tolerate these nonideal effects. NIA [2]
However, in-RRAM computing (IRC) suffered from large device
variation and numerous nonideal effects in hardware. Although showed the accuracy degradation caused by IR drop. The
previous approaches including these effects in model training accuracy dropped to 32% on a simple dataset such as MNIST
successfully improved variation tolerance, they only considered even with a relatively small crossbar array size(64 × 64). This
part of the nonideal effects and relatively simple classification accuracy degradation could be recovered to 98.3% by retrain-
tasks. This paper proposes a joint hardware and software ing networks with approximated IR drop noise. IR-QNN [3]
optimization strategy to design a hardware-robust IRC macro
for object detection. We lower the cell current by using a low showed that the problems of IR drop, device variation, and
word-line voltage to enable a complete convolution calculation in driver nonlinearity could be solved by training with an average
one operation that minimizes the impact of nonlinear addition. mask matrix that helps to predict the nonideal effects on
We also implement ternary weight mapping and remove batch BNN. Lee et al. [4] also showed a method to recover the
normalization for better tolerance against device variation, sense accuracy degradation due to IR drop by training the distorted
amplifier variation, and IR drop problem. An extra bias is
included to overcome the limitation of the current sensing vector-matrix-multiplication (VMM) when mapping binary
range. The proposed approach has been successfully applied to weights to the crossbar array. Beyond these offline training
a complex object detection task with only 3.85% mAP drop, methods, Chang et al. [5] combined offline variation-aware
whereas a naive design suffers catastrophic failure under these training and online fine-tuning or retraining part of the lay-
nonideal effects. ers to reduce the impacts of device variation and recover
Index Terms— Computing in memory, object detection, accuracy.
RRAM. Most of the related works focus on modeling these nonideal
effects in the training process. However, since these works
I. I NTRODUCTION only perform simple classification tasks and use a smaller
crossbar array size (<256 × 256), the accuracy degradation
I N-MEMORY computing (IMC) [1] has attracted significant
attention in recent years to implement deep neural networks
due to its highly parallel operation by using memory crossbar
could be easily recovered using the proposed training or
retraining methods. By contrast, this paper targets a larger
arrays and low power consumption based on analog com- crossbar array size (1024 × 1024), and a more complex task:
puting. However, the inherent variation of analog computing object detection, which is much more sensitive to nonideal
makes the values of model weight and activation deviated from effects. Therefore, deploying only these offline training or
the original well-trained model and easily leads to degradation retraining methods cannot recover the accuracy loss due to
of model accuracy or even catastrophic failure. This situation the nonideal effects completely.
becomes even worse for in-RRAM-computing (IRC) due to This paper returns to a more fundamental strategy: making
larger device variations of RRAM cells and other nonideal the IRC macro more robust hardware through joint hard-
effects. ware and software optimization. In hardware optimization,
we reduce the RRAM cell current by using a lower word-line
Manuscript received December 5, 2021; revised February 14, 2022 and
March 25, 2022; accepted April 26, 2022. Date of publication May 2, 2022; voltage to enable a complete convolution calculation in one
date of current version June 13, 2022. This work was supported in part by operation rather than relying on external addition from mul-
Taiwan Semiconductor Manufacturing Company (TSMC) and in part by the tiple sub-convolution blocks, thus minimizing the nonlinear
Ministry of Science and Technology, Taiwan, under Grant 109-2634-F-009-
022, Grant 109-2639-E-009-001, Grant 110-2622-8-009-018-SB, and Grant addition effect. In software (algorithm) optimization, we adopt
110-2218-E-A49-015-MBK. This article was recommended by Guest Editor ternary weight mapping without batch normalization (BN) to
T. Kim. (Corresponding author: Tian-Sheuan Chang.) reduce the effects of device variation, sense amplifier (SA)
The authors are with the Institute of Electronics, National Yang Ming Chiao
Tung University, Hsinchu 30010, Taiwan (e-mail: q123783293@gmail.com; variation, and IR drop. The limited range of current sensing is
nison.zxc@gmail.com; sung86084@gmail.com; thhou@mail.nctu.edu.tw; overcome using an extra bias. With these approaches, we do
tschang@nycu.edu.tw; jerryjou@g2.nctu.edu.tw). not require a lengthy training process as in other approaches
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JETCAS.2022.3171522. while maintaining the mean Average Precision (mAP) loss
Digital Object Identifier 10.1109/JETCAS.2022.3171522 of less than 4% considering hardware nonideality even for
2156-3357 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
548 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 12, NO. 2, JUNE 2022
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
CHIANG et al.: HARDWARE-ROBUST IN-RRAM-COMPUTING FOR OBJECT DETECTION 549
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
550 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 12, NO. 2, JUNE 2022
not only the area but also the power of the overall peripheral
circuits.
Fig. 6. Example for the RRAM cell device variation. (a) The ideal current
on the bit-line should be 3 units. (b) The actual current will become 4.4 units This nonlinear effect significantly affects the accuracy of the
due to the variation. MAC results if we only enable partial word-lines and accumu-
late these partial sums through an analog adder. This is because
different values of partial sums possess different amounts of
of the RRAM cells varies within the array and follows a errors. By contrast, if the bit-line current of one complete
log-normal distribution as shown in Fig. 3. Thus, the current convolution is accumulated in a single operation without using
on the bit-line deviates from the ideal value due to the cell the partial sums, the nonlinearity does not directly change the
resistance variation. final result if we only compare the relative current values
Device variation leads to inaccurate accumulated current on of two bit-lines. For example, in Fig. 8, if we split one
the bit-line, i.e. wrong MAC results. Although the variation convolution operation into three times and accumulate three
of binary RRAM cells is relatively small compared to the bit-line currents externally, the total current becomes much
multilevel RRAM cell, it is not negligible. For instance, larger than it should be. However, if we accumulate the total
in Fig. 6, the ideal current accumulated on the bit-line should current in one single operation entirely in the crossbar array,
equal 3 units, but the actual current becomes 4.4 units due to the nonlinearity effect has little impact on the final result.
the variation in individual cells. We simulate the nonlinearity effect by multiplying the
For high resistance state (HRS) cells, its device variation nonlinearity ratio with the current outputs from the crossbar
effect is less significant compared to the LRS cells due to array. The nonlinearity ratio based on the SPICE simulation of
its much higher resistance value and less contribution to the real circuits is fit using two piecewise high-degree polynomials
final accumulated current. In this paper, to simulate the device as followings, where p is the number of LRS cells.
variation effect, we multiply a random mask generated by the ⎧
log-normal distribution of the measured cell resistance before ⎪
⎪1.0286 × 10−8 × p4 − 3.79 × 10−6 × p 3
⎪
⎪
the convolution calculation. ⎪+5.3 × 10−4 × p2 − 3.92 × 10−2 × p + 2.5
⎪
⎪
⎪
Although the variation from a single RRAM cell looks sub- ⎨i f p <= 140;
stantial, the variation on the final bit-line current, i.e. the MAC r ati o =
⎪
⎪
⎪1.8063 × 10−11 × p 4 − 3.204 × 10−8 × p3
value, remains tolerable because the variation distribution by ⎪
⎪
⎪
⎪+2.2495 × 10−5 × p2 −8.057 × 10−3 × p + 1.707
summing a large number (1024) of RRAM cells becomes ⎪
⎩
tighter than an individual cell according to the law of large i f p > 140;
numbers in statistics [15]. (1)
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
CHIANG et al.: HARDWARE-ROBUST IN-RRAM-COMPUTING FOR OBJECT DETECTION 551
Fig. 8. (a) Accumulating the current partially and comparing them will
get a wrong result. (b) Single accumulation and comparison can avoid the Fig. 10. Relationship between the LRS block location and current drop.
nonlinearity impact. The x-axis is the location of the block, where the smaller number means the
block closer to the bit-line driver. The blue line is the current drop caused
by 32 LRS cells in a single block. The red line is the current drop caused by
160 cells in 5 blocks from the position block 0 to block 4.
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
552 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 12, NO. 2, JUNE 2022
Fig. 11. Overall model architecture for the IRC based object detection task,
where the GConv Block m, n is the binary group convolution with m input
channels and n output channels.
Fig. 12. Weight mapping for the IRC crossbar array, where 1 means LRS,
bit-line is set to 300µA and too many activated cells exceeds and 0 means HRS. (a) (left) binary weight mapping: (1, 0) at the convolution
bit-line means weight values 1, and −1, respectively, and the reference bit-line
this limit. The convolution result of a bit-line is compared has even distributed 0 and 1. (b) (right) ternary weight mapping: (1, 0), (0, 1),
to the reference bit-line by using a binary SA for the final and (0, 0) pairs at the same wordline means weight values 1, −1, and 0,
activation output. respectively.
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
CHIANG et al.: HARDWARE-ROBUST IN-RRAM-COMPUTING FOR OBJECT DETECTION 553
TABLE I
T HE R ATIO OF VALUES C HANGED BY S ENSING VARIATION OF SA AND N ONSENSIBLE . W ITH THE E XTRA B IAS M APPING ,
THE I MPACT OF T HESE T WO E FFECTS I S M ORE B ALANCED
Fig. 15. mAP results considering device variation for models with and
without BN.
Fig. 14. Relationship between the wordline voltage, power of IRC macro,
and mAP results, where the number A(std = B) in the x-axis denotes the
wordline voltage A and device variation B.
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
554 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 12, NO. 2, JUNE 2022
Fig. 17. Demo results of (a) the ideal model and (b) the proposed model with the proposed approach under nonideal effects.
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
CHIANG et al.: HARDWARE-ROBUST IN-RRAM-COMPUTING FOR OBJECT DETECTION 555
TABLE V
C OMPARISONS W ITH O THER W ORKS U NDER D IFFERENT N ONIDEAL E FFECTS
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.
556 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 12, NO. 2, JUNE 2022
[13] C.-C. Chou et al., “An N40 256 K × 44 embedded RRAM macro with Tuo-Hung Hou (Senior Member, IEEE) received
SL-precharge SA and low-voltage current limiter to improve read and the Ph.D. degree in electrical and computer engi-
write performance,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. neering from Cornell University in 2008. In 2000,
Tech. Papers, Feb. 2018, pp. 478–480. he joined Taiwan Semiconductor Manufacturing
[14] C.-X. Xue et al., “A 1 Mb multibit ReRAM computing-in-memory Company (TSMC). In 2008, he joined the Depart-
macro with 14.6 ns parallel MAC computing time for CNN based AI ment of Electronics Engineering, National Chiao
edge processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tung University (NCTU) [as National Yang Ming
Tech. Papers, Feb. 2019, pp. 388–390. Chiao Tung University (NYCU) in 2021], where
[15] C.-S. Lin et al., “A 48 TOPS and 20943 TOPS/W 512 kb computation- he is currently a Distinguished Professor and the
in-SRAM macro for highly reconfigurable ternary CNN acceleration,” Associate Vice President for Research and Devel-
in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Nov. 2021, opment. His research interests include the emerging
pp. 1–3. nonvolatile memory for embedded and high-density data storage, electronic
[16] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” synaptic device and neuromorphic computing systems, and heterogeneous
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, integration of silicon electronics with low-dimensional nanomaterials.
pp. 7263–7271.
[17] C.-C. Tsai et al., “The 2020 embedded deep learning object detection
model compression competition for traffic in Asian countries,” in Proc.
IEEE Int. Conf. Multimedia Expo Workshops (ICMEW), Jul. 2020,
pp. 1–6. Tian-Sheuan Chang (Senior Member, IEEE)
[18] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” received the B.S., M.S., and Ph.D. degrees in elec-
2017, arXiv:1711.05101. tronic engineering from the National Chiao Tung
University (NCTU), Hsinchu, Taiwan, in 1993,
1995, and 1999, respectively.
From 2000 to 2004, he was the Deputy Manager
with Global Unichip Corporation, Hsinchu. In 2004,
he joined the Department of Electronics Engineer-
ing, NCTU [as National Yang Ming Chiao Tung
University (NYCU) in 2021], where he is currently
Yu-Hsiang Chiang received the M.S. degree in elec- a Professor. In 2009, he was a Visiting Scholar
tronics engineering from the National Yang Ming with IMEC, Belgium. His current research interests include system-on-a-chip
Chiao Tung University, Hsinchu, Taiwan, in 2021. design, VLSI signal processing, and computer architecture.
He is currently working with MediaTek, Hsinchu. Dr. Chang has received the Excellent Young Electrical Engineer from the
His research interest includes in-memory computing Chinese Institute of Electrical Engineering in 2007 and the Outstanding Young
architecture design and VLSI design. Scholar from Taiwan IC Design Society in 2010. He has been actively involved
in many international conferences as an organizing committee or a technical
program committee member.
Authorized licensed use limited to: TU Delft Library. Downloaded on February 07,2023 at 12:03:23 UTC from IEEE Xplore. Restrictions apply.