A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.
2, FEBRUARY 2021 667
A Logic-Compatible eDRAM Compute-In-Memory

With Embedded ADCs for Processing
Neural Networks
Chengshuo Yu , Graduate Student Member, IEEE, Taegeun Yoo , Member, IEEE,
Hyunjoon Kim , Student Member, IEEE, Tony Tae-Hyoung Kim, Senior Member, IEEE,
Kevin Chai Tshun Chuan, Senior Member, IEEE, and Bongjin Kim, Member, IEEE
Abstract— A novel 4T2C ternary embedded DRAM (eDRAM) Index Terms— Embedded DRAM, compute-in-memory, hard-
cell is proposed for computing a vector-matrix multiplication ware accelerator, current-mode, vector-matrix multiplication,
in the memory array. The proposed eDRAM-based compute- SRAM.
in-memory (CIM) architecture addresses a well-known Von
Neumann bottle-neck in the traditional computer architecture
I. I NTRODUCTION
and improves both latency and energy in processing neural
networks. The proposed ternary eDRAM cell takes a smaller
area than prior SRAM-based bitcells using 6-12 transistors.
Nevertheless, the compact eDRAM cell stores a ternary state
T HE Von Neumann architecture has been applied to most
electronic devices since it was first introduced in 1945.
The central concept of this architecture is the separation of
(−1, 0, or +1), while the SRAM bitcells can only store a binary
state. We also present a method to mitigate the compute accuracy memory from its central processing unit (CPU). In general,
degradation issue due to device mismatches and variations. a traditional computer consists of three separate parts: arith-
Besides, we extend the eDRAM cell retention time to 200µs metic logic unit (ALU), control unit, and memory. A typical
by adding a custom metal capacitor at the storage node. With compute operation is performed in three stages as follows.
the improved retention time, the overall energy consumption of First, both instructions and data stored in memory are trans-
eDRAM macro, including a regular refresh operation, is lower
than most of prior SRAM-based CIM macros. A 128×128 ternary ferred from memory to ALU before the compute begins. Sec-
eDRAM macro computes a vector-matrix multiplication between ond, the fetched instructions and data are stored in temporary
a vector with 64 binary inputs and a matrix with 64 × 128 registers and used for computation in ALU. Finally, the com-
ternary weights. Hence, 128 outputs are generated in parallel. puted results are sent back to memory. In recent years, the fun-
Note that both weight and input bit-precisions are programmable damental limits of the conventional Von Neumann architecture
for supporting a wide range of edge computing applications
with different performance requirements. The bit-precisions are are brought into spotlights. The memory access usually dom-
readily tunable by assigning a variable number of eDRAM cells inates the entire energy consumption of the modern micro-
per weight or adding multiple pulses to input. An embedded processors and the limited communication band-width limits
column ADC based on replica cells sweeps the reference level for the compute performances in both throughput and latency.
2N −1 cycles and converts the analog accumulated bitline voltage Due to such limitations, the Von Neumann architecture is no
to a 1-5bit digital output. A critical bitline accumulate operation
is simulated (Monte-Carlo, 3K runs). It shows the standard longer the best choice, especially for processing artificial deep
deviation of 2.84% that could degrade the classification accuracy neural networks (DNNs) in resource-constrained mobile edge
of the MNIST dataset by 0.6% and the CIFAR-10 dataset by computing devices.
1.3% versus a baseline with no variation. The simulated energy is One of the alternative architectures that could significantly
1.81fJ/operation, and the energy efficiency is 552.5-17.8TOPS/W improve the performance in processing DNNs is a compute-
(for 1-5bit ADC) at 200MHz using 65nm technology.
in-memory (CIM) architecture. As shown in Fig. 1, the in-
Manuscript received June 6, 2020; revised October 13, 2020; accepted memory architecture enables the memory to compute essential
November 1, 2020. Date of publication November 16, 2020; date of current functions by embedding them in its macro. As a result, we can
version January 12, 2021. This work was supported by the Singapore minimize both energy consumption and compute latency by
government’s Research, Innovation and Enterprise 2020 Plan Advanced Man-
ufacturing and Engineering Domain under Grant A1687b0033. This article eliminating a large portion of energy-hungry data communi-
was recommended by Associate Editor M.-F. Chang. (Corresponding author: cations between ALU and memory. Besides, massively parallel
Bongjin Kim.) computations in the large memory array maximizes throughput
Chengshuo Yu is with the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore 639798, and also with Institute and fully utilize memory capacity. In conclusion, the DNN
of Microelectronics, A∗ STAR, Singapore 138634. processing in mobile devices will become much faster and
Taegeun Yoo, Hyunjoon Kim, Tony Tae-Hyoung Kim, and Bongjin Kim more energy-efficient by adopting the CIM architecture.
are with the School of Electrical and Electronic Engineering, Nanyang
Technological University, Singapore 639798 (e-mail: bjkim@ntu.edu.sg). Fig. 2(a) illustrates a brain-inspired neuron which computes
Kevin Chai Tshun Chuan is with the Institute of Microelectronics, A∗ STAR, a dot-product between ‘n’ pairs of inputs and weights. The
Singapore 138634. dot-product is followed by a nonlinear activation, which
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. generates an output. A CIM macro with embedded brain-
Digital Object Identifier 10.1109/TCSI.2020.3036209 inspired neurons is shown in Fig. 2(b). The macro consists of
1549-8328 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University. Downloaded on April 28,2021 at 06:04:26 UTC from IEEE Xplore. Restrictions apply.
668 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 2, FEBRUARY 2021
Fig. 1. A comparison between (a) conventional architecture and (b) compute-

in-memory architecture.
a two-dimensional memory cell array and a row of nonlinear

activations. A column of ‘n’ memory cells with an activation at
the bottom represents a neuron. Each column neuron computes
a dot-product between ‘n’ inputs coming from the left and the
same number of weights stored in the memory cells. Note that
the macro consists of ‘m’ columns, and hence its throughput is
proportional to ‘m’ thanks to the fully-parallel column-based Fig. 2. (a) A functional block diagram of brain-inspired neuron. (b) Embed-
ded neurons in a CIM macro with ‘m x n’ memory cells and ‘m’ activations.
operations.
Recently, various SRAM-based bitcells for CIM
macros [1]–[10] have been developed for energy-efficient and Kim et al. [4], [5] proposed voltage-driver based
DNN processing. A standard 6T SRAM bitcell is used SRAM bitcells, where the accumulation is performed by
for storing a binary weight while it typically requires programming the ratio between pull-up and pull-down drivers
extra transistors for accumulation via shared bitlines. that are embedded in their bitcells. However, both suffer
J. Zhang et al. [1] first presented a standard 6T SRAM-based from the low integration density due to large bitcell size.
CIM macro for a machine learning classification. Recently, Yu et al. [6] presented an 8T SRAM bitcell having
A programmable positive short-pulse is applied to a two extra transistors for decoupled read/write operations.
wordline (WL) causing one of the bitlines (BL/BLB) to While having a moderate dynamic range and a relaxed ADC
discharge while the other remains unchanged. Note that such overhead, the inherently large SRAM bitcell size is an issue
an operation is equivalent to a multiplication between the input to overcome. Emerging resistive memory (ReRAM/RRAM)
and the stored binary weight. The operations are performed based compute-in-memory (CIM) macro designs [11], [12]
all in parallel, and hence the voltage drop at BL/BLB have shown higher memory density compared to the proposed
is equivalent to the sum of the individual multiplication eDRAM macro with smaller bitcell area (0.25μm2 and
results (i.e., accumulation). The standard 6T SRAM bitcell 0.2025μm2) than the proposed cell (1.08μm2). However,
is compact and does not require extra transistors. However, the RRAM technology demonstrated the lower energy
it has fundamental limitations, including a memory cell efficiency (19.2 and 53.17TOPS/W at 1bit) due to the
write disturbance and a narrow dynamic range in shared relatively large cell current compared to the eDRAM bitcell
bitlines for accumulation. Besides, the ADC overhead is (552.5TOPS/W at 1bit). Besides, the RRAM-based in-memory
significant for applications that require high-precision outputs. computing suffers from the unstable resistance value, which
Biswas et al. [3] proposed a 10T SRAM bitcell having four is caused by the inherent filament formation process.
extra transistors to decouple bitlines for read/write operations This paper presents a logic compatible eDRAM-based CIM
(i.e., RBL/WBL). Hence, the 10T bitcell is free from the macro for energy-efficient processing of DNNs in mobile
write disturbance issue during accumulation. Moreover, edge devices. A novel 4T2C ternary eDRAM cell (two 2T
the voltage-based operation enables a rail-to-rail dynamic eDRAM bitcells and two capacitors) is proposed. Both input
range in the read bitline. However, the 10T bitcell occupies and weight bit-precisions are readily reconfigured by assigning
a larger area than the standard 6T bitcell, and hence results a variable number of bitcells per weight or using multiple
in the reduced integration density. Besides, it also requires pulses per input. We also propose a replica-based analog-to-
extra circuits for accumulation based on charge redistribution, digital converter, which is embedded in each column of the
and the ADC overhead issue is not resolved yet. Yin et al. CIM macro along with 128 eDRAM cells.
YU et al.: LOGIC-COMPATIBLE eDRAM COMPUTE-IN-MEMORY WITH EMBEDDED ADCs 669
TABLE I
C OMPARISON OF E MBEDDED M EMORY C ANDIDATES FOR C OMPUTE -I N -M EMORY M ACRO
The rest of the paper is as follows. Section II introduces (RBLL/RBLR) when a negative short pulse is applied to RWL,
the proposed ternary eDRAM cell with its basic operations. and the stored weight is high. Note that the eDRAM bitcell
Section III presents the reconfigurability of the CIM macro decouples read and write operation, and hence is free from
based on the proposed eDRAM cells. The design challenges a write disturbance issue. Besides, the proposed eDRAM cell
are discussed in Section IV. Section V introduces the embed- stores a ternary state (−1, 0, or +1) using four transistors only
ded ADC and the offset-calibration approach. Section VI while a standard SRAM bitcell shares a read/write bitline and
describes the overall architecture and simulation results, fol- stores a binary state using six transistors.
lowed by a conclusion in Section VII.
B. Compute-In-Memory Multiplication
II. E DRAM C OMPUTE -I N -M EMORY M ACRO
Fig. 4 describes a compute-in-memory dot-product opera-
A. Proposed 4T2C Ternary eDRAM Cell tion using the proposed 4T2C ternary eDRAM cell. Before
In this work, we propose an embedded DRAM (eDRAM) computing, two read bitlines in the middle are pre-charged to
based CIM macro for the first time. Different styles of a high voltage. Fig. 4(a), top shows the circuit of the eDRAM
eDRAM bitcells have been introduced, as shown in Table I. cell. A single cell consists of four transistors and six control
Barth et al. [13] presented a 1T1C bitcell. The highest lines; two pairs of write and read bitlines, a write wordline,
integration density and the low power consumption of the and a read wordline. The stored weights on the left/right
1T1C structure facilitate it to become a promising bitcell bitcells are WL and WR . The table in Fig. 4(a) summarizes
candidate for an eDRAM-based CIM. However, the compact a default ternary-weight binary-input multiply operation in a
1T1C bitcell shares a bitline for reading/writing a storage cell. Here, note that the high or the low voltage is denoted
capacitor, and hence suffers from the write disturbance issue. by ‘H’/’L’ while ternary state values are represented by ‘−1’,
The dynamic nature of eDRAM bitcell necessitates a reg- ‘0’, and ‘1’. A ternary weight is stored in an eDRAM cell
ular refresh operation with associated energy consumption. as −1 (WL = H, WR = L), 0 (WL = L, WR = L), or +1
Besides, it requires low leakage access transistors and a deep (WL = L, WR = H). A binary input is represented by a
trench capacitor, which is not available in the generic logic transient voltage at an RWL node. The input is ‘0’ when
process. Chun et al. [14], [15] have proposed gain-cell based RWL is ‘H’ and is ‘1’ when a negative short pulse is applied
eDRAM macros based on compact 3T [14], 2T [15], and to the RWL node. After a cycle of operation, the ternary-
2T1C [16] bitcell structures using only standard logic process. weight binary-input multiplication result is accumulated in a
The logic-compatible eDRAM bitcells decouple read and write differential read bitline (RBLL/RBLR) as a voltage difference
operations, and the manufacturing cost is lower than the 1T1C (−V, 0, or +V) based on weight and input combinations
eDRAM [13]. The memory density is higher than 6T SRAM as shown in Fig. 4(a).
but lower than 1T1C eDRAM. A logic compatible 2T eDRAM Fig. 4(b) describes how the proposed 4T2C ternary eDRAM
bitcell circuit comprising of a PMOS and an NMOS transistor cell operates for each weight (W) and input (X) combination.
is shown in Fig. 3. The PMOS transistor is used for accessing When the RWL node is high (i.e., input is 0), no current flows
the internal storage node through write bitline (WBL) when through RBLL/RBLR and hence results in ‘no change’ in the
its gate node (WWL) voltage is low. The NMOS transistor voltage difference between the two read bitlines as shown in
works as a storage capacitor as well as a read access transistor all three diagrams in Fig. 4(b), left. When a negative pulse is
to read the stored value when the source (RWL) node voltage applied to RWL, either RBLL or RBLR is discharged through
is low. In this work, we use a pair of 2T eDRAM bitcells an NMOS read access transistor on the left or right bitcell
as a ternary-weight compute-in-memory unit. Each eDRAM when the storage node voltage is ‘H’. As a result, the voltage
bitcell on the left and the right works as a current discharging difference between RBLL and RBLR decreases or increases
unit, and contributes to a finite voltage drop in read bitlines by ‘V’ as highlighted in Fig. 4(b), right. When both weight
Fig. 3. A 2T eDRAM bitcell when operating as a memory: (a) write mode

and (b) read mode. (c) A pair of 2T eDRAM bitcells works as a ternary-weight
compute-in-memory unit cell.
Fig. 5. (a) Dot-product operation in a column (i.e. a neuron) of eDRAM array.

(b) A read bitline (RBLL or RBLR) discharge plot of all possible outputs.
column consists of ‘n’ unit cells that are stacked in a vertical

direction. A pair of PMOS transistors precharge RBLL/RBLR
before the current-mode accumulation induces voltage drops
at the shared read bitlines. The precharge voltage is set
to 0.7V for reducing leakage current flowing through read
bitlines from the disabled bitcells. To maximize the linearity
in accumulate operations, we use only a limited voltage range
(i.e., ∼250mV), as depicted in Fig. 5(a). Each 2T1C bitcell
discharges a unit current (Iunit ) when the stored weight is
‘H’ and a negative pulse is applied to RWL. When a read
bitline capacitance is ‘C’ and an RWL short pulse width is
Fig. 4. (a) A circuit schematic of the proposed 4T2C ternary eDRAM cell ‘τ ,’ the resulting voltage difference generated by each enabled
and a table summarizing ternary-weight binary-input multiplication results. bitcell is ‘V= τ · (Iunit /C).’ Since the number of eDRAM
(b) Cell operations for all six weight (W) and input (X) combinations. cells in a column is ‘n,’ the accumulated voltage difference at
I n
voltages are ‘L,’ there will be no change in RBLL/RBLR RBLL/RBLR is ‘Vdi f f = τ · diCf f Wi · X i ’.
i=1
regardless of the input. Fig. 5(b) shows a transient simulation result of a single
read bitline (RBLL or RBLR) discharge operation when a
C. Current-Mode Bitline Accumulation negative short pulse is applied to RWL. For verifying the
Fig. 5(a) describes the accumulate operation in a column linearity of the proposed eDRAM-based bitline accumulation,
of the ternary eDRAM cells. As shown in Fig. 5(a), left, a we swept the number of enabled bitcells from 0 to 128 with
the step size of 1 when the number of eDRAM cells per

column is 128. Note that the SRAM core supply voltage is
0.5V, and the read bitlines are pre-charged to 0.7V before the
accumulate operation. The RWL short pulse width is 1ns, and
the simulated unit current (Iunit ) is 97.65nA.
III. R ECONFIGURABILITY
The weight and input bit-precisions of the proposed
eDRAM macro can be reconfigured from ternary/binary to
the higher bit-precisions for processing DNNs having more
stringent accuracy requirements. Fig. 6 describes how we
reconfigure the input and weight bit-precisions using RWL
input pulses and the number of eDRAM cells per weight.
Fig. 6(a) shows the programmable input precision based on
the number of short pulse cycles per input. A pulse train
instead of a single pulse can be applied to an RWL node to
increase the number of input levels. Note that the number of
pulses corresponds to the number of input levels, and it can
be increased as long as the accumulated voltage level does not
exceed the limited operating range of the read bitlines. As for
the implementation of multi-level weights, we can group
multiple eDRAM cells to represent a weight. For instance,
two or three ternary eDRAM cells are combined to work as a
5- or 7-level (−2 to 2 or −3 to 3) weight storage, as shown
in Fig. 6(b). Note that each eDRAM cell adds two more levels, Fig. 6. Reconfigurable (a) input and (b) weight precisions using multiple
RWL pulses and ternary eDRAM cells to represent an input or a weight.
and the RWL node has to be shorted between cells.
Fig. 7 shows an example of reconfigured column with
ternary eDRAM cells for a dot-product between ‘n’ pairs of
multi-level inputs and weights. A group comprising of four
eDRAM cells represents a 9-level (−4 to 4) weight, and a
shared RWL with three negative pulse cycles represents a
4-level (0 to 3) input. Fig. 8 shows a reconfigured column
with 5-level weights and 3-level inputs, as well as the timing
diagrams of essential signals and the simulated waveforms.
A pre-charged RBL voltage (0.7 V) drops twice for each of the
two negative short pulses, and the level of each voltage drop is
proportional to the number of enabled eDRAM bitcells, which
is swept from 0 to 64, as shown in Fig. 8(c). Note that the RBL
voltage levels settled after two-cycles of the bitline accumulate
operation still do not exceed the predefined dynamic range
(∼250mV) to ensure the linearity.
Fig. 7. An example of reconfigured column with 4T2C ternary eDRAM
cells: 4-level inputs and 9-level weights.
IV. D ESIGN C HALLENGES
One of the critical design challenges of the eDRAM macro time. Tikekar et al. [19] proposed the on-demand power-up
is the short retention time of its dynamic memory cell having scheme that can minimize the excessive power from eDRAM
the limited storage capacitance and the substantial leakage access. Park et al. [20] and Choi et al. [21] used eDRAM as
currents. The situation becomes more severe in the eDRAM a temporary buffer.
macro with 2T gain cells since it relies on a small (<1fF) The 2T eDRAM bitcell (i.e., a half-circuit of the proposed
gate and parasitic capacitance while a 1T1C structure utilizes 4T2C ternary eDRAM cell) was originally developed [15] to
a dedicated storage capacitor such as a 20fF deep trench capac- increase its retention time by eliminating a pulling-down gate
itor [13]. Various approaches [17]–[22] have been introduced leakage of the NMOS access transistor, as shown in Fig. 9, left.
to make the retention time longer while maintaining a high Note that the NMOS transistor is off when the bitcell is not
integration density. Cho et al. [17] applied different refresh accessed, and hence its gate leakage is negligible. However,
cycle times to different memory blocks based on the priority the retention time of eDRAM bitcell for CIM macro has to be
of data to optimize refresh operation and the associated energy. redefined since it not only flips the stored data but also degrade
Kazimirsky et al. [18] presented an algorithm that schedules the computation accuracy in the bitline accumulation. A unit
the refresh operations to happen only during the unoccupied discharging current is directly affected by the stored voltage in
Fig. 10. A retention time comparison with 10K Monte Carlo iterations.
‘P’ represents a write access PMOS and ‘N’ represents a read access NMOS.
Fig. 8. Proposed multiple unit cells computing module. (a) Schematic.

(b) Timing diagram. (c) Simulated RWL and RBLL waveforms during
transition with Input 2 (two down-pluses).
Fig. 9. Leakage Current and Retention Time Theory of eDRAM; the impact
of retention time on computation.
each eDRAM bitcell, and hence the bitcell has to be refreshed

much earlier than the eDRAM bitcell only for memory usage.
Therefore, the more stringent retention time constraints have
to be considered when designing an eDRAM CIM macro.
Note that the bitcell with the stored data ‘0’ is dominant
when determining the retention time of the asymmetric 2T
Fig. 11. (a) Illustration of the custom MOMCAP using four metal layers
eDRAM bitcell with unbalanced leakage components. In this (M4-M7) (b) A 4T2C ternary eDRAM cell layout.
work, we set the maximum allowed voltage level at the internal
storage node for data ‘0’ to be 221mV, as shown in Fig. 9,
right. The voltage level is calculated based on the maximum Typically, a generic CMOS logic process provides multiple
allowed leakage current (i.e., Iunit /256) when the number of sets of PMOS and NMOS transistors with different threshold
eDRAM cells in a column is 128. In other words, we set the voltages. In this work, we use a 65nm logic process with
maximum low-level storage node voltage to ensure the sum of three different threshold devices: HVT (high Vth ), RVT (reg-
leakage currents through NMOS read access transistors is less ular Vth ), and LVT (low Vth ). Thus, we can consider nine
than a half of the unit discharging current, and hence does not PMOS/NMOS combinations as candidates for the design of
affect the accumulate operation. 2T eDRAM bitcell. Fig. 10 shows the simulated retention
Fig. 14. Timing diagram of the proposed column ADC Reference with two
Fig. 12. Retention time comparison with three different metal capacitor different dot-product results, 17 and 25.
layouts: (a) Schematic Only (b) Covered with grounded M4-M7 (c) Custom
MOMCAP.
Fig. 15. Simulation result of proposed ADC (1-5bits mode).
‘0,’ we are able to calculate a ‘valid’ retention time. However,

the calculated retention time is only a few μs, even in the
best case where HVTs are used for both PMOS and NMOS.
To extend the retention time, we designed a customized metal-
Fig. 13. Operating principle of the column ADC [6]. The ADC comprises a oxide-metal (MOM) capacitor that maximizes the storage node
sense amplifier (SA) and replica eDRAM cells generating an ADC reference. capacitance without increasing the eDRAM cell size, as shown
The reference is swept for 2N −1 cycles of N-bit ADC operation (e.g., N = 5).
in Fig. 11.
Fig. 11(a) illustrates the designed custom MOM capacitors
time of the nine different combinations based on 10K Monte- that are connected to the internal storage nodes on the left and
Carlo iterations. The core supply voltage of 0.5V and the right bitcell of the proposed ternary eDRAM cell. The outer
minimum size PMOS/NMOS transistors were used for the ring of metals is grounded, and the ‘H’ shaped central metals
simulations. The simulation results show the bitcell with HVT are connected to the storage nodes. The capacitors are drawn
PMOS and HVT NMOS achieves the longest retention time, using four metal layers (from M4 to M7), and their layouts are
as shown in Fig. 10, bottom right. Note that the impact of drawn in the way to maximize the overall fringe capacitances
PMOS transistor type on retention time is more significant between the storage nodes and the ground surrounding them.
than that of the nominally-off NMOS read access transistor. The ternary eDRAM cell layout, including both front- and
As shown in the bottom three cases in Fig. 10, the bitcell back-end metal layers, is shown in Fig. 11(b). The first two
with HVT PMOS results in the longer retention time with metal layers (M2 and M3) are used to build a frontend
less variation compared to other candidates. The maximum design of ternary eDRAM cell with four vertical and two
allowed voltages for the three cases are also indicated. Here, horizontal interconnects for read and write access. The size of
‘clearly off’ state means that the bitcell is operating in a valid the proposed 4T2C eDRAM cell is 1.8μm × 0.6μm. Without
region where the sum of NMOS leakage currents at the bitline affecting the cell size, the added MOM capacitor maximizes
is negligible and minimally affects the accumulate operation. bitcell storage capacitance. A 128 × 128 ternary eDRAM cell
Based on the computed maximum allowed voltage for data array layout is drawn, and its parasitic RCs are extracted to
Fig. 16. Operating concept of the binary-searching based offset calibration.

Note that the white grid in the top image represents the bitcell with a weight Fig. 18. (a) Overall architecture with 128 × 128.4T2C eDRAM cell array.
‘+1’ while the black grid represents the bitcell with a weight ‘−1’. (b) CIM operating sequence including a regular refresh cycle.
Fig. 19. Power and area breakdown.
time contributes to reducing the frequency of refresh and

saving refresh & computation energy. As retention time rises
from 15μs to 200μs, the energy efficiency has increased
by 14.48 times. The total energy efficiency of the eDRAM
macro with 15μs retention time is 26.2fJ/OP while the total
energy efficiency of the eDRAM macro with 200μs retention
time is 1.81fJ/OP. Note that the MOM capacitor also plays a
significant role in minimizing the variation of the stored data
‘1,’ as shown in Fig. 12.
Fig. 17. A column neuron with 128 eDRAM cells. Each column consists
of 64× cells for dot-product, 32× for ADC reference, and 32× for offset V. C OLUMN ADC AND O FFSET C ALIBRATION
calibration.
The overhead of area-consuming analog-to-digital converter
verify the extended valid retention time. Fig. 12 compares (ADC) is one of the key challenges in the design of analog
the simulated retention time with different eDRAM unit cell CIM macros. A bulky ADC is typically shared by many
designs. Fig. 12(a) shows the shortest retention time (few μs) column- or row-based neurons in a macro, and hence results in
with schematic only (i.e., the same result with Fig. 10, bottom- the reduced throughput and the increased latency. To address
right). The eDRAM cells with extra metal capacitors are the challenges, we adopted a column ADC [6] embedded in
shown in Fig. 12(b-c). The retention time of the proposed each column of the CIM macro by reusing replica cells and
eDRAM cell with custom MOM capacitors demonstrates the a sense amplifier as an ADC reference and a bit quantizer.
longest retention time of 200 μs, which is twice longer The proposed ADC converts the bitline analog voltage (i.e., a
than the one with four ground plates. The extended retention dot-product result) to a digital thermometer code from cycle
Fig. 20. Size and density comparison of 16 kb CIM macro layout between 8T SRAM [6] and the proposed 4T2C eDRAM.
to cycle while sweeping a replica-based ADC reference from

the highest to the lowest level.
Fig. 13 illustrates an example of the column ADC oper-
ation. For each column, 64× eDRAM cells are used for a
dot-product computation, and the other 32× cells are assigned
for generating the ADC reference level. The ADC reference
level is generated by swept weights (i.e. from −32 to +32 with
the step 2) and fixed inputs as ‘1’ (i.e. RWL = down pulse).
It takes 2N − 1 cycles for a single N-bit ADC operation (i.e.,
31 cycles for 5bit ADC). A sense amplifier for memory read
operation is reused for converting the analog dot-product result
in the bitlines to a series of thermometer code from the highest
to the lowest code. Note that one of the weights need to be
flipped from ‘+1’ to ‘−1’ before the new cycle arrives. After
31 cycles, a complete 31bit thermometer code is generated and
transformed into a 5bit binary code.
Timing diagrams of the ADC operations are illus- Fig. 21. Simulated bitline accumulation results with statistical mismatch.
trated in Fig. 14 for two different dot-product results
(e.g., 17 and 25). While the two dot-product results are fixed
throughout the conversion cycles, the ADC reference level 32× replica cells. The blue lines indicate the offset error of the
generated by the replica cells is swept for generating a ther- dot-product part (64x cells) while the black lines represent the
mometer code from cycle to cycle. When the dot-product result actual calibration value of offset calibration part (32x cells).
is 25, the sense amplifier output is zero for the first three cycles Before calibration, the offset calibration setting is reset to ‘0,’
since the ADC reference levels are higher than the dot-product and then it searches the optimal offset calibration code using
result. The reference levels are kept lower than the dot-product a binary-searching method. For instance, replica cells are set
result starting from the cycle #3, and hence the remaining to +2 when the offset is −3, as shown in Fig. 16(a), and
thermometer code is all ones. Similarly, the sense amplifier the calibration is set to −6 when the offset is +5, as shown
outputs are zeros at the beginning, and then it switches to one in Fig. 16(b). It takes five cycles to finish a calibration per each
when the ADC reference level crosses the dot-product result. column, and the calibration is performed individually from
Finally, the generated 31bit thermometer code is converted into column to column. Note that the programmed data of offset
a 5bit binary code (i.e., B [4:0] = 11100 or 11000 when the calibration block should be fixed with specific value before we
dot-product result is 25 or 17). Fig. 15 shows the simulated apply the column ADC to generate a digital thermometer code.
ADC characteristic over the swept dot-product result from Fig. 17 illustrates a complete block diagram of the
−32 to +32 with the step size of 2 for 1-to-5bit mode. column-based neuron comprising of 64× eDRAM cells for
The offset calibration is proposed to compensate the dot-product, 32× cells for ADC reference, 32× cells for
offset error occurring in ADC. Fig. 16 illustrates the offset calibration, a precharge driver, and a sense amplifier.
concept of the binary-searching based offset calibration using A dot-product between 64× binary inputs (X [0:63]) and
TABLE II
S UMMARY OF VGG-L IKE CNN M ODEL
Fig. 22. Simulated classification accuracy versus Vdiff standard deviation

using MNIST/MLP and CIFAR-10/VGG-like CNN. The simulated variation
of our design is 2.84% (i.e., 14.2mV/500mV).
TABLE III
S UMMARY OF MNIST/CIFAR-10 C LASSIFICATION R ESULTS
Fig. 23. The simulated MNIST classification accuracy difference between

binary weights and ternary weights.
Once pre-charged, RWL decoders create either a negative short
pulse (for input ‘1’) or a DC high voltage (for input ‘0’)
64× ternary weights (WL0 /WR0 , WL1 /WR1 , · · · , WL63/WR63 )
for each row. Then, parallel multiply-and-accumulate (MAC)
is performed, and the resulting analog bitline voltage is
computations are executed in every dot-product cells and
converted to a thermometer code based on the setting of
column-based neurons. A row of 128× sense amplifiers and
ADC reference (R [0:31]) along with the calibration code
a shift-register based readout circuit is located towards the
(C [0:31]). Note that all three operations (i.e., dot-product,
bottom of the eDRAM array. Each sense amplifier converts
ADC, and calibration) in the proposed column-based neuron
the dot-product result into a thermometer bit by sensing the
are performed via shared read bitlines (RBLL/RBLR).
voltage difference in RBLL and RBLR. It takes multiple
cycles for a complete ADC operation depending on the output
VI. OVERALL A RCHITECTURE AND S IMULATION R ESULTS precision mode, as described in section V. Although it is not
Fig. 18(a) shows the overall architecture of the proposed shown in Fig. 18(a), we also have write-back drivers for the
4T2C ternary eDRAM CIM macro. The macro is comprised regular eDRAM cell refresh operation. The operating sequence
of a 16K (i.e., 128 × 128) eDRAM cell array with 128× of the proposed eDRAM CIM macro, including a write/refresh
sense amplifiers for 128× parallel dot-product computations and compute-in-memory operations, is shown in Fig. 18(b).
and I/O periphery that includes write bitline (WBL) drivers, The operating clock frequency is 200MHz, and the refresh
read (RWL) and write wordline (WWL) decoders, output cycle is 200μs. It takes 12.8μs for a complete refresh operation
registers, precharge drivers, and an RWL pulse generator. of 16K ternary eDRAM cells, and hence the throughput
A total of 128× eDRAM cells in a row share a write wordline of the proposed macro is 187.2M dot-products/sec per each
(i.e., WWL[i], i = 0 to 127) and a read wordline (i.e., RWL[i], column, including the regular refresh operation. To minimize
i = 0 to 127). Similarly, 128× cells in a column share the overall energy consumption while ensuring the accurate
four bitlines including two write bitlines (i.e., WBLL[i] and computation results, we use 0.5V supply for both memory
WBLR[i], i = 0 to 127) and two read bitlines (i.e., RBLL[i] core and RWL decoders and 0.7V supply for pre-charging
and RBLR[i], i = 0 to 127). For memory write operation, read bitlines.
each row is selected one by one via WWL decoders, and Fig.19 presents the power and area breakdowns. eDRAM
WBL drivers are used to drive write bitlines and program array occupy the major area of this design (i.e. 54.72%).
the selected bitcells. Prior to a compute-in-memory operation, On the other hand, the main power is eDRAM array
read bitlines are pre-charged by a pair of PMOS drivers. (i.e. 62.6%).
TABLE IV
P ERFORMANCE C OMPARISON W ITH S TATE -O F -T HE -A RTS
Fig. 20 shows a comparison between 8T SRAM [6] and represent itself as a random noise distribution at the output.
the proposed 4T2C ternary eDRAM based macro and cell The red curve of fig. 22 shows the simulated classification
layout. The SRAM based CIM macro performs a similar accuracy versus the worst-case standard-deviation (std-dev) of
bitline accumulate operation using two extra transistors for the bitline voltage difference (Vdiff ) using a multi-layer per-
decoupling read/write on top of the standard 6T SRAM bitcell. ceptron (MLP) network with two hidden layers (784-256-64-
Hence, it can store a binary weight and is free from retention 10) and MNIST dataset. And the blue curve of fig. 22 shows
time issue. However, the size of its bitcell is significantly the simulated classification accuracy using a VGG-like con-
larger than our 4T2C eDRAM cell. As shown in Fig. 19, volutional neural network (CNN) with six convolution layers
the proposed ternary 4T2C eDRAM cell only occupies 0.32× and three fully-connected (FC) layers with CIFAR-10 dataset.
area than the SRAM bitcell. The eDRAM memory density is These two curves show a similar decreasing trend with the
0.866Mb/mm2, which is 2.97× higher than the 8T SRAM. increase of variation. Fig. 23 shows the classification accuracy
Note that the layouts for both 8T SRAM and 4T2C eDRAM of using binary weight and ternary weights with the sweeping
follows the same standard logic design rule. variation (i.e. 0% to 10.5%). The simulated results clearly
Fig. 21 shows the bitline accumulation transfer charac- illustrate that the ternary weight provides better accuracy
teristic based on 3K runs of Monte-Carlo simulation using under all variation conditions. Table II summarizes detailed
a column of 128× eDRAM cells. For verifying both lin- configurations of the simulated VGG-like CNN model. A bina-
earity and variation of the proposed eDRAM-based bitline rized neural network [25] is used for training and testing
accumulation, we swept dot-product results from −128 to MLP and CNN models. The simulated MNIST classification
+128 and measured the bitline voltage difference. Based accuracy initially degrades slower than CIFAR10 results when
on the simulation result, we achieved the standard-deviation std-dev is lower, and then it decreases much quicker as std-
of 13.7mV, 14.2mV, and 13.6mV, and the mean accu- dev exceeds 4% of the full dynamic range. Based on the
mulated bitline voltage difference of −121.9mV, 0V, and simulated worst-case variation (i.e., σ = 2.84%) in Fig. 20,
121.4mV when the dot-product result is −64, 0, and +64, the estimated classification accuracies are 96.78% for MNIST
respectively. using MLP and 82.8% for CIFAR-10 using the VGG-like
The impact of process variations in the bitline accumulation CNN model. The results show 0.6% and 1.3% accuracy
has been assessed by measuring image classification accuracy degradations, respectively, from the baseline results with no
using two different neural network configurations and datasets. variation. Table III summarizes the simulated MNIST/CIFAR-
The Vdiff std-dev/full-range of this design (i.e. 2.84%) is 10 classification results.
from the Monte-Carlo simulation with PVT and mismatch Table IV summarizes a performance comparison between
information, as shown in Fig.21. To show the trend of accuracy the proposed 4T2C eDRAM macro and prior CIM macros
with Vdiff std-dev/full-range, we sweep Vdiff std-dev/full- based on 6-to-12T SRAM bitcells. Note that the proposed
range from 0 to 10.5%. In testing phase, we introduce the work realizes a compact DRAM based CIM macro for the
noise obeying normal distribution with different variation first time. Compared to the previous SRAM-based works,
in each layer between convolution and batch normalization eDRAM offers its unique advantages, including the inherently
layers, the error/variation generated within the hardware will decoupled read/write ports and the high memory density.
Besides, the proposed eDRAM macro is highly-reconfigurable [7] J. Kim et al., “Area-efficient and variation-tolerant in-memory BNN
in terms of bit-precisions, and it embeds ADC per each computing using 6T SRAM array,” in Proc. Symp. VLSI Circuits (SOVC),
Jun. 2019, pp. C118–C119.
column-based neuron. [8] M. Kang, S. K. Gonugondla, A. Patil, and N. Shanbhag, “A
481 pJ/decision 3.4 M decision/s multifunctional deep in-memory
inference processor using standard 6T SRAM array,” Oct. 2016,
VII. C ONCLUSION arXiv:1610.07501. [Online]. Available: https://arxiv.org/abs/1610.07501
[9] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, “A multi-
In this paper, we presented a novel 4T2C ternary eDRAM functional in-memory inference processor using a standard 6T SRAM
cell for energy-efficient processing of DNN. The proposed array,” IEEE J. Solid-State Circuits, vol. 53, no. 2, pp. 642–655,
eDRAM cell consists of a pair of asymmetric 2T1C eDRAM Feb. 2018.
[10] J.-W. Su et al., “15.2 A 28 nm 64Kb inference-training two-way
bitcells with decoupled read and write ports. Hence, one of the transpose multibit 6T SRAM compute-in-memory macro for AI edge
critical issues (i.e., write disturbance) in analog compute-in- chips,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
memory has been eliminated. The compact eDRAM cell stores Papers, Feb. 2020, pp. 240–242.
[11] W.-H. Chen et al., “A 65 nm 1Mb nonvolatile computing-in-memory
a ternary weight and occupies 0.32× area (or 2.97× higher ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN
memory density) than the 8T SRAM-based bitcell, which can AI edge processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC)
only store a binary state. The read wordline (RWL) is used Dig. Tech. Papers, Feb. 2018, pp. 494–496.
[12] C.-X. Xue et al., “24.1 A 1Mb multibit ReRAM computing-in-memory
as an input port, and it can be reconfigured by programming macro with 14.6ns parallel MAC computing time for CNN based AI
the number of short pulses in a pulse train. To address the edge processors,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
short retention time issue of eDRAM, we added custom metal Tech. Papers, Feb. 2019, pp. 388–390.
[13] J. Barth et al., “A 500 MHz random cycle, 1.5 ns latency, SOI embedded
capacitors on the internal storage node without increasing the DRAM macro featuring a three-transistor micro sense amplifier,” IEEE
cell size. We embedded column ADCs [6] in each column of J. Solid-State Circuits, vol. 43, no. 1, pp. 86–95, Jan. 2008.
our eDRAM macro to address the ADC overhead issue, one of [14] K. C. Chun, P. Jain, J. H. Lee, and C. H. Kim, “A 3T gain cell
embedded DRAM utilizing preferential boosting for high density and
the critical challenges in the analog CIM macro design. The low power on-die caches,” IEEE J. Solid-State Circuits, vol. 46, no. 6,
proposed ADC performs a reconfigurable 1-5bit conversion, pp. 1495–1505, Jun. 2011.
and it takes 1-31 conversion cycles (i.e., 2N − 1 cycles for [15] K. C. Chun, P. Jain, T.-H. Kim, and C. H. Kim, “A 667 MHz logic-
compatible embedded DRAM featuring an asymmetric 2T gain cell for
N-bit). A column of 128× eDRAM cells is comprised of high speed on-die caches,” IEEE J. Solid-State Circuits, vol. 47, no. 2,
64× cells for dot-product, 32× for ADC reference, and 32× pp. 547–559, Feb. 2012.
for calibration. A Monte-Carlo simulation result demonstrates [16] K. Chun, W. Zhang, P. Jain, and C. H. Kim, “A 700 MHz 2T1C
embedded DRAM macro in a generic logic process with no boosted
both high linearity and the reasonable variation in the bit- supplies,” IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
line accumulate operation. The simulated worst-case standard Papers, Feb. 2011, pp. 272–273.
deviation is 14.2 mV, which is 2.84% of the full dynamic [17] K. Cho, Y. Lee, Y. H. Oh, G.-C. Hwang, and J. W. Lee, “EDRAM-based
tiered-reliability memory with applications to low-power frame buffers,”
range of 500mV. The simulated MNIST classification result in Proc. Int. Symp. Low Power Electron. Design (ISLPED), Aug. 2014,
using a three-layer MLP (784-256-64-10) architecture results pp. 333–338.
in the estimated accuracy of 96.78%, which is 0.6% lower than [18] A. Kazimirsky and S. Wimer, “Opportunistic refreshing algorithm for
eDRAM memories,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63,
the baseline with no process variation. The simulated CIFAR- no. 11, pp. 1921–1932, Nov. 2016.
10 classification using a VGG-like CNN model results in an [19] M. Tikekar, V. Sze, and A. P. Chandrakasan, “A fully integrated energy-
estimated 82.8% accuracy, which is only 1.3% lower than the efficient H.265/HEVC decoder with eDRAM for wearable devices,”
IEEE J. Solid-State Circuits, vol. 53, no. 8, pp. 2368–2377, Aug. 2018.
baseline accuracy. The proposed ternary eDRAM cell presents [20] Y. S. Park, D. Blaauw, D. Sylvester, and Z. Zhang, “Low-power high-
the smallest area (i.e., 1.08μm2) and the highest density throughput LDPC decoder using non-refresh embedded DRAM,” IEEE
among the published CIM bitcells using 65nm technology. J. Solid-State Circuits, vol. 49, no. 3, pp. 783–794, Mar. 2014.
[21] W. Choi, G. Kang, and J. Park, “A refresh-less eDRAM macro with
embedded voltage reference and selective read for an area and power
R EFERENCES efficient Viterbi decoder,” IEEE J. Solid-State Circuits, vol. 50, no. 10,
pp. 2451–2462, Oct. 2015.
[1] J. Zhang, Z. Wang, and N. Verma, “A machine-learning classifier [22] F. Tu, W. Wu, S. Yin, L. Liu, and S. Wei, “RANA: Towards efficient
implemented in a standard 6T SRAM array,” in Proc. IEEE Symp. VLSI neural acceleration with refresh-optimized embedded DRAM,” in Proc.
Circuits (VLSI-Circuits), Jun. 2016, pp. 1–2. ACM/IEEE 45th Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2018,
[2] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A 64-tile 2.4- pp. 340–352.
mb in-memory-computing CNN accelerator employing charge-domain [23] T. Yoo, H. Kim, Q. Chen, T. T.-H. Kim, and B. Kim, “A logic compatible
compute,” IEEE J. Solid-State Circuits, vol. 54, no. 6, pp. 1789–1799, 4T dual embedded DRAM array for in-memory computation of deep
Jun. 2019. neural networks,” in Proc. IEEE/ACM Int. Symp. Low Power Electron.
[3] A. Biswas and A. P. Chandrakasan, “Conv-RAM: An energy-efficient Design (ISLPED), Jul. 2019, pp. 1–6.
SRAM with embedded convolution computation for low-power CNN- [24] M. Ichihashi, H. Toda, Y. Itoh, and K. Ishibashi, “0.5 V asymmetric
based machine learning applications,” in IEEE Int. Solid-State Circuits three-Tr. Cell (ATC) DRAM using 90nm generic CMOS logic process,”
Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, pp. 488–490. in Dig. Tech. Papers, Symp. VLSI Circuits, Jun. 2005, pp. 366–369.
[4] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, “XNOR-SRAM: In-memory [25] I. Hubara, M. Courbaruaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
computing SRAM macro for binary/ternary deep neural networks,” IEEE “Binarized neural networks,” in Proc. 30th Conf. Neural Inf. Process.
J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, Jun. 2020. Syst. (NIPS), 2016, pp. 4107–4115.
[5] H. Kim, Q. Chen, and B. Kim, “A 16K SRAM-based mixed-signal in- [26] K. C. Chun, W. Zhang, P. Jain, and C. H. Kim, “A 2T1C embedded
memory computing macro featuring voltage-mode accumulator and row- DRAM macro with no boosted supplies featuring a 7T SRAM based
by-row ADC,” in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), repair and a cell storage monitor,” IEEE J. Solid-State Circuits, vol. 47,
Nov. 2019, pp. 35–36. no. 10, pp. 2517–2526, Oct. 2012.
[6] C. Yu, T. Yoo, T. T.-H. Kim, K. C. T. Chuan, and B. Kim, “A 16K [27] Y. Zha, E. Nowak, and J. Li, “Liquid silicon: A nonvolatile fully
current-based 8T SRAM compute-in-memory macro with decoupled programmable processing-in-memory processor with monolithically
read/write and 1-5bit column ADC,” in Proc. IEEE Custom Integr. integrated ReRAM,” IEEE J. Solid-State Circuits, vol. 55, no. 4,
Circuits Conf. (CICC), Mar. 2020, pp. 1–4. pp. 908–919, Apr. 2020.
Chengshuo Yu (Graduate Student Member, IEEE) Minnesota in 2008, the DAC/ISSCC Student Design Contest Award in 2008,
received the B.S. degree in electronic engineering the Samsung Humantech Thesis Award in 2008, 2001, and 1999, and the
from Feng Chia University, Taiwan, in 2019. He is ETRI Journal Paper of the Year Award in 2005. He was the Chair of the IEEE
currently pursuing the Ph.D. degree with the School Solid-State Circuits Society Singapore Chapter. He has served on numerous
of Electrical and Electronic Engineering, Nanyang conferences as a Committee Member. He also serves as an Associate Editor for
Technological University, Singapore. the IEEE T RANSACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI)
His research interests include in-memory comput- S YSTEMS , IEEE A CCESS , and the IEIE Journal of Semiconductor Technology
ing and time-domain hardware accelerator. and Science.
Taegeun Yoo (Member, IEEE) received the B.S.,

M.S., and Ph.D. degrees in electrical and electron-
ics engineering from Chung-Ang University, Seoul, Kevin Chai Tshun Chuan (Senior Member, IEEE)
South Korea, in 2009, 2011, and 2015, respectively. received the B.Eng. (Hons.) and Ph.D. degrees in
From 2015 to 2016, he was with the Chung-Ang electronic and electrical engineering from the Uni-
University as a Research Professor. In 2016, versity of Glasgow, U.K., in 2002 and 2007, respec-
he joined Nanyang Technological University, Sin- tively, developing tissue cell imaging solution based
gapore, as a Research Fellow. His research inter- on Electrical-Impedance-Tomography on a CMOS
ests include analog mixed-signal ICs and low-power chip.
memory architecture. He received encouragement He joined the Institute of Microelectronics,
award and silver award at the Human-Tech Paper A∗ STAR, Singapore, in 2008, as a Research Scien-
Award hosted by Samsung Electronics Company Ltd., in 2011 and 2014, tist and developed a silicon-nanowire based biosen-
respectively. He also received the Silkroad Award at the IEEE International sor readout system for the detection of biomarkers
Solid State Circuits Conference (ISSCC) in 2014. in cardiac disease. He received several competitive A∗ STAR grants as a
PI/Co-PI for MEMS sensor-related applications in temperature, motion and
sound detection, cell counting, electronic stethoscope system for the early
Hyunjoon Kim (Student Member, IEEE) received detection of diastolic dysfunction in hypertensive heart disease, and so on.
the B.A. degree in physics from the Oberlin Col- He currently heads a department of more than 30 IC designers working
lege, Oberlin, OH, USA, in 2008, and the M.S. on various topics from AI powered hardware accelerators looking at both
degree in electrical engineering from the University deep learning and neuromorphic methodologies, compute-in-memory using
of Minnesota, Minneapolis, MN, USA, in 2012. emerging memories, hardware security for edge IOT, power management
He is currently pursuing the Ph.D. degree in digital solution with IVR, mmWave IC, and design acceleration techniques using
compute-in-memory (CIM) circuits and architecture machine learning.
for machine learning applications with Nanyang
Technological University, Singapore.
From 2013 to 2018, he was with Gainspan Cor-
poration, San Jose, CA, USA, where he worked
as an RF Test Engineer for IEEE 802.11 standards. In 2018, he joined
Nanyang Technological University. His research interests include memory Bongjin Kim (Member, IEEE) received the B.S.
centric systems, neural network accelerators, and its design methodologies. and M.S. degrees from POSTECH, Pohang, South
Korea, in 2004 and 2006, respectively, and the Ph.D.
Tony Tae-Hyoung Kim (Senior Member, IEEE) degree from the University of Minnesota, Minneapo-
received the B.S. and M.S. degrees in electrical lis, MN, USA, in 2014.
engineering from Korea University, Seoul, South He spent two years with Rambus, Sunnyvale, CA,
Korea, in 1999 and 2001, respectively, and the USA, where he was a Senior Staff Member and
Ph.D. degree in electrical and computer engineering worked on the research of high-speed serial link
from the University of Minnesota, Minneapolis, MN, circuits and microarchitectures. He was as a Post-
USA, in 2009. doctoral Research Fellow with Stanford University,
From 2001 to 2005, he was with Samsung Elec- Stanford, CA, USA, for a year. From 2006 to 2010,
tronics Company Ltd., Hwasung, South Korea, he was with Samsung Electronics Company Ltd., Yongin, South Korea, where
where he performed the research on the design of he performed research on clock generators for high-speed serial links and
high-speed SRAM memories, clock generators, and clock generators. He also worked as a Research Intern with Texas Instruments,
IO interface circuits. From 2007 to 2009, he was with the IBM T. J. Watson Dallas, TX, USA, IBM TJ Watson Research, Yorktown Heights, NY, USA,
Research Center, Yorktown Heights, NY, USA, and Broadcom Corporation, and Rambus, during his Ph.D., from 2012 to 2014. He joined Nanyang Tech-
Edina, MN, USA, where he performed the research on circuit reliability, nological University (NTU), Singapore, in September 2017, as an Assistant
low-power SRAM, and battery-backed memory design. In 2009, he joined Professor. His current research interests include memory-centric computing
Nanyang Technological University, Singapore, where he is currently an devices, circuits, and architectures, hardware accelerators, alternative com-
Associate Professor. He has authored or coauthored over 160 journal and puting, and mixed-signal circuit design techniques and methodologies. His
conference articles and holds 17 U.S. and Korean patents registered. His research works appeared at top integrated circuit design and automation
current research interests include low-power and high-performance digital, conference proceedings and journals, including ISSCC, VLSI Symposium,
mixed-mode, and memory circuit design, ultralow-voltage circuits and sys- CICC, ESSCIRC, ASSCC, ISLPED, DATE, ICCAD, IEEE J OURNAL OF
tems design, variation and aging-tolerant circuits and systems, and circuit S OLID -S TATE C IRCUITS (JSSC), and IEEE T RANSACTIONS ON V ERY
techniques for 3-D ICs. L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS (TVLSI).
Dr. Kim received the Best Demo Award at APCCAS2016, the Low Power Dr. Kim was a recipient of the Prestigious Doctoral Dissertation Fellowship
Design Contest Award at ISLPED2016, the best paper awards at 2014 and Award based on his Ph.D. research works, the International Low Power Design
2011 ISOCC, the AMD/CICC Student Scholarship Award at the IEEE Contest Award from ISLPED, and the Intel/IBM/Catalyst Foundation Award
CICC2008, the Departmental Research Fellowship from the University of from CICC.

A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.

2, FEBRUARY 2021 667

A Logic-Compatible eDRAM Compute-In-Memory

Fig. 1. A comparison between (a) conventional architecture and (b) compute-

a two-dimensional memory cell array and a row of nonlinear

Fig. 3. A 2T eDRAM bitcell when operating as a memory: (a) write mode

Fig. 5. (a) Dot-product operation in a column (i.e. a neuron) of eDRAM array.

column consists of ‘n’ unit cells that are stacked in a vertical

the step size of 1 when the number of eDRAM cells per

Fig. 8. Proposed multiple unit cells computing module. (a) Schematic.

each eDRAM bitcell, and hence the bitcell has to be refreshed

Fig. 15. Simulation result of proposed ADC (1-5bits mode).

‘0,’ we are able to calculate a ‘valid’ retention time. However,

Fig. 16. Operating concept of the binary-searching based offset calibration.

Fig. 19. Power and area breakdown.

time contributes to reducing the frequency of refresh and

to cycle while sweeping a replica-based ADC reference from

Fig. 22. Simulated classification accuracy versus Vdiff standard deviation

Fig. 23. The simulated MNIST classification accuracy difference between

Taegeun Yoo (Member, IEEE) received the B.S.,

You might also like