You are on page 1of 4

A 55nm Ultra High Density Two-port Register File Compiler

with Improved Write Replica Technique


l 2 l l l l
Zhao-Yong Zhang *, Li-Jun Zhang , Vi-Ping Zhang , Rui-Feng Huang , Shou-Dao Wu , Jian-Bin Zheng

I Department of Memory Design, AiceStar Technology Corporation, Suzhou 215021, China


2
School of Urban Rail Transportation, Soochow University, Suzhou, China
* Email: brightzhang@aicestar.com

Abstract In this paper a two-port register file (RF) PortA BL Port A BLB
PortA WL
-

compiler with ultra high density design is presented. The


memory implemented using a single port memory core, which
is combined with a smart address selector circuit to reduce
peripheral devices number and thus the silicon area is
decreased significantly. The separate read and write replica
scheme are implemented, with the improved write replica
technique the memory compiler can accurately track the write
timing over a wide range of memory array sizes and PVT
variation. A test-chip with 13 embedded RF memories has
Port B WL
been fabricated in UMC 55nm logic standard performance
Port B BL Port B BLB
low-K process. The ultra high density design can markedly
(a)
save 44.0% silicon area compared to conventional two-port WL
RF (with 8T dual-port memory core) and only 4.3% area
overhead compared to single-port RF (with 6T single-port
memory core) for a 55nm 72Kb memory.

Index Terms - Register file compiler, replica technique, self­


timed write, two-port memory, address selector.

INTRODUCTION
BL BLB
Two-port register file (RF) memory is widely used for the
(b)
temporary storage of information and typically provides high Fig. 1. Scbematic diagram of (a) conventional dual-port memory cell and
performance and better area utilization. The conventional two­ (b) single-port memory cell.
port RF memory device is based on an array of 8T memory In this work, the 6T 2PRF memory uses a smart address
cells (refer to Fig. 1 (a)). For achieving ultra high density a selector circuit, which can reduce the peripheral devices
two-port RF memory using single-port 6T memory cells (refer number and power consumption. The improved write replica
to Fig. 1 (b)) was reported previously [1 ]-[4] (abbreviated 6T technique utilizes multi write reference cells (alike the
2PRF). The synchronous 6T 2PRF memory utilizes one memory cell) to ensure reliable writing of data to the memory
system clock to serially access the memory cell in a cycle [1], cells. The write timing control circuit generates a self-timed
[3]-[4]. internal write control clock signal (refer to signal WWLE in
As we known that not only the PVT (Process, Voltage, and Fig. 2) with a dummy write driver having configurable drive
Temperature) conditions will affect the performance of one strength and a programmable write tracking accelerator unit.
RF memory generated by compiler, but also the parameters of The write pulse width can be adjusted to obtain the desired
configuration or density will affect the sensing margin and write timing margin.
cycle time. In modem synchronous SRAM (Static Random The organization of this paper is as follows. In Section II, a
Access Memory), the read replica technique is widely used to brief overview of the architecture is presented. Section III
generate the sensing clock to control wordline pulse widths discusses the design of some 6T 2PRF memory circuits
and limit bitline swings [8]-[9]; this technique can help to including the proposed address selector, and the improved
minimize the variations of the data-sensing and power. write replica technique. Section IV shows the experimental
Usually the read replica technique also is applied to write results on the performance of the test-chip.
operation. Whereas the per-charging after a write typically
I. ARCHITECTURE
requires much more time than pre-charging after a read cycle,
hence, the separate write replica techniques [5]-[7], [10]-[1 1] The memory array core uses a 6T high threshold voltage
have been developed in recent years to reduce the write cycle single-port SRAM cell with 0.502�m2 area, which authorized
time. by UMC for 55nm logic standard performance (SP) low-K

978-1-61284-193-9/11/$26.00 2011IEEE

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:18 UTC from IEEE Xplore. Restrictions apply.
TABLE I Tcyc

6T 2PRF MEMORY COMPILER Co FIG RATION �:::::::iL�::::::::::::�:::Iw:�::::::�


I I I
Parameter Ranges CLK
Words 8b to 2Kb,increment=CMx2
CSAN
��=:::!'===±==
Bits I b to 144b, increment= I
Bytes 144b to I b, decrement= I A[m·I:O]
Aspect ratios (CM) 2,4,8 CSBN ! i /--'
\'-++-
Rtlld

i
Wrilt

Column
RellliclII

i
WEB ��----�----T---
I I
WDWL
RIW Dummy B[m-I:O]
VDO
rl , , ��-;:'�=�+==.:==!===
RDW!.
WLDriver&
Dummy Write "!P n- �¥ n- v� I vT I D[
I w·1 :Ol
Driver & Read f-- Read Load W,itLoad D"mmy ...
Direct Sink
HV I I I I
Dummy
C ell _DO--,-[w _. _I: O l : _': unkno wn v alid
Circuit
Ref. Cell Ref. Cell Cell
I 0
i
-,---+ +-,
Wli �+ ___ :r>!�________� \

"T r+
�) r+
, "� I
Mcmo,y , v� I
, � ���--����i�:------4----
RWLE

\,� !
H
... v;

H IH H RDWL
Read Load W,itLo"d Memory
Ref. Cell Ref. Cell Cell C ell
0 �
I
:
�===+==:'---lJ)f::===+===
RDBL


:.
,
i
.

"T r VT I

t
0

<C
Q.. ===-�--�J/(��----+---
RB LT

Row
"T) r1- "'T r1--
W, ; tLoad S!\EN h
: \..:�1 I:
H I II H I H I �+-
Read Load Memory Memory
-- -
Decoders
i :n ,-+--
-
______ ___
Ref. Cell Ref. Cell Cell Cell
I 0 W WL E --+--
· ·

andWL
" \
Drivers
"r r+
, vlJf!\�r 1 "T I
Mcmo,y ...
"T I v;

) l
WDWL I

i �
H H H I H I
Read Load W,;leRer, _ Memory
Ref. Cell Cell Cell Cell WDBL

ci--
I 0
:
:. � WBLT

"T) r+
, ".!p'rD
IWileRer"
"T I
Memory , "T I Fig. 3. Timing waveform of a read-write cycle.

H H H H I
Re"d Lo"d Memory
Ref. Cell Cell Cell Cell The 6T 2PRF memory dynamic functions include read-only,
I 0

-
· ·
Rowand
Column write-only and read-write options. The Fig. 3 presents a read­
Pre
nle �n () lage re-c arge \...lrCLlllS.

I II IV�II
Read End Volwgc
[)ctcctorwith PGM
decoders
I)ctector
i\�cc1cTlltor Column Multiplexers. write cycle wavefonn. In a read-write operation the read
& Address Sense Amplifiers. Write Driver.


Selector

\��� l'inll1: Slate Machine & Control Circuitry
SAEN I/Ori,,,,,il<

operation priority is deliberately designed higher than write
AJm-l:0J Dlm-I:OJ eLK CSAN CSUN WEll operation, which is implemented in the FSM circuit.
Fig. 2. The 6T 2PRF memory architecture block diagram with read and
write replica columns. II. CIRCUIT DESIGN
process. Different combinations of words, bits, and aspect In the following section, we will have an introduction for
ratios CM (Column Multiplexing) can be used to generate the the write replica technique and address selector circuit, which
most desirable configuration. Table I shows the configuration are integrated in the proposed 6T 2PRF memory architecture
infonnation of the 6T 2PRF memory compiler. described in the previous section.
The proposed 6T 2PRF memory architecture (take right half
part as example) is shown in Fig. 2. The array is consisted of A. Write Replica Circuit
nonnal single-port memory cells and two extra columns of For ensuring fast and low-power operation, the internal
replica memory cells, one of which is preferably used for read write timing control path uses separate replica technique and
replica and the second column is used for write replica. We self-timing scheme to match the nonnal write path. The
can individually optimize the read and write timing by using improved write replica circuits include write replica bitline
the separate replica column. The read replica circuits include loading cells (refer to Write Load Ref. Cell in Fig. 2), write
read replica bitline loading cells (refer to Read Load Ref. Cell replica wordline loading (using metal wire in the dumm y
in Fig. 2), read replica wordline loading cells (refer to Dummy memory cells, refer to signal WDWL in Fig. 2), write
Cell in Fig. 2), read tracking direct sink circuit (with reference cells (refer to Write Ref. Cell in Fig. 2, the circuit is
adjustable delay option), and read tracking end voltage shown in Fig. 4 (b», dummy write driver, and write end
detector (RTEVD). The read replica wordline and bitline voltage detector circuit with programm able write tracking
memory cells are alike as nonnal memory cells. The read accelerator. The write replica bitline loading cells are alike as
replica bitline cells (refer to Fig. 4 (a» are fixed to store logic the nonnal memory cells except of its wordline forced to
"I" for achieving better read tracking timing by avoiding the ground. Hence, the capacitance of write dummy bitline (refer
serious bitline leakage current variations in ultra sub-micron to WDBL in Fig. 2) is the same as core memory cells'
process [5]. The output of RTEVD circuit supply a reset including junction and wire parasitic capacitances. The write
signal RBLT, which is sent to fmite state machine (FSM) & reference cell is modified from nonnal memory cell by
control circuit and used to end the internal read self-timing connecting its wordline to Voo and shorting the source-drain
operation. The internal read clock (refer to signal RWLE in of the access NMOS connected to WDBLB, which can avoid
Fig. 2) and write clock (refer to signal WWLE in Fig. 2) are the threshold voltage decrease through NMOS passing "I"
used to implement the parallel/serial conversion operation. presented in [7]. M. F. Chang et al. [5] also developed a write

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:18 UTC from IEEE Xplore. Restrictions apply.
Voo Parallel/serial converter
Address Latch Address Pre-decoders I
Pre-decoder Dri vef
LA
A[OJ lAO
IAOB
A[IJ IAI IAOB
IAIB IAIB
XPO

B[OJ LBO
IB08 IBOB
IBIB
RDBL B[JJ IBI
IBIB
(a) (b) LB
Fig. 4. Circuits of (a) read replica bitline loading cell and (b) write replica RWLE WWLE
reference cell.
FIG. 6. ADDRESS SELECTOR CIRCUIT WITH LATCH AND ADDRESS PRE­
WDBL WDBLB
DECODERS.
'ProWlln,nableAcu
,------------ ------------------------.
ratorCin:..n
clock described in [I]. The PIS conversion function is
implemented by the timing non-overlap signals RWLE and
WWLE described in Fig. 3. The proposed address selector
circuit designed in the pre-decoders, which can greatly reduce
devices number compared with [I] designed in row decoders,
because the row decoders' number is usually higher than the
pre-decoders' for a memory. The attribute of proposed address
selector circuit is desirable for saving silicon area and static
External Signals power consumption.
(a) (b) III. EXPERIMENTAL RESULTS
Fig. 5. Circnits of (a) dummy write driver (b) write end voltage detector
circuit with programmable write tracking accelerator. A test-chip has been fabricated using UMC's 55nm logic
SP low-K technology, with 13 memories and embedded
replica technique. However, the scheme does not track the
MBIST etc. testing circuits which are designed for memory
wordline loading of memory-array, which may induce timing
function and timing measurement. Fig. 7 shows the layout of
skew in the write cycle especially for memory compiler
test-chip, which includes compiled 6T 2PRF memory's
instances. The improved write replica technique utilizes multi
ranging from 8b to 72kb in a variety of aspect ratios. The
write reference cells (refer to Fig. 2), which can reduce the .
features of can be found in table II.
write timing skew resulted from process variation.
The dummy write driver (refer to Fig. 5 (a)) locate at the
top of normal wordline drivers. The floor plan can avoid the
extra timing delay result from the feedback path in [7]. The
drive strength of the dummy write driver can be adjusted by
configuring the parallel NMOS with different channel width.
Fig. 5 (b) shows the write end voltage detector circuit with
programmable write tracking accelerator. The inverter
connected to WDBLB is the write end voltage detector. The
single NMOS is used for write replica reference cells recover
to initial status when the write tracking end. The PMOSs
connected to WDBLB can be used to accelerate the end of
write tacking when a corresponding PMOS is selected by an
external test pin. The speed and yield of the 2PRF memory
can be tradeoff by the programmable and configurable ability,
which also can help obtain a desired write timing margin.

B. Address Selector Circuit


There are two groups input address (refer to A[m-I:O] and
B[m-I:0] in Fig. 3) for the two port (read port and write port)
in the 6T 2PRF memory. The address selector circuit is used
to implement the paralleVserial (PIS) conversion, which help
the two port addresses can serially access to one group A comparison of the memories' area and read speed (Taa)
wordline and bitline [I], [3] by one system clock. Fig. 6 shows for the 6T 2PRF with conventional 6T IPRF (single-port RF)
the proposed address selector circuit with latch and address and 8T 2PRF (two-port RF) is shown in Fig. 8 (a). The data of
pre-decoders. In our work the circuit needn't converter control 6T IPRF and 8T 2PRF instances come from Faraday

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:18 UTC from IEEE Xplore. Restrictions apply.
TABLE II (dynamic) current for the 6T 2PRF with conventional 6T
FEATURES OF TEST-CHIP
IPRF and 8T 2PRF is shown in Fig. 8 (b). The data of 6T
Foundry UMC
Process 55nm I P I OM logic SP low-K
2PRF memory are the silicon measurement average values in
55nm I P4M (6T 2PRF Instance) TC comer (typical process dies in l.OV power supply and
2
Memory Cell 0.502!lm SPHYT 6T cell 25°C temperature condition). The results show that this design
Supply voltage I.OY
Die size 4000!lm x 4000!lm
can greatly reduce dynamic current by 45% for a 72Kb
Package QFP 208 memory and by 41 % for an 18Kb memory compared with
7.00 _ =-_=-_-=_-=_-=_-=_:-,
r------------_=-=- 3.00 conventional 8T 2PRF. The results also show that this work
� Area
, can markedly reduce static current by 42% for a 72Kb
6.00 2.46 6.73 L-=-:-_T�a��� !
2.60 memory and by 36% for an 18Kb memory compared with
conventional 8T 2PRF.
6.00
."
_

2.00
.� V. SUMMARY
..
E 4.00
An ultra high density 6T 2PRF embedded memory compiler
i
'"

.:. 1.60
o

e 3.00 ....
based on an industrial 55nm SP low-K process has been
...: 1.02
c:-
demonstrated. The RF memories compiled can markedly
o 1.00
� 2.00 1. reduce silicon area and power consumption by using single­

1.00
port SRAM cell as memory core with the help of proposed
0.92
1.00 smart address selector circuit. Desired write timing margin
with the help of improved write replica technique further
0.00
guarantee the RF memories compiled with wider margin for
correct functionality and accurate characterization. The
measurement results of test-chip have proved the design
correctness, high density and low power efficiency.
(a) ACKNOWLEDGMENT
360.00 .------O=�---__n 0.026
The authors would like to thank Johnson for help with the
300.00
test-chip design, W. T. and Jason for testing of the chips, Jack,
0.020 Sam and James for helpful discussion on the circuits design,
U
260.00 l- and Amy Zhang for the layout implementations.
U
l- @)
@)
<' 200.00
0.016 ¥ REFERENCES
182.99
::;;
2- ;;( [I] K. Endo, T. Matsumura, and 1. Yamada, IEEE Journal of Solid-State
c:
-
.§. Circuits, vol. 26, no. 4, pp. 549-554, April 1991.
� 160.00
0.010 C [2] F. E. Barbert, D. 1. Eisenberg, G. A. Ingram, M. S. Strauss, and T. R.
" �
U Wik, IEEE International Solid-State Circuits Conference, pp. 44-45,
"
g 100.00 u Feb. 1985.
U [3] S. Kengeri, D. Sabharwal, P. Bhatia, S. Sampigethaya, and S. Kainth,
0.006 «
United States Patent, US7251l86BI, Jul. 2007.
60.00
[4] S. Balasubramanian, L. Y. Holla, and B. D. Shefield, "Dual port memory
unit using a single port memory core," United States Patent,
0.00 US7349285B2, Mar. 2008.
[5] Meng-Fan Chang, S. M. Yang, K. T. Chen, H. J. Liao, and Robin Lee,
IEEE International Workshop on Memory Technology, Design and
t.­ t.­ t.­ t.­ t.­ t.­
o: o: o: o: o: o:
.. .. .. .. .. ..
Testing, pp. 57-60, Dec. 2007.
.... .... .... .... .... ....
M M M M
[6] Donovan L. Raatz, and Taisheng Feng, United State Patent, US5546355,
'<> '<> 00 '<> '<> 00
Aug. 1996.
(b) [7] Alexander Shubat, A. Kablanian, 1. Raszka, and R. S. Roy, United State
Fig. S. Performance comparison between the 6T IPRF, ST 2PRF and 6T Patent, US6392957, May 2002.
2PRF. (a) Memories area and read speed (at WC corner). (b) Memories [8] Zhao-Yong Zhang, Chia-Cheng Chen, and Jian-Bin Zheng, IEEE 8'h
DC and AC current (at TC corner). International Conference on ASIC, pp. 625-628, Oct. 2009.
[9] Bharadwaj S. Amrutur, and Mark A. Horowitz, IEEE Journal of Solid­
commercial RF compiler datasheets. The results show that this State Circuits, vol. 33, no. 8, pp. 1208-1219, Aug. 1998.
work can markedly save silicon area by 44.0% for a 72Kb [10] Toshikazu Suzuki, Y. Yamagami, I. Hatanaka, A. Shibayama, H.
memory and by 42.0% for an 18Kb memory compared with Akamatsu, and H. Yamauchi, IEEE Journal of Solid-State Circuits, vol.
41, no. I, pp. 152-160, Jan. 2006.
conventional 8T 2PRF. Although some additional logic
[Il] Shyh-Chyi Yang, Hao-I Yang, Ching-Te Chuang, and Wei Hwang,
circuits are integrated in this design, it only make smaller area International Symposium on VLSI Design, Automation and Test, pp.
overhead by 4.3% for a 72Kb memory and by 8.0% for an 162-165, April 2009.

18Kb memory compared with conventional 6T IPRF by our


aggressive design especially in the address selector circuit.
A comparison of the memories' DC (static) and AC

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:18 UTC from IEEE Xplore. Restrictions apply.

You might also like