You are on page 1of 4

Standard Cell based Memory Compiler for

Near/Sub-threshold Operation
Jinbo Zhou Kamlesh Singh Jos Huisken
Innatera Nanosystems BV Department of Electrical Engineering, Department of Electrical Engineering,
Delft, The Netherlands Electronic Systems Electronic Systems
Email: jinbo.zhou@innatera.com Eindhoven University of Technology Eindhoven University of Technology
Eindhoven, The Netherlands Eindhoven, The Netherlands
Email: k.k.singh@tue.nl Email: j.a.huisken@tue.nl

Abstract—Voltage scaling to near/sub-threshold region is com- standard cell based memories (SCMs) for small size memory
monly used to achieve energy-efficient operation in digital cir- (< 8kb) blocks [1], [6], [7].
cuits. However, the voltage-scaling potential of traditional 6T
SRAM memories is limited by reliability problems. For small size The advantage of SCMs over foundry-provided SRAMs
memories, designers tend to a use latch/flip-flop based register file macrocell is the ease of portability over different technologies.
which brings relatively high area overhead, high delay penalty, The SRAM macrocells must be recreated for new technologies
and power overhead. In this work, a standard-cell based memory as opposed to the SCMs which are described in technol-
(SCM) compiler is presented to automatically generate a 12T
OAI based SCM for multiple CMOS technologies operating in
ogy independent hardware description languages. For smaller
near/sub-threshold region. The SCM compiler utilizes Python- memories the SCMs are relatively more efficiency as compared
MyHDL for RTL and constraint generation. The timing and to SRAMs in terms of area, delay, and power [1]. In literature,
floor-plan constraints are generated based on user specified several representative SCM designs for sub-threshold opera-
technology, memory size, and memory shape inputs. The power, tion are already developed [1], [8], [9]. In [9], a D-latch based
area, energy per access, read delay, and write delays are evaluated
for the generated memories of different sizes and shapes. The
SCM using clock-gates for writing and glitch-free multiplexers
proposed 12T OAI based SCM achieves average speed-up of for reading is implemented and compared to 6T SRAMs in 65-
28%/39% in write delay and average speed-up of 25%/19% in nm CMOS technology for sub-threshold operations. In [1], an
read delay when compared to D-latch/flip-flop based memories Or-And-Invert (OAI) and And-Or-Invert (AOI) based SCM is
in a 28-nm FDSOI technology. Furthermore, 41% reduction in presented. However, the placement and routing of the bitcell
energy per access and 33% area saving are obtained, compared to
a recently published SCM design in 40-nm CMOS technology [1].
array is done manually in order to avoid timing issues in
physical design. This approach results in relatively slow read
Keywords:– RAM, memory compiler, standard cell, syn- circuitry and high design effort.
thesis, CMOS. In this paper, an OAI based standard cell memory is
proposed. To the best of our knowledge, this is the first work
I. INTRODUCTION proposing a complete automated flow to generate different size
and shape of OAI based SCMs. The main contributions of this
Voltages scaling to near/sub-threshold region is a prevalent work are as follows:
approach to achieve ultra-low power/energy for portable de-
vices. The conventional 6T SRAM bitcells do not operate • Design of a technology-independent SCM compiler with
reliably at low voltages due to static noise margin (SNM) a fully integrated automatic EDA design tool flow to
degradation [2]. To deal with the bitcell reliability issue, generate different size and shape of OAI based SCMs.
various special designs with 8T, 9T or 10T bitcell for SRAMs • Automatic synthesis of memory array along with pre-
are proposed for near/sub-threshold voltages [2]–[4]. placement strategy for bitcells, results in improved read
The larger on-chip memories (> 8kb) are generally designed access delay, write delay and power consumption.
by using SRAM macrocells [1], which require extensive modi- • Comprehensive performances analysis of read delay,
fications to operate reliably in the sub-threshold voltage range. write delay, power consumption, and area for the pro-
However, the peripheral circuitry in SRAMs typically cause posed SCMs, latch and flip-flop based SCMs.
area overhead especially for smaller memories. The small on-
chip memories, such as registers and register files, are typically The paper is organized as follows. In Section II, the structure
based on latches or flip-flops [5] for which the area and delay of the bitcell and memory array are illustrated. Section III
overhead is relatively high. Therefore, there is a shift towards describes the architecture of the SCM compiler. Comparisons
and analysis of our proposed SCM and latch/flip-flop based
This work was performed at Eindhoven University of Technology, prior to SCMs are presented in Section IV, followed by a conclusion
Jinbo Zhou’s affiliation with Innatera Nanosystems BV. in Section V.

978-1-7281-6044-3/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:46:58 UTC from IEEE Xplore. Restrictions apply.
data per word. At any time only a single WWL should be
active low. In this example, for write operation two groups of
WBR and WBL are set inverse of each other, the rest of WWL
kept high. This ensures one word of 2 bits can be written.
In the case of read only one RWL is activated, in this case
RWL1 = 1 in order to read bit positions 1 and 5.

Fig. 1: Original structure of bitcells.

II. P ROPOSED 12T OAI BASED SCM


The proposed SCM consists of OAI based bitcell. The
original structure of the bitcell at design stage of RTL netlist
is shown in Fig. 1. Each bitcell consists of a pair of cross-
coupled OAI gates for write operation and followed by either
an AOI (even bitcell) or an OAI (odd bitcell) gate for read
logic. The daisy-chained read decoder is used with one gate
Fig. 2: Architecture of SCM array.
delay per bit to obtain minimal delay for the read operation
and to save area. This results in even and odd bitcells. III. A RCHITECTURE OF THE SCM COMPILER
In the write mode, the write is possible when write word A. SCM compiler
line (WWLi ) is active low. The pair of cross-coupled OAI In the proposed SCM compiler the memory generator is
based bitcell is written with a value corresponding to write written in Python/MyHDL for generating the RTL description
bit lines WBLi and WBRi . In the data retention mode, either of the memory. As shown in Fig. 3, a variety of technologies
WWLi or both WBLi and WBRi must be active high. The with their PVT corners, parameters including address bits, data
active low value on WWLi and WBLi /WBRi at the same bits, and shape (aspect ratio) are taken as user inputs. As first
time must be avoided. In this case, glitches on the WWLi step the generator produces RTL netlist, timing constraints,
and WBLi /WBRi signals may cause parasitic writes, hence I/O pins placement and a YML file with various kinds of
this case needs special attention by setting appropriate timing technology information. In second step, a templates-driven
constraints. In the read mode, the state of the daisy-chain script generator produces scripts for synthesis and physical
input (RB0 ) is initialized to 0, so the consecutive states of design procedures and tools.
RB from RB0 to RBi is 0, 1, 0, 1, ... 0, 1. While reading The proposed SCM compiler also generates testbench
out bitcelli , the read word line (RWLi ) becomes active high, scripts for design validation at different stages of design
and the inverted read is achieved via alternating AOI and OAI flow. Any type of memory march test can be executed after
gates in the daisy-chain. As shown in Fig. 1, in original bitcell synthesis and physical design with sdf back-annotation to
each alternate OAI/AOI gate contains 6 transistors making a verify correctness.
bitcell of 18 transistors. After these preparation steps, using unix ”make”, all steps
In contrast to [1], in this proposed SCM design the alternate from synthsis up to physical design including functional
OAI/AOI gate for read decoder is typically optimized by validation using the march tests are executed.
logic synthesis. This turns the long daisy-chained read decoder
typically into a tree structure in gate level netlist. Since the
RTL synthesis tool is free to use different gates from the
standard cell library, the read decoder circuitry is optimized
to less number of gates as well as faster read access time.
The cross-coupled OAI gates comprising the 12T bitcell are
not touched by synthesis and place & route tools, constituting
the proposed 12T OAI based SCM. Furthermore, without any
customized design in the SCM, the logic gate-based bitcell
is flexible and portable to different technologies, sizes and
shapes.
The architecture of a 8-bit (2x4) memory array is shown in
Fig. 3: Design flow of SCM compiler.
Fig. 2. The SCM contains 4 words (2-bit address) with 2 bits

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:46:58 UTC from IEEE Xplore. Restrictions apply.
B. Synthesis
The proposed SCM compiler flow is verified for 28-nm
FDSOI technology and 40-nm bulk CMOS technology. For the
cross-coupled OAI gates, the combinational timing loop is cut
to enable a standard logic synthesis flow. The logic synthesis
of the SCM optimizes the daisy-chained read decoder of the
memory array to meet the timing requirements. In the SCM
proposed in [1], the OAI/AOI based bitcell array is designed
manually and only the periphery is synthesized, using 40-
nm CMOS technology. In order to compare with [1], a 64kb
(2048x32) SCM based memory is generated and synthesized
in 40-nm CMOS technology using LVT cells. Initial post-
synthesis comparison of total area and access energy per bit
for write operation is shown in Table I. The analysis shows Fig. 4: Layouts of SCMs with pre-placement and different aspect
that the proposed SCM achieves 41% lower energy per bit for ratios.
write access when compared to the SCM proposed in [1]. TABLE II. Comparison between with (W) and without (WO) pre-
placement at typical corner. VDD=0.4 V, T=25◦ C.
TABLE I. Post-synthesis comparison using 40-nm CMOS technology.
SCM 256 bits 256 bits 512 bits 512 bits
SCM 40-nm CMOS Area (mm2 ) Access 28-nm FDSOI A=5 D=8 A=3 D=32 A=5 D=16 A=4 D=32
energy per Write W 5.5 6.8 7.3 7.4
bit (pJ/bit) delay(ns) WO 6.7 7.7 8.2 7.6
Proposed OAI based SCM 0.2 92 Read W 13.7 15.8 17.5 14.9
delay(ns) WO 14.4 19.6 18.8 15.0
OAI/AOI based SCM in [1] 0.3 157
Consumed W 1.1 2.4 2.1 3.5
power (µW) WO 1.2 2.7 2.2 3.6
C. Place & Route
In this work, standard physical design flow with customized impacts especially on read delay and power consumption are
floor-planning using relative placement of the bitcell is used. observed when aspect ratios deviate from one. Therefore, it is
During floor-planning, all pairs of cross-coupled OAI gates advised to keep aspect ratio close to one.
are pre-placed relatively using the generated script. This pre- TABLE III. Comparison between different aspect ratio W/L at typical
placement is especially beneficial to maintain timing correct- corner. VDD=0.4 V, T=25◦ C.
ness for the broken timing path in the cross-coupled OAI gates.
SCM Aspect Write Read Power
Each pair of OAI gates are placed tightly together without 28-nm FDSOI ratio(W/L) delay(ns) delay(ns) (µW)
any space to keep the wire short. Several OAI gates form a 512 bits (32x16) 1.0 7.3 17.5 2.1
cluster, then all OAI cell clusters are distributed in rows and 2.0 8.7 19.0 2.1
0.5 9.0 21.0 2.2
columns. Small spaces are kept in between the clusters for the 2048 bits (64x32) 1.0 14.2 26.2 4.0
memory read and write decoder logic cells as shown in Fig. 4. 2.0 14.4 32.5 4.1
Main challenge is the density of bitcells in the memory array 0.5 13.5 30.7 4.1
increases with the sizes of SCM memory. The pre-placement
algorithm is flexible to adjust the size of bitcell clusters and IV. P ERFORMANCE EVALUATION
the spacing between the clusters. The floor-planning with pre- In this work, the Foundry-provided 8-track, low-threshold
placement for different aspect ratios are shown in Fig. 4. (LVT) standard cell library in a 28-nm FDSOI technology
1) Performances of SCMs with and without pre-placement: is used. The LVT standard-cell library can operate reliably
Comparison for power consumption, write, and read delay and efficiently at 0.4 V across all PVT corners. The Foundry-
are performed for SCMs with and without pre-placement for provided standard cell library is characterized for near/sub-
different address width (A) and word width (D) as shown in threshold operation at 0.4 V±10%, different corners and tem-
table II. The SCMs with pre-placement achieve up to 18% peratures. A comprehensive performance evaluation of the
speed-up for write delay and up to 19% speed-up for read proposed SCM compared to latch based and a flip-flop (FF)
delay as compared to non pre-placed SCMs. This indicates that based SCM is performed. Cadence RTL Compiler is used for
pre-placement is beneficial in providing better performance. logic synthesis while Cadence Innovus Digital Implementation
Additionally, for the same memory size, different address and System is used for the back-end physical design (placement
word widths also influence the performances. Wider memories and routing). The sign-off is performed at slow-slow (SS)
provide overall better energy efficiency, as expected. corner, 0.36 V and 0◦ C.
2) Performances of SCMs with different shapes (aspect For read-address, write-address and write-data, normal reg-
ratios): Some layouts with different aspect ratios are shown in ister setup and hold constraints are applied. The WWL decoder
Fig. 4. The read delays, write delays, and power consumption and WB DEMUX determine the write-access delay. Write-
for different aspect ratios are compared in Table III. Negative access timing arc starts at the write-address and write-data

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:46:58 UTC from IEEE Xplore. Restrictions apply.
TABLE IV. Comparison of OAI/LATCH/FF based SCMs at typical corner. VDD=0.4 V, T=25◦ C.
SCM Write delay (ns) Read delay (ns) Power consumption (µW) Area (µm2 )
size (bits) OAI LATCH FF OAI LATCH FF OAI LATCH FF OAI LATCH FF
16 3.3 5.6 5.7 6.6 19.6 13.3 0.2 0.2 0.1 97 89 108
32 3.4 5.8 6.6 8.6 21.3 12.7 0.3 0.2 0.3 162 136 162
64 3.6 6.0 7.7 10.9 22.5 14.0 0.4 0.3 0.5 280 238 305
128 4.8 7.6 7.9 12.6 27.9 16.2 0.9 0.8 1.1 517 396 538
256 5.5 7.8 8.7 13.7 29.4 18.3 1.1 1.1 2.4 843 739 1040
512 7.3 10.6 11.4 17.5 30.6 21.2 2.1 2.0 4.2 1612 1299 1937
1024 10.5 15.6 18.2 23.1 32.4 26.9 2.8 2.5 6.6 3477 2505 5244
2048 13.5 16.3 20.7 30.7 37.9 39.6 4.1 4.4 12.8 6067 4900 10366
4096 14.8 17.9 24.6 40.8 39.6 42.0 8.0 8.3 25.6 11937 9205 20572

(a) (b) (c)


Fig. 5: (a) Write delay, (b) read delay, and (c) power consumption of OAI/LATCH/FF based SCMs.

registers and endpoints are the cross-coupled OAI gates in to flip-flop based SCMs. This essentially indicates that the
memory array. Read-access time is determined by the read proposed SCM design is competitive for near/sub-threshold
decoder, consequently read delay is defined from the read- operation. Furthermore, the SCM design still has potential to
address through the output of bitcells to the register output. become more area and energy efficient with proper transistor
Power consumption of the SCM is measured by running sizing in the bitcell, while being supported by a full automatic
memory march test. The post-layout netlist simulation with design flow.
sdf back-annotation is performed for power analysis. R EFERENCES
Comparison of write delay, read delay, power consumption
[1] X. Fan, J. Stuijt, R. Wang, B. Liu, and T. Gemmeke, “Re-addressing
and area for SCMs from size 16-bit to 4096-bit are listed SRAM design and measurement for sub-threshold operation in view of
in Table IV. To observe the trends of delays and power classic 6T vs. standard cell based implementations,” in IEEE International
consumption, the results at typical corner, 0.4V, and 25◦ C Symposium on Quality Electronic Design, Mar. 2017.
[2] T.-H. Kim, J. Liu, J. Keane, and C. H. Kim, “A high-density subthreshold
are plotted in Fig. 5a, 5b and 5c. The OAI based SCMs SRAM with data-independent bitline leakage and virtual ground replica
are superior as compared to latch and flip-flop based SCMs scheme,” in IEEE International Solid-State Circuits Conference, pp. 330–
in most situations. The proposed OAI based SCM achieves 606, Feb. 2007.
[3] C. B. Kushwah and S. K. Vishvakarma, “A sub-threshold 8T SRAM cell
average 28% and 39% speed-up in write delay as compared design for stability improvement,” in IEEE International Conference on
to latch and flip-flop based SCMs respectively. The speed-up in IC Design & Technology, pp. 1–4, May 2014.
read delay is 25% and 19% on average as compared to latch [4] M.-F. Chang, S.-W. Chang, P.-W. Chou, and W.-C. Wu, “A 130 mV
SRAM with expanded write and read margins for subthreshold applica-
and flip-flop based SCMs, respectively. Power consumption tions,” IEEE Journal of Solid-State Circuits, vol. 46, Feb. 2011.
for OAI and latch based SCM are almost equal. The flip- [5] H. Kaeslin, Digital Integrated Circuit Design From VLSI Architectures to
flop standard cell consists of two latches, hence flip-flop based CMOS Fabrication. Cambridge: Cambridge University Press, 2008.
[6] P. Meinerzhagen, O. Andersson, B. Mohammadi, Y. Sherazi, A. Burg,
SCMs consumes higher power. and J. N. Rodrigues, “A 500 fW/bit 14fj/bit-access 4kb standard-cell
based sub-VT memory in 65nm CMOS,” in European Solid-State Circuits
V. C ONCLUSION Conference, pp. 321–324, IEEE, Sep. 2012.
[7] B. Liu, M. Ashouei, J. Huisken, and J. P. De Gyvez, “Standard cell sizing
For a variety of SCMs with different sizes and aspect for subthreshold operation,” Design Automation Conference, p. 962, June
ratios, a technology-independent SCM compiler is proposed 2012.
to support fully automatic design flow. Experimental results [8] X. Fan, J. Stuijt, and T. Gemmeke, “Towards SRAM leakage power
minimization by aggressive standby voltage scaling − Experiments on
show that the proposed OAI based SCMs are more area 40nm test chips,” in IEEE International Symposium on Defect and Fault
and energy efficient due to automatic optimization during Tolerance, pp. 1–4, Oct. 2017.
synthesis. OAI based SCMs outperforms the latch/flip-flop [9] P. Meinerzhagen, S. M. Y. Sherazi, A. Burg, and J. N. Rodrigues,
“Benchmarking of standard-cell based memories in the sub-Vt domain
based SCM for write and read delay. The evaluation also shows in 65-nm CMOS technology,” IEEE Journal on Emerging and Selected
the superiority for power consumption and area compared Topics in Circuits and Systems, pp. 173–182, June 2011.

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:46:58 UTC from IEEE Xplore. Restrictions apply.

You might also like