You are on page 1of 6

Low-Power High-Performance and Dynamically

Configured Multi-Port Cache Memory Architecture

H. Bajwa and X. Chen
EE Dept, Grove School of Engineering, City College and Graduate Center of the City University of New York
Convent Ave. at 140th St., New York, NY, 10031 U.S.A.

Abstract-As on-chip cache size has increased considerably in

recent high-performance microprocessor technologies, power II. SRAM LEAKAGE POWER
dissipation and leakage current in SRAM have become critical. Lowering threshold voltage and gate oxide thickness has
High-performance IC designs use multi-port cache memory to resulted in significant increase in sub-threshold leakage in sub-
provide the needed accessibility and bandwidth. Since the word
and bit lines cover the foot-print of the entire cache section, micron technologies [2]. Since static power strongly depends
duplicating the word and bit lines for multiple ports results in upon technology and circuit optimization most of the research
large silicon area and increases bitline discharge and power activities toward reducing leakage power focuses on materials
dissipation. and circuits.
As technology scales down device size and supply voltages, Before we propose low power area efficient SRAM
static power dissipation has emerged as a critical factor in total architecture, we are going to look at leakage current
system power dissipation. In this paper, we present an area- and components in conventional single and dual port memories.
energy-efficient multi-port cache memory architecture, which
employs isolation nodes, local sense amplifiers and dynamic
Embedded SRAM ionalcsingle
s suffers form cell dland r e ie
leakage current and bitline
memory partitioning techniques, to facilitate simultaneous multi- leakage current [9]. Five basic components of cell leakage
port accesses without duplicating bitlines. The proposed cache current are: reverse biased, sub- threshold, gate induced punch
memory architecture also reduces bitline latency. through and gate tunneling leakage current. Sub threshold
leakage current is drain source current of the transistor when
the gate source voltage is less then the threshold voltage i.e.
transistor is operating in weak inversion. It is main contributor
Sub-threshold leakage current flowing from drain to source, to leakage current and is defined by Berkeley BSIM model as
even when the transistor is not operating, has been a dominant follows:
leakage current in high performance microprocessors where L I VdsA7
a K Vgs - Vth- Voffl
and L2 cache occupy majority of die area [1, 2, 3]. As Idsub = Io 1- exp - exp (1)
transistor size scales down, power dissipation becomes a Vt)] n 7
serious problem that limits overall system performances [2, 4]. where, Voff is an empirically-determined model parameter;
In high performance systems employing multi-core Vt = KT / q is a physical parameter proportional to the
technologies, bitline leakage current can contribute to as much temperature; Vth is the threshold voltage, and the current
as 5000 of overall cache memory leakage power [5]. The trend Io = Iso'*W / L is determined based on the geometry. When a
of using multi-port cache in modem microprocessor transistor is off (Vgs=O and Vds=Vcc), the leakage current
technologies [6] has exuberated this further, as leakage current going through it can be reduced either by increasing the
dissipation scales proportionally with the area of the circuit [7]. threshold voltage or reducing its width.
In this paper, we will further explore low power and area III. CLASSICAL SINGLE- AND MULTI-PORT SRAM
efficient architecture presented in [1]. We will focus on power
dissipation in dynamically configured memories. The rest of Fig. 1 shows the classic 6-transistor SRAM cell in a single-
thesipapern is oganizcasllycows.gur ectrios.
the paper iS organized as follows. In I Sections IIII and III we port configuration. When the word line (WL) is selected, the
analyze present SRAM technologies and leakage current in SRAM cell is connected to the pair of bit lines (BL and BL)

these technologies. In Section IV we review some previously via access transistors T5 and T6. Figure2 shows the SRAM cell
proposed architectura and circuit techniques whichwith dual ports, each is hardwired with dedicated word and bit
leakage acurrentu btine ulatencyIniSectio wedwil
and V lines, and how duplicating bit-lines causes bit-line leakage
preskage am confed memoy. arhiecture flwed
present dynamically ponig (DPM) algorithmsur inguSeto transistors T7multiply.
current to The word and bit lines and access
and T8 for the second port would almost double
by dynamic memory parhtitoning (DPM) algorithms in Section Il-.1- -,I .If

VI~~~~~~~~~~~~~~~~.InScinVIw.rsn h ici iuainmdl the silicon area [10] of the single-port configuration.
followed by conclusions in Section VIII. Subthreshold leakage current, drain-source current of the
transistor when the transistor is operating in weak inversion is
a major contributor towards SRAM leakage current [11]. Pre-

Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
charging, as well as keeping the bit lines high, causes accesses simultaneously. Therefore, it doubles (in the case of
significant power dissipation and contributes heavily to the dual-port) or multiplies (in the case of multi-port) the
total power dissipation. bandwidth of a single-port cache.

When "O" is stored, transistors TI, T5 and T4 dissipate Leakage current for a dual-port memory cell can be
leakage current. When "1" is stored, T2, T3 and T6 dissipate described as:
leakage current. The leakage current in a single-port SRAM Idcell = Idsub(T1) + Idsub(T5) +Idsub(T4) + Idsub(T7) (4)
cell is described as follows
Idcell = Idsub(T1) + Idsub(T5) +Idsub(T4) (2) For a memory core with N rows and M columns this leakage
current of Equation 4 can be represented as:

| WL Idcell.= N Idsub(T1)
J +
Idsub(T5) +Idsub(T4))
I . N * M +Idsub(T7)


Xf, -"
T6 Most of the recent research activities in this area are geared
toward reduction of sub-threshold leakage current in on-chip
T2 l / - | [ T4,
cache. Among process and circuit level techniques dynamic Vt,
dual-threshold voltage, reduced-gate SRAM (RG SRAM) and
gated Vdd have been discussed in [12-15]. Dynamic Vt SRAM
BL BL [12] reduces leakage current in cache memories by switching
cache lines to high Vt if the access has a small probability. The
dual-threshold voltage technology [13] uses high threshold
Fig. 1. A Single-PortnSRAM Cell voltage devices to reduce leakage current; low threshold
voltage devices are used where high performance is required.
For a memory core with N rows and M columns this leakage Reduced-gate SRAM (RG SRAM) uses two additional pass
current is described as: transistors connected between the cross couple inverters to
Idcell= N * M(Idsub(Tl) + Jdsub(T5) +Jdsub(T4)) (3) decrease gate leakage current [14]. Gated Vdd [2] reduces
leakage power by using high threshold transistor between a
virtual ground and GND to cut off the power supply to the
0< WL (Port o) memory cell in a low power mode.
r Among architectural techniques bitline segmentation [6,

18] and W3
sub-banking have reduced leakage power significantly.
Most of the architectural techniques are combined with
>O circuit techniques to suppress unnecessary leakage
o S power.
HX Albonesi [15] reduces power dissipation by enabling
X s s / ~~T4[T4
_ E only a small portion of the L2 cache at a time. Zhu et. al. [19]
1> ~~~~designed low power SRAM by enabling banks to switch
T only a smallportionof
between active and standby mode. Bitline segmentation
efficiently reduces leakage current by shorten the bitline length
BL BL dynamically. Karandikar et. al. [16] divides bit lines into
hierarchical segments to reduce bitline capacitance and adds
Fig. 2. A Dual-Port SRAM Cell parallel bit lines to access SRAM cells. Adding parallel bit
lines in the above architecture, however, has the drawback of
Fig. 2 shows the classic hardwired dual-port memory larger memory area. yang et. al. [18] explored this architecture
architecture, where each SRAM cell is accessible by two ports further and proposed hierarchical bit lines with local sense
with dedicated word and bit lines for each The addition of the amplifiers. One major source of energy dissipation is
word and bit lines and access transistors T7 and T8 would charginng the whole bitline. Rao [6] divides bitlines
almost doublerthe siliconoarea. into smaller segments such that segments higher then the
The dual-port (as well as multi-port) memory architecture current access cell are isolated from bitline precharge.
has been implemented with instruction and data cache in multi- Although this approach incurs additional delay, it reduces the
core processors in recent years. The most important advantage
of this architecture is that it can execute multiple cache

Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
length of bit lines for accessing cells near the physical ports,
hence, reducing latency and power dissipation. IT--
. VMzer Port
The cache memory configurations employed in the above Pre-cha e circuit
mentioned approaches use fixed bank size and duplicated word
and bit lines (without providing dual- and multi-port accesses), l
hence, incur either moderate performance degradation or large B
area overhead.
The proposed architecture employs a DMP technique that
uses isolation nodes to partition a cache memory block into Cell
two virtually independent sections based on real-time access Set(x)
addresses of multiple ports. Figure 3 shows placement of
isolation control line (ICL) and isolation node on each of the
bit lines to divide an SRAM block into the upper and lower Cell(+7) 0
sections, which are to be accessed by the upper and lower Set(x+7)
ports, respectively. A selected ICL turns off isolation nodes
based on the real-time access addresses at the upper and lower
ports. Compared with the hardwired dual-port SRAM as shown
in Fig. 2, DMP can provide dual-port accesses without the need
of the second pair of bit lines and effectively reduce leakage Group(g)
current, bitline latency and silicon area. RL BL

Isolation nodes placed on bit lines introduce additional

latency. Rao et. al. [6] showed that bit lines segmented by 8 Fig. 3. ICL and Isolation Nodes Placement
isolation nodes pose no significant performance degradation.
With DMP the placement of isolation nodes is of strategic iss T TAG INDEX OFFSET
importance. Placing isolation nodes between adjacent word PortB
lines provides the highest degree of flexibility for DMP, but out

would be overkill if the applications do not need or utilize such

fine-grain DMP capability. In principle isolation nodes are Set512
placed for every n word lines, where n is determined based on
the statistical patterns of access addresses of targeted Set 3
In Fig. 3 the bit lines across 512 word lines which are A T
divided into 8 groups, each contain 8 sub-groups. Each
contains a local sense amplifier block and a local pre-charge
block. An ICL and two isolation nodes are placed between sub- A A

groups, each across 8 word lines. When an ICL turns its

isolation nodes off, the upper and lower ports can access
desired cells from different sub-groups. Each group's local
sense amplifiers ensure that the group's bit line section TAG INDEX OFFSET
accesses cells on 64 word lines separated by 8 isolation nodes
wit miia additional delay[6r]. Fig. 4. A Cache Subsystem with DMP
Simultaneous accesses to the cache via multiple ports may
Thus a large cache memory can be divided into virtually encounter potential conflicts, known as access conflicts. For
independent sections. Figure 4 illustrates the structural example, a READ and WRITE operations via Ports X and Y to
the same cells may take place simultaneously. When this
configuration ~ ~ ~ ~ ~ ~ ~~
converting a conventional single-port 8-way ocur
configurahon hadie
cache memory into a dual-port cache with DMP without occurs, existig (hardwired) mulh-portmemory
r existinn
access coflc
conflict multi-ona ortlmemoy
adding the bit lines for the second port. resolutions may require the READ-WRITE-READ sequence to
be executed to ensure data integrity. With DMP, all existing
multi-port memory access conflict resolutions still apply and
no changes to these protocols are necessary.

Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
A small circuit block is used to compare the addresses and k = current ICL;
calculate ICLs. DMP operations are carried out in parallel with if (j < k < i) return NULL (no new DMP);
address (including bank selection) setup and bitline precharge else return (j + (i-j)/2);
phases, as depicted in Fig. 5. There is no overall operation For coarse-grain cases, DMP-2 is adjusted accordingly.
delay due to DMP. It is important to note that a new ICL is
enabled only when the current ICL cannot provide valid multi- ORTB
port accesses.
Address ________________________To X ~~~
L(2) -WL (3) -4 -W
AddrelI__ fU3)- _______________
fCL(4).-----------1 ---WVL (4)
f CL(5){WL)
T1 T * * ~~~~~~~~~~~~~~~~~~~~WL (6)
DMP \ /

Pr e - < X<=~T ICL (i-Il) - - - W


Toc: Operation cycle time; TAS: Addr setup time ICL (n-3) -ii ---
TBA: Bank address setup time;
TICL: ICL identification time;
TDP: Valid data
Tp. Precharge time
ICL (n-2)
ICL (n-i)_.
-7 W
---3L (n-2)
_WL (n-i)
WL (n)
Fig. 5. Memory Operation Timing with DMP
Optimal cache partitioning not only provides dual-port VII. CIRCUIT SIMULATION MODEL
access but also reduces delay and minimizes power dissipation. Scaling down geometry and voltage parameters has proven
Efficiency of dynamically partitioned algorithm is determined to be too simple to predict behavior of nano-scale technologies.
by three factors, delay, power dissipation and complexities of We have used BSIM Predictive Technology Model [20] to
proposed algorithm. Two generic DMP algorithms are calculate resistance and capacitance of MOS devices. Bit line
developed based on fine-grain DMP illustrated in Figure 6. delay, wire resistance and capacitance is an increasing source
Algorithm DMP- 1 provides the optimal partitioning that of concern in sub-micron cache design [21]. Figure 7 shows
minimizes bitline latency and power dissipation, with which the wire resistance trends evaluated from [22] as technology
two ICLs are turned off during partitioning: ILC(j) is turned off scales down form 32nm to 90nm. Fig. 8 shows transmission
for the upper port to access WL(j), while ILC(i-1) is turned off gate and MOS devices resistance generated with the BSIM
for the lower port to access WL(i), where j < i. This approach Predictive Technology Model [21]. As technology scales
ensures shortest active bitlines for both the upper and lower down, device resistance decreases and delay due to isolations
ports, hence minimizing latency and power dissipation. The nodes will become smaller. Fig. 9 shows the circuit simulation
pseudo-coded of DMP-1 as follows: model for the active bit lines with DMP. The isolation node
addr (A) <1: n>; addr (B) <1: n>; (implemented as a transmission gate) is modeled by RT and 2
where addr (A) = i > addr (B) = j; CT. With Wn/Wp =0.5, using 90nm technology, we calculated
if i = j +1 return ICL (j); the capacitance and resistance for transmission gate. For
else return ICL (j) and ICL (i-1); simulation purpose the drive inverter of SRAM cell is replaced
with equivalent Ro and CO. Though the length and width of the
For coarse-grain DMP (an isolation node is placed on bit wire is difficult to predict in the earlier stages of design we
lines between every n word lines), DMP- 1 is adjusted used a stick diagram to evaluate the length of the wire to be
accordingly to select correct ILCs to turn off. 602 per memory cell. Rw and Cw represent wire resistance and
capacitance between 2 bit cells on the same bitlines. CL is the
For applications that do not require a new partition for every total load capacitance of the local sense amplifier structure.
memory access, algorithm DMP-2 minimizes the switching of
isolation nodes by identifying whether or not a current partition
can facilitate new accesses. In this case, one selected ILC is
turned off for a DMP. The generic fine-grain DMP-2 algorithm
is pseudo-coded as follows:
addr (A) <1:n>; addr (B) <1 :n>;
where addr (A) =i > addr (B) = j;

Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
multi-port memory. Shorter active bit lines also means less
40 Cache memory often contributes a large part to the total
30 system power dissipation. This happens as the bit lines remain
i 20 pre-charged even when not accessed. The proposed
r0 ___RWIRE architecture reduces leakage power by using bitline isolation
0 and selective pre-charging. Dynamically configured memory
32 45 65 90 not only reduces leakage current by eliminating pass transistors
Technology 32nm - 9Onm in hardwired multi-port memory, but also reduces the bitline
leakage power to half by eliminating additional bitlines.
Leakage current in dual-port dynamically configured memory
is the same as (2). For a memory core with N rows and M
columns the leakage current is reduced to less than half the
70 value of hardwired dual-port memories.
60 X Idcell= * M(Idsub(Tl) + Idsub(T5) +Idsub(T4)) (6)
50 2_ _

Q 40
30 = _ =
_ .2
Z 20
R-NMOS (Ohms) X<
10 RT-GATE (Ohms)
R-PMOS (Ohms)
, _'
0 / OO
32 45 65 90
Technology 32nm - 90nm 600mX

Fig. 8. Technology Scaling Impact on MOS Devices and Transmission Gate-

Resistance iLh W i.

t iv
th I solatiNo s
R Rw Rw RR I
I.0 402P times I 0p

' TCc, TCP FwTrP 1Gw TCWICT TCT CL Fig. 10. Bitline Delay with/without Isolation Nodes
Using 90 nm technology data we estimated bitline delay by
calculating resistance and capacitance in the drive circuit,

Drive Wire RC Model Isolation node


Inverter and Pass Transistor RC Model

Capacitance wires ' transmission gate and load. Resistance and capacitance
Fg9.RMdloapactiveBitince with Isolation Node per cell is calculated to be 0.36 Q and 0.6675 JF, respectively.
Capacitance of pass transistor in cutoff mode is calculated to
be 76 jF. Transmission gate resistance RT and two ground
VIII. CONCLUSION capacitances CT are calculated to be 39.2 Q and 376 if,
respectively. Figure 10 compares the bitline delay in a 64
An area- and energy-efficient multi-port memory architecture rd-linelok with an witho islin des. ach
is proposed. It employs new DMP techniques, which use iorlion node sate an 8w'olin s ock from
isolation nodes and control lines to dynamically partition the another. note wors ase, when-the sulous aces
bit-lines of a memory block into virtually-isolated sections, so
that they can be accessed simultaneously and independently, addresses are near either the top or the bottom of a cache block
Compared with the classic hardwired multi-port memory (hence one active bitline is much longer than the other), the
architecture the new DMP facilitates efficient designs with no delay of the longer active bitline is increased by about 200 ps
significant impact to the timing of memory operations. It compared to the hardwired bitline without the isolation nodes.
reduces the use of silicon area largely due to the elimination Of When accesses are facilitated with a DMP in the middle of the
additional bit lines. On average, bitline pre-charge and leakage bitlines, the delay of the active bitlines for each section is about
currents are reduced to half the value of typical hardwired 400 more of the hardwired version, with less pre-charge time.

Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.
Combined with local sense amplifiers and port multiplexing, [12] C.H. Kim and K. Roy, "A Leakage Tolerant Cache
the proposed DMP can support efficient multi-port cache Memory for Low Voltage Microprocessors," in the Proc.
architecture to reduce silicon footprint, wiring contention, of the 2002 Interationa Symposium on Low Power
power dissipation and bitline latency. Electronics and Design, pp. 251-254, 2002.
[13] J.T. Koa and A.P. Chandrakasan, "Dual threshold voltage
REFERENCES techniques for Low-Power digital circuits," in IEEE
[1] X. Chen and H.Bajwa, "Energy-efficient dual-port cache Journal of solid state Circuits, Vol. 35, No.7, pp. 1009-
architecture with improved performances," in IEE Journal 1018, Jul. 2000.
of Electronic Letters, Vol. 43, No. 1, pp. 12-14, Jan. 2007. [14] C. Thondapu, P. Elakkumanan, R. Sridhar, "RG-SRAM: a
[2] N. S. Kim, K. Flautner, D. Blaauw , T. Mudge, "Circuit low gate leakage memory design," in the Proc. of the
and Micro-architectural techniques for reducing cache IEEE Computer Society Annual Symposium on VLSI, pp.
leakage power," IEEE Transaction on VLSI systems Vol. 295-296,2005.
12, No. 2, pp. 167-184, Feb. 2004. [15] D.H. Albonesi, "Selective Cache Ways: On-Demand
[3] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. Cache Resource Allocation," in Proc. of the 32nd Annual
Vijaykumar,, "Gated-Vdd A circuit technique to reduce International Symposium on Microarchitecture, pp. 248-
leakage in deep-submicron cache memories," in Proc. 259, Nov. 1999.
IEEE/ACM Int. Symposium on Low Power Electronics [16] A. Karandikar and K.K. Parhi, "Low power SRAM design
and Design, pp. 90 95, 2000. using hierarchical divided bitline approach.," in Proc. Int.
[4] S. Kim, N. Vijaykrishnan, M. Kandemir and M. J. Irwin, Conf. Computer Design: VLSI in computers and
"Optimizing Leakage Energy Consumption in Cache Processors, pp. 82-88, 1998.
Bitlines" In Journal of Design Automation for Embedded [17] K. Ghose, M.B. Kamble, "Reducing power in superscalar
Systems, Vol. 9, No 1, pp. 5-18(14), Mar. 2004. processor caches using sub-banking, multiple line buffers
[5] S.-H. Yang and B. Falsafi, "Performance and Energy and bit-line segmentation," in Proc. of the International
Trade-offs of Bitline Isolation in Nano-scale CMOS Symposium on Low Power Electronics and Design,
Caches.," presented at the Workshop on Complexity- pp.70-75, Aug. 1999.
Effective Design (WCED) held in conjunction with the [18] B.D. Yong and L.-S. Kim, "A Low Power SRAM Using
30th International Symposium on Computer Architecture Hierarchical Bit Line and Local Sense Amplifier." in IEEE
(ISCA-30), Jun. 2003. Journal of Solid State Circuits, Vol. 40, No. 6, pp. 1366-
[6] R. Rao, J. Wence, D. Franklin, R. Amirtharajah and V. 1376, Jun. 2005.
Akella, "Exploiting Non- Uniform Memory Access [19] Z. Zhu, K. Johguchi, H.J. Mattausch, T. Koide, T.
Pattern through Bit Line Segmentation.," presented at the Hironaka, "Low power bank-based multi-port SRAM
Workshop on Memory Performance Issues, in conjunction design due to bank standby mode," in Proc. of the 47th
with High Performance Computer Architecture (HPCA), Midwest Symposium on Circuits and Systems, Vol.1, pp.
Feb. 2006. 569-72, 2004.
[7] B. Amelifard, F. Fallah, M. Pedram, "Reducing the sub- [20] W. Zhao, Y. Cao, "New Generation of Predictive
threshold and Gate-tunneling Leakage of SRAM cells Technology Model for Sub-45nm Design Exploration," in
using Dual-Vt and Dual-Tox Assesment," in IEEE the Proc. of the 7th International Symposium on Quality
Proceedings of Design, Automation and Test, Vol. 1, pp. Electronic Design, pp. 585-590, 2006.
1-6, 2006. [21] P. Kapur, J.P. McVittie, K.C. Saraswat, "Technology and
[8] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. reliability constrained future copper interconnects. I.
Shigematsu, and J. Yamada, "1-V power supply high- Resistance modeling," in IEEE Transactions on Electron
speed digital circuit technology with multithreshold- Devices, Vol. 4, pp. 590-597, 2002.
voltage CMOS," in IEEE J. Solid-State Circuits, Vol. 30, [22] C. Grecu, P.P. Pande, A. Ivanov, R. Saleh, "A scalable
pp. 847-854, Aug. 1995. communication-centric SoC interconnect architecture.," in
[9] N. V. Ykrishanan, M. Kandemir, M.J. Irwin. "Optimizing the Proc. of 5th International Symposium on Quality
Leakage energy Consumption in cache Bit-lines" In Electronic Design, pp. 343-348, 2004.
Design Automation for Embedded Systems, Vol. 9, pp. 5-
[10] R.D. Adams, "High Performance Memory Testing: Design
Principles, Fault Modeling and Self-Test," Kluwer
Academic Publishers, 2003.
[11] M. Mamidipaka, K. Khouri, N.Dutt, and M. Abadir
"Analytical Models for Leakage Power Estimation of
Memory Array Structures" In International Conference on
Hardware/Software and Co-design and System Synthesis
(CODES±ISSS) pp. 146-15 1, 2004.

Authorized licensed use limited to: University of Central Florida. Downloaded on October 26, 2008 at 00:02 from IEEE Xplore. Restrictions apply.