Professional Documents
Culture Documents
Low Power Digital VLSI Design PDF
Low Power Digital VLSI Design PDF
The power dissipation issues and the devices' reliability problems, when they
are sealed down to 0.5 fin and below. have driven the electronics industry to
adopt a snpply voltage lower than the old standard, 5 V. The new industry
Low-Power VLSI Design: An Overview 3
_.
(!4 0 (W)
PowerPC 603 80 0.5 3.3 2.2 [lo]
IBM 486SLC2 66 0.8 3.3 1.8 [Ill
MIPS R4200 80 0.64 3.3 1.8 [IZ]
Handheld Cellular
Example Motorola Microtac
RF Power GOO mW
cI LOGIC/CIRCUlT
DEVICEPROCESS
I
I
Figure 1.1 Power reduction design ~pacr
the threshold voltage. To overcome this problem, the devices should be scaled
properly. The advantages of scaling for low-power operation are the following:
Table 1.4 shows the effect of ecaling on microprocessor performance [14]. The
power &sipation can be reduced by one order of magnitude at fired frequency
of operation.
. Utilive low system clocks. Higher frequencies are generated with on-chip
phbse locked loop; and
rn High-level of integration. Integrate off-chip memories (ROM, RAM, etc.)
and other ICs such 61 digital and analog peripherals.
1.4 THISBOOK
Tb3 book is an early eontribntion to the field oflow-power digital VLSI circuit
and system design. It targets two types of aodiences; the senior undergrad-
uate and postgradoate university stodents and the VLSI circuit and system
8 CHAPTER
1
designer working in industry. In this book we have tried to cover the basics,
from the process technologies and device modeling t o the architecture level, of
VLSl system. T h e fundamentals of pow- dissipation in CMOS Circuits are
presented to provide the readers with Juffieient badrgranod to be famdiaz with
the low-power defign world. Several practical eheuit examples and low-power
techniqucs, mainly in CMOS technology, me discussed. Also low-voltage issues
for digital CMOS and BiCMOS eircnitr are emphasiied. This book also pro-
vides an extensive study of advanced CMOS subsystem design. brious power
minimiaation techniques, 8t the circuit, logic, architecture and algorithm lev-
els, are presented. Finally, the book includes a rich list of references, treating
advanced topics, at the end of each chapter. This allows the readers to study,
in depth, any topier they find interesting.
This book is orgganiad into eigth chapters. The first chapter i s an introduction
to low-power design. The other chapters m e presented in the following sections.
[l] Special Report, 'The New Contenders," IEEE Spectrum, pp. 20-25, De
cember 1993.
[2] D. W. Dobberpuhl et al., 'A 200-MHz 64-b Dual-Issue CMOS Micro-
processor", IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1555-1567,
November 1992.
131 W. J. Bowhill et d.,"A 300MBs 64b Qoad-Issue CMOS RISC Miero-
processor," IEEE International Solid-State Circaits C o d , Tech. Dig., pp.
182.183, February 1995.
141 Technology 1995: Solid State, IEEE Speetmm, pp. 35-39, January 1995.
[5] D. Bearden, et d.,"A 133 MHe 64b Four-Issue CMOS Mieroproeessor,'
IEEE International Solid-State Circuits Conf., Tech. Dig., pp. 174.175,
February 1995.
[6] MIPS Press release, 1994.
[TI A. Charms, ot al., "A 64b Microprocessor with Multimedia Support,"
IEEE International Solid-state Circuits Conf., Tech. Dig., pp, 178-179,
February 1995.
[8] C. Small, "Shrinking Devices Pat the Squeese on System Packaging,"
EDN, "01. 39, no. 4, pp. 41-46, February 1994.
[9] P. Verhofstadt, "Keynote Address," IEEE Symposinm on Low Power Elec-
tronics, Tech. Dig., October 1994.
[ID] G. Gerosa, et d.,"A2.2 W 80 MHz Superscalar RISC Microprocessor,"
IEEE Journal of Solid-state Circuits, "01. 29, no. 12, pp. 1440-1454, De-
cember 1994.
[ll] R. Beehade, et al., "A 32b 66MAu Micropzocersor," IEEE International
Solid-state Circuits Conference, Tech. Dig., pp. 208-209, February 1994.
12 DIGITAL VLSI DESIGN
LOW-POWER
[I21 N. K. Yeung, Y-H. Sutu, T. Y-F. Su, E. T. Pak, C-C Chao, 5. Akki, D.
D. Yau, and R. Ladenquai, "The Deign of a 55SPECint92 RISC Proees-
IOIunder ZW," IEEE Internationd Solid-State Circuits Conference, Tech.
Dig., pp. 206-201, Febrmry 1994.
[13] 5. Lipoff and A. D. Little, "Evsluation of New Battery Technology in Se
lected Applications," IEEE Workshop on Low-power Electronics, Phoenix,
AZ, August 1993.
(141 J. M. C. Stork, "Toehaalogy Leverage for U1L.a-Low Power Inmmation
Systems," IEEE Symposium on Low Power Electronics, Tech Dig., pp.
5255. October 1994.
2
LOW-VOLTAGE PROCESS
TECHNOLOGY
In this section we review two CMOS bull. technologies: N-well and twin-tub
proeeeser. Other processes such ar retrogradwvell technology is not discussed.
The process starts by growing an oxide on the wafer. The oxide is then pat-
terned to open N-well windows. Phosphorus atoms are implanted into the &-
con followed by a high-temperature annealing to diffusethe well [Fig. Z.I(a)].
The LOCOS ( L o c a l Oxidation of Silicon)' technique is used to isolate the
Merent active areas. After removing the nitride used in the LOCOS process,
a photoresist layer is deposited and is then patterned by B P-well mark (new
mark). This is followed by low energy ion implantation of boron (B I/I) to
adjust the threshold voltage of the N-channel transistor [Fig. Z.l(b)]. A see-
ond ion implantation can be applied to eliminate punchthrough in the short
channel device. Simiirly, the threshold voltage of the P-channel tramistor is
adjusted [Fig. Z.I(c)]. A thin gate oxide is then grown and B layer of polysil-
icon is deposited and doped with phoaphoros. The polyailiean is patterned to
form the gates of all the transistors and intereonneetion layer [Fig. Z.l(d)].
The source and drain regions are then implanted by using =photoresist mark.
Boron is used for the Pf regions of the P-channel transistors and arsenic for
N-channel transistors [Fig. 2.l(e)]. The N f and P+ regions e.re dso used N-
and F- we& contacts, respectively. The photoresist is removed and a thick
oxide is deposited by Chemical Vapor Deposition (CVD) ar an isolation layer
between the polysilicon layer and the subsequent metal layer. Contact holes
are opened in the oxide layer and metal (usually aluminum) is deposited on the
whole wafer. At this stage, the metal is patterned and annealed at d s t i v d y
low-temperature (450 C) [Fig. Z.l(f)]. One or two other metal layers are u m -
ally added. At the end, the wafer is pauivated and windows are patterned over
the metal bonding pads to provide electrical contacts with pins.
'For nore dctoils on the LOCOS iadationnrrc Sictian 2.8.l.
PI
16 2
CHAPTER
. Strip 1eisUordde
Grow gate oxide
Deporitpolysilicon
.
8 Apply photoresist
and pattern
stripresirt
.
a Apply photoresist
-..
Patteln s/D regions
for P-ehanorl
~mi~rp+srn
Stripphotar&t
. RepeatiorN+SlD
Stripphotore%l
..Grow oxide
...
Etch contact hoie
Deposit mptd
Pattar" metal
0 Metal anneal
Fig. 2.2 shows the major steps involved in B typical twin-tub process. The
starting material is B lightly doped P-epitaxial material over a, Pi- substrate to
reduce latch-up. In addition to the conventional N-tub process, another N-type
(arsenic) shallow implant is used to increase the suifaee doping of the N-tub to
prevent punchthrough (far short channel devices). It is also used to form the
channel-stoppers' for the P-channel transistors [Fig. Z.Z(a)]. The photoresist is
stripped and a selective oxidation of the N-tub is performed. The nitride/pad
wide layers are removed to implant boron, which is driven in to form the P-tub.
This is followed by a second boron ion implantation for the channel-stoppers
for the N-channel device [Fig, 2.2(b)]. The N-tub oxide is then stripped. So
far only one mask (N-tub mask, MASK#l) is required for self-aligned wells
and channel-stopper processes. Both tubs are driven in. LOCOS isolation is
developed to isolate between the devices using MASK#2, which defines the
active areas. After the LOCOS process, baron is implanted through the pad
oxide (wed in the LOCOS) to reduce the threshold voltage of the P-channel
transistor using MASK#3. This process results in a buried-channel PMOS
transistor. The pad oxide is then removed. The remaining steps are similar
to those used in the N-well process where MASK#4 is needed to pattern the
polysilieon [Fig. 2.2(~)].MASK#B and MASK#B me required to form the N t
and Pi Joureer/drainr (S/D), respectively. MASK#? for contact openings,
and MASK#8 for patterning the metal [Fig, 2.2(d)].
.
-.
stripe rcsir,
8 Grow sclcctivc hick oxide
..
Remove niindeipad oxide
B in ( P - ~ ~ I I )
P-tub N-rub
B anneal (P-wolll
2 n d B Ill (channel-stoppis)
P-rub
P epi-1aycr
..
..
H'SID
P'SID
contacts
Metalhalion
A
P rpi4ayer
P MOSFET N MOSFET
N-Subsmh
A1
hsve been replaced by the side wall base electrodes. This allows the base are&
to be almost as large as the emitter. The SICOS rtructnre is suitable for VLSI
applications became of its density and low perasitics
Any bipolar process typically starts with creating the bnried layers and the
epitaxial layer. Fig. 2.8 illustrates the major steps of the epitaxid growth
with an iv+ buried layer (BL). This buried lsyer is introduced to reduce the
collector resistance o f a hipolar transistor. While the epitaxial layer offers the
high-quality silicon host far the bipolar transistor. The steps involved in Fig.
2.8 are the following. First, an oxide lsrer is grown on the substrate and is then
patterned using the buried layer mask. The photoresist on the oxide s e r ~ e sas
a mask against etching and ion implantation. After etching the oxide, the
exposed regions of the silicon surface are implanted by arsenic or antimony to
form the Nt buried layers. The photoresist is then removed and an annealing
step is carried out. All oxide is then stripped. An N-epitariai layer is grown
'A r-irw of conrmntiond bipolar t.~chnologyusing the jundion isolation ttchniquu can
be f o n d in [la].
24 CHAPTER2
Pholamm
.. Grow oxide
Apply p h a r o n a a
Pducdetch N+BLmark
8 Implant Sb
on the substrate as shown in Fig. 2.8(b). The thickness of this epitadal layer
can he as low as 0.8 pm for advsnced digital bipolar technology. The problems
limiting the &g down of the thickness of epitaxial layer are the autodoping
and oot-diffusion of the boried Ieyer.
Fig. 2.9 amstrates the sequence of a DPSA process assuming B starting stimc-
ture with N+ buried layer, N-epitaxial hyer and isolation oxide as shown in
Fig. 2.9(a). First, photoresist is deposited and patterned to define the col-
lector contact region (deep Nt collector sink). This region is then implanted
with phosphorus to increa~eits doping level. The photoresist is stripped and
Low-Voltaqe Process Technology 25
Oxide isolalion
Initial Svucmre
Apply photoresist
.
PatBrn pholomist
(3 , : ,: (N+calleelor mask)
P In for lhcN'sink
CVD Oxide
..
Svip photoresistloride
(4
.DepositP+palySiio~ide
Pattendetch oxidalpolyS1
26 CHAPTER2
. DepositCVD oxide
RiE etch of oxide
-
lcvcl oipulyrilicon
P Ill IN+poIy)
Anncal
a Pauemictch
N+ p01ysi
-
a Dcposil oxide
Open wnracl haler
Dcposil metel
Pallemicuh mcial
Low- Voltage Process Technology 27
Several isolation techniques have been proposed and used. The most popular
are LOCOS (Local Oxidation ofSilicon) [17],trench i d s t i o n [la, 19,20, 211,
and selective cpitaxy [22]. Selective epitaxy is not studied in t h s chapter.
I PChanncl-Stop
Substrate
I Substrate
30 CHAPTER
2
Nitride
PolySiiicon
Nilridc
-
Figure 1.11 Poll buffered LOCOS promni
The Pofy Buff=? LOCOS process was developed to iedoce the hids heat en-
croachment [23]. Ln this modified LOCOS process, the nitride mask thickness
has been inereared t o 240 n m snd B polysilicon streas relief buffer layer or50 nm
has been added between the nitride and B 10 n m pad oxide [Fig. 2.12(a)]. This
srrangement prevents deep lateral extenlion ofthe field oxidc under the nitride
layer [Fig. 2.12(h)]. A 0.8 pm field oxide thickness results in 0.15 pmlride of
Low- Voltage Process Technology 31
Fig 2.13 illustrates the steps of the trench isolation process. First, the pad
oxide, the nitride and the thick oxide layers are patterned using the mask of
the active areas. The thick oxide series ar s mask in the trench processing
[Fig. 2.13(.)]. A deep trench is formed by dry etching (RLE).This is fallowed
by B boron implsnt to ueate the P+ channel-stoppers at the bottom of the
trench. The top thick oxide is removed, and the trench sidewds are oxidived
[Fig. 2.13(b)]. The polysilicon is deposited over the whole wafer, filling the
trenches. The polysilicon is used as the trench dielectric because it uniformly
fills the trenches better than other dielectrics. The surface polysilicon is then
etched to yield the stroetore shown in Fig. 2.13(c). The wafer is oxidized
using the nitride as a mask. The nitride is finally removed as illustrated in Fig.
2.13(d). At this stage, conventional processing can be used to integrate the
CMOS devices.
Although trench isolation permits reduction of the separation between the ac-
tive regions; it has several drawbacks: i) it is a costly process because of the
large number of processing steps, and fi) it can not be used BE an isoletian
region for the inactive parts of the chip. In this ease, LOCOS is usnally used.
T h e description of other trench isollrtion processes c m be found in [28].
..
Grow oxidelnitrideloxide
Pattern a l i v e region
..
RIE trench
.Implant boron
Remove hick oxide
OXidizB m e h walls
Complement wcll
Porl-orocersinP
"
CII
.Oxidize
Remove nitride
Low-Voltage Process Technology 33
B E C
I P-Subairare
epitaxial layer thickness. As the epitaxial thickness is being reduced for higher
device performance the oxide isolation area becomes smaller, which means that
LOCOS may become a practical isolation technique for advanced bipol-1 and
BiCMOS technologies.
Fig. 2.16 illwtrates thc proecsr steps for oxide isolation in a bipolar pmcesl.
After epitaxy growth, a thin layer of Si02 is grown and B layer of S i J N I is
deposited. A photoresist layer is applied and patterned with M isolation mark
[Fig. 2.16(a)]. Then the nitride/pad oxide layers and approximately half of
the epitaxial layer are dry etched. Boron implant is performed to form the
ehannel-stopper [Fig. 2.16(b)]. The photoresist is then removed and the wafer
i s oxidized to grow the thick isolation oxide. This oxide is called recessed ozide.
The SisN* and the pad oxide are stripped at this stage. The resulting strocture
is almost planar. In this structure the birds beak is formed BE in the MOS ewe
[Fig. 2.16(c)].
In the early 198Os, new isolation techniques such as grooves and trenches [29,
30, 311 were demonstrated. These techniques reduced the collector-substrate
capacitance and increased the packing density. Hence they improve circuit
speeds The fabrication process is the same BS the one described in CMOS
trench isolation.
Many of the steps of the advanced CMOS and bipolat procesrer ate similar,
hence, they can be shared for the fabrication of MOS and bipolar trsosistors
Low- Voltage Process Technology 35
Photoresist
.
Oxide I \ Nilode
NtBL PruceES
Cmw epi-layer (Ntype1
Grow pad oxide
Dep06if nihidelresisl
Epi-layer Palteem resisl
..
Slnp r w k l
-+
Croiu sclecdvcoxidc
(CI
Remove nilndeloride
c
36 2
CHAPTER
when they are integrated in a BiCMOS process. Some examples of there steps
are:
1. The N-well, which can be used bl the body of the PMOS transistor and
ar the N-collector of the NPN transistor;
2. The N + buried layer of the NPN can be used to form B retrograde well for
the PMOS to reduce the latch-up susceptibility;
3. The polysilicon can be used for the CMOS gatos and for the emitter con-
tacts;
4. The r h d o w P-type implantation c a n he shared by the PMOS S/D and
the s e l f - w e d extrinsic base of the NPN transistor;
5 . The shallow N-type implantation can be shared by the NMOS S/D and
the emitter of the NPN transistor; and
6. The final annealing s t e p match.
1. Low-cost;
2. Medium-performance; and
3. High-performance (high-speed).
P-SubsUale
N-well __I Collector ]
LOCOS isolation
NMOS channel implanration
PMOS channel implantation
Gate oxide
Polysilicon gate
SiDN+implantation
S l D P + implanmtion Pentrinsic base I
Base P implantation
t
~~
Contact opening
MeMiZa~CIn
(a)
WN NMOS PMOS
Low- Voltage Process Technology 39
40 CHAPTER2
The BiCMOS process described above can be optimized to be used far high
performance circuits. The collector resistance is low in comparison to the low-
cost proecsr (exsmple 1). For a 0.8 pm process, the cut-off frequency (ft) of a
bipolar can be as high m 5 081.
Polysiticon
/ NPY
P-base
N-well
Thick piysilicon
(450 nm)
.
0 Apply photarcsisf
rauem emi,,er
.
0 Etch polytoxidc
s,ripresin
Deposit LPCVD poly
(250 "rn) 2nd pan
of spiit poiy
Poly-Erniller
\
-. lmplilni AsiQ
Apply pho~oicsist
-. Pattern poly
Dry etch poly
. strip reSiEl
Ann4
Low-Voltage Process Technology 43
oxide is formed nelu the emitter and gate edges. Fig. 2.19(b) shows the find
crosrsection of this BiCMOS process.
ated with the PNP transistor are its high collector resistance, low current gain,
and high b s e transit time.
It has been recently reported that CBiCMOS processes can offer NPNs with
fe'g of 8-20 GHz and PNPr with 2-7 GHa A [45, 46,41, 48, 49, 501. Fig. 2.21
shows a cross-sectional view and process flow of a CBiCMOS [46]. The N+
buried layet of the NPN transistor creates a retrograde well for the PMOS
transistor. The Pi buried layer is only used for isolation isles between NPN
transistors. After the epitaxial layer growth, twin-well and LOCOS processes
are performed. The P-well of the NMOS device is used 86 the collector of
PNP tr-tor. A second high energy (600 keV) boron ion implantation is
carried out to form the retrograde well (2nd P-well) for the NMOS and the P+
buried 1ny.r for PNP device. The S/D implants of MOS transistors are used
simultaneonsly for the extrinsic baser of the NPN and the PNP transistors.
The emitters of the NPN and the PNP are formed by the self-aligned contact
doping technique to simplify the process flow. Finally, the metal is deposited
and patterned.
the minimum length of the MOS gate is 2X and the minimum length and width
of the bipolar emitter contact is 2X and 4A respectively. Table 2.2 describes
the ba3ic marks used in the layont design of BiCMOS devices. The rest of the
masks are generated automatically.
Table 2.3 h t r the de3igp rules for the (design) masks only of a typical BiCMOS
technology in terms of the parameter A. The corresponding graphical repre-
sentation of design rules is illustrated in Plate 1. Plate I1 shows the layouts
of minimum size PMOS, NMOS and bipolar transistors in * 0.8 pm BiCMOS
technology.
6Thcgiucn designrules a r c t y p i o d o f ~ g m c r i c O . w
8m high-pdarmanccBiCMOSpco'osera.
Low- Voltage Process Technology 45
P~rvbrUalc
N + I P + b w i d layer
N - t p spifBxill layer
Nn'iwinweIl(lnP-wcllfor PNP)
Field ihlulion
Callmior deep N'
DccpPt Ill for NMOS retrograde well uod
2nd P-well for PNP ( P+ bwicd layer)
Gate (CMOS)
NMOS S D ( N t s r s i n s i c brrc forPNP)
PMOS SID ( P Cwindc bsrc for NPN)
NPN Base
PNP Bare
Caniacl haler
N t w d P'eniLL~r implant
Mctslizaalion
P+
I I
Nt deep collector (CN) The CN mark defines the area which is ex-
posed for the N + sink implantation.
Polyrilicon (PO) The PO mark defines the gate and the emit-
ter electrodes, and the polysilicon intercon-
nect layer.
Emitter window (EW) The EW mask definer the opening for the
emitter window.
1. N-weU(NW)
1.1 minimum width 12A
1.2 minimum spacing 12A
2. N + -diffusion (DN)
2.1 minim- width 3A
2.2 minimum spacing 3A
2.3 minimum NW overlap ofDN OX
2.4 minimum NW to external DN spacing 6A
3. P+ -diffusion (UP)
3.1 minimum width 3A
3.2 minimum spacing 3A
3.3 minimum NW overlap of DP 4A
3.4 minimum NW to external UP spacing 4A
3.5 minimum space to DN (same potentid) CIA
3.6 minimum space to DN (different potentid) 3A
6. Polyrilieon (PO)
6.1 minimum width 2A
6.2 m-um spming 3A
6.3 minimum space to DP or DN 2A
6.4 gate overhang of DP 01 DN 2A
6.5 minimW0 space to CN or CP 1A
8. contact (CO)
8.1 minimum size (single)
8.2 minimum rise (double)
8.3 minimum spacing
8.4 minimum DN or DP overlap of CO
8.5 minim"rn space to gate
8.6 minimum PO overlap of CO 1A
8.7 minimum CN or CP overlap of CO 1A
8.8 minimum PO to CO spacing in P b s e 2A
8.9 minimum poly emitter CO to CP spacing 2A
9. Metal 1 (MI)
9.1 minimum width 2A
9.2 minimom spacing 3A
9.3 minimum M I overlap of CO 1A
9.4 maximum current density 1 mA/pm
Low-Voltage Process Technology 49
11. Via(VIA)
11.1 minimnm size
11.2 minimum spacing
11.3 minimum MI or M2 owrlap of VIA
11.4 minimum VIA to CO spacing
11.5 minimum PO to VL4 spacing
11.6 minimum PO overlap of VIA
50 CHAPTER
2
NMOS
PMOS
BIT
Si
Many techniqnes existent to grow silicon on insolator [HI. The most mature
technique ir the epitaxial growth of Silicon On Sapphire (SOS). Many LSI/VLSI
circuits have been fabricated using SOS technology. SO1 can dso be produced
by oring what is called SIMOX (Separation by IMplrtnted Oxygen) [52] tech-
nology. It is fabricated simply by the formation of buried oxide (SiOl) by
implantation of oxygen underneath the surfsce of the silicon as illustrated in
Fig. 2.22. Dose and energy of oxygen ions are as high as 2 x 10'8m-2 and
200 KeV respectively. A subaqaent thermal annealing at high temperature
is performed to improve the qoality of the silicon overlayer. The buried oxide
can be several hundreds of n m thick and the thin silicon layer can have several
tens of n m thickness. Compared to SOS, SO1 SIMOX materials have better
defect density and thin silicon layer control. The dislocation density can be
lower than lO'~rn-~. One important phenomenon which u i r t s in CMOS SO1
devices is the kink effect. It consists of B "kink" which appears in the out-
put characteristics of an SO1 MOSFET, as illustrated in Fig. 2.23. It is due
mainly to the floating sobstrate of an NMOS device. An explanation of this
phenomena c a n be found in [51].
Low- Voltage Process Technology 53
Drain
Kink effect
Drain Voltage
Pig. 2.24shows B thin film SOI/SIMOX CMOS process cross-section. The pro-
cess starts by the formation of buried oxide in silicon wafer ar explained above
in [Fig. 2.24(a)]. Then, an oxide is grown on the surface silicon and 8 nitride
hyer is deposited. Silicon nitride is used as n mark to protect the active region
from oxidation. The nitrideloxide layers are patterned and a LOCOS isolation
is applied [Fig. 2.24(b)]. At the end, the nitridejoxide layers are removed. This
is followed by P I/I to Bdjut the threshold voltage ofthe N-channel transistor.
Skilady, the threshold voltage of the P-channel transistor is edjdjnsted by I/I.
A thin gate oxide is then gmvn and a layer of polyrilicon is deposited and
doped with phosphorus. Then the Pt souice and drain regions of the PMOS
are patterned and implanted with boron [Fig. 2.24(c)]. Similarly, the N+ S/D
r@onr of the NMOS are patterned and implanted with phosphorus. A thick
oxide is then deposited BS an isolation layer between the polysilicon and the
subsequent metd layer. The oxide is etched at contact locations. N u t . the
54 CHAPTER
2
- N-ChV m paitcm
N-Ch V m implant
Gmw gale oxide
Dcparir polyrilicon
and pattern
metal l a y s (aluminum) is deposited over the whole surface. Finally, the metal
is etched and annealed.
This simple process description showsthat the SO1 process is much simpler than
bulk CMOS. Forbdance, the wells are no longer needed, and the punchthrough
implants ae also unnecessa~yif thin-film SO1 is used. Fig. 2.25 shows B
u
. .. .. .. ,. ...
56 CEAYTER 2
Due to the dielectric isolation, the MOS devices have several advantages over
bulk CMOS such as : absence of latch-up, high packing density and lower pma-
sitic capacitances. SO1 reduces the circuit capacitance by 30% [57]. It has been
discovered that if the silicon (containing the devices) is made sufficiently thin
(< IOUnm), the MOSFETs devices are f d y depletcd [51! even when Vos = 0.
W y depleted thin film SO1 MOS dwiccs offer attractive characteristics for
CMOS applications such ar immunity from short channel effect, absence of
kink effect, superior aobthreshold leakage and high d r d n 8atursAition current
(due to low channel doping) [58, 59, 601.
[l] A F.M. Wanlans, and C.T. Sah, Nanowatt Logic using Filed-Effect MOS
Triodes, International Solid-state Circuits Conference Tech. Dig., pp.32-
33, 1963.
[Z] L.C. Parrillo, R.S. Payne, R.E. Davis, G.W. Ratlinger, and R.L. Field.
Twin-Tub CMOS: A Technology for VLSl Chcuits, International Eke-
tron Devices Meeting Tech. Dig., pp. 752-755, December 1980.
[3] Y. Tam et al., High-Performance 0.1 pm CMOS Devices with 1.5 V Power
Supply, International Electron Devices Meeting Tech. Dig., pp. 127-130,
December 1993.
141 K. F. Lee et al., Room Temperatare 0.1 pm CMOS Technology with 11.8
ps Gate Delay, International Eleetmn Devices Meeting Tech. Dig., pp.
131-134, December 1993.
[5] K. TaLeuchi et al., 0.15 pm CMOS with High Rdiability and Perfor-
mance, International Electron Devices Meeting Tech.Dig., pp. 883-886,
December 1993.
[6] T. Yamaeaki, K. Goto, T. Fukano, Y. Nara, T. Sn@, and T. Ito, 21 pr
Switching 0.1 pm-CMOS at Room Temperature using High Pedormance
Co Salicide Pmcess, International Electron Devices Meeting Tech. Dig.,
pp. 906-908, December 1993.
[7] A. Oyamatsu, K. Kinugawa, and M. Kalrumu, Design Methodology of
Deep Submicron CMOS Dwices for 1 V Operation, Symposium on VLSI
Technology Tech. Dig.,
pp. 89-90, 1993.
[8] B. Yoshimma, F. Mdatsooka, and M. K a l r m u , New CMOS Shallow Junc-
tion Well FET Structure (CMOS-SJET) for Low Power-Snpply Voltage,
International Electron Devices Meeting Tech.Dig., pp. 909-912, December
1992.
[9] T. Uehino, T. Shiba, T. Kikuehi, Y. Tamaki, A. Watansbe, Y. Kiyota,
and M. Honda, 15-pr ECL/74-GAz ft Bipolar Technology, Intecnational
Electron Devices Meeting Tech. Dig., pp. 67-70, December 1993.
58 LOW-POWERDIGITALVLSI DESIGN
[lo] T.B. Ning, and D.D. Tang, "Bipolar Trends," Proe. IEEE, vol. 74, no. 12,
pp. 1669-1671, December 1986.
1341 H. Momose, K.M. Cham, C.I. Drowley, H.R. Grinold., and R.S. Fu, "0.5
Micron BiCMOS Technology," International Electron Devices Meeting
Tech. Dig., pp. 838-840, December 1987.
(35) A.R. A l w e a , 3. Teplik, D.W. S c h d m , T. Hnlsemh, H.B. l i n g , M. Dy-
dyk.snd I. &him, "Second Generation BiCMOS Gate Array Technology,"
Bipolsr Circnits and Technology Meeting Tech. Dig., pp. 113-117, 1987.
1361 B. Bastani, C. L a g , L. Wong, J . Small, R. Lahri, L. Bouknight, T. Bow-
man, J. Mao~liu,and T. Tunt-od, "Advanced l Mimm BiCMOS Tcch-
0010gy for High Speed 256k SRAM'r," Symp. on VLSI Technology Tech.
Di.,pp. 41-42, 198~.
[37] T. Y-guchi and T.H. Yuanriha, 'Process Integration and Device Per-
formance of B Submicron BiCMOS with 1GGHB f< Doable Poly-Bipolar
Devices," IEEE Trans. on Electron Devices, "01. 36, no. 5, pp. 890-896,
May 1989.
[38] C. K. Lau, C-H Lin and D.L. Packwood, "Sub-micron BiCMOS Procer.
Design for Manufaoturing," Bipolar/BiCMOS Circuits and Technology
Meeting Tech. Dig.,pp. 76-83, 1992.
[39] C. H.Wang and J. Van Der Velden, '"A SinglcPoly BiCMOS Technology
with a 30 GHa Bipolar A," Bipolar/BiCMOS Circuits and Technology
Meeting Tech. Dig., pp. 234237, October 1994.
[40] 8. Yoshida, H. Suziki, Y. Kinoshita, K. Imai, T. Ahnoto, K. Toksshiki,
and T.Yamaaaki, "Process Integration Technology for Low Process Com-
plexity BiCMOS using Trench Collector Sink," Bipolar/BiCMOS Circuits
and Technology Meeting Tech. Dig.,pp. 230-233, October 1994.
[41] J. M. Sung et al., "BESTP- A High Performance Super-Aligned 3V/5V
BiCMOS Technology, with Extremely Low Paraaitics for Low-Power
Mixed-Signal Applications," IEEE Custom Integrated Circuits Conf. Tech.
Dig., pp. 15-18, May 1994.
[42] H.J. Shin, "Performance Comparison of Driver Configorations and M-
Swing Techniques for BiCMOS Logic Circuits," IEEE Jorunal of Solid-
State Circuits. "01. 25, no.3, pp. 863-865, Jone 1990.
[43] S.H.K. Embabi, A. BeUaouar, M.I. Elmarry, andR.A.Hadaway, "New Full-
Vdtag&wing BiCMOS Buffers," IEEE Journal of Solid-state Circuits,
vol. SC-26, pp. 150-153, February 1991
REFERENCES 61
Qo is the total of dl charges in the oxide and near the interface oxide/silicon.
This charge is positive. The work function difference between the gate electrode
and the semiconductor d,, depends on the type ofthe electrode and the doping
concentration of the semiconductor, For an aluminum electrode, we have
N.
4fP = -&In(-) l i
for P - t y p e si (3.6)
Nd
$f,, = +Kin(-) f o r N-type Si (3.7)
ni
where K = K T / q . The charge Qs is the s u m of the charge in the depletion
layer QB and the inversion layer QI.Therefore;
vos = vrs + b, -
QB +&I
___ (3.8)
The bulk depletion charge (per unit are*) consists ofioniied acceptors (P-type
substrek) or donois (N-type substrate). The depletion charge ofB P-type bulk,
with zero biss b&-s-aouree voltage (VBB = 0), is given by
(bl
(3.10)
880
VTO = VPB t 4, - - (3.11)
Go,
QBO is eqnal to -qN.W,,, where W D , = W D ( ~=. 21dj1)3. Thus, the
threshold voltage can be rewritten as
If the bulk-source is reverse biased (IVBBI> O), the threshold voltage becomes
(3.15)
Low- Voltage Device Modeling 67
This valoe is negative and is not suitable for digital circuits where a positive
VTIlis ieqmked fox switching. To get a reasonable VTo,the device rnrface is
implanted with boron. The implanted dose DI came$ VTo to increase by the
amount qDi/C,. The threshold voltage is hence given by
The symbols of the NMOS and PMOS transistors are shown in Fig. 3.l(c).
Typical values of the VT are -2.5 V to -4 V far depletion-mode NMOS devices.
For low-voltage CMOS they a m 0.3 V to 0.8 V for enhancement-mode NMOS
devices, -0.3 V to -0.8 V for enhancement-mode PMOS devices.
When VGs < VTO,the transistor is in the cuiqffwgion, since no inversion layer
exists, 85 r b w n in Fig. 3.2(a). The drain current is, therefore, approximately
zero. When VGs > Vm, the channel is formed and a drain current flowsfrom
the dm.b to the source [Fig. 3.2(b)]. The transistor is in the linear region (&o
called ohmic wgion) when VOD( i . VGE ~ - VDS) 2 VT. When Vcr > VT a d
VDs > Vos - VT (ix. Vco < VT) the channel is pinched off as illustrated
in Fig. 3.2(c) and the device enters the solurntion region. The drain-source
voltage which causes the channel to pinchoff at the drain edge is commonly
~ is equal to Vcs VT.
known as the saturation d r a k s o u r c e voltage V D S . . and ~
The voltage drop between the pinchoff point and the wmce is VDS,.~.Any
VoS higher t h m V D S , .will
~ appear between the pinchoff point and the drain.
If we assume that the distance between the piacbaff point and the drain is
extremely small compared with the overall length. then for VDS> V D S , . the~
drain current is constant. The carriers which reach the pinchoff paint are swept
across to the drain by the potential (VDS- Vns..,) between the drain and the
end of the channel.
68 CHAPTER
3
LowVoltage Device Modeling 69
From Pip.3.3 it C L L be
~ shown that the element dz har a resistance
(3.17)
To solve thL integration, we need to express the electron inversion charge den-
sity QI(=) in term of V . From Equation (3.8), we have
Vos - V ~ B - + QBO
C
.
~
1 C, (3.20)
The surface potential 4, at any point z dong the channel is equal to ZlQfI +
V ( z ) . By substituting for VFB- Qso/C, +
2l$fl by [Equation (3.11)] in
Equation (3.20) we get
Q r ( a ) = 4 V c e - VTO - V (x ) l G (3.21)
The surface potential at the drain is larger than that at the Y ) ~ C C by VDs.
Therefore, the magnitnde of Q I decreares with the distance across the channel.
This is why the inversion layer is triangular a illustrated in Fig. 3.3. Assuming
that QBO is constant across the channel and substituting for Qi from Equation
(3.21) into Eqnation (3.19), we obtain
The characteristics ofan MOS transistor based on Equations (3.24) and (3.25)
are s h o w in Fig. 3.4. The cnrrent eqnations (3.24) and (3.26) have to be
modified if the bulk-source voltage is greater than eero by replacing by
VT [see Eqnation (3.14)]. Note that when VDSis small (say 60 mV), Equation
(3.24) can be a p p r o h a t e d by
Low-Voltage Device Modeling 71
72 CHAPTER3
This equation expresses B linear relatiomhip between I D S and Vos. Using lin-
ear extrapolation, VTO and k p p can he determined 8s shown in Fig. 3.4(h).
-9,
The measured I-V characteristics show that the drain cnnent, in the saturation
region, iS a weak function ofVDs. This is due to the channel length modulation
phenomenon which can be explained s follows. Let us define
If we assume that AL
&Ill
<< 1, then we cam rewrite the current as
_
AL -
- XVDS (3.31)
L m
Thc channel modulation factor X is very small. A typical value of X is 0.01
V-?
The drain current model described, so far,is known as the LEVEL I (MOSI)
model in SPICE'. Thi. model is also d e d the Shiehman-Hodgea model. How-
eveq this model b still very simple' to accomt for state-of-thtart CMOS
devices and might lead to B 100% error in the current particularly for low-
voltage deepsubmicrometer CMOS devices. However, kp ( or p ) can be used
as D fitting parameter to reduce this error. This model in most suitable for
preliminary analysis.
4SPICE1GBor 381 oz 3C1.
'Tbis model 1- used in the 70's.
Low-Voltage Device Modeling 73
* A model for mobility degradation with the vertical abd the horizontal
electric fields;
rn A model for the threshold voltage of short- and narrow- channel devices
(the (Drain Induced Barrier Lowering (DIBL)effect is accounted foz);
An improved model for the channel length modulation phenomenon;
m Weak im&m conduction (subthreshold conduction).
(3.33)
where W,, the depletion layer width of a cylindricsl junction and is given
by
We = 0.0831353+ 0.8013929-
W D
- 0.0111077(-)W D (3.35)
2, 2,
The mobility degradation due to the vertical electric field is modeled by the
following simple equation [4]
In this expression, when the device operates in the saturation, Vos is replaced
by VosSct.
Lou-Voltage Device Modeling 75
where
dQs (3.42)
= dVsa
and Nps is a curve fitting parameter. V, marks the point between the weak
and strong inversion modes. Typical d u e s of n range &om 1.0 to 2.5. I , is
related to the c u r e d of Equation (3.39) by taking Vos = V,.
Fig. 3.7 illustrates the transfer characteristics of the weak inversion and drift
model. The voltage V , insures the continuity of the current, but it is dear from
the figure that at Vo3 = V, a discontinuity exists in the derivative. Therefore,
the MOS3 model is not precise in simulating the intermediate region where the
diffusion and drift currents are comparable.
IDS = P c f / c o z w c j f L c / f [vC3 - VT 1 + Fg
- 7 V D . I VDS (3.47)
The saturation voltage, which taker into aecomt the carrier velocity saturation
effect, is gi~a.by
where
Knc = (Vcs - &)/(I + F s ) (3.49)
v. = v,,.L,ffIP. (3.50)
The LEVEL 3 model approximates the device physics and relies on the proper
choice of the empirical pammeters t o accurately reproduce the device charac-
teristics.
.
rn Drain-induced barrier lowering effect;
Non-uniform doping in the channel surface and sub-surface regions effect;
3
CHAPTER
LEVEL 3 3
VTO 0.8 -0.9
TOX 17.5 Y 10-9 17.5 x 10-9
NSUB 3.23 x 10" 3.37 Y 10'6
NFS 820 Y 10s 764 Y 10'
uo 503 165
VMAX 150 x lo8 190 x 108
ETA 45 Y lo-* 121 x 10-8
KAPPA 6.7 10-3 1.45
THETA 63.4 x lo-' 135 x 10-3
DELTA fl 728 0.336
XJ 275 x lorQ 230 x
CJ 250 x lo-' 450 x lo-'
JS 5 10-4 5 x 10-4
JSW 5.5 x 10-0 5.5 Y 10-8
MJ n.m
... 0.50
PB 0.92 0.92
CJSW 205 x lo-'' 212 x 10-'1
MJSW 0.30 0.30
CGDO 274 x 215 x lo-"
CGSO 274 x 10-12 215 Y lo-'>
CGBO 571 x 10-l' 571 x lo-''
RD 596 1189
RS 596 1189
LD 59.5 x 10-9 0.
WD 0. 0.
XL 0. 0.
xw 0. 0.
ACM 2 2
LDIF 940 x 10Wo 1 x 10-8 m
80 3
CHAPTER
PO
* '=f)
IDS =
1t UO(V0S - VT) (1 + $$V,,) 2 " )
((Vos - V*)VD, - -V&
(3.52)
where
a = 1 + 9 XI
F(Q. t IVBgl)-"' (3.53)
and
I
g = 1 -
1.744 + 0.836(h + ~ V B B ~ ) (3.54)
where
K' = I+..+J1+2.. (3.56)
2
and
(3.67)
(3.58)
(3.59)
and
(3.61)
The factor d.8 is empirkd to achieve the best fit. The Subthreshold parameter
n is a function of Vpbs and VB.
(3.62)
Another deep-submicrometer MOSFET's model called BSlM3 181 has been de-
velopcd for circuit simulrdion. It uses an. improved threshold voltage, drain
current snd chaanel-lenpth modulation mod&. The model is also simple and
has a s d number of parameters (x 25).
The exponential factor. Mj and Mi.- are in the order of 0.3-0.5. C, is the
zero-bias capacitance of the bottom jmction p a unit area and C;,- is the
eel-bias capacitance per unit perimeter.
Low- Voltage Device Modeling 83
C G B = cmwc,,Lc,f (3.68)
When the device in in the linear resion the channel is extending uniformly
Gom the m n x e to the drain. The channel shields the b d k and the CB-
paeitance exists only between the gate and the channel. The gate-buk
capacitance goes to %em.The gate-channel capacitance can be oxpressed
in terms of two equd lumped capacitances, B gate-source and a gatedrain
capacitance, which am denoted Cos and CGDand are given by
1
COS = COD = FcozweffL'ff (3.69)
Finally, when the device enters saturation, the channel at the drain pinches
off and hence the gate-drain capacitance component becomes i e m while
the pste-source capacitance esa be expressed by
2
Ccr = -C,W.,fL.ff (3.10)
3
Fig. 3.9 depicts the change of the capacitance components as a fnnctbn of the
gatc-source voltage (assuming that the sourcebulk voltage is zem). The total
gate-ronrce capacitance is given by the snmmation of the Cosm and Ccs, and
s i d m l y , the total gatedrain capacitance is given by the summation of C C D ~
and COD.
The above described capacitance model can be used for circuit analysis and
eLeuit design. SPICE me8 B chargecontrol model, which IS- developed by
Ward and Dutton [$I. This modelis bared on the mtod distribution of charge
in the MOS stiuctue and its conservation.
I l d =-
w.llIo,o-vds (3.74)
W.
Using the examples of Fig. 3.10, typical values for constant-current and ax-
trapohted threshold voltager are 0.3 V and 0.5 V respectively. The parameter
5 is equal to 75 mVldeeade and the leakage cnrrent is e q d to 1p A l p m -
When estimating the static power dissipation, the worst-c leakage current
has to be evaluated. In this E B S ~ ,the worst csre threshold d t a g e , VT,, hsr to
be used where
VT,. = VT - AVT (3.75)
AVT is the vapiation of the threshold voltage due to the process parmeters
fluctuation such BS the oxide thickness, doping profile, junction depth, gate
and width lengths, ete. AVT can be BS high as 50 mV on the same wafer
and 150 mV for different wafers. This results in almost two decades ofleakage
Low- Voltage Devzce Modeling
current increase. Also the temperature effect has to be considered when leakage
current is computed. The temperature affects both VT and S. A typical value
of the temperature coefficient of the threshold voltage is 1.6 mV decrease per
degree Celsius. The subthreshold suing, S increases by 0.25 mV/(decade.C)
[See Equation 3.731. For example, if the temperature increases &om 25 C to
75 C, the thrcshald voltage decreases by 80 mV md the leakage current equalr
30 pA/pm (initid extrapolated VT = 0.5 V). This value ib 30 timu higher than
that at 25 C. Both the temperature and process effects can result in a drastic
increase of the worst-case static power dissipation. Note that this variation of
VT greatly affects the delay of CMOS circuits a t low supply voltage, since the
drive cuirent is proportional to (VDD- VT).
the vertical electric field in the inversion layer. At this point we prefer to use
the symbol & for the mobility to denote its dependence on the vertical dectrie
field. Also, the velocity (v) is no longer proportional to E but is gjwn by the
following twwregion piecewise empirical model [14]
where
2%.,
E. = - (3.77)
&
where the saturation velocity is equal to 8 x lo8 e m / s for electrons (NMOS
device) and 6.5 x 10e e m / s for holes (PMOS device).
Expression
Dimensions
Gate oxide
Doping
Voltage
Capaeitace
current
Gate Delay
Dynamic Power
Dynamic Energy
In the CE scheme all horizontal and vertical dimensions and voltages scale
h e d y with the $ m e faetor. In the CV reheme, the dimensions are scaled,
while the voltages w e kept constant. This scenario has been the most corn-
monly used. While the constant electric field scaling is natural Lom the device
physics point of view, the constant voltage scaling is more piactical from the
systems standpoint. Changing the supply voltage every technology generation
(when the feature sizes a e scaled) is too expensive because mdtiple pow-
90 CHAPTER
3
supply generatois will be required for each PC board. However, BS the channel
length scales helow sboat 0.6 p m the 5 V supply voltage must be reduced for
reliability rea~ons(e.6. hot carrier effects, breakdown, ete). The quasi-constant
voltage scaliog is an intermediary scheme between the CE and CV views. The
@c&g factors of the hoiieontal dimensions and the volts@ are denotd by kh
and !ex, rerpectively. Table 3.3 summluiees the scaling ef the important de-
vice parameters according to the three theories as a fonction of the horizontal
scaling factor (kh). Note that in the QCV scheme, the dimenions scale more
aggressively than the voltage (k, = kh'.)
Thk expression is not far fiom the one propored by [El. Table 3.3 shows the
erect of device sealing on the delay, power and energy. It is assnmed that a gate
drives other gates, where the load is mainly the gate cspscithnce. The threshold
voltage is sealed proportional to VDD rcsling. The gate delays imprave with
scaling for all the scenarios, but with II better rate in the CV scheme. However.
the dynamic power. at maximal frequency, of the gate increases by a factor k;'
in the case of CV. For the CE scheme, the power is reduced by a high factor
equal to kF6. Also in this Table, the dynamic energy dissipated by a gate is
reported. This is independent of fkquency. For all schemes, it has improved
significantly, particularly for the CE case.
Scaling the snpply voltage is an efficient way to reduce the power consomption.
However, to get B better performance 8t low-Vdtagge the device sizes and the
threshold voltage have to be properly scaled. For B fixed sub-micron technology.
the supply voltage can not be reduced aggressively, otherwire the *peed is
degraded. However, for each fixcd technology generation, there is a lower limit
power supply voltage VDD,~, [la]. For VDD'S higher than this minimum limit
the speed does not improve significantly. Typical d u e s for VDD,~,are, 3.3
V and 2.5 V for L.,j of 0.5 pm and 0.3 pm, respectively. On the other hand,
the h i e r lrmit of V ~ isDdriven by the reliability and the power dissipation
limiitation. The d n e of this VDD is proportional to the s p a r e root of design
rules (6) [IS]. For 0.6 pm and 0.3 pm design rules with LDD structure, these
high limits are 4.5 V and 3.3 V, renpeetively.
Low-Voltage Device Modeling 91
(3.89)
92 CHAPTER3
./ N-well
This ratio has to be nem unity; thst is, the emitter current should mostly
be due to electrons for an NPN transistot. The ratio
1C
-
fl= - (3.90)
IB
where na(0) and na(Ws) are the electron concentrations at the edges of the
emitter-base and collector-base depletion regions respectively [see Fig. 3.131.
Note that the slope of the clectmns in the base is given by the term between
the brackets as demonstrated by Fig. 3.13.
'B? app~ying KCL (i
bstuten LB and I.o.
j l s . / w e that I,.,
. I, + I~ ~ I, = 0). -
If thc recombination in the bsrc i s n&c$cd
ri L o .
scL t h t is the differcncc
(LB = 0). we can
Low-Voltage Device Modeling 95
Using thejunction law, the electron concentrations nn(0) and na(Ws), can be
expressed rn terms of VBE m d VBCrespectively. The current I., c a n hence
be given by [ZO]
The current IPc is due to the holes injected from the base to the collector8.
The baSc-eoUcetor junction is basically a P + N N + structure as shown in Pig.
*Not= Lhat I., w- mat inclvdcd in Eqv~tion(3.88)because in drriring Equation (3.86)
we harr -rumEd that the Eallsstor-b-e junction was revc-c biased.
96 CHAPTER3
Eqnations (3.97) and (3.98) m e called the EberrMoU eqmations. Fig. 3.14
shows the equivalent circuit of the BJT bared on the Ebers-Moll equations.
The EbersMoU model described above is general and can be used for any region
of operation by substituting for VB, and V.c by lhe appropdate values. In
the forward ective region, assuming that VBS = 0.8 V and VBC < 0.3 V the
emitter and collector current of Equations (3.97) and (3.98)reduce to
where the reverse saturation current of the bare-emitter junction In, can be
derived from Equation (3.99)snd is given by
Lour-Voltage Device Modeling 97
The simple Ebers-Moll model lacks accuracy for the following three reasons
1. It does not account far the parasitic resirtors of the emitter. base and
collector.
98 CRAPTER3
PC
d E
2. It doer not aocount for the Early effect, which causes the collector current
to increase 8s the collector-emitter voltage increases.
3. It does not sccount for the effect of the high collector currents on the
current gain.
The effect of the perasitie resistances ir important because the voltage drop
BEIOSS them contribute to the external baseemitter and collector-emitter volt-
ages VB1=. and V , , E ,respectively, = shown by the following two equations
The drop across the parasitic resistors has to be acconnted for to get more
accurate iesalts from the EM model. Neglecting these drops may ~ V U Llead
to erroneous iesults. For example, if the external collector-emitter voltage i n
fonnd to be equal to 2 V one may dednce that the BJT operates in the active
Ecgion. However, if Rc = 1.8K and RB = 0 . M and Ic I , = 1 mA, then
the intrinde collector-emitter voltage (Von) is 0.1 V. This implies that the
bipolar transistor is actually saturated. This phenomenon is known as Quari-
Satuwlion.
(3.108)
The inverse of the forward Early voltage 1,'VAj is analogous to the coefficient A
in an MOS transistor. A typical value of VA, is 50 V. The AC output resistance
of the BJT in the forward active region is related to the Early voltage and is
given by
70 -v.r
~
I0
(3.109)
The Early effect in the inverse active region can be modeled by using the reverse
Early voltage (VA,) which charaderises the slope ofthe collector cutrent in that
region (inverse active region).
Ic =
where the forward knee current Ixje is defined
ev-l=v%
- (3.110)
where & is the value of the gain when Ic < I z f . The modeling of the Kbk
effect is very complex. However, simple model for the current gain, which
can be used in first oidei circuit analysis, i n given below [Zl]
(3.112)
The aemracy of the simple EM model can be enhanced by acconntbg for the
parasitic resirtars, the Early effect and high emrent effect which mn be modeled
by simple analytical expressions as shown above.
.
m High-level injection effects (the Kirk effect is not included)
Base resistance -tion with current.
in1ii f
The two bad-teback diodes on the right represent the intrinsic base-emitter
and basccollector junctions and their curients are given by 1231
I,, I . ves/n,v. - 1)
= -(e (3.113)
qb
- ( e vec/n,v, - 1)
Iso = I* (3.114)
4s
where I, is given by [23]
(3.116)
The forward and reverse current e-on coefficient (nt ond %), which ate
introduced in Equations (3.113) and (3.114), are used to model thelow currents.
The parameter qb (base charge factor) accounts for the high current and base
Low- Voltage Device Modehng 103
9s = + 1- (3.116)
qr models the effects of base width modulation and can be expressed as
eled by [23]
c,r,(ev-~-v~ ~ I)
-
The two back-to-back diodes on the left [Fig. 3.191 account far the currents
caused bv the recombination of carders in the emitter-base and the collector-
base space-charge layers and other recombinations. These currents be mod-
(3.120)
c,r,(ev**m=vs
- I) (3.121)
where C,,C,.n. and n. have been introduced to fit the measured corrents.
Further improvements to this model ate possible by the inclusion of three par-
asitic resistances ( R c , Rs, R B ) ;three jnnction capacitsnces (CE, C c , Cs);
and two diffusion capacitances (C-, Cdc)= shown in Fig. 3.19.
The model of the bare resistance take. into account the effect of the corrent
(current crowding) through the following expression [24]
tan(r) - I
R B ( I ) = R B +~ ~ ( R B
- R B ~ z) tan(z)l (3.122)
The empirieal factor FC has a value between 0 and 1. Its default valne in
SPICE is 0.5. Note that Equations (3.124) and (3.125) apply for a reverse and
forward biased junction respectively.
The diffusion capacitances model the charge associated with injected carriers.
For example, the electrons injected in the bare have B corresponding rtorsge
charge
Q~~ = r,rcc (3.126)
Low- Voltage Device Modeling 105
The diffusion capacitance (associated v i t h the injected electrons from the emit-
ter into the base, when the base-emitter junction is forward biased) is gjvm
by
CDE = aQDB (3.128)
Although the SPICE models account for most of the first and second order
effects, they m e not highly accurate. This originates from some weaknesses in
the theory on which the models are based. As the device festnres are scaled
down the currently a d a b l e models become less accurate. The physics and
the theory of the sealed devices is more complex. Hence, aseluate modeling
becomes very difficdt. One way around that problem is to chose the model pa-
rameters such that simulated device chsracteriaties agree with measurements.
In practice, the models' parameters are extracted automatically using parame-
ter analyser. with software tools to obtain the best fit. As a result, the values
of the extracted parameters may not correspond to their actual values. For
example, it is common to find B discrepancy of 20% between the measured
cnrrent gain of a bipolar transistor and that listed in the SPICE fie. h o t h e r
approach, which U eqmivalent to tweaking the parameterr, is to m e empifid
models (eg. BSIM model), in which the empirical (fitting) parameters c m be
optimized to get the best fit between simulation and measurements.
Typical GP parameters , for the 0.8 prn BiCMOS prsented in Chapter 2., a ~ e
shorn in Table 3.4 and 3.5.
106 CHAPTER
3
IS Saturation current
BF Ideal madmum forward gain
BR Ideal madmum reverse gain
NF Forward current-emirision coefficient
NR Reverse current-emirision coefficient
VAF Forward early voltage
VAR Revers early voltage
IKF Forwadknee enrrent
IKR Reverse-knee current
ISE Baseemitter leakage ssturation current
ISC Basecollector leakage saturation current
NE Baseemitter leakage emission coefficient
NC Basecollector leakage emission coefficient
RE Emitter resistance
RC Collector resistance
RE Base resistance at zero current
IRB Base current where RB = RB(O)/Z
RBM Minimnm high-current base resistance
CJE Base-emitter ser-bias depletion cap.
VJE Base-emitter built-in potential
MJE Base-emitter junction grading factor
CJC Basecollector aero-bias depletion cap.
VJC Basecollector built-in potential
MJC Base-collector junction grading factor
CJS Collector-substrate iero-bias cap.
VJS Collector-substrate built-in potential
MJS Collector-substrate junction grading factor
XCJC Internal base fraction of base-collector cap.
FC Coefficient for forward-bias depletion cap.
Low- Voltage Device Modeling 107
I,
XTF
TF
XTF
Forward transit time
T F biar-dependant coefficient
VTF VTF TF barecollector voltage dependence c o d .
ITF ITF T F high current parameta
T, TR Reverse transit time
XTB Forward and re~ersebetel0 temperature exponent
XTI Saturation current temperature exponent
ED Energy gap
KF Flicket noise coefficient
AF Flicker noise exponent
IS Zx A
BF 100
BR 1
NF 1
NR 1
VA P ..
sn V
VAR 5 V
IKF 5n 10P A
IKR 0. A
ISE 0. A
108 3
CHAPTER
RE 30 n
RC 87 n
RB 650 n
IRB 0 A
RBM 650 62
CJE 1 . 5 1 ~lo-'' F
VJE 0.87 V
MJE 0 265
CJC 1.15~10-14 F
VJC o 713 V
FC 0.5
TF 12.5~ Q
XTF 916.2
VTF 1.6
ITF a.7x 10-2
TR 4 x 10W8 J
XTB 1.4
XTI 3.5
EG 1.11 ev
XF 2.9x10-e -
AF 2.0
Low- Voltage Device Modeling 109
[9] D.E. Ward and R.W. Dutton, "A Chargeoriented Model for MOS Tran-
sistors Capacitances," IEEE Journal of Solid-State Circuits, vol. SC-13,
pp. 703-707, 1978.
112 LOW-POWERDIGITALVLSI DESIGN
In thir chapter we introduce the CMOS logic gate with the development of sim-
ple models for delay and power disripstion estimation. These analysis permit
us to understand the mechanisms that control the performance, particularly
the power dkipation, of a logic circuit. Several CMOS d m i p s t y k , such as
pseudoNMOS, dynamic logic and NORA, are presented. Other k c n i t varia-
tions of the static complementary CMOS, which are suitable for low-PO- ap-
plications, are discussed. These include the passtransistor logic families such
as Complemendary Pass-transistor Logic (CPL), Dud Pasctramistor Logic
(DPL), and Swing Restored Pass-transistor Logic (SRPL). Also an overview
of clocking strategy in VLSl systems is covered. Included in this chapter is
one important %re*which is the I/O circuits. The power dissipation of the
I j O circuits is also analyzed. Findy, low-power techniques for CMOS design
are also reviewed at the tr-istor-level. We will cover the low-power issues
a t subsystem/system/architeeture levels in Chapter 6,7 and 8 in more detail.
Several books treat in detail other CMOS circuit design aspects [I, 2, 31. The
reader CM refer to them.
Many issues existing in todays advanced CMOS circuit structures are consid-
ered; such as:
v, = K" VDD = 0~
(4.2)
In this case, Vosn > VT, and lVcstl < lVrpl. The PMOS is OFF and the
NMOS is ON. The NMOS transistor N provider a current path to ground.
The find stable value of the outpot voltage V. is
v, = 0 (4.3)
At the steady rtete, the DC cnment from VDD to the groondis controlled
by the subthreshold current of the PMOS P ,since this device ia OFF and
the NMOS N has B VDS equals to zero. We assume that the junctions
leakage is negligible. If VT,,' is low enough (lower for example than -0.5
V), the subthreshold current is negligible (< 1 pA/prn width). If
(negative) is high, the subthreshold is not negligible and can be w high as
1 p A / p m for = -0.05 V [see Section 3.321. In this case the output is
not exBctly at zero and can have a value of tens of mV. In this section we
a m m e that the subthreshold cmient is not importmt. Low-VT CMOS
circuits .%re treated in Section 4.10.
Similarly, when Kn is low (OV) Vos. f VT, and IV,s8l > [VTJ. The
PMOS transistor is ON and the NMOS transistor iS OFF. The output
voltage is given by
v. = VDD (4.4)
Also we assume that the leakage current is negligible.
'Exbr*pold.ed thruhold voltage.
Lorn-Voltage Lou-Power VLSI CMOS Cixuit Design 117
%sf+ PMOS
The logic levels of the CMOS inverter are close to VDDand ground and the
logic swing is equal to VDO.This is B main feature of CMOS gates.
IDS? = - I D S . (4.5)
The PMOS current is given by
I D S p '-Pp [(~~-vDD-vTn)(va--I/DD)-~/~(~-vDO)z]
(4.6)
Where
6, = kp% (4.7)
Leff
(4.8)
where
a.= -,k W.ff (4.11)
L.ff
and
VGS, = Km (4.12)
Using equations (4.5), (4.6) and (4.10), the ontput voltage is given by
v, = (K*-Vrp)+ (4.13)
- P-
VDD
(%, - VTp)' - a(%% -- vTv)vDD -(!& - vT,)a
2 PP
'DI
YO
(4.15)
where
p = -i% (4.16)
PP
This equation is very useful from B design point of view. Note, from
this equation, that the logic threshold voltage of this gate is set by the
designer; since the parameters & and /a are dependent on W c f fand
L . t f . Moreover, the region (C) is d e k e d for only one point of I$,,
For symmetrical NMOS and PMOS devices we have
This ratio is a typicd example. The designer should set the rise ratio
a5
(4.20)
We obtain
VDD
K, = K*" = - (4.21)
2
A n inverter with this V,."* is sometimes called B symmetrical gate. The
cutput voltage in this ea5e h not neeereary equal to VDD/2 and is given
by the following inequality
\i
V. = (K* - V&) - ( L VT,,)' ~ ~ &(I$.
Pn
~ VDD VT?)~(4.23)
~
v. =0 (4.24)
The cnrient flowing from VDDto ground, Y C ~ I S Y Sthe inpnt voltage, is plotted
in Fig. 4.2(b). It reaches its madmum when both the MOS transistors are in
saturation. It h important to note that for V,= K,," the DC power dissipation
would be maximal.
Low- Voltage Low-Power VLSI CMOS G h o d Desrgn 121
4.1.2 Effect of p
As we discussed before. the ratio 0 controls the threshold voltage of the CMOS
inverter. This panmeter is set by the ekenit designer through the transistor
sizes. Other psrameters such BS the mobility and the theshold voltage of
devices are set during the fabrication and the circuit designer can not change
them. Fig. 4.3 illustrates the dependence of DC transfer charaeterirtier and the
threshold voltage of the CMOS inverter on the ratio p . Increasing 0 decreases
the voltage &,". KU has II prwticsl maximum less than VOD t VpP and
practical minimum greater than I+". Practical values mean that 0 can not
have zero or infinite. In general, the circuit designer tries to set 0 = 1 for
symmetrical operation unless the gate is used to switch an input s-8 different
than a CMOS swing (from ground to VDD).
(a)
words, we would define the valid logic levels such that they are restored when
they propagate through a digital circuit. The logic levels c a n be extracted from
the DC characteristic. As illustrated in Fig. 4.4 we define the levels at
the input by
.
rn
Logic 0 : for 0 5 Ii, 5 VrI,
Logic 1 : for fix 5 5 VDD
. Logic 0 : for 0 5
The V,r. and the V m lev& can be defined ils the points where the slope of the
DC transfer characteristics is -1, i.e.,
These valuer can be deduced wing equations (4.13) and (4.23). To have good
noise mar&, it is desirable to have Vii. and f i x each near the other, mound
the point V D D ~ ~ .
For CMOS circuits, the HIGH output Voltage level VOH,can be defined by
letting VOH = VDDand Vor. = 0. The CMOS logic inverter has fairly ideal
transfer nnnnctian and it tends to have very good noise margins. In some appli-
cations, either N M x or NM,, is compromised to have good speed of operation.
T vDD 1
The power dissipation issue during the switching is considered in Section 4.3.
Low-Voltage Low-Power VLSI CMOS Czrcuit D e q n 125
When the input goes from low (ground) to high (VDD),initially the output is at
VDD, the pull-down NMOS of Fig. 4.5 is in the saturation region. We wusume
that when the output falls to VDD~Z, the NMOS drain current is approximated
by the raturstion current IDs,&. Referring to the equivalent circuit of Fig.
4.6(a), the delay i s computed from the following differential equation
where
-E n )
I D S , , ~ , = Kn~.atCocWe~,m(Vcsn (4.30)
We ~ s s u m ethat the factor K, does not change. By integrating Equation
(4.29) from t = tL, correrponding to V, = VDD, to 2 = t l , corresponding
to V. = V D ~ / Zand
, substitution of (4.30) into (4.29) we obtain
Note from this equation that the delay is inversely proportional to the width of
the MOS transistor. So by aising the gate we can reduce the delay of the gate
alone.
When the input goes from high (VDD)to low (ground), initidly the output is a t
zero. The pull-up PMOS transistor operates in the saturation region. Similarly
using the equivalent circuit of Fig. 4.6(h), the rise delay is given by
(4.32)
126 CHAPTER
4
11 vDD
At t = t , Vo=V,,
At t = t 3 V o = O
At t = t Vo=-v~~
4 2
From the *bow equation we can deduce that the dse delay is greater than the
fall delay for equally sisad MOS transistors. So We,,,phould be rised such
that the two saturation currents are almost equal in order to get symmetrical
rise and fall dehyr.
1
fz = #d, +td.) (4.33)
(4.35)
Fig. 4.7(a) shows the simulated effect of the power supply voltage on the delay
ofan inverter with fanout = 3, using the device parameters given in Chapter 3.
We buffer the input voltage with one inverter stage to obtain accurate results.
The delay is almost stable at high VDO,however when VDDapproaches the
threshold voltage of the NMOS and PMOS devices, it increaser drastically
as expected by Equation (4.35). Therefore, the threshold wltage should be
reduced to overcome this problem. In Fig. 4.7(b), the delay of the inverter is
D VOD= 2.5 V. For VT/VDD > 0.5. the delay
plotted versus the ratio V T ~ V D at
incresses rapidly. In order to maintain improvement in circuit performace at
reduced power supply voltage, VTJVDDmust be 5 0.2.
4.5
I
Low- Voltage Low-Power VLSI CMOS Circuit Deszgn 129
0.65 I 1
0.15 '
1 2 3 4 5 6 7 8 9
I
10
There are three power dissipation components within the CMOS inverter.
These are:
1. Static power csused by the leakage current and other Static cur-
rent 1.t due to the value of the input voltage;
2. Dynamic power caused by the total output capacitance CL;and
130 CHAPTER4
Sometimes component (2) and (3) are merged as total dynamic power
P, = P s i + P.2 (4.36)
Leakage eubent consists of MOS junction leakage currents. Fig. 4.9 shows
the parasitic diodes in a CMOS inverter. The body ties in this stroeture, such
as the p&itic. diodes, m e not conducting (i.e. reverse biased and/or at iero
voltage). The current in B diode is given by
9vd
Id = I,(exp - 1) ~ (4.37)
nkT
where n is the emission coefficient of the diode (sometimes equal to 1) and Vd is
the applied voltage to the diode. Note that the current parameter 1. inereares
with temmnrturc. The total rrower dissipation due to these le&am currents is
given by
P,l = ~ I a , V L W (4.38)
We con$der now the second component ofthe static power which is a function
of the input voltage Kn. Assume that the input of the pull-down NMOS, of
the inverter, is at B voltage 0 5 K" < V,. In this ease the torrent is given by
the subthreshold expression (Fig. 4.10)
wW.O,,oLsgw
I D S = zo-I (4.39)
Low- Voltage Low-Power VLSI CMOS Circuit Deszgn 131
Vss
r
132 CHAPTER
4
(4.45)
T
Low-Voltage Low-Power VLSI CMOS Cmud Desegn 133
T VDD T vDD
This equation shows that the power dissipation is proportiond to the operating
frequency. Moreover, the ieduction of the power supply d r a s t i d y reducer the
power dissipation. Ideally, 3.3 V ~npplyvoltage rednces the power dissipation
by 56% compared to that of 5 V. Moreover, at 1 V the power is reduced by
96% compared to 5 V. The expression of dynamic power in Equation (4.45) is
valid only for an inverter. However, for E. complex gate the concept ofswitching
activity is introduced [see Section 4.5.31.
-
During the h s t output transition (charging) from 0 VDD,the energy drawn
from the power mopply is Ed = CLV;,. For tbis transition, the energy stored
in the load capacitor is
...............
/
... ....... L
...... ....... 1
Time
...
y
...... ...... .>
Time
\
Lou- Voltage Low-Power VLSI CMOS Circuit Design 135
350 I
-50 '
0 I 2 1 4 5 (1 7
1
8
Time (ns)
X * ( t ) = VOO
-f (4.51)
VT
VDD
*I= -7 and t 2 = I
2 (4.62)
Thk equation shows that the short-circuit power dissipation is also proportional
to the tiequeney. The only parameters that can be controlled by the circuit
designer at given frequency and power supply to reduce P., are: 0 and 7.
The power supply s d n g greatly affects the reduction of short-circuit power
dissipation. Note that this analysis was done for an unloaded inverter. For a
loaded gate, if the outpnt signal and inpnt signd have eqnd rise/fd times, the
short-circuit power dissipation will be less than 20% of the total power [5]. So
it is very important to keep the edges fast, to have negligible P,*01a t least, it
is desirable to have equal input and output rise/fd times.
If the load capacitance is high, the output rirejfaU times become larger than
the input ones. In this case, the inpot ehsnges completely before the output
changer rignificantly. Therefore, the short-circuit current is near zero. Note
that if VODis approaching (VT,,+ VTz)01 is less, the short circuit current can
he eliminated because both devices can not conduct simultaneourlv.
138 CHAPTER4
It represents the total power of a gate when it is switching at the same rate
aa the operating frequency. In Chaptez 8, we will discuss how to estimate the
power dissipation of a complex circuit.
Other power dissipation k u e s exist, such as: worst ease power estimation and
temperature effect. These conditions are : maximum VDOandjunction tcmper-
atarc, and faat-faat process. Static power dissipation (subthreshold carrent) is
incieaad by the increased temperature and increased power supply. Dynamic
pow= is not sensitive to the temperatare bat it is affected greatly by the worst
caae VDD.Short-drcuit power dissipation depends on the temperature j u t as
the short-circuit current doer. It is also dependent on the power snpply. The
mobility and threshold voltage deereaae with increasing temperature. Each of
these two parameters has an opposite effect on the current. So it is important
to eonrider the worst case power consumption evaluation in any design.
The simulated average total power dissipation can be easily measured by the
SPICE simulator u&g POWER MEASUREMENT commands. However, sev-
eral papers in the literature have introduced "power meter" in circvit simulation
to meaauce the power dissipation [6,7, 81,
4.4 CAPACITANCEESTIMATION
Previously we saw that the speed and power dissipation of CMOS gat- depend
strongly on the total ontput load ce.paeitance. This capacitance is the sum of
three components as shown in Fig. 4.15.
For simplicity we estimate, in this section, the average value of Cr. over the
range of the output awing. This approach is used only for b i t i d estimation
Low- Voltage Low-Power VLSI CMOS Czreutt Deszgn 139
of the design. More circait simulation and layout extraction and port-layout
shdation arc needed fm mole accuracy. Moreover, it is sometimes interesting
to derive a simple expression for the load capacitance to dee the impact of
important parameters on the speed and the power dissipation. We h t eramine
the different components of the outpnt load capacitance: then we illustrate by
eo
. example the estimation approach.
where n is the number of tr-torr of the gate. This expression sum3 the gate
capacitances of all the transistors composing the driven circuit. For a CMOS
inverter it is given by
(4.57)
140 CHAPTER4
3.5 I
,
I
' ?'
VOllll ,? ',,' voD=3.3 v -
3 - y:
2.5 - ,
i !
i ? -
2 - Vin i I
- i
1.5 - i .
1 -
i -
i 7
0.5 - i
i .
i ;vout2
_..t . .... . ..*< ei .
.
-0.5
Low-Voltage Low-Power VLSI CMOS Czrcuit Desrqn 141
Figwe 4.16 shows an example of the equivalent gate capacitance of the receiving
gate. The driven inverter has the following drawn sizes : W, = W. = 20 p m
and L = 0.8 pm. This gate can be replaced by an equivalent capaeitenee
Cgacc z= 50 f F ,which is approximately the same as the one ealeulated from
Equetion (4.57).
c, +
= CdP Cd,, + Gjp+ c,, (4.58)
142 CHAPTER
4
(4.61)
\I
where H is the thickness of the insulator layer (oxide), and C,. is the capaei-
tanee per erea unit. The total capacitance of the wire is
c, = IWC,. (4.64)
where W is the width of the wire (metal or poly). and I is the length of the
wire. Table 4.1 piyes some values of the widng capacitance per area for the
0.8 pm process presented in Chapter 2. This capacitmce can not be known in
the early design stage but can be known after layout extraction.
When the thickness of the insulator becomes comparable to that of the wire,
T, then the fringing fields at the edge of the wire become important. The effect
of the fringing fields is manifested by the increare of the effective area of the
plates [Fig. 4.191. Many approximations have been proposed to compute the
144 CHAPTER
4
Metal2 to Substrate 11
Metal2 to Metall 25
Metall to Substrate 19
Metal1 to poly 28
Metall to diffusion 27
Gate poly over field oxide 58
Metal2 to Substrate 38
Metal2 to Metall 47
Metall to Substrate 44
Metall to poly 48
Metall to diffusion 47
Gate p d y over field oxide 44
4.4.4 Example
Consider en inverter with W, = 2W. = 20 pm with 3 pm length of each drain
and source. This inverter is driving B Line of metall of 100 pm length by 2 pm
width a d an inverter with W, = 2W, = 20 pm operating st VDD= 3.3 V.
Low- Voltage Low-Power VLSI CMOS Ctrcuit Design 145
The total load cspacitsnce is computed using the 0.8 p m device parameters
presented in Chapter 3 BI follows:
c, = [%L,+W"I;,IC,
= [20 x 0.8 + 10 x 0.81 x 2 f F w 48fF
,c
, = CGD,W, + CODhiW"
Then
The drain areas are 60 pmaand 30 p d far PMOS and NMOS respec-
tively. The drain perimeters are 46 p m and 26 pm for the PMOS and
NMOS transistors respectively. The total junction capacitance can be
easily calculated and is
Cj s 3 2 f F
Note that this capacitance increaser with the power supply voltage
reduction.
m The wire capacitance is estimated by adding the two components psx-
allel plate and fringing capacitances. The ares of the wire is 200 pm'
while its perimeter is 204 pm. We have
c, = w x I x CW(peV m a ) + +
Z(W i ) x C&r length)
= 200pm' x 19 Y lO-'fF/pm' +
204pm x 44 x 10-3fF/pm
= 3.8 + 9.0 c 13 f F
Hence the total capaeitance at the output is 100 fF.Note that the contribution
of the junction capacitance is important. The contribution of each component
wries *om one circuit to another and it depends on the layout style osed.
Before starting any circuit layout, it L important to keep in mind an estimation
of capacitances snch BQ the gate a d ontput capacitance of 1 unit sbe inverter
and the wire capacitance of, for example, 100 fin poly line and 100 p n metall
line. With these data, when starting the design, it is possible to siee different
transistors correctly.
The design of these gates, or any CMOS static gate, follows that of an inverter.
As discussed in Sections 4.1 and 4.2, an inverter ir designed to meet a given
DC and tianrient petformanee, then (W/L), and (W/L), are determined. The
(W/L)- and (WjL), of the devices of II logic gate are determined BJ follows:
For example we want to design a 3-input NAND (Fig. 4,21(a)) to have the same
DC and transient as that of an inverter driving the same C,, (Fig. 4.21(h)).
Low-Voltage Low-Power VLSI CMOS Circuit Desagn 147
J
A gF 6
A m =c
T
148 CHAPTER4
We assume that
W" = W",= w
.* = Wns (4.66)
and
w,= w,= w,,= w,, (4.67)
The first thing to do is to approximate the gbtc by M equivalent inverter where
the effective p is given by
1 1 1 1 3
G=G+-t-=- (4.68)
w 2 s
.
0 0,
and
?Pelf =a, (4.69)
To have LS of the gate in the midway of the power supply in DC character-
istics, the following condition should be satisfied for the Sinpot NAND gate
(see Eqnation 4-18)
PPLlf = a<n (4.70)
which means that
P, = 0.
3 (4.71)
To have the same delay BE an inverter with determined eiues, we should have
(assuming that L is the same)
and
w,,.= w,.,,= T
W, (4.73)
But in practice the size of these transistors, composing the 3-input NAND gate,
should be increased because the output parasitic capacitance afthe NAND gate
(or any complex gate) is larger than that of the inverter. Hence
w,> w, (4.74)
and
W" > 3w"i (4.75)
Note that by circuit simulation, we can properly size the transistors. Moreover,
it should be noted that the back-gate bias effect has to be taken into consider-
ation in the design of the series NMOS devices in NAND gate (or repier PMOS
in NOR). The relies-connected MOSFETr, during switching, exhibit a thresh-
old voltage increase doe to a non-null source-substrate voltage as shown in the
simulation example of Fig. 4.22. In Fig. 4.22(a), the transistor NL of the
Low- Voltage Low-Pourer VLSI CMOS Circuit Design 149
first NAND3 gate near the ootpot outl, is driven by the latest signal becanse
N, 8nd N, are already ON. Therefore, the node oi is at the ground level and
the source of the transistor N, is not subject to the body effect. In t h e other
NAND3 gate, the transistor N , and N6 are ON, while Ne receives the input
signal. In this case, the node a. and bz are eit II certain voltege Icvd. Henee,
during the discharging period the transistors N, and N5m e subject to the body
effect. This effect slows the discharge of the output aa shown in Fig. 4.22(b).
The output outl is discharged more ispidly than the output oui2. One way
t o reduce the body effect at the logic level is to put the transistor, driven by
the latest ardving signal, near the output. The e d y arri'ving sign& should be
used to discharge the nodes snsceptible to the body effect. For example in ~n
adder &=nit, the transistor driven by the carry is placed near the ontpot.
Let us derive the output parasitic capacitance ofthe m-input NAND gate and
compare it to thst of the CMOS inverter of Fig. 4.21(b). We have
The Ce. of the m-input gate is larger than that of the CMOS inverter by the
ratio W,/W,.i. Fmm the above equation it is obvions that C, of the m-inpnt
NAND gate is lrtrger than that of the CMOS invater.
Note that for the same pedormance and far the same number of inputs the
NAND gate consumes less silicon area than that ofa NOR gate because of the
s m d e r *pea taken by the NMOS devices. Hence, CMOS NAND gates arc more
widely used than NOR gates. Moreover, the NOR gate eonsume~more power
than the NAND gate.
the N block to the output capacitance in Fig. 4.23(b) is less than that of Fig.
4.23(c). There is no direct DC path between VDD and ground for any of the
logic input combination. In practice, the complex CMOS gates are used for a
marimurn f& of 6-6.
Low- Voltage Low-Power VLSI CMOS Circuit Design 151
Logic
Block
B
c-
Logic
ci5 (C)
The dynamic power for B complex gate cannot be estimated by the simple
expression Cr,ViDf, because it might not always switch when the dock is
-
VODand VDD 0 transitions,
the switching activity a determiner how many 0 + V O Dtransitions
~
-
switching. The switching activity determines how often this switching occurs
on a capacitive node. For N periods of 0
occur at
transition 0 -
the output. In other words, the activity Q represents the probability3 that a
VDDwin OEEU during the period T = l / f . f is the periodicity
of the inputs of the gate. The average dynamic power of B complex gate due
to the output load capacitance is
P* = aCLV;,f (4.77)
The internal power dissipation, due to the internal capacitive nodes, can be
characterized by simulation. Fig. 4.24 illustrates an example of a complex gate
with internal nod-. The internal dynamic power of a cell is gken by
"
P k A p = xQiC$xvDDf (4.78)
i=,
I L
in the next sections. First we consider the c s e of a NOR gate. Then we treat
several rtatk gates. Table 4.3illustrates the truth table of the NORgate. From
-
the table the probability that the output is at zem is 3/4 and that it is at one
is 114. The probability for (I VDDtransition is eompnted by multiplying the
probability that the output d be at sera, Po,by the probability it d be at
one, P,.
3 1 3
PNOn, = Po.P, = - Y - = - (4.79)
4 4 16
We aFsume that the inputs ate uniformly distributed (i.e, the probabilities
P(A=I)=P(B=l)=I/1).
We show that for m y bodean function, the activity d a static gate is given
by
OI = P(0 4 1) = P,.P, (4.80)
where Po is computed by dividing the nvmber of zeros by the total n-ber of
input eornbin&ons (N = 2" for n-input gate) and P, is computed by dividing
the number of ones by N. Po is also equal to (1 -PI), Fig. 4.25 shows the
probability that the output maker an 0 3 1 transition for several static gates.
The probability of transition. at the inputs are assumed uniformly distributed.
Low- Voltage Lour-Power VLSI CMOS Circuit Design 155
+ ~ P(O-21)
114
P(0 +I j
3/16
3D I4
with d o d g dis
tribnted inpute
4.5.4.1 Example
.
rn
Implementatirm 1 : an 6-inpnt NAND and an invater.
Implementation 2 : two 3-input NANDs and one 2-input NOR.
Implementation 3 : three 2-input NANDr and ODE 3-input NOR
P = 6314096 P = 6314096
01
lrnplernenialion I
Low-Voltage Low-Power VLSI CMOS Circuzt Deszgn 157
First we compare the delay and the iliea of the different implementations. Us-
ing the data of Table 4.4, the results are reported in Table 4.5. The delay may
be computed or simulated by SPICE as illustrated in Table 4.5. The imple-
mentations 2 and 3 offer the best speed compared to the first one. However,
they requiz. more area.
Let us now compare the power dissipation wing the power cost function. It ir
defined by
Power coat = CP.-.,,C, (4.86)
158 CHAPTER
4
lmplomentatian 1 01
P, 63/64 1/64
Po = 1- P, 1/64 63/64
^^II^^^
PO-, 65/4086 oa/nuao
Implementation 2 01 0 2 2
PI 718 7!8 1/64
Po = 1 - P, 118 1/8 63/64
PO-, 7/84 7/64 65/4090
Note that the node 01,in implemention 1, has a lower switching activity =om-
pared to the other two. To compute the power cost function we laiu not indude
the p~imaryinputs. Table 4.7 illnstrates the results of this calculation. The
results indicate that implementation 1 has the lowest power. So technology
mapping is important for low-power applications.
We consider now another example using low-area 0.8 p m CMOS standard eel!
library for the &input AND implementation. Some characteristics of this li-
brary are s h o w in Table 4.8. Cornpazed to the library presented in Table 4.4,
this library uses sma!! transistors with W, = W, = 4 em. Compared to the
Low-Voltage LowPower VLSI CMOS Circutt Deszgn 159
case of the highperformance hbrary, the cell area unit, in the low-area ease, LS
smaller by a factor of 1.5. Note that the delays of diRerent gates are higher.
Bowever, the input gate and output parasitic capacitance$ me lower Thus,
this hbrarg c a n be used for low-power fonction implementation.
The delays reported in Table 4.8 do not indnde the effect of the input voltage
-
dope. The delay, of the m e r e n t implementations, w.s simulated with SPICE
and it is almost the pame for all the configuration. The delay is 1.5 "8. Using
the same reasoning discussed earlier we can compute the power cost function
wing this library. The transition probabilities are the same, except the total
160 CHAPTER4
node capacitances which are different. The results of the power cost evaluation
are illustrated in Table 4.9.
The power cost, in the case of low-power library, is almost half of that of high-
performenee. Still, implementation 1 hea .e low-power chs*Factedstie while the
speed is h o s t the S-e compared to the others. The me- is also lower than
the other implementations. This example shows that the power dissipation e m
be Fedneed a t the gate level. Even if we take into account the wire capaci-
tances between the cells atill, the conclusion is valid. The topic of low-power
at the gate-level is discussed more in Chapter 8. Keep in mind, that in this
comparison, the internal power of the gates has not been considered.
4.55 GlitchingPower
Note that in the probabmty discussed so far, we assumed that the gates had e e m
delay. In that case, we m e not taking into account the glitches and we consider
only the transitions between stable states. Glitches must be considered if we
assume non-aero delay at gates. Thus the total dynamic powei of a circuit is
the total dynamic power with iero delays power and the glitching power. So
what is the glitehing phenomenon?
In a static logic gate, the output or internal nodes can switch before the correct
logical value is being stable. To illustrate this spurioos transition, Fig. 4.2T
.
m Avoid if possible the carcaded implementation; and
Redesign the logic when the power due
component.
to the glitches is an important
.
m
Set the siaing of the transistors composing the gate;
B
-* (a1 D
Lorn- Voltage Lou-Power VLSI CMOS Circud D e q n 163
164 CHAPTER
4
"OD v~~
i;ll lhl
.. -. . .
B
OUI
A
Low-Voltage Low-Power VLSI CMOS Circuit Design 165
Note that the power Line widths are drawn taking into consideration the cur-
rent consamed by the cell because the electromigation phenomena sets the
minimum width of eoodacturs.
The third layout methodology is the gete array6. The gate arrays consist d i m -
plemented cells and need only the personalination steps. Fig. 4.33illuetrates an
example of gatearray core using Sea-Of-Gates structure. It consists of I/O and
internal cell areas. The 110 cell area contains pads with input/output buffets.
Theinternal cell array eontainsscontin~ousarrayofNMOS and PMOS tran-
sistors. Hence, the transistors and interconnects a r e & e d y predefined. The
design of a logic gate consists of wiring the different tramistors using metal-
lization and contacts. The isolation of a logic gate is performed by tying the
polysilieon gates of the limiting transistors to Vss or VDDdepending on the
type of gate diffusion. Routing channels are routed over unused transistors.
This methodology permits the reduction of the design cost at the expense of
area, power and performance. Ont recent gate array nrchiteeture WVIU based on
multiplexers with small sine transistors to maintain low-power characteristics
1111.
Low-Voltage Low-Power VLSI CMOS Circuit Design 167
VDD(metal)
Pdiffusion
Polysilican gates
N-diffusion
V
ss (metal)
vD;k;
-
>"
PMOS ON
NMOS ON
TlIlE
Any CMOS TG logic (we call it here conventional pars-transistor logic) function
can be implemcntcd using the TG primitive element described above. In such
implementation the transistor count, hence the silicon area, is low compared to
standard static CMOS implementation. This ishighlighted in the implementa-
tion of such functions BJ mdtiple-g, demdtipleldng, decoding and addition.
Pi. 4.36 shows & 4 1 multiplmer, where the data lines A, B, C and D are
contlolled by S1 and S2 such that
Thm form of logic is used when the inputs and their logic complements are
available. The implemenlation does not need VDDor ground liner. However,
the implementation suffers from a number ofdrawbacks; the driving capability
of the ckcnit is limited and the delay increa~eswith long TG chains. Moreover,
the eireait does not provide a restoration ofthe logic lev& i.e., the logic gates
are passive with no gain elements. Pi.4.37 shows an example on how to lestore
the voltage levels in chained TGs. When 8 TGs are pnt in s u i e s . the output
signal changes very slowly. However, when an inverter stage is added every 4
TG stages, the level is restored as shown in the SPICE voltage waveforms of
Fig. 4.37.
D
Low-Voltage Low-Power V L S I CMOS Crrcuzt Deszgn 173
n<I
controlled by the carry signal C,, should be placed dose to the output. This
will _offretthe body effect problem, since the carry is the latest arri-8 signal.
An optimiaed implementation of the full-adder is shown in Fig 4.39(b) It uses
only 18 transistors and is bared on the XOR function shown in Fig. 4.38 and
the TG gates. Hence, this adder is more compact and farter and eonrnmer less
power than the complementary static one.
174 CHAPTER4
A B C;., S,, C,
0 0 0 0 0
0 1 0 1 0
1 0 0 1 0
1 1 0 0 1
Thus V0,, depends strongly on the ratio &/A,. For example, if we need B
VOL = 0 . 0 4 V ~and ~ VT = 0 . 2 V . ~ , then the ratio &I@, should be equal at
l e s t to 0.1. If the NMOS transistor is minimom she, the PMOS should be
weak to provide adequate noise margins (low Voc). In this case, the rise time
of the gate is too slow. If we improve the rise time, the ratio condition tends
to inerurre the gate area a d hence the input capacitance.
Although this circuit offers a reduetion in total transistor count and ease of
layout, it has the disadvantage of non-~erostatic power dissipation. Since the
pull-up PMOS is always ON, a current flows from VDD to ground whenever
the pull-down section of the pseudo-NMOS is turned ON. This current is the
source of the static power dissipation. When II pseudo-NMOS gate, with antput
a t VoL, is driving another one, the d i v a gate, with OFF pd-down section,
leaks a high eubthreshold cnrrent but still this cnrrent is lower than the one
when the pull-down in ON. For a-input preudrrNMOS gate there ate (ntl)
transistois. Fig. 4.42 illustrates an example of complex gate implemented in
pseudo-NMOS style. This logic hns been used in many applications such 88.
decoding logic for memories and PLA. Because of its high static power, it is
not suitable for low-power applications.
A
R
i
Figure 4.41 PseudaNMOS complex laslc g a b
This circuit u4es asingle clock phase clk. DuMg theprecharge p k e ( c f k= O),
the storage capacitance is charged through the PMOS pull-up PI to VDDand
the inpats have no effect since there is no path to ground. The output of the
buffer is precharged to ground. During the evaluation phase (cfL = l), A', is
ON, and depending on the logic performed by the N-logic block, the node A is
either discharged or it will stay precharged.
T
180
er
clk
Stagel
Figure 4.44
sage2
Logic
Block Block
Low-Voltage Low-Power VLSI CMOS Circuit Design 183
Po-, = Po (4.94)
where Po is the probability that the output has a "0" output. For a two-input
NAND dynamic gate, the output has only one zero for 4 input stater. So,
1 1
Po-, = Po = - (4.96)
2' - 4
~ ~
Another refinement oftbe domino CMOS logic is shown in Fig. 4.48 [14],where
the CMOS buffer is removed. N and P logic blocks are alternated and each
drke the other. When clk is low (0), the h s t and third stage are prechsrged
high and the second stage is precharged low.
Fig. 4.49 s h w s another NP domino logic called NORA (No Fbcce) [El. Two
sections elk and elk are shown in Fig. 4.49. It is constructed by cascading
N and P blocks followed by C 2 M O S (clocked CMOS) latch. CMOS buffers
(inverters) ace nsed to provide logic inversion. When clk = 1 (evaluation phase
in section dk),the CaMOS latch3 operates like aninverter. When clk = 0, the
latch move* into hold state because the output NMOS and PMOS transistors
ale OFF. In this case, the old data is latched at the output. This latch is used
to avoid signal races. A NORA pipeline is shown in Fig. 4.50 and it consists
of alternating elk and cik sections. Signal racer do not occur in this structure
because of the use of C'MOS. Another logic hlrr; been proposed to oveicome
charge sharing by using additional clocking signals. It is e d e d Zipper CMOS
logic. For more details refer to [MI.
'Scr the ex-ple of the DEC Alpha Ehip in Scc~ion4.8.4.
184 CHAPTER4
T T
\?7+
T
To N-Block \?7 i::
(a) NORA clk-SeLdon
To N-Block To I lock
186 4
CHAPTER
clk-Section
-
clK-sect,on clk-Section
logic and improved rise time. The power dissipation consumed by this logic Is
high due to the hi& switching adi-ity of the clock even if the circuit is not
used. However,power-down techniques can be used t o control the dock of the
logic. Using thi. style, requires from the desi@er to spend more d s i p effort
than the static style to solve all the problems of dynamic logic such 81: charge
sharing, clock skew, preeharging, ate. Finally, we note that pass-transistor logic
is very pxomising for high-performance low-voltage low-powez applications.
Low-Voltage Low-Power VLSI CMOS Circuit Design 187
elk;-
insensitive, is the one shown in Fig. 4.54 [18]. The delays clk. + clk and
d k are equahed with special buffer sizing.
188 CHAPTER4
4c:
4.7 CLOCKING
One way to synchronize thousands of sign& in 8. VLSI system is to employ a
docking strategy. The clock controls the flowof data in the digital system and
reduces the compl&ty of design.
Low-Voltage Low-Power VLSI CMOS Czrcuzt Deszgn 189
clock signal
repistcr input
register register
Q
lateh
D i
clock
Q :
4.7.1.1 D-Latch
There are a variety of implementations for this D-latch. Fig. 4.58 reviews
some of the static versions. The circuit of Fig. 4.58(a) hhS a weak inverter
used 85 feedback path for latch mode. The mltsge at node A is not changed
by noise or leakage because the feedback inverter would keep the level. The
feedback inserter should have low (Wjl) for NMOS and PMOS (weak inverter)
compared to the transmission gate and forward inverter. This assures that the
transmission gate is capable of overdriving the feedback inverter when data is
being written to the latch. The feedback inverter should he carefully siaed to
guarantee switching for all process corners and maximom fanout condition.
Low- Voltage Low-Power VLSI CMOS Circurt Design 191
The problem of rstioed design in Fig, 4.58(a) can bc avoided by using the
modified version in Fig. 4.58(b), where B transmission gate in added in the
feedback path. When clk = 1, the data is passed to the storage node and the
feedback node is disconnected. When clk = 0, the feedback loop is dosed, and
the latch is in store (latch) mode. Fig. 4.58(c) shows another version of Fig.
4.58(b), where the outputs are buffered. Thia latter latch is fonnd in the cells
library of standard-cell and gate-array. All there described static latches store
their state even ifthe clock is stopped. Note that these latches do not dissipate
any DC power.
To reduce the size of the static latches, dynamic versions can be used as
illustrated in Fig. 4.59, Fig. 4.60 and Fig. 4.61. Fig. 4.59 shows a simple
dynamic latch, where the storage node A, temporarily stores the data. Note
that latches have B property called "trampareney": output follows the input
when the dock is asserted. Otherwise they are yopsqne". Fig. 4.60 shows two
other latches [19]. The circnits of Fig. 4.60(a) is transparent when the dock
elk, is high and latches the data (opaque) when the dock is low. This latch is
positive level-sensitive. The negative level-sensitive is shown in Fig. 4.60(b).
Note that these latches use one clock line ( c l k ) .
The circuits of Fig. 4.60 have redaced noise immunity. For example, for the
circuit of Fig. 4.60(a), when the latch is opaque (elk = O), the node A may
be tristated high with Q tristated law. The node A is isolated and may be
surceptible to noise which reduces its voltage. The reduced voltage of node A
can cause the PMOS PBleaking current, thereby deitwyhg the output Q. This
problem was addressed with latches designed in DEC Alpha microprocerror
PI]. For example the eircoit of Fig, 4.61 is an improved version of Yuan and
Svenrron [19]. A weak PMOS device P3 is added to solve the problem of noise
in positive level-sensitive latch. The operation of this latch follows. When clk
192 CHAPTER
4
clk
clk = 0
Low- Voltage Low-Power VLSI CMOS Circuit Deszgn 193
clk
T T
b high, PI,NI and N3 function like an inverter. Pz,Nz and N4 function &a
&e an 'bwerter. Therefore the latch p~3sesthe input D t o the output Q. If D
falls to low,then A is high and Q is low. When clk is low, Ns and Nn are OFF.
If D goes to high, Pi is OFF,while the nodes A and Q are tristated high and
low respectively. The added P3,in this case, is ON and holds P2 OFF. This
device supplies current to node A and counters any noise.
194 CHAPTER
4
TT T
For R&bility reason many latches have been designed for DEC Alpha chip
[Zl].Some are illustrated in Fig. 4.62. These latches have been designed for all
process corners and circuit conditions (supply Voltage, temperature, rise/faU
times, etc.). The results showed no appmciable evidence of raccthrough for
elk risvjfd times at or below 0.8 ns. With 1-ns rise/fall times, the latches
showed some signs of feilure. A 0.5 ns for rise/faU timer was set for the dock
in this chip.
TT T
196 CHAPTER
4
cik locally, to reduce the clock skew problem. The dock skew, in single-phsc
strategy can lead to invalid data storage.
A dynamic version of the positive ETDFF is shown in Fig. 4.64 [19]. The
operation of this drcuit is Unstrated by the voltage waveforms. The d o e
Low-Voltage Low-Power V L S I CMOS Czrcuit Design 197
T T
D i n n
of the hold time of this Ripflop is close to zero [ZO]. This dynamic flipflop,
compared to the static one, needs only 9 transistors and one clock Line. The
negative ETDFF is shown in Fig. 4.65.
4.7.1.3 MiscrlIoneous
Many other latches and Ripflops are available; Car example in gatearray Li-
braries such as the JK Ripflop and the toggle (T) flip-flop. Fig. 4.66 shows
the T Rip-flop with reset control. When elk = 1, the output Q is comple-
mented, whereas when d k = 0, Q keeps its old state.Thir T flip-flop provides
divide-by-2 operation. A J K flipflop is shown in Fig. 4.67. When J and K
inputs are low, the outputs are meintainod on the positive edge of the dock. If
198 CHAPTER
4
T T
Elk
Q
-
Q .. ...... ~i
Combinational
Fig. 4.70 shows another example of singlephase system using ETDFFs. This
system is edge based and the minimum cycle time is given by [22]
4.8.1 CPL
The main concept behind CPL ia shown in the block diagram of Fig. 4.72. It
consists of NMOS pass tranrktor logic network driven by two sets of eomple
mentary inputs and two CMOS inverterr used as buffers.
where VT,. is the threshold voltage subject to the body effect. So the invertiog
buffers translate the swing of the output fram ground to VDD- VT,,to a full-
rail logic swing (ground to V D D ) .The logic threshold voltage of the inverting
buffers should be shifted to lower voltage than VDD/Z. Hence the 0 ratio of
the inverter in this case should be higher than unity. This inverting buffer
permits also to drive large load capacitance efficiently. When the output of
logic networks are st Von - VT, then all the output inverters are driven by
reduced $Wing, BS shown in Fig. 4.74. Hence, the DC power of the inverter
increases because the pull-up PMOS device is not completely OFF. The VG,
of the puU-mp PMOS is eqnal to -VTm.Moreover, the drive capability of the
pull-down NMOS transistor is reduced particularly if the power supply voltage
is iedueed. The noise margins are also affected. To solve the problem of DC
power &$pation we can design NMOS transistors with lower VT than that
of the PMOS transistor. Also, the body effect should be controlled. Another
way to solve all the problems associated with the reduced high-level is to add
to the CPL II PMOS latch 8s shown in the case of the ANDINAND circuit of
Fig. 4.75. In this case, the two added PMOS transistors can be sised to be
Low-Voltage Low-Power VLSI CMOS Circuit Design 205
minimum. as long 8s the high-level reacher VDDin the given cycle time. We
call this style PMOS latch CPL. Careful design should be considered when the
NMOS network has minimum size devices. Otherwise the high-level stored in
t h e latch cannot be discharged.
Fig. 4.16 shows examples of CPL arrays for ORINOR and XORjXNOR fune.
lions. With only 4 transistom we cm pmdnce many awo-kput functions
with their complement. More examples are shown in Fig. 4.17 for 3-input
ANDINAND and ORJNOR gates. In these examples 8 NMOS transistors are
needed to generate the 3-input functions. Any complex logic function can be
constructed easily using this principle of NMOS n e w o r k t~an&%tors.
For e x m -
Ple the full-adder circuit call be constructed wing wired CPL as shown in Fig.
4.18. The circuit is constructed using basic CPL primitives discussed before.
206 CHAPTER
4
(a) (h)
Ait; - ~~~
~
B
ii
B
ABC
~
-
ABC A+BIC
-
A+B+C
(a) (b)
Ako the sizes of the transistors are shown in this fignre for fast operation. The
tr-istors of the NMOS net>mrk, far from the output, have larger size than
those closer to the mtput. This is because the NMOS devices, closer to the
output, pass a reduced swing. The siving of the transistors depends on the
chcuit type, layout and device's parameters, Compared to full-dder imple-
mented in standard static CMOS style, the adder of Fig. 4.78 is much fsstei
and dissipater less power due to the low internal swing. Also the schematic of
this CPL adder is structured resulting in simplified layout.
One drawbad assodated with the CPL logic is the driving capability which is
limited and the delay increases with long pass-transistor chains. So buffering
is needed to restore the transmitted level and improve the driving eapability.
4.8.2 DPL
The DPL is a modified version of CPL suitable foor law-voltage applications. It
deviates the problems of CPL associated with the reduced high level. Example
far ANDINAND gate is illustrated in the schematic of Fig. 4.79. It consists
of NMOS and PMOS pass transistors in contrast to CPL gate, where only
NMOS devices are used. In the example of ANDiNAND gate, the NMOS
tranrktor m e used to pass the ground while the PMOS transistors are used
to pass the high level (VoD). The output of the DPL is full rail-to-rail swing
owing to the addition of PMOS. However. this addition results in increased
208 4
CHAPTER
Low- Voltage Low-Power VLSI CMOS Czrcuit Design 209
A.5 A.B
input capacitance compared to CPL. This wiU not limit the performance of
DPL as will be explained.
CPL
Ciicuii
A B XOR Pars
Table
-"DO - "T,
OWNOR
NMOS CPL
improves the speed as shown in the simulation C U Y ~ of Fig. 4.84. It har been
found that the rim of the latch should be minimum, for a fast operation, using
the 0.8 p n device parameters of Chapter 3. If the siae of the NMOS transistors
in the network k small, the autpnt of the SRPL gate fails to switch to ground
b e c a m the equivalent impedance of the network is lower t h a n the one seen
by the output to VDO. Thk problem becomes wome when many gates are
cascaded. Fig. 4.85 illostrstes this problem in 2 ANDJNAND cwcaded gates.
When the input goes from VDOto ground, the nodes A and B,initidly at VDD,
cannot be completely discharged.
Low-Voltage Low-Power VLSI CMOS Circuit Design 213
750
I
I
4 6 8 10 12 14 16 18 20
-0
%+- I T
Part of
thc lalch
4.9 YO CIRCUITS
1/0 circuits connect the on-cbip lo& circuitry to the external world. They
play an impmtant role in the limitation of speed and power dissipstion of the
whole chip. In thu section many 110 circuits are discussed such BS input and
output buffers, dock distribution, clock buffeimg and low-swing 110.The power
dissipation issuer related to there circuits are &o studied. Layout techniques
for 1/0 circuits are not cclverd in this chhapter.
YDD
peak current that flows in the diodes. %ical d n e s of R are few a hundred of
and m e realieed using the diffusion layers. The input protection Circuit has
a pararitic RC time constant which can limit high-speed operation. It ranger
from a few tens of ps to a few hundreds of pa.
The input buffer, connected to this input pad, consists in general of a number
of inverter stages to drive the internal circuitry. The input buffer. for clock
distribution, needs rpecid care and design and is discussed in Section 4.9.4.
The individual input inverters are designed by setting their W / L ratio such
that the rwitebiog point of the buffer is near 1.4 V (middle of VILand Vrx).
To have thk switching point of 1.4 V at 5 V power supply voltage, the ratio
W,lW, of the input inverter of the buffer should be at 2.9 using 0.8 pm CMOS
technology. At 3.3 V,this ratio should only be equal to 0.7. However, since the
TTL voltage swing is limited to 1.2 V, the input buffer is always dissipating
216 CHAPTER4
+
Figure 4.81 TTL inpuL buffrr.
The dynamic power dissipation of the input pad is mainly internal power. The
total dynamic power of all the input pads (of the $ m e type of example) is
where A is the switching activity, N , the number of the input 'pads and Eii is
internal energy of the input pad in Watt/Hz.
When the input signal has ECL levels, then an ECL input buffer, with ECL-
**CMOS converter a ~ ensed. In "eeneral the" are imolemented in BiCMOS
technology and con~umea DC power. An ECL-CMOS converter can be de-
signed in full CMOS ps].
218 CHAPTER4
A CMOS version ofthe Schmitt trigger is shown in Fig. 4.91. When the input
is rising, initially the NMOS transistois are OFF. The Vcs afthe transistor Nz
is given by
v,,, = v;" v m
~
(4.103)
Low-Voltage Low-Power VLSI CMOS Circuit Design 219
...........
vT+
vDD\ Time
6
Figure 4.81 The CMOS Schmilt triggrrrchrrnstic.
When V,. = VT+, N, enters in conduction mode which means VGS, = V,,
then'
V F N = vr+ - VT" (4.104)
'WIneglrct the body offast of N,
220 CHAPTER
4
where
(4.111)
This equation shows that the trigger point is independent of the process prs-
remeters except for VT,. By symmetry, the trigger point for falling transition,
ULO be deduced from the pull-up section. We have
(4.112)
where
(4.113)
2
+-2
(4.114)
v7.=---VOO VT (4.115)
2 2
VH = VT+ - VT- = vr (4.116)
In thiscase the hysteresis voltage can be made equal to VT. The short-circuit
power dusipation of the Sehmitt trigger can be very important since the rke/fd
timer of the input signal is very long.
Lorn-Voltage Low-Power VLSI CMOS Circuit Design 221
Fig. 4.92 shows SPICE simulation o f the circuit of Fig. 4.91 in 0.8 p m tech-
nology. In thla example, the load capacitance is 0.1 pF and the total power
dissipation is 0.85 mW. The dynamic power &sipation, dne to the load and
parasitic capacitances, is 0.40 m W .Therefore, the power dre to theshort-circuit
-
iS 0.45 m W , which represents 53 %of the total power dissipation.
Question : What are the d u e s of the size ratio a and the number of stages
n t o op&e the deky ?
The power dissipation ofa CMOS bufferis mainly dominated by dynamic power
dissipation for large VT. The short-circuit power dissipation can be neglected
85 first-order analysis [34]. If we indude the parasitic outpnt capacitance. So
stage i, has a t o t d ontput capacitance
Rence
a" - 1
P, = V&f(aC,, t C,)- a - 1 (4.130)
The power efficiency of the buffer can then be defined as
224 CEAPTER4
r,. =
cE= 3.2 x 10-0 x 3.3 = 21 A (4.132)
At 0 . 5 lo-*
~
OVDD = 3.3V power mupply. The corresponding dynamic power dissipation
due to this clock lobding is
This example shows how the docking is an important design issue. A clocking
strategy should be used to distribute the clock to the different functional blocks
of chip with minimum clock skew and low-power dissipation.
-T
Clock Driver
FFZ
Block
Figure 4.95 Clock skew due to the vnbaknced bad. at block A and block
B.
226 CHAPTER4
Several stmtegiea have been proposed to minimiee dock skew. The first a p
proach is to use cascaded inverters (buffer) to ddve B lmge load and feed dl
blocks as shown in Fig. 4.96. The buffer chain is designed by the approach
presented in Section 4.9.3. In another approach, the clack distribution is ac-
eomplirhed by using a tree of clock buffers well sized as illustrated by Fig.
4.97. Identical buffers are used in each level and each buffer sees the s a m e load
capacitance. Equalking clock buffer loads is possible by : 1) equalizing the
interconnect lengths between the buffers of different levels, and 2) the addition
of dammy bufferr st the slightly loaded bvffer ontput. The last distribution
level has buffers which drive the functional elements such as registers. This
structure results in very reduced skew and the only skew that exists is the one
produced by variations in process parameters. To further minimile the skew,
identical layout for all the buffers, should be wed. As an uample of tree ap-
proach is the following case. To distribute the clock signal to 64 elements (for
example r e e k s ) . 3 stages (levels) of buffering with 1-to-4 tree structure m e
required. A wuiety of software paekager have been developed for clock tree
synthesis [35. 361.
T o ieduce the high dynamic power dissipation (few Watts) in dock distribution
at a fixed power supply. many techniques c a n be used such as:
1. Using a low capacitance clock routing Line such as metal3. This layer
of metal can be, for example, dedicated to clock distribution only.
2. Using low-swing drivers at the top level of the tree 01 in intermediate
levels.
Low- Voltage Low-Power VLSI CMOS Circuit Design 227
For the second approach, a half-swing clocking scheme has been proposed 1371.
Fig. 4.98 shows the half-swing dock driver which generate half VDD clock
signals (four phases) to the elements (eg , latches). Using the charge shaiing
principle, the node of haEVDD can he expressed by
Fig. 4.99 shows the clocking schemes of the latches driven by the clock driver.
Compared to the conventional scheme which uses two clock phases, the half-
Swing scheme requires four clock phases. Two phases are for PMOSs and two
are for NMOSr BI shown in Fig. 4.99(b). This scheme reduces the power by
75%. However, the delay of the latch is increased by the new docking scheme,
which can be acceptable [37].
TO drive the output pad. a high drive capability driver is needed to achieve
adeqnate rise and fall times. In this cme, inverter chain is used to handle the
228 CHAPTER4
Low-Voltage Low-Power VLSI CMOS Circuit Design 229
large load of the pad, package wiring, and off-chip load. This capacitance can
be few tens of pF. A typical value of this capacitance is 50 pF. There arc
~ V) to
many types of output pads swh BS tristate, bidirectional, I O W - V D(3.3
higb-VDo ( 5 V) output buffer and low-swing output.
The total power dissipation a t the output pads can be divided into the static
power dissipation asd the dynamic power dissipation. The statk power dissi-
pation is due mainly to the leakage curents (junction and subthreshold) if the
ontput pads are driving CMOS logic. If the VT of the devices is large enough,
then the static power dissipation of the output pads is neglected. However if
VT is small, then the DC power, due to the subthreshold current, for the output
pads is
P. = N . I D s , ~ . . VDD
~ (4.136)
where No is the number of output pads and ID5,mron is the average subthresh-
old current for both cases when the input is 1-w and high. For low VT the
230 CHAPTER4
1
Data-in
IDS,-..* value would be important, beesnse the devices in the autpnt bnffer
have large ske partiedrub the output transiston. ID,,.., should be corn-
puted in worse case where the VT has its minimum value. Thus for future
technologies where the threshold voltage is low and the nomber of output pads
is large, thm static power dissipation would be very important and can be a
limiting factor for low-power applications. Hence low-power eircuit techniques
are needed for output buffers.
If the CMOS output buffer is intended to drive bipolar TTL inputs (not CMOS
TTLinputs), thenMportanteurrentissn~.Fig. 4.102shows thefinalstageof
the buffer driviog a TTL logic. Since, bipolar TTL inputs can sonrce significant
amounts ofcnrrent, B CMOS ootpnt buffer must sink this current. For 3.3 V
power supply, this current can be in the range of 1 mA to 12 m.4 depending
on the strength of the ootput driver. The static power dissipated by the one
output pad driving bipolar TTL inputs is
= VOLIOL (4.131)
Low- Voltage Low-Pourer VLSI CMOS Circuit Design 231
output driver :
where lo&is the cmrent sunk by the output buffer and is equal to the I-
of the cnxrent from d the bipolar TTL inputs. VOL = 0.4 V for 10- T T L
output. This disspated power is due to the ontpnt NMOS pull-down transistor
and can be an important issue s far BJ the chip heat is concerned. Note that
the corresponding energy is not drawn from the internal power supply.
Another romponent of the total power dissipated at the output pads is the
dynamic power. It is given by
The total power dissipation of the bidirectional pads can be evaluated using
the approaches developed far the input and outpot circuits.
problem. For example, if the 3.3 V inverter driver high into the 5 V inverter,
the Vos of the PMOS transistor P, is equal to 1.7 V. This value is larger
than VT of the device and thus results in large DC power dissipation in the
range of milliwattr. Since this power is for every 110, then for a whole ASIC
chip it could be hundreds of mW. This situation is unacceptable for low-power
application..
Xp. We we simple analyri. to find the relationship between the sizes of the
two transistors. For high input data, initidly the node Cis at V D D X . Thns
the NMOS Ng is in satmatian and the PMOS Pf is in the linear region. By
'ustoning that the drain current of N? is much higher than that of P f , w e have
(4.140)
where & and opt are the 8s of the NMOS transistor Np and the PMOS
transistor P f , respectively. The low-to-high voltage converter has jl negligible
DC current when the input is stable since all the devices are completely OFF.
Thin technique can be used to interface any lowvoltage to higher voltage.
, .
C""*", j :
4 : I
Vi"
p F yj > 'TI i n x
. . ..
Time
V dl
L = L- dt
This noise problem can occur on power lead and is termed power bounce. We
will use only one name to refer to this problem. Consider a CMOS output
driver driving the output pad of 50 p F at 3.3 V in 2 ns rke/fall timer. It can
be shown [39] that 2 is related to the fall/rise times by
(4.142)
The dijdl can be as high as 165 mA/m. If for example 8 drivers are dowed
to switch rimnltaneoudy per eaeh VoojVss pads pair, the resulting ground
bounce for 1 = 1 n B is 1320 mV. This value can be B problem, partieduly for
low-voltage applications, since this ground bounce consumes a large fraction
of the digital noise margins. Some of the problems encountered arc 1) fake
triggering. 2) double cloddng, andjoz 3) missing clocked pulses.
Low- Voltage Low-Power VLSI CMOS Czrcurt Deszgn 235
110 buffers are not the only sonree of ground bounce in CMOS circuits. Clock
bnffers llod slightly the c o x logic can also cause serious ground bounce in the
supply leads when driving large loads. Careful power supply routing should
be taken when we power large buffes. The resistance of the metal should be
minimieed so the voltage drop, due to the corrent spike, is reduced.
There are many techniques to reduce the ground bounce. One simple approach
is to use separate supply pins for the ootput buffers. Some approaches, based
on reducing L and d i l d l , are the following:
Placement of power and ground pins, adjacent one to the other re-
duces the effective inductance of power sod groond pins by mutual
In eondudon, to reduce the ground bounce, all the techniques can be combined
to reduce Land d i l d t The reader can refer to many other techniques to reduce
the ground bounce [42, 43, 44, 451.
236 CHAPTER4
T'DDC
-
I f VDDBus
Fig. 4.101 shows two chips connected to the bidirectional transmission line
(50 R termination resistors) though GTL I/O (Gunning 110 ) transceivers.
Bath ends of the transmission line are tezminated to prevent reflections. The
load seen by each driver is 25 R. The termination voltage VTMis about 1.2 V.
The output driver is an open-drain NMOS pull-down transistor and when it is
inactive the output is at high-level signal Vox equal to 1 $ ~The . input receiver
uses a M e r e n t i d comparator with external reference voltage = 0.8 V.
Low-Voltage Low-Power VLSI CMOS Circuit Design 237
Fig. 4.109 shows the input buffer which employs B differential comparator. This
circuit switches to high (low) V,, when I L V,., > 50 mV (< -50 mv),
~
Vi"
(GTJ. levels)
YOU,
To reduce this subthreshold current, associated with low VT devices. there are
many techniques. These techuiqms are based on the principle to reverse bias
the VGSvoltage of the MOS device (in the case of NMOS) in the standby made
ofoperation, as ahown in Fig. 4.110. With Vcs = V e x , where Von is mgativc,
the standby state of the device moves from state to state p . We d t e two
tcchniqoes using this principle:
This technique has been used mainly to reduce the static power dissipation
in standby mode of the memory decoded-driver [53]. The drivers, in memory,
have a lbrge number of circuits, arranged repeatedly, but only a few of them
operate aimultaneoudy. The drcuit of Fig. 4.111 can drastically reduce the
subthreshold current of the drkers. The technique simply consists of inserting
a PMOS tmnsbtor P- with a size W. between the power supply VDO and the
common source node A. AU the PMOS transistors (Pd,,Pd2, ...,Pdn)of the
' I C o l y L ~ -nl
t tbcahold voltage.
240 4
CHAPTER
drivers have, in thk example, the s m c sivc Wd and common SOUICC (node A).
The number of drkers R can be between a few hundreds to a few thousands.
The MOS transistors in the ddvers have low iVTdl (e.g., 0.1 V). The PMOS
transistor PG have a threshold d t t a g e IVT,I slightly higher than I V d (%.,
0.2 ~ 0.4 V).
In active mode, the input S is low and the transistor Pois ON. For the drivers
only one circuit is ON. In order that the PMOS transistor Pedoes not affect the
drive current of the driverg, its size W, should be larger than Wd,depending
on the capacitance of the common murce, which is huge for high R. In standby
mode, the input S is high and the PMOS transistor P, is OFF. The inputs
of all drivers are set to high (VDD). Without the PMOS tiansirtor P., the
total subthreshold emrent would be n timer the c u r d of each driver. This
malres thk current very high. Hence Pc %educesand limits the sobtbrahold
cnrrent. The voltage of the common source node A, is reduced by an amount
AVsna (afew hundreds ofrmV). This CBUSOS the PMOS transistors ofell drivers
to hsve self-reversebiasing gate-source voltage, which drastically reduces the
subthreshold current. The time needed for the node to stabiliue to VDD-
AVsns (or the time needed to switch from the active to stsndby mode) is called
evolution time and can be very high (order of 1 mr) compared 10 the delay of
the driver. The reason is that only the leakage and subthreshold cyzlents which
Low-Voltage Low-Power VLSI CMOS Circuit Design 241
Slvndby mode
s Active mode
&charge the node A in this mode. This time can be undgnificant to low-power
operation if the standby mode time is large enough s i n the case of many low-
power applications. When the input S is turned low (active mode), the time
needed for the coinmm source A to recover (reaches almost V D Dis ) too low
and can be lower than the delay time. Hence. it doer not interrupt the start of
normal operation.
Lets derive now the subthreshold current expressions before and after reduction
by SXB technique. The total subthreshold current withont the self-reverse-
biasing techaique is given by
wa
I..*, = n.I-exp
-1vm (4.143)
wo ~
Sjln10
With the lranristor P,, the subthreshold current is given by
w. exp
I d 2 = la- -lvTcI
(4.144)
w, ~
S/I.lO
242 CHAPTER4
We assume that the devices have the s m e lo, Wo and S. By dividiog the
-,
current equations (4.143)and (4.144). ws have, for the subthreshold current, a
reduction factor
4.10.1.2 Mulri-VTTechnique
This techniqne is similar to the one discussed above, but it u ~ be
n applied to
any CMOS logit (54,561. The basic idea is shown in the crsmple of the NAND
gate of Fig. 4.112. Here the MOS transistors P and N have high VT (e.g.,
0.6 V extrapolated) for 1 V power supply applications. Also the logic gate has
MOSFETs with low VT ( 5 0.3 V). The signal SL is used to switch the gate
in active or sleep (standby) mode. The virtual upp ply lines VDDV and Vssv
are common for many gates. We call thb logic multi-threahold CMOS logic
(MT-CMOS).
In the active mode, the signal SL is low,P and N are ON, so the vktoal supply
lines VDDVand Vssv can be set to almost VDOand ground, respectively.
Hence, the 10w-V~logic o m switch effidently, bot cart shonld be taken in the
siziing ofthe P I N devices compared to the logic. Fig. 4.113 shows the effect of
aieing the high-& devices on the delay of the gate. The width of P I N rhodd
be at least 10 timer larger than that of logic cells. This condition depends
greatly on the pararitic capacitances of the Virtusl sopply lints CI6nd C, [see
Fig. 4.1121. If C, and C, are large then the width of P and N transistors
can be reduced, because these capacitances tend to suppress the bouncing of
VDDVand Vssv and henee improve the rpeed. The high-& MOSFET. can be
cornmon for several logic g a t e s (q,
10).
In the standby (sleep) mode, the signal SL is high, then P and N are OFF.
Hence, the subthreshold current is limited by that of these high-VT devices.
In this ease, the static power dissipation is dramatically reduced in the sleep
mode. The subthreshold reduction factor can be deduced using the analysis
presented in the previous section. One problem associated with this MT logic
is that the evolution and recovery times can be large.
Low- Voltage Low-Power VLSI CMOS Circuit Design 243
'
The measured delay, as a function of the supply voltagc tor Zinput NAND gate
with FO= 3 and wiring load of 1 mm (0.25 p F ) , is shown in Fig. 4.114. The
technology is 0.5.pm CMOS with low VT- = 0.25 V, low V T ~ = -0.35 V, high
VTn = 0.55 V and high VTp = -0.65 V. The MT-CMOS logic has almost the
s-e speed ag the full 10w-V~logic. The logic delay time is reduced by 70% at
1 V as campared with that af the high-v, one.
For holding the level of the output during the deep mode, a level holder is
necessary 85 shown in Fig. 4.115. It consists o d y of cross-coupled inverters
with high-VT devices powered from the power snpply VDD.
T h e source of the static power dissipation is not mly low VT devieer. Several
other issuer eontribnte to static power increase. These are some Circuit design
guidelines to ieduce the static power Mipation :
where (I,is the gate activity, V, is the voltage swing, C, is the load and parasitic
capacitances and f is the operating frequency of the system. Equation (4.146)
demonstrates that there m e several ways to reduce P,:
246 CHAPTER4
These arc some guidelines for the design of low-dynamic power eircnits :
rn Cho0.e the technology that has low junction and oxide capacitances
for the same performance.
Avoid, if possible, the use of dynamic logic design style.
rn For any logic design, reduce the switching activity, by logic reordering
where V is the power supply voltage ar shown in Fig. 4.116(a). Half of the
energy is dissipated by the resistor of the pull-up PMOS device during the
charging phare. A similsr argument applies Lo the discharge resistor of the
pd-down NMOS transistor. This analysis is valid men if a step power supply
voltage, V, is applied to the network. From Fig. 4.116(b), the Voltage drop
across the resistor, Rp varies from V (supply voltage) to eero. Hence. the energy
disripsted by Rp is given by
En = / e V . d Q = / e V n C d ( V - V x ) (4.148)
then
1
En = 41.v
2
(4.149)
En = C L V V . (4,150)
where 6 is the average voltage drop nerosr the resistor of the pull-up PMOS.
If the power supply voltage bar two half steps, ar shown in Fig. 4.116(c), the
energy dksipated by the resistor is
1
ER = -C,Va (4.151)
4
So less energy is dissipated by the resistor, when the average voltage is reduced,
while keeping the swing and load eapaeilnnce constant. This is the principle of
Adiabatic Switching [61, 62, 631.
For multi-steps power supply voltage, BC shown in Fig. 4.116(d), the total
energy dissipated is given by 1611
E = CL-Va = Ecmuant,msj
(4.152)
N N
and the one dissipated by the resistor is
En = 4 -
1 vz (4.153)
2 N
248 CHAPTER
4
Low-Voltage Lour-Power VLSI CMOS Czrcud Design 249
where N is the number of voltage steps uniformy distributed. Fig. 4.117 shows
an example of a driver with uaiformy distributed supplies which are switched
in surcesi~ely.The voltage V, is given by
Another variation is to use a supply voltage with a ramp form" [62]. In this
case, the energy is drastically reduced if a long time period is used. For the
inverter for example, pulsed power supplie~(PPS) are applied to the circuit.
The adiabatic comput;oP becomes attractive only when the delay is not critical,
b e c a m in that technique the energy is traded for delay. The energy-delay
product of the sdie.bbstic circuit is much worse than the conventional CMOS
gates [64].
[lZ] M. 1. Elmasty, "Digital MOS Integrated Circuits I", IEEE Press Book,
1981.
[I31 R. H. Krambeck, C. M. Lee and H-F S. Law, *High S p e d Compact Ck-
cuitr with CMOS", IEEE J. Solid-State Circuits, vol. 17, no. 3, pp. 614-619,
June 1982.
[I41 V. Friedman and S. Lio, "Dynamic Logic CMOS Circuits". IEEE J. Solid-
Stale Circuits. vol. 19, no. 2. pp. 263-266,April 1984.
1161 C. M. Lee and E. W. Seeto, "Zipper CMOS," IEEE Circuits and Dcviccr
Mag.. vol. 2, no. 3, pp. 10-17, May 1986.
[lT] N. Weste and K. Erhraghian, "Piinciplcr of CMOS VLSI Design : A Syr-
temr Perspective." Addison-Wesley. Reading, MA, 1985.
[IS] F. Lu and H. Samueli "A 200-MH1 CMOS Pipelined Multiplier-
Aeeumiilator Using a Quasi-Domino Dynamic Full-Addcr Call Design,"
IEEE J. Solid-Stale Circuits. VOI.
28, no. 2. pp. 123-132. February 1993.
[19] J. Yuan and C. Svenron, "High-speed CMOS Circnit Technique," IEEE
J. Solid-state Circuits, vol. 24. no. 1. pp. 62-71, February 1989.
1201 M.Afghahi and C. Svensson, "A Unified SinglcPhare Clocking Scheme far
VLSI Systems," IEEE J. Solid-state Circuits, uol. 25. DO. 1. pp. 225-233.
February 1990.
I211 D. W. Dobberpuhl e l al., '"A 200-MHz 64-b Dual-Issue CMOS Micro-
proccs~or",IEEE J. Solid-State Circuits. vol. 27, no. 11. pp. 1555-1567,
November 1992.
1221 H. 8. Bskoglu, "Circuits. Interconnects. and PacLaging lor VLSI," Addison
Wesley, Reading. MA, 1990.
[23] K. Yam, e l al., "A 3.8-ns CMOS 16x16 Multiplier u%htg Complementary
PaJr-'Ihn8islar Logic", IEEE J. Solid-Stntc Circuits, "01. SC-25. no. 2. pp.
388-394, April 1990.
[24] M. Suaiki. e l .I., "A 1.5-ns 32-b CMOS ALU in Double Pars-Thnsistor
Logic", IEEE J . Solid-Slite Circuits, vol. SC-28. no. 11, pp. 1145-1151,
November 1993.
REFERENCES 253
[37] Li. Kojims, S. Tsnaka, and K. Sasski, Half-Swing ClocLing Scheme for
75% Power Saving in C l o c h g Circuitry, Symposium on VLSI Circuits,
Tech. Dig., Honolulu, pp. 2524, June 1994.
[381 J. S. Caravella and J. H.Quigley, *Thee Volt to Five Volt Intedace Cir-
cuit with Device Leakage Limited DC Power Dissipation, in IEEE ASIC
Intern. Conf. and Exhibit, Rochester, NY. pp. 448-451, September 1993.
1391 M. Shoji, CMOS Digital Circuit Technology, Prentiee Hall h c . , Englc
wood Cliffs, NJ., 1988.
(401 F. Abu-Nofd et d.,A ThresMillion Ttanaistor Microprocessor, in IEEE
Iotenw&xal Solid-State Circuits Conf., pp. 108-109, February 1992.
(411 T. Gabars and D. Thompson, Ground Honnee Control in CMOS In-
tessted Circuits, in B E E International Solid-state Circuits Cod., pp.
88-89, February 1988.
(421 T.Gahara, Gronnd Bounce Control and Impromd Latch-op Suppression
Through Substrate Conduction, IEEE J. Solid-State Circuits, 01. 23,no.
5 , pp. 12241232, October 1988.
[43] M. HashLnoto and 0 - K Kwon, Low dI/dt Noise and Refletion Free CMOS
Signal Driver, in IEEE Cuatom Integrated Circuits Conf., Tech. Dig.,pp.
14.4.1-14.4.4. 1989.
[44] T. Wada, M. EiOo and K. Anami, Simple Noise Model and Law-Noise
Data-Ontput Buffer for Ultra-High-speed Memories, IEEE J. Solid-state
Circuits, 01. 25, no. 6, pp. 15861588, December 1990.
[45l R. S e n t b a t h a n and J. L. Prince, Application Sp&e CMOS Out-
put Driver Circuit Design Techniques to Reduce Simultaneous Switching
Noise,IEEE J. Solid-state Circuit, YOI. 28, no. 12, pp. 1383-1388,Decem-
her 1993.
[46] T. Knight and A. Krymm, A Sew-Terminating Low-Voltq,e-Swing CMOS
Outpvt Driver, IEEE J. Solid-State Circuits, 701. 23, no. 2, pp. 457-464,
April 1988.
[47] H-J Schumseher, J. Dikken and E. Seevindr, CMOS Subnanosecond True
ECL Output Buffer, IEEE J. Solid-State Circuits, 01. 25, no. 1,pp. 150-
154, February 1990.
(481 M. PedcrMn and P. Meta, A CMOS to lO0K ECL Interface Circuit,
in IEEE International Solid-State Circuits C o d , Tech. Dig., pp. 226-227,
February 1989.
REFERENCES 255
The CMOS inverter of Fig. 5.1 suffersfrom the limited current drive when the
load capaeit,ance u large. To increase the drive capability of CMOS, I bipolar
driver can he added at thc output of the CMOS inverter. Fig. 5.2 shows
one possible configuration to construct what is called B conventional BiCMOS
258 CHAPTER5
inverter. The addition of the bipolar driver stage to the basic CMOS inverter
is responsible for the high current driving capability of BiCMOS over CMOS.
As a result BiCMOS offers lower d e l q compared to that of CMOS especially
at high loading capacitance.
The operation ofthis gate is straightforward. When the input is low, the PMOS
P is ON and its d r a b current tmns the transistor QlON. The collector current
of QIcharger the output load capacitance. As the output reacher VDD-VBB,,
where VBE, is the turn-on voltage of the bipolar transistor and ir about 0.7
V, Q, gradually turns OFF. During this period, the NMOS transistor N a is
ON. Since Ndl is conducting, Q2 is in the cutoff region. Bansistor Nd2 can
also be controlled by the output node. However, using the base node results
in faster operation because the b a of Qt is p d e d up faster than the output
node and because the voltage level of the b a a node is largei. If the input is
high, the NMOS transistors N and Nd, are ON. Qlis OF while Q. turns ON
to discharge the output node. As a result, the load capacitance is pulled down.
As the output V. leaches VEB, transistor Q. turns OFF and the outpot stays
at this level. The conventional BiCMOS gate provides high drive capbilitr,
eem static power dissipation and h g h input impedance. More dincnssionr on
this gate are given in the following sections.
Low- Voltage VLSI BiCMOS Circuit Design 259
w CMOS
"0
BiCMOS
1
1 L
TCL
Figure 6 2 Conventional BiCMOS h v c r k r
5.1.1 DC Characteristics
,-.
5 3 j :
t
0
>
z 21
As the input voltage increases again, the base of Q2Sollows the voltage of the
output since N is ON. When the input voltage reaches V D D ,the PMOS P is
OFF.The discharge device, A', is ON and the base ofQl is at uero. Also, the
o n t p t is completely discharged and N is ON. Then, the base of Q, is at sera
In this cme, the output voltage is %emend both the base-emitter voltages are
aero.
Time (nr)
(b)
e
-6
-8
0 1 2 3 4 5
Time (ns)
snalysis of the puU-op section. Then we show the difference in the case of the
pull-down section. We asinme a step input.
262 CHAPTER5
Fig. 5.4 shows the transient behavior of the BiCMOS inverter of Fig. 5.2.
When the inpmt f& t o gronnd, transistor P turns ON and operates initially
in the saturation region. Its drain charges the parasitic capadtames et the
base and when VBE,PI = VBErm, Qlturns ON. The emitter current increaser
in a relatively short time to its peak to charge the output load Cr.as shown
in Fig. 5.4(b). The ontput voltage is pulled-up following the base voltage of
Q1 BI shown in Fig. 5.4(a). As the b- of Q, exceeds VT,, Ndl turns ON to
discharge the base of QIto ground. But due to capacitive COUP^^. VB,,, tends
to be pulled-up. When the base vokage is higher t h m VDD- V D S , . ~where,
VDS..+is the saturation voltage of P,the PMOS tramistor P enters the Linear
zepion and the drain (base) current drops gradually. Consequently, the emitter
current of Ql struts falling. As the output voltage V, approaches the theoretical
limit of VDD VBE-, Ql is expected to turn gradually OFF. However, due to
~
the capacitive coupling between the bare and the output node, V, exceeds this
limit as shown in Fig. 5.4(a). The same ieasoning can be applied when the
input riser to VDD
A simple delay aoalysk is w r i e d out in this section. The reader can refer
to [4. 5, 61 for other detailed models. We talre iota acconnt the pararitic ca-
pacitances and the bipolar high current effects. We do not take into account
the parasitic resistances since they have no appreciable effect with advanced
bipolar technology. This model is based on i b j e model [TI.
Fig. 5.5 illustrates the transient equivalent circuit of the pull-up section (Fig.
5.2) of the conventional BiCMOS gate driving a load capacitance CI,.As we
are interested in 50% rise time, the PMOS current can be modeled by the
saturation current of the device. Thia current is given by Eqnstion (3.82) in
Chapter 3
IDS,,* = ~ p c ~ ~ , ~ t , p ~ p-~ l vosl
IVT?l) (5.')
where Vcs is equal to (K*+j V D D )where
~ , K,+ is the low level ofthe input.
The capacitance C,, accounts for the parasitic capacitances of the MOS devices
P, N d , and Ndz a t the base of the pull-up bipolar transistor. Therefore, it is
given by
c,, +
= C d , P Cd,N*> + (5.2)
where C d , pand Cd,Na,are the drain junction capacitances of P and Ndl and
Ca,N., is the gate oxide capacitance of N d l . The overlap capacitances of P
Low- Voltage VLSI BiCMOS Circuit Design 263
and N,, hie assumed negligible. The bipolar parasitic capacitance Ca, of Fig.
5.5(a) is given by
Cpa = CC.Q>t CE.Q, (5.3)
The total load capacitance, C., shown in Pig. 5.5(b), i s given by
c, = c, t CS,Q1+CC.Q, (5.4)
where Cr.is the external load capacitance, C,,O, is the average collector-
substrate capacitance of Qz and CC,~,is the average base-collector capacitance
of Q2.R e c d from Section 3.5.3 lhat the base-emitter Murion capacitance is
given by
drc,Q,
co =if= (5.5)
whew the q is the forward transit time subject to high-level effects.
1. The first component, l,, in defined as the time required to turn QION.
The model of Fig. 5.5(a) can be used in this case. Writing lhe current
equation at the base node of QI,we have
264 CHAPTER5
(5.11)
Low-Vollage VLSI BiCMOS Circuit Design 265
that Ic,pz is constant during this time [see Fig. 5.41, and
I f w e assume
the mid-point of the output is VDD/Z,then we have
(5.12)
1 = IIitatt. (5.13)
The first delay is associated with the parasitics at the bare, the second one with
thc forward transit time and the last one is a function of the load capacitance.
For smdl loads, t2 and ti dominate. Bowever, for large output loads, the third
delay term, t s dominates.
The exprersion of the pull-down time is similar to that of the pull-up time
ucept for the value of the drain e m e n t of the transistor N [see Fig. 5.21. The
saturation current ofthis device is given by
-
I D S . .=~ K , C = U , G ~ W ~ ( V G ~V h ) (5.14)
The VGs far the NMOS during the switching is affFeted by V L Zdrop
~ while the
one of the PMOS is not. This voltage is given by
The slope of the characteriPtic delay-load of the BiCMOS gate is larger than
that of CMOS, since it is equal to V D D / Z ( ~ D S+,l c~p~) . For 8 CMOS gate, the
slope is rimply VDD/~(~DS.~,). The saturation culient in the CMOS is slightly
higher than that of BiCMOS because the CMOS inverter has D PMOS with
slightly wider device (see next Section]. Houcver, the slope of the BiCMOS
inverter is larger due to large Icp.Therefore. the BiCMOS gate h a s a higher
ddvability than CMOS.
266 CHAPTER5
It is estimated by
-
The first term is due to the total peraritie capacitance at the base node of
Qi where the swing is V D D . The second term is also due to the parasitic
capacitance st the base node of 4. The swing at this node is limited to
VBB.,... when the collector current reaches its peak. Finally the third term is
related to the output load capacitance, CL,and the parasitic capacitance at the
output. The swing is only V x - V ~ where
, VH and VL are the high-level and the
low-level of ontput, respectively. These levels ace affected by the output load.
Low- Voltage VLSI BzCMOS Circuit Design 267
For small loads the power of BiCMOS is greater than that of CMOS, while
for large loads, they have almost the same dynamic power. Table 5.1 shows
the simulation results of the power dissipation for both gates at 5 V power
supply. At a fanout of 1, CMOS consumes much lower power than BiCMOS
and it is h t e r . However at a Ianout of 10, the BiCMOS is faster (37.5% delay
reduction) and it dissipater only 24% power more than CMOS.
When a BiCMOS gate is driving another BICMOS, or a CMOS gate, the driven
gate exhibits a DC power dissipation. This DC current is nat acceptable,
particularly when the circuit is in standby mode. Thk is due to the reduced
$-Ping at the output of the first gate. Fig. 5.8 d o w r an example of BiCMOS
gatedrivhgaCMOS gate. Iffor example theoutput ofthefirst gate (BiCMOS)
VBE,the Vos of the driven NMOS would be higher than ieio and around
the VT, resulting in appreciable DC power. Furthermore, the drive current of
the driven gate would be reduced; particularly a t low power supply voltagc.
Another disadvantage of the reduced swing is the noire margin reduction.
268 CHAPTER5
N and N,. When V. falls below V,, Qa ceases to sink current from the load
capacitance. Then the output is discharged to the ground through only the
MOS transistors N and N,. The final charging and discharging phaser occurs
through the shunting devices. Hence, these phases c a n be slow became the
MOS shunting devices have low drive capabilities. When this FS BiCMOS
gate L operating under high frequency, the output s-g can he reduced. An-
other drawback of this circuit is that part of the current supplied by P ( N ) is
wasted through the shunting transistors which weakens the bipolar drive. The
shunting transistors P, ond N, can be minimum size.
The problem of the base drive inherent in the "FS type" BiCMOS gate can
be overcome by using feedback (FB) from the output through an inverter as
shavn in Fig 5.9(h). This eireuit is called "FB type" [9]. During the pull-up
transition, the shunting device P, is initially OFF and the PMOS transistor
p wpplied all its current to the b s e af Q,. When V, is approaching its high
level, the inverter I turns ON P, which itself charger the output node to V D D .
The pull-down transition can be explained similarly. The shunting devices P.
and N , and the inverter I can be sived properly to achieve greater speed then
the othei configurations, even the conventional BiCMOS gate.
270 CHAPTER5
VDD Vnn
r
Figure 5.0 Fdl.swing BiCMOS gstr typal: (a) "FS type"; (b) "FB k y p i ' ' ;
( c ) '"CErhlvltingtype.
1.4,
operate at 2 V power supply. The BiCMOS outperforms CMOS but for 3 and
sub4 V it looser its superior performance.
The limit of operation of the conventional BiCMOS gate with the power supply
m Teehnology/procesn complexity;
rn Circuit complexity by osing less device count;
m Area occupied by the gate; and
rn Power dissipation.
272 CHAPTER5
Fig. 5.11 shows the BiNMOS family of BiCMOS &<nits. The b&c circuit
technique used in BiNMOS [lo] is the use of the NPN bipolar transistor only
in the pull-up section of the output stage [Fig. 5.11(&)]. The pull-down see-
tion is kept as CMOS. In CMOS circuits, the PMOS transistor is twc-tc-three
t i e s slower than an NMOS transistor, when same sbes are compared. In the
BiNMOS circuit, the use of the PMOS, with the bipolar driver in the pull-np
section, will halanee the unsymmetrical response of CMOS.
In the basic circuit of Fig. 5,11(a), the output reachs only VDD VBE level.
~
This increaser the delay and power &sipation of the subsequent gates. If a
resistor (in this case the gate is called BiRNMOS) or n grounded gate PMOS
transistor is inserted between the emitter and the base of the pull-up bipolar
transistor. the output achiever fd-swing. However, this will degrade the speed
of the gstc because the base current is bypasaed by the inserted element and
hence is reduced.
Many alternatives have been proposed such ar BiPNMOS [Ill, and PBiNMOS
[I21 to realist full-swing output. The BiPNMOS is shown in Fig. 5.11(c). A
small rise PMOS transistor and an inverter ale added to the bark BiNMOS
gate. The PMOS device realiees full-swing output when the output changes
from low to high. The Sdded PMOS, P, turns ON only when the output rewhches
the threshold voltage of the feedback inverter. Hence, the bare curreat supplied
by the pull-up PMOS transistor is not affected by this added PMOS transistor.
Consequently, the BiPNMOS gate has higher performance than conventional
BiNMOS and BiRNMOS. One drawback of the BiPNMOS is the increased
output load capacitance due to the inverter I.
The PBiNMOS gate eonfiguration shown, in Fig. 6,ll(d), uses a small sine
PMOS device in parallel with the bipolar p d - u p transistor t o r&e full-swing
output. This configuration results in better performance compared to the other
circuit structures but slightly increases the input capacitance of the gate. In
this section, we show that a properly optimiied PBiNMOS gate is faster than
CMOS, even a t low power supply and load.
Low- Voltage VLSI BiCMOS CiTCUit Design 273
274 CHAPTER
5
Finding the proper sieing of the inpct MOSFETs P and N (W, and W,
respectively) is not tdvial. The sizing of Na and P, [see Fig. S.ll(d)] k not
critical. For typicd applications, it is enough to use near minimum size devices.
When the delay of the PBiNMOS is plotted versus the width of one of the
devices P or N,for different fanouts, a common optimum width exits as shown
in Fig. 5.12(a) with a fiattaed region. This optimum is due to the fact that
when inerebdng the size, the d r i n t i i t y of the gate increases. However, the
equivalent ontpnt load also increase.. Then at a certain siee, an optimum delay
exits. &om this figure,the optimum W, is 9 p m and W, = 11p m (particularly
for low-fanout). Note that in Fig. 5.12(8), we have chosen W, ii 0.8Wm. This
is explained in more detail below.
When the BiNMOS inverter is used as a driver of a fixed losd (e.g., bus), instead
of d d ~ gates,
g then we should consider the delay of the driver, including
the delay of the stage that drives it. In Fig. 5.12(b), the total delay of the
PBiNMOS driver and the CMOS inverter that driver it is plotted for two fixed
loads: 0.2 p F and 0.5 p F . The CMOS stage has a minimnm dae. The minimum
delay is around the point determined previously for the knout cese
The choice of the emitter area in this gate depends on the technology and the
load. For the 0.8 pm BiCMOS at 3.3 V power supply voltage, it was found that
using the minimum emitter ares (AB x 1 = 0.8 x 4 pm) gives the minimum
delay for the range of loads 5 1pF.
Fig. 5.13 shows that the optimal W,/W, ratio is the same for different fanonts
and is equal to 0.8. This point &o gives almost symmetrical f d j d s e delays. So
wen if the fanont is unknown,the optimnm gate is fixed and the size. depend
only on the device parameters. This result is very important for standard cells
and gate arrays where the cells are ddgned with unknown loads.
Low-Voltage VLSI BICMOS Czrcutt Design 275
1411, I
2201 6
--
8 LO 12 14
I
16
276 CHAPTER5
.....
...... ....
340 ......., ........ VD0 = 3.3 v
wp +W,,=201im
n 2x0
240
228.2 0.4 0.6 0.8 I 1.2 1.4 1.6 1.8 2 2.2 2.4
wpmn ratio
CMOS.--.-
-a
500
....
- ......
'
$ 4 0
300
200
IwI 2 3 4 5 6 7 8 9 1 0
Fanout
Figure 6.11) Comparison of the CMOS m d PBiNMOS delays for the same
input ce,p~ciLancc funslim of the fan..uk.
Low-Voltage VLSI BzCMOS Czrcuzt Deszgn 277
Let us compare the power dissipation of the gates for different fanoot. Table 5.2
shows this comparison for s m d fanouta. The power dissipations of both gat-
are comparable and are the same for e. fanout (> 3). The small rize additional
bipolar in the BiNMOS gate does not result in sigaificant power dissipation
overhead. This result shows that the BiNMOS family is an excellent choice fo?
law-powcr and high-speed operation. However for D fanout 1-2, still the CMOS
can be used.
Figure 5.16 Cir-uit rchhcmslier of: (a) PBJNMOS NOR2 j (b) PBiNMDS
NANDZ.
One technique to reduce the area penalty of the BJT is to use merged N-well
bipolar and PMOS device..
-_
I - (N-Well B N - P l u g m N + Diff nP+D i f f
$$$Gate m P - B a s e a M e t a l 1 UMetal 2
~ C o n t a c t l X ] V l AI UEmitter
280 5
CHAPTER
The only disadvantage of BiNMOS is its poor performance for sub-2 V oper-
ation. The small area penalty of BiNMOS is not a problem since for complex
gates the overhead of the bipolar device is miaimiued.
-
been widely used in BiCMOS products. However, some of the logic circuits pre-
sented in this section exhibit high-performance at low-voltage down to 1 V.
Low-Voltage VLSI BiCMOS Circuit Design 281
The pull-up section is similar to the one in conventional BiCMOS. The opera-
tion of the pull-down sections is BS follows. When the input is high, N a p u b
the bare of Q1down to ground and P, turns ON. The transistor Pz supplies the
base elurent to Ql. The bipolar tramistor Q2discharges the load capacitance
to lover voltage equal or Iw than Vgaon.
Stin this structure suffers from the 2 VaE hrser. The only improvement in
MBiCMOS, compared to conventional BiCMOS, is the higher drive current of
the pull-down section. If the N-well of the pull-down PMOS transistor is tied
to the VDD rail, its threshold voltage will experience a degradation due to the
body effect during the pull-down transient. As a result, the drivability of the
pull-down PMOS transistor is degraded. A simple solution to eliminate this
problem is to shunt the IOUC~ and the substrate of the PMOS transistor, P2.
282 CHAPTER5
5.3.1.2 Qunsi-Complrme?zforyBiCMOS
Another variation of the MBiCMOS is called "Quasi-complementary BIG
MOS" [17]. A "quasi-PNP" connection is generated in the pull-down section
of the conventional BiCMOS as shown in Fig. 5.19. It consists of PMOS and
NPN tranaktors (Fig. 5.1S(b)). This configuration resembles the MBiCMOS
gate of Fig. 5.18. The QCBiCMOS has two attractive features. The first one is
that the drain curtent of the pull-down section does not suffer the ~ V B losses
E
as in the case of conventional BiCMOS. The second one is lhat the pull-down
waveform is steep, dae to the good Ehsrge retention capability of the bipolar
tramistor. The feedback circuit formed by the two cross-coupled inverters, 1,
and Iz, permits the discharge of the bere of the pun-down transistor immedi-
ately after the p&down transition.
Fig. 5.20 shows the use of complementary bipolar output stage to form the
bnsic complementary BiCMOS circuits [18, 191. The pun-op section is similar
to the conventional BiCMOS. The pull-down section is symmetdcal to the pull-
np. The cnrrent of the NMOS transistor N does not sdfer of VBSreduction
doc to Q. as in conventional BiCMOS. T h e static swing varier between VBE-
and VDD VBB-. However, m explained in Section 5.1.2, the actual swing
~
might bs larger than the static design. The balanced transconductance of the
PMOSINPN and NMOSIPNF makes it ensier to obtain symmetrical fall and
rise time. Hence this circuit eliminates the degradation of the pull-down delay
with power supply voltage of the conventiond BiCMOS.
284 CHAPTER5
The gate of Fig. 5.20 can be modified to achieve full-swing operation by using
emitter-base shunting devices. Fig. 5.21(a) shows EF CBiCMOS with shunting
technique. The shunting MOS transistors of the base-emitters permit rcstor8r
tion of the full logic level of the output. But still the full-swing is achieved
with the two dow MOS devices. Some of the base current can be consnmed
by the shunting devices which weakens the drive of Ql and Qz. To O T C I C O ~ ~
this problem, the feedback technique can be used as shown in the circuit of
Fig. 5.21(b). The turn ON of the shunting devices is delayed by the feedback
inverter, I.
only when the operating frequency is low, where the gate can complete its full-
swing operstion and/or when the load capacitance is small 1201. FuU-swing
circuits with full bipolat drive are needed. In this section, CBiCMOS variation
suitable for sub-2 V operation, called Ttmsient Saturation (TS) is presented.
"DO
T
4
Figure 1.22 Common-*mitt* CBiCMOS $eL.
These two problems have been salved with several implementations [21, 221.
One possible implementation is shown in Fig. 5.23. It is cslled Transient
Satmation M-Swing (TS-FS) BiCMOS. This logic nses the principle of CE
CBiCMOS described in Fig. 5.22. When the input f a , we - m e that the
output is charged high, then Pa is ON. Pz tmns ON and the base of QL is
charged throngh Pa and Pa [Fig, 5.23(b)]. Consequently, Ql discharges the
output (load) down. When the octput voltage approaehs eero, the inverter
Z, turns Ps OFF and N4 ON [Fig. 522(c)]. The base voltage of Q1 falls
below V B E , causing it to torn OFF. Although 91 Jatutates, this does not
slow the n u t pull-up transition because the excess minority carriers of Q,
are discharged immediately after the pull-down operation. Thus, the bipolar
transistor ra1mst.a transiently. The circuit is symmetrical, hence the operation
of the pull-up section can be explained W a r l y . T h e PMOS transistor,Pa,
cuts
off the the DC enrient path during the pull-down transition to avoid any static
power dissipation. The small sine ontput latch, composed of the inverters I,
and I,, holds the output level because in steady state there is no path between
thc ontpnt and the supply h e s .
Compared to the BiCMOS logic circuits so far presented, TS-FSis faster below
2 V supply, when the load is relatively large (- 1 pF). At 1.5 V it is twice as
fast s CMOS for large loads. Although this circuit solves the problem of speed
degradstion of BiCMOS a1.5 V power supply, it still has several drawbacks:
Low-Voltage VLSI BiCMOS Cixuit Design 287
(a) (C)
Figure 6.13 (a) Circuit configuration af TS-FS BiCMOS: (b) and (c) tram.
sicnt saturation opcrstion for the pd-down srclion.
process complexity due to the PNP bipolar transistor; large area; relatively
high crossove~point with CMOS (- 0.4 pF); and it is a noninverting circuit.
strapping have been proposed to overcome lhe negative effect of VBEloss [ZO].
The full-swing operation is performed by saturating the bipolar transistor of
the pull-up section with jl base current polse. After which, the base is isolated
and bootstrapped to a voltage higher than VDD.These Schottky circnits ont-
perform all exjsting BiCMOS families in snbW regime down to 2 V, but they
need a BiCMOS tcehnology with good integrated Schottky diode. Other exam-
ples of a such technique are the bootstrapped BiCMOS circuits published by
[23,24. 251. The main advantage of the bootsttrapped circuits is that they c a n
be realized in conventional BiCMOS process with CMOS and NPN transistor
only. In this section, we present one bootstrapped circuit which overcomes
many drawbacks of the BiCMOS logic families discussed previously.
When the inpnt is low, the gate of the PMOS transistor Pp turns OFF (almost
instantaneously) during the bootstrapping cycle to prevent dkehsrging the
bootstrapped node through reverse current Corn 01 to VDD.This is achieved
through the use of the pseudo-inverter formed by P( and Nj. During the boot-
strapping cyde (the input is low), Pt t u n s ON and the gate of the preeharge
transitor Pp is pulled up towards the voltage of nl. Thus, P,, is completely
OFF when the voltage at nl exceeds VDD.Furthermore, the PMOS transistor
Pd is OFF completely because its gate is driven by the boosted voltage through
P..
Low-Voltage VLSI BiCMOS Circuit Design 289
"OD
T 7
I"
Gt
As a first orda analysis, the minimum d u e of Cb,,, necessary for the boot-
strapping condition, can be obtained as follows During the piecharge cyde,
the charge of the bootstrapped capacitor is VDDCS~.~ and the charge on C,,
the parasitic capacitance on the node nt, is VDDC,. The total charge on nl
during the precharge cycle is
Qni +
= VDDC~..~ VDDC, (5.17)
In order for Vt, to reach VDD, V,, must reach VDDt VBE- (during the
+
bootstrapping cycle). Thus the charge on C, is (VDD VBE,)~, and the
Law- Voltage VLSI BiCMOS Circuit Design 291
QI,
=V s ~ ~ C +a (VDD
~ ~ i+ V B S ~ ) C ~ (5.18)
Qh =I& (5.20)
292 CHAPTER5
where I , is the average base current of Q 1 and t, is the rise time of the output.
From Equations (5.17-5.20) we find that
As shown in Fig. 5.24 of the BFBiCMOS inverter, the N-well of the PMOS
devices Pp, PI and P*is connected to the bootstrapped node nl.This prevents
their source/drain-well junctions to turn ON during the bootstrapping cycle.
Also, it pzevents any latch-op which might be eaosed by the parasitic SRC
when the drain/sowce-well voltages a r e forward-biased. The PMOS tiansistor
Pa &o has its well connected to its source. This eliminates the body effect of
the transistor and prevents any leakage during the bootstrapping.
(VDOt VBB,) to VOO by the PMOS P j , inverter I2 holds the output level a t
VDO. Withoot this inverter, the output falls down to a level equal to (VDD -
VBE)due to the baseemitter coupling capacitance. The simulated waveforms
of the different voltages are shown in Fig. 5.28.
0.35pm 0 35pm
o a3pm 0 34pm
4.9 mA 24mA
B V. = V n F = 3 3v w = 10 /,m
52 fF 73 fF
30 5l 37 R
28 n 31 R
265 R 280 R
5.29(f)]. The simulations were carried out using a chain ofgatcr. The reported
50% delay timed m e those of an intermediate gate.
Table 5.4 shows the delay, the a w a g e power dissipation and the power-d&T
product of the different NAND gates at two sopplies; 3.3 and 1.5V. The rimu-
lation was carried out at a typical load capacitance of 1 pF.
The bootstrapped family consumes more power than CMOS because of the
higher internal node capacitance. However, they provide a high speed of oper-
ation, particularly the BFBiCMOS, where il has a factor of 3 speed advantage
compared to CMOS at 1.5 V. Moreover, the delay-power product of the boot-
strappcd family is lower than that of CMOS. Notice that at 3.3 V, PBNMOS
has the lowest delay-power product and less delay than CMOS. BiNMOS at
1.5 V is slower than CMOS and is not reported in the table. These rwulta
also indicate that the m e of the bootstrapped BiCMOS/BiNMOS gate would
improve the delay-power product when VDOis scaled dawn to 1.5 V.
298 CHAPTER
5
TS-FS
20.0
18.5
26.4 7.6
TS-FS
BS-BiCMOS 962 3.84 3.1
BFBiNMOS 1175 4.60 3.2
686 3.50 4.1
5.3.6 Conclusion
We have demonstrated, during all the previous sections, that the b e t family
to use for B fanout higher than 5 , is the bootstrapped BiCMOS for the r q e
of power supply 1-to3.3 V. Bowe~er,due to its higher area occupied, it can be
-
used m d y in high-speed digital applications. Note, when the load is large,
in the range of 1 p F , the bootstrapped f d y provides a Q h speed and
a good dday-power product. One drawback of this f d y , beside the large
=ma, is that the bootsttapping is sensitive to the shape of the inpot voltage.
One practical gate which can be used in several applications, even when the
fanout is low, is the BiNMOS family. It has good performance for 3.3 and 2.5
V power supplies. Also it provides a better delay-product than CMOS. In the
next section, many digital applications b a e d on BiNMOS family are outlined.
Low-Voltage VLSI BiCMOS Circuit Design 299
BiNMOS logic have been nred in several microprocessors [26, 271. In this
application, BiNMOS can be used in critical path delay reduction without
increasing .hip area since BiNMOS needs a low-fanout to outperform CMOS.
Among the critical paths, we cite
In the microprocessor of [26], the PBiNMOS logic family is used a t 3.3 V power
supply. The critical p s t h ofthe control onit is reduced by 36% ovei CMOS. The
BiNMOS gates keep their speed advantage even in the worst ehre (VDD= 2.7 V
and T = 125 C).
BiCMOS logic is not only limited to conventional gates, but many other logics
can be devised. One such example is the pass-transistor BiNMOS used in the
design of a 64bit adder [28] similar to the CMOS CPL logic family discussed
in Chapter 4. Fig. 5.30 shows an urdnsive ORINOR gate uriing the pass-
transistor BiNMOS gate (abbreviated PT-BiCMOS) wing donble raiL The
outputs of the pass-traoristoi network a m connected to the bases of the bipolar
transistors Q, and Q2 to reduce the intrinsic delay. The PMOS transistors Pl
and Ps are crorr-coupled to restore thc high level of the pass logic to full
Voo. The PMOS transistors, P2 and P4,charge the oatput to full-swing.
These transistors are subject to body effect, hence they turn ON later during
transitions.
300 CHAPTER5
-Pars-transistor
network
Fig. 5.31(a) compares the delay of exclusive OR and NOR gates using PT-
BiCMOS, TG-type CMOS, and CPL-type CMOS using 0.5 pm BiCMOS pro-
cess at 3.3 V power supply voltage. The fanout=l is equivalent to jl capacitance
of 35 IT The PT-BiCMOS gate is faster than the CMOS gates for any fanout.
The power-delay product is &so shorn in Fig. 5.31(b). The T G gate has the
best delay-power product for a fanant lower than 3. However, for B fanout
greater than 3, the PT-BiCMOS sate is better.
This PT-BiCMOS has been used in the dcsign of .e &bit adder [28]. It is used
mainly in the P, sum and carry blacks. A delay time of 3.5 ns was obtained for
the 64-bit adder at 3.3 V, which is 25% better than the CMOS version. The
area and power dinsipation penalties of the PT-BICMOS adder, compared to
the CMOS, were 13% and 14% respectively. The speed advantage is kept down
to almost 2 v.
BNl
VD".,., Y
7w
006 0 12 0 I* 0 21
Load Capacitance (pF)
Low-Vo7tage VLSI BiCMOS Circuit Design 303
rn BiNMOS is used in the blocks such as: SRAM, ROM (Read Only
Memory), ALU (Arithmetic Logic Unit), multiplier, and clock driver,
etc.
Fig. 5.33 shows a block diagram of a DSP [41]. This architecture can ~ E O C ~ S B
any signal processing operation. The BiNMOS inverters me used as dock
buffers to reduce the clock skew at 300 MHu clock frequency. The dock is
distributed to about 1000 registers. High clock frequency increares drastically
power and reduces the power supply voltage due to the powor noise (effect of
high disripsted current). The BiNMOS inverter, used in the clock distribution,
is the conventional one which h= a high level of VDD- VBE. Bence, the
dynamic power of the clock network is rednced by 17% compared to CMOS
when rising BiNMOS.
Each of the basic cells is typically made up ofa nnmhez of transistors which can
he connected to form a two input NAND 01 NOR gate or B simple latch. The
only p ~ ~ e step ~ ~that h can
g be cnstomiaed is the metalhation. The nser of
a gate array can implement the system by specifying the required connections
between the devices in each cell and then the connection between the various
cells. This is done a u t o m s t i d y using CAD tools. The number of metal levels
used for wiling varies from 2 to 4. The first one or two levels are used for
internal Wiring of the cell and the upper levels (0.g. third and fourth) for
wiring between the cells in the harbontal and vertical directions [43].
Low- Voltage VLSI BiCMOS Circuit Design 305
24-bit
fl-
BiCMOS technology has been used extensively for building gate arrays and
channelless gate arrays (sea-of-gates) [43, 44, 45, 461. At 3.3 V power supply
voltage, BiNMOS logic f d y has been wed [lo, 111. In [ll],BiPNMOS logic
gste has been proposed for the Chamelless gate array. Fig. 5.35 shows a layont
ofa BiPNMOS basic c d on 0.5 pm BiCMOS technology. A bipolar transistor
and a md size MOS transistor are added to the pnre CMOS basic c e l l Thew
transistors are not only used to implement BiPNMOS gates but also Eip-flopn,
memory macros (RAM, ROM, and CAM), etc. A BiPNMOS two-input NAND
gate has 36% delay reduction compared to a similar CMOS gate for B fanout
of 7. The speed advantage is maintained down to 2.5 V.
306 CHAPTER5
110 PADS
I":
R
Bipolar
I
0 Resinlor
PMOS
I Ma
F3S
NMOS
[12] H. Hara ct al., "0.5-um 3.3-V BiCMOS Standlrrd Cells with 32-kb Cache
and Ten-Port Register File", IEEE Journal Solid-State Circuits, vol. 27,
no. 11, pp. 1579-1584, November 1992.
(331 M. Takada e t al., "A 5-ns I-Mb ECL BiCMOS SRAM," IEEE Journal of
Solid State Circuits, VOI. 25, no. 5 , pp. 1051-3062, October 1990
134) A. Ohbn et al.. "A 7-ns I-MI) BiCMOS ECL SRAM with Program-Free
Rcdundancy," in Symp. VLSI Circuits Conf. Tech. Dig.. pp. 41-42, May
1990.
(351 Y. Okajiia et &I.. "A 7-nr 4-Mh BiCMOS SRAM with a Parallel Testing
Circuit," International Solid-state Circuits Conf. Tech. Dig., pp. 5455,
February 1991.
136) N. Tamba el sl.,'"A 1.5 nr 256Kb BiCMOS SRAM with 11K 60 PI Logic
Gates." International Solid-State Citcuits C o d , Tech. Dig., pp. 246-247,
Februaiy 1993.
312 DIGITAL
LOW-POWER VLSI DESIGN
[37] K. Nakamvra et al., "A 200-MHz Pipelined 16-Mb BiCMOS SRAM with
PLL Propmtional Self-Tim'mg Generator," IEEE Journal of Solid-State
Circuits, vol. 29, no. 11, pp. 1317-1322. November 1994.
[38] G. Kitsukawa, et al., 'An Exp-ental I-Mb BiCMOS DRAM," IEEE
Jonrnal of Solid-State Circuits, vol. S C Z Z , no. 5, pp. 657-662, October
1987.
[39] S. Watanabc, et al., "BiCMOS Circuit Technology for High Speed
DBAMs," Symposium on VLSI Circuits, Tech.Dig.,pp. 79-80, 1987.
1401 G. Kitsukaws, et al., "Design of ECL I-Mb BiCMOS DRAM," Electronics
and Communications in Japan, Part 2, vol. 76, no. 5, pp. 89.102, 1992.
[41] M. Namura et al., ''A 300-MH8, ]&bit, 0.5-em BiCMOS Dsital Signal
Proeesror Core LSI," IEEE Cnstom Integrated Circuits Conference, Tech.
D i . , p p . 12.6.1-12.6.4,Me.y 1993.
1421 T. Inoue, et al., "A 300-MHe 16-bit BiCMOS Video Signal Proeersor,",
IEEE Journal of Solid-State Circuits, vol. 28, no. 12, pp. 1321-1329, De-
cember 1993.
[43] F. Mdurabayshi, et al., "A 0.5 micron BiCMOS Channellcss Gate Amy,"
IEEE Curtom Integrated Circuits Conference, Tech.Dig., pp. 8.7.1-8.7.4,
May 1989.
[44] E.Hara,etal., YA350p~50X0.8micr~nBiCMOS GateAnaywithShared
B i p o h Cell Structure," IEEE Custom Integrated Circuits Cenferenee,
Tech. Dig., pp. 8.5.1-8.5.4,Msy 1989.
I451 J. D. Gallia, et al., "High-Performance BiCMOS 100K-Gate Array," IEEE
Journal of Solid-State Circuits, "01.25, no. 1, pp. 142149, February 1990.
[46] T. Hanibuchi, et al., "A Bipolar-PMOS Merged Basic Cell for 0.8 micron
BiCMOS Sea of Gates," IEEE Joarnal of Solid-State Circuits, vol. 26, no.
3, pp. 427-431, March 1991.
6
LOW-POWER CMOS RANDOM
ACCESS MEMORY CIRCUITS
SRAMs have several advantages OY~T Dynamic RAMS (DRAMS) such BS:
However, S U M S have the great disadvantage ofa large memory eeU eompered
to DRAMS. For this reason, their capadties rue smaller than that of DRAMs.
A timing disgram during read eyde is shorn in Fig. 6.l(a). Daring this time
the data stared in a specific SRAM location (defined by the address) is read
out. For a read cycle, two times are shown in the figare; the read cycle time,
ixc, and the address access time, IAA. Fig. 6.l(b) shows the write cycle which
permits change to the data in an SRAM. Two timer are indicated. the write
cyde time, f w c , and the write recovery time, ~ W R .Same of this information
is used in this chapter. For more detail on the timing, the reader can refer to
any memory data book.
A typical SRAM mchitecture is shown in Fig. 6.2. The memory array con-
tains the memmy cells which a x readable and writable. The row decoder (X-
decoder) selects 1 out of n = 2 rows, while the column decoder (Y-decoder)
Selects I = 2 out of m = 21 columns. The address (row and column) are not
multiplexed as in the ease ofa DRAM. Sense amplifiers detect small voltage
variations on the memory complementary bit-line which reduces the reading
time. The conditioning circuit permits the preehaige of the bit-lines. The a-
c e s ~b e is determined by the critical path from the address input to the data
output as shown in Fig. 6.3. This path contbins address input buffer, row
decoder, memory cell array, sense amplifier and output buffer circuits. The
word-line decoding and bit-lines sensing delay timer am critical delay compo-
nentr. To reduce the sensing time during a read operation, the swing on the
bit-liner should be as small as pamible.
-
CS (Chip Select) ;
-
OE (Output Enable) I
ktnn-
\
Data Out
- r-
CS (Chip Select) \ I tWK
-
WE (Write Enable )
Input
Addmr Row decoder Memory
idnver
address mpnt buffer cell
During the read cycle, the bit-lines are held high (prechsrged). Assume that
a "0" is stored at node A an& "1' is stored at node B. W h e n the cell is
selected; i.e., WL set to "I", BL is discharged through N1 and N3.
To write in the cell, one of the bit-liner is pulled low and the other high and
then the cell is selected by W L , Assume that B is set to "0" whil e mltlally
' ' ' a
"1" is stored at node A ("0" at B).N1 and P1 should be riaed such that node
A is pulled down enough to turn P2 ON. This in turn causes node B to be
pulled np. The crosssoupled inverter pair have a high gain to cause the nodes
A and B to switch to opposite voltages. The data retention (standby) current
of thk cell can be 85 low BS 10-"A. Although this full-CMOS cell has low
retention current, the cell area is so large that it does not allow high-density
SRAMs. A typical cell area using a 0.8 ~m design rules is 75 p d ,
The stability of the memory cell is its sbility to hold a stable state. Fig. 6.5(a)
ahows the transfer cumes of full CMOS S U M S . The box between the two
Low-Pomuer CMOS Random Access Memory Circuits 319
characteristics (I and 11) defines the Static Noise Margin (SNM). Static
noise is DC disturbance, such ffi offsets and mismatches, due to the pioeesskg
and variations in process conditions. The SNM is defined as the maximum
value of V, (static noise IOOIC~ ffi shown in Fig. 6.5jb)) that can be tolerated
by the cross-coupled inverters before altering state. A n important parameter
in SNM is the memory cell ratio, I , defined by
where transistors N , and N , sre the a c e m and driver NMOS transistors shown
in Fig. 6.4. An a n d y s k of SNM for memory cells is given in [13]. This static
noise margin parameter incremes with the ratio 7 . However, it k limited by
the cell area constraint. The stability of the cell iS maintained even if VDDis
scaled down.
Another mcmory cell configuration is shown in Fig. 6.6. This cell is similar
to the full CMOS memory cell, except that the PMOS pull-up devices are
replaced by high-iesistance polysilicon loads. The memory cell area can be
320 CHAPTER
6
"DO
about 30% to 40% smaller than the CMOS &-transistor memory cell, because
the two polyrilieon resistances c a n be formed on top of the two NMOS driver
transistors. The High Resistive Load (HRL) memory cell har been used in
several S R A M generations from 4 K b . The high state storage node of Fig.
6.6 ulll be p d e d down with time due to two kinds of leakage current; the
I d a g e current ofthe drsin junction and the subthreshold current. The voltage
drop BCZOBI the resistance R prevents iegvlac cell operation, if the leakage
current reacher the l e d of the poly-Si remtor current. In several SRAMs
generations using BRL memory cell, the total standby current w w act to 1 p A
per chip a t room temperature for battery-backup applications. Thus, for each
memory generation with quadrupled density, the polyJi resistance value is also
quadrupled. For 4 M b chip which h a II total standby current less than 1 PA,
Low-Power CMOS Random Access Memow Cwcuzts 321
I
typical d u e s of &'stance me in the 5 x 1 P 0 range and the resistance current
is limited to 10-laA. This current should be mvch larger than the total leakage
current of the storage node of the cell to improve tho data retention margin.
The leakage current cannot be scaled because, fist, the subthreshold current
per channel width, tends to increase; particalerly with the trend to decrease
the threshold voltage for low-voltage. Second, the leaksge current of the drain
jonction per area unit tends t o increase with technology scaling. Moreover the
junction area is shrank with a rate lower than the SRAM density increase rate.
In [14], it w m determined that the maxim- SRAM capacity for low-power
applications, using an ERL memory cell is 4 Mb where the retention current is
1 @A.
Note that the high-level node voltages of all poly-Si load memory cells are
(VDD- VT)after mite cycle, where VT is the threshold voltage of the access
transistor, subject to body effect. These nodes need a time of several ms to
charge np to VDD.The SNM of the ply-Si load memory cell L more sensitive
to cell ratio 7 , than the full CMOS cell 1131. A typical valne of I is 3. Also
the cell stability is drastically degraded when VDDis 3 V or less. The transfer
curves in the read mode can be easily plotted for different VDDto flnd out that
the cell cannot store the data a t a certain low-voltage.
322 CHAPTER6
I p-Suhsmle
I
Low-Power CMOS Random Access Memory Circuzts 323
For 4 Mb and higher density SRAMs, the polysilieon load cell starts to be
replaced by a polysjliean PMOS load called PMOS Thin Film Damistor (TFT)
for low-power applications [S,9, 151. Fig. 6.7 shows a cmss section and
k c n i t diagram of the poly-Si PMOS load memory cell 181. The TFT device is
fabricated from amorphous silicon (a- Si). This material has a grain size of 2
~ r while
n that of the conventional poly-Si material is 0.03 pm. The thickness
of this a - Si is 100 n m and the gate oxide thickness of lhe TFT is 40 nm.
This technology rerulls in improved ON/OFF currents compared to the one
using poly-Si. The N i drain area of the NMOS transistor ia used ar the gate
electrode for the PMOS TFT. To obtain a small area, the polydimn PMOS
must be stacked on the NMOS driver. The second palysilicon Iaye~farms the
channel regions. The T F T memory cell area is more than 40% s d e r than
the fall CMOS one.
Fig. 6.8 shows the drain curzcot of B PMOS TFT used in a 4-Mb SRAM as
a function of the gate voltage. An ON current more than W 7 A is obtained
at a supply voltage of 3 V, while an OFF current of lO-"A is attained. The
ON current is larger by more than six order of magnitude than memory cell
leakage currents which b much better than the current of the HRL cell Thos,
it results in an excellent data letentian characterbtic. Moreover, the very low
OFF current results in a standby current less than 1 p A for 4-Mb SRAM. This
current is low enough for battery back-up operation. At 1.2 V power supply,
the current flowing in the PMOS TFT is more than one-and-a-half order of
magnitude larger than the OFF current. Thk demonstrates the ability of this
teehnoiogy for iow-voitsge operation.
Afier write cyde, the hgh-storage node voltage in the cell becomes VDD- VT.
The time needed for charging up this node to VDDis
C,VT
t,h = - (6.2)
4
where 4 ir the current flowing in tho load device and C, is the total parasitic
capacitance of the node. Using 4-Mb data for TFT memory cell, VT = 1 V ,
C, = 10 fF and 4 = 10 p A the to&is around 1 me. For poly-Si load this
chage-np time is larger than 100 m i because h k low i y ~0.1 PA. The
average interval time between two word-line selections (for the same word-line)
is given by
1. = Nlcy,rr
~
(6.3)
M
where N is the number of memory ceUr per SRAM chip, M is the number of
memory cells pel word-line, and (or noted t n c ) b the operating cycle
time. For CMb, a typical value oft, is 4.5 ma when the cycle time is 70 na and
324 CHAPTER6
6.1.3 R e a m r i t e Operation
Fig. 6.9 shows a simplified readout circuitry for an SRAM. The circuit has
static bit-line loads composed of pull-up NMOS devices N , and N2.The bit-
lines are pulled-up to a voltage (VDD- h), where V!, is the threshold voltage
Low-Power CMOS Random Access Memory Circuits 325
326 CHAPTER6
"OD
WL
mbjett tu body effect. When the word-line W L is asserted, one word is selected.
At this time, the bit-line B L is p d e d down to s level determined by the pull-up
NMOS HI, the word-line transistor N., and the driver NMOS transistor Nd ss
shown in Fig. 6.9(b). The voltage at the node A should be low (mar ground) to
not alter the RAM content during this read operation. A small swing change
on BL is dwirable to achieve the high-speed readout, particularly if CnL is
high. The Sense Amplifier (SA) amplifies the small swing, AV on the bit-line.
Typical values 0fAV-J are 100 mV wd.L?& respectively. It should
be noted that t&FA phould provide a wide opemting margin over all pmcess,
temperature, and voltage cornerr.
Fig. 6.11(a) shows asimplified circuit configuration for SRAM write operation.
For II write operation the memory cell state should be Ripped. When the write
signal W E is asserted, the input data and its complement are placed on the
bit-lines. If for example, a vero has to be stored in the node A initially at
VDD,the voltage at this node should be below the threshold voltage of the
coll, as shown in equivalent circuit of Fig. 6.ll(b). The bit-line in thia crse is
pulled-down to almost 0 V. The design of write circuitry should provide a wide
operating margin o v a all process, temperature, and voltage corners. Note that
B DC current is consumed during a write mode, hence the W E signal should
Low-Power CMOS Random Access Memory Circuzts 327
WL ~
BL
&o be short to cut this current at the end of the write operation. In high-speed
SEAMS, write recovery time is an important component of the write eyde time.
It is defined BE the time necessary to recover from the write cycle to the read
state after the W E s i g d is disabled. Note that the swing on bit-lines after
mite operation is large. Thus, an equalizer circuit is needed to reduce this
s-g, so that the read operation is performed qoidrly.
Dafa-i"
%D
WE 0
WL
0
@.@ x
T
Lou-Power CMOS Random Access Memory Gircuzts 329
Bil-line conBLioning
column 1 md COlvm" m
AQ 1M
a% /
9 X3LdVH3 OEE
Low-Power CMOS Random Access Memory Circuits 331
Note that the power dissipated by the pads is not included. The power dissi-
pation of the components, other than the memory array, depends on the total
capacitances, the opersting frequency and the internal voltage swing. It can
include a DC component with a major contribution from the sense amplifier.
To reduce the active power consumption many techniques can be used and are
summatized 85 follows :
- (HWL) techniques.
Reducing the DC current by using the pulse operation technique for
the word-tine and the periphery circuits (including sense amplifier).
rn Use of multi-stage static CMOS decoding to reduce the AC current.
Lowering the operating power supply d t a g e .
The standby power (or Sometimes called retention current) of an SRAM has a
major contribution from the memozy cells in the array if the sense amplifiers
are disabled in this mode. It is given by
One way to reduce the standby current is to reduce the operating voltage. How-
ever, note that the data-retention cnirent will increase with memory capacity.
Moreover, the leakage current, per cell, tends to increase because the threshold
voltage is expected to be reduced for low-voltage operation.
In the following sections, many key circuits in an SRAM are reviewed. The
circnit techniqocs and memory organisation to reduce the lrctive and data-
retention currents are presented.
6.1.6 Decoders
(h) 6
i
Addressi
334 CHAPTER6
-
:
Address h e r
Word line dtivcr
r
Low-Pourer CMOS Random Access Memory CirczLita 335
are several ways to build mw-decoderr and it depends on the R.AM architecture
division.
The column decoder permits the selection d l out of m bits of the accessed TOW.
Fig. 6.17(a) shows the circuits involved for column selection uskg an example
of 4 columns. The selected gate permits the transferring of the data from the
bit-lines to the common data-lines I j O . The signals Yi a r e controlled by the
ANDINAND c o l u m decoder BS shown in Fig. 6.17(b).
336 CHAPTER
6
Low-Power CMOS Random Access MemonJ Czrcuits 337
The threshold voltage of the load, VT is subject to the body effect. A typical
valne of this precharge level for 5 V power supply is 3.5 V. This level is suitable
for voltage-type sense amplifiers to provide large gain and f st rensiog delay.
To reduce the DC current, during the write circuit, a variable bit-line load
tdmique can be employed [Fig. 6.191, It realizes fast sensing in the read cycle
and B short wdte pulse width in the mite cycle. For fast sensing, the voltage
swing of the bit-line shodd be small. To achieve this, the load impedance
should be low. On the other hand, to obtain a low current dndng write cycle,
the load impedance of the bit-lines shonld be high. As shown in Fig. 6.19,
during the read operation, all four NMOS transistors N,, Na, N,, and N4 are
turned ON. The bit-lines are switched into a low-impedance state so that the
Voltage swing of the bit-lines is limited to R small value (e.g., 100 mV). During
the write operation, the NMOS devices N, and NI arc witched OFF and only
the small she transistors N, and N , are turned ON.
338 CHAPTER6
i
NI
T
Low-Power CMOS Random Access Memory Circuits 339
As the power supply voltage is sealed down to 3 V, the preeharge level can be
lower t h q 2 V, Thus, d-g r e d operation the high-level node of the memory
cell can t;,f&e equal to the bit-line d t s g e . Hence, the noise margin of the
memory cell is drastically degraded and consequently the cell stebbility and soft
error are degraded. Therefore, at 3 V power supply voltage, a PMOS trsnsktor
can be used w bit-liner' load [Fig. 6 . 201. The bit-lines precharge voltage
is V b ~ Far. law-voltage bit-liner precharge voltage, special ~ e n s eamplifiers
should be used because conventional sensing circuits have poor voltage gain
(less than 10). A variable impedance bit-line, using PMOS transistois, can
&o be implemented.
Various kinds of sense amplifiers have been devised for fast sensing operation
and low-power dissipation. Fig. 6.21(a) shows a ringlcend sense ampliser with
an active current-mlror. Thin structure forms the basin for ~ n SRAMa' y
sense amplifier circuits. It has two differentid inputs, D L and DL. The noise
equally affects both the two inputs and only the difference is detected. The
transistor N, acts as a curent source. Before the signal $ 4 . ~ is asserted, the
data-lines D L and DL are high. AU the nodes, A, B and C, a x high. The
signal & A is a s e r t e d when DL starts, for example, to drop slowly. In this m e ,
the NMOS transistor N, is ON. The output voltage (node C) drops suddenly
to a c a t & voltage. Thus, the input signal is amplified by the gain of this
differential amplifier.
Fig. 6.2l(b) shows the voltage waveforms of the single-end sense amplifier
uskg SPICE simulation. The signal is generated with an ATD pulse. It i s
340 CHAPTER
6
Low-Pourer CMOS Random Access Memory C~rcuets 341
asserted for a time, enough to amplify the small variation (few hundreds of rnV)
on data-lines', then it is disadivated. In this scheme the DC cnrrent consumed
by the sense amplifier is cnt off. Usually the sense amplifier is common to msny
columns through the common data-liner. The small Signel gain of this amplifier
is given by
* = 9-- (6.8)
90
where is the transconductance of the driver NMOS Nd and go is the corn-
y'mn
bioed output conductance of the PMOS load and the NMOS driver.
In many SRAMs multi-stage sense amplifiers are needed to attain large volte.gge
gain. In this case, the daublbend sense arnpLifier is used a6 sh- in Fig.
6.22. This circuit h s often been wed in many SRAMs. To attain high-speed
data sense, a two and three-stage sense amplifier technique a n be adopted.
Fig. 6.23 shows a two-stage amplifier structure. An equalisation technique is
used for the data-lines, using the equalization pulse 4sq,which is generated
with an ATD pnlse. It is indispensable, not only to attain faster data transfer
'Thc auipui of the srme ampmcr k then iatchcd.
342 CHAPTER
6
Low-Power CMOS Random Access Memory Circuzts 343
I
S
during read operation, but also to suppress incorrect data before the comect
data appears in the sense amplifier [17]. For low-powei applications and &o
due to the plastic packaging limitations of static memories, this type of sense
amplifier can result in high power dissipation for high-density memories even
if the current source is pulsed.
Many circuits have been proposed to reduce the power of the sense amplifier
while improving their sensing delay time. One of them is the PMOS CIOSS-
coupled amplifier [I81 shown in Fig. 6.24. The PMOS loads, P, and Pz,are
cross-coupled and the M e r e n t i d outputs S a m S are connected to their girtes.
The positive feedback in this latch amplifier permits much faster sense speed
than the conventional one. In this circuit the equalization technique is used
for the reasons discussed above. Fig. 6.25 rhawr the senre delnys of both the
PMOS cross-coupled amplifier and the double-end current-mirror amplifier as
1 function of the average current of the amplifier. The input voltages simulate
344 CHAPTER6
0 6 prn CMOS
-
Convenuo~aicurrent -mrrror SA
1 2 3 4 5 6
'd
Low-Power CMOS Random Access Memory Circuits 345
the common data-lines' voltages and the sense delay id is defined as the delay
time from the crosso~erpoint ofthe input voltages to the point when the ontput
reacher 1 V difference. The PMOS cross-coupled amplifier has less than half the
delay of the conventional current-mirror sense smplifrer. Moreover, this latch
amplifier consumes less than one-Mth ofthe power of 6 current-mirror amplifier.
The PMOS cross-coupled latch amplifier requires much more accurate timing
for +., to optimize the sensing delay [la], Thin circuit also has low-power
property compared to the current-mirror amplifier since it has nearly full-swing
outputs with positive feedback.
346 CHAPTER6
When the voltage is sealed to 3 V power supply, the data-line voltage is near
VDD, then a level shifting can be pedormed. Fig. 6.26 shows a two stage
sense amplifier wed for 3.3 V mpply. The first stage is a cross-coupled NMOS
amplifier which also performs level shifting of the common data-line voltage.
In the second dage, a conventional sense amplifier is used which operates at
the maximnm 9 .;. point since the l e d on SA a d YZ =re medium leutlr.
Fig. 6.21 shows another sense amplifier developed for low-voltage power supply
[IS]. This circuit is mcd when the bit-tines are close to VDD,where the gain of
a conventional current-mirroi amplifier is poor. The circuit is composed of a
level-shift circuit and a conventional current-mirror amplifier. The level-shifter
shifts the bibline voltage to a medium voltage; 0.6 to 0.7 V, (@ 1 V power
Low-Power CMOS Random Access Memory Czrczlits 347
supply voltage) where the gain IS maximum. Low-VT NMOS devices NL and
N2 are used to provide these medium levels. There devices are subject to the
body effect.
m The latch circuit must not delay the mad access time. Such a require-
ment is attained by connecting the latch with data-bus lines in parallel.
One input transmission gate, controlled by 41,is used to enter the data
to the latch. Another transmission gate, controlled by 40, is used to
put the dat. back into the det-bnr.
rn The latched data must not be destroyed by the noise entering the
SRAM. A noise in an SFAM is generated and propagated by the fol-
lowing mechanism. On the system board, 8 ground noire can enter the
SRAM. When the peak level of the ground noise becomes large enough
for the first gate of the address buffer to change the logic value of the
address input, an ATD pulse noise is generated. This noise pulse could
turn on the word-lineand the *erne amplifier for a short time resulting
in an expected signal on the data-bus. Therefore, the Latched data
conld be destroyed if the inpnt Gp.1 is ON. To avoid such a problem,
two circuit techniques m e included in the eireuit of Fig. 6.28. The first
one is the generation of Qr only when the pulse width of the ATD is
large enongh, compared to that of the noise. The other circuit tech-
nique is to place latch-protecting invertem [Fig. 6.281 in the front of
the output gates. The inverterr prevent noise from entering the output
gates.
348 CHAPTER6
1 The new data must be quickly latched into the data-latch. The circuit
of Fig. 6.28 can be optimbed for fast operation.
- Global
row decoder
n-
Block 2nd Block
- nBch Block
Elnck sdcct
lillC
n i n CI,IIIIlI"S
C B
(rneniory cells)
is reduced, since only the selected columns switch. Moreover, the ward-line
selection delay, which is the delay time from the address input to the divided
word-line, is reduced. This delay is composed ofthe main word-line select delay
and the divided word-linc select delay. The main word-line selection delay is
reduced compared to the conventional one, because the total capacitance of
connected transistors is reduced. In a conventional S U M , the word-he has all
the row memory c e k ' gates of B row connected to it. The insin word-line delay
increases as the number of blocks increase because the number of block select
gates increases. On the other hand, the divided word-line delay decreases as
the number of connected cells i s reduced with the increasing number of blocks.
Consequently, the word-line selection delay has a minimum for a certain number
of blocks.
Fig. 6.30 shows the effect of the number of blocks in DWL structure on the
word-line select delay and the colvmn power for 64-Kb SRAM [l o]. In this
example. a number of blocks of eight can be chosen. The ares penalty for this
case is only 5%, compared to the conventional memory. AE an example, for
I-Mb SRAM, the cell array is divided into 16 blocks and each black consists of
612 OWE by 128 columns. 9-bit address (,4...Ae) is used to select B I O W within
350 CHAPTER
6
I 2 16 32
Number of Blocks
a block using two-stage row decoder. Global block selection is done using &bit
address.
The DWL structure has been widely used in high-density SRAMa for its low-
power. high-speed characteristics. However, in high-density SRAMs, with a
capacity more than 4 M b , the nomber of blocks in the DWL structure will
have t o increase. Therefore, the capacitance of the global w o r d - h e increases
cansing the delay and power increase. To solve this problem, the concept of
Hierarchical Word Decoding (HWD) was proposed in [21] as shown in Fig.
6.31. The word select line is divided into more than two lev&. The number of
lev& (hierarchy) is determined by the total load capacitance of the word select
line to efficiently distribute it. Hence. the delay hnd the power ayt reduced.
For 4-Mb, three levels of hierarchy haw been used with 32 blocks; each block
having 128 columns by 1024 rows. Fig. 6.32 shows the delsy time and the total
352 CHAPTER
6
capacitance of the word decoding path comparison for the optimized DWL
and HWD strmtures of 256-Kb, 1-Mb, snd 4-Mh S U M S . For 256-Kb SRAM
there is no significant advsnthge of HWD over DWL. However, for high-density
SRAMs the perfounance, of HWD in terms ofpower and delay, becomes dear.
The three-levels scheme can be used efficiently for 16-Mb SRAMs.
For 1 V power supply, a full CMOS memory cell has a lower power dirripation
in standby mode and greater immunity to transient noise and voltage variation
than other cells. It can also operate at the lowest supply voltages. Although
a full CMOS cell operates well at ultralow-voltage, its area is almost double of
that of PMOS TFT. Henee it is not suitable for high-density memories (sine >
4Mb).
When the full CMOS memory cell is operated at 1 V power ropply, a typical
cell ratio is 3 for stable operation. The SNM of this cell, at 1V, can be h o s t
the same as for a poly-Si load memory cell at 5 V. When nsing the fnU CMOS
4 no boosting of the wad-line is needed to write a high voltage level in the
cell. However, the PMOS T F T cell requires a boosted voltage (V.h > VDD)
on the word-line during the write cycle 1191. If the voltage of the word-line is
raised only to VDDin the write cycle, the high node B of Fig 6.33 is initially
at VDD- VT, where VT is the threshold voltage of the access device subject to
the body effect. This low-level (VDO- I+) of the node B em not charge up to
V0o because of the poor drimbility of the PMOS T F T device.
When the boosted word-he tedrniqne is applied to the PMOS T F T cell during
a write cycle, a problem can a G e . The unselected cells connected to the boosted
c o m m o n word-he suffer from an instability problem because a large current
flows through the low node of the cell. This large current is due to the high
voltsge on the access transistor. Consequently, this technique is not suitable
for 1 V operation.
Low-Power CMOS Random Access Memory Circuits 353
Word driver
Low- VT
MOSFET
-Din WE Din
(a)
Fig. 6.35(b) shows the voltage waveforms for the TSW circuitry in read/write
modes. During the write cycle, the high node A is first charged to a low voltage,
'The boostcdLvel8~lcratorirprcsentcdin ScetionB.2.11.
Low-Power CMOS Random Access Memory Circuits 355
then raised to Vms.The bit-hes are initially floating, then prechaged at the
end of mite cycle. In the next read cycle, the b i t - k s are floating. Before the
word-line voltages rise to V,,, the cell discharges BL through the low node B .
Thus, when the word-line has risen to Vwt, current does not flow in the cell
and the node B stays at low level voltage. Note that this technique requires
mdti-V, CMOS devices and causes delay in writing because the bit-lines are
discharged before writing.
In addition to the trend for higher-density standard DRAMs, there are two
other trends: Low-Power (LP) DRAMs, and high-speed DRAMr. The high-
speed DRAMs sacrifice the retention current ar well as density for faster access
time. Low-voltage low-power DRAMs are becoming important particularly
for battery operation. LP DRAMs extend the time of the battery operation
as well as battery back-up operation. The active current of LP DRAMS has
been lowered. The data-retention cuiient has also been reduced but rtii it is
about one order of magnitude higher than those of SRAMs'. The 5 V power
supply standard has been used for many DRAM &enmations from 64Kb to
16-Mb externally. This was followed hy 64-Mb DRAM powered with external
3.3 V not only to reduce the power dissipation, but &o t o emme reliability.
The gate oxide reliability limits the msldmum voltage which is related to the
boosted voltage inaide the chip. Regarding the internal voltage, the 5 V can
be used to a maximum DRAM capacity of 4-Mb. At 16-Mb generation, the
internal voltage is 3.3 V while maintaining external 5 V with on chip voltage
'This comparison is msdc for I - M b mernezicr.
Low-Power CMOS Random Access Memory Circuits 357
6 - WL SWING
LIMITER
5 -
-? 4 -
w
0 3
4
t;
- -, Li
? I - -
4 Mn
1 - - 4 NiCd
0 I I I I I I
DENSITY 1M 4M 16M MM 256M Ic (hi0
FEAT.SlzE1.3 0.8 0.5 0.3 0.2 0.1 ipim)
Toi 25 20 I5 10 7 5 (nm)
down converter [see Section 6.31. Howevez the 3 3 V externill power supply wlll
dominate.
-
CAS \ /
.
m Inpnt/outpot data pi...
External power supply pins.
It is dear that the multiplexed address penalims the access delay so for fast
DRAMr separate address input pins can be used. The multiplexing permits the
reduction of the pin count and the cost of packaging. An example of DRAM
timing, ndng the addresa multiplexing during read mode, is shown in Fig. 6.31.
Some important times are shown, such as the access time from low, tmS,
the row addxss strobe cyde time (or cycle time), tRC,and the row address
strobe low-state time, 1x1s.
Fig. 6.38 shows B gene& 4 M b DRAM architecture. It uses almost the same
circuit techniques as SRAM except for memory army. Some additional circuits
are needed such es a Back Bias Generator (BEG), B Half-Voltage Generator
(BVG), an optiond Voltage-Down Converter (VDC), a R,eference Voltage Gea-
erator (RVG), and a boosted voltage generator circnit. The substrate back-bias
voltage is indispensable for stable operation of the DRAM array. The half-
voltage generatar permits generation of the precharge level for the bit-lines to
half-VDD as it is explained in the following sections. The reference voltage
generator ir needed for the VDC. The boosted voltage generator uses b charge-
pump circuit and permits overdriving of the word-line WL to a voltage higher
than VDD.More details on these circuits, composing the DRAM, are given in
the following sections.
---
9.
RAS CAS WE
r .
102
I'
Low-Power CMOS Random Accrss MemonJ Circuits 361
where (VMC- Vm,) is the difference between the memory cell voltage and the
bit-line voltage before the selection ofthe cell. A typicd value of the difference
is V D D ,Hence,
~ we have fog the hit-line renre signal
(63)
For 3.3 V supply voltage, and using a rstio E = 8 far 16-Mb DRAM,the sense
signal V , = 180 mV. This r m d voltage change, of the bit-line, requires sensing
circuits. For low-voltage operation, V. decreases, thus a low ratio R is required.
This is possible by reducing CBLand increasing C,.
In the early DRAM,the plate of the capacitor WBS grounded to reduce the
noise injection from the VDDpower supply. However, for multi-Mb DRAMs,
a VDD/Z bias or the eeU plate was nsod. This scheme has several advantages
such as, the reduction of the stcess on the thinner oxide of the atorage capacitor,
and the reduction of supply voltage noise. Many I-Mb DRAMs have used this
cell biasing scheme.
362 CHAPTER6
For Gb DRAM cell design with redneed VOD,the ratio R should be rednced.
This L possible by reducing the bit-line capacitance, Csr. and increasing the
storage capacitance C.. On the other hand, the area occupied by C. should
be rednced to increase the chip capacity. One solution for C. reduction is the
use or* capacitor insulator with extremely high permittivity 6 such BI Ferra-
electric materials nuch as BoSrTiOJ film. Consequently B simple planar-typo
capacitor can be nsed in that c a ~ e
Low-Power CMOS Random Access Memory Czrcurfs 363
6.2.3 R e a m r i t e Circuitry
Fig. 6.40 illurtrstes the Merent circuits for read, write precharge, and equal-
isation funotions. The read operation is performed as follows. Initially both
the bit-lines ( B L and BZ)are precharged to V, which is equal to VDD/Zand
eqndized before the data reading operatirm. This hali-yoo preeharge technique
permits the reduction of the active power disdpation 89 discussed in Section
6.2.9. The signal W L is seleded by the TOW decoder. The high level of the
word-line voltage har to be greater than VDD to increase the stored chaise in
the memory cell. The selected memory cell is connected to one bit-line. Then
AVBL (100 to 200 mV) appears between the bit-lines, immediately &her the
word-line rises. Then it is amplified by the latch-type CMOS sense amplifier
364 CHAPTER6
which is connected to both bit-liner. After the sensing and the restoring o p
erations, the voltage levels of the bit-lines bsve a full-swing condition. The
bit-line differential voltage signal is transferred to the differential output-lines
(0 and d), through a read drcnit. The signal YR i selected h o s t at the
8-e time with W L . The parasitic capadtance of the output-line is large (a
typical value 2 pF for 4-Mb DRAM), and the readout circuit would need a long
time to amplify the ootput-line signal. A main sense amfler is used to read
the output-liner, then the data is selected among several main SAs connected
to different sub-arrays. Finally it ia transferred to the output buffer.
The DRAM cell readout mechanism is destructive, and hence the same data
must be wsdtten to the cell on every read access. Consequently, on each bit-
line pair, a CMOS mpifier is needed to amplify and restore the level. This
mechanism is not needed in SRAMs since the lead operation is non-destructive.
Pacam= C m A V m V D D f (6.12)
To ieduce this active power, many techniques can be used and a m smnmarieed
as follows :
The data retention power in a DRAM is mainly due to refresh operation and
the DC power ( I D c ) due to peripheral circuits such 8s BBG, BVG. VRG,
HVG. The refresh process is performed by reading the m cells connected on
each word-line and restoring them. Thus, n refresh cycles are needed for n x m
DRAM. It can be estimated by
In the following sections, the circuit techniques to reduce the active and data-
retention power dissipation are presented. Also, different circuits conrtitnting
a DRAM are described and low-power issues of these eirenits are discussed.
6.2.5 Decoder
In a DRAM, the static CMOS NAND decoders are used. The power is reduced
by sing the predecoding technique. This topic is discussed more in Section
6.1.6 for SRAMs. Fig. 0.41 shows astatie CMOS word-line driver. The boosted
level, K h , generated by an intunsl charge pump circnit, is used in the output
stage. When node A is high at (VDD- VT),the antpnt inverter le& a high
DC ourent because this is l m w then Vrhby 8%least two threshold voltages,
sobjeet to body effect. Therefore, a small size PMOS transistor PI is used to
restme the level of the node A to K, l e d . Also this transistor permits the
latching of the low output level (ground). Thc Xi signal, when selected, is
normally at Voo. The unselected X, is discharged to ground in the selected
block before the row decoder selection.
Low-Power CMOS Random Access Memory Czrcuits 367
Fig. 6.42 shows the principle of multi-divided bit-line architecture for the mem-
ory array. The m x n m a y is now divided into m columns by k snbarrays.
Each subarray contains n/k word-lines. In this scheme the bit-line capacitance
CsLis reduced by dividing it into k sections. Also the signal-twmise of the
cell is improved. Fig. 6.43 illustrates an example of I-Mb DRAM [32]. The
memmy is divided into two parts; upper and lower. One part is divided into
N = 16 sub-arrays and the total number of rubarrays i s k = 32. Two sub-
-
bit-lines share one amplifier which are selected by isolation sign&, I S 0 and
ISO. Thus, a partial activation is performed by selecthg only one SA along
the bit-line. The switeh SW is controlled by the Y signal from the shared
e o l m decoder. This signal runs in parallel to the bit-linw and uses metal-2.
Thos, the 1/0is shared by two sub-bit-hes. Thk principle results in reduced
power dissipation and chiprize. It has been used foz many DRAM generations
up to 16Mb.
Row decodri
._ - - _
--_ ---__ Bit-lineinmetal-l
(meid-2)
Figure (1.45 Multi-divided bit.8ne orchilceturr with shard SA, I/O snd
eolum.dccodrr[Zl].
370 CHAPTER
6
,,,R ..-._
._ ._
Fig. 6.44 shows the hierarchical word-line structure proposed for a 256-Mb
DRAM [26]. This scheme resembles the one used in the SRAM. The DRAM
cell array is divided into several blocks and each o m itself is divided into sub
arrays. The SnbWord-Line (SWL) circnitry is embedded in the subarray.
Only one S W L is activated by the Main-Word-Line (MWL) and the 109" select
Jignd. It is common to two sub-mays as shown in Fig. 6.44. Thus, only two
cell rubarrays are activated which represents B very small portion of the total
cell arrays. In the case of the 256-Mb, the active cell array rise is 1/1024 of
the total number. This ntrosture results in reduced active current and ground
bounce.
Lorn-Power CMOS Random Access Memory Czrcoits 371
duty ratio of the H V G E signal in the data-retention mode. To solve the other
problems dted an HVG G c d t was proposed k [28] but this circuit dissipates
B DC =-rent.
where VBBminis the back-bias voltage when no current is pumped and is equal
t o ( W - V D n ) (optimumvalue). During thertart-upalargecorrent Lpumpcd;
equal to (-Vasin..C,,,f).
Fig. 6.49 shows e. pumping circuit which avoids the VT losses and hence is
suitable for low-voltage operation [35]. When the clock ( c l k ) is low, the voltage
of the node A reaches (IVT~I - VDD), and the PMOS transistor PI clamps
374 CHAPTER
6
Low-Power CMOS Random Access Memory Clrczlzts 375
376 CHAPTER
6
the voltage of node B to the ground level. The Vgg level is in that case,
(IVT,~- VOD- VT,,). When clk goes to a hieh level, the voltage of A rises to
V T and
~ the voltage of B , by capacitive coupling, becomes -VOD, causing VBB
to be equal to -VDD. Therefore the Vse will be
Vsa = mas{-Vm, V
l ,I~ VDD - VF") (6.15)
This eircvit needs a special triplewell strncture to avoid minority carrier injw-
tion of the NMOS transistor N, as discussed in [SS].
To reduce the power dissipation of the BBG dreuit, while the DRAM is not in
an active mode, the BBG can be operated a t low fpequency. Fig. 6.50 shows
D simplified circuit diagrsm of the BBG circuits for low-power operation [Xi].
In the normal mode, the ring oscillator works all the time to retain the Vsa
level. In the data retention mode, the BBG Enable (BBGE) signal is clocked
Lou-Powuer CMOS Random Access Memory Czrcuits 377
with a low duty ratio. Then the ring oscillator is operating with low-frequency
to iefresh the pumping eircuit.
The boosted level can not be dkctly used to drive the load. Thus a pass
transistor is needed to isolate the switching boosted level from the load as
shown in the example of the drcuit of Fig. 6.52(a) [28]. The charge pump
circuit CP1 generates at the node A, B boosted signal switching between VDD
and ZVOD. To control the pass tiandstor N , two pump circuits CP2 and CP3,
and an inverter INV are needed. The pump circuit C P generates, a t node
B, a signal switching between WDDand ~ V D and D uses the boosted voltage
Vrh. The other pump circuit CP3, controls the inverter INV. The output
of this inverter (node D) switches between VDDand SVDD. The output of
this KVG circuit is Vc,, = 2VDD and it is stable since is large. The
voltage waveforms are shown in Figure 6.52(b). This ekcnit is insensitive to
VDDreduction and can work down to s u b 1 V power supply.
perature. One way to increase this time, and hence reduce the dato retention
powex dissipation, is to eontrol the refresh period funftion of the chip tempera-
ture. Fig. 6.53 shows LUL on-chip self-refresh control circuit with a memory-cell
l e h g e monitoring scheme. A iefreJh dock hraffrlh ir generated automatically
with a period of t,s,va,h.The moOitox cell, which has s hk?.&ecunent I&,
controls the refresh period. Initially node A is high, the NMOS transistor N is
OFF, and node B is low. When the c h a w on node A is deereased to the p&t
that the PMOS transistor P toms ON, node B riser up. Then, during t h e 7
B high puke is generated at the node C, whieh in turn charges OP node A to
high level.
Low-Power CMOS Random Access Memory Cixuits 381
One solotion to these problems is the use oflow-VTdevices in the DRAM army
for the CMOS SA, prechlrrge and equ&g circuits. However, this leads to a
drastic inuerse in the leakage current during the active period. The leakage
current paths are shown in Fig. 6.55. To significantly reduce this leahge cur-
rent the concept of Welldynchronized Sensing and Equalizing (WSSE) concept
was proposed [37]. It is based on the following two concepts:
382 CHAPTER
6
rn The voltage levels of the transistor souxes and the well are equaled
during the sensing, the restoring, and the equalizing period. This dim-
h a t e s the body effect.
Fig. 6.56(a) shows the WSSE eireuits using a triple-well structure. The N-well
and the P-well control voltages, Vw, and Vwp, respectively, are controlled by B
s p e d logic. Fig. 6.56(b) finstrates the voltage waueforms. Before the word-
line is activated, the bit-lines and #
,, and $, are equaliaed to haKVoo. The
P-well and N-well levels BIC prechapged to ( ~ / ~ V -DVDn ) and (1/2Yon ~
VT~), respectively. There voltage levels permit to avoid any drain-well voltsge
forward-biasing during the initial time, after W L activation. During this initial
time, one bit-line is different than VDD/Z.In the sensing and restoring period,
the signals 4.. and Vwp are palled-down while the signals $, and Vw. are
pallhp; each pair is synchronimd. After this period, the bit-lines BL and
are in full-Jwing condition. Then, the level Vw, is pulled below GND to VHH
and isolated from &, while the level Vw. is pulled above VDDto V& and
isolated from qLp.
Fig. 6.58 shows the anselected memory oell in long cyde operation. The bit-
line hsr completed t h e sg
- operation and is at gronnd level (GND). In this
situation, t h e memory cell is exposed to worst case leakage condition. The
c h q e stored in the cell leaks rapidly due to the subthreshold current. This
situation sets the lower limit of the threshold voltage. Note that the access
transistor of the memory cell has lVss1 as back-bias voltage. The threshold
voltage in this mode is given by
384 CHAPTER6
Low-Power CMOS Random Access MemonJ Czrcuats 385
To meet these two requirements of the threshold voltage, the substrate voltage
should have a suEcient bad-bias voltage to suppress the body effect.
For example when the internal supply voltage is VOD= 1.5 V, the IVsel is set
to -1. The V~(1.5V) is 1 V and the Vp(0) is 0.75 V and S = 90 mV/decade.
Extrapolakd thrcrhold v o h g r .
386 CAAPTER6
+
this case, Vch must be larger than (VDD VT(VDD)) which is 3 V.
-
Therefore, the lcskage current of e transistor with W = 1 pm, is 10 fF. In
When the VT of the memory cell is reduced, the leakage current increases
drastically. The concept of Boosted Senre Gronnd (BSG) [38] was proposed
to shnt down the subthreshold current in the memory cell B C C ~ S S transistor.
This is achieved by slightly boosting the low-level voltage of the bit-line. This
level is called BSG level, and is set at 0.5 V. During a long cycle operation,
the gatesource ofan unseleeted cell is negative (-0.5 V), then the subthreshold
current is redveed by 6 orders ofmagnitude (for S = 80 mV/decade). Fig. 6.59
shows the BSG circuit applied to a memory cell. The BSG line is common to all
N-channel sense amplifiers. The BSG l e d is generated by .e circuit similar to
the VDC circuit [see Section 6.3. I0 active mode, the differential amplifier and
N I are activated and the voltage of the sense ground becomes Kc,. The W2
transistor has alarge width and is activated by the signal SE at the beginning
of the sensiig period to suppress an unnecessary rise in the BSG level by the
sensing current. In the standby mode, the differential amplifier is made inactive
to reduce the standby current and also N , and N 2 . The BSG level is clamped
to the threshold voltage of N,. Note that the boosted level, Vrh, is reduced
compared to the conventional scheme because VT is reduced.
When the threshold voltage is low, the subthreshold elurent of each driver is
important. Then for &DRAM the total subthreshold current of the drivers is
pump eLcuit cannot handle such a DC current. Note that this current should
always be evaluated in the worst case; maximum temperature, and the lowest
value of VT. In the standby mode, all the drivers are turned OFF. The current
L a d - is still the same.
selected drive^ in standby mode. When d is low, node Ai is high, then the
selected wmd driver is low.
One problem associated with the SRB acheme is that daring the actke mode,
after one selected word-line driver is activated, d the other drivers m e leaking
thereby substantidly contributing to the active current. This problem is solved
by the partial Betivation of hierarchical power-line scheme 139). Fig. 6.63 shows
the principle of the 2-D selection scheme. In this scheme, the array of k blodrs
b7 n drivers is divided into E sob-blocks in columns and I sub-blocks in mw6.
The total of sub-blocks, each containing a set of drirers, is k x I . Dudng the
active mode, only one subblock is activated. Thus the subthreshold carrent in
the active mode is drastically reduced.
On-chip VDCs are used for DRAMS as w d BJ SRAMs, ASICs and digital
proeersors. They m e employed in commercial 16-Mb DRAMr to reduce the
external 5 V to an internal voltage of 3.3 V. For SRAMs,they have not been
commonly used as in DRAMr, partieulmly in commercial ones. The SRAMs
can operate over B wide range of power supply. Moreover, they already have
low data retention current, enough for battery-operated applications. In thk
section, w e discuss the VDC &<it tcchniquer for DRAMS which are basically
the same as for SRAMs and other circuits.
Numerous pspers have reported designs of the VDC circuit for B DRAM [32,40,
41, 42, 43, 44, 451 and for an SRAM [46]. Fig. 6.64 shows one approach using
a VDC to reduce the internal voltage for 8 DRAM. Memory cell array and the
periphery circuits are powered from the internal supply voltage, while the 110
390 CHAPTER6
Low-Power CMOS Random Access MemonJ Circuits 391
vch
t ,O 0 Vb
u h
392 CHAPTER
6
bfiers are powered with the external voltage to maintain the compatibility.
However. the VDC, in thk situation, should be stable when supplying a large
current to periphery and memory array. When the VDC is used for battery
operated applications, the standby current should be less than 1 p A over a
wide range of temperature (0-70C).
-
Fig. 6.65 shows a schematic of the
5
VDC structure for a DRAM, used to convert
V to 3.3 V. It is composed ofaReference Voltage (&) Generator (RVG),
a driver circuit and B time-dependent load. The buffer dreuit consists of a
differential amplifier [Fig. 6.661 and common-smrw drive PMOS transistor Pb.
The current load has B peak, for the memory spray, ofmore than 100 mA in 10-
30 nd time and more than 100 mA in few ns for the periphery <Leuit. To deliver
such a large carrent, the width of the PMOS 8 of the outpot stage shanld be
large. Moreover, when the output current changes rapidly, the output voltage
VDD decreases by AVDD. To m i n i = AVDD, the gate control voltsge, VG,
hes to change quickly. This is possible by increasing the differential amplifier
tail current, I,. The current snomce, I., is needed to clamp the mtpnt voltage
VDDwhen the load ourrent becomes almost zero.
Low-Power CMOS Random Access Memory Circuits 393
Q 10
circuit
t.
Figure 6.08 Schematic of Lhr differential amplifier,
394 CHAPTER6
A VDC circuit is one of the keys for achieving 8. DRAM with data-retention
current that can be used in battery based applications. The requirements for
low-power are the following :
The stability of thir circuit is essential for the operation of the VDC. To study
the stability, ac smd-signal analysis is carried out. Fig. 6.67 shows the aim-
plified equident circuit using the MOS smd-signal techniques [47]. The gate
capacitance of the output PMOS Cor is hnge and is taken into account. gml
and gmr are the transcondnctances of the differential amplifier and the output
stage, iespectively. T , and p1 are their iwpective equivalent output resistance.
Ci. is the ovtput load capacitance composed of the wire capacitance C-', and
the switched capacitance of the memory core em8.
(6.20)
The circuit has two poles: m = l/CGq,for the differential amplifier and
PI = l/C,,n for the output stage. The two poles must be sufficiently sep-
arated from each other to M J U I ~ a good phase margin 1481. For a DRAM
application, the pole pa varies drastically, because of the load variation. Thus.
the circuit CM fail to ensure a sufficient phase margin and hence it c a n generate
ringing or oscillation. Therefore, phase compensation has to be applied. One
'A typical ralw of C, is 1OOpF.
'A typical ralm 01C, is 1200 DF.
Low-Power CMOS Random Access Memory Circuits 395
The condition of the stablization is defined at the paint of 0 dB loop gain where
the phase margin is larger than 45 degrees. Using the smd-sigignal analysis with
the compensation eapacitm C. the condition c a n be utracted. This capacitor
is a function of gma, gml, CL and Co. To determine it, gmml has to be known,
using Iarge-Signd analysis. The PMOS driver Pb has to be rised to satisCy the
condition on A V D D ~ V D(less
D than lo%), due to the transient load current
variation. Hence 9-2 can be determined from the she of &. For a 1 6 M b
DRAM, the width of the antpot PMOS Pb can be as high as 30,000 p m and
C, eqn& t o 200 p F . This is for 3.3 V internal power supply generation from
5 V.
The current tail of the differential amplifier can be high (few ma) in active
mode. The driver can be &activated in standby mode to conmme only a very
small current by Chip Select (CS) signal. In this case, the internal vdte.ge can
be supplied by a low-power voltage follower (461. The voltage fallowex has the
same eonfigmation as the driver but the tail current is in the nub-fiA range.
LOOP
Gain
Low-Power CMOS Random Access Memory Czrcuits 397
The former consumes a DC current which is not low enough for low-power
applications. The latter is more suitable far B CMOS technology.
(6.21)
where N.il and N.42 are the surface impurity concentrations of PI and P2$
respectively. Far B stable-temperature design, the concentration ratio N.il/N,i2
and. therefore the threshold voltage difference, should not be excessively large.
A typical valne of temperature dependency is 0.4 mV/C, whieh is small for the
VDC circuit.
Since the AVT is around 1 V, the circuit of Fig. 6.10 is used to convert this
difference to the required internal supply voltage. The voltageup converter
amplifies AVT to:
V,.t = AVT (1+ 2)
R
(6.22)
The mismatch between the two PMOS devices PI and P, of Fig. 6.69 can be
minimised by using large channel widths and lengths. But stiU the deviation
on VT, dne to the fabrication process, has to be eliminated. This can be done
by using fuse trimming technique to control the ratio of the resistors R1 and
R2. The total current consumed by this RVG circuit is
where 31 is the current consumed by the voltage regulator [eee Fig. 6.69(a)]
and I, is the current of the differential amplifier. I& = K c f / ( R r + R2)is the
current of the ontput stage. I can be made < Ip A, however I. and II, can not
be made rmdcr, particdarly I,. The resistor is implemented, foz example, by
using doped polysilicon. Typical valuei of the resistances m e of the order of
100 K l l . They can not be increased excessively, otherwise the m a of the RVC
can be significantly high. Moreover, the substrate noise can affect the reference
398 CHAPTER6
Low-Power CMOS Random Access Memory Circuits 399
voltage through the coupling capacitances of the resistors. The total current of
this type of RVG is in the order of few .e tens of p A .
[2] M. Matsni et al., "An 8-ns I-Mb ECL BiCMOS SRAM," International
Solid-State Circuits Conf. Tech.Dig.,pp. 38-39, February 1989.
[3] Y.Maki et al., 'A 6.5-nr 1 Mb BiCMOS ECL SRAM," International Solid-
State Circuits Conf. Tech. Dig., pp. 136-137, February 1990.
[4] M. Takada et al., "A 5-11s 1-Mb ECL BiCMOS SRAM," BEE Journal of
Solid State Circuits, uol. 25, no. 5, pp. 1057-1062, October 1990.
151 A. Ohba et al.. "A 7--ns I-Mb BiCMOS ECL SRAM with Program-Free
Redundancy," in Symp. VLSI Circuits C o d Tech. Dig., pp. 41-42, May
1990.
[6] Y. Okajimact al., "A 7-nr 4-Mb BiCMOS SRAM with a Parallel Testing
Circuit," International Solid-State Circuits Conf. Tech. Dig., pp. 54-55,
Febrosry 1991.
[7] K. Sas& ct d.,"A 7-ns 140-mW 1-Mb CMOS SRAM with Current Sense
Amplifier," IEEE Journal of Solid.State Circuits, vol. 27, no. 11, pp. 1511-
1518, November 1992.
[8] T. Ootani et al., "A 4-Mb CMOS SRAM with a PMOS Thin-Film Tran-
sistor Load Cell," IEEE Journal of Solid-State Circuits, "01. 25, no. 5, pp.
1082-1092, October 1990.
[9] S. Mur&kami et al.. "A ZI-mW 4 M b CMOS SRAM for Battery Opere-
tion,' lEEE Journal ofSolid-State Circuits, vol. 26, no. 11, pp. 1563-1570,
November 1991.
[lo] K. Saraki et al., "16-Mb CMOY SRAM with a 2 . 3 - p ~Single-Bit-Line
~~
Memory Cell," IEEE Journal of Solid-state Circuits, val. 28, no. 11, pp.
1125-1130, November 1993.
404 DIGITALVLSI DESIGN
LOW-POWER
[Ill M. Metrumiya et al., 'A 15-ns 16-Mb CMOS SRAM with Interdigitated
Bit-Lme Architecture," IEEE Journal of Solid-State Circuits, ual. 27, no.
11, pp. 1497.1503, November 1992.
[I21 K. Sen0 et al.. " A 9-ns 16-Mb CMOS SRAM with OfEset-Compensated
Cnrrent Sense Amplifier," IEEE Journal of Solid-State Cirenitr, vol. 28,
no. 11, pp. 1119-1124,November 1993.
[I41 H. Kato et al., "Consideration of Poly-Si Loaded Cell Capacity Limits for
Low-Power and High-speed," IEEE Journal of Solid-State Circuits, vol.
27, no. 4, pp. 683-685. April 1992.
[I51 K. Saraki et al.,"A 23-ns 4-Mb CMOS SHAM with 0.2-pA Standby Cur-
rent," IEEE Journal of Solid-state Circuits, vol. 25, no. 5, pp. 1075-1081,
October 1990.
[I61 K. Ishibarhi, T. Yamanaka, and K. Shimohigashi, "An a-Immune.2-V
Supply Voltage SRAM using a Polysilicon PMOS Load Cell," IEEE Jour-
nal of Solid-state Circuits, vol. 25, no. 1, pp. 55-60, February 1990.
[I?] K. Saraki et al., "A 15-ns I-Mbit CMOS SRAM," IEEE Journal of Solid-
State Circuits, vol. 23, no. 5 , pp. 1067-1072, October 1988.
[I81 K. S s a k i e l al., "A 9-ns I-Mbit CMOS SRAM," IEEE Jonrnal of Solid-
State Circuits, "01. 24, to. 5, pp. 1219-1225, October 1989.
[I91 K. Ishibarhi, K. Takasugi, T. Yamanaka, T. Hashimoto, K. Sasaki. " A
I-V TFT-Losd SRAM using a Two-step Word-Voltage Method," IEEE
Journal of Solid-state Circuits, vol. 27, no. 11, pp. 1519-1524, Msy 1992.
[20] M. Yoshimito, K. An-, H. Shioohara,T. Yoshihara, H. Takagi, S. Nagao,
S. Kayano. and T. Nakano, "A Divided Word-Line Structure in the Static
RAM and its Applieation to a 64K Fall CMOS RAM," IEEE Journal of
Solid-State c i r c u i t s , vol. SC-18, no. 5, pp. 479-485, October 1983.
[21] T. Hirose, H. Kuriyama, S. Mnmkami, K. Yuzuriha, T. Mukai, K. Tsut-
sumi, Y. Nishimura, Y . Kohno, and K. Anami, "A 20-ns 4 M b CMOS
SRAM with Eieraichical Word Decoding Architecture," IEEE Journal of
Solid-State Circuits, vol. 25, no. 5, pp. 1068-1074, October 1990.
REFERENCES 405
144) M. Boriguchi, et al., "A Tunable CMOS-DRAM Voltage Limiter with Sta-
bilised Feedback Amplifier," IEEE Journal of Solid-State Circuits, YO\. 25.
no. 5. pp. 1129-1135, October 1990.
REFERENCES 407
-.
m Ripple Carry Adders (RCA);
Carry Look-Ahead Adders (CLA);
Carry Select Adders (CS); and
m Conditional Sum Adders (CSA).
S = A @ B ( B C (7.2)
T T
Crilicai path
412 CHAPTER7
latest one in an adder chain. The schematic of Fig. 1.2 ir symmetrical and
leads to better layout and small area. Since the outpnts are complemented,
and in order t o implement an RCA circuit, the configuration of Fig. 7.3 can be
used. In this case, many cells use inverted inputs.
Note that an n-bit RCA circuit is subject to the glitching problem. Fig. 7.4
shows 8 static simulation of a 4-bit adder, vrith the inputs A; set to zero (0),
and the inputs B; and C,. i i s i g from 0 to 1. The outputs S, should stay
at 0, however, due to the delay of the carry signal, through the chain of full-
adders, the autpnts exhibit spurious transitions (glitching). There dynamic
transitions dissipate extra powm and can represent an important portion of
the total power. With careful design this glitchhg problem cam he minimized.
One ddvbntage of the RCA is its low-power characteristic. However, its speed
is very limited, particularly when the adder is wide.
Another efficient full-adder cell is based on Transmission Gates (TGs). Fig. 7.5
shows an optimived version of the fd-adder cell wing TGs & e d y discussed in
Chapter 4. The carry ieal propagates only through one TG. Hence, an n-hit
RCA would be faster and more compact than the conventional one'. Fig. 7.6
shows the construction ofan n-bit d d e r . Pmctiedy, an inverter is added every
four stages to reduce the degradation of the carry signal due to the dktribnted
RC effect. When the carry rignd is inverted after 4 I-bit stager, complementary
carry path adders are used for the next 4-bit stages. This adder structure is
sometimes called Mancherter adder. This circuit is faster than the RCA and
may have loww power dissipation.
G; = B. (7.4)
VLSI CMOS SubSystem Design 413
414 CHAPTER7
I
. T I
Ci"
VLSI CMOS SubSystem Design 415
C I = G a t POCO (7.6)
Cz = G I + PIGo PIP& + (7.71
Cs = Gn + PxGr+ PzPzGo + PZPLPOCO (1.81
Cn = Gs + PsGr + PsPzGi + PsPzPxGo+ PaP,P,PoCo (1.9)
Fig. 1.7 shows the block diagram of a 4bit CLA adder. The carry generator
blocks (CLG1 to CLG4) generate the carries CL to Cn, in parallel, &om the
w r y in signal Co. The different P< and G; signals are implemented following
the expressions given b7 Equations (7.4) and (1.51. The B- generator blocks
(SG1 to SG4) generate the sums. The mm, S ( , Li generated by
Sc = Ci-1 @ Ai @ B; (7.10)
416 CHAPTER7
or
s, = C<L, B Pj (7.11)
if the propagate signal is given by
P, = A< Q B, (7.12)
In general, an n-bit CLA adder can be implemented dciently using 4-bit
blocks.
Fig. 7.8(a) and 7.8(b) show the first and the fourth CMOS carry lookahead
generator kcuits, respectively. The generate and propagate signals are gener-
ated in parallel and are fed to all carry generators with the input carry signal
Co. The e u r y signals %regenerated simultaneously. However, because the
number of stacked MOS transistors increases, the delay of the fourth carry is
greater than that of the first and limits the adder speed. The sum generator
of the CMOS adder of Fig. 7.2 c m be used in this ewe. The same circuit is
used for all four bits. This implementation is slow beeavae of the large numbers
of stacked MOS transistors which represent a high equivalent resistance in the
pull-up and pd-down paths.
P = P.+sP,+2P,+,P; (7.13)
+
G = Gi+a Pi+sGi+? +P;+aP;+2Gi+i +Pi+sPd+&+tGi (7.14)
The circuits of Fig. 7.9(b) and Fig. 7.9(c) show the implementations of the
global functions P and G . Simildy, the P and G sign& for the third. second
and first bit stages c a n be constructed. For an n-bit adder, all the P and G
-
signals are computed in parallel. Hence, the critical path is the carry path
C, C;+,, except for the fust &bit adder block, where the oritieal path can
be from one of the inputs ( A , or Bo) to the carry out C4.
The 11101 generator is implemented using the propagate signals, P<and p;. Fig.
7.10(a) illustrates one pwsible circuit using B static CMOS implementation.
VLSI CMOS SubSystern Design 417
t Gn
418 CHAPTER7
VLSI CMOS SubSistem Design 419
ci -
Figure 7.10 S w generator circuits: (a) static CMOS; (b) transmiasion @tr
ramion.
Another circuit more compact and faster is shown in Fig. T.lO(b). It uses
transmisJion gates and needs only 6 transistors.
Many urcuit techniques for high-speed carry lookahead adders have been pro-
pored. One of them uses the pseudo-NMOS like style [I]. The adder w~ used in
a multiplier and achieved a high-speed static operation. However, it consumer
a DC current and it is not snitable for low-power applications.
420 CHAPTER
7
Other CLA implementations, to improve the carry path delay, are based on the
transmission gates and CPL families. In this section we present the one based
on CPL. The TG version is left to the reader to design. Fig. 7.11 shows the
block digram of a 32-bit PMOS lsttch CPL carry loakahesd adder using 4 b i t
blocks. The carry generators (CLGs) of each 4 b i t block generate the carries
C,+>through C(+$in parallel from the carry in, C.. The different P; and G,
signals, required by each 4-bit block, m e not shown for clarity reasons. When
the carry Cj+4 is fed to the next 4-bit block it "re3 B buffer to distribute this
carry to other CLGs and SGs. Therefore, the carry path is not signifmtly
loaded. This results in a h t operation. Fig. 7.12 shows the CPL implementa-
tion of the CLG of the fourth bit. This circuit is located in the clitical path of
the carry signal. It is compact and uses only NMOS pass transistors. P and G
are the global propagate and generate signals, respectively. The fourth carry
is generated from the carry in or G signals through only one NMOS device.
The P signal block i b implemented using ANDINAND CPL style. After each 4
CLG blocks of the critical path, the carry is buffered and restored using PMOS
latch buffers. The PMOS latch restorer the reduced high level to full-swing
to avoid any DC leakage current as shown in Fig. 7.11. Fig. 7.13 shows the
G signal block for the fourth-bit CLG 8s an example. The same circuit gtyle
can be used t o generate this G signal for the third-bit, the second-bit, and the
first-bit CLGs. In addition the output inverter rises a PMOS latch to rertore
the swing. The PMOS latch circuit is incorporated only when dual rail signals
are available. However, for a single-ended signal, a feed-back PMOS, transistor
is added to restore the full r d high-level ar in the case of the sum generator of
Fig. 7.14.
C"
... ...
422 CHAPTER7
(7.16)
(7.17)
(7.18)
VLSI CMOS Subsystem Design 425
The adder uses mainly for the multiplexers transmission gates as shown in Fig.
7.17(~). Note that the architectue we6 the signals and their complements
(dualhail architecture) to avoid the use ofinverterr for the multiplexers. Oth-
erwise the delay of the csrrg path will be pen&& by the addition ofinverterr.
To design an n-bit (e.g., 32-bit) adder, one possible technique for fast operation
is to use staged blocks of constant width or variable width. In this case, dl
the conditional sum blocks compute thelr respective double snms and double
output carrier in paallel. The troe sum and carry out signals of each block a r e
then selected by the carry in generated by the preYions stage. The architecture
at the block level UBU B any-select like technique where the carry in of each
block ir the true carry out of the previous block. The optimal staging a n
be determined from circuit simulation. The architecture has two critical delay
paths within a block. One from the carry in to the carry out which is affected
by the layout routing since the carry in of a block is distribnted to all the final
multiplexers. The other critical delay path is the one from the LSB-inpnt of B
block to the cnrry out.
To reduce the power dissipation and the delay of the CSA adder, B CPL-Wre
circuit style can be used. Fig. 7.18 shows the different circuit cells needed
to implement such an adder. In Fig. ?,la(*), the conditional cell schematic
is shown. The output signals have a high level voltage equal to VDD - VT.
Fig. 7.18(b) shows the compact mdtiplexer using NMOS pass-transistors. The
control signals of the multiplexers should have f u l - r d swing, When using t h e e
reduced swing circoits in the adder, whenever a full-rail swing is needed it can
be generated with the double-rail swing restored circuit of Fig. ?.lS(c). The
output inverter ofthe rum Signal is shown in Fig. 7,18(d). The feedback PMOS
transistor is needed to restore the high level when only a single-rail exists. The
layout of such an adder is regular. Only three c& of the first. second and
third bits have to be drawn. Fig. 7.19 illutratw the layout of a 4bit block
0.8 pm design rules.
* : MUXs
(a1
VLSI CMOS SulSystem Design 427
428 CHAPTER7
R C h . The conditional snm adder, with variable block staging, combincd with
carry select like style ULO iesult in the fastest adder if well optimized. The
power dissipation of this adder can be comparable or maybe less than that
of the RCA because it u e s jl reduced internal swing and a datively small
transistor count if thc CPL-like style is used. When considering all the criteria
ouch as the power, the area and the speed, a tool can be developed to select
the adder class which satisfies the specified requirements.
Far wide adders, having operand's sire more than Whit, the different arehitec-
turer can still be utilised. However, to optimize the speed and power of such a
wide adder, several additional algorithms can be combined. Examples of wide
adders can be found in 15. 61.
which have been used in VLSI. The reader can consult references [7, 81 for more
details on array multiplication algorithms.
(7.19)
(7.20)
i=o j=o
The delay of such a multiplier is dependent on the delay of the full-adder cell and
the final adder in the last row. In the multiplier array, an sdder with balanced
carry and s u m delays is desirable beoause sum and carry signals are both on
the critical path. This is diJkent than the case of a p d l e l adder where the
carry path should be optimized and speed up compared t o the s u m path. For
large arrays, the speed and power of the full-adder are very important. CPL-
like styles discussed in Chapter 4 can result in reduced power dissipation and
high-speed of operation. The final sdder in the last row can USE the techniques
presented in Section 7.1.
430 CAAPTER7
x, x* x, xo =x
Y3 Y> Y, Yo =Y
VLSI CMOS SuhSystem Design 431
xi qv;
(bl
432 CHAPTER7
x = -x,-12"-' + c
; a - I
i=o
X.2' (7.22)
Y = -Y,-,2"-' + c
i=n-*
i=o
K2i (7.23)
P = XY 5 x"_rY,_,2"-' + cc
i=n-2j=n-2
i=o j=o
X;Ip'"
-x-., c
i=n->
i=o
fi2"f"-Y n.i c
<=*-a
i=o
X,2"+'-' (7.24)
In order to avoid the use of subtractor cells and use only adders, the negative
t e r m should be transformed. So
c (- c
+ 2"-' + i=n-2 E P - 1
1
i=n-2
__,.-x,_1 KZ"+L - x ".I p . 2
i=o *=o
(7.25)
Using this property in Equation (7.23), the product P becomes
P = XY = -2-'+(z".l + + x".*Y"-,) .2'*-2
Using the above rdstion M n x n multiplier, using only adders, can be imple
mented. The schematic circuit diagram of 8.4 x 4 two's complement mdtiplicr
bared on Baugh-Wooley'a algorithm is shown in Fig. 1.22. The different cells
composing the array are &o shown. In this scheme n(n- 1) 3 full-addus are +
VLSI CMOS SudSyslem Desagn 433
required. So for the ease a f n = 4 the array needs 15 adders. When n is rela-
tively large, the Rnal adder stage in the multiplier army a n be implemented
with the techniques discussed in Section 7.1.
This type of multiplier L suitable for applications where operands vith less
than 16 bits are to be processed. Application;, for snch a mdtiplier are, far
exxamplc, for digital filters where s m d operands mc used (q., 6 , 8 and 12).
For low-power and high-speed of operation, the array uses a CPL-like adder
BS mentioned pieviously in Section 7.2.1,while a CSA scheme, combined with
carry select, a n be u t i e d in the final adder. For operands equal or greater
than &bit, the Baugh-Wooley scheme becomes too area-consuming and slow.
434 CHAPTER 7
Henee, techniques t o reduce the size of the array, while maintaining the regu-
larity are required.
In this equation, the terms in brackets have valuer in the set{-2, -1,O, 1, +2}.
The reeoding of Y ,using the modified Booth algorithm, generates another
number with the following five signed digits, -2, -1. 0, +1, +2. Each recoded
digit in the multipliei performs B certain operation on the multiplicand, X ,85
illustrated in Table 7.1
So the bits of the multiplier are partitioned into groups of overlapped 3-hits,
each group permits generation of B ceitain partial product. The five posi-
ble multiples of the multiplicand are relatively easy to generate following the
explanation given in Table 7.2
The generated partial prodnct is related to the multiplicand for each recoded
digit by the relationships presented in Table 7.3. PP,is the partial product and
PP, is the sign bit of the partial product w t h P, = Pn-l when no shifting of
the partial product is performed. Note that the partial product is represented
+
on n 1 bits.
436 CHAPTER7
0 PP; = 0 fori=O,.-.n 0
+1 PP; = x, fori=O, ...a 0
+2 PP, = for i =0. ...n 0
-1 PP; = x, for i = 0,.. -n 1
-2 PP, = Z,-, for i = O , . . .n 1
oiioio,oi: - +a -1 -2 +I
The bits are grouped into 3-bit groups overlapped by one bit and a bit with
a value of aero is added on the right side of Y 85 Y-I. So the mdtiplicstian
of two %bit numbers generates only 4 partial products. The number is then
reduced by half, The partial prodnet in thb example is represented on 9 bits.
For a correct partial products addition, the signs aze extended 85 shown in Fig.
7.23. The shape ofthe multiplier is then trapeiaidal due to the sign extension.
VLSI CMOS SubSystern Design 437
(-107) 10010101 = X
(+165)
%ELzy Operalion BltE recoded
+I 010
-2 100
extension -1 101
~100101010 +2 ni I
1101010000011101 = P (-11235)
In order to make the =nay rectangular, and then more regular for VLSI im-
plementation, the problem of sign extension must be addressed. This problem
is more crucial when the operand lengths ars wide, where each partial product
must be sign-extended to the length of the product. In thirIeetion we will not
deal with the techniques to solve the problem of the sign extension. Bat we
d discuss one technique which is shown in Fig. 1.24 for the e m p l e of Fig.
7.23. The bmie idea is to use two extra bits in the partial product. For the
first partial product, the two additional bits, PP,+I and PP,+. ale equal to
the sign bit of the partial product
(-107) lOOlOlOl = X
(+I051 KOEl = Y
Y Y
Operation Bits recoded
..
:1E110010101 +I 010
mOl10101 I0 -2 100
~OOllOlOll -I 101
D~00l01010 +2 01 1
ll~10100P0011101= P (-11235)
..I
,
8-1 Additional hiis 10 he gencrawJ [sign ~i1cnsi0n1
0 Additional bits generated fmm the previous Sign and the prescnl sign
Figure 1.24 Thc prcviour trample of Figvrc 7.23 eith aimpiifiId sign cxtm-
<om.
Let us now see the implementation of the n x n modified Booth multiplier. Fig.
7.25 shows the block diagram of the multiplier. Also it gives an idea about the
fioorplan of this subsystem. It is composed of the following blodrs:
For the sake of simplicity, we treat the case of B 6 x 6 multiplier. All the c&
described in this easmple are the besic cells of any multiplier size. Fig. 7.26
VLSI CMOS Subsystem Design 439
X<*-l:O>
"Y
3
Y<n-l:O>
I I
+JcF.w n-bit adder
P<Zn-l:n:
shows the implementation of such a multiplier. Four types of c& are used plus
the final adder. There cells are:
The ADD cell which generates 0 or 1 [see Table 7.31. The schematic
circuit of this cell is shown in Fig. 7.27(a). Two implementations
m e possible: one using pars-transistors controlled by the five signals
d&g the recoded digit code, and the other one is an AND2 gate of
the two sign& -1x and -2x.
The partial product MUX (PP-MUX) which generates the partial prod-
uct. Fig. 7.2T(b) shows the schematic of PP-MUX using CPL type
logic. The feedback PMOS, Pj in this figure or in the o m of Fig.
440 CHAPTER7
VLSI CMOS SubSystem Design 441
sumin 'i-1
*
5
C cT
4
Sum"",
(*) not conncclcd for PP-HA
(b) (Ci
The Booth multiplier exhibits a lot ofunnecessary glitches. The main mason for
glitchcs is due to the race condition between the multiplicand sod the multiplier
due to the Booth encoder. The power dissipation assodated with the glitches
can be an important portion ofthe total power and henee it needs to be reduced
by some techniques of signal synehroniaation.
3-1
Figure T.28 Logic aehemstis of the Booth encoder including thc aim exten-
sion losir
444 CHAPTER7
pp"J
Fl
Fig. 1.30 shows how the 4-2 compressor can be implemented by 2 full-adders or
by custom static CMOS Iogjc [9]. 4-bit 11,...,In. are added to produce 2 s u m
S and C. Hence, 4-bit of the partial product are compressed to produce two
new partial products. The compressor is implemented, using carry-save adder
construction, by two cascaded fd-adders as shown in Fig. 1.30(b). Notice that
carry-out2 is never generated by carry-in. Fig. 1.31 shown the 4 2 compressor
circuit osing B compact structure of multiplexers [lo]. This structure is faster
than the static complementary version. Fig. 1.32 shows the intereonneetion of
the 4-2 compressors for block A of the example of Fig. 1.29. C. is connected
VLSI CMOS SubSystem Design 445
........... X
x7
Y7 ........... Y
: 0 zcra
446 CHAPTER7
VLSI CMOS Subsystem Design 447
As
B 7
I
L
448 CHAPTER7
VLSI CMOS Subsystem Design 449
I.
x<31:0>
I
I I
7 I 1 2nd stage-BlockE ]
iz-
-P<15:0>
ii
laslage-BlockC
1st stage-Block D
] 2nd slage.Block F
PPG: Gcncrator of panial
products ] 3rd alage-Block G 7
to the next carry-in f&. Since these signals are independent, the carry is not
propagated through the row.
To further enhance the Wallace tree multiplier, the modified Booth algorithm
can be used to rednee the number of partial prodocts by half in a camy-save
adder array. One example of such combined construction is the architectme
of the 32 x 32 multiplier shown in Fig. 7.33. It consists of four functions:
the Booth encoder, the partial product's generator, the compressor blocks, and
the final 64-bit adder. The Wallace tree is constructed with 3 stages (levels).
The first stage har 4 blocks (A to D ) , with each block summing up 4 partial
450 CHAPTER7
products among 16. The second stage s u m up the 8 new generated partial
products from the first stage. Hence, two blocks are needed, E and F. Finally,
block G of the third stage of the tree generates two other new partial products
to the find adder. This architectare exhibits some irregularities in the b y m t
since it has a complicated interconnection scheme. Hence, the interconnection
wirer affect the speed and power dirsipntion of the adder.
7.3 DATAPATH
A VLSI chip can be partitioned in two piuts; the data path (oz execution unit)
and the control unit. Data paths are often used in digital signal proce~~ors,
microprocessors and application specific ICs (ASKS). The data path consists
of a combination of an Arithmetic Logic Unit (ALU), a shifter, a file register,
1/0ports, a multiplier, an adder, B magnitude comparator, and data busses,
etc. It performs many operations on the data in the register file, to which
the results are sent back. The data busses permit communication between the
diSerent units of the data path. The data busses are the communication means
for the dats transfer between the ALU, shiiler, and file register, ete. These
busses have a heavy load (few p F ) . In CMOS design, dynamic techniques are
used to &ow fast operation. One way to reduce the power dissipation, doe to
the precharging transistors, is to use static burres (111.
VLSI CMOS SubSystern Design 451
Lalch A
Lalch C
Latch B
I
Op Code
*I Bus-B
The control unit delivers the instructions to the data path. These instructions
determine the operations that the data path has to perform. The eontrol unit
can be implemented using random logic, micro-ROM (Read Only Memory),
PLA (Programmable Logic Array) or n combination of these three implemen-
tations. Other macrocells, snch as TLB (Itandation Lookaside Suffe~),cache
memory. ete., can be added to the data path and the control nnit. In thj,
section, several blocks of a data path are discussed.
The madmum clock frequency of a VLSI circuit may be limited by the ALU
operations; especially the arithmetic ones. The critical delay o f an arithmetic
452 7
CHAPTER
operation is due mainly to the carry propagation along the width of the ALU.
There are many types of ALU, depending on the number of operations t o be
performed. Fig. 7.35 shows the block diagram of a 1-bit slice of an ALU. It
has exactly the same structure as the adder, except that the P and G blocks
are programmable. Fig. 7.35(a) shows the P block with 4 control sign&
(OPI . . . O&). The feedbaek PMOS transistor. P j , permits restoration ofthe
high-level from VDD - V., to VDD.Hence the DC current of the first inverter,
due to the reduced high-level, is eliminated. Fig. 7.35(b) shows the G block
with 4 op code sign& (O&..OPa). The P and G b l a h use the pass-transistor
style. The techniques discussed in Section 7.1 can be applied to achieve low-
power and fast operation. The carry and resdt (sum) blocks m e shown in Fig.
1.35(c) and (d), respectively. Table 7.4 summarises some of the functions that
can be implemented with these blocks. Several other operations can be realimd
with this ALU.
Operation Result
Add w. carry A tB
Subtraction A t B+1
Bit-wire AND A and B
Bit-arise OR A mB
Not A A
To implement an n-bit ALU, all the techniques discussed for carry speed-up in
adders can be applied. Drivers are needed to dirtribvte the op code signals for
VLSI CMOS SudSystem Design 453
- -
P P P P
454 CHAPTER7
* B
an n-bit ALU. Foi low-power design, the busses which communicate with the
ALU are in general not precharged 8s in the case of many data paths.
The Absolute Valne Calculator (AVC) is, in general, used in data path.
of video processors to compare the data of two pictuw. Fig. 7.36 shows
the architecture of the AVC. This pardel circuits performs two subtractions
simultaneously, A - B and B A. Using the most significant bit of there two
~
operations, the MUX circuit selects the positive one. Then the output giver
the absolute d u e IA-BI.
To reduce the power dissipation and the area of an n-bit AVC, the logic of
two n adders rewired c a n be reduced by the merging of the common functions
for both operations. Also the techniques described in Section 7.1. for n-bit
addition. should be nsed
VLSI CMOS SubSystem Design 455
7.3.3 Comparator
A magnitude comparator is oscd in many DSP applications. It permits com-
parison of the magnitudes of two numbcis A and B by providing if A < B, or
A = B, or A > B. Fig. 7.37(a) shows an example of a two-bit comparator
which requires two types of eelk C1 and CZ. The cell, C1, is constructed by
the eireuit of Fig. 7.37(b). Table '1.5 shows the truth table for this cell.
These relations can easily be nsed to implement the second cell, Cz, of the
comparator a8 shown in Fig. 7.37(c)
This technique, for the two-bit comparator, can be extended for an n-bit =om-
parator. It can be constructed by using B parallel tree of the cells C1 and C2.
A 4-bit comparator could. for example, be constructed with two 2-bit compara-
tors connected in parallel and at the output the 4 E and F generated signals
456 CHAPTER7
7.3.4 Shifter
Another macrocell of the data path is the shifter. It pertorms shift or rotate
operations on the data If the number of bits to be shifted is arbitnuy, then
a barrel rhifter is used [12,131. Fig. 7.38 shows the CMOS implementation
VLSI CMOS SubSystem Design 457
s3 s2 S1 SO
Table 7.6 shows the values of the output bus function of the input data. De-
pending on the values ofD < 6 : 0 >, several shift operation8 can be performed.
For example if D < G : 4 >= O, and D < 3 : 0 > is the 4-bit input data, then
458 CHAPTER7
The barrel rhiftei is not 8 critical unit for the delay. A low-power operation is
performed by odng a static implementation. This shifter can be implemented
with transmission gates and the feeedbak PMOS are not required. However
for low-power, the use of NMOS array is more efficient. The feedback PMOS
should be sized to minimum.
Fig. 7.39 shows the schematic ofthe singleended memory eeU with 2 read ports
and 1 write port (2R-IW). The read ports are the r e d bit-lines BL.RI and
BL-R2. The memory cell, composed of two cross-coupled inverters h and 12
is addrwsed by two read word-line signals, W L R l and WL-R2. The NMOS
transistor N, is controlled by the Wzite Enable ( W E ) signal. N1 is connected
aerially to the write B E C ~ S S transistor N 2 . The transistor flz is controlled by
the write word-line ( WL - W) signal. The transistor N, isolates the stored data
from the write bit-line ( B L W ) .To write the datain the storage node A from
the write bit-line, the imerters I , and I2 rhonld be sized earefnlly. The P ratio
of the inverter I, should be larger than 1 (e.g., 5 ) to set the threshold voltage
of 1, to a law-level. This is due to the fact that Nl and N2 we&!+ transfers a
high level (only 100 -VT=). Moreover, to ensure a correct write operation, the
ThedeFdlianofB iasivoninChc~pirr4.
VLSI CMOS SubSysten Design 459
WL-w
WL-RI
WLLRZ
WE(Wdte Enable)
A pair of three-port memory e& is shown in Fig. 7.40. This rtrueture has
shared access transistor Na and write bit-line, B L W . To read and write the
memory cell, the simplified rchematio of Fig. 7.41 is nsed. This schematic
uses the calomn multiplexing scheme. For low-power, the register file U E ~ S
static design and avoids the use of the conventional sense amplifier for bit-
lines sensing. The sense amplifier consumes DC power. For a three port
register file, two read and one write row decoders are required. Also, Write
Enable (WE) and column addresses are needed to produce the column write
enable for writing the data to the specified storage node. For fast operation
AND gates can be u.ed with a m-om of of 5-bit inputs.
During the read operation, if for example Na is asserted, then the data is put
on the bit-line, BL.Rl. The bit-line is selected through the pass-transistor N,.
The data is then senred by the inverter I , in Fig. 7.41. During this period, the
460 CHAPTER
7
read enable signel, RE, is asserted, Ni is OFF and only the feedbaek PMOS P j
is activated when a one ( V D-~VT,) is on the data-line. In this situation, the
feedback PMOS charges up the data-line to VDD.Also the DC current, which
c m be generated due to the reduced high l e d on the data-line, is completely
eliminated. The p ratio of the inverter I, should be higher than one (e.g., 5 )
to achieve a symmetrical r e d access time for a % e m and a one. When R E = 0,
then the data-lines axe i 4 a t e d from the bit-liner and the NMOS transistor Nz
is ON. Therefore, the latch formed by the pair of inverters 11 and I , latches
the old data.
The operation of such a re&a file is fully static and does not dissipate any
atatic power at any mode of operation. Furthermore, the read and write o p
erations are asynchronous. This type of register file is suitable for low-power
applications.
WSie decoder
(WAI
vow ,K.
Y l W ....
WE lWritof3nablc)
-
YOR. YOR. Y l R , .
RE (Read Enable)
462 CHAPTER7
me usually dynamic circuits for fart operation. These dynamic circuits can be
shut down with a power management Unit for power ravings. If for example
the do& is turned OFF, all dynamic circuits go into 8 piechsrge mode with
all PMOS precharge devices are ON.
PLAs have regular architecture divided mainly in two planes BS shown in Fig.
7.42. Theso planes pelform a specific fnnction such 85 OR and AND. CMOS
PLAs can be implemented in both static and dynamic styles. The style is
chosen depending on the timing strategy in the chip. Other factors such BJ
speed, power dissipation, and the allowed area, p l q an important role in the
PLA design style. A CMOS PLA example, ushg psendo-NMOS like style, is
s h a m in Fig. 7.43. The output OR functions are r & d with NOR gates.
From Fig. 7.43(a), we have
PI = A t B t C = A.B.C (7.33)
The buffers are used when the load on the bit-line is large. They consist in
general of two invectez's stages. The OR plane is in principle similar to the
AND plane [Fig. 7.43(b)]. From Fig. 7.43(b), we have
x = Pi + P, + Pa (7.37)
Y = P, + P, (7.38)
For this pseudo-NMOS PLA, NOR-NOR logic gate style iz used. This example
shows that the PLA organization is useful for implementing Sum Of Products
(SOP) functions. Hence any SOP function can be redzed by programming the
army with the AND and OR cells. Any type of latch or register cm be used
at the input and output. ThL design style of PLAs has e n m d size area and
VLSI CMOS SudSystem Design 463
Inputs 0"tP"tE
In dynamic CMOS style, the circuit shown in Fig. 7.44 can be used. It is a self-
timed PLA, where the AND and OR planes are both realised =sing precharged
NOR configuration. In this structure, o d a~ &gle clock phase is needed.
When the dock, elk, is high the bit-lines are preeharged in both planes. The
NMOS transistors NA and No are OBF, guaranteeing that there is no p.th
to ground. Tracking liner in both planes are used to generate a delayed clock
to the OR plane. When the clod is law, the prechargt PMOS transistors, in
the AND plane, turn OFF, N A tarns ON and the produets a~leevdnsted. The
tiaching lines ensure that No tuns ON only when the inputs to the OR planer
are stable. Othetwise the outputs can be spmiously discharged. This PLA is
fast, bnt it har a lot of wasted dynamio power. The wmted power har r e v a d
sources such ar:
464 CHAPTER7
_ _ _
X = ARC+AC+RC
Y = ABCiAC
x = q + Pi+ Fj$ L + P 4
(bl
AND-plane OR-plane
clk
- :vinua1Ground
Figure 7.44 Sclf-timcd d+c PLA using NOR-NOR style.
m The virtual ground Liner are charged and discharged every cycle. The
total eapheitance of the virtual ground is important, particularly for
large PLAs because for the purpose oflayout compactness the ground
lines ate in diffusion. This capacitance can be reduced using metal
level in multi metals technology;
m The number of inverters forming the buffers are important. Then,
duiing the evaluation, several of them switch; and
m The switching activity of dynamic NOR implementation is high [see
Chapter 41.
Consider now the PLA shown in Fig. 7.45 mith AND-NOR structure. The OR
plane is still the same compmed to the PLA of Pig. 7.44. However, the AND
plane is considerably simplified because:
AND-plane OR plane
Delay Tra'h"g
- 'Vinual Ground
Figure 1.45 Sclf-timeddynamic PLA u s h r AND-NOR stylo
The switching activity of the NAND implementation is aLo lower than that of
NOR implementation, resulting in Iower power in the AND plane. O n e problem
associated with this struetme is that the use of NAND may result in a large
discharge time.
Another dynamic PLA combines the pseudo-NMOS and dynamic logic design
styles [19].Fig. 7.46 shows an example of such a structure. The AND plane
uses a predseharged pseud-NMOS NOR style, while the OR plane uses B
conventional dynamic precharged style. During the precharge phase, the clock
signal is high and the bit-lines in the AND are predircharged to ground. In
the OR plane, the bit-lines are precharged to VDD.The i n p d s@ to the
OR plane are low. During the evaluation phare (clk = 0), the PMOS loads in
the AND plane are ON, and t h e plane behaves as pseudo-NMOS logic. In this
case, the PMOS device should be siaed correctly to ensure safe operation when
the output stays at a low level. The product terms are evaluated and then
the outputs. During this evaluation phase, the PLA dissipates a static power
m d y by the AND plane. Then the power is increased by this DC component.
VLSI CMOS SubSystern Design 467
PMOSlOad ,
This PLA does not need the seW-t-g techaiqne nsed previously. Also it was
shown that this PLA has a kst operation [IQ].
Fig. 7.41 shows a simple ROM circuit architecture using NOR logic design. The
state of the memory array is retained even if the ROM is not powered. The
89P
VLSI CMOS SubSystem Design 469
Bit-he (merall)
G - word-fine (rnCtSl2)
Diffurian
Ward-ime (polyriiicon)
The ROM can be implemented in both styles: static and dynamic. In static
styla, the pseudo-NMOS logic, similar to that of static PLA, can be used.
Fb. 1.49 shows an example of a small ROM 'Lsing pseudo-NMOS circuit style.
The conditioning circuits use PMOS devices, with their gates grounded, and
the sense amplifier circuit is simply an inverter. The column decoder is also
shown. One of the column decoders selects one of the two bit-lines. Then,
node A is initially at VDD.If the selected bit-line is &charged, then node A
is discharged and the outpot is pulled up to VDD.The pseud-NMOS is eaey
to design and does not need a careful design, howveer, the power dissipation
may be significant due to the DC current. For a relatidy large ROM, like the
one used in microcontrollers, the power dissipation c m be significantly rcduced
using the low-power techniques of SRAMsa. They include pulse mode operation
using address transition detection, and r m d swing sensing, ete.
*These tecbsiisuca M discused in mom detail in Chapter 6.
470 CHAPTER7
ROW demder 4 a
q< Gmunded PMOS
A dynamic version of the ROM ir shown in Fig. 1.50. During preeharge phase,
elk = 1 and the bit-lines are precharged to VDD- VT, where VT is subject
to the body effect. Node A is also precharged by the PMOS trensistar Pp.
The select lines Sell and Sei2 are controlled by a column decoder. Ail the
word-lines are predirchsrged to groond. Dudog evsluation, cfk = 0 and if the
hit-line is discharged to gro.aund, node A is also discharged. Then the ontput of
the inverter I is p d e d up. If node A is not discharged, the feedbadr PMOS
transistor Pt permits to maintain the high level at VDD.Since the swing on
the high-load bit-line is reduced, the power dissipation is reduced on this line
by a factor V D D / ( V D D - VT).
decoder Word-linc
Bit-line
Sdl + r
Figure T.60 Dynsmi~ROM cirrvit.y.
A CAM stores tags which can be compared against an input address word
(A o...A,,,) as shown in Fig. 7.51(*). A match detection signal is sent by the
CAM if the valuer stored in the CAM array match with the input address word.
A CMOS implementation of the CAM cell is illustrated in Fig. ?.5l(b). It c m
be readable and writable jwt as an ordinary memory cell. The read/write and
decoder circuits are similar to that of B RAM.
A tag word ir formed by identical cells which are repeated in a horiaontd array.
The write lines are used to write data in the array. The comparison procehs k
described e ~ ,follows. Dnring prechmge phase, the bit-lines me predischarged
low. All the write lines are low. The Match line (ML) is precharged high.
During the evaluation phase, suppose that a "1" is stored at node A. Assume
that C B L line is held high and m l i n e is held low. In this case, the transistors
N3 and N1 are OFF, hence the M L Line remains high, indiea&a match at
this bit location. Assume now that C B L is driven low and C B L high. The
transistor NQis OFF, but N1 and N2 are ON. Then the ML line is discharged,
indicating B mismatch at this bit location.
Wnfe Line(WL)
NOR circuit is used, LU shown in Fig. 7.62. When the clock is low the NOR
gate is precharged along with the match lines. The inputs to the NOR gate
me predischarged to ground. When the cUr signal is high (evaluation phase),
one of the match lines, MI,((), stays high and the others are discharged to
ground. When the msteh liner are stable, the eual signal i n asserted with elk
using self-timing (similar to the PLA case). This permits keeping the dynamic
NOR gate from falsely diecharging. The inputs to the NOR gate must not go
high until the data is stable. If one of the match line stays high, then the NOR
gate is discharged and the output matoh signal goes to high.
skew between the external and internal clocks is due to the clock tree.
The outpot datais significantly delayed compared to the external clock.
One main contribution is the dock skew. In Fig. T.SS(b), the internal
dock is deskewed via the use of a PLL. The PLL shonld reduce this
skew OD B wide range of process, temperatnre and voltage vadations;
To synchronize data between chips as shown in Fig. 7.54. The PLL
solves the problem of clock skew Grom chip to chip. An example of such
an application is &cussed ia 221;and
To generate internal clocks with higher frequencies than the external
dock (system dock).
There are other applications of PLL for clock recovery in serial data communi-
cations and these are not discussed in this section. Several theoretical references
on PLLs can be found [23,24, 251. Thu section provides m introduction to
the PLL. The CMOS circuit design of the PLL, for low-power applications, is
then discussed.
7.5.1 Charge-PumpedPLL
One interesting C O Z L ~ ~ ~ U F L ~of O PLL is the charse-pumped loop shown in
~ Othe
Fig. 7.55. It is B PLL-based frequency multiplier which consists of a Phase
Frequency Detector (PFD), B ChargePump(CP), a Loop Filter(LF), II Volt-
age Controlled Oscillator (VCO), and a programmable frequency divider. The
feedback of the internal dock is compared to the external clock for phase m d
frequency error. The outputs of the phase/frequency detector are two +tal
si& called U (for Up) and D (for Down). The charge pump and loop fl-
ter convert these digital EignaLE into ap analog signal (control) suitable for the
VCO. The VCO function of the control signal level generates a certain oscilla-
tion frequency. If the PLL generates multiples of the external clock Gequency,
then a frequency divider is inserted between the generated clock and the phase
detector.
A simplified diagram of the charge pump and loop filter is shown in Fig. 7.68.
It consists of two switchable corrent S O U ~ Cdriving
~ ~ an impedance (LF). The
pnlses generated by the PFD block are nsed to switch the charge pump, to
charge or discharge the impedance. The loop filter flters these pukw and has
an analog output signal to control the VCO.
Thc chargo PUP 102 PLL should not he confused with the one vacd to sonerate diffeicnt
Oltagcl.
VLSI CMOS SubSystem Design 475
Clock
p outpu,
Figure 7.6s PLL clock gener*ticm ior drakeluing: (a) n chip without PLLi
(b) a chip with PLL.
476 CHAPTER7
Chip#l Chip #2
Data pad
LF
Q r4
change the ontpnt control signals. In thia last state both U and D signals are
at zero level. The d a t a changes whenevu clk or clk..t f a down. In no case
U and D are both activated.
Consider that d k and elk..t have the same freqneney bnt the f&g edges of
eB..t (elk) leads the falling edges ofclk (~lkept),respectively. Then, d ( 8 )is
asserted with II certain duty cyde, while D (U)is never asserted. In this case,
the PFD is characteiiaed &B the phase detector.
Consider now the case where clkezt has a higher frequency than elk. d is
asserted moat of the time. More falling edger of clEsmt signal than elk. A
similar sitnation vhen clE h s higher freqoency than clk,,, and D is assected
most of the t h e . In this case, the PFD is characterbed as frequency detector.
clk
I
VLSI CMOS SubSystem Design 479
mirror circuit. In this situation, the output curent of the h g e pump can be
adjusted through the control of the current mirror.
The charge pump and filter generate a control voltage for the VCO. One
important parameter of the VCO is the VCO gain. When considering
the charaeted4tic frequency-control voltage, the VCO gai0 is the sbpe of lhis
characteristic. A linear characteristic is, in general, desirable. In general the
VCO is implemented using h ring oscillator as shown in Fig. 7.60. A series
connection of de1e.y inverter cells forms a tapped delay line which oscillates
with a frequency determined by the delay time of the cell and the odd number
480 CHAPTER7
The VCO of Fig. 7.61 har an excellent bandwidth characteristic, where B wide
range of frequency can be generated I%]. It ia used for video signal processors
end covers a wide range of applications. The freqnency range EM change by one
order of magnitude from 50 MHz to 350 MHe. In fig. 7.61, by turning ON and
OFF 8 CMOS TGs with control signals, the number ofring oacihtor stages can
be selected among eight values (7,S,ll,l5,Zl,ZS.3S.61). Each stage of the ling
oscillator combines an inverter in parallel with II current-controlled inverter.
The inverter inereares the frequency of oscillation of the VCO, where= the
currenteontrolled inverter permits tuning of the frequency of the VCO.
The generated clock frequency can be N times the external dock frequency
(reference frequency). This dock then feeds the clock driver and tree. Since
the PLL discussed here is intended to be integrated on-ehip, it is then sensitive
to the noise generated on the power lines (called power-supply-induced dock
jitter). If the power supply changes by 100 mV the skew 01 phaae error will
VLSI CMOS SubSystem Design 481
Selection signals
Generated clock
be important before the PLL has time (tens of clodrJ eydes) to correct this
emor [ZT]. One vay to reduce the effect of thjs problem is to dedicate an analog
power supply pin to the VCO and the charge pump. At the drcuit l e d , a ncw
VCO delay cell war proposed by Young [ZT] to iedoce the phase error.
The frequency divider can be implemented using togglc flip-flops. Fig. 7.63
shows an example o f a divider with division ralm of 1, 112, 114, and 118. The
PLL, so far discussed, is not completely digital. Only the PFD, charge pump
and the frequency divider are digital. While, the I F and VCO are analog m d
operate 8s eontinuoostime systems.
T clk T clk
Q Q
As an exsmple, to disable the PLL, is to shvt down the VCO and disable
the external clock. Fig. 7.64 shows the Same VCO of Fig. 7.62 but with one
inverter transformed to a tw&nput NAND gate. One of the inputs is controlled
by the Enable signal to shut down the PLL when it is low. The NAND gate
can be used for any of the VCOs presented previously. Also the enable signal
can be used to disable any current O O I I T C ~used in the PLL to eliminate any DC
cunent. A typical power dissipation of B PLL, at 3.3 V,is in the range of tens
of mW depending on the frequency.
rn High-speed addition.
rn Multiplication techniques.
I PLL and clock deskewing technique.
REFERENCES
[l] J. Mori, et al., "A 10-ns 54 x 54-b Pardel Structured Full Army Multiplier
with 0.6-pm CMOS Technology." IEEE Journal of Solid-state Circuits,
vol. 26. no. 4, pp. 600-606, April 1991.
[23]F. M. Gardner, "Phase Lock Techniques," John Wiley and Sons, 1919.
[24] F. M. Gardner, "Charge-Pump PhaseLocked Loops," IEEE Transactions
on Communications, COM-28(11). pp. 1849-1858, November 1980.
1251 R. E. Bert, "Phase-Locked Loops," McGraw Hill, 1984
[26] J. Goto, et al., "A Programmable Clock Generation with 50 to 350 MHz
Lock Range for Video Signal Processors," IEEE Custom Integrated Cir-
cuits Conference, Tech. Dq.,pp. 4.4.1-4.4.4, May 1993.
[21] I. A. Young, J. I<. Greason, and K. L. Wong, "A PLL Clock Generator
with 5 to 110 MHs of Lo& Range for Microprocessors," IEEE Journal of
Solid-State Circuits, 701.21, no. 11, pp. 1599-1607, November 1992.
[ZS] M. G. Johnson, and E. L. Hodsan, 'A Vaiahle Deb7 Line PLL far CPU-
Coprocessor SyruchroniUation," IEEE Journal of Solid-State Circuits, vol.
23, no. 5 , pp. 1218-1223,October 1988.
8
LOW-POWER
VLSI DESIGN METHODOLOGY
8.1.1 Floorplanning
Floorplanning of a circuit is the first step in VLSI layout design. It permits the
allocation of space on a chip for a given set ofmodules. A module can be rigid,
e.g., the module is in the library and its dimension and power dissipation are
known. or pezibie, e.g., it has not beon deaigned and has B list of parameters
such as different shapes and power consumptions for feasible implementations.
Floorplanner for low-power design should choose a suitable implementation for
each module such as the total power/area of a chip are optimieed [I].
-
where P, is the probability of the node i being a "1" (1 P$)is the probability
that node i is a 'V", and Cs ia the capacitance of this node. For more infar-
mation on thia model see Section 8.5.2.1. To minimiie the above equation. one
has to first evaluate the current value of P; and then change it by making P:
dose to 0 or close to 1. Also in [3], zero-delay approximation is assumed. This
implies that the glitching power is neglected.
-
To minimize the switching activity, some techniques that can be used are:
The technology mapping step for low-power refers to the process of trans-
forming a logic function into a technology-dependent (e.g,, CMOS) circuit with
minimieed power consomed. This technology dependent Step ~ s e sa target
technology. The first step in technology mapping is to decompose each logic
function into twwinput gates. The objective of this decomposition is to mini-
mize the total power dissipation by reducing the total switching activity. Fig.
8.1 shows an example of a foor-inpot AND gate decomposition into two dif-
ferent implementations. The probabilities of inpots being at "1" logical are
also shown in pig. 8.1. Primary inputs ace assumed to be uncotrelatcd. The
switching activity at each internal node is also shown in Fig. 8.1. A two-input
( i , j ) AND gate is given by
Lmpiomcnration 1
lrnpiemsntition 2
We s m m e also that the gate delays are zero to ignore the power dne to the
glitehing phenomenon. The total switching setivitie for implementations 1
and 2 are 0.888 and 1.056, respectively. Therefore, implementation 1 is better
than implementation 2. This problem ofdecomposition was addressed by [5,6].
In 151, the power dissipation, associated to glitehing, is neglected while in [6]it
is not. Taking into rrccount the power dissipation of glitches is very i m p o r t a t
ar is discussed in Section 8.2.2.
In this aectian we do not consider the algorithms for technology mapping. The
reader can consult rcfcrencea [5, 71. We illnstrste this concept of technology
mapping by the following example. Fig. 8.2 shows an example for implementing
the logic circuit of Fig. 8.2(a) into two implementations. The first implemen-
tation [Fig. 8.2(a)] is for minimal area deign using OAI (OR-AND-INVERT)
gate. The second implementation [Fig. 8.2(b)] is for minimal power design
where the high switching node N of Fig. 8.2(a) ir hidden using B mom complex
gate.
Thus the process of technology mapping is to &st decompose the logic func-
tion such that the total switching activity is minimbed. Then, to hide any
high svitching activity node within complex gates 80 that the capacitance of
that node is minimisod. However, mahiog LL gate too complex c a n trade the
delay for low-power. Typical reduction in power dissipation is on the order of
20% without any degradation in performance but st the expenac of small area
penalty.
The quality of the targeted cell library can considerably impact the results of
mapping [S]. For eremple, the availability ofcells with different drive etrengths
and doublerail outputs (signal and its complement) gives more fleldbility for
logic optimisstion. A goad library a n result in 20-5095 of power dissipation
reduction.
To .educe this power the first appioach in to balance the path delays by chang-
ing the logic atmsture (e.g., tree) ar explained in Section 4.5.5. Another tech-
nique ir to balence the delay of the patho by pising down the gates in the fast
paths 1121, However, this approach can increare the delay of the circuit. ALSO
insertion of buffers (delay elements) in the fast paths can baknce the delay.
However, the added buffers increare the power dissipation.
494 CHAPTER
8
Self-timed
funclion
-
Gated
Pdlel-adder
function I
s m s are not witched notil they are evaluated. The additional Circuitry in the
conventional approach UUI colls~unrmore power than it mag s m e .
Parallel-adder
is critical for power savings. Otherwise, the additional circuitry can dissipate
a relatively important power. Note that this added logic slightly increases the
area of the circnit and may also inerese the clock cycle. The preeomputation
techniqne can be applied to a mnltiple output function. However, if the logic
has a large number of ontputs, then it may be worthwhile to s e k c t i d y apply
precompotation technique to a small number of complex outputs. This selective
partitioning will add a duplication of combinational logic and regirtera and this
may offset the powex savings.
498 CHAPTER8
8.3 LP ARCHlTECTUKE-LEVELDESIGN
In this section, sxhitecture meens also Register Transfer Level (RTL). The
architecture uses a set of primitives suoh 8s adders, multipliers, ROMs, register
filer, etc. RTL synthesis programs m e used to convert an RTL description to
a set of registers and combinational lwgic. The impact of low-power techaiqnes
on the architecture level c a n be more significant than the gate level as .rill
be shown in this section. Techniques to reduce the power dissipation discxssed
m e : parallelism, pipeline, distributed processing m d power man<&ment.
8.3.1 Parallelism
Parallelirm can be used to reduce the power dissipation at the expense of area
while maintaining the same throughput [lo]. To finstrate thia, the quantitative
example of Fig. 8.7 is considered. In Fig. 8.7(a), a regbter snpplies two 16-bit
operands to a 16 x 16 multiplier. We refer to this architecture to reference one
and we w e the ref notation for frequency, power snpply voltsge, power dissipa-
tion, etc. This register is clocked at a maximal frequency f , s j = 50 ME$.We
assume that the worse case delay of the multiplication is 20 ns at V,el = 3.3 V
power supply voltage. It is clear that we cannot reduce %,I to reduce the
Low-Power VLSI Design Methodology 499
A
500 CAAPTER
8
Hence
Ppe7= 0.33P,.j
Thus, the power dissipation is significantly reduced.
The key to this power ssVings is the duplication of the hardware in parallel
configuration. In general, N processors E B be~ paralldked by duplication, with
each processor running with slower do& (by 8 factor of N).In this case, for
the s a m e throughput, the power dissipation c a n be ieduced with the increase of
N. Therefore. the power ropply voltage (VDD) can be aggressively rednced to
meet II worst case delay almost equal to the reference delay divided by N. To
wploit this power mpply reduction, the threshold voltage ( V T ) should also be
reduced to limit the degradation of the delay as VDDapproaches VT. Keep in
mind that the scaling of VT is also limited by the static current oonsiderations.
When the number N is relatively large, the parallelism can lead to several
problems. A highly p m d d k e d configuration can result in s drastic incresse of
the occupied area. In addition, there is rooting overhead to distribute the input
and output signals. This also increases the &re8 and the wiring capacitance.
Therefore, the power dissipation &a tends to increase and then limits the
utility of parallelism.
8.3.2 Pipelining
Pipelining is another arehiteetluc that can reduce the power dissipdion [lo].
As an example, let us consider the case of the 16 x 16 multiplier presented in
Section 8.3.1. The 60 MAB multiplier is broken into two equal parts as shown in
Fig. 8.8. A set of pipeline registtun (or latches) is inserted, resulting in a 2-stage
pipelined version of the multiplier. Architectures with more pipeline stages can
Low-Power VLSI Design Methodology 501
i mulliplicr
be realized. S i e e the hardware between the pipeline stager is reduced then the
reference voltege V,.! = 3.3 V c a n be reduced to 1.8 V (V,.t/1.83) to maintain
a worst case delay of 20 ns (50 MHe). The estimated power dissipation is given
hv
The switching capacitance has increased slightly due to the pipelining. Thus,
the power dissipation is redneed by a faetar ofalmost 2.8 which is spprodmately
the same IU the pardel EIUC. Alao the area increase is relatively low and the
area penalty h due only to the additional registers (or latches). As the pipeline
registers reduce the logic depth, the power dissipation, due to the glitches, is
also reduced.
the frequency of operation is reduced. However. the luea would increase sign%-
eantly. For low-voltage, the threshold voltage should also be reduced to reduce
the power dissipation, otherwise the power supply voltage redoction is limited.
Indeed, at low-voltage, VDO approaches VT and the delay inereares d r a r t i d y .
To maintain the throughput with pardelism/pipelhing, the threshold voltsge
should be reduced compared to VDO.
The two terms (C:; - CiJ and Z(C,; - C,) are Computed B pliori and stored
in a memory to reduce the number of operations.
Fig. 8.10(a) shows the centralized implementation of the VQ. It has a ten-
traliaed memory, processing element, and eontroller. This architecture is time-
multiplexed, wbich performs operations sequentially over a large number of
clock cycle^. In TSVQ, each l e d of the tree has specific code vectors that
are found only at that level. Therdore, the memory can be paltitimed into
separate memories for each level of the tree. Fig. 8.10(b) shows the distributed
implementation of the VQ.The memory s k e from one module to the other in-
creaser. The architecture is pipelined allowing the dock frequency and supply
voltage to be reduced.
- -
vector does not change [15]. Through this partitioning, the power dissipated,
of the eentraliaed implementation, was reduced by a factor 11 at the expense
of an area increase by a factor of 2.
504 CHAPTERa
Low-Power VLSI Design Methodology 505
From this example we can learn that proper design of the architecture, through
distributed processing, is more power-efficient than the centralieed procerror.
In the distributed implementation, the different l o d hardware ~esonrcescan
be optimized more efficiently than the global hardware in the centralized imple-
mentation. The application of this technique depends on whetha the executed
algorithm can be partitioned. Keep in mind, that the power s8-g trades the
occupied area, while the throughput is maintained.
There are two types of power management: i) dynamic and i) static. Dynamic
Power Management (DPM) allows selective shut-down of different blocks of
the chip based on the l e d of activity required to run a particular application.
Different blocks of the chip may be idle for a certain period of time when
mnning different applications. For example, the floating point unit can have
lOO%idletime when the processor is executing integer applications. The DPM
requires additional logic on the chip. This logic is controlled by signals of idle
periods.
In the PowerPC' 603 [21], the DPM mode is ensbled by software. The DPM
logic automatically stops the dock switching of specific unit generated by
clock regenerators. The clock regenerators produce two docks, C1 and C2,
which feed master and slave latches. Two "freeze" input signals control the
clocks, C1 and C2, as s h o w in the timing diagrams of Fig. 8.11. The logic
needed for DPM does not introdnee any performance degradation and it eon-
s - ~ 0.3% of the total die areain the PowerPC. The DPM provides a power
raving of 10.20% depending on the application to be executed. The DPM can
be implemented at either high-level (cg., execution u.it) and low-level (e.g., a
block inside a unit) of hardwlue.
Static Power Management (SPM) permits the awing of the power dissipation in
the standby mode. In this $me, the activity of the entire system is monitored
rather than a specific unit (or block). When the system remains idle for a
'PowerPC 603 is h a m l B M C o w .
506 CHAPTER8
yl
T
c1 ...............
................
...............
........
CLLiRr-tLh
a_FP.EEz
e
................
c2 ~
........ .........
c1mm
c1
significant period of time, then the entire chip L rhut-down2. The SPM may
have several modes depending on whether the entire chip is shut-down or a part
ofit. For example, the PowerPC 603 has three modes which are programmable
through a hardware bat controlled by software (operating +em). In this
microprocwor, one mode is called sleep mode which allows a m-am power
swings by disabling the do& to all units. h this mode the PLL and external
input do& are disabled to bring the power dissipation down to the leakage
levels. The power of PowerPC 603. in the sleep mode, is as low as 1.8mW
1201.
Low-Power VLSI Design Methodology 507
In the binary TSVQ already presented in Section 8.3.3, the codebook is orga,
nieed into a tree structure a~ shown in Fig. 8.12. The input vector is compared
with two code vectors at each node. Based on this comparison, one of the two
branches is chosen and the eodehook search space is reduced compared to the
full search, since a reduced number of code vectors (16) is utiked. For each
comparison, at 8 specific level, an index bit is generated as shown in Fig. 8.12.
The process of comparison thmngh the tree is repeated until a leaf node is
reached. Far II codebook of 256 levels, the tree has depth of 8 (d=7). Com-
pared to the full search, the nvmber ofmemary ~ e e e s sand executing operations
508 CRAPTER
8
d=O
d=l
d=2
d=3
6.7
iedoced considerably since only 16 code vectoxs -re used in the TSVQ a l p
rithm. One VLSI implementation of the TSVQ algorithm uses systolic arrays
P21.
operations is reduced. Table 8.1 [15] shows the computation complexity of the
three methods of the VQ. The differential TSVQ results in a lower number of
operations to be executed for each type.
Table 8.2 shows B eomphrison of 3-bit representation of the binary and Gray
codes. Note that the Gray code have only one transition for reqoential change
110 101 6
111 100 7
510 CHAPTER
8
In 1231, the switching property of the address coding w e memured Using the
number of bit switches per executed instruction. For instroction accesses, both
the Gray and binary coding were compared wing benchmark programs. The
maximum reduction in bit switches was found to be as high as 58% and the
-
average reduction was equal to 31%. The same study was also carried out for
data addresses. The average reduction of bit switches was 8%.
The most accurate power simulator to date is still SPICE.However, it can han-
dle only very small circuits (e.g, hundreds of transistors). SPICE accurately
taker into account non-linear capacitances ljunction and gate) which esnnot
be eaptvred by higher level tools. Also, it rnaccurately measwe short-circuit
and leakage currents. The latter is very important for low-VT applications.
SPICE cannot be used to estimate the power of large circuits or chips, due to
the time e o n r u i n g nature of the simulator. It is a pattern-dependent power
analysis tool.
' D y n d c l l i y computed PQWY should not bm c o d a d with dynamic power.
Low-Power VLSI Design Methodology 511
PowerNIill can also identify the hot spots (which consnme more dynamic
power) and twuble spots (which comnme unexpectedly large amoontr ofleahge
.mulent). Moreover, elements with excessive short-circait are detected. This
allows the designer to resise the circuit to reduce the riselfall time. Static
reduced-swing nodes ace detected as shown in the example of Fig. 8.13. The
node A is charged to VDD- VT when the input is low.
This approach, based on the Monte Carlo method, requires simulation over B
large number of measurements. The advantage of the statistical techniques is
that they can be built around existing simulation tools.
'PorerMill is fromEPlC D&gn Technology.
512 CHAPTER8
where N is the nnmber of nodes in the network. With a total physical capxi-
tance Ci. ai is the switching activity (or c d e d transition probability, P,)given
by
oli = P,(1- P,) (8.13)
where P*ir the probability that the node i is at high level. In this expression of
sctivity it in assumed that the circuit input and internal nodes me independent
(spatial independence). Also the values of the same Jignal, in two consecutive
dock cycles, are assumed independent ( t e m p m l independence).
By
az;
= y(=, = 1) @ y(.; = 0) (8.14)
It was shown in [29] that if 2, are spatially independent, then the density of
the boolean fonction is given by
(8.15)
(8.16)
The factor 112 k added to a c c o r d for the doable transition pm dock period.
This model, blued on transition density, ignores the spatial correlation of the
signals and eompntes, approximatidy, the power due to glitches. The work
in [28] attempts to handle both spatial and temporal eorrdations. One disad-
vantage of the approach in [28] is that the use of BDDs, for the whole circuit,
tends to limit the siw of the network thst can be analyzed.
The probabilistic techniques have the advantage that the user does not have
to supply dmnlation patterns and they are daimed to have fast computation
514 CHAPTERa
time. However, they do not account for the internal power of the gates and
static power dissipation. These techniques can be nsed, for example, as a fast
power estimator for logic synthesis. They might also be suited for comparing
varioos subsystem structures.
A set of p a e r vector8 drrcribes all possible events where power can be &-
sipated by the cell for dynamic and static cases. With SPICE these power
events are accurately chanlcterised. There are two types of power vectors: i)
dynamic snd ii) static. A dynamic power vector describer an event in which
power is dissipated due to a signal switching st the cell inputs. For example,
for a 2-input ( A and B) AND gate, when A = 1and B makes a tianAtion from
0 to 1, an energy is dissipated. A ststic power vector describes the conditions
of logic signals under which leakage power OCUUS.
The designer creates a design from the cell library at gate level then it is inpnt
to the Aspen' (A System for Power EatimatioN) system. Also the stimulus
to drive the logic simulator and the interconnect loads, representing the inter-
cell connectivity (estimdea or actual d u e s provided by back-annotation from
layout) are specified. A logic simulator such as Verilog-XL' is wed as an
even-driven simulator. Upon invocation, Aspen monitors the power event
occwrence (node a~tiYity)ofeach cell and computes the total power dissipation
a8 the sum of the power dissipation of all the cells in the power vector paths.
Multiple time windows can be specified for simulation to compute the average
power O Y ~ I different time periods Note that Aspen uses the power vectors of
a cell to compute the total power.
The static power vector is used to compute the leakage of B cell. Note that
the static power of B cell is dependent on the logic state of a cell, 85 shown in
Fig 8.15. To compute the static power dissipation, the duration of activation
time of the corresponding static power vector is measured. A transition of net
signal may cause a static power vector to be activated and another vector to be
deactivated. Vectors are time stamped during aetiwtion andnpon deactivation.
Then the total time length in which the vector is active is foand. The activation
time length of the static power vector is multiplied with the power dissipation
value (per time unit) to obtain the static power of the vector. Again the static
power dissipation for aU veotors asrociatcd with a cell instance is summed to
derive the total power dissipation.
516 CHAPTER8
The results reported by Aspen, such SJ the switching activity of nodes, can
be used to drive floorplanning, placement and routing tools. Also Aspen can
handle chips with B complexity of o w e d hundred thousand gates and is four
orders ofmagnitude faster than SPICE.It prodnces results within 10% accuracy
of SPICE results. One disadvantsge of Aspen is that it cannot handle power
due to the glitches.
1
latch
The power ofthe on-chip memory is modeled for a certain memory architectnre.
The interconnections are defined in two categories, local and intermediate, and
global busses. The local interconnection is defined as interconnections within a
logic gate. The intermediate interconnections are used for connection between
gates or functional blocks (subsystems). The global bun includes data, control,
and address busses. The lengths of local and intermediate interconnections are
modeled by the Rent's rule [34]. Then the power can be computed from the
lengths u&g a fixed switching activity equal to the one specled far the logic.
The global interconnect is determined from the dimensions of the ehip and the
number of drivers/receivers connected to it.
The power model of the clock network ia bared on the H-tree [34] and the chip
dimFnsionr. The power of on-chip drivers are also modeled in two components.
One'is the power used to drive the off-chip total capacitance. The other is the
pou/er consumed by the pad driver itself. The activity factor for the pads is
P
ars med fixed and is equal to 1 [33].
T$e tool developed in [33] is used ar a power estimator in the early stage of
t#e design. It requires some technological parameters (feature siae, gate oxlde
fltickncss, p a m e t e r e of the intereonneetion layers, etc.), the snpply voltage,
the chip area, the switching fhctor and the gate count. This tool can only be
used ar a roogh estimator of the total power of the chip since the switching
activity is arrumed fixed through the design. Therefore the pourer partition
between the different units can be incorrectly estimated.
518 CHAPTER
8
where G is the number of the logic gates comparing the fnnctional block, ui is
the switching activity of the ith gate, C,is the load of the ith gate, i,.,i is the
short eirenit component, and f is the frequency. This power equation can be
expressed in more compact form as
For a VLSI chip, composed of several functional blocks, the t o t d power dissi-
pation can be determined by summing the power o f & bloekr. We have
PM = niG,f, (8.20)
d, b l e r l .
Consequently, this technique doer not account for the strong dependency of
power consumption on the statistics of the input data [36]. The next section
tr t s the ease of power modeling, taking into account the correlated behavior
ofthe bits.
I
8.6.3.3 Dun/ Bif Type Model
k
In digital signal processing, corrdation can exist between value of a temporal
~e uence of data. The UWN model can lead to an error in estimating the
power of a dreuit even if the bit-width utiliantion is maximized. To take into
account the data correlation, the Dual Bit Type (DBT) dbta model har been
proposed in [36,311. The DBT data permits accurlrte estimation of the power
dksipation.
520 CBAPTER8
P(0-1)
p =4.99
p =4.80
p = -0.60
p = 0.0
p = 0.60
p=o.80
p = 0.99
14 12 10 8 6 4 2 I1
Fig. 8.18 shows the transition activity for several different two's complement
data stream versus the bit (for an n-bit word). In this figure, eaeh enme
corresponds to B different temporal correlation given by
P = cou(Xt-l,X,)
sl (8.22)
where X,_l
corresponds to the white noise case, where P ( 0 -
and Xt are successive data (intime) and rais the variance. p = 0
1) = 0.25. From Figure
8.18 it is evident that the UWN model, while sufficient for describing activity
in the Least Significant Bits (LSBs), is inadequate for the Most Significant Bit
(MSB)region. The U N W model works correctly for the LSBs up to the break
point BPO. The MSB region corresponds to the sign bits and consequently,
the signal and transition probabilities of there bits are far from random. p > 0
eorrerpands to a lower activity for positively correlated signals, while p < 0
corresponds t o a higher activity for negatively correlated signals. T h e MSB
region starts from the break point B P I . The region between BPO and BPI
can be modeled by linear interpolation. BPO and B P 1 can be determined from
the word-level statistics [37].
shifterr, RAMS, ROMs, ete. The power dissipation is modeled for each module
by
P = CV&f (8.23)
where the switched capacitance C is related to the compleity and the activity
of the module. For example of an n-bit dpple-carry subtractor, the switching
capacitance is modeled by
c = CGf,n (8.24)
where C,,, is a capacitive coefficient (in fF/bit) determined from the DBT
model. Ce,f can be a single coefficient for the U W N case. The DBT model
employs several codfieienti for C.,,, which reflect the data representation and
signal statistics. For the case of the subtractor, for example, B table of Cc,j is
generated as a function of all possible data transitions, i.e., i g n bits transitions
and LSB bits random transitions.
To extract the capaeitiae coefficients ofeaeh module, the library should be char-
acterbed. This operetion is performed onetime for one library. The process of
extraction consists of several steps:
- -
rn Capacitive coefficient's extraction. The simulation step produces the
average effective switching capacitances for the entire series of applied
input tramitions such a: U U, S 9 , cte. The capacitive
coefficients are utracted from the effective switching capacitances and
the complexity parameters.
One approach for power estbation, at the behavioral level, h a been proposed
in [38]. It is based on the combination of analytical and stochatic power
models. In this work, e cl- ofapplieationa such a zeal time DSPs is considered
for the power estimator. In the behavioral context, the power consnmed by a
hardware resource is given by
P = N.CV'f (8.25)
where N . is the number of accesses to the resource over the period of computa-
tion. Cis the average capacitance switched per access and f is the computation
frequency.
[3] A. Shcn, A. Ghosh, S. Devadar, and K. Keutaer, "On Average Power Dis-
sipation and Random Pattern Testability of CMOS Combinational Logic
Network," Proc. of the International Conference on Computer-Aided De-
sign, pp. 402-401, November 1992.
[4] K. Keutaer, "The Impact of CAD on the Design of Low Power Digital
Circuits." IEEE Symposinm on Low Power Electronics, Tech. Dig., pp.
4245, October 1994.
[5] GY. Tsui, M. Pedram, and A. M. Despain, "Technology Decomposition
and Mapping Targeting Low Power Dissipation," 30th ACMfIEEE Dcsign
Automation Conference, Tech. Dig., pp.68-T3, June 1993.
[6] R. Murgai, R. K. Brayton, and A. Sangiovanni-VinEente, "Deeomposi-
tion of Logic Functions for Minimum Transition Activity," Proe. of the
International Workshop on Low Power Design, pp. 33-38, A p d 1994.
[TI V.Tiwad, P. Ashar, and S. M&,
"Technology Mapping for Low Power."
30th ACMfIEEE Design Antomation Conference, Tech. Dig.,pp.74-79,
Jrme 1993.
[a] K. Scott and K. Keutsc., "Improving Cell Libraries for Synthesis," IEEE
Custom Integrated Circuits Conference, Tech. Dig., pp. 128-151, May 1994.
[9] C. Lemonds and S. Mhhant Shetti, "A Low Power 16 by 16 Multiplier using
Transition Reduction Circuitry," Proe. of the International Workshop on
Low Power Design, pp. 139-142, April 1994.
524 DIGITALVLSI
LOW-POWER DESIGN
[XI S. Gary, C. Diete, J. Eno, G. Geross, S. Park, and H. Sanches. "The Poa-
erPC 603 Microprocessor: A Low-Pow- Design for Portable Apphtiom,"
Proc. of COMPCON'94, Tech. Dig., pp. 307-315, February 1994.
[22] R. K. Kolagotla, S-S. Yu, and J. F. Jda, "VLSI Implementation of a 'Itee
Searched Vector Quantieer," IEEE Transactions on Signal Processing, "01.
41, no. 2, pp. 901-905, February 1993.
[23] C-L. Su, C-Y. Tsui, and A. M. Derpain, "Low Power Aichitecture Design
and Compilation Techniques foz High-Performance Processors," Proceed-
ings of COMPCON'OI, Tech. Dig., pp. 489-498, Februsry 1994.
[24] A-C Deng, "Power Analysis for CMOS/BiCMOS Circuits." Proe. of the
International Workshop on Low Pow- Design, pp. 3-8, A p d 1994.
[25] C. M. Emher, "Power Dkipation Andyysk of CMOS VLSI Circaits by
Means of Switch-Level Simulation," Proc.of the European Solid-state Cir-
cuits Conference,pp. 61-64, 1990.
1261 M. A. Cirit, "Estimating Dynamic Power Consumption of CMOS Cir-
cuits," IEEE International Conference on Computer Aided Design, pp.
534537, November 1987.
1381 R. Mehra, and J. Rabaey, "Behavioral Level Power Estimation and Explo-
ration," Proceedings of the International Workshop on Low Power Design,
Nape, CA, pp. 191-202. April 1994.
INDEX