Clusiered Voltage Scaling Technique
for Low-Power Design
iyosh} Usami* and Mark Horowitz
Stanford University, Stanford, CA 94305
Abstract
“This paper describes a technique: to reduce
power without changing circuit performance by
making use of two supply Vallapes. Gates Off the
critical path are-nun at the Lower supply to reduce
ower. Te-minimizo the: siiaiber of interfacing
svel-converters noéded, our algorithm clusters the
circuits which opeate at reduced voltage, leading
to clustered voltage scaling (CVS). We applied the
CVS techaiqae to design exeraples of contol logic:
in areal microprocessor, which had besn imple-
mented using a low-power library, Minimizing the
power is achieved by combining the gate re-sizing
and the CVS technique. The CVS teshnique was
able to further reduce the power by 10-20%.
1. Introduction
Volage scaling is-ane of the most etfective
sechniques in reducing the power consumption oF
CMOS citcuits [1]. The majority of the power is
dynamic powcr, which is reduced quadratically
‘vith the supply valiage Vp [2]. The cunzent in the
MOS transistdr, howover, decreases with Yop
which Jeads to increase the delay of the cireutt,
causing the pocformaxice degradation.
‘Tn adesiga of most micropcoceszors oF
ASIC chips, the operating frequency is scv by the
darget market, The timing constraints in the chip.
are in turn set by the operating frequency,
Designers need to optimize the design t0 reduce
the power consump
the constant Vth, the-critical-path delay will nat
moet che timing constraints, Alehough some re-
scarches bry to compensate for the Inst perfor
mance by up-sieang the transistors (J, this will
reach a point of litited returns. Moreover, in the
real design, iti likely that meetiag the ming,
constraints Will be very difficult evan without a
eureantaddcess is Toshiba Corp. $801, Hoeikawa-
cho, Baiwai-ka, Knwtakt, Japan, This work was done
whi he wos a-visiting catolar ai Stanford.
evaslon te make gible’ cupie of all or parc is material for
prcineal ws Chases We is ented wilbaut re prdie hot he eapioo see
ok mate or daipyted fr proGt of corumescal sleimmag, tho cppaiht
Fouce, the Ge af tha pubedlom andi dace eppeer, and wou is pen
Bal eapaight & by porfdan ofthe ACM ec, Te ongy alkervie to
nigh, io peak ta Hoeven ce these Co Nt ques PPE
fermion a ee.
SELED 35 Dama Point CA MEAD 1995 ACN 0.89951-744 0504.58.50
om within the specified timing.
constraints. If thc supply Voltageis seduced under
scaled supply.
na real control Logie, notevery path is
ceitical, Non-criticel paths have positive "slash",
the difference bevwizen the required time and the,
arrival time of signals. A power reduction tech-
nique bas been reported in [3], in which the:tran-
sistors ere downsized where the excessive slacks
extgt,
Since transistors get less efficient as they
get very small and thete is a fixed minimum size,
we propose "Clustered Voltage Sealing” (CvS)
which partially reduces the supply voltage to cir-
cuits that have excessive slacks. The next sostion
presents aur method, and Section 3 presomts re-
‘sults from using this algorithm.
2. Clustered Voltage Sealing
21 Clustered Voltage Scaling Structure
‘The basic idea is to provide wo different
supply voltages: Vo. (lhe reduced voltage) and
Voout (the original {utvedueed) volkage}. Circuit
with excessive slack are made ta operate at Vane,
while those along the critical pathe are taade te 9p-
‘trate ot Voor. However, there are a couple of
peobleens when two different supply voltages co-
‘Geist. One is the static current flowing ar the inter-
fag of the Vow. pam and the Vooy pare. As showa
in Fig, 2.1, if Ge oupet of a circuit operating al
Vopr (ealled as the Vpn eiteuit) is connected di-
rectly t0 the input caf a circuit operating at Vouy
(calledLas the Vppy civcuit), dhe static cunent flows
inthe Vooe cincoit a the input level "High. Since.
the voltage of the nede M1 is wot raised higher
than Voor even at "High" level, che PMOS (MP1)
eantioe be cut-off Yay, < Vonuy ~My It
canses the $1866 current front Vgbit to Vs through
MMP | and MAI. I is mot desirable for Jovr-power
oriented design, A Jevel converses, like the one
shown in Fig, 22, blacks the statig current if tis
jinsereed at the nede N1. It should be noted thac ao
Tavel-conventeris needed in the reversed case, ie.
svhen the oulgitaf a Vou circuit ig comneated to
the input of 2 Vox citevit. The static current doesnot flow because the inpux of the Voc circuit is
deiven up to Yoox.
Vou
Cee
Fig2.1 Direct Connection Fig2,2 Conventional
of Vou Ciratand Level Comverser
“Vora Cirovit
Aithough the level converter shown in Fig.
2.2 prevents the static powar, dynamie power is
falny large when wggliog. In our measimement, it
dissipanes a¢- much ag 0-SmW per gircuit at
1O0MHi, with delay of @ Sn using @8am
CMOS. tf we prodiice she structure in which a
Vpoc sircuit and a Vppg eizeuit are connected one
after the other, such as "Vppy cireuit~ Vonacireuit
= Vong sireuit’- Vopay siteurt -..", algtof level
converters are. nazded at each interface berween dhe
Vppe, cirouit and the Vony circuit . The sumat
‘ew power could cancel our the power reduction
given by reducing the supply voltage partially. -
“To avoid this problem, we propose the
Clhstered Voltage Sealing (CWS) stuceare shown
in Fig. 2.3. The structwre minimizes the umber of
Jevel convertors needed. Astuming the eortbina-
tional logic is implemented using standard calls,
(he structure
primary_inputts > Vopy, Cells + Vig, cols
> fevel converters 9 primary _ostiputs
is formed in all paths from the primary inputs
the primary oumputs. This leads to the forraation of
‘the cluster of ‘Viyyy e2lls and thet of Veqy, cells.
‘Supply voltage Vn1, is given to ibe shaded cells in
Big. 2.3. Level comvemters are inscrtedt only be-
‘rween the Jast cells inthe Vo Voy contee-
dens, The C¥S structure has another advantage:
‘the overhead is small fathe layout because of the
clnstered structure, When a designer performs
placoment and routing asing standard cells, he'she
‘groups the oells in the Wypyy Cluster and thase in.
‘the Vion luster, and thes place thant separate
rows
(10: Level Convener}
Fig2.3 Clustered-VoltagesScaling (CVS) Smucture
22 Latch with Level-Conversion
Function
Irthe peimary ourpet is connected toa
Jatch, the level converter can be menged in With the
latoh as a8 shown in Fig. 24, This i6 a laxch with
levelsconversion function (LCF). Reduced voltage
swing at the node "IN" is converied to 5 ¥-swing
ax the output Q. Power and delay of the laich wer-
sos the inpot voltage (Vyy) are plotted ie Fig. 2.5,
The data of the potver aad delay were obtaitied
sing SPICE, We assumed the operoting fre~
quency of 1OOM#Hz and 03LmCMOS technology.
‘The delay has been measured from IN to Q.
we
Belay,
Tae
Tags Ss
Fig.2.4 Lach with “VIN Groits)
Level-Conversion Fig 2.5 Power and Detay of
Buncign Latch with LCF
‘The power and the delay both increase when Vin
gets lower, because the slope of the voliage wavex
form al the apde D Bar decreases. We compared
the power of a simple combination of the conven-
tonal level-converter and the conventional lsich,
‘with that of the latch with LCF, Av Vi = 3Y, thepower of the simple combination is |.0Sm¥¥ be-
uuse the level-converier consumes OSiiW and
the Jatch does 0.5m, while the power of the
latch with LEF is coly dsm per eizcuit. The
power of the latch with LCF is much less than that
Of the simple combination.
23 Algorithms and Heuristics for
Creating CVS Structure
4. Basic Algorisim
For the gate-level neilist, we perform
backward graph-traversal using the Depth-First-
‘Search {DFS} algonthm from the primary ourputs.
foward the primary inputs. Bach time we visita
cell, we try to replace it with a Vpn cell IF dhe
‘timing constraints is still net even after the re-
placement, the cell is replaced, This process is ce
‘Pested uncil we finish visiting all the cells in the
module. We will usc Fig. 2.7 os an example.
Fig.2.7 Initial Circait
Graph traversal is initiated at he primary ontpues
(ol, 02,03, od). Weich of tients taka fest
sdeteernined by a heuriglie, which will be discussed
Jater. When ol is taken fret, replacement js ried at
the cell "G8". Degraded delay oF G8 is computed
whon its supply voltage Is reduced from Vipa ta
Yop by the following equation:
Ys, Ops ~¥ 0
Bel ana OEE # Belay
Static timing analysis by PERT [4] is performed
asing the degraded delay to chock the arcival time,
requized time and slack atearh node. Hf the slack is
positive, the Vigna coll at GS is replaced with a
Von. call and the backovard traversal is continued,
‘The call GS'is visited next and seplace ment is tried
by checking the slack using the stntic timing unal-
ysis. I€ GS ¢ suecessfully replaced. the fanins of
GSarcexemined,
‘On the contrary, if the negative Slack is
detected, the cell is not repladed. For example, if
Ge mepative slack is detected when trying tose
place G5, GS is notzeplaced and is marked as
“onreplaceable", We stop the traversal from the
fanins (02, 03,14} toward the primary inpats and
newly initiate the backward waversal atthe noxe
‘primary ontput.
4. Algoritian for Handiing Fanowts
Ja performing the waversal, we need tobe
careful about the eell with multiple fanouts. For
example, ia above-mentioned traversal of
‘01 9 G8 — GS, the cell GS has tvvo famous of
‘GB and G9. In order to replace the Vy cell at GS
with a Vppg cell, not only G8 but aso G9 should
bbe replaced with 4 Voy, osll, Otherwise, the CVS
structure with
Primary_inputs => Voy ClUste¥ > Voge eltister
A} outpus taiches with LOF
cannot be formed. Although G8 is replaced in the,
aversal, GS is not visited yet so we need fo OY 10
replace al-G9, Moreover, if GP has multiple
fanouts, we need to my replace the Vpper celle at
all of the fanours with Vpn cells.
‘When trying to repisce the cells with mi.
tiple fanomts, we perform the forward teaversal
using DFS toward the primacy outputs to check if
the child cells can bo replaced wieh Vopx, cells It fs
zat until every child checked in die procedtu ig
replaced that GS it replaced witha Yop, celi.
Eveumally. the CVS structure shown in
Fig. 23 is completed. The Vpa cells and the
‘Vou, sells are-closiered, respeetvely.
os Henristies
A couple of heuristics, larger and
“largerSlack" ere provided, They determine the
order of the candidates it DES. For
‘example, we eed to determine the ofder of the
primary outpnts 01, 02, 03 and-o4 when initiating
DES. if we choose the heuristic "larger, the
nodes ol-of ane aninged in descending order with
‘heir load rapacitiet, such as {ol, 03, 02,04}401 has dhe largest capacitance), The backward
DES is initiated at ol, followed by «3, 0? and of.
Ef wo choose the heuristic "langerSlack", the nodes
‘ave arranged in descending onder with their slacks.
‘Weealso use these heuristigs 69 order choices i
side the DES.
3, Experitnental Results and Discussions
$4 Design Examples
We implemented the algorithms and the
heuristics deseribed in Section 2 in auc tool
"Power Slimmer". i has boon 7a on some design
examples of contr logic in a superscaler micro-
processor, Torch [5], The outline af the examples
Js shown in Table 3.1. The module "AdeeSBI"
and "BdecSE1" ace decode & control logis. Only
combinational civcuits are eatracted fromthe origi
nal circuits.
Table 3.1_ Outline of Design Examples
RHEE | Mase) BEET]
TOLER,
ap so =
outpats 30 a5
#of calls 164 14
34 Desiga Flew
‘The design flow is shown in Fig.3.1
Tagic Syaikenis
Combination Fat
32 Libraries
Wo tied osing wo standard cell Hbraries,
a "Regular Lib” aad 3 "Fine Lib" as shown in
Table 3:2. The Regular Lib was acwally vsed in
the dedign of the Torch micreprocessor, in which
the minimum cell has the transistors of
Wp / Wri = 200m 10pm in 0.3m CMOS. Lt in-
cludes apo 4 inputs of NANDs, NORs, ANDs,
‘ORs and inverters, logether with XORs, AQMs and
‘OAls. The Fins Lib has been prepared for the ex-
periment, in which the minimum cell has the tran
sistors of WJ Was Spdor /2.Syere hich is close
athe minimum vansistor width, Furthermore, the
Fine Jib hax more variations of transistor sizes
shan the Regufar Lib,
‘Table 3.2. Regular Lib and Fine Lib
RegularLib [Fine Gib
‘Varintons 3 Te, ae Sx, ak
‘Tronsisior size in 0.25 step,
Bx, 12x, 16
For ceils a7, a6
Fons Enna [Say Power EET
‘Cierbag Sim Ij” Progam