Chi2007 PDF

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO.
6, JUNE 2007
637
Gate Level Multiple Supply Voltage Assignment

Algorithm for Power Optimization
Under Timing Constraint
Jun Cheng Chi, Student Member, IEEE, Hung Hsie Lee, Sung Han Tsai, and Mely Chen Chi, Member, IEEE
AbstractWe propose a multiple supply voltage scaling algorithm for low power designs. The algorithm combines a greedy approach and an iterative improvement optimization approach. In
phase I, it simultaneously scales down as many gates as possible
to lower supply voltages. In phase II, a multiple way partitioning
algorithm is applied to further refine the supply voltage assignment of gates to reduce the total power consumption. During both
phases, the timing correctness of the circuit is maintained. Level
converters (LCs) are adjusted correctly according to the local connectivity of the different supply voltage driven gates. Experimental
results show that the proposed algorithm can effectively convert
the unused slack of gates into power savings. We use two of the
ISPD2001 benchmarks and all of the ISCAS89 benchmarks as test
cases. The 0.13- m CMOS TSMC library is used. On average, the
proposed algorithm improves the power consumption of the original design by 42.5% with a 10.6% overhead in the number of LCs.
Our study shows that the key factor in achieving power saving is including the most comportable supply voltage in the scaling process.
Index TermsAlgorithms, low power, multiple voltages assignment, partition, power optimization, voltage scaling.
I. INTRODUCTION
OWER consumption is an important issue in modern

VLSI designs. The power consumption of CMOS circuits
consists of two factors. One is dynamic power consumption.
The other is static power consumption caused by the leakage
current. In practical, the leakage power only contributes 1% to
the total power consumption when the circuit is in the active
mode. Therefore, in this paper we focus on the power minimization for the dynamic power. Dynamic power consumption
includes switching power and short-circuit power. Switching
power dissipation occurs when switching current from charging
and discharging parasitic capacitance. Short-circuit power
consumption occurs because of the short-circuit current which
occurs when both n-channel and p-channel transistors are momentarily on at the same time. Switching power consumption
in CMOS circuits is proportional to the square of the supply
). Applying a voltage scaling technique that
voltage (
changes the supply voltage of gates to a lower value in CMOS
circuits is an effective way of reducing power consumption.
Manuscript received May 3, 2006; revised September 21, 2006. This work
was supported by the National Science Council of Taiwan under Grant NSC942215-E-033-003 and Grant NSC95-2221-E-033-077-MY3.
J. C. Chi is with the Department of Electronic Engineering, Chung Yuan
Christian University, Chung Li 32023, Taiwan (e-mail: juncheng@mail2000.
com.tw).
H. H. Lee, S. H. Tsai, and M. C. Chi are with the Department of Information
and Computer Engineering, Chung Yuan Christian University, Chung Li 32023,
Taiwan (e-mail: mlchen@cycu.edu.tw).
Digital Object Identifier 10.1109/TVLSI.2007.898650
Fig. 1. Average distribution of gates with different slacks for 16 MCNC91

benchmarks [1].
Scaling down the supply voltage of a gate will cause the gate
to have a longer gate delay. In order to maintain the correctness
of the timing, only the gates along noncritical paths are assigned
to a lower supply voltage to convert the unused slack into power
savings. The average distribution of gates with different slacks
for 16 MCNC91 benchmarks is shown in Fig. 1; these were
presented in [1]. In Fig. 1, the slack of each gate was normalized
to the longest path delay. It may be seen from this figure that
the number of gates on critical paths (i.e., gates with zero or
close-to-zero slack) accounts for only about 14% of the total
number of gates. The number of gates with a slack larger than
0.2 comprise more than 60% of the total number of gates. This
means that there is plenty of room for power reduction via the
utilization of lower supply voltages on the gates of large slack.
However, in a voltage-scaled circuit, if a lower supply
voltage gate (a VDDL gate) drives a higher supply voltage gate
(a VDDH gate), a level converter (an LC) must be inserted as
a bridge between these two gates [2]. This is because the output
signal of the VDDL gate will cause a static current flow from
the VDD to VSS in the VDDH gate. An example is shown in
Fig. 2. In Fig. 2, the inverter is a VDDH gate and this inverter
is driven by a VDDL gate. Since the voltage of the input signal
of the inverter will not be higher than VDDL even when the
input signal is at the HIGH level, the pMOS in this inverter
.
may not be cut-off if
represents the threshold voltage of the pMOS. This will cause
a static current flow from VDD to VSS through the pMOS to
nMOS. Thus, an LC is needed between a VDDL and a VDDH
gate to prevent the creation of a static current. However, the LC
will also consume power and will cause a timing delay. It also
increases the chip area. An LC is not required if a VDDH gate
drives a VDDL gate. The number of LCs in a voltage-scaled
1063-8210/$25.00 2007 IEEE
638
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 6, JUNE 2007
Fig. 2. Illustration of the static current flow in a VDDH inverter when it is

directly connected to a VDDL gate.
Fig. 3. Conventional level converter [2].
design is determined by the voltage scaling algorithm. Fig. 3

shows an example of an LC [2].
In [2] and [3], Usami et al. proposed a dual supply voltage approach, namely clustered voltage scaling (CVS). CVS is based
on a topological constraint that only allows a transition from a
VDDH gate to a VDDL gate along paths from the primary inputs (PI) to the primary outputs (PO). Thus, no LCs are required.
In this resultant cluster structure, some gates with high value of
slack cannot be assigned to VDDL gates. The ECVS [4] relaxes
this topological constraint by allowing a VDDL gate to drive a
VDDH gate if an LC is inserted. The assignment is performed by
visiting gates in a reverse levelized manner. A design methodology and design flow for implementing this structure of netlist
was proposed in [5]. Both algorithms, CVS and ECVS, assign
the appropriate power supply to each gate by traversing a combinational circuit from the POs to the PIs in a levelized order.
The levelized approach restricts the possibility of creating a solution with lower power consumption.
Without using the levelization approach, Yeh and Kuo [6]
proposed an optimization-based multiple supply voltage scaling
algorithm, OB MVS. The OB MVS algorithm first scales all
gates to the lowest supply voltage, and then scales up the supply
voltage of the gates according to a critical order until the timing
constraint is met. Gates that are shared by more critical paths
have a higher critical order. This work does not consider the insertion of LCs. The calculation of the critical order of each gate
is very time consuming. Kulkarni et al. [7] proposed an algorithm (GECVS) that greedily assigns power supplies based on
a sensitivity measure. This sensitivity measure considers slack
changes and power saving at each voltage assignment. The algorithm starts from a VDDH netlist and then scales down the
supply voltage of the gate with the highest sensitivity. This algorithm saves more power than the CVS and ECVS algorithms.
However, these algorithms use greedy approaches without any
refinement process, and thus, power consumption may not be

optimized.
Chen et al. [1] translated the power optimization problem
to a maximal-weighted-independent set (MWIS) problem. The
authors first estimate the lower bound of the supply voltage of
each gate that meets the timing constraints. Then, the voltage of
each gate is assigned by the proposed dual-voltage-power-optimization (DVPO) algorithm. In order to reduce the power
penalty of LCs, a constrained F-M algorithm is used to reduce
the number of LCs. Kang et al. [8] proposed a scheduling
and resource allocation approach for multiple voltage scaling.
They used a data-flow graph (DFG) to represent the netlist
and then calculate the timing slack of each node in the DFG.
After the initial voltage assignment of each node, a pair-wise
multiple way graph partition algorithm was performed to
further improve the power consumption while not violating
the given timing constraints. This work was enhanced in [9].
Manzak and Chakrabarti [10] present resource and latency
constrained scheduling algorithms to minimize power/energy
consumption when the resources operate at multiple voltages.
The proposed algorithms are based on efficient distribution of
slack among the nodes in the data-flow graph. The distribution
procedure tries to implement the minimum energy relation
derived using the Lagrange multiplier method in an iterative
fashion. Mohanty and Ranganathan [11] present an integer
linear programming (ILP)-based power minimization algorithm. The algorithm simultaneously minimizes the peak and
average power during behavioral level datapath scheduling. The
datapaths can function in three modes of operations: 1) single
supply voltage and single frequency; 2) multiple supply voltages and dynamic frequency clocking; and 3) multiple supply
voltages and multicycling. Recently, some research [12], [13]
has combined voltage scaling and
assignment to further
reduce the power consumption of a circuit. For example, in
[13], Hung et al. proposed an algorithm that utilizes the Genetic
assignment, the multiAlgorithm to perform the multiassignment, device sizing, and stack forcing for low power
designs simultaneously.
These previous works did not consider wire delay and wire
load. Thus, the timing and the total power consumption of a design might vary a lot from the design after layout. In this paper,
wire delay and load are taken into consideration in both timing
analysis and power calculation processes.
In this paper, we focus on the voltage assignment for gates
in a gate-level netlist for dynamic power optimization. We propose an algorithm that combines both greedy and iterative optimization approaches. In phase I, we greedily assign the supply
voltage of all gates to VDDL, then reassign the gates with negative slack to a higher supply voltage to guarantee the timing correctness of the circuit. This allows the gates along the timingloose paths to be scaled down simultaneously. It results in a
smaller number of LCs on these paths. The VDDL gates will
be fixed at the lowest voltage and the rest of gates are referred
as scalable gates. Only scalable gates will have their supply
voltages reassigned in phase II. The problem size will thus be
reduced. In phase II, we utilize a partitioning algorithm to perform the voltage assignment to optimize the power consumption. After each scaling, the LCs are inserted into or removed
CHI et al.: GATE LEVEL MULTIPLE SUPPLY VOLTAGE ASSIGNMENT ALGORITHM FOR POWER OPTIMIZATION UNDER TIMING CONSTRAINT
from the netlist according to the local connectivity of the voltage

scaled gates. The delays and power consumptions of LCs are
counted in the timing analysis and power calculation process.
During the scaling process, the wire load and delay are estimated
by applying a wire load model. Therefore, the estimation of path
delay and the total power consumption of our algorithm may be
more accurate than the estimations used in algorithms that do
not account for wire load and delay considerations. These considerations are helpful in facilitating timing closure in the physical design flow. The details of the proposed algorithm are described in Section IV. We have carried out several experiments
to study the impact of applying different voltage domains on
power savings. We have also studied the impact of switching
activity on total power consumption. These experiments are discussed in Section V.
This paper is organized as follows. In Sections II and III, we
describe the calculation of the slack and power consumption of
each gate. In Section IV, we present the multiple supply voltage
scaling algorithm. In Section V, we show the experimental results. Finally, Section VI concludes the paper.
639
is the threshold voltage of a transistor and is a

where
technology dependent parameter. We run SPICE on different
types of gates in the library (TSMC 0.13- m CMOS library) to
get the dependency of delay on various supply voltage. From
these data, we can calculate the value of . The value of is
1.49.
Then, we calculate the require time of each gate. The BFS
algorithm is applied backwards from a PO/flip-flop to a PI/flipflop. The cycle time of the circuit is set to be the require time
at each PO/flip-flop. When a gate is visited, its require time is
equal to the require time of the previous gate minus the sum of
the delay of the previous gate and the connected wire.
Finally, the slack of a gate is calculated by subtracting the
arrival time from the require time of the gate. If the slack of any
gate is negative then the circuit has a timing violation.
We assume that the input circuit has no timing violation. Our
algorithm will reduce the power consumption of the circuit by
scaling down the supply voltages of gates while maintaining
the timing correctness of the circuit. During the timing analysis
process, the delay of an LC is also extracted from a lookup table
in the cell library. The table is created according to a real LC
design.
II. BACKGROUND-TIMING ANALYSIS

The objective of timing analysis is to calculate the slack of
each gate in the design. We first calculate the arrival time of
each gate. The breadth first search algorithm (BFS) is applied in
a levelized manner to find the maximum delay of all paths. Each
BFS starts from a PI/flip-flop and ends at a PO/flip-flop. At the
starting gate of each path, the arrival time of each PI/flip-flop is
the delay of the PI/flip-flop. The maximum delay among output
ports of a gate is the arrival time of the gate. The maximum
arrival time among all input ports of a PO/flip-flop is the arrival
time of the POs/flip-flop. The maximum arrival time among all
POs/flip-flops is the cycle time of the circuit.
During the timing analysis, both gate and wire delays are
counted. Because at this stage, the wire length of each net is
not available, we use the wire load model [14] that is given in
the standard cell library to estimate the wire length of a net. This
wire load model is a statistical experimental result obtained by
analyzing many layouts of a specific technology. This model
provides parameter for estimating wire length of a net according
to the number of fan-out of the net and the total number of gates
in the circuit. After estimating the wire length, we may calculate the capacitance (C) and resistance (R) of the wire. Then, the
R*C delay model is used to estimate the wire delay. The summation of the wire capacitance and the input capacitance of fan-out
gates is the total load capacitance on the driver gate. Then a gate
delay is obtained from a lookup table in the standard cell library
according to the total load capacitance of the gate. The input
transition time is assumed to be the minimum transition time
in the table. The gate delay that is extracted from the library is
the delay of the gate at supply voltage VDDH. The delay of a
may be estimated by using the
VDDL driver gate
alpha-power law model (1) [15]
(1)
III. POWER CONSUMPTION CALCULATION

The dynamic power consumption of a gate , denoted as
may be calculated by (2)
(2)
where and are the numbers of input and output pins of gate
.
is the power consumption of the th input pin. The values
may be extracted from the lookup table in the library acof
cording to the total capacitance of fan-out load. represents the
and
represent the switching acfrequency of the circuit.
tivity of the th input pin and the th output pin, respectively.
is the loading capacitance on the th output pin. The value
is the sum of the capacitances of the fan-out net and the
of
driven pins of the net. The capacitance of each net is estimated
represents the supply voltage
by applying wire load model.
at gate . For example, if is a VDDH/VDDL gate then
equals to VDDH/VDDL. The power consumption of an LC is
of each LC is asalso calculated by (2). The supply voltage
signed as VDDH.
The total power consumption of the circuit is calculated by
(3)
IV. PROPOSED VOLTAGE SCALING ALGORITHM

FOR LOW POWER CIRCUITS
The inputs of the algorithm are a circuit which has no timing
violation and a standard cell library. A set of voltage domain
for
which
represents the voltages that may be applied to the gates of the
circuit. Initially, the supply voltage of all gates is , which is
is also denoted
also denoted as VDDH. The lowest voltage
as VDDL. The objective is to scale down the supply voltage on
640
both phases, the correctness of timing is maintained. The details

of these two phases are described in the following.
B. Phase I: Greedily Scaling Down the Voltages of Gates
While Satisfying the Timing Constraint
Fig. 4. Example of the insertion and removal of an LC. (a) The original netlist
of a four terminal net is shown. When gate A is scaled down to a VDDL gate,
an LC is inserted at the output of gate A, as shown in (b). When gate D is
successively scaled down to a VDDL gate, the netlist becomes (c). Finally, gate
A is scaled up to a VDDH gate, the LC is removed and the netlist becomes (d).
the gates such that the total power consumption is reduced while
maintaining the same cycle time of the design.
In the voltage scaling process, we need to adjust the netlist
by inserting or removing an LC according to the local connectivity of the voltage scaled gates. An example is shown in Fig. 4
in which two voltage domains VDDL and VDDH are used.
Fig. 4(a) shows the original connection of a four terminal net
. First, when gate is scaled down to a VDDL gate, an LC
is inserted at the output of gate . The net is divided into
two nets and
as shown in Fig. 4(b). Then gate
is successively scaled down to a VDDL gate and the netlist becomes
Fig. 4(c). Finally, gate is scaled up to a VDDH gate, the LC is
removed, and the netlist becomes Fig. 4(d). Due to the inserting
and removing of an LC from the netlist, the netlist is changed
dynamically. This change is considered in the timing analysis
process. In the example shown in Fig. 4(b), the delay from an
input pin of gate to the input pin of gate is the summation
of the delays of gate , net , gate LC, and net .
A. Algorithm Overview
At the beginning, we apply the timing analysis procedure to
calculate the cycle time of the circuit and the slack of each gate.
This cycle time is used as the timing constraint of the design in
order to maintain the timing correctness of the circuit. Then the
algorithm will proceed with two phases. In phase I, we apply
a greedy approach that scales down the supply voltages of as
many gates as possible. It allows all gates along the timing-loose
paths to be scaled down simultaneously and results in a smaller
number of LCs on these paths. The VDDL gates will be fixed
at the lowest voltage and the rest of gates are referred to scalable gates. Only the supply voltages of scalable gates will be
reassigned in phase II. In phase II, we utilize the technique of
the multiway partitioning algorithm to perform the voltage assignments. Different voltage domains are treated as different
partitions. We refer to a voltage assignment as a move. Each
scalable gate is moved to the voltage domain of the maximum
power gain. The iterative optimization process is executed until
the total power of the circuit can no longer be reduced. During
At this stage, we try to scale down as many gates as possible to

reduce the total power consumption while simultaneously maintaining the timing correctness of the circuit. The pseudo code of
phase I is shown in Fig. 5.
First, we scale all gates in the netlist to the lowest voltage .
LCs are inserted before the primary output gates. Static timing
analysis is performed to calculate the slack of all gates. The
supply voltages of all gates with negative slacks are scaled up to
the next higher supply voltage . The netlist is adjusted by inserting or removing LCs according to the vicinity connectivity
of the voltage scaled gates. Then, we perform timing analysis
again to recalculate the slack of each gate. The delays of LCs
and nets along the path are also included in the timing calculation. Then the slacks of these gates are updated. If any gate
with negative slack is found, then the supply voltage of the gate
is scaled up to the next supply voltage. This process is repeated
until the slacks of all gates are positive or zero.
An example of phase I is shown in Fig. 6. In the example,
dual supply voltages, VDDH and VDDL, are used and the cycle
time is 20 ns. The delays of each gate and wire are shown above
each gate and wire. Fig. 6(a) represents a portion of the original netlist in which all gates are VDDH gates and the slack
of each gate is positive. We first scale all gates to VDDL as
shown in Fig. 6(b). Then, we scale all gates of negative slacks
to VDDH gates. According to the connectivity of the scaled
gates, two LCs are inserted at the output of gate and the input
of gate , respectively. Then, we recalculate the slack of each
gate. We can find that the slack of each gate is positive and then
phase I is finished. The resultant netlist is shown in Fig. 6(c).
As shown in Fig. 6(c), the supply voltages of these gates on
are scaled down to
the timing-loose path
VDDL simultaneously.
After this phase, the delays of all paths are less than or equal
to the cycle time. The effectiveness of this phase lies in the fact
that it simultaneously scales down the supply voltage of all gates
on the timing-loose paths. The delays on these paths are still
less than the cycle time after the voltage scaling. Other paths
are composed of gates with different voltage domains.
C. Phase II: Partition Based Multiple Supply Voltage Scaling
Algorithm
At this phase, we apply a partition-based approach to perform the voltage scaling. If the number of voltage domains is
, then this problem is treated as a -way partition problem.
For example, if we are performing the voltage scaling with dual
supply voltages (VDDH and VDDL), the problem is formulated as a two-way partition problem. The gates are moved between the two partitions VDDH and VDDL to obtain a netlist
with the lowest power consumption. A gate is moved to the
VDDH/VDDL partition which means that the supply voltage
of the gate is assigned to VDDH/VDDL.
All VDDL gates are marked as unscalable while the others
are scalable. We only deal with the scalable gates. Thus, the
641
Fig. 5. Pseudo code of phase I.
problem size of phase II is smaller than the original circuit. The

scalable gates are moved among the voltage domains in order
to reduce the total power consumption. The moves that result
in timing violations are disallowed. During the moving process,
the power may be increased. It provides for the possibility that
better solutions may be discovered in later moves.
1) Cost Function of Phase II: The cost function of this phase
is the total power consumption of the circuit that may be calculated by (3). The power consumptions of LCs are also included.
Then, we define the power gain of each gate. For each scalable
represents the
gate , the power gain of , denoted as
improvement of power consumption when the supply voltage
is defined as
of gate is scaled to . The power gain
follows:
Total Power
Total Power
(4)
If the voltage movement requires an insertion or removal of a
level converter then the power associated with the level con.
verter is added to or subtracted from the Total Power
During the scaling process, only the power consumptions of the
fan-in gates of the scaled gate are affected. Thus, the power gain
of a gate , may be simplified by following equation:
(5)
and
represent the power values of gate before
where
and after gate is scaled, respectively. The values of and
may be calculated by (2). By applying (5), the power gain of
each gate may be incrementally updated. An example is shown
, and are VDDH
in Fig. 7. If the supply voltages of gates
is VDDL, then the power gain
and the supply voltage of
of gate is
.
According to the definition of power gain, each gate has
power gains, denoted as
for each
. Then, we
,
define the maximum gain value of a gate , denoted as
for each
. It is
as the maximum value among
defined as follows:
(6)
value of the affected gates
After one gate is scaled, the
will be updated.
2) Proposed Algorithm: First, we set up an unlock flag for
each scalable gate and calculate the power gain of each gate.
and
Then we select the unlocked gate with the largest
move it to the best voltage domain. We insert or remove level
converters according to local connectivity. Then we lock the
moved gate and calculate the slacks of all affected gates. When
a gate is scaled, the arrival time of the gates on the downstream paths of the fan-in gates of and the require time of
the gates on the upstream paths of are changed. Thus, the
slacks of these gates are updated. If any slack is negative, then
the move will cause a timing violation and the move is rejected.
If the move is accepted, we will update the power gain of all
connected gates.
We use the same example as shown in Fig. 7. If gate has
the maximum gain among all unlocked gates and
, we select gate and scale it to a VDDL gate.
Then, a level converter is inserted between gates and . If the
slacks of all affected gates are positive or zero, then the move
, and are updated.
is accepted. The power gains of gates
is changed. Before
In this process, the output load of gate
scaling, the load of gate is equal to the sum of the capacitance
of a two terminal net and the input capacitance of
. After
becomes the capacitance of a two
scaling, the load of gate
terminal net and the input capacitance of gate . Therefore, both
delay and power consumption of gate should be updated. The
642
Fig. 6. Example of phase I.
difference of the power consumptions of gate is counted in

.
The gate-move process is repeated until all gates are locked or
all power gains of the last 20 moves are negative. The accumulated sum of power gains of all moved gates is called the partial
sum. We calculate the partial sum after each move. Because the
power gain may be negative, the value of the partial sum fluctuates. During the gate-move process, we keep the largest partial
sum of power gain and record the corresponding state. After an iteration is terminated, the state of the largest partial sum is found.
643
Fig. 7. Example of the calculation of power gain when gate B is scaled. The power consumption of gate A stays the same when gate B is scaled, thus P = P .
Due to the removal of LC the loading capacitance of gate D is changed. It leads to a change in the power consumption of gate D . Therefore, the power gain
G(B; VDDL) of gate B is equal to (P
+P +P
)
(P + P + P
).
Fig. 8. Pseudo code of phase II.
We assign the state of the largest partial sum of the previous

iteration as the initial state and unlock all scalable gates. Then,
we recalculate the slack and power gain of each gate. The iteration of the gate-move process is performed again to reduce the
power consumption. This optimization process is repeated until
the largest partial sum is equal to or less than zero. Then phase II
is completed. The pseudo code of this phase is shown in Fig. 8.
V. EXPERIMENTAL RESULTS
The proposed algorithm is implemented in the C/C++ language. We also implemented the GECVS algorithm of [7] for
comparison. The experimental platform is SUN Blade 1000 running Solaris 9 with 8-G RAM. Two ISPD2001 circuit benchmarks, Mac1 and Mac2 [16], and the ISCAS89 benchmarks [17]
are used as the test cases. The TSMC 0.13- m CMOS library
is used in our experiments. In these experiments, in order to reduce the complexity of the problem, we only use one type of LC
that can convert an input signal that is within the range (0.6 V,
1.2 V) to a 1.2 V output signal. As presented in the previous researches [1], [7], we assume the switching activity of all nets is
a constant. Because the value will not affect the results of the
voltage scaling, we assign 1 to the switching activity to reduce
644
TABLE I
COMPARISON OF THE RESULTS OF OUR ALGORITHM AFTER PHASE I AND PHASE II
the computational effort. However, we also study the impact of

switching activity on the voltage scaling in the last experiment.
Since we do not have the switching activity for each net, we use
the probability of switching activity at the output of a gate as
the switching activity of the output signal. Details are described
later in this section.
First, we would like to study the effect of phase I in the proposed algorithm on dual voltage domains. We use 1.2 and 0.6 V
as the voltage domains. In phase I, many gates are assigned
to VDDL gates and the number of VDDH gates is reduced.
These VDDH gates are the scalable gates for phase II. Thus,
the problem size is reduced. The results are listed in Table I.
The number of gates is listed in column 2. The original power
of the circuits is listed in column 3. The results after phase I are
listed in columns 48. Column 4 shows the ratio of the VDDH
gates over the total number of gates. The ratio of the number
of VDDL gates is shown in column 5. The number of LCs is
shown in column 6. Comparing to the total power of the original design, the power saving after phase I is listed in column 7.
The CPU time after phase I is shown in column 8. The CPU time
listed in column 8 includes the CPU time required to import the
input data. The results of our algorithm after phase II are shown
in columns 912. The total CPU time listed in column 12 is the
total execution time of the program.
As shown in Table I, after phase I, on average, the problem
size is reduced to 73.8% of the original size. The amount of
reduction depends on the timing tightness of the paths in the
circuit. More paths with looser timing will exhibit a greater reduction in the problem size after phase 1. On average, 12.7%
of the total power of the original circuit is saved. Table I also
shows that in all test cases, phase I is completed within 7.3 s.
After phase II, there are 61.9% of gates are scaled to the low
supply voltage and 37.2% of power is reduced.
We use s13207 as an example in Fig. 9 to illustrate the variation of slack distribution during the scaling process. The -axis
represents the slack of a gate and the -axis shows the number
of gates. Fig. 9(a) is the slack distribution of the original design. Fig. 9(b) and (c) are the distributions after phase I and
phase II, respectively. Fig. 9 shows that this algorithm succes-
sively scales gates from high voltage to low voltage such that
the slacks of gates are decreased. We find that, originally, slacks
are heavily clustered between 7 and 11 ns. After phase I, the
slacks are shifted to the interval [4 ns, 9 ns]. The slacks are generally reduced. After phase II, 25% of slacks are less than 1 ns.
There are about 1100 gates with the slacks close to zero. The
cycle time of the design is 12.2 ns. It shows that this algorithm
utilizes the slack of each gate to scale down the voltage of the
gates. Yet, we can detect that there are gates which still have
large slacks after phase II. This is because these gates are on the
paths with delays much shorter than the cycle time. Even though
the voltages of all gates along these paths have been scaled to
the low supply voltage, the slacks of these gates are still much
larger than the slacks of other gates. This implies that the supply
voltages of these gates could be scaled further downward.
Then, we compared the results of the proposed algorithm with
dual supply voltages and the results of GECVS [7]. The two
voltage domains used in this experiment are 1.2 and 0.6 V. In
this experiment, we implement the GECVS without the backoff
on the timing of the circuit. Thus, the total power of GECVS
may be further reduced if the backoff delay is allowed. The experimental results are listed in Table II. As shown in Table II,
column 2 shows the number of gates. The original power of
these circuits is listed in column 3. Columns 47 show the results of our algorithm. Column 4 shows the percentage of the
number of LCs over the number of total gates (including LCs).
The percentage of total power of LCs over the total power of
the circuit is shown in column 5. Column 6 shows the saved
power compared to the original power. The CPU time is shown
in column 7. The results of GECVS are shown in columns 811,
respectively.
On average, our algorithm can reduce total power consumption by 37.2% with an 11.9% overhead on the total number of
LCs. The GECVS algorithm can only reduce total power consumption by 29.7% with a 16.9% overhead on total number of
LCs. In this experiment, an interesting effect is observed in the
test case s1488. As shown in Table II, our algorithm uses more
LCs but consumes less LC power than GECVS. It is because
the output load on each LC will affect its power consumption.
Fig. 9. Slack distribution of s13207 which was scaled by the proposed algorithm using dual supply voltages (1.2 and 0.6 V). (a) the slack distibution before
scaling. (b) After phase I. (c) After phase II.
The power consumption of any two LCs may not be the same.
We calculated the total loading capacitance of all LCs in s1488.
The values are 2.047 and 3.259 pf by using our algorithm and
GECVS, respectively. Therefore, it is possible that a netlist with
more LCs consumes less LC power.
Table III shows a comparison of the improvements our algorithm provides compared to GECVS in the number of level
converters, power consumption, and CPU time. The percentage
of the number of scalable gates in phase II with respect to the
number of total gates is shown in column 5. The table shows
that our algorithm can save more power and use fewer level converters. On average our algorithm improves the number of level
converters by 34.2% and the power consumption by 11.0% in
27.0% less CPU time. Looking at the same table we can see
that, on average, the percentage of scalable gates in phase II of
our algorithm is only 73.8% of the scalable gates of GECVS.
Thus, the problem size of our algorithm is much smaller than
645
that of GECVS. Therefore, even though our algorithm is an iterative optimization process, it still uses less CPU time.
We have also explored the impact on power savings by using
multiple voltage domains. We compared the results of using two
and four voltage domains in the scaling process. The experimental results of the proposed algorithm that uses four voltage
domains are listed in Table IV. Column 2 shows the percentage
of the number of LCs over the number of total gates (including
LCs). The percentage of total power of LCs over the total power
is shown in column 3. Column 4 shows the saved power compared to the original power. The original power is shown in the
column 3 of Table II. The CPU time is shown in column 5.
Comparing the results in Table IV and the results shown in
Table II, we can find that the average power saving of four
voltage domains is 5.3% more than the power savings achieved
by using two voltage domains. Fig. 10 shows a comparison of
the power savings in three programs, namely, GECVS, our algorithm with dual voltage domains, and our algorithm with four
voltage domains.
These experiments show that using more voltage domains can
save more power. We find that in the case Mac1, using four
voltage domains in our algorithm saves 12.1% more power than
that with dual supply voltages. Therefore, we use Mac1 as the
test case to further study the impact on power savings when different numbers of voltage domains are used. The experimental
results are shown in Table V. In Table V, column 1 shows the
number of voltage domains. Column 2 is the original power of
the circuit. The power saved after phase I and phase II are shown
in columns 3 and 4, respectively. The voltage domains used in
these different experiments are shown in column 5.
As shown in Table V, we can find that when dual supply voltages (1.2 and 0.6 V) are used, the power saving is 40.3%. When
the supply voltage 0.8 V is added into the voltage domains in the
experiments, the power savings on rows 3 and 4 are the same.
When the supply voltage 0.7 V is used, the power saving on
rows 5, 6, and 7 are very close. It means that 0.6 V is not a good
choice for the lower bound of the voltage domains. Therefore,
we use the same test case to study the impact on power saving
by applying different lower bounds in the voltage domain. The
experimental results are shown in Table VI.
Table VI shows that after phase II, when 1.2 and 0.7 V are
used, the largest power consumption can be saved. Tables V
and VI also show the comparable power savings when 0.7 V is
included in the voltage domains. Therefore, we can find that the
use of the most comportable supply voltage is the main factor
in determining total power savings. The most comportable
supply voltage of different circuits is determined according to
the timing tightness of the paths in the circuit.
Although using more voltage domains saves more power, it
needs more voltage islands and the designer must make extra
efforts to accommodate the extra voltage islands. Therefore, for
the proposed algorithm, if the most comportable supply voltage
for each case is known, the use of dual supply voltages is a good
choice for considering both power and effort saving.
In the previous experiments, the switching activity of each net
is 1. In this experiment, we would like to assign different values
of switching activity to different nets. Reference [18] described
a patternless method to estimate the switching activity of simple
646
TABLE II
EXPERIMENTAL RESULTS OF OUR ALGORITHM AND THE GECVS [7] ALGORITHM. THE TWO SUPPLY VOLTAGES USED ARE 1.2 AND 0.6 V
TABLE III
COMPARISON OF THE IMPROVEMENTS OUR ALGORITHM
PROVIDES COMPARED TO GECVS [7] IN THE NUMBER OF
LEVEL CONVERTERS, POWER CONSUMPTION, AND CPU TIME
Fig. 10. Power saving of the three programs, GECVS, our algorithm with dual
supply voltages, and our algorithm with four supply voltages.
TABLE V
COMPARE THE RESULTS OF USING DIFFERENT NUMBER OF
VOLTAGE DOMAINS ON MAC 1
TABLE IV
RESULTS OF OUR ALGORITHM USING FOUR VOLTAGE DOMAINS
TABLE VI
RESULTS OF OUR ALGORITHM USING DIFFERENT LOWER
BOUNDS OF VOLTAGE DOMAINS ON MAC1
gates, including NAND, NOR, and XOR. The switching activity at

the output of a two input NAND/NOR gate is 3/8 when the two
inputs are independent. The switching activity at the output of
for a large . We know
a -input NAND/NOR gate is
that the function of a NAND/NOR gate is the combination of an

AND/OR gate and an inverter. Therefore, the switching activity
647
TABLE VII
EXPERIMENTAL RESULTS OF THE PROPOSED ALGORITHM WITH ESTIMATED SWITCHING ACTIVITY
at the output of an AND/OR gate is the same with a NAND/NOR

gate. The switching activity at the output of any XOR gate is 1/2.
Besides, we assume that the switching activity at the output of
an INV/BUF/DFF is same as its input signal. For other types of
gates, the switching activity of the output is no larger than 0.5
according to [19]. Thus, the switching activity at the output of
any other type of gate is a randomly generated value within the
range of 0.01 to 0.5. Then, the switching activity is applied in the
cost function in the proposed algorithm. Dual voltage domains
(1.2 and 0.6 V) are used and the experimental results are shown
in Table VII.
As shown in Table VII, on average, we can reduce total
power consumption by 39.5% with an 9.5% overhead on the
total number of LCs. Compared to Table II, our algorithm with
estimated switching activity produces 2.3% more power saving
and 2.4% less number of LCs than that with constant switching
activity. Therefore, to consider the switching activity in the
voltage scaling process may save more power consumption and
use less LCs.
VI. CONCLUSION
A two-phase voltage scaling algorithm for VLSI circuits is
proposed. The proposed algorithm utilizes the slack of each gate
to scale down the voltages of the gates. It combines a greedy approach and an iterative optimization method to scale the supply
voltage of gates effectively. On average, it improves total power
consumption by 42.5% over the original circuit with a 10.6%
overhead on the total number of LCs. Phase I in the algorithm
reduces the problem size for the optimization process in phase
II. Therefore, even though our algorithm is an iterative optimization process, it still can reduce more power consumption in less
CPU time as compared to GECVS [7]. On average, our algorithm improves the number of level converters by 34.2% and
the power consumption by 11.0% in 27.0% less CPU time. Our
study also shows that when the most comportable supply voltage
is included in the voltage domains, using more voltage domains
may improve power consumption by a small amount. The key
factor in achieving power saving is including the most comportable supply voltage in the scaling process. If more voltage
domains are used, more voltage islands will be needed and designers will be burdened with the extra voltage islands in their
designs. Thus, using dual voltage domains is a good choice
both for saving power and facilitating the design effort. We also
studied the impact of considering switching activity on total
power consumption. The results show that the algorithm reduces
total power consumption by 39.5% as compared to the original
circuit. Low power design is always an important issue for VLSI
designs. By applying lower supply voltages on nontiming critical gates, we can greatly reduce the total power consumption.
REFERENCES
[1] C. Chen, A. Srivastava, and M. Sarrafzadeh, On gate level power optimization using dual-supply voltages, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 9, no. 5, pp. 616629, Oct. 2001.
[2] K. Usami and M. Horowitz, Clustered voltage scaling technique for
low-power design, in Proc. Int. Symp. Low Power Design, 1995, pp.
38.
[3] K. Usami, T. Ishikawa, M. Kanazawa, and H. Kotani, Low-power
design technique for ASICs by partially reducing supply voltage, in
Proc. 9th Annu. IEEE Int. ASIC Conf. , 1996, pp. 301304.
[4] K. Usami, M. Igarashi, F. Minami, M. Ishikawa, M. Ichida, and K.
Nogami, Automated low-power technique exploiting multiple supply
voltages applied to a media processor, IEEE J. Solid-State Circuits,
vol. 33, no. 3, pp. 463472, 1998.
[5] M. Igarashi, A low-power design method using multiple supply voltages, in Proc. Int. Symp. Low Power Design, 1997, pp. 3641.
[6] Y. J. Yeh and S. Y. Kuo, An optimization-based low-power voltage
scaling technique using multiple supply voltages, in Proc. IEEE Int.
Symp. Circuits Syst., 2001, pp. 535538.
[7] S. N. Kulkarni, A. N. Srivastava, and D. Sylvester, A new algorithm
for improved VDD assignment in low power dual VDD systems, in
Proc. Int. Symp. Low Power Design, 2004, pp. 200205.
[8] D. Kang, M. C. Johnson, and K. Roy, Multiple-Vdd scheduling/allocation for partitioned floorplan, in Proc. 21st Int. Conf. Comput. Design, 2003, pp. 412418.
[9] D. Kang, M. C. Johnson, and K. Roy, Simultaneous multiple-Vdd
scheduling and allocation for partitioned floorplan, in Proc. 5th Int.
Symp. Quality Electron. Design, 2004, pp. 98103.
[10] A. Manzak and C. Chakrabarti, A low power scheduling scheme with
resources operating at multiple voltages, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 10, no. 1, pp. 614, Feb. 2002.
[11] S. P. Mohanty and N. Ranganathan, Simultaneous peak and average
power minimization during datapath scheduling, IEEE Trans. Circuits
Syst.I: Reg. Papers, vol. 52, no. 6, pp. 11571165, Jun. 2005.
[12] A. Srivastava and D. Sylvester, Minimizing total power by simultaneously Vdd/Vth assignment, in Proc. Asia South Pacific Design Autom.
Conf., 2003, pp. 400403.
648
[13] W. Hung, Y. Xie, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and Y.

Tsai, Total power optimization through simultaneously multiple-Vdd
multiple-Vth assignment and device sizing with stack forcing, in Proc.
Int. Symp. Low Power Electron. Design, 2004, pp. 144149.
[14] Synopsys, Sunnyvale, CA, Library compiler user guide: Modeling,
timing and power technology libraries, 2003.
[15] T. Sakurai and A. R. Newton, Alpha-power law MOSFET model and
its applications to CMOS inverter and other formulas, IEEE J. SolidState Circuits, vol. 25, no. 2, pp. 584594, Apr. 1990.
[16] Y. C. Chou and Y. L. Lim, A performance-driven standard-cell
placer based on a modified force-directed algorithm, in Proc. ISPD,
2001, pp. 2429 [Online]. Available: http://www.cs.nthu.edu.tw/~ylin/
ISPD2001NTHUBenchmark/placement.htm
[17] F. Braglez, D. Bryan, and K. Kozminski, Combinational profiles of
sequential benchmark circuits-preliminary results, in Int. Symp. Circuits Syst., 1989, pp. 19291934 [Online]. Available: http://www.fm.
vslib.cz/~kes/asic/iscas/
[18] M. Pedram, Power minimization in IC design: Principles and appications, ACM Trans. Design Autom. Electron. Syst., vol. 1, no. 1, pp.
356, Jan. 1996.
[19] C. Svensson and D. Liu, Low power circuit techniques, in Low Power
Design Methodologies. Norwell, MA: Kluwer, 1996, pp. 3864.
Jun Cheng Chi (S05) received the M.S. degree

from the Chung Yuan Christian University, Taiwan,
R.O.C., in 2001, and the Ph.D. degree from the
Graduate Institute of Electronics Engineering,
Chung Yuan Christian University, Taiwan, R.O.C.,
in 2006.
Currently, he is a Senior Engineer with Springsoft Inc., Hsinchu, Taiwan, R.O.C. His research
interests include the areas of design automation of
timing-driven physical design, signal integrity, and
low power design methodologies.
Hung Hsie Lee received the B.S. degree in information and computer engineering from Chung Yuan
Christian University, Taiwan, R.O.C., in 2006.
Currently, he is an Engineer in the SoC Technology Center, Industrial Technology Research
Institute, Hsinchu, Taiwan, R.O.C. His research
interests include the area of design automation for
low power VLSI IC design.
Sung Han Tsai received the B.S. degree in information and computer engineering from Chung Yuan
Christian University, Taiwan, R.O.C., in 2006.
Currently, he is an Engineer with Springsoft Inc.,
Hsinchu, Taiwan, R.O.C. His research interests include the general area of VLSI CAD.
Mely Chen Chi (M80) received the B.S. degree

from National Taiwan Normal University, Taipei,
Taiwan, R.O.C., in 1970, and the M.S. and Ph.D.
degrees from the Wesleyan University, Middletown,
CT, in 1974 and 1978, respectively, all in physics.
Since 1999, she has been a Professor in the Department of Information and Computer Engineering,
Chung Yuan Christian University, Taiwan, R.O.C.,
and also the Founding Director of the Electronics and
Information Technology Center. She worked at the
University of Pennsylvania, Philadelphia, from 1977
to 1978 and worked in the Computer-Aided Design and Test Laboratory, Bell
Laboratories, Murray Hill, NJ, from 1978 to 1990. She served as a Senior Researcher and Manager at the Computer and Communication Laboratory, Industrial Technology and Research Institute, Hsinchu, Taiwan, R.O.C., from 1990 to
1999. She has worked in the area of electronics design automation since 1978.
Prof. Chi has served in the technical committees of the IEEE International
SOC Conference and the IEEE International Symposium on Quality Electronic
Design.

Chi2007 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chi2007 PDF

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO.

Gate Level Multiple Supply Voltage Assignment

OWER consumption is an important issue in modern

Fig. 1. Average distribution of gates with different slacks for 16 MCNC91

1063-8210/$25.00 2007 IEEE

Fig. 2. Illustration of the static current flow in a VDDH inverter when it is

Fig. 3. Conventional level converter [2].

design is determined by the voltage scaling algorithm. Fig. 3

refinement process, and thus, power consumption may not be

from the netlist according to the local connectivity of the voltage

is the threshold voltage of a transistor and is a

II. BACKGROUND-TIMING ANALYSIS

III. POWER CONSUMPTION CALCULATION

IV. PROPOSED VOLTAGE SCALING ALGORITHM

both phases, the correctness of timing is maintained. The details

At this stage, we try to scale down as many gates as possible to

Fig. 5. Pseudo code of phase I.

problem size of phase II is smaller than the original circuit. The

Fig. 6. Example of phase I.

difference of the power consumptions of gate is counted in

Fig. 8. Pseudo code of phase II.

We assign the state of the largest partial sum of the previous

the computational effort. However, we also study the impact of

gates, including NAND, NOR, and XOR. The switching activity at

that the function of a NAND/NOR gate is the combination of an

at the output of an AND/OR gate is the same with a NAND/NOR

[13] W. Hung, Y. Xie, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and Y.

Jun Cheng Chi (S05) received the M.S. degree

Mely Chen Chi (M80) received the B.S. degree

You might also like