You are on page 1of 6

Bu er Insertion and Sizing Under

Process Variations for Low Power Clock Distribution


Joe G. Xi* Wayne W.-M. Dai
Computer Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
Abstract delays from source to sinks, also have to be limited.
Power dissipated in clock distribution is a major source
Most techniques in skew minimization are based on the
of total system power dissipation. Instead of increasing
adjustment of the interconnect lengths and widths: the
wire widths or lengths to reduce skew which results in in- length adjustment technique moves the balance points
creased power dissipation, we use a balanced bu er insertion or elongates the interconnect length to achieve zero
scheme to partition a large clock tree into a number of small skew[10, 2, 12] 1; the width sizing technique achieves
subtrees. Because asymmetric loads and wire width varia- zero skew by assigning variable widths to wires in clock
tions in small subtrees induce very small skew, minimal wire tree[13]. However, both wire elongating and widen-
widths are used. This results in minimal wiring capacitance ing techniques increase the wiring capacitance and dy-
and dynamic power dissipation. Then the bu er sizing namic power dissipation.
problem is formulated as a constrained optimization prob- Other works on bu ered clock tree[7] do not consider
lem: minimize power subject to tolerable skew constraints. power dissipation as a main optimization objective.
To minimize skew caused by device parameter variations Moreover, for a clock tree with intermediate bu ers,
from die to die, PMOS and NMOS devices in bu ers are bu er delay variations are inevitable in chip fabrica-
separately sized. Substantial power reduction is achieved tion, i.e. PMOS and NMOS device parameters such as
while skews are kept at satis able values under all process carrier mobilities and threshold voltages can vary in-
conditions. dependently from die to die[9]. This causes additional
skew in a bu ered clock tree that is delay balanced
1 Introduction with typical process parameters.
In this paper, we present a method to minimize both
Carrying large loads and switching at high frequency, dynamic and short-circuit power dissipation in a clock
clock is one of the major sources of dynamic power tree subject to skew and phase delay constraints. The
dissipation in a digital IC. Large drivers which are rest of this paper is organized as follows. In section 2,
required to drive the large clock load also dissipate we describe the bu er insertion scheme which is used
short-circuit power. As reported in [3], the clock power to partition a large clock tree into subtrees of small
dissipation of DEC Alpha chip is almost 40% of the path-length and loads. In section 3, we formulate and
chip's total power dissipation. solve the device sizing problem for power minimization.
For high-speed ICs, clock skews which are the vari- Experimental results are given in section 4.
ations of delays from clock source to clock terminals
should be controlled to within very small or tolera- 2 Bu er Insertion in Clock Tree
ble values[6]. Tolerable skew is the maximum value
of clock skew with which the system can function cor- 2.1 Problem Formulation
rectly at a desired frequency. Clock phase delays, the
In order to ensure fast rise/fall time of the clock
 Joe
G. Xi is also aliated with National Semiconductor edges for fast clock transitions, bu ers have to be used
Corp, Santa Clara, CA. to drive the large load capacitance on clock[1, 3]. There
are two common clock driving schemes: In the single
32nd ACM/IEEE Design Automation Conference  driver scheme as shown in Fig. 1(a), a chain of bu ers
is used at the clock source. Wire sizing can be used
Permission to copy without fee all or part of this material is granted, provided
that the copies are not made or distributed for direct commercial advantage,
the ACM copyright notice and the title of the publication and its date appear,
and notice is given that copying is by permission of the Association for
Computing Machinery. To copy otherwise, or to republish, requires a fee 1 Zero-skew in the sense that phase delays of all sinks calcu-
and/or specific permission.  1995 ACM 0-89791-756-1/95/0006 $3.50 lated with a delay model, i.e. Elmore delay model, are equal
under ideal process condition.
to reduce the skew caused by asymmetric loads and Assuming width variations w = 15%w, the worst
wire width deviations[13, 8]; In the Distributed bu ers case additional skew is:
scheme as shown in Fig. 1(b), intermediate bu ers are
located in various parts of the clock tree. ts = 0:15( l1wCL1 + l2wCL2 ) (2:3)
1 2

Equations (2.1) and (2.3) indicate that without wire


width variations, skew is a linear function of path
length. However, with wire width variations, the addi-
Source Source
tional skew is a function of the product of path length
and total load capacitance. Increasing the wire widths
will reduce skew but result in larger capacitance and
power dissipation. Reducing both the path length and
load capacitance reduces skew while minimum wire
width can be used and wiring capacitance is kept at
(a) (b) minimum. Therefore, if bu ers are inserted to par-
tition a large clock tree into subtrees with suciently
Figure 1: (a) Single driver scheme; (b) Dis- short path-length and small loads, the skews caused by
tributed bu ers scheme. asymmetric loads and wire width variations are very
small.
For power minimization, the distributed bu ers Bu ers can not be excessively inserted and overly
scheme is preferred over the single driver scheme. We sized. Both dynamic and short-circuit power dissipa-
consider a simple example of an equal path length tion of bu ers need to be minimized. However, di er-
tree as shown in Fig. 2 where l1 = l2 . l0 ; l1; l2 ent bu er delays cause phase delay variations on dif-
and w0; w1; w2 are the lengths and widths of the tree ferent source-to-sink paths. For simplicity, we budget
branches. CL1 and CL2 are the load capacitances at the given tolerable skew of a bu ered clock tree, ts into
sinks, s1 and s2 respectively. The skew between s1 two components, the tolerable skew for bu er delays,
and s2 can be derived as: tbs and the skew allowed for asymmetric loads and wire
width deviations after bu er insertion, tws ,
ts = rl
w1
1
C L
rl2 C
w2 L2 (2:1)
ts = tbs + tws
1
(2:4)
where r, c are the sheet resistance and unit capacitance In order to balance the source-to-sink delays in a
of the wires. clock tree, we start with an equal path-length clock
s0 R1 tree T. T consists of a source s0 and a set of sinks
S = fs1 ; s2;    ; sm g. The bu er insertion problem
s1
l0, w0
C1/2 C1/2
is to nd the locations on the clock tree to insert
intermediate bu ers. We call these locations bu er
CL1
N0 R0
s0

insertion points(BIPs).
Formulation 1: Given an equal path length clock
C0/2 C0/2
l, w2 l, w1
tree T with minimum width for all branches and the
R2
s2

C2/2 C2/2 CL2 tolerable skew constraint, dws , the problem of bu er


s2 s1 insertion is to determine the minimum number of BIPs
in T , such that skew due to asymmetric loads and wire
(a) (b) width variations is less than dws .
Figure 2: (a) An equal path-length clock tree 2.2 Bu er Insertion Scheme
and (b) its delay model.
To meet skew constraint, the bu er insertion scheme
The skew variation in terms of wire width variations should try to balance the bu er delays on source-to-
can be stated as: sink paths independent of the clock tree topology. Our
balanced bu er insertion scheme partitions the clock
@ts w + @ts w = CL1 l1 w + CL2 l2 w
ts = @w tree into subtrees such that every subtree is of equal
1
1
@w2 2 w12 1
w22 2
path-length and all source to sink paths have equal
(2:2) number of levels. The clock tree is partitioned into
Source Source S0 S0
x0,1
S3 S3

L/3 Iso-radius level 1 x3,5 x3,6 N34


N46

x1,1
L/2 Iso-radius level 1

2L/3 Iso-radius level 2 x2,2 N11


N22
x2,3
x3,4
S1 S1
L L x3,3
N35 N23
S2 S2
Iso-radius level number = 1 Iso-radius level number = 2

(a) (b)
Figure 3: Bu er levels are increased until the
tolerable skew bound is satis ed. Figure 4: An example of bu er insertion in
an equal path-length tree: (a) balanced bu er
multiple levels and BIPs are determined at cut-lines. insertion; (b) level-by-level bu er insertion.
We de ne each cut-line as an iso-radius level of the
clock tree that is a circle centered at the clock source. S0 S0

We choose the radius  of the rst level cut-line(nearest


to the clock source) as  = L=( + 1). L is the
path length of the clock tree and  is the designated
number of levels of cut-lines. The radius is  for the
rst iso-radius level, 2 for the second iso-radius level,
: : :,  for the th iso-radius levels. We need to nd
a minimum number  of bu er levels to satisfy the
skew constraint. We devise the method to determine
 as follows. We evaluate the worst case skew as we
S3 S2 S1 S3 S2 S1

increment the number of  from 1; 2;   , until some (a) (b)


number q at which point the worst case skew due to Figure 5: Bu er insertion in a general equal
asymmetric loads and wire width variations is less than path-length clock tree: (a) the balanced bu er
dws , then  = q. The paradigm is depicted in Fig. 3. insertion method; (a) the level-by-level method.
This bu er insertion scheme has the following prop-
erties: ations. According to Elmore delay, the interconnect
Property 1: Paths from the clock source to every sink
are segmented by equal levels of bu ers.
Property 2: Each resulted subtree is a equal path-
delay from source s0 to a sink si is[10]:
d(iwire) =
X rej (cej =2 + Cj ) (2:5)
length clock tree if the original clock tree is a equal ej 2path(s0 ;sj )
path-length tree.
Property 3: At each increment of iso-radius levels, Equation (2.5) indicates that the worst delay of a
the skews caused by asymmetric loads and wire width sink is caused by resistance increases of all branches on
variations are monotonically decreasing. the path from source to this sink and subtree capaci-
An example of the bu er insertion scheme is shown tance increases of all nodes on the path from source to
in Fig. 4(a). Previous methods[7] insert bu ers level- this sink. Conversely, the best delay of a sink is caused
by-level at the branch split points of the clock tree as by resistance decreases of all branches on the path from
shown in Fig. 4(b). This works well in a full binary source to this sink and subtree capacitance decreases
tree where all sinks have the same number of levels. In of all nodes on the path from source to this sink. The
the case of a general equal path-length tree, such as resistance increase is due to the decrease of wire width
the case in Fig. 5(b), di erent numbers of bu ers are while the capacitance increase is due to the increase of
inserted on di erent source to sink paths. Depending wire width. Based on this observation, we determine
on the clock tree topology, some large subtrees may the worst case delay of a sink as shown in Fig. 6(a).
still require wire widening to reduce skew. The widths of all branches on the path from source to
We use a simple method to determine the worst case a sink are incremented by w, i.e. w = 0:15w. The
skew caused by asymmetric loads and wire width vari- widths of other branches are decremented by w. The
best case delay of s1 is shown in Fig. 6(b). After we For a CMOS process, we can usually extract three
obtain the worst case and best case delays for all sinks, sets of parameters for PMOS and NMOS devices: fast,
the worst case skew is then the largest value of the dif- typical and slow parameters. In chip fabrication, a die
ference between the worst case delay of one sink and can have fast PMOS and slow NMOS, or slow PMOS
the best case delay of another sink[11]. and fast NMOS or other combinations of the three
sets of parameters 2 . This type of process variation
s0 s0
is di erent from the device geometry variations which
N0 N0
can be overcome by increasing the device geometries[7].
The variations of device delays can be characterized
by fP for PMOS and fN for NMOS, where fN ; fP 
x1 x2 x1 x2
1. If the rise time of a bu er with typical process
N1 N1
parameters is Tr , then the rise time with fast and slow
parameters, Trf and Trs respectively, are:
N2 N2

s4 s3 s2 s1 s4 s3 s2 s1 Trf = fTr ; Trs = Tr fP (3:8)


P
(a) (b)
Figure 6: Delay due to wire width variations.
(a) Worst case delay of s1 ; (b) Best case delay wp11 wp12 wp13 x1 wp11 wp12

of s1 . wn11 wn12 wn13 wn11 wn12

3 Bu er Sizing Under Process


Variations wp21 wp22 wp23 x2 wp21 wp22 wp23

3.1 Problem Formulation


wn21 wn22 wn23 wn21 wn22 wn23

The sizes of bu ers can be further adjusted to reduce (a) (b)


power dissipation. We formulate bu er sizing as a
constrained optimization problem with power as the Figure 7: Delay variations due to process pa-
minimization objective and skew and phase delay as rameter variations.
the constraints.
Formulation 2: Given a clock tree T with interme- Fig. 7 demonstrates the process variation e ects. In
diate bu ers, the problem of power minimization by Fig. 7 (a), paths from source to two di erent sinks x1
bu er sizing(PMBS) is to determine the size of each and x2 have the same number of bu ers but of di erent
bu er in T to minimize total power, Ptot, subject to sizes. Under typical process condition, let d1 be the
the phase delay constraint, tp and skew constraint, tbs: delay from source to x1, which is the sum of the three
bu er delays d11; d12 and d13 on that path, d2 be the
max(di)  tp (3:6) delay from source to x2. We have d1 = d11 + d12 + d13
and d2 = d21 + d22 + d23. Assume the delays of two
max(di dj )  tbs (3:7) paths are balanced with typical process parameters,
where di and dj are the phase delays from s0 to sinks d1 = d11 + d12 + d13 = d21 + d22 + d23 = d2 (3:9)
si and sj respectively.
The above formulation assumes typical process pa- then with slow PMOS and fast NMOS parameters,
rameters and a xed PMOS/NMOS transistor ratio d01 = fP d11+d12=fN +fP d13; d02 = fP d21+d22=fN +fP d23
for CMOS bu ers, i.e. wp =wn = 2:0. In reality, PMOS (3:10)
and NMOS device parameters in the same process such We can see d01 6= d02, even when d1 = d2, if d11 6= d21
as carrier mobilities and threshold voltages may vary or d12 +d13 6= d22 +d23. This problem is more obvious
independently in a remarkably wide range from die to if two paths have di erent number of bu ers as in Fig.
die[9]. Depending on process conditions, additional 7(b)[9].
skew will arise from the bu er delay variations even
when bu er delays are balanced with typical process 2 We assume that parameter variations of PMOS transistors
parameters. in the same die are negligible, same for NMOS.
One solution to this problem is to separately balance where the bu ers in jth subtree is on the path from
the delays through PMOS devices(the pull-up path) s0 to si , p1 ; n1 ; p2 ; n2 ; p ; n are all nonnegative
and NMOS devices(the pull-down path) in the two constants.
paths[9]. If d11 + d13 = d21 + d23 and d12 = d22, then, We notice that the expressions on the right hand side
d01 = d02 regardless of the variations of either PMOS or of (3.16), (3.17) and (3.18) are all convex functions of
NMOS parameters. variables wp , wn[5]. They belong to a special class of
Let dpi , dpj be the pull-up path delays and dni, dnj be functions called posynomials[4, 5]. We rewrite the skew
the pull-down path delays from clock source to any two constraint of (3.15) as:
sinks si , sj . Then the phase delays with typical process
parameters from source to si and sj are di = dpi + dni max(dpi)  p + dpf ; max(dni)  n + dnf (3:19)
and dj = dpj + dnj. where dpf and dnf are the smallest pull-up path and
If the pull-up and pull-down path delays with typical pull-down path delays. Thus the skew constraints
process parameters satisfy: are transformed into convex functions. The PMDS
dpi dpj  p dni dnj  n (3:11) problem is transformed into a Convex Programming
problem for which the global minima can be obtained
where the pull-up path and pull-down path skews, if a local minima is found[11, 4]. The transformed skew
b b functions also belong to posynomials and a Posynomial
p = 2fts ; n = 2fts (3:12) Program can be applied.
P N Posynomial Program Posy: g0 ; g1  1; g2  1; g3  1
then in the worst case when slow PMOS and slow Minimize: g0(w)
NMOS are used, di = fP dpi +fN dni, dj = fP dpj +fN dnj, subject to: g1 (w)  1; g2(w)  1; g3(w)  1
where g0 (b) = Ptot, g1 (w) = maxtp(di ) , g2(w) = max (di )
p
the skew constraint tbs can still be satis ed: p +dpf ,
di dj = fP (dpi dpj )+fN (dni dnj)  fP p +fN n = tbs g3(w) = max (dn
n +dnf .
i)

(3:13) We devised an ecient algorithm to apply the posyn-


Formulation 3: Given a clock tree T with interme- omial programming approach. The algorithm works as
diate bu ers, the problem of power minimization by follows. At each iteration, the fastest pull-up path and
device sizing(PMDS) while satisfying delay and skew pull-down path delays, dpf ; dnf are identi ed and used
constraints irrespective of process conditions is to de- in the skew constraints de ned in (3.19). The posyno-
termine the sizes of PMOS and NMOS devices of each mial program Posy de ned above is then called. After
bu er on T , such that, the total power dissipation, Ptot obtaining the sizing result, the path delays of all sinks
is minimized, subject to the phase delay constraint, tp , are evaluated. If the skew and phase delay constraints
pull-up path and pull-down path skew constraints, p de ned in (3.14) and (3.15) for any pair of sinks, si
and n: and sj , are not satis ed, then a pair of new dpf ; dnf
max(di)  tp (3:14) are identi ed and used in (3.19). Posy is called again.
This iteration continues until all constraints de ned in
max(dpi dpj )  p max(dni dnj )  n (3:15) (3.14) and (3.15) are satis ed.
3.2 Posynomial Programming Solution 4 Experimental Results
Let wp and wn denote the widths of PMOS and
NMOS devices respectively. The power and delays of We tested the bu er insertion and sizing methods
a clock tree with intermediate bu ers can be expressed on a Sun Sparcstation10. The PMDS algorithm con-
as a function of wp and wn[11]: verges quickly in just 5 to 10 iterations and each it-
X p n
Ptot = p1 wjp + n1 wjn + (wjp + wjn)2 ( w p 2 + w n2 )
eration takes less than 2 minutes of CPU time for all
tested examples. The rst example, Ex1 is an equal
j 2T j 1 j 1 path-length clock tree with 54 terminals. The second

dpi =
X wp + wn
(tp0 + p jwp j ) (3:17)
(3:16) example, Ex2 has 106 terminals. The third and fourth
examples are based on MCNC benchmarks, Prmary1
and Primary2[6]. In all the examples, We choose the
j minimum width(1m) for all branches. A 3:3v supply
X (tn + n wjp +n wjn )
evenj 2path s ;s
( 0 i) 1

voltage is used. The parameters we used in our de-


dni = 0
wj (3:18) lay and power calculations are from a 0:65m CMOS
oddj 2path(s0 ;si) 1 process: sheet resistance, r = 40m
, unit capacitance,
c = 0:03fF. Two di erent load capacitances are used 5 Acknowledgement
at clock sinks, 25fF and 35fF.
Table 1 gives the results of skew under device param- The authors would like to thank Qing Zhu for his
eter variations from chip to chip. We use fP = 1:73 help in the writing of the manuscript and for provid-
and fN = 1:65 as the variation factors for PMOS and ing the test cases as well as the result of WS. This work
NMOS devices in our test cases. was supported in part by the National Science Foun-
dation Presidential Young Investigator Award under
Examples Typical Fast-P,Slow-N Slow-P,Fast-N grant MIP-9058100 and in part by thee Semiconduc-
Ex1 0.06 0.04 0.05 tor Research Corperation under grant SRC-93-DJ-196.
Ex2 0.09 0.06 0.08
Pr1
Pr2
0.15
0.38
0.12
0.34
0.16
0.40
References
[1] H. Bakoglu. Circuits, Interconnections, and Packaging
Table 1: Clock skews under process variations. for VLSI. Addison-Wesley Publishing Company, 1987.
[2] T.H. Chao, Y.C. Hsu, J.M.Ho, Kenneth D. Boese, and
Andraw B. Kahng. Zero skew clock net routing. IEEE
We compared the power, skew and phase delay re- Trans. on Circuits and Systems, 39(11):799{814, 1992.
sults of the four examples against the results of a skew [3] D. Dobberpuhl and R. Witek. A 200mhz 64b dual-issue
minimization by wire sizing method(WS)[13] which cmos microprocessor. In Proc. IEEE Intl. Solid-State
adopts the single driver scheme. Table 2 and Table Circuits Conf., pages 106{107, 1992.
3 give the comparisons on power, skews and phase de- [4] J. G. Ecker. Geometric programming methods, compu-
lays using the same process parameters. tations and applications. SIAM Review, 22(3):338{362,
July 1980.
Dynamic (W ) Short-circuit (mW ) Reduction [5] J.P. Fishburn and A. E. Dunlop. Tilos: A posynomial
WS BIS WS BIS (%) programming approach to transistor sizing. In IEEE
Ex1 1.334 0.389 75.8 42.9 326 Intl. Conf. on CAD, pages 326{328, 1985.
Ex2 1.623 0.792 96.3 48.4 200 [6] M. A. B. Jackson, A. Srinivasan, and E. S. Kuh. Clock
Pr1 4.383 1.327 209.5 112.3 306 routing for high-performance ics. In Proc. of 27th
Pr2 4.578 2.218 458.3 241.2 200 Design Automation Conf., pages 573{579, 1990.
[7] Satyamurthy Pullela, Noel Menezes, Junaid Omar, and
Table 2: Comparisons of power dissipation. Lawrence T. Pillage. Skew and delay optimization for
Bu er Insertion and Sizing method(BIS) versus reliable bu ered clock trees. In Proc. of IEEE Intl.
the Wire Sizing method(WS). Conf. on CAD, pages 556{562, 1993.
[8] Satyamurthy Pullela, Noel Menezes, and Lawrence T.
Pillage. Reliable non-zero skew clock trees using wire
Examples Clock Skew (ns) Phase Delay (ns) width optimization. In Proc. of 30thDesignAutomation
WS BIS WS BIS Conf., pages 165{170, 1993.
Ex1 0.07 0.06 4.21 7.11 [9] Masakazu Shoji. Elimination of process-dependent
Ex2 0.02 0.09 2.12 4.78 clock skew in cmos vlsi. IEEE Journal of Solid-State
Pr1 0.23 0.15 2.96 3.88 Circuits, sc-21(1):875{880, 1986.
Pr2 0.67 0.38 12.3 10.2 [10] Ren-Song Tsay. An exact zero-skew clock routing
algorithm. IEEE Trans. on CAD, 12(3):242{249, 1993.
Table 3: Comparisons of skew and phase delay. [11] Joe G. Xi and Wayne W.M. Dai. Bu er insertion
and sizing under process variations for low power clock
Table 4 summarizes the test cases and bu er inser- distribution. In Technical Report, UCSC-CRL-95-12,
University of California, Santa Cruz., 1995.
tion results. [12] Qing Zhu and Wayne W.M. Dai. Perfect-balance planar
Examples Frequency Bu er Levels # of bu ers clock routing with minimal path-length. In Proc. of
Ex1 200 5 62 IEEE Intl. Conf. on CAD, pages 473{476, 1992.
Ex2 100 4 24 [13] Qing Zhu, Wayne W.M. Dai, and Joe G. Xi. Optimal
Pr1 300 6 86 sizing of high speed clock networks based on distributed
Pr2 200 7 148 rc and transmission line models. In Proc. of IEEE Intl.
Conf. on CAD, pages 628{633, 1993.
Table 4: Four clock tree examples.

You might also like