You are on page 1of 101

Timing issues in Digital ASIC Design

S.Sivanantham
VLSI and ES Division
VIT University
This Class + Logistics
Timing
Storage elements, Clock distribution, Clock tree synthesis
Reading
Whitepapers/datasheets on STA; papers on clock tree synthesis
Schedule
MT in one week (lab/recitation fair game); Lab #2 due Mon 1/27
HW #9: As a blocks layout is compacted down to fit into a smaller and smaller
region, the timing of the block at first improves, but then worsens. Explain.
HW #10: Hold time violations mean that the chip doesnt work at any frequency.
Propose several distinct methods for fixing hold time violations (guided by post-
routing static timing analysis), and explain the pros and cons of each.
HW #11: Compare DECs first Alpha and first StrongArm processors (look up
transistor counts, supply voltage, frequency, etc.). (a) How much of
StrongArms power efficiency can be attributed to process, supply, and
frequency scaling? (b) What factors might contribute to the remainder?
Slide courtesy of S. P. Levitan, U. Pittsburg
Review
Static timing analysis (Lecture 4)
Pin-based timing graph
Directed acyclic graph (DAG) of timing arcs
Longest path in DAG time linear in #arcs (edges)
Slack = required arrival time actual arrival time (long path
analysis)
Logic synthesis (Lecture 5)
Slide courtesy of S. P. Levitan, U. Pittsburg
Static Analysis vs. Dynamic Analysis
Why static analysis when dynamic simulation is more
accurate?
Drawbacks of simulation
Requires input vectors (stimuli for circuit)
Long runtimes
Example: calculate worst-case rising delay from a to z
Exponential explosion with number of possible design input states
a
c=0 c=1
b=0 a-z delay1 a-z delay2
b=1 a-z delay3 a-z delay4
b
c
z
STA Terminology
90
10
(Actual) arrival time (AAT, or AT) = time at which a pin
switches state
Usually 50% point on voltage curve, i.e., AT = t
50
Slew time = time over which signal switches
Usually difference between 10% and 90% on voltage curve, i.e.,
t
slew
= t
90
t
10
Required arrival time (RAT) = time at which a signal must
arrive in order to avoid a chip fail
Slack = RAT AAT
Positive slack good (= margin), negative slack bad
Vdd
50
Time
Example: What is slack at PO?
d=2
d=1
d=5
d=3
d=2
d=1
d=3
d=3
d=1
temp at=3
temp at=7
at=0
at=0
at=0
at=1
at=2
at=5 at=6
at=5
at=8
at=11
rat=10
Slack= -1
Example: Incremental Timing Analysis
d=2
d=1
d=5
d=3
d=2
d=1
d=3
d=3
d=1
temp at=3
temp at=7
at=0
at=0
at=0
at=1
at=2
at=5 at=6
at=5
at=8
at=11
rat=10
Slack = 0
at=10
d=1
d=1
d=1
at=3
at=7
Amount of work is bounded by sizes of fanin, fanout
cones of logic
Early-Mode Analysis
Definitions change as follows
RAT = lower bound on arrival time
Propagate shortest possible instead of longest possible delays
Slack = Arrival Required
Example: negative slack because AT
c
is too small (early)
0 =
a
AT
1 =
b
AT
2 =
x
RAT
1 =
x
AT
1 2 1 = =
x
SL
1 0 1 = =
b
SL
0 0 0 = =
a
SL
1 =
y
AT
0 =
c
AT
0 1 1 = =
y
SL
a
b
x
c
y
1
1
1 1 0 = =
c
SL
Enhancements of STA
Incremental timing analysis
Nanometer-scale process effects variation (
probabilistic timing analysis)
Interference crosstalk
Multiple inputs switching
Conservatism of delay propagation
(Old: HW #8: Suppose you change the size of one (combinational) gate in
your design, thus invalidating the previous timing analysis. How much work
must be done to regain a correct timing analysis?)
Courtesy K. Keutzer et al. UCB
Timing Correction
Driven by STA
Incremental performance analysis backplane
Fix electrical violations
Resize cells
Buffer nets
Copy (clone) cells
Fix timing problems
Local transforms (bag of tricks)
Path-based transforms
DAC-2002, Physical Chip Implementation
Local Synthesis Transforms
Resize cells
Buffer or clone to reduce load on critical nets
Decompose large cells
Swap connections on commutative pins or among
equivalent nets
Move critical signals forward
Pad early paths
Area recovery
DAC-2002, Physical Chip Implementation
Transform Example
..
Double Inverter
Removal
..
..
Delay = 4
Delay = 2
DAC-2002, Physical Chip Implementation
Resizing
0
0.01
0.02
0.03
0.04
0.05
0 0.2 0.4 0.6 0.8 1
load
d
A B C
b
a
d
e
f
0.2
0.2
0.3
?
b
a
A
0.035
b
a
C
0.026
DAC-2002, Physical Chip Implementation
Cloning
0
0.01
0.02
0.03
0.04
0.05
0 0.2 0.4 0.6 0.8 1
load
d
A B C
b
a
d
e
f
g
h
0.2
0.2
0.2
0.2
0.2
?
b
a
d
e
f
g
h
A
B
DAC-2002, Physical Chip Implementation
Buffering
0
0.01
0.02
0.03
0.04
0.05
0 0.2 0.4 0.6 0.8 1
load
d
A B C
b
a
d
e
f
g
h
0.1
0.2
0.2
0.2
0.2
B
B
0.2
b
a
d
e
f
g
h
0.2
0.2
0.2
0.2
0.2
?
DAC-2002, Physical Chip Implementation
Redesign Fan-in Tree
a
c
d
b
e
Arr(b)=3
Arr(c)=1
Arr(d)=0
Arr(a)=4
Arr(e)=6
1
1
1
c
d
e
Arr(e)=5
1
1
b
1
a
DAC-2002, Physical Chip Implementation
Redesign Fan-out Tree
1
1
1
3
1
1
1
Longest Path = 5
1
1
1
3
1
2
Longest Path = 4
Slowdown of buffer due to load
DAC-2002, Physical Chip Implementation
Decomposition
DAC-2002, Physical Chip Implementation
Swap Commutative Pins
a
c
b
2
1
0
1
1
2
1
5
Simple sorting on arrival times and delay works
c
a
b
2
1
0
1
1
1
3
2
DAC-2002, Physical Chip Implementation
Outline
Clocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Why Clocks?
Clocks provide the means to synchronize
By allowing events to happen at known timing boundaries, we
can sequence these events
Greatly simplifies building of state machines
No need to worry about variable delay through
combinational logic (CL)
All signals delayed until clock edge (clock imposes the worst
case delay)
Comb
Logic
r
e
g
i
s
t
e
r
Comb
Logic
r
e
g
i
s
t
e
r
r
e
g
i
s
t
e
r
Dataflow FSM
Clock Cycle Time
Cycle time is determined by the delay through the CL
Signal must arrive before the latching edge
If too late, it waits until the next cycle
- Synchronization and sequential order becomes incorrect
t
cycle
> t
prop_delay
+ t
overhead
Can change circuit architecture to obtain smaller T
cycle
Pipelining
For dataflow:
Instead of a long critical path, split the critical path into chunks
Insert registers to store intermediate results
This allows 2 waves of data to coexist within the CL
Can we extend this ad infinitum?
Overhead eventually limits the pipelining
- E.g., 1.5 to 2 gate delays for latch or FF
Granularity limits as well
- Minimum time quantum: delay of a gate
t
cycle
> t
pd
+ t
overhead
t
cycle
> max(t
pd1,
t
pd2
) + t
overhead
r
e
g
i
s
t
e
r
r
e
g
i
s
t
e
r
r
e
g
i
s
t
e
r
r
e
g
i
s
t
e
r
r
e
g
i
s
t
e
r
CL
A+B
CL
A+B
CL
A
CL
A
CL
B
CL
B
t
pd
t
pd1
t
pd2
FO4 INV Delays Per Clock Period
0.00
20.00
40.00
60.00
80.00
100.00
120.00
1982 1987 1993 1998 2004
Year
N
u
m
b
e
r

o
f

F
O
4

i
n
v
e
r
t
e
r

d
e
l
a
y
s
386
486 DX2 DX4
Pentium
PentiumMMX
PentiumPro
PentiumII
Celeron
PentiumIII
Pentium4
FO4 INV = inverter driving 4 identical inverters (no interconnect)
Half of frequency improvement has been from reduced logic stages, i.e., pipelining
Parallelism
For FSMs:
Same functionality and performance can be achieved at half
the clock rate
However, the input and output signals must be doubled (to
account for the outputs for each original cycle)
Instead of doubling the delay, the optimized logic is often
logarithmically related to the degree of parallelism
t
cycle1
> t
pd
+ t
ov t
cycle2
> Nt
pd
+ t
ov
t
cycle3
> log(Nt
pd
) + t
ov
r
e
g
i
s
t
e
r
t
pd
M-bits
r
e
g
t
pd
M-bits
t
pd
r
e
g
M-bits
r
e
g
i
s
t
e
r
t
pd
2*M-bits
CL
CL
CL
CL
Opt.
CL
Opt.
CL
CL
CL
Outline
Clocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Storage Elements
Latches
Level sensitive transparent when H, hold when L
ckb
d
ck
q p_q
ck
q
d
ck
q d
ck
q
d
Flip-flops
Edge-triggered data is sampled at the clock edge
Latch and Flip-Flop Gates
Rising edge flip-flop
Active high latch
clock
D
QN
Q
clock
clock
clock
clock
clock
clock
clock
clock
clock
clock
clock
QN
D
Q
in
out
enable
enable
out
enable
enable
in
Latch and flip-flop schematics from TSMC 0.13um LV Artisan Sage-X Standard Cell Library.
Latch and Flip-Flop Behavior
Rising edge flip-flop Active high latch
When clock is high When clock is high
D
QN
D
QN
Q
Q
t
CQ
4 inverter delays
t
DQ
2 inverter delays
When clock is low
When clock is low
D
QN
QN
D
Q
Q
Clock Skew and Jitter
A B
clock
(a)
(b)
(c)
clock at B
clock at B
T t
j
t
j
/2
t
j
/2
T
high
t
duty
t
duty
clock at B
clock at B
t
sk,AB
clock at B
clock at A
t
sk,AB
Clock skew
Duty cycle
jitter
Cycle-to-cycle
edge jitter
Flip-Flop Timing Characteristics
Rising edge flip-flop
non-ideal
clock
t
CQmax
t
comb,max
t
su
t
sk
+t
j
T
flip-flops
non-ideal
clock
clock
A B
t
comb,min
t
CQ,min
A
B
A
B
t
h
t
sk
Setup time constraint
Hold time constraint
Latch Setup Time and Transparency
Active high latch
clock
t
comb
non-ideal
clock
t
DQ
t
DQ
A B
A
B
clock
t
CQ
t
comb,max
t
su
t
sk
+t
j
t
duty
non-ideal
clock
A B
A
B
No penalty to clock period
for setup time constraint!
Setup time constraint
Outline
Clocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Setup Time
Important characteristics of storage elements
Setup time, hold time, clock-to-q delay
Setup time, t
su
Time before the clock edge that the data must arrive in order
for the new data to be stored
The setup time for a F/F occurs before the latching edge.
The setup time for a Latch occurs before the transition from
transparent to hold
t
setup
d
ck
q
Hold Time
A second important characteristic is the hold time, t
h
Time after the clock edge that the data must remain in order to
the data to be properly held
Note that Hold time (and Setup time) can be negative
Why isnt hold time just the negative of setup time?
Storage elements typically have some data dependence
- Capacitances, and devices may be faster for one data value
versus another
Specify the worst case for process technology and operating
condition variations
d
ck
t
hold
q
Clocking Overhead
Inherent delay in any storage element
The delay is measured from
Clock transition to Output data transition, t
c2q
Input data transition to Output data transition, t
d2q
Flip-flop is edge triggered
The overhead is t
c2q
+ t
su
Latch is level-sensitive
The overhead is t
d2q
d
t
c2q
t
d2q
ck
q
Clock Skew
Clock Source (ex. PLL)
CLK1
CLK2
Skew
Time
Time
Time
t
1
t
2
Latency
Most high-profileof
clock network metrics
Maximum difference in
arrival times of clock
signal to any 2
latches/FFs fed by the
network
Skew = max | t
1
t
2
|
Fig. From Zarkesh-Ha
Sylvester / Shepard, 2001
Clock Skew Causes
Designed (unavoidable) variations mismatch in buffer load
sizes, interconnect lengths
Process variation process spread across die yielding
different L
eff
, T
ox
, etc. values
Temperature gradients changes MOSFET performance
across die
IR voltage drop in power supply changes MOSFET
performance across die
Note: Delay from clock generator to fan-out points (clock
latency) is not important by itself
BUT: increased latency leads to larger skew for same amount of
relative variation
Sylvester / Shepard, 2001
Clock Jitter
Clock network delay uncertainty
From one clock cycle to the next, the period is not exactly the same
each time
Maximum difference in phase of clock between any two periods is
jitter
Must be considered in max path (setup) timing; typically O(50ps) for
high-end designs
Sylvester / Shepard, 2001
Clock Jitter Causes
PLL oscillation frequency
Various noise sources affecting clock generation and
distribution
E.g., power supply noise dynamically alters drive strength of
intermediate buffer stages
J itter reduced by minimizing IR and L*(di/dt) noise
Courtesy Cypress Semi
Sylvester / Shepard, 2001
Clocking Methodology (Edge-Triggered)
F
l
i
p
F
l
o
p
Comb
Logic
Comb
Logic
t
per
Max(t
pd
) < t
per
t
su
t
c2q
t
skew
Delay is too long for data to be captured
Min(t
pd
) > t
h
-t
c2q
+t
skew
Delay is too short and data can race through, skipping a state
Example of t
pdmax
Violation
Suppose there is skew between the registers in a dataflow
(regA after regB)
igets its input values from regA at transition in Ck
CL output oarrives after Ck transition due to skew
To correct this problem, can increase cycle time
r
e
g
A
r
e
g
B
t
pdmax
Ck Ck
Ck
i
i o
t
skew
Too late!
t
pdmax
Comb
Logic
Comb
Logic
Ck
o
Example of t
pdmin
Violation: Race Through
Suppose clock skew causes regA to be clocked before regB
ipasses through the CL with little delay (tpdmin)
oarrives before the rising Ck causes the data to be latched
Cannot be fixed by changing frequency have rock instead of chip
Ck Ck
r
e
g
A
r
e
g
B
t
pdmin
i o
Ck
t
skew
Too early!
t
pdmin
Comb
Logic
Comb
Logic
Ck
i
o
Time Borrowing (Cycle Stealing)
Cycle steal with flip-flops using delayed clocks
F
l
i
p
F
l
o
p
F
l
i
p
F
l
o
p
Intentional delay = skew
Comb
Logic
Comb
Logic
Ck
L
a
t
c
h
L
a
t
c
h
Ck
Time borrowing with latches
Give it back in later stages
Comb
Logic
Comb
Logic
Comb
Logic
Comb
Logic
t
pd
< t
per
+ t
w
Outline
Clocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Clock Distribution
General goal of clock distribution
Deliver clock to all memory elements with acceptable skew
Deliver clock edges with acceptable sharpness
Clocking network design is one of the greatest challenges
in the design of a large chip
Clocks generally distributed via wiring trees (and meshes)
Low-resistance interconnect to minimize delay
Multiple drivers to distribute driver requirements
Use optimal sizing principles to design buffers
Clock lines can create significant crosstalk
Clock Distribution Problem Statement
Objective
Minimum skew (performance and hold time issues)
Minimum cell area and metal use
(sometimes) minimal latency
(sometimes) particular latency
(sometimes) intermixed gating for power reduction
(sometimes) hold to particular duty cycle: e.g. 50:50 +- 1 percent
Subject to:
Process variation from lot-to-lot
Process variation across the die
Radically different loading (ff density) around the die
Metal variation across the die
Power variation across the die (both static IR and dynamic)
Coupling (same and other layers)
Issues in Clock Distribution Network Design
Skew
Process, voltage, and temperature
Data dependence
Noise coupling
Load balancing
Power, CV
2
f (no or )
Clock gating
Flexibility/Tunability
Compactness fit into existing layout/design
Reliability
Electromigration
Skew: Clock Delay Varies With Position
Clock Distribution Methods
RC-Tree
Less capacitance
More accuracy
Flexible wiring
Grids
Reliable
Less data dependency
Tunable (late in design)
Shown here for final stage drivers driving F/F loads
RC-Trees
X-Tree Binary-Tree
H-Tree
Asymmetric trees can and are used due to uneven sink
distribution, hard macros in floorplan (hierarchical clock
distribution), etc.; the basic goal is to have even RC delays
Grids
Gridded clock distribution common on
earlier DEC Alpha microprocessors
Advantages:
Skew determined by grid density, not
too sensitive to load position
Clock signals available everywhere
Tolerant to process variations
Usually yields extremely low skew
values
Disadvantages:
Huge amount of wiring and power
To minimize such penalties, need to
make grid pitch coarser lose the grid
advantage
Pre-
drivers
Global
grid
Sylvester / Shepard, 2001
Trees
H-tree (Bakoglu)
One large central driver, recursive structure to
match wirelengths
Halve wire width at branching points to reduce
reflections
Disadvantages
Slew degradation along long RC paths
Unrealistically large central driver
- Clock drivers can create large temperature
gradients (ex. Alpha 21064 ~30 C)
Non-uniform load distribution
Inherently non-scalable (wire R growth)
Partial solution: intermediate buffers at branching
points
courtesy of P. Zarkesh-Ha
Sylvester / Shepard, 2001
Buffered Tree
L2
WGBuf EGBuf
NGBuf
SGBuf
L3
PLL
Drives all clock
loads within its
region
Other regions
of the chip
Sylvester / Shepard, 2001
Buffered H-tree
Advantages
Ideally zero-skew
Can be low power (depending on skew requirements)
Low area (silicon and wiring)
CAD tool friendly (regular)
Disadvantages
Sensitive to process variations
- Devices Want same size buffers at each level of tree
- Wires Want similar segment lengths on each layer in each source-sink
path !!!
Local clocking loads inherently non-uniform
Sylvester / Shepard, 2001
Tree Balancing
Some techniques:
a) Introduce dummy loads
b) Snaking of wirelength to match delays
Con: Routing area
often more valuable
than Silicon
Sylvester / Shepard, 2001
Examples From Processor Chips
Serpentines
Intel x86
[Young ISSCC97]
Grids
DEC [Alphas]
H-Tree, Asymmetric
RC-Tree (IBM)
Examples From Processor Chips
DEC-Alpha 21064 clock spines
DEC-Alpha 21164 RC delays for Global
Distribution (Spine + Grid)
DEC-Alpha 21064 RC delays
DEC-Alpha 21164 RC local delays
ReShape Clocks Example (High-End ASIC)
Balanced, shielded H-tree for pre-clock distribution
Mesh for block level distribution
Pre-clock 2 Level H-tree
output mesh
All routes 5-6u M6/5,
shielded with 1u
grounds
~10 buffers per node
E.g., ganged BUFx20s
Output mesh must hit
every sub-block
Block Level Mesh (.18u)
Max 600u stride
1u m5 ribs every 20 - 30 u
(4 to 6 rows)
Shielded input and output m6 shorting straps
Clumps of 1-6 clock buffers, surrounded by
capacitor pads
Pre-clock connects to input shorting straps
Problems with Meshes
Burn more power at low frequencies
Blocks more routing resources (solution, integrated
power distribution with ribs can provide shielding for
free)
Difficult for spare clock domains that will not tolerate
regioning
Post placement (and routing) tuning required
No beneficial skew possible
Problems with Meshes (#2)
Clock gating only easy at root
Fighting tools to do analysis:
Clumped buffers a problem in Static Timing Analysis tools
Large shorted meshes a problem for STA tools
What does Elmore delay calculation look like for a non-tree?
Need full extractions and spice-like simulation (e.g.
Avant! Star-Sim) to determine skew
Benefits of Meshes (#3)
Deterministic since shielded all the way down to rib
distribution
No ECO placement required: all buffers preplaced
before block placement
Low latency since uses shorted (= ganged, parallel)
drivers, therefore lower skew
ECO placements of FFs later do not require rebalance of
tree
Idealizedclocking environment for concurrent RTL
design and timing convergence dance
Mesh Example
~ 100k flops
6 blocks
Clock Skew Thermal Map
Pre-tuning
Clock Skew Thermal Map #2
50ps block/ 100ps global skew, post tuning
Alternative Clock Network Strategy
Globally Tree
Power requirements
reduced relative to global
grid
Smaller routing
requirements, frees up
global tracks
Trees balanced easily at
global level
Keeps global skew low
(with minimal process
variation)
Sylvester / Shepard, 2001
Vertex Locations in a Bounded-Skew Tree
skew
0
2 4 6
2
4
6
0
2 4 6
2
4
6
skew
v
s
4
v
a
b
s
1
s
2
s
3
Topology
s
0
b
a
Given a skew bound, where can internal nodes of the given topology
(e.g., a, b, v) be placed?
Deferred-Merge Embedding (DME) Algorithm
s
4
v
a
b
s
1
s
2
s
3
Topology
s
0
s
1
s
3
s
4
s
2
mr(a)
mr(b)
mr(v)
B = 4
Bottom-Up: build tree of merging
regions corresponding to given
topology
s
0
Special case: skew = 0 merging segments
Top-Down Embedding Phase of DME
s
4
v
a
b
s
1
s
2
s
3
Topology
s
0
s
1
s
3
s
4
s
2
a
b
v
B = 4
Top-Down: choose embedding
points within merging regions
s
0
Zero-Skew Example (555 sinks, 40 obstacles)
Outline
Clocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Skew Reduction Using Package
Most clock network
latency occurs at global
level (largest distances
spanned)
Latency Skew
With reverse scaling,
routing low-RC signals
at global level becomes
more difficult & area-
consuming
Sylvester / Shepard, 2001
Skew Reduction Using Package
System
clock
P/ASIC
Solder bump
substrate
Incorporate global
clock distribution into the
package
Flip-chip packaging
allows for high density,
low parasitic access from
substrate to IC
RC of package-level wiring up
to 4 orders of magnitude smaller
than on-chip wiring
Global skew reduced
Lower capacitance lower
power
Opens up global routing tracks
Results not yet conclusive
Sylvester / Shepard, 2001
Useful Skew (= cycle-stealing)
Zero skew
FF
fast
FF FF
slow
hold setup hold setup
Timing Slacks
FF
fast
FF FF
slow
Useful skew
hold setup hold setup
Useful skew
Local skew constraints
Shift slack to critical paths
Zero skew
Global skew constraint
All skew is bad
W. Dai, UC Santa Cruz
Skew = Local Constraint
Timing is correct as long as the signal arrives in the
permissible skew range
D : longest path
d : shortest path
FF
FF
safe
Skew
race condition
cycle time violation
-d + t
hold
T
period
- D - t
setup
< <
permissible range
W. Dai, UC Santa Cruz
Skew Scheduling for Design Robustness
Design will be more robust if clock signal arrival time is in
the middle of permissible skew range, rather than on edge
Can solve a linear program to maximize robustness =
determine prescribed sink skews
FF FF FF
2 ns
6 ns
T = 6 ns
0 0 0 : at verge of violation
2 0 2 : more safety margin
4 0
-2
2
4 0
W. Dai, UC Santa Cruz
Potential Advantages of Useful Skew
Reduce peak current consumption by distributing the FF switch
point in the range of permissible skew
CLK
0-skew
CLK
U-skew
Affords extra margin to increase clock frequency or reduce sizing
(= power)
W. Dai, UC Santa Cruz
Conventional Zero-Skew Flow
Placement
Placement
Synthesis
Synthesis
Extraction & Delay Calculation
Extraction & Delay Calculation
Static Timing Analysis
Static Timing Analysis
0-Skew Clock Synthesis
0-Skew Clock Synthesis
Clock Routing
Clock Routing
Signal Routing
Signal Routing
Useful-Skew Flow
Existing Placement
Existing Placement
Extraction & Delay Calculation
Extraction & Delay Calculation
Static Timing Analysis
Static Timing Analysis
U-Skew Clock Synthesis
U-Skew Clock Synthesis
Clock Routing
Clock Routing
Signal Routing
Signal Routing
Permissible range generation
Permissible range generation
Initial skew scheduling
Initial skew scheduling
Clock tree topology synthesis
Clock tree topology synthesis
Clock net routing
Clock net routing
Clock timing verification
Clock timing verification
W. Dai, UC Santa Cruz
Outline
Clocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and used-skew degrees of freedom
Clock power issues
Gate timing models
Clock Power
Power consumption in clocks due to:
Clock drivers
Long interconnections
Large clock loads all clocked elements (latches, FFs) are driven
Different components dominate
Depending on type of clock network used
Ex. Grid huge pre-drivers & wire cap. drown out load cap.
Sylvester / Shepard, 2001
Clock Power Is LARGE
Not only is the clock capacitance large, it
switches every cycle!
P = C V
dd
2
f
Sylvester / Shepard, 2001
Low-Power Clocking

Gated clocks
Gated clocks
Prevent switching in areas of chip not being used Prevent switching in areas of chip not being used
Easier in static designs Easier in static designs

Edge
Edge
-
-
triggered flops in ARM rather than transparent latches
triggered flops in ARM rather than transparent latches
in Alpha
in Alpha
Reduced load on clock for each latch/flop Reduced load on clock for each latch/flop
Eliminated spurious power Eliminated spurious power- -consuming transitions during latch flow consuming transitions during latch flow- -
through (transparency) through (transparency)
Sylvester / Shepard, 2001
Clock Area
Clock networks consume silicon area (clock drivers, PLL,
etc.) and routing area
Routing area is most vital
Top-level metals are used to reduce RC delays
These levels are precious resources (unscaled)
Power routing, clock routing, key global signals
Reducing area also reduces wiring capacitance and power
Typical #s: Intel Itanium 4% of M4/5 used in clock routing
Sylvester / Shepard, 2001
Clock Slew Rates
To maintain signal integrity and latch performance, minimum
slew rates are required
Too slow clock is more susceptible to noise, latches are slowed
down, setup times eat into timing budget [T
setup
= 200 + 0.33 * T
slew
(ps)], more short-circuit power for large clock drivers
Too fast burns too much power, overdesigned network, enhanced
ground bounce
Rule-of-thumb: T
rise
and T
fall
of clock are each between 10-
20% of clock period (10% - aggressive target)
1 GHz clock; T
rise
= T
fall
= 100-200ps
Sylvester / Shepard, 2001
Example: Alpha 21264
Grid + H-tree approach
Power = 32% of total
Wire usage = 3% of
metals 3 & 4
4 major clock quadrants, each with a large driver
connected to local grid structures
Sylvester / Shepard, 2001
Alpha 21264 Skew Map
Ref: Compaq, ASP-DAC00
Sylvester / Shepard, 2001
Power vs. Skew
Fundamental design decision
Meeting skew requirements is easy with unlimited
power budget
Wide wires reduce RC product but increase total C
Driver upsizing reduces latency (reduces skew as well)
but increases buffer cap
SOC context: plastic package power limit is 2-3 W
Sylvester / Shepard, 2001
Clock Distribution Trends
Timing
Clock period dropping fast, skew must follow
Slew rates must also scale with cycle time
J itter PLLs get better with CMOS scaling but other sources of noise
increase
- Power supply noise more important
- Switching-dependent temperature gradients
Materials
Cu reduces RC slew degradation, potential skew
Low-k decreases power, improves latency, skew, slews
Power
Complexity, dynamic logic, pipelining more clock sinks
Larger chips bigger clock networks
Sylvester / Shepard, 2001
Outline
Clocking
Storage elements
Clocking metrics and methodology
Clock distribution
Package and useful-skew degrees of freedom
Clock power issues
Gate timing models
Gate Timing Characterization
C
L
D
A
B
F
C
L
Extractexact transistor characteristics from layout
Transistor width, length, junction area and perimeter
Local wire length and inter-wire distance
Compute all transistor and wire capacitances
Cell Timing Characterization
Delay tables generated using a detailed transistor-level
circuit simulator SPICE (differential-equations solver)
For a number of different input slews and load
capacitances simulate the circuit of the cell
Propagation time (50% Vdd at input to 50% at output)
Output slew (10% Vdd at output to 90% Vdd at output)
Time
t
slew
t
pd
Vdd
Delay and Transition Measurement
Transition
80%
50%
20%
Cell Delay
Non-linear effects reflected in tables
Input
Slew
Input
Slew
Delay at the gate
Output
Capacitance
Output
Capacitance
Output
Slew
Intrinsic
Delay
Resulting waveform
D
G
= f (C
L
, S
in
) and S
out
= f (C
L
, S
in
)
Non-linear
Interpolate between table entries
Interpolation error is usually below 10% of SPICE
Timing Library Example (.lib)
library(my_lib) {
delay_model : table_lookup;
library_features (report_delay_calculation);
time_unit : "1ns";
voltage_unit : "1V";
current_unit : "1mA";
leakage_power_unit : 1uW;
capacitive_load_unit(1,pf);
pulling_resistance_unit : "1kohm";
default_fanout_load : 1.0;
default_inout_pin_cap : 1.0;
default_input_pin_cap : 1.0;
default_output_pin_cap : 0.0;
default_cell_leakage_power : 0.0;
nom_voltage : 1.08;
nom_temperature : 125.0;
nom_process : 1.0;
slew_derate_from_library : 0.500000;
operating_conditions("slow_125_1.08") {
process : 1.0 ;
temperature : 125 ;
voltage : 1.08 ;
tree_type : "worst_case_tree" ;
}
default_operating_conditions : slow_125_1.08 ;
lu_table_template("load") {
variable_1 : input_net_transition;
variable_2 : total_output_net_capacitance;
index_1( "1, 2, 3, 4" );
index_2( "1, 2, 3, 4" );
}
fall_transition(load) {
index_1( "0.0326, 0.1614, 0.4192, 1.5017" );
index_2( "0.0010, 0.4249, 2.1491, 8.1881" );
values ( \
"0.011974, 0.071668, 0.317800, 1.189560", \
"0.033212, 0.101182, 0.328540, 1.189562", \
"0.059282, 0.155052, 0.389900, 1.202360", \
"0.162830, 0.317380, 0.628160, 1.441260" );
}
rise_transition(load) {
index_1( "0.0375, 0.1650, 0.5455, 1.5078" );
index_2( "0.0010, 0.4449, 1.7753, 5.1139" );
values ( \
"0.016690, 0.115702, 0.418200, 1.189060", \
"0.038256, 0.139336, 0.422960, 1.189081", \
"0.076248, 0.213280, 0.491820, 1.203700", \
"0.170992, 0.353120, 0.694740, 1.384760" );
}
}
cell("INV") {
pin(A) {
max_transition : 1.500000;
direction : input;
rise_capacitance : 0.0739000;
fall_capacitance : 0.0703340;
capacitance : 0.07278646;
}
pin(Z) {
direction : output;
function : "!A";
max_transition : 1.500000;
max_capacitance : 5.1139;
timing() {
related_pin : "A";
cell_rise(load) {
index_1( "0.0375, 0.2329, 0.6904, 1.5008" );
index_2( "0.0010, 0.9788, 2.2820, 5.1139" );
values ( \
"0.013211, 0.071051, 0.297500, 0.642340", \
"0.028657, 0.110849, 0.362620, 0.707070", \
"0.053289, 0.165930, 0.496550, 0.860400", \
"0.091041, 0.234440, 0.661840, 1.091700" );
}
cell_fall(load) {
index_1( "0.0326, 0.1614, 0.5432, 1.5017" );
index_2( "0.0010, 0.4249, 3.6538, 8.1881" );
values ( \
"0.009472, 0.072284, 0.317370, 0.688390", \
"0.009992, 0.095862, 0.360530, 0.731610", \
"0.009994, 0.126620, 0.477260, 0.867670", \
"0.009996, 0.144150, 0.644140, 1.127700" );
}
Delay Calculation
Cell Fall
Cap\Tr
0.05 0.2 0.5
0.01 0.02 0.16 0.30
0.5 0.04 0.32 0.60
2.0 0.08 0.64 1.20
Cap\Tr
0.05 0.2 0.5
0.01 0.03 0.18 0.33
0.5 0.06 0.36 0.66
2.0 0.09 0.72 1.32
Cell Rise
1.0pf
0.1ns
0.12ns
Fall delay = 0.178ns
Rise delay = 0.261ns
Fall transition = 0.147ns
Rise transition =
0.178
0.261
Cap\Tr
0.05 0.2 0.5
0.01 0.01 0.09 0.15
0.5 0.03 0.27 0.45
2.0 0.06 0.54 0.90
Fall Transition
0.147
0.147ns
PVT (Process, Voltage, Temperature) Derating
Actual cell delay = Original delay x K
PVT
PVT Derating: Example + Min/Typ/Max Triples
Proc_var (0.5:1.0:1.3)
Voltage (5.5:5.0:4.5)
Temperature (0:20:50)
K
P
= 0.80 : 1.00 : 1.30
K
V
= 0.93 : 1.00 : 1.08
K
T
= 0.80 : 1.07 : 1.35
K
PVT
= 0.60 : 1.07 : 1.90
Cell delay = 0.261ns
Derated delay = 0.157 : 0.279 : 0.496 {min : typical : max}
Conservatism of Gate Delay Modeling
True gate delay depends on input arrival time
patterns
STA will assume that only 1 input is switching
Will use worst slope among several inputs
Time
A
B F
t
pd
A
F
t
pd
Vdd
Vdd
D
A
B
F
C
L
D
A
B
F
C
L
Time