Ece260b w04 Clocking | Electronic Circuits | Digital & Social Media

# ECE260B – CSE241A Winter 2004 Clocking

ECE 260B – CSE 241A Clocking 1

Andrew B. Kahng, UCSD

Sequential Machine

Combinational logic

Combinational logic

Combinational logic

clk

clk

clk

State is stored in registers (flip-flops or latches) Combinational logic computes next-state, outputs from present-state, inputs

ECE 260B – CSE 241A Clocking 2

Courtesy K. Keutzer et al. UCB

Andrew B. Kahng, UCSD

Outline
Why Clocking Clock Distribution The “Zero-Skew Tree” Problem The “Useful” Skew Problem The Timing Analysis Problem

ECE 260B – CSE 241A Clocking 3

Andrew B. Kahng, UCSD

Why Clocks?
Clocks provide the means to synchronize
By allowing events to happen at known timing boundaries, we can sequence these events

Greatly simplifies building of state machines No need to worry about variable delay through combinational logic (CL)
All signals delayed until clock edge (clock imposes the worst case delay)
FSM Comb Logic register register
Courtesy K. Yang, UCLA

Dataflow Comb Logic register

ECE 260B – CSE 241A Clocking 4

Andrew B. Kahng, UCSD

Clock Cycle Time
Cycle time is determined by the delay through the CL
Signal must arrive before the latching edge If too late, it waits until the next cycle
- Synchronization and sequential order becomes incorrect

Constraint: tcycle > tprop_delay_through_CL + toverhead
Example: 3.0 GHz Pentium-4 tcycle = 333ps

Can change circuit architecture to obtain smaller Tcycle

ECE 260B – CSE 241A Clocking 5

Courtesy K. Yang, UCLA

Andrew B. Kahng, UCSD

g. tpd2) + toverhead register register CL CL A A CL CL B B tpd ECE 260B – CSE 241A Clocking 6 tpd1 Courtesy K. split the critical path into chunks Insert registers to store intermediate results This allows 2 waves of data to coexist within the CL Can we extend this ad infinitum? Overhead eventually limits the pipelining . UCSD . Yang.E.Minimum time quantum: delay of a gate tcycle > tpd + toverhead register register register CL CL A+B A+B tcycle > max(tpd1.Pipelining For dataflow: Instead of a long critical path..5 to 2 gate delays for latch or FF Granularity limits as well . UCLA tpd2 Andrew B. Kahng. 1.

00 Pentium III Pentium 4 20. UCSD .00 Number of FO4 inverter delays 100.00 Pentium Pro Pentium II Celeron 40.Intel MPU FO4 INV Delays Per Clock Period 120.00 486 DX2 DX4 Pentium Pentium MMX 60.00 0.. pipelining ECE 260B – CSE 241A Clocking 7 Andrew B. Kahng.00 1982 1987 1993 1998 2004 Year FO4 INV = inverter driving 4 identical inverters (no interconnect) Half of frequency improvement has been from reduced logic stages.e. i.00 386 80.

wire delays logic levels cycle time data Tclock1 Tmax ≤ T Q1 Q2 critical path. ~5 logic levels Courtesy K.Let’s Revisit Cycle Time and Path Delay Cycle time (T) cannot be smaller than longest path delay (Tmax) Longest (critical) path delay is a function of: Total gate. Kahng. UCB Tclock1 clock ECE 260B – CSE 241A Clocking 8 Tclock2 Andrew B. Keutzer et al. UCSD .

must be stable for: Setup time (Tsetup) before clock arrives setup time data Tclock1 Tmax + Tsetup ≤ T Q1 Q2 critical path. Kahng. UCB Tclock1 clock ECE 260B – CSE 241A Clocking 9 Tclock2 Andrew B. UCSD . Keutzer et al.Setup Time For FFs to correctly capture data.Cycle Time . ~5 logic levels Courtesy K.

UCB Tclock1 clock ECE 260B – CSE 241A Clocking 10 Tclock2 Andrew B. Keutzer et al. UCSD 10 . Kahng. ~5 logic levels Courtesy K.Cycle Time – Clock Skew If clock network has unbalanced delay – clock skew Cycle time is also a function of clock skew (Tskew) data Tclock1 Tclock2 Q2 clock skew Tmax + Tsetup + Tskew ≤ T Q1 Q2 critical path.

Cycle Time – Flip-Flop Delay (Clock to Q) Cycle time is also a function of propagation delay of FF (Tclk-to-Q or Tc2q) data Tclock1 Tclock2 Q2 clock-to-Q Tc2q : time from arrival of clock signal till Tmax + Tsetup + Tskew + Tclk − to −Q ≤ T change at FF output) Q1 Q2 critical path. Kahng. UCSD . ~5 logic levels Courtesy K. Keutzer et al. UCB Tclock1 clock ECE 260B – CSE 241A Clocking 11 Tclock2 Andrew B.

Min Path Delay . data must be stable during: Hold time (Thold) after clock arrives Determined by delay of shortest path in circuit (Tmin) and clock skew (Tskew) hold time data Tclock1 Tmin ≥ Thold + Tskew Q1 Q2 short path. UCB Tclock1 clock ECE 260B – CSE 241A Clocking 12 Tclock2 Andrew B. Keutzer et al. ~3 logic levels Courtesy K.Hold Time For FFs to correctly latch data. Kahng. UCSD .

Hold. UCB Andrew B. Kahng. UCSD .Setup. Keutzer et al. Cycle Times cycle time hold time – D stable after clock set-up time – D stable before clock When signal may change Example of a single phase clock ECE 260B – CSE 241A Clocking 13 Courtesy K.

Yang.Summary of Constraints (Edge-Triggered FFs) FlipFlop Comb Comb Logic Logic tper Max(tpd) < tper – tsu – tc2q – tskew Delay is too long for data to be captured Min(tpd) > th-tc2q+tskew Delay is too short and data can race through. Kahng. UCLA ECE 260B – CSE 241A Clocking 14 Andrew B. skipping a state Courtesy K. UCSD .

can increase cycle time Ck’ i Comb Comb Logic Logic Ck o regB tskew Ck Ck’ i Courtesy K. UCLA regA tpdmax Too late! ECE 260B – CSE 241A Clocking 15 o tpdmax Andrew B. Kahng. Yang. UCSD .Example of tpdmax Violation Suppose there is skew between the registers in a dataflow (regA after regB) “i” gets its input values from regA at transition in Ck’ CL output “o” arrives after Ck transition due to skew To correct this problem.

UCSD ECE 260B – CSE 241A Clocking 16 . UCLA tskew tpdmin Too early! tpdmin Andrew B. Yang.Example of tpdmin Violation: Race Through Suppose clock skew causes regA to be clocked before regB “i” passes through the CL with little delay (tpdmin) “o” arrives before the rising Ck’ causes the data to be latched Cannot be fixed by changing frequency Ck i Comb Comb Logic Logic have rock instead of chip Ck’ o regB regA Ck Ck’ i o Courtesy K. Kahng.

Yang. Kahng. UCSD . UCLA Andrew B. Useful Skew) Cycle steal with flip-flops using delayed clocks FlipFlop FlipFlop Ck ECE 260B – CSE 241A Clocking 17 Comb Comb Logic Logic Intentional delay = skew Courtesy K.Time Borrowing (Cycle Stealing.

Outline Why Clocking Clock Distribution The “Zero-Skew Tree” Problem The “Useful” Skew Problem The Timing Analysis Problem ECE 260B – CSE 241A Clocking 18 Andrew B. UCSD . Kahng.

Kahng. UCSD .Clock Distribution General goal of clock distribution Deliver clock to all memory elements with acceptable skew Deliver clock edges with acceptable sharpness Clocking network design is one of the greatest challenges in the design of a large chip Clocks generally distributed via wiring trees (and meshes) Low-resistance interconnect to minimize delay Multiple drivers to distribute driver requirements Use optimal sizing principles to design buffers Clock lines can create significant crosstalk ECE 260B – CSE 241A Clocking 19 Andrew B.

1 percent Subject to: Process variation from lot-to-lot Process variation across the die Radically different loading (ff density) around the die Metal variation across the die Power variation across the die (both static IR and dynamic) Coupling (same and other layers) ECE 260B – CSE 241A Clocking 20 Andrew B. Kahng. UCSD .Clock Distribution Problem Statement Objective Minimum skew (performance and hold time issues) Minimum cell area and metal use (sometimes) minimal latency (sometimes) particular latency (sometimes) intermixed gating for power reduction (sometimes) hold to particular duty cycle: e. 50:50 +.g.

Issues in Clock Distribution Network Design Skew Process. and temperature Data dependence Noise coupling Load balancing Power. CV2f Clock gating Flexibility/Tunability Compactness – fit into existing layout/design Reliability Electromigration ECE 260B – CSE 241A Clocking 21 Andrew B. Kahng. voltage. UCSD .

Skew: Clock Delay Varies With Position ECE 260B – CSE 241A Clocking 22 Andrew B. UCSD . Kahng.

Tox. 2001 Andrew B. UCSD .Clock Skew Causes Designed (unavoidable) variations – mismatch in buffer load sizes. Kahng. values Temperature gradients – changes MOSFET performance across die IR voltage drop in power supply – changes MOSFET performance across die Note: Delay from clock generator to fan-out points (clock latency) is not important by itself BUT: increased latency leads to larger skew for same amount of relative variation Sylvester / 260B – CSE 241A Clocking 23 ECE Shepard. etc. interconnect lengths Process variation – process spread across die yielding different Leff.

UCSD .Clock Distribution Methods RC-Tree Less capacitance More accuracy Flexible wiring Grids Reliable Less data dependency Tunable (late in design) Shown here for final stage drivers driving F/F loads ECE 260B – CSE 241A Clocking 24 Andrew B. Kahng.

not too sensitive to load position Clock signals available everywhere Tolerant to process variations Usually yields extremely low skew values Predrivers Disadvantages: Huge amount of wiring and power To minimize such penalties. 2001 Global grid Andrew B. Kahng. need to make grid pitch coarser lose the grid advantage Sylvester / 260B – CSE 241A Clocking 25 ECE Shepard.Grids Gridded clock distribution common on earlier DEC Alpha microprocessors Advantages: Skew determined by grid density. UCSD .

Zarkesh-Ha Andrew B. recursive structure to match wirelengths Halve wire width at branching points to reduce reflections Disadvantages Slew degradation along long RC paths Unrealistically large central driver . Kahng.H-Tree H-tree (Bakoglu) One large central driver.Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C) Non-uniform load distribution Inherently non-scalable (wire R growth) Partial solution: intermediate buffers at branching points Sylvester / 260B – CSE 241A Clocking 26 ECE Shepard. UCSD . 2001 courtesy of P.

Devices Want same size buffers at each level of tree . Kahng.Wires Want similar segment lengths on each layer in each source-sink path !!! Local clocking loads inherently non-uniform Sylvester / 260B – CSE 241A Clocking 27 ECE Shepard. UCSD . 2001 Andrew B.Buffered H-tree Advantages Ideally zero-skew Can be low power (depending on skew requirements) Low area (silicon and wiring) CAD tool friendly (regular) Disadvantages Sensitive to process variations .

2001 Andrew B. Kahng. UCSD .Tree Balancing Some techniques: a) Introduce dummy loads Con: Routing area often more valuable than Silicon b) Snaking of wirelength to match delays Sylvester / 260B – CSE 241A Clocking 28 ECE Shepard.

Examples From Processor Chips H-Tree. Kahng. UCSD . Asymmetric RC-Tree (IBM) Grids DEC [Alphas] Serpentines Intel x86 [Young ISSCC97] ECE 260B – CSE 241A Clocking 29 Andrew B.

Example Skews From Processor Chips DEC-Alpha 21064 clock spines DEC-Alpha 21064 RC delays DEC-Alpha 21164 RC local delays DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid) ECE 260B – CSE 241A Clocking 30 Andrew B. Kahng. UCSD .

ganged BUFx20’s Output mesh must hit every sub-block output mesh ECE 260B – CSE 241A Clocking 31 Andrew B. UCSD . shielded H-tree for pre-clock distribution Mesh for block level distribution All routes 5-6u M6/5..g. Kahng. shielded with 1u grounds ~10 buffers per node E.ReShape Clocks Example (High-End ASIC) Balanced.

surrounded by capacitor pads Shielded input and output m6 shorting straps Pre-clock connects to input shorting straps 1u m5 ribs every 20 . Kahng.18u) Clumps of 1-6 clock buffers.30 u (4 to 6 rows) Max 600u stride ECE 260B – CSE 241A Clocking 32 Andrew B.Block Level Mesh (. UCSD .

Problems with Meshes Burn more power at low frequencies Blocks more routing resources (solution: integrated power distribution with ribs can provide shielding for ‘free’) Difficult for ‘spare’ clock domains that will not tolerate regioning Post placement (and routing) tuning required No ‘beneficial skew’ possible Clock gating only easy at root Fighting tools to do analysis: Clumped buffers a problem in Static Timing Analysis tools Large shorted meshes a problem for STA tools What does Elmore delay calculation look like for a non-tree? Need full extraction and SPICE-like simulation to determine skew ECE 260B – CSE 241A Clocking 33 Andrew B. UCSD . Kahng.

therefore lower skew ECO placements of FFs later do not require rebalancing of tree “Idealized” clocking environment for “concurrent dance” of RTL design and timing convergence ECE 260B – CSE 241A Clocking 34 Andrew B. Kahng. parallel) drivers. UCSD .Benefits of Meshes Deterministic since shielded all the way down to rib distribution No ECO placement required: all buffers preplaced before block placement Low latency since uses shorted (= ganged.

Outline Why Clocking Clock Distribution The “Zero-Skew Tree” Problem The “Useful” Skew Problem The Timing Analysis Problem ECE 260B – CSE 241A Clocking 35 Andrew B. Kahng. UCSD .

construct a ZST T(S) with topology G and having minimum cost. UCSD .Zero-Skew Tree (ZST) Problem Zero Skew Clock Routing Problem (S.si) – td(s0.sj)| over all sink pairs si. Td = signal delay (from source s0) Connection topology G = rooted binary tree with nodes of S as leaves Edge ea in G is the edge from a to its parent |ea| is the (assigned) length of edge ea Cost = total edge length ECE 260B – CSE 241A Clocking 36 Andrew B. sj in S.G): Given a set S of sink locations and a connection topology G. Skew = maximum value of |td(s0. Kahng.

40 obstacles) ECE 260B – CSE 241A Clocking 37 Andrew B.Zero-Skew Example (555 sinks. Kahng. UCSD .

A Zero-Skew Routing Algorithm Finds a ZST under linear delay model with minimum cost over all ZSTs with topology G and sink set S Terms Manhattan Arc: line segment with slope +1 or –1 Tilted Rectangular Region (TRR): collection of points within a fixed distance of a Manhattan arc Core = Manhattan arc Radius = distance Merging segment = locus of feasible locations for a node v in the topology. UCSD . Kahng. then ms(v) = {v} If v is an internal node. and within distance |eb| of ms(b) ECE 260B – CSE 241A Clocking 38 - Andrew B. consistent with minimum wirelength If v is a sink. then ms(v) is the set of all points within distance |ea| of ms(a).

b be children of v.Phase 1: Tree of Merging Segments Goal: Construct a tree of merging segments corresponding to topology G Merging segment of a node depends on merging segment of its children bottom-up construction Let a. We want placements of v that allow TSa and TSb to be merged with minimum added wire while preserving zero skew Merging cost = |ea| + |eb| Fact: The intersection of two TRRs is also a TRR and can be found in constant time Constant time per each new merging segment linear time (in size of S) to construct entire tree ECE 260B – CSE 241A Clocking 39 Andrew B. UCSD . Kahng.

Phase 2: Find Node Placements Goal: Find exact locations (“embeddings”) pl(v) of internal nodes v in the ZST topology If v is the root node. UCSD . Kahng. and TRR intersection is O(1) time Find_Exact_Placements is O(n) DME is O(n) ECE 260B – CSE 241A Clocking 40 Andrew B. then v can be embedded at any point in ms(v) that is at distance |ev| or less from pl(p) Detail: create square TRR trrp with radius ev and core equal to pl(p). then any point on ms(v) can be chosen as pl(v) If v is an internal node other than the root. placement of v can be any point in ms(v) ∩ trrp Each instruction executed at most once for each node in G. and p is the parent of v.

v) be placed? skew 0 a 2 4 6 2 4 6 6 4 2 skew 0 2 v 4 b 6 s0 a ECE 260B – CSE 241A Clocking 41 v b Topology s1 s2 s3 s4 Andrew B.Non-Zero Skew Bounds Given a skew bound. where can internal nodes of the given topology (e. UCSD ..g. b. Kahng. a.

BST-DME Bottom-Up Phase Bottom-Up: build tree of merging regions corresponding to given topology s0 a v b Topology B=4 s2 s0 s1 s2 s3 s4 s3 mr(b) s4 mr(a) s1 ECE 260B – CSE 241A Clocking 42 mr(v) Andrew B. Kahng. UCSD .

UCSD .BST-DME Top-Down Phase s0 a B=4 v b Topology s2 s0 a s1 s2 s3 s4 s3 b s4 s1 ECE 260B – CSE 241A Clocking 43 v Andrew B. Kahng.

UCSD .Outline Why Clocking Clock Distribution The “Zero-Skew Tree” Problem The “Useful” Skew Problem The Timing Analysis Problem ECE 260B – CSE 241A Clocking 44 Andrew B. Kahng.

Dai. Kahng. UCSD . UC SantaCSE 241A Clocking 45 ECE 260B – Cruz Andrew B.Skew = Local Constraint Timing is correct as long as the clock signals of sequentially adjacent FFs arrive within a permissible skew range FF D : longest path d : shortest path FF -d + thold race condition < Skew safe < Tperiod .tsetup cycle time violation permissible range W.D .

“Useful Skew” Design Robustness Design will be more robust if clock signal arrival time is in the middle of permissible skew range. rather than on edge FF 2 ns FF 6 ns FF T = 6 ns 4 0 “0 0 0”: at verge of violation 4 0 “2 0 2”: more safety margin 2 W. UC SantaCSE 241A Clocking 46 ECE 260B – Cruz -2 Andrew B. Dai. Kahng. UCSD .

Kahng. x + HOLD) For 1 ≤ i.Constraints on Skews FFi receives clock signal delayed by xi ≥ MIN_DEL 0 < α ≤ 1 ≤ β : if nominal clock delay is xi. we compute lower and upper bounds MIN(i. P = clock period ECE 260B – CSE 241A Clocking 47 Andrew B.j) for the time that is required for a signal edge to propagate from FFi to FFj Avoid double-clocking (race condition) αxi + MIN(i.j) ≤ αxj + P.j) and MAX(i.j ≤ L (#FFs).j) ≥ βxj + HOLD Avoid zero-clocking βxj + SETUP + MAX(i. UCSD . the correct input data must be present and stable during the time interval (x – SETUP. then actual clock delay must fall within interval αxi ≤ x ≤ βxi For FF to operate correctly when clock edge arrives at time x.

Proc.j) – P xi ≥ MIN_DEL Notes .j) αxi– βxj + P ≥ SETUP + MAX(i. 399-404.t.βxj ≥ HOLD – MIN(i.j) αxi– βxj – M ≥ SETUP + MAX(i. Szymanski. IEEE Trans. αxj . P. αxj . Computers 39(7) (1990). June 1992. UCSD . G.Useful Skew optimization is similar to Retiming optimization . pp.t. Fishburn.βxj – M ≥ HOLD – MIN(i. . DAC.Peak current reductions are a side benefit ECE 260B – CSE 241A Clocking 48 Andrew B.j) xi ≥ MIN_DEL LP_SAFETY (robustness): Maximize M s. pp.J. . “Clock Skew Optimization”.Optimal Useful Skews by Linear Programming LP_SPEED (clock period reduction): minimize P s. 945-951. “Computing Optimal Clock Schedules”. Kahng.T.

UCSD . Kahng.Outline Why Clocking Clock Distribution The “Zero-Skew Tree” Problem The “Useful” Skew Problem The Timing Analysis Problem ECE 260B – CSE 241A Clocking 49 Andrew B.

UCSD . Keutzer et al. UCB Andrew B. Kahng.Is Circuit Timing Correct? Combinational logic Combinational logic Combinational logic clk clk clk original circuit Combinational logic extracted block ECE 260B – CSE 241A Clocking 50 Courtesy K.

05 X A .15 .10 1 . Kahng.20 2 Z C . Keutzer et al.Key = Delays Through Combinational Block Arrival time in green Interconnect delay in red Gate delay in blue 0 0 W .20 1 f B Question: What is the right mathematical object to use to represent this physical object? ECE 260B – CSE 241A Clocking 51 Courtesy K.05 .05 Y 2 . UCB Andrew B.20 2 . UCSD .

UCSD . FI(υ) = {X. UCB Andrew B. Kahng. Keutzer et al.Actual Arrival Time Actual arrival time A(v) for a node v is latest time that signal can arrive at node v X A(X) dx → z A(Z) Z dY → z Y A(Y) A( υ) = max (A(u) + du →υ ) u∈FI( υ ) where dυ→u is delay from υ to u. and υ = {Z}. ECE 260B – CSE 241A Clocking 52 Courtesy K.Y}.

1 1 W . UCB Andrew B.Problem: Longest Path in Directed Graph Use a labeled directed graph W 0 1 .15 .E> Vertices represent gates.15 .05 0 0 1 X 2 .10 0 C .05 Y 2 . Kahng.2 . UCSD .20 2 Z G = <V.20 1 f B A C .05 .20 .05 .05 f Z Y 2 B Courtesy K.20 2 .20 2 . Keutzer et al. primary inputs and primary outputs Edges represent wires Labels represent delays ECE 260B – CSE 241A Clocking 53 .05 X A .