You are on page 1of 75

Clock Distribution

Shmuel Wimer
Bar Ilan Univ. Eng. Faculty
Technion, EE Faculty

July 2010 1
Clock System Architecture

clk3

Clock
ext_clk Distribution
External Clock gclk
Clock Generator Buffers
Gaters
clk1 clk2
Clocked
Elements

Chip

Chip receives external clock through I/O pad.


Clock generator adjusts the global clock to the external clock.
Global clock is distributed across the chip.
Local drivers and gaters drive the physical clocks to clocked elements.
July 2010 2
Global Clock Generation
• Receives external clock signal and produce the global
clock distributed across the die.
• A large skew occurs between external clock and the
physical clocks at clocked elements due to delay of
distribution network (wires, buffers, gaters).
• Therefore, data at clocked elements is no more in sync
with data at I/O pins.
• Phased Locked Loop (PLL) compensates this delay.
• PLL can perform frequency multiplication to obtain the
required on-chip frequencies.

July 2010 3
Synchronous Chip Interface with PLL
Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be
synchronized to the common clock.

A PLL produces the global clock of chip B such that it is in sync


with the external clock.

Chip A Chip B

CLKin
ext_clk ref_clk
CLKout gclk
clk PLL clk_out
Din fdbk_clk
Dout

Dout
Din Clock Distribution

July 2010 4
How PLL Works?

Loop
Charge Filter
Pump
C
Up
ref_clk I R
M Voltage
Phase clk_out
Detect Controlled
Vctrl Oscillator
N
fdbk_clk I
Down

July 2010 5
Phase – Frequency Detector (PFD)
Q_A: B should go
1 D Q faster
A
CLR

CLR
B
1 Q_B: B should go
D Q
slower

The two flip-flops receive the signals at their clock input (one is usually a
reference and the other is the sampled).

The output of the leading flip-flop is 1 for the lead duration.

Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.
July 2010 6
What happens when the reference and the sampled signals are a shift of
each other?

A: reference

B: sampled

Q_A: sampled
should go faster

Q_B: sampled
should go slower

The spikes at Q_B are a result of the delay of the AND gate driving the
CLR input of flip-flip and the internal delay from CLR to Q.
July 2010 7
What happens when the reference and the sampled signals have different
frequencies?

A: reference

B: sampled

Q_A: sampled
should go faster

Q_B: sampled
should go slower

Sampled is more often 1-value than the reference is, since rising edge of
B occurs more often than rising edge of A.
July 2010 8
Charge Pump
faster
1 D Q
CLK_ref Icp
CLR
Sup Vctrl

Sdn
CLR
CLK_fdbk
Icp
1 D Q
slower

Converts PFD error (digital) to charge (analog), which then controls PLL VCO.

Charge is proportional to PFD widths: Qcp  I up  tfaster  I dn  tslower .

July 2010 9
Current Mirror
V Vcc
Iin Iout
N1 N2
P1 P2
Iin Iout
Vss V

Charge pump consists of current mirrors which are sources of constant


current. Device N1 is in saturation since its gate is connected to high voltage.
Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.

This is an ideal current source with infinite output impedance since Iout is
independent of N2 load; a change in output voltage doesn’t affect Iout.

Current mirror works similarly for P transistors.

July 2010 10
How Charge Pump Works?

R load – determines Vcc Vcc


the current through
current mirror
I I I C
P-type current mirror
faster R
Switch – open when
faster = 1
Vout

slower
Switch – open when
slower = 1
I I I

N-type current mirror


Vss

July 2010 11
Faster Mode Vout → Vcc
Vcc Vcc

faster=1 R

Vout

slower=0

Vss

July 2010 12
Slower Mode Vout → Vss
Vcc Vcc

faster=0 R

Vout

slower=1

Vss

July 2010 13
Loop Filter

Vcc

+
Vctrl Vctrl

Differential amplifier connected as a unity-gain follower is used.

July 2010 14
Voltage Controlled Oscillator (VCO)
A Ring Oscillator cascades an odd number of inverters and feeds back the
last output to first inverter (even number of inverters will be stable). It starts
to oscillate spontaneously.

Vout

If tinv is the delay of an inverter and n is the number of inverters,

oscillation frequency is f  1  2ntinv 

Frequency can be controlled by number of inverters and supply voltage


of inverter (higher voltage obtains faster inverter).

July 2010 15
Components of VCO

Buffering for
driving clk_out
clk_out
Vctrl Vcc

V-cc
- cc
V

+
- +-

Vss
Level Converter from
Ring of 5 inverters Vctrl-Vss to Vcc-Vss

July 2010 16
Delay Locked Loop (DLL)

• It is a variant of PLL that uses voltage-controlled delay


line rather than oscillator.
• It adjusts phase only. Frequency multiplication is
impossible.
• It is simpler than PLL, less sensitive to Vctrl noise and
requires simpler loop filter.
• It is very difficult to correctly design PLL and DLL. It
requires expertise in control systems and analog circuit
design.

July 2010 17
How DLL Works?

Loop
Charge Filter
Pump
C
Up
ref_clk I R
Phase Voltage clk_out
Detect Controlled
Vctrl Delay Line
fdbk_clk I
Down

July 2010 18
Clock Distribution Networks

July 2010 19
Tree Clock Network (Unconstrained)
clk1

clk2
clk3

clkn

No constraints imposed on buffers and wires.


Used mostly by automatic tools in automatic synthesis flows.
Can be used for small blocks within large design.
Tools aim at minimizing the variance of clock delays.

July 2010 20
If TCLK is the clock delay at a leaf, the variance exapression
i
2
n  1 n 
 T
i 1  CLKi

  i 1TCLK  should be minimized.
n i

Serpentine routing or extra buffers may be introduced


to obtain small variance.

Constraints on power can be imposed by limiting


number and size of clock buffers and width of wires.

July 2010 21
Clock Distribution with Grids

Grid feeds flops


directly, no local
buffers

Clock driver tree spans height of chip


Internal levels shorted together

Low skew but high power


July 2010 22
Clock drivers are on perimeter Clock drivers are on grid points

July 2010 23
Delay and Skew in Grid Distribution

July 2010 24
DEC’s Alpha Microprocessor Clocking

July 2010 25
DEC Alpha 21264 Microprocessor Clock distribution

July 2010 26
Clock Distribution with Spines

July 2010 27
Intel’s Pentium4 Clock Distribution

July 2010 28
July 2010 29
Clock Distribution with Trees
RC-Tree H-Tree

Recursive pattern to
Each branch is individually distribute signals uniformly
routed to balance RC delay with equal delay over area

More skew but less power

July 2010 30
Clock H-Tree

chip / functional block / IP

sequential elements

clock / PLL

July 2010 31
IBM / Motorola PowerPC Clock Distribution

July 2010 32
Delay Calculation

We use Elmore delay model. Sub trees are modeled


as capacitive loads
July 2010 33
Clock Skew and Jitter
• Clock should theoretically arrive simultaneously to all
sequential circuits.
• Practically it arrives in different times. The differences
are called clock skews.
• Skews result from paths mismatches, process variations
and ambient conditions, resulting physical clocks.
• Most systems distribute a global clock and then use
local clock gaters located near clocked elements.
Clock skew consists of the following components:

July 2010 34
• Systematic is the portion existing under nominal
conditions. It can be minimized by appropriate design.
• Random is caused by process variations like devices’
channel length, oxide thickness, threshold voltage, wire
thickness, width and space. It can be measured on silicon
and adjusted by delay components.
• Drift is caused by time-dependent environmental
variations, occurring relatively slowly. Compensation of
those must takes place periodically.
• Jitter is rapid clock changes, occurring by power noise
and clock generator jitter. It cannot be compensated.
July 2010 35
Factors affecting clock skew, Intel 1998, 0.25u.

July 2010 36
Skew, Clock Cycle and Design Margins

Clock Jitter is the same order as skew, but far more


difficult to compensate.
July 2010 37
Skew Modeling
Point of
divergence Q
1 2 Tclk1 m

Clock 
CL
Generator
1 2 Tclk2 n
D

TCLK1   TCLK2  
m n
T
i 1 i
 m T
i 1 i
 n

Cl - capacitive load, I d - drive current, VCC - voltage swing

  Cl VCC I d
July 2010 38
Consider a small change in the delay, taking the linear term.

   Cl VCC ClVCC
  VCC  Cl  I d  VCC  C l  I d
VCC Cl I d Id Id Id 2

 VCC Cl I d 
      
 VCC Cl Id 

Standard deviation of stage delay is     = .  is around 5%.

If clock buffers delays are normally distributed and independent

of each other, then   TCLK1   m , and   TCLK2   n .

Skew is:   Tskew     | TCLK2  TCLK1 |    


m  n  skew .

July 2010 39
Clock Distribution Switching Power
Consider m - level clock tree.
Let Cl_m be the total load of its far-end driven sequential

elements.
Assuming a fixed fanout k of each clock tree buffer, the

dynamic power is: 2


PCLK_m  Cl_mVCC f,

Cl_m 2
PCLK_  m j   VCC f , 0  j  m  1.
kj

July 2010 40
Summing over whole clock tree:
m 1
PCLK   C V 2
j 0 l_  m  j  CC
f 

m 1 Cl_m
1   1 k  m 
2
VCC f 2
 VCC fCl_m  
j 0
 1   1 k  
j
k
How much of the power is consumed by the far
end drivers of clock tree?

PCLK _ m 1  1 k 
 .
1  1 k 
m
PCLK

July 2010 41
Given the number of sequential elements in a block, at
least 50% of the switching power is consumed by the
far end drivers (clock tree is binary, k=2). This number
approaches 1 rapidly with k growth.

Example: Assume a block with 214 sequential elements


and H-tree clock distribution. Then k=4 and m=7. The
far end drivers consume nearly 75% of the clock tree
switching power, while adding the next upper level
drivers brings it to more than 90%.

July 2010 42
Active Clock De-Skewing

• Compensates process variability, temperature gradients,


imperfect design.
• Can be implemented for global fixes (small HW
overhead) or local fixes (high HW overhead).
• Can be used at testing for one time fix (variability
occurring during manufacturing), or dynamically
concurrently with chip operation.
• Its implementation is a difficult design challenge.

July 2010 43
Intel’s Pentium2 De-Skewing System

1998, 450MHz Clock


0.25u process
60pSec skew w/o fix
15pSec skew with fix

Two clock spines for


two clock regions.

A phase detector
detects relative shifts.

Clock of a region is
shifted by a delay line.

July 2010 44
Delay line consists of two cascaded inverters.
Each has a programmable load consists of eight parallel P-N gate
capacitors.
The shift register stores a thermometer code for load programming
in steps of 12pSec.
July 2010 45
Another Programmable Delay Line
The signal from input to output is delayed In
1
Out
or not according to the control bit.
0

In
2n1 1 2n2 1 1 1
Out

0 0 0

Dn 1 Dn  2 D0

Connecting a series of n delay elements, each of delay 2i , n  1  i  0,

it is possible to control the dely of the line to any value from 0 to 2 n  1

dealy units by an n-bit control word  Dn1, Dn 2 , , D0  .


July 2010 46
Intel’s IA64 Itanium1 De-Skewing System

2000, 800MHz clock


0.18u process
28pSec skew with fix
X4 increase w/o fix

30 independent de-skew
regions. PLL

Each cluster is driven


from a global H-tree.

Delay circuit in de-skew


region are similar to
Pentium3 with 20-bit
registers.

July 2010 47
July 2010 48
Proposal for H-Tree Clock De-Skew –
Hierarchical Approach

If a phase detector (PD) has a skew guard band g, then guard bands
may accumulate along tree paths.
For example, if a logic stage is shared between region B and C, it may
add 7g time units to path delay.
July 2010 49
Proposal for H-Tree Clock De-Skew –
Mesh Approach

Clock is distributed by H-
tree, but de-skew takes
place by neighbor leaves
phase detection.

A delay buffer accepts


phase inputs from its 4
neighbors and then
decides of whether to
increase, decrease or
not change its delay.

July 2010 50
Clock Characteristics of Commercial Processors

July 2010 51
Power Consumption in Chips
• Clock power may reach 50% of total (dynamic +
static).
• Clock gating is very useful and standard design
practice
• Four gating methods:
– Synthesis based, automated by EDA tools, RTL
compilers, inserted into clock-tree
– Clock enable signals manually defined by designer,
inserted into clock-tree, FFs’ clock input
– Data-Driven clock gating, inserted at FF-level
– Auto-gated FF, inserted at latch-level
52
FF Data Toggling (DSP core)

23k FFs, Test bench of 250k CLK cycles

10% of application run-time CLK is enabled.


Of which, only 1.6% CLK pulses are useful!

53
data activity / clock activity

0.03

block count
FF Data Toggling

a
b
63 control blocks of ϻP, 200k FFs

cumulative normalized clock


capacitive load
Data-Driven CLK Gating

55
tight
constraint

56
How Many FFs To Group ?

Latch overhead
Net saving per FF amortized over k FFs

csaving  q k
 cFF  cw   clatch k   1  q   cw  cOR  
FF

Gater’s disabling probability Probability of enabling FF

q k ln q  cFF  cW   clatch k 2  0
57
58
Given n flip-flops and m  1 clock cycles
a   a1, , am  is the activity (toggling) of flip-flop

ai  a j is the number of redundant clock pulses


ocurring by jointly clocking FFi and FF j
59
FF Pairwise Activity Model

G  V , E , w  : FF pairwise activity graph.

vi  V corresponds to FFi .

 
 eij  vi , v j  E is FF pairing.

ai | a j is joint toggling.

 
w eij  ai  a j is redundant clock pulses, hence a waste.

E   E: vertex matching
60
Total power:

P  2 e ai | a j 
ij E 

 v V ai   e   a i   ai | a j   
 a j  ai | a j  
i ij E  e E  
ij 

Essential + Waste
 v V ai   e ai  a j   v V ai   e E  w  eij 
i ij E  i ij

61
Flip-Flop Grouping Algorithm
FF j

Minimize:  ai  a j ai  a j
FFi

Optimal FFs pairing


can be solved by
minimal cost perfect
graph matching.

Can be repeated for groups of size 4, 8, 16 …


62
Is repeated perfect matching optimal ?

a 7 | a8
a3 | a 4
a1 | a2
a 6 | a5

a1 a3 a2 a7 a4 a6 a5 a8

63
No! Here is the optimal 4-size grouping

64
Multi-Bit Flip-Flop
Saving the power of internal CLK drivers
1-bit flip-flop

CLK 2-bit flip-flop


CLK CLK

D1 Master Slave
Master Slave Q1
D Q Latch Latch
Latch Latch

+ = CLK
CLK
CLK

CLK Master Slave


CLK CLK D2 Q2
Latch Latch

D
Master Slave Q
Latch Latch

65
66
MBFF should be combined with data-driven CG to
maximize energy savings.

Toggling vectors (VCD) are unfortunately not


always available.

Data-to-clock toggling ratio (probability) is more


often available.

How to utilize it for MBFF optimal grouping?

67
What is the energy waste in 2-bit MBFF?

 
p j  1  pi   pi 1  p j  pi  p j  2 pi p j

n n2
For n FFs grouped
in n/2 MBFFs it is  p j  2 psi pti
j 1 i 1

What is the optimal MBFF grouping?

Group the FFs such that p1  p2    pn


68
Auto-Gated FF

69
Look-Ahead CLK Gating

70
Relaxed
constraint

71
Power Saving Per FF

save
cdyn 

 1  p   cFF+CLK  cFF  cO  
k

 cFF+CLK 
p  cX  kcO    cAint  cFF  cO 
 3 

72
Which FF to Gate?

73
Results

74
75

You might also like