Clock Distribution: Shmuel Wimer

Clock Distribution
Shmuel Wimer
Bar Ilan Univ. Eng. Faculty
Technion, EE Faculty
July 2010 1
Clock System Architecture
clk3
Clock
ext_clk Distribution
External Clock gclk
Clock Generator Buffers
Gaters
clk1 clk2
Clocked
Elements
Chip
Chip receives external clock through I/O pad.

Clock generator adjusts the global clock to the external clock.
Global clock is distributed across the chip.
Local drivers and gaters drive the physical clocks to clocked elements.
July 2010 2
Global Clock Generation
• Receives external clock signal and produce the global
clock distributed across the die.
• A large skew occurs between external clock and the
physical clocks at clocked elements due to delay of
distribution network (wires, buffers, gaters).
• Therefore, data at clocked elements is no more in sync
with data at I/O pins.
• Phased Locked Loop (PLL) compensates this delay.
• PLL can perform frequency multiplication to obtain the
required on-chip frequencies.
July 2010 3
Synchronous Chip Interface with PLL
Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be
synchronized to the common clock.
A PLL produces the global clock of chip B such that it is in sync

with the external clock.
Chip A Chip B
CLKin
ext_clk ref_clk
CLKout gclk
clk PLL clk_out
Din fdbk_clk
Dout
Dout
Din Clock Distribution
July 2010 4
How PLL Works?
Loop
Charge Filter
Pump
C
Up
ref_clk I R
M Voltage
Phase clk_out
Detect Controlled
Vctrl Oscillator
N
fdbk_clk I
Down
July 2010 5
Phase – Frequency Detector (PFD)
Q_A: B should go
1 D Q faster
A
CLR
CLR
B
1 Q_B: B should go
D Q
slower
The two flip-flops receive the signals at their clock input (one is usually a
reference and the other is the sampled).
The output of the leading flip-flop is 1 for the lead duration.
Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.
July 2010 6
What happens when the reference and the sampled signals are a shift of
each other?
A: reference
B: sampled
Q_A: sampled
should go faster
Q_B: sampled
should go slower
The spikes at Q_B are a result of the delay of the AND gate driving the
CLR input of flip-flip and the internal delay from CLR to Q.
July 2010 7
What happens when the reference and the sampled signals have different
frequencies?
A: reference
B: sampled
Q_A: sampled
should go faster
Q_B: sampled
should go slower
Sampled is more often 1-value than the reference is, since rising edge of
B occurs more often than rising edge of A.
July 2010 8
Charge Pump
faster
1 D Q
CLK_ref Icp
CLR
Sup Vctrl
Sdn
CLR
CLK_fdbk
Icp
1 D Q
slower
Converts PFD error (digital) to charge (analog), which then controls PLL VCO.
Charge is proportional to PFD widths: Qcp  I up  tfaster  I dn  tslower .
July 2010 9
Current Mirror
V Vcc
Iin Iout
N1 N2
P1 P2
Iin Iout
Vss V
Charge pump consists of current mirrors which are sources of constant

current. Device N1 is in saturation since its gate is connected to high voltage.
Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.
This is an ideal current source with infinite output impedance since Iout is
independent of N2 load; a change in output voltage doesn’t affect Iout.
Current mirror works similarly for P transistors.
July 2010 10
How Charge Pump Works?
R load – determines Vcc Vcc

the current through
current mirror
I I I C
P-type current mirror
faster R
Switch – open when
faster = 1
Vout
slower
Switch – open when
slower = 1
I I I
N-type current mirror

Vss
July 2010 11
Faster Mode Vout → Vcc
Vcc Vcc
faster=1 R
Vout
slower=0
Vss
July 2010 12
Slower Mode Vout → Vss
Vcc Vcc
faster=0 R
Vout
slower=1
Vss
July 2010 13
Loop Filter
Vcc
+
Vctrl Vctrl
Differential amplifier connected as a unity-gain follower is used.
July 2010 14
Voltage Controlled Oscillator (VCO)
A Ring Oscillator cascades an odd number of inverters and feeds back the
last output to first inverter (even number of inverters will be stable). It starts
to oscillate spontaneously.
Vout
If tinv is the delay of an inverter and n is the number of inverters,
oscillation frequency is f  1  2ntinv 
Frequency can be controlled by number of inverters and supply voltage

of inverter (higher voltage obtains faster inverter).
July 2010 15
Components of VCO
Buffering for
driving clk_out
clk_out
Vctrl Vcc
V-cc
- cc
V
+
- +-
Vss
Level Converter from
Ring of 5 inverters Vctrl-Vss to Vcc-Vss
July 2010 16
Delay Locked Loop (DLL)
• It is a variant of PLL that uses voltage-controlled delay

line rather than oscillator.
• It adjusts phase only. Frequency multiplication is
impossible.
• It is simpler than PLL, less sensitive to Vctrl noise and
requires simpler loop filter.
• It is very difficult to correctly design PLL and DLL. It
requires expertise in control systems and analog circuit
design.
July 2010 17
How DLL Works?
Loop
Charge Filter
Pump
C
Up
ref_clk I R
Phase Voltage clk_out
Detect Controlled
Vctrl Delay Line
fdbk_clk I
Down
July 2010 18
Clock Distribution Networks
July 2010 19
Tree Clock Network (Unconstrained)
clk1
clk2
clk3
clkn
No constraints imposed on buffers and wires.

Used mostly by automatic tools in automatic synthesis flows.
Can be used for small blocks within large design.
Tools aim at minimizing the variance of clock delays.
July 2010 20
If TCLK is the clock delay at a leaf, the variance exapression
i
2
n  1 n 
 T
i 1  CLKi

  i 1TCLK  should be minimized.
n i
Serpentine routing or extra buffers may be introduced

to obtain small variance.
Constraints on power can be imposed by limiting

number and size of clock buffers and width of wires.
July 2010 21
Clock Distribution with Grids
Grid feeds flops

directly, no local
buffers
Clock driver tree spans height of chip

Internal levels shorted together
Low skew but high power

July 2010 22
Clock drivers are on perimeter Clock drivers are on grid points
July 2010 23
Delay and Skew in Grid Distribution
July 2010 24
DEC’s Alpha Microprocessor Clocking
July 2010 25
DEC Alpha 21264 Microprocessor Clock distribution
July 2010 26
Clock Distribution with Spines
July 2010 27
Intel’s Pentium4 Clock Distribution
July 2010 28
July 2010 29
Clock Distribution with Trees
RC-Tree H-Tree
Recursive pattern to
Each branch is individually distribute signals uniformly
routed to balance RC delay with equal delay over area
More skew but less power
July 2010 30
Clock H-Tree
chip / functional block / IP
sequential elements
clock / PLL
July 2010 31
IBM / Motorola PowerPC Clock Distribution
July 2010 32
Delay Calculation
We use Elmore delay model. Sub trees are modeled

as capacitive loads
July 2010 33
Clock Skew and Jitter
• Clock should theoretically arrive simultaneously to all
sequential circuits.
• Practically it arrives in different times. The differences
are called clock skews.
• Skews result from paths mismatches, process variations
and ambient conditions, resulting physical clocks.
• Most systems distribute a global clock and then use
local clock gaters located near clocked elements.
Clock skew consists of the following components:
July 2010 34
• Systematic is the portion existing under nominal
conditions. It can be minimized by appropriate design.
• Random is caused by process variations like devices’
channel length, oxide thickness, threshold voltage, wire
thickness, width and space. It can be measured on silicon
and adjusted by delay components.
• Drift is caused by time-dependent environmental
variations, occurring relatively slowly. Compensation of
those must takes place periodically.
• Jitter is rapid clock changes, occurring by power noise
and clock generator jitter. It cannot be compensated.
July 2010 35
Factors affecting clock skew, Intel 1998, 0.25u.
July 2010 36
Skew, Clock Cycle and Design Margins
Clock Jitter is the same order as skew, but far more

difficult to compensate.
July 2010 37
Skew Modeling
Point of
divergence Q
1 2 Tclk1 m
Clock 
CL
Generator
1 2 Tclk2 n
D
TCLK1   TCLK2  
m n
T
i 1 i
 m T
i 1 i
 n
Cl - capacitive load, I d - drive current, VCC - voltage swing
  Cl VCC I d
July 2010 38
Consider a small change in the delay, taking the linear term.
   Cl VCC ClVCC
  VCC  Cl  I d  VCC  C l  I d
VCC Cl I d Id Id Id 2
 VCC Cl I d 
      
 VCC Cl Id 
Standard deviation of stage delay is     = .  is around 5%.
If clock buffers delays are normally distributed and independent
of each other, then   TCLK1   m , and   TCLK2   n .
Skew is:   Tskew     | TCLK2  TCLK1 |    

m  n  skew .
July 2010 39
Clock Distribution Switching Power
Consider m - level clock tree.
Let Cl_m be the total load of its far-end driven sequential
elements.
Assuming a fixed fanout k of each clock tree buffer, the
dynamic power is: 2

PCLK_m  Cl_mVCC f,
Cl_m 2
PCLK_  m j   VCC f , 0  j  m  1.
kj
July 2010 40
Summing over whole clock tree:
m 1
PCLK   C V 2
j 0 l_  m  j  CC
f 
m 1 Cl_m
1   1 k  m 
2
VCC f 2
 VCC fCl_m  
j 0
 1   1 k  
j
k
How much of the power is consumed by the far
end drivers of clock tree?
PCLK _ m 1  1 k 
 .
1  1 k 
m
PCLK
July 2010 41
Given the number of sequential elements in a block, at
least 50% of the switching power is consumed by the
far end drivers (clock tree is binary, k=2). This number
approaches 1 rapidly with k growth.
Example: Assume a block with 214 sequential elements

and H-tree clock distribution. Then k=4 and m=7. The
far end drivers consume nearly 75% of the clock tree
switching power, while adding the next upper level
drivers brings it to more than 90%.
July 2010 42
Active Clock De-Skewing
• Compensates process variability, temperature gradients,

imperfect design.
• Can be implemented for global fixes (small HW
overhead) or local fixes (high HW overhead).
• Can be used at testing for one time fix (variability
occurring during manufacturing), or dynamically
concurrently with chip operation.
• Its implementation is a difficult design challenge.
July 2010 43
Intel’s Pentium2 De-Skewing System
1998, 450MHz Clock

0.25u process
60pSec skew w/o fix
15pSec skew with fix
Two clock spines for

two clock regions.
A phase detector
detects relative shifts.
Clock of a region is
shifted by a delay line.
July 2010 44
Delay line consists of two cascaded inverters.
Each has a programmable load consists of eight parallel P-N gate
capacitors.
The shift register stores a thermometer code for load programming
in steps of 12pSec.
July 2010 45
Another Programmable Delay Line
The signal from input to output is delayed In
1
Out
or not according to the control bit.
0
In
2n1 1 2n2 1 1 1
Out
0 0 0
Dn 1 Dn  2 D0
Connecting a series of n delay elements, each of delay 2i , n  1  i  0,
it is possible to control the dely of the line to any value from 0 to 2 n  1
dealy units by an n-bit control word  Dn1, Dn 2 , , D0  .

July 2010 46
Intel’s IA64 Itanium1 De-Skewing System
2000, 800MHz clock

0.18u process
28pSec skew with fix
X4 increase w/o fix
30 independent de-skew
regions. PLL
Each cluster is driven

from a global H-tree.
Delay circuit in de-skew

region are similar to
Pentium3 with 20-bit
registers.
July 2010 47
July 2010 48
Proposal for H-Tree Clock De-Skew –
Hierarchical Approach
If a phase detector (PD) has a skew guard band g, then guard bands
may accumulate along tree paths.
For example, if a logic stage is shared between region B and C, it may
add 7g time units to path delay.
July 2010 49
Proposal for H-Tree Clock De-Skew –
Mesh Approach
Clock is distributed by H-
tree, but de-skew takes
place by neighbor leaves
phase detection.
A delay buffer accepts

phase inputs from its 4
neighbors and then
decides of whether to
increase, decrease or
not change its delay.
July 2010 50
Clock Characteristics of Commercial Processors
July 2010 51
Power Consumption in Chips
• Clock power may reach 50% of total (dynamic +
static).
• Clock gating is very useful and standard design
practice
• Four gating methods:
– Synthesis based, automated by EDA tools, RTL
compilers, inserted into clock-tree
– Clock enable signals manually defined by designer,
inserted into clock-tree, FFs’ clock input
– Data-Driven clock gating, inserted at FF-level
– Auto-gated FF, inserted at latch-level
52
FF Data Toggling (DSP core)
23k FFs, Test bench of 250k CLK cycles
10% of application run-time CLK is enabled.

Of which, only 1.6% CLK pulses are useful!
53
data activity / clock activity
0.03
block count
FF Data Toggling
a
b
63 control blocks of ϻP, 200k FFs
cumulative normalized clock

capacitive load
Data-Driven CLK Gating
55
tight
constraint
56
How Many FFs To Group ?
Latch overhead
Net saving per FF amortized over k FFs
csaving  q k
 cFF  cw   clatch k   1  q   cw  cOR  
FF
Gater’s disabling probability Probability of enabling FF
q k ln q  cFF  cW   clatch k 2  0
57
58
Given n flip-flops and m  1 clock cycles
a   a1, , am  is the activity (toggling) of flip-flop
ai  a j is the number of redundant clock pulses

ocurring by jointly clocking FFi and FF j
59
FF Pairwise Activity Model
G  V , E , w  : FF pairwise activity graph.
vi  V corresponds to FFi .
 
eij  vi , v j  E is FF pairing.
ai | a j is joint toggling.
 
w eij  ai  a j is redundant clock pulses, hence a waste.
E   E: vertex matching
60
Total power:
P  2 e ai | a j 
ij E 
 v V ai   e   a i   ai | a j   
 a j  ai | a j  
i ij E  e E  
ij 
Essential + Waste
 v V ai   e ai  a j   v V ai   e E  w  eij 
i ij E  i ij
61
Flip-Flop Grouping Algorithm
FF j
Minimize:  ai  a j ai  a j
FFi
Optimal FFs pairing

can be solved by
minimal cost perfect
graph matching.
Can be repeated for groups of size 4, 8, 16 …

62
Is repeated perfect matching optimal ?
a 7 | a8
a3 | a 4
a1 | a2
a 6 | a5
a1 a3 a2 a7 a4 a6 a5 a8
63
No! Here is the optimal 4-size grouping
64
Multi-Bit Flip-Flop
Saving the power of internal CLK drivers
1-bit flip-flop
CLK 2-bit flip-flop

CLK CLK
D1 Master Slave
Master Slave Q1
D Q Latch Latch
Latch Latch
+ = CLK
CLK
CLK
CLK Master Slave

CLK CLK D2 Q2
Latch Latch
D
Master Slave Q
Latch Latch
65
66
MBFF should be combined with data-driven CG to
maximize energy savings.
Toggling vectors (VCD) are unfortunately not

always available.
Data-to-clock toggling ratio (probability) is more

often available.
How to utilize it for MBFF optimal grouping?
67
What is the energy waste in 2-bit MBFF?
 
p j  1  pi   pi 1  p j  pi  p j  2 pi p j
n n2
For n FFs grouped
in n/2 MBFFs it is  p j  2 psi pti
j 1 i 1
What is the optimal MBFF grouping?
Group the FFs such that p1  p2    pn

68
Auto-Gated FF
69
Look-Ahead CLK Gating
70
Relaxed
constraint
71
Power Saving Per FF
save
cdyn 
 1  p   cFF+CLK  cFF  cO  
k
 cFF+CLK 
p  cX  kcO    cAint  cFF  cO 
 3 
72
Which FF to Gate?
73
Results
74
75

Clock Distribution: Shmuel Wimer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clock Distribution: Shmuel Wimer

Uploaded by

Copyright:

Available Formats

Clock Distribution

Chip receives external clock through I/O pad.

A PLL produces the global clock of chip B such that it is in sync

The output of the leading flip-flop is 1 for the lead duration.

Charge is proportional to PFD widths: Qcp  I up  tfaster  I dn  tslower .

Charge pump consists of current mirrors which are sources of constant

Current mirror works similarly for P transistors.

R load – determines Vcc Vcc

N-type current mirror

Differential amplifier connected as a unity-gain follower is used.

If tinv is the delay of an inverter and n is the number of inverters,

oscillation frequency is f  1  2ntinv 

Frequency can be controlled by number of inverters and supply voltage

• It is a variant of PLL that uses voltage-controlled delay

No constraints imposed on buffers and wires.

Serpentine routing or extra buffers may be introduced

Constraints on power can be imposed by limiting

Grid feeds flops

Clock driver tree spans height of chip

Low skew but high power

More skew but less power

chip / functional block / IP

We use Elmore delay model. Sub trees are modeled

Clock Jitter is the same order as skew, but far more

Cl - capacitive load, I d - drive current, VCC - voltage swing

Standard deviation of stage delay is     = .  is around 5%.

If clock buffers delays are normally distributed and independent

of each other, then   TCLK1   m , and   TCLK2   n .

Skew is:   Tskew     | TCLK2  TCLK1 |    

dynamic power is: 2

Example: Assume a block with 214 sequential elements

• Compensates process variability, temperature gradients,

1998, 450MHz Clock

Two clock spines for

Connecting a series of n delay elements, each of delay 2i , n  1  i  0,

it is possible to control the dely of the line to any value from 0 to 2 n  1

dealy units by an n-bit control word  Dn1, Dn 2 , , D0  .

2000, 800MHz clock

Each cluster is driven

Delay circuit in de-skew

A delay buffer accepts

23k FFs, Test bench of 250k CLK cycles

10% of application run-time CLK is enabled.

cumulative normalized clock

Gater’s disabling probability Probability of enabling FF

ai  a j is the number of redundant clock pulses

G  V , E , w  : FF pairwise activity graph.

Optimal FFs pairing

Can be repeated for groups of size 4, 8, 16 …

CLK 2-bit flip-flop

CLK Master Slave

Toggling vectors (VCD) are unfortunately not

Data-to-clock toggling ratio (probability) is more

How to utilize it for MBFF optimal grouping?

What is the optimal MBFF grouping?

Group the FFs such that p1  p2    pn

You might also like