Professional Documents
Culture Documents
Shmuel Wimer
Bar Ilan Univ. Eng. Faculty
Technion, EE Faculty
July 2010 1
Clock System Architecture
clk3
Clock
ext_clk Distribution
External Clock gclk
Clock Generator Buffers
Gaters
clk1 clk2
Clocked
Elements
Chip
July 2010 3
Synchronous Chip Interface with PLL
Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be
synchronized to the common clock.
Chip A Chip B
CLKin
ext_clk ref_clk
CLKout gclk
clk PLL clk_out
Din fdbk_clk
Dout
Dout
Din Clock Distribution
July 2010 4
How PLL Works?
Loop
Charge Filter
Pump
C
Up
ref_clk I R
M Voltage
Phase clk_out
Detect Controlled
Vctrl Oscillator
N
fdbk_clk I
Down
July 2010 5
Phase – Frequency Detector (PFD)
Q_A: B should go
1 D Q faster
A
CLR
CLR
B
1 Q_B: B should go
D Q
slower
The two flip-flops receive the signals at their clock input (one is usually a
reference and the other is the sampled).
Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.
July 2010 6
What happens when the reference and the sampled signals are a shift of
each other?
A: reference
B: sampled
Q_A: sampled
should go faster
Q_B: sampled
should go slower
The spikes at Q_B are a result of the delay of the AND gate driving the
CLR input of flip-flip and the internal delay from CLR to Q.
July 2010 7
What happens when the reference and the sampled signals have different
frequencies?
A: reference
B: sampled
Q_A: sampled
should go faster
Q_B: sampled
should go slower
Sampled is more often 1-value than the reference is, since rising edge of
B occurs more often than rising edge of A.
July 2010 8
Charge Pump
faster
1 D Q
CLK_ref Icp
CLR
Sup Vctrl
Sdn
CLR
CLK_fdbk
Icp
1 D Q
slower
Converts PFD error (digital) to charge (analog), which then controls PLL VCO.
July 2010 9
Current Mirror
V Vcc
Iin Iout
N1 N2
P1 P2
Iin Iout
Vss V
This is an ideal current source with infinite output impedance since Iout is
independent of N2 load; a change in output voltage doesn’t affect Iout.
July 2010 10
How Charge Pump Works?
slower
Switch – open when
slower = 1
I I I
July 2010 11
Faster Mode Vout → Vcc
Vcc Vcc
faster=1 R
Vout
slower=0
Vss
July 2010 12
Slower Mode Vout → Vss
Vcc Vcc
faster=0 R
Vout
slower=1
Vss
July 2010 13
Loop Filter
Vcc
+
Vctrl Vctrl
July 2010 14
Voltage Controlled Oscillator (VCO)
A Ring Oscillator cascades an odd number of inverters and feeds back the
last output to first inverter (even number of inverters will be stable). It starts
to oscillate spontaneously.
Vout
July 2010 15
Components of VCO
Buffering for
driving clk_out
clk_out
Vctrl Vcc
V-cc
- cc
V
+
- +-
Vss
Level Converter from
Ring of 5 inverters Vctrl-Vss to Vcc-Vss
July 2010 16
Delay Locked Loop (DLL)
July 2010 17
How DLL Works?
Loop
Charge Filter
Pump
C
Up
ref_clk I R
Phase Voltage clk_out
Detect Controlled
Vctrl Delay Line
fdbk_clk I
Down
July 2010 18
Clock Distribution Networks
July 2010 19
Tree Clock Network (Unconstrained)
clk1
clk2
clk3
clkn
July 2010 20
If TCLK is the clock delay at a leaf, the variance exapression
i
2
n 1 n
T
i 1 CLKi
i 1TCLK should be minimized.
n i
July 2010 21
Clock Distribution with Grids
July 2010 23
Delay and Skew in Grid Distribution
July 2010 24
DEC’s Alpha Microprocessor Clocking
July 2010 25
DEC Alpha 21264 Microprocessor Clock distribution
July 2010 26
Clock Distribution with Spines
July 2010 27
Intel’s Pentium4 Clock Distribution
July 2010 28
July 2010 29
Clock Distribution with Trees
RC-Tree H-Tree
Recursive pattern to
Each branch is individually distribute signals uniformly
routed to balance RC delay with equal delay over area
July 2010 30
Clock H-Tree
sequential elements
clock / PLL
July 2010 31
IBM / Motorola PowerPC Clock Distribution
July 2010 32
Delay Calculation
July 2010 34
• Systematic is the portion existing under nominal
conditions. It can be minimized by appropriate design.
• Random is caused by process variations like devices’
channel length, oxide thickness, threshold voltage, wire
thickness, width and space. It can be measured on silicon
and adjusted by delay components.
• Drift is caused by time-dependent environmental
variations, occurring relatively slowly. Compensation of
those must takes place periodically.
• Jitter is rapid clock changes, occurring by power noise
and clock generator jitter. It cannot be compensated.
July 2010 35
Factors affecting clock skew, Intel 1998, 0.25u.
July 2010 36
Skew, Clock Cycle and Design Margins
Clock
CL
Generator
1 2 Tclk2 n
D
TCLK1 TCLK2
m n
T
i 1 i
m T
i 1 i
n
Cl VCC I d
July 2010 38
Consider a small change in the delay, taking the linear term.
Cl VCC ClVCC
VCC Cl I d VCC C l I d
VCC Cl I d Id Id Id 2
VCC Cl I d
VCC Cl Id
July 2010 39
Clock Distribution Switching Power
Consider m - level clock tree.
Let Cl_m be the total load of its far-end driven sequential
elements.
Assuming a fixed fanout k of each clock tree buffer, the
Cl_m 2
PCLK_ m j VCC f , 0 j m 1.
kj
July 2010 40
Summing over whole clock tree:
m 1
PCLK C V 2
j 0 l_ m j CC
f
m 1 Cl_m
1 1 k m
2
VCC f 2
VCC fCl_m
j 0
1 1 k
j
k
How much of the power is consumed by the far
end drivers of clock tree?
PCLK _ m 1 1 k
.
1 1 k
m
PCLK
July 2010 41
Given the number of sequential elements in a block, at
least 50% of the switching power is consumed by the
far end drivers (clock tree is binary, k=2). This number
approaches 1 rapidly with k growth.
July 2010 42
Active Clock De-Skewing
July 2010 43
Intel’s Pentium2 De-Skewing System
A phase detector
detects relative shifts.
Clock of a region is
shifted by a delay line.
July 2010 44
Delay line consists of two cascaded inverters.
Each has a programmable load consists of eight parallel P-N gate
capacitors.
The shift register stores a thermometer code for load programming
in steps of 12pSec.
July 2010 45
Another Programmable Delay Line
The signal from input to output is delayed In
1
Out
or not according to the control bit.
0
In
2n1 1 2n2 1 1 1
Out
0 0 0
Dn 1 Dn 2 D0
30 independent de-skew
regions. PLL
July 2010 47
July 2010 48
Proposal for H-Tree Clock De-Skew –
Hierarchical Approach
If a phase detector (PD) has a skew guard band g, then guard bands
may accumulate along tree paths.
For example, if a logic stage is shared between region B and C, it may
add 7g time units to path delay.
July 2010 49
Proposal for H-Tree Clock De-Skew –
Mesh Approach
Clock is distributed by H-
tree, but de-skew takes
place by neighbor leaves
phase detection.
July 2010 50
Clock Characteristics of Commercial Processors
July 2010 51
Power Consumption in Chips
• Clock power may reach 50% of total (dynamic +
static).
• Clock gating is very useful and standard design
practice
• Four gating methods:
– Synthesis based, automated by EDA tools, RTL
compilers, inserted into clock-tree
– Clock enable signals manually defined by designer,
inserted into clock-tree, FFs’ clock input
– Data-Driven clock gating, inserted at FF-level
– Auto-gated FF, inserted at latch-level
52
FF Data Toggling (DSP core)
53
data activity / clock activity
0.03
block count
FF Data Toggling
a
b
63 control blocks of ϻP, 200k FFs
55
tight
constraint
56
How Many FFs To Group ?
Latch overhead
Net saving per FF amortized over k FFs
csaving q k
cFF cw clatch k 1 q cw cOR
FF
q k ln q cFF cW clatch k 2 0
57
58
Given n flip-flops and m 1 clock cycles
a a1, , am is the activity (toggling) of flip-flop
vi V corresponds to FFi .
eij vi , v j E is FF pairing.
ai | a j is joint toggling.
w eij ai a j is redundant clock pulses, hence a waste.
E E: vertex matching
60
Total power:
P 2 e ai | a j
ij E
v V ai e a i ai | a j
a j ai | a j
i ij E e E
ij
Essential + Waste
v V ai e ai a j v V ai e E w eij
i ij E i ij
61
Flip-Flop Grouping Algorithm
FF j
Minimize: ai a j ai a j
FFi
a 7 | a8
a3 | a 4
a1 | a2
a 6 | a5
a1 a3 a2 a7 a4 a6 a5 a8
63
No! Here is the optimal 4-size grouping
64
Multi-Bit Flip-Flop
Saving the power of internal CLK drivers
1-bit flip-flop
D1 Master Slave
Master Slave Q1
D Q Latch Latch
Latch Latch
+ = CLK
CLK
CLK
D
Master Slave Q
Latch Latch
65
66
MBFF should be combined with data-driven CG to
maximize energy savings.
67
What is the energy waste in 2-bit MBFF?
p j 1 pi pi 1 p j pi p j 2 pi p j
n n2
For n FFs grouped
in n/2 MBFFs it is p j 2 psi pti
j 1 i 1
69
Look-Ahead CLK Gating
70
Relaxed
constraint
71
Power Saving Per FF
save
cdyn
1 p cFF+CLK cFF cO
k
cFF+CLK
p cX kcO cAint cFF cO
3
72
Which FF to Gate?
73
Results
74
75