You are on page 1of 66

Unit III- Low Power Clock

Distribution
Power dissipation in clock distribution, Single driver
versus distributed buffers, buffers & device sizing
under process Variations, zero skew Vs. Tolerable
skew, Chip & Package co-design of Clock network
Introduction
• In synchronous systems, chip performance is directly
proportional to its clock frequency.
• Clock nets need to be routed with great precision, since the
actual length of the path of a net from its entry point to its
terminals determines the maximum clock frequency on which
a chip may operate.
• A clock router needs to take several factors into account,
including the resistance and capacitance of the metal layers,
the noise and cross talk in wires, and the type of load to be
driven.
• In addition, the clock signal must arrive simultaneously at all
functional units with little or no waveform distortion.
• Another important issue related to clock nets
is buffering, which is necessary to control
skew, delay and wave distortion.
• However, buffering not only increases the
transistor count, it also significantly impacts
the power consumption of the chip. In some
cases, clock can consume as much as 25% of
the total power and occupy 5-10% of the chip
area.
Clock Tree models
Nehalem clock distribution inIntel® CoreTM
i7/i5/i3 processors
Spine structure for clock distribution
A Scalable, Sub-1W, Sub-10ps Clock Skew, Global Clock Distribution
Architecture
for Intel® CoreTM i7/i5/i3 Microprocessors

• Clock Distribution Architecture


Nehalem clock distribution which is part of PLL control loop
can be divided into three distinct portions as shown in Figure-1.
1. First part is a point-point distribution that transports the
clock from PLL to center of Nehalem core. PLL resides outside
(Figure-1) of the Nehalem core to meet PLL power supply
filtering requirements and also to enable aggressive scaling of
Nehalem core to 32nm process technology node.
2. Second part is a recombinant horizontal clock spine that
distributes clock from center of Nehalem core to entire width of
core. This single horizontal clock spine feeds the clock to 21
vertical clock spines distributed across the width of core.
3. Third part consists of a global clock grid, which is formed by
shorting the vertical clock spines. This global grid supplies the
clock to various functional units in Nehalem core.
• Nehalem global clock distribution has distribution
latency less than 1.0ns and clock bandwidth (BW) in
excess of 3.5GHz at nominal voltage to support
Adaptive Frequency System[3] and
• Intel® Turbo Boost Technology feature. High clock BW
and low-latency is attained by using high performance
transistors,
• optimizing distance between vertical clock spines,
improving
• clock driver via resistance and balancing clock grid
load.
Metal layer stack in IC- clock is distributed in top
two metal layers along with power
Definitions of slew, skew and latency
• A slew rate is defined as a rate of change.
• The slew is typically measured in terms of the
transition time, that is, the time it takes for a
signal to transition between two specific levels.
• Note that the transition time is actually inverse
of the slew rate – the larger the transition time,
the slower the slew, and vice versa.
• The falling Slew and rising slew are expressed
as percentage of VDD.
Skew between Signals
• Skew is the difference in timing between two or
more signals, maybe data, clock or both.
• For example, if a clock tree has 500 end points and
has a skew of 50ps, it means that the difference in
latency between the longest path and the shortest
clock path is 50ps.
• Figure 2-15 shows an example of a clock tree. The
beginning point of a clock tree typically is a node
where a clock is defined. The end points of a clock
tree are typically clock pins of synchronous
elements, such as flip-flops.
Clock skew
Jitter is the timing variations of a set of signal edges from their ideal
values. Jitters in clock signals are typically caused by noise or other disturbances in
the system. Clock jitter is typically caused by clock generator circuitry, noise, power supply
variations, interference from nearby circuitry etc.
Clock latency
• Clock latency is the total time it takes from the clock source to an
end point.
• The beginning point of a clock tree typically is a node where a clock
is defined. The end points of a clock tree are typically clock pins of
synchronous elements, such as flip-flops
• There are two types of clock latencies: network latency and source
latency.
 Network latency is the delay from the clock definition point to the
clock pin of a flip-flop.
 Source latency, also called insertion delay, is the delay from the
clock source to the clock definition point.

 Source latency could represent either on-chip or off-chip latency.


Figure 7-9 shows both the scenarios. The total clock latency at the
clock pin of a flip-flop is the sum of the source and network
latencies.
• One important distinction to observe between
source and network latency is that once a clock
tree is built for a design, the network latency can
be ignored
• However, the source latency remains even after
the clock tree is built.
• The network latency is an estimate of the delay of
the clock tree prior to clock tree synthesis.
• After clock tree synthesis, the total clock latency
from clock source to a clock pin of a flip-flop is the
source latency plus the actual delay of the clock
tree from the clock definition point to the flip-flop.
• Clock skew is the difference in arrival times at the end points
of the clock tree.
• Typically, a fixed buffered clock distribution network is used
at the chip level.
• At a block level, a local clock routing scheme ensures minimal
skew and delay.
• The scheme used in each block can differ, depending on the
design style used in the block. The clock routing problem has
significant impact on overall chip design.
• Clock frequencies are increasing quite rapidly. Note that
current microprocessors can operate at 500 MHz to 650 MHz.
• It is expected that 1.5 - 2.0 GHz microprocessors will be
available within two to three years.
Intel 10th Generation Processor – i9
Turbo Boost is a feature that,
when fewer than the total number
of cores are being used, the
processor can turn off the unused
cores and increase the clock
speed on the rest of the cores.
This increases performance (the
cores that are being used get
faster) and can reduce power
usage.
Turbo Boost is built into the Intel
CPU. By default the processor runs at
2.3Ghz, and when under heavy load,
it will automatically speed up the
cores up to 3.3Ghz. Intel Turbo
Boost Technology 2.0 is activated
when the Operating System (OS)
requests the highest processor
performance state (P0).
Power dissipation in Clock Distribution
• the dynamic power dissipated by switching the
clock can be given by:

• where C L is the total load on the clock and Cd is


the clock driver capacitance.
• Given the total number of clock terminals N,
the nominal input capacitance at each terminal,
cg , the unit length wire capacitance, cw and the
chip dimension, D
assuming an H-tree based global clock routing of h levels , C L
can be given by:

where the second and third terms are the global and local
wiring capacitance respectively, α is an estimation factor
depending on the algorithm used for local clock routing
Dynamic power dissipated by clock increases as the number of clocked
devices and the chip dimensions increase.
• global clock may account for up to 40% of the total system power
dissipation
• For low power clock distribution, measures have to be taken to
reduce the
1.clock terminal load,
2.the routing capacitance and
3.the driver capacitance
• Clock skews are the variations of delays from clock source to clock
terminals.
• To achieve the desired performance, clock skews have to be
controlled within very small or tolerable values
• Clock phase delay, the longest delay from source to sinks, also has
to be controlled in order to maximize system
throughput
• technology advances into deep-submicron, device
sizes are shrinking rapidly which reduces the clock
terminal capacitances.
• The increase of chip dimensions also make the
interconnect capacitance increasingly important .
• the high frequency requirement, performance
driven clock tree construction methods such as
adjusting wire lengths or widths to reduce clock skew
increase the interconnect capacitance to a more
dominant part of the total load capacitance on clock.
• Reducing the interconnect capacitance may
significantly reduce the overall system power
consumption
• low power systems with reduced supply
voltages require increasing the device sizes to
maintain the necessary speed, i.e. the sizes of
clock drivers are increased substantially to
ensure fast clock transitions.
• This exacerbates both the dynamic and short-
circuit power dissipated by clock.
• clock distribution is a multi-objective design
problem. Minimizing clock power consumption
has to be considered together with meeting
the constraints of clock skew and phase delay
Single Driver vs. Distributed Buffers
1. Clock driving schemes
 Single driver scheme
 Distributed buffers scheme
2. Buffer insertion in clock Tree

1.Clock Driving Scheme: To ensure fast clock transitions, buffers have


to be used to drive the large load capacitance on a clock.
• There are two common clock driving schemes: In the single driver
scheme as shown in Figure 5.1a, a chain of cascaded
• buffers with a very large buffer at the end is used at the clock
source, no buffers are used elsewhere;
• in the distributed buffers scheme as shown in Figure 5.1 b,
• intermediate buffers are inserted in various parts of the clock tree.
• For power minimization distributed scheme is preferred over
single driver scheme
Single driver and distributed buffer scheme
• The single driver scheme has the advantage of avoiding the
adjustment of intermediate buffer delays as in the
distributed buffers scheme. Often in conjunction with this
scheme, wire sizing is used to reduce the clock phase delay.
• Widening the branches that are close to the clock source can
also reduce skew caused by asymmetric clock tree loads and
wire width deviations
• The distributed buffers scheme has been recognized to have
the advantage that relatively small buffers are used and they
can be flexibly placed across the chip to save layout area.
• for a large clock tree with long path-length, the
intermediate buffers (or repeaters) can be used to
reduce clock phase delay
• Figure 5.2 illustrates the effects of wire widening and intermediate
buffer insertion on delay reduction.
• In high speed design, a long clock path as shown in Figure 5.2a can
be treated as a distributed RC delay line.
• Widening the line as shown in Figure 5.2b will make it a capacitive
line with smaller line resistance.
• By adjusting the sizes of buffers at the source, the line delay will be
reduced.
• The other way to reduce line delay is to insert intermediate buffers
along the line as shown in Figure 5.2c.
• The intermediate buffers partition the line into short segments each
of which has small line resistance. This makes the delay of the line
more linear with the length.
• While widening wires requires increasing the sizes of buffers at the
source and hence the short-circuit power dissipation, the small
intermediate buffers used to drive the short wire segments impose
little penalty on power dissipation
• Additionally, if the chip is partitioned in such a way that clock
distributed to subsystems can be disabled (or powered down)
according to the subsystem functions, significant amount of
power can be saved.
• This can be done by replacing the buffers with logic gates.
• Clock buffers dissipate short-circuit power during clock
transitions. The intermediate buffers also add gate and drain
capacitance to the clock tree.
• But compared to the single driver scheme which uses large
buffers at the clock source, this scheme does not introduce
more short-circuit power dissipation if buffers are not
excessively inserted and overly sized.
• In the mean time, both dynamic and short-circuit power
dissipation of buffers have to be minimized.
Example of clock distribution scheme and delay
model
• For power minimization in a clock tree, the
distributed buffers scheme is
• preferred over the single driver scheme.
Consider equal path length tree and its delay
model as shown in Figure 5.3, where II = l2
• The lengths and widths of the tree branches
are l0, II, l2 and w 0, wI,w 2
• The load capacitances at sinks, sl and s2 are CLl
and CL2 respectively,
Elmore’s Delay model
The skew between sl and s2 can be derived as:
• Above equations indicate that without wire width variations,
skew is a linear function of path length. However, with wire
width variations, the additional
• skew is a function of the product of path length and total load
capacitance.
• Increasing the wire widths will reduce skew but result in larger
capacitance and power dissipation.
• Reducing both the path length and load capacitance also
• reduces skew and minimum wire width can be used such that
wiring capacitance is kept at minimum.
• Therefore, if buffers are inserted to partition a large clock
• tree into sub-trees with sufficiently short path-length and small
loads, the skew caused by asymmetric loads and wire width
variations becomes very small.
Other techniques of Low power clock distribution

• Additionally, if the chip is partitioned in such a


way that clock distributed to subsystems can be
disabled (or powered down also known as clock
gating) according to the subsystem functions,
significant amount of power can be saved.
• This can be done by replacing the buffers with
logic gates.
Other sources of power dissipation in clock
distribution
• Clock buffers dissipate short-circuit power during clock
transitions.
• The intermediate buffers also add gate and drain
capacitance to the clock tree.
• But compared to the single driver scheme which uses large
buffers at the clock source,
• this scheme does not introduce more short-circuit power
dissipation if buffers are not excessively inserted and
overly sized.
• In the mean time, both dynamic and short-circuit power
dissipation of buffers have to be minimized.
2.Buffer Insertion in Clock Tree

• Different buffer delays cause phase delay


variations on different source-to-sink paths.
For simplicity, the given tolerable skew of a
buffered clock tree, ts is divided into two
components, the tolerable skew for buffer
delays, t bs
• and the skew allowed for asymmetric loads
and wire width deviations after buffer
insertion
• To meet the skew constraint, the buffer
insertion scheme should try to balance
• the buffer delays on source-to-sink paths
independent of the clock tree topology.
An equal path-length tree, T consists of a source
So and a set of sinks

The buffer insertion problem is to find the locations on the clock


tree to insert intermediate buffers.
these locations are called buffer insertion points(BIPs)
• Clock Grids
• § Use grid on two or more levels to carry
clock
• § Make wires wide to reduce RC delay
• § Ensures low skew between nearby points
• § But possibly large skew across die
Buffer insertion

Buffer insertion scheme should


1. balance buffer delays on source to sink paths independent of the clock
tree topology.
2. Partitions the clock tree into a number of levels and BIPs are determined
at the cut lines.

Following are the properties of resulting clock tree-


1. Each source-to-sink path has the same number of buffers
2.All sub-trees rooted at a given level are equal path length trees.
• This works well in a full binary tree where
all sinks have the same number of levels.
• In the case of a general equal path-length
tree, such as the case in Figure 5.6, different
numbers of buffers are inserted on different
source to sink paths.
• Depending on the clock tree topology,
• some large sub-trees may still require wire
widening to reduce skew.
• Low-voltage swing is an effective technique to
reduce dynamic power consumption, especially
for clocks which are among the most active
signals in a VLSI circuits and generally consume
up to 50% of the total power [2]. Reduced
voltage swing clock signals can be applied at the
upper level of a clock tree for low-power, while
clock gates (such as inverters) amplify the signals
to full swing upon reaching sequential elements

You might also like