You are on page 1of 21

ISSCC 2011 Tutorial Transcription

Practical Power-Delay Design Trade-offs


Instructor: Tim Fischer

1. Introduction

Okay, thank you. So, I’m Tim Fischer and I’m going to be your presenter today.

Our goal today in this presentation is to give an awareness of some of the power-delay trade-offs that we’re
making as we go through the design process. What the solutions look like for these problems, and today’s
presentation is oriented towards high performance design, as opposed to the ultra low power tutorial which
was given a little earlier today. So let’s get started.

2. Practical Power-Delay Tradeoffs in Modern VLSI Processes

So, first we’ll briefly cover the motivation for making trade-offs in the power-delay space. Next is a
refresher, we’ll cover power and delay components in CMOS circuits. We’ll then go through some power
metrics and

3. Agenda

examples of power-delay trade-offs. We’ll cover the usage of Vt transistor selection for improving power
delay, high level overview of power gating, and then talk about system level power management for
trade-offs in the power-delay space using hardware and software interactions.

4. Agenda

So first of all, what’s the motivation for design per power? Is it to reduce electricity bills? Is it to reduce heat
production and simplify the cooling solutions? Is it to make it run faster? Reliability?

5. Motivation: Product Capability

So the answer there is really, all of the above, because the main motivation is improving product capability.
Power consumption directly affects the costs, the performance, the infrastructure that’s needed for the
system, and essentially the computes per unit area.

And the mobile and handheld space, we’re looking to extend the battery life of the product, simplify the
cooling, get it into a smaller area that can be portable. And the server space, the goals are the same, but the
intent is slightly different, that we want to add compute capacity and increase the utilization of the system
that’s there because they’re large fixed costs dominating.

6. Motivation: Cost vs. Capacity

We’ll give some examples of server costs a little bit, versus utilization. This is from an article that was
published in an IEEE Micro, last summer. This shows the costs in the server space versus utilization in the
server, and they’re kind of two key take aways here.

The first one is that, the slope is pretty small and there’s a large fixed cost dominating. So across the
utilization, the fixed costs are spread across there so that small increase in utilization is a performance gain
with essentially no cost. The second piece is that the electricity, while it’s a small part of the total, is really
the most variable piece of the cost. And so as designers, we’re going to want to optimize the power at high
utilization when it’s very active and at low utilization when it’s idle, in order to maximize the efficiency.

7. Effects of Power in the Data Center

Here’s another slide that shows, on the left side, a breakdown of data center energy consumption. Then on
the right side, a pie chart of total cost of ownership. On the left side, you can see that about 75% of the
energy consumption is in cooling and in just supplying the servers and storage themselves. But that energy
cost is a total of about a quarter of all of the total cost of ownership. So when we’re increasing the utilization
of server, increasing the performance in a given power envelope, what we’re really doing is avoiding capital
expenditures and buying more servers which require that big fixed cost.

8. Importance of Energy Efficiency Across Loads

And this kind of shows the motivation behind that, in that showing an estimate of more than 80% of the
server is running at less than 50% of utilization, so, getting the performance up within the same power
envelope, is the right thing to do.

9. Agenda

Now we’ll review some of the components of CMOS power and delay and talk about how we can optimize
and make trade-offs between them.

10. Why Tradeoffs?

So the first question to answer is, why tradeoffs?

And the answer here is because performance is equal to power. As we try to increase the performance by
adding new features, we’re going to increase the power. And as we make circuit level power reductions often
times we’re decreasing the performance. Some of the examples here are when we reduce VDD, obviously
it’s going to slow the circuits down, transistor Vt reductions are increasing the leakage, frequency reduction
obviously reduces the performance, but clock gating can add new critical paths that have to be solved.
Activity factor reduction, detecting when a unit is idle adds new logic, that’s going to consume power. And
then system level power mode transitions that we’ll talk about in a little bit, can actually take time and result
in a performance loss.

11. Power Components

So, what are the active components of power? Total power, to a first order, is active power plus leakage
power. We’ll look at CMOS switching power, the active power consists of the charging and discharging the
capacitance, plus overhead are wasted power, due to crossover current. We’ll talk about each one of these
components in a moment.

Leakage power is the leakage current which to a 1st order has three main components. There’s subthreshold
leakage which is the current that flows between the source and the drain. When the transistor is subthreshold
conduction, there’s a junction leakage current flowing between the source and the substrated body, and then
there are gate oxide tunneling in currents which are flowing between the gate and the substrate, or the body.
The subthreshold current is typically about 3 orders magnitude less than the transistor on current but the
subthreshold leakage really dominates all the total leakage currents in the device.

12. Active Power: Capacitance and Crossover

So let’s look at the main 1st order components that the designer has control over an active power, that’s
crossover current and capacitance.

First the crossover power is the current that flows between VDD and VSS when the gate is switching.

The source drain capacitance is the diode capacitance between the diffusion and the substrate. This is a
voltage variable capacitance that is going to vary as the depletion region on the diode changes with voltage.

The interconnect capacitance is a fixed capacitance of the metal between layers or between routes on the
same layer. This is a fixed capacitance, but the far end of the capacitor may not be connected to ground, may
be connected to a switching signal. In that case, the other signal switching can affect noise on our node and it
can also affect the switching speed of our circuit.

The gate capacitance is another voltage variable capacitance. The capacitance between the gate and the
channel is going to vary as the channel charge varies when the device is switching. The capacitance between
the gate and the diffusion is due to overlap in fringe capacitances. And that will also affect our gate as the
diffusions switch.

13. Active Power: Voltage and Frequency

So let’s look at voltage and frequency parts of the active power. So the V2 term is there because the charge
that’s transferred is CV and if you look at the energy of switching that charge, it’s CV2. That’s where the V2
term comes from. If you look at frequency as just an independent variable, it’s basically the number of times
that the signal switching per second, but if you look at the relationship to VDD, it’s directly proportional to
VDD. And since then, most of the digital systems, we’re running them at the minimum VDD, that supports
that frequency, or in other words, at a delay limited frequency, then overall, the active power is proportional
to V3. And this is really important because VDD is the single biggest knob for controlling the active power
that the designer has.

14. Active Power: Activity Factor

So let’s look at the activity factor, this is kind of the last factor in switching capacitance power.

So this is the number of cycles in which the signal switching out of the total number of cycles. The activity
factor of a node is dependent on the clock activity factor and the data activity upstream, and the gates are
considered to be leaking when the circuit is off.

15. Active Power: Clock Gating Example

So let’s look at the power distribution in a clock gating example for this circuit. First we’ve got three kind of
major activity factor regions. The clock region typically has a high activity factor; the node is switching high
and low, once per cycle. On the flops, those have an activity factor, input activity and output, which are
gated by the clock. Each of the nets has an activity factor that’s dependent on its input activity and they can
reduce activity or they can actually multiply up the activity depending on the timing and the type of gate that
it is. For example, glitches can be created in XOR gates, almost in any type of gates depending on the arrival
time of the inputs. Typically circuits with the highest glitching activity are complex parity structure,
multiplier arrays, even adders that have a large number of muxes or XORs that are especially prone to
glitching. In those cases, the glitch power is carefully measured and for other circuits that don’t have so
many glitches, it’s often just estimated for those.

16. Active Power: Clock Gating Example – Power with No Gating

So here’s kind of a rollup of power for this gate of this circuit with no power gating. The clock circuits are
shown in blue here with an activity factor of 1 for the NAND and INV. The flops have a fairly high activity
factor, and then the rest of the gates are listed down below.

The nets in the logic cone are about 68% of the total power, whereas the clocks themselves are 17%, and the
flops are about 15 %.

So let’s look what happens when we reduce the activity factor on the NAND and INV on the clock tree by
2/3.

17. Active Power: Clock Gating Example – Power with Clock Gating

In this case, you can see that the activity factor is 0.3 for the clock and the activity factor of the flops has
dropped dramatically. And this is often the case when the clock is gating off useless switching activity and
that useless activity is much higher when it’s not doing useful work, than there can be a very nonlinear
improvement in the activity factor and then the power of all the nets downstream. And so most of the power
saving here, comes in the datapath. Here the total power, the total activity factor is reduced by more than 7x
and the power, I think is just 15% of the previous slide.

So clock gating can provide very nonlinear benefits, depending on the circuit.

It’s interesting to note that the clock power and the flop power increase as a percentage of the total, just
because of the large reduction in the logic gates.

18. Active Power: Crossover Current

So let’s talk about crossover current, also known as bypass current or short-circuit current. This is a current
that flows between VDD and VSS when the gate switches. It’s often modelled as a percentage of the load
capacitances as an overhead. It’s typically 10-15% for a gate, but can be much higher depending on the edge
rate and the circuit type. It’s a strong function of VDD, increases with VDD, increases with input edge rate,
increases with the load capacitance, and it actually goes down as the Vt increases. It’s a weak function of
the β ratio and the logic type, although some logic types have very high crossover capacitance.

Metal resistance is interestingly kind of a crossover multiplier in that it speeds up output edge rates due to
resistive isolation, and it slows down input edge rates, the end of the wire. So just by inserting resistance
between receivers and drivers, we tend to get higher crossover current than the gates.

So from the designer’s point of view, they can mostly control the effective crossover and the amount of
crossover by keeping edge rates sharp without oversizing gates which we’ll talk about shortly.

19. CMOS Leakage Power

So in the leakage area, this consists primarily of subthreshold current for transistors that are operating in weak
inversion. Here’s an equation showing some the major terms of subthreshold conduction. You can see that
it’s a strong function of the threshold voltage, the bias of the transistor, the temperature, size of the transistor,
and the mobility of those devices.

The drain diode and the gate leakage currents are typically small enough that we ignore them for power and
delay trade-offs, although the gate leakage, it was helped a lot by the adoption of high k metal gate dielectric
and passive nodes. As we move forward, that’s starting to increase again and become more significant.

So really, in the designer’s control, leakage power can be reduced by careful Vt selection and VDD, along
with the size of transistors.

20. CMOS Leakage Power (cont’d)

Just a quick note on stack device leakage, that when a pair of devices in a stack are both shut off, that internal
node will typically rise or move away from the rail due to leakage and this reduces a subthreshold leakage of
each of those devices proportional to e-Vds so that has a large effect on reducing the leakage of those devices
and we’ll see more about that in power gating when we go forward.

21. CMOS Leakage Power – Process Considerations

And just a last comment on the effective process on leakage. The magnitude and the ratio of subthreshold
junction and gate leakage is a strong function of the process depending on whether the process is bulk vs.
SOI, the gate dielectric, the channel length, the doping profile of the transistor, the source drain engineering,
all affect the process. And then the designer’s tool kit, to try to offset that are circuit topology, the channel
length, threshold voltage, and where the transistors are operating. So, the designer can help to reduce those
natural process leakage impacts through circuit design.

22. CMOS Gate Propagation Delay

Just briefly touch on propagation delay. Shown here are the currents for transistors operating in the saturation
region and in the linear region. And you can see that the things that designers can really affect here are the
size of the transistor, the width and length, and the threshold voltage of the transistor.

Over on the right hand side, several waveforms showing the voltage and current for inverter. The input
voltage is falling, the output voltage is rising, and the delay is measured here at the 50% point between them.
We show the current draw from VDD as a transistor switching.

Over on the left side, we show a DC transfer curve, it’s the output voltage vs. the input voltage and the
different regions of operation for the transistor. There’s an edit here, I think up here in the PFET in the linear
region from the slides that you have.

Really, the key point of this DC transfer curve is here in the middle where the PFET and NFET are both in
saturation. This is where the bulk of the crossover current is coming from, and from the designer’s point of
view, keeping the transistors moving as quickly through that region as possible, is a goal of reducing
crossover.

23. CMOS Delay and Gain

CMOS delay comes from charging and discharging capacitances and all the loads here are primarily
capacitive. So the delay is a function of the width of the transistor, the fanout, threshold voltage, and the
resistance of the metal interconnect. Fanout or gain here, is defined as the output capacitance, over the input
capacitance. And higher fanouts lead to slower edge rates, slower delays, increased crossover current, but
the higher fanout results in smaller gates which is giving us smaller area, smaller device widths, less active
current, and less leakage current. Typically stage fanouts for efficiency are in the 4-10 range, and there’s a
lot work that’s been put into optimizing transistor sizing vs. delay, and it kind of boils down to heuristics and
guidelines or automation at this point.

The complexity of tuning the buffer for optimal fanout, is kind of shown in this reference, they give in number
9 here, in the bibliography. That’s very long paper involved with just tuning a buffer chain.

The logical effort techniques really simplify the tuning for critical paths, but circuit simulation in logic
synthesis are the main automated techniques right now. Of course there’s heuristics related to design for
custom circuits that are typically used to size gates, according to fanout of 3 or fanout of 4, that’s thought to
be fairly reasonable for most circuit applications.

24. Delay: Timing Models

I have a comment on timing models and how we’re modeling the switching of capacitance. So, the
capacitance is variable and complex to model. The first thing we do is take all the capacitance to non
switching signals, and just lump that to ground from our node that’s switching. But gate capacitance and
diffusion capacitance are voltage dependent, have to be modelled properly.

And for coupling capacitance between nodes that are simultaneously switching, that coincidence switching
can affect delay that can be modelled statically or dynamically. For example, to model it statically, 1 or 2
nodes are switching together in the same direction, one might 0 out the coupling capacitance between them to
show the maximum affect of that aggressor on the victim node. If they’re switching in opposite directions, we
might multiply the capacitance by 2x to statically model that coupling affect. But there are tools such as
primetime SI that can handle noise induced delay. And this delay calculation is iterative because crosstalk
affects the fanout arrival times. PTSI takes into account, things like timing windows, logical exclusions
between victims and aggressors and what happens when there’s partial overlap in timing windows.

But the main point of this is really that the timing model is not a power model, and so it’s not going to model
the total effect on current and power the circuit, just the timing effect.

25. Active Power Reduction Examples

So we’ll go through a couple of examples here in reducing active power just to illustrate some of these cases.
So we’ll go through a low power master slave flop or the slave clock enabled by the master data being
different than the slave data, we’ll look at reducing the activity in the register file read word line, where the
incoming address is different than the previous address. We’ll talk a little bit about efficiencies that can be
gained from integrating function specific blocks onto a dial with a general purpose CPU.

26. Active Power: Switching Activity Reduction Using Data Comparison

So on the left side is the master slave flop on this fairly simple example where an XOR has been added into
enable the clock, comparing the master data to the slave data. Slave clock is the power that’s saved when the
data is the same and it doesn’t need to change but the interesting thing here is that a new timing path has been
added from the master data output, setting up to this clock edge. So this is an example where critical paths
get added, or new paths can be added, that are different from the normal data paths.

A similar thing has happened in the register file, gating over on the right hand side, an XOR comparing the
edge rest, simply gating the read clock and a new path has been added from the incoming edge rest through
the XOR setting up to the read word line.

Both of these are really just more forms of clock gating, data dependent clock gating where the enables has
been pushed as far up the clock tree as possible.

27. Active Power: Integration Efficiency

So, just to talk a little bit about integration efficiency at the system or RFC level. Function-specific logic is
often created and added to a design to improve performance and often times this has the effect of reducing
power. When we’re looking at the effect of those function blocks, we need to really look at the energy for
the operation that’s being accomplished for the work being done. Often times, cycles, to accomplish that
operation is a key part of it because they’re fixed costs that can be associated per cycle and in emulating or
performing in an operation.

Then lastly, integration at the SOC level can help to reduce IO power, associated with communication
between blocks that can make the circuit more efficient.

Some of the examples are where a floating unit is brought on chip, this was done quite a few generations ago,
but if you contrast that with a typical processor or emulation through a math library, even though floating
point units are considered to be the highest power units on CPU, they’re actually fairly efficient when you
consider the power needed to get off chip to a coprocessor with many, many cycles of computation needed to
emulate that floating point operation with the rest of the CPU.

Serial hardware divider is another example where it might take many cycles that could still be more efficient
than an emulation.

Then, also if you look at more recent trends toward integrating CPUs and GPUs, one of the advantages there,
is obviously that we’re eliminating the IO power and performance cost required to communicate from the
CPU to the GPU. But because it’s closer and lower energy to get to the GPU, the GPU can be used for things
that it’s more efficient at, such as off loading, highly vectorized operations.

28. Agenda

So, now we’ll move on to discussing power metrics and specific power-delay tradeoffs in circuits.

29. Power Metrics

So, what power metrics, or figures or merits are useful for comparing designs in making these kinds of
tradeoffs?

So, one of the most useful is, accounting for power and delay of a single operation. This is the power-delay
product or energy per operation. You can see here, the active energy. I’ve simplified it by taking crossover
current out of this equation, but the active energy is CV2F, times the delay which is 1/F, we get the energy per
operation.

Other metrics are useful to, such as the energy delay product that is weighted for delay, when delay is most
important or, the power-energy product which is weighted for power.

There are other metrics related to timing, such as total negative slack that can be used a lot in automation to
help tradeoff delay improvements vs. power.
Of course, some of the drawbacks when looking to use any of these metrics, is that they might lead us to the
wrong circuit choice or the wrong conclusion. For example, higher PDPs are typically worse and in a case
where there’s a higher PDP, and it’s because of delay being higher, it’s possible that delay might be good
enough for the path or the application. We’ve got the gate in and the circuit has lower power, so in that case,
it can lead us to the wrong choice of choosing a circuit that has lower PDP, but not lower power. So we have
to be kind of careful about how we use the metrics.

30. Function: PDP vs. Cell Type

So, having said that, we’ll start using PDP to compare a few cell types here, just at the gate level.

This is the power delay product, for a couple of simple CMOS gates, many of you have probably seen this
before. This is for fanout of 4 gates, all normalized to an inverter. This includes estimated interconnect
between the gate and its load. You see the NAND and the NOR about 2x the PDP of an inverter, the aoi21 is
about 3x and the aoi22 is about 4x.

And we’ll see this aoi22 more, because this is our typical muxing function an even XOR function, and static
complimentary formats that’s used across the die.

31. Vdd Optimization: PDP vs. Vdd

So, let’s look at PDP vs. VDD and how it varies. In this graph, on the left side is PDP and on the x axis
voltage. The first thing you notice is that it’s a bathtub curve. There’s an optical VDD point for these
circuits. This is for an inverter, the NAND, NOR and the aoi gates. They all tend to track and scale similarly,
so their optical voltage is pretty similar.

On the left side, this is a slight change in the notes also that you have, because I’ve noted that the rising PDP
there is mostly due to the delay rolloff at the low voltages. Also, there’s a little bit higher crossover currents
there you would expect because of the edge rate slowing down.

On the right side, the PDP is increasing, due to higher switching energy, we’re applying more voltage, more
current into the design but the delay is not rolling off as quickly as we’d expect. The delay is saturating at
higher voltage in this process.

32. Gain: Power, Delay, PDP vs. Fanout

So let’s look at, in more detail at how the components of power and delay within PDP vary with fanout. So
on the left side is delay power and PDP, and on the x axis is fanout.

This is for an FO4 inverter chain, driving a constant 1 pF load. If we start with delay which is shown in blue,
we see that it’s increasing fairly linearly from the fanout of 1, up to a fanout of 10. In a 10 we get this kink in
it. That’s kind of an art of fact of this design, for this particular design, the inverter at the head of the chain at
the very front, starts saturating at the minimum design width, the minimum technology width in this design
and so the delay levels out there because the fanout is not actually increasing for that first stage.

On the green curve, you see power decreasing exponentially there, and there’s kind of a neon curve for
fanout between 3 and 4, that’s one of the heuristics that designers are typically using for just vanilla flavoured
custom logic. And then you can see the PDP response is just a fact product of those two.

33. Crossover: PDP vs. Edge Rate


So a point about edge rate and the relationship between edge rate in PDP. Here we show the curves, vs. edge
rate for PDP for a fanout of 10 buffers in red, and a fanout of 4 buffers in blue, and from the previous slide
the fanout of 4 buffers has a higher PDP. But the interesting thing is that the size difference of 2 ½ x
between these can be wasted by slow edge rates going into the smaller gate. So this is a case where the delay
and the power for that fanout of 10 buffers can very quickly increase to match the fanout of 4 buffer when
the edge rates increase. So this is kind of the other component of downsizing gates, we really need to watch
the edge rates because of their effect on crossover current.

34. Total Energy

So when we look at more complex blocks, the primary thing we want to do is to account for the total power in
delay of the block. And the problem is that PDP misses the leakage energy there. So what we really want to
look at is total energy consumed for that block. So when we include the leakage energy here, that’s going to
be proportional to the leakage current, times the operating voltage and the percentage of the time that circuit
is leaking.

35. Total Energy vs. Activity Factor

We look at total energy vs. activity factor. It’s a pretty simple curve. We see that as the activity factor
drops, say from right to left, the circuit energies are dropping for INV. and NAND and NOR and the residual
energy left at 0 % activity is just the leakage energy.

36. Design Tradeoffs: Virtual Ground Circuit

So now we’re going to move on and look at a couple of specific circuit tradeoffs. We’ll illustrate some of
these examples.

First, we’ll start out with a fairly, simple, straight forward circuit that’s commonly used in arrays and designs
these days, it’s a case of comparing 4 clock NAND gates that are shown on the top. There’s a clock going
into the bottom stack of each gate and the inputs A, B, C and D are all known to be 1 hot, so only one of
those are going to switch at a time. In that case, we can get about the same delay, by gaining together, the
bottom transistor, in the NFET stack in a virtual ground configuration. In order to offset the extra
capacitance of that virtual ground wire, we typically size up the bottom NFET by 10 or 20% to account for
that.

The way that this is drawn, if it was laid out with a single NFET at one side that puts a lot of virtual ground
resistance in series with the pull down. So by distributing that virtual ground gate, at the same width across
the design, we can improve the speed without affecting the power.

37. Design Tradeoffs: Virtual Ground Power

So if we look at just a quick comparison of the gate compacitance, with 6μm of gate cap in the original design,
now reduced by about 47% but the effect is that we’ve added ground resistance in series with it, in just the
units of standard cell widths. But we can have that added resistance by distributing the NFET across the
design in layout.

38. Circuit Styles: Pass Transistor Logic Styles

So let’s look at a variety of other circuit styles in the power-delay of those. There’s a very thorough study
done by Zimmermann and Fichtner, back in 1997 of the power-delay tradeoffs between a number of circuit
types. This is a figure from that paper that consists of mux2 circuits, there are 8 different types here.

We’ll kind of walk through these a little bit and see what the results show us.

Starting in the upper left hand corner here, this is our aoi22 gate, complimentary MOS mux. There’s a pass
gate mux here that’s labeled CMOS+. There’s a complimentary pass logic gate here, or cascode gate, CPL,
there’s an EEPL gate here, that’s energy economized pass logic. There’s a push-pull pass logic mux, swing
restored pass logic mux, an LEAP mux, I’m not sure what that acronym is. This is an active feedback PFET
to restore the logic level coming through the NFET pass gate, and then there’s a dual pass logic here. And the
first thing that we see when we’re looking at these is the circuits on the bottom of this figure, tend to have
issues related to either FET count, process and supply scaling, or just poor gain compared to the other
circuits. So it makes them not useful for general purpose muxing functions across the design.

EEPL is pretty similar to CPL, it’s a little less efficient, so out of these eight, we really end up with three
candidates for good general purpose muxing functions from a delay and power point of view.

So let’s look at some results.

39. Circuit Styles: CMOS vs. CPL (cont’d)

This is the results from that same paper which was comparing a number of different gate functions as well.
So, we’ll first look in another gate function in an NAND4. In this table, the NAND4 function is implemented
either with a NAND4 and an INV in this first line, or an NAND-NOR decomposition, each two inputs in the
second line, which is a pretty interesting comparison.

So what we see is that the NAND4-INV has less area, but it’s the same power and it’s a lot more delay, so
overall, the PDP of the NAND-NOR decomposition is actually a lot better than the NAND4-INV. So in
cases where we want to reduce and minimize the area, we go with a NAND4-INV, case where we really want
to emphasize power, or power delay, we go with a NAND-NOR.

The next one is our mux case that we were just looking at here, the first line is the aoi22, the second line is the
pass gate mux, and the third line is the CPL or the cascode gate.

Here, we see that the pass gate mux has lower power, but it’s slower, and so it’s got a worst PDP than the
plain complimentary mux gate.

40. Power Tradeoffs: Adder Algorithm Overview

The CPL gate is between the two in terms of speed,

41. Circuit Styles: CMOS vs. CPL (cont’d)

but it’s a lot higher power, but the PDP is much worse. So the case here is that the aoi22, is actually a pretty
good general purpose muxing circuit for most cases. And in some cases, the cascode can provide some speed
benefits, especially in more complex muxing and parity structures, but the aoi22 is the better general purpose
gate.

And when we look at the XOR function using the CPL that bears out the same decision

42. Power Tradeoffs: Adder Algorithm Overview


as that CPL is faster at high voltage,

43. Circuit Styles: CMOS vs. CPL (cont’d)

but it’s much higher powered that the CMOS gate. So again, it can be useful when specific applications,
especially with complex parity and muxing functions, but going with the aoi22 is the best balance of power
and delay for general purpose use.

44. Power Tradeoffs: Adder Algorithm Overview

So now we’re going to move on to looking at power tradeoffs in more complex logic blocks. We’re going to
go through adders. My goal here isn’t to really delve into specifics of adders or make us all adder designers,
because I’m really not an expert adder designer, but the goal here is to look at the variety of power and delay
tradeoffs because they’re especially rich in the outer design.

First of all, when an adder is computing the sum of inputs A and B, and it does that by decomposing NAND
into a function of essentially XOR in the propagates with carry-in of each bit. This is kind of a premis of the
prefix adder. The carry propagation number, which is the adder critical path, but it's very parallel visable
because the terms can be computed in parallel and there, can be combined in an order dependent way, and
their terms can overlap.

Adder algorithm is essentially defined as carry propagation graph, which trades off relationships between the
radix, or the number of carry bits that are merged per logic level, and the number of logic levels you need to
get to the worst bit in the adder.

There are also tradeoffs between the wiring complexity, or the number of carry wires that are going through
each bit, and the logical fanout of the number of bits that are driven by the carry.

Some of the examples of different algorithms are Kogge-Stone, Han-Carison, Ladner-Fischer, which are all
variants that we’re not going to go into the details of, but you’ll see in upcoming slides.

There are other things that affect the adder topology, internal buffering, how block level terms such as carries,
so forth are computated with the block level or the bit level. There are differences in the way that carries are
generated for carry select, or transformed, such as in pseudo carry for the ling transformation, and there are
hybrid trees on non-critical bits. In these cases, either carry redundancy or sparseness can be used to reduce
the area and reduce the power of the adder by simplifying, changing around the carry graph for the noncritical
bits. And in some cases, the power can be reduced on the noncritical sections. And then that power, given to
upsizing gates in critical paths, so that the overall delay of the adder can be improved within a given power
budget.

45. Adder Design Tradeoffs

So, some other design tradeoffs. We talked about algorithm and topology from a clock point of view. There
are dynamic adders which have very low delay but obviously incur clock power, and static adder avoids the
clock power, has better low voltage scaling characteristics than the dynamic adder. But obviously, it’s
slower. The radix implies stack height in the carry merge, and this gets the gate complexity that we looked at
the tradeoffs previously.

Differences in gate dominated delay, RC dominated delay, affect how an adder scales across a voltage.

And then redundancy trades power for speed internally and fanin and fanout can affect how transistors are
sized along the critical paths.

46. Adders: Energy Efficiency

So, this is a very complex slide from a 2003 study of adder energy delay tradeoffs by Vojin and authors, and I
thank them for letting me use it here. We could probably spend the entire day on this one slide since it shows
a lot of really interesting tradeoffs. But I just want to point out a couple of key concepts that illustrate where
we’ve been talking about here.

First, this graph shows the energy vs. delay for a number of different adder designs. The adder designs are
grouped along curves for the algorithm that they’re based upon. You see, this is the Kogge-Stone here, you
see the Han-Carison here, and then the curves are grouped into static adders down here in the lower right
side, dynamic here, and then hybrid dynamic static adders here.

So to pick out a few of the key take a ways here, the first thing is that you notice, this graph is comparing total
energy vs. delay, as we said that’s really the right way to compare these complex logic blocks.

The second thing is that the static is lower energy than the dynamic per unit delay. The static logic is also
slower than the dynamic for a given energy budget. But we see that a pretty good compromise can be made
using this dynamic static adder.

When you look at the structure of this, the dynamic adder consists of two dynamic stages of dynamic pull
down invert and dynamic pull down invert, and the dynamic hybrid is a dynamic pull down stage into a
complex aoi gate on the output.

The last thing is that the router algorithms provide very different design points that have to be explored. It’s
not obvious from just considering the adder algorithm, which one is the lowest power, or which one is the
fastest, it depends on many factors and how the adders are put together.

47. Energy Efficiency: Adder Algorithm Comparison (cont’d)

Here’s a 2007 energy delay comparison study by Sun and Sechen, at the University of Texas in Dallas. In this
study, they were comparing total energy across a number of different adder topologies and again, many
interesting results from this study.

But for our purposes, again the first thing is comparing total energy and total delay for the best metrics by
looking at these blocks. We see that here we’ve got an area column now and that the energy is generally
tracking with area, the higher the area, the higher the energy.

Also, we see that there’s large differences in the energy and delay of adders, even that have the same number
of logic levels. Here we see the fastest adder at 8 levels, and further down, we see the highest power adders,
at 8 levels also.

The lowest energy adders are the slowest, the highest energy adder is the fastest, and the best adder for us is
obviously the one that meets our speed tradeoffs, but has the lowest energy that fits in our available area
budget.

48. Energy Efficiency: Adder VDD Scaling Example

So just a last note on energy efficiency. This is an adder design, it’s a single algorithm, but it’s been designed
to operate at a number of different delay points, and then the simulation is done across a range of voltages
that show how the energy and delay scale.

On the left side, I think one of the most interesting things is that the energy is going up exponentially as we
approach the 10-12 fanout of 4 gate delay range. For us, that’s very significant because in modern processors,
the pipeline is built around single cycle execution of inner drops, and a 64 bit adder is basically the key to
that. So, this is the really fundamental cycle time limiter for modern processors.

On the right side, the really interesting part is, if you look at the design being optimized at different voltages,
the design is fairly clearly gate limited because you see that there’s a higher voltage design is actually lower
energy than a lower voltage design. So again, it depends on that balance between RC delay and gate delay in
the circuit that dictates these characteristics, so we really have to look across the energy delay space to make
the right tradeoffs.

49. Agenda

50. Design Tradeoffs: Transistor Vt

So now I’ll switch gears and talk a little bit about transistor Vt and its roll in reducing leakage and improving
our power delay tradeoffs.

So modern processes provide 2 or more transistor threshold levels to designers to use. For example, they
might be called low, regular and high Vt. These threshold levels are often in 50 mV or so separation, and
they’re typically 3-5X leakage step increases as we progress to lower Vt . The lower Vt transistors reduce
the delay and the needed device width for a given speed, but obviously the leakage and crossover are
increasing dramatically. And the power efficiency of those tradeoffs really depends on the activity factor of
the circuit. There are a couple of optimization ways that we can apply Vt in our digital circuits. We can do it
at the circuit level, the path level or the block level.

51. Circuit Vt Optimization: PDP vs. Vt

So if we just take a quick look at PDP vs. supply for different Vts. This is for a single inverter and it shows
the same bathtub shape we saw before for the range of circuits. But what’s different now, is there’s a
different voltage optimization point for each one of these Vts. And you can see this especially pronounced for
the high Vt gate where the delay is dramatically increasing below to Vts, to the low end of the scale. The low
Vt gates are better for scaling down to that range and so have much better PDPs. And then all of them are
increasing on the right side, due to the higher switching energy, not really giving us as much return for the
delay.

52. Vt Selection in Dynamic Circuits

So let’s talk a little bit about how to select Vt in dynamics circuits. So these are wide-fanin and high-gain
circuits that are typically during a wide logical OR and the gate trip point is proportional to Vt. But the pull
down NMOS is typically ratioed against the keeper PMOS. So the keeper has to be sized in order to offset all
of the NFETs hanging off of that bit line or that dynamic node. So the keeper is sized, is a function of the
width of those NMOSs and the number of them.

So typically we end up using high Vt and low leakage transistors here in order to keep the speed up and
improve the noise margin overall.

So with the keeper and high Vt overhead, how can we improve the delay of the dynamic circuits? Well there
are three different keeper structures that we can choose from that might help here.

53. With Keeper and High-Vt Overhead, How Can We Improve the Delay of Dynamic Circuits?

There’s the traditional feedback keeper, there’s a delayed keeper scheme, and we can use cross-coupled
keepers when we have dual-ended read. Just go forward and take a look at these circuits real quick before
we discuss them.

54. Improving Dynamic Circuit Delay (cont’d)

So, in the traditional feedback keeper, there’s a PMOS that’s holding the node high, and we’re ratioed against
that when we’re pulling down until the output switches.

In a delayed keeper scheme, we insert a PMOS in series with that so that we don’t have to fight that PFET
when the circuits evaluating, and then we turn the PFET keeper on after a small amount of delay in order to
hold the node through the rest of the cycle. We can take advantage of this dual-rail keeper scheme when we
have the dual-rail read where we cross-coupled the PFETs, both of those are shut off when the circuits
precharge and then one turns on when the other evaluates.

So the traditional feedback keeper is drawing crossover current during pull down, and it’s really the bit line
leakage that limits the AC speed.

55. Improving Dynamic Circuit Delay (cont’d)

In the delayed keeper scheme, that’s self-timed so we don’t have to

56. With Keeper and High Vt Overhead

ratio against the keeper when we are evaluating and leakage doesn’t limit the AC speed.

With the cross-coupled keepers, those are faster than the single-ended anyway, because we’ve got true and
complement that we can use downstream, but also because, again, the leakage isn’t limiting the AC speed
there. However, this does have to 2X the average power of the single ended read so we wouldn’t be choosing
that, just to eliminate that keeper. And this does need careful gating.

57. Improving Dynamic Circuit Delay (cont'd)

So, I mentioned MTCMOS, multi threshold CMOS.

58. Multi-Threshold Voltage CMOS (MTCMOS)

In this scheme, the intent is to insert sleep transistors that are high Vt in series with a logical block that’s
implemented at low Vt . The idea here is that we’ve created these virtual VDD and VSS rails in which we can
isolate the circuit when it’s idle so that we get the benefits of the low Vt from a speed point of view, and we
get the benefits of the H Vt from a leakage point of view.

In practice, the circuit in this exact configuration, isn’t usually used. Having this many sleep transistors and
having to route the sleep wire to all these separate devices, results in wiring complexity and a lot of sleep
transistor switching power.
Also we find that the sleep transistor can usually be shared between many logic blocks because of naturally
low activity factors in the logic block. We really don’t need separate sleep transistors in every gate. So we’ll
talk about this, I’ll apply a little bit more to power gating when we go forward.

59. Mixed-Vt (MVT) CMOS

So, MVT CMOS is a very different beast than MTCMOS. In scheme, we’re inserting low Vt in high Vt
transistors inside of a gate strategic points, based on the critical paths or arks that go through that gate. We
use low Vt on the critical timing paths, we use high Vt on the noncritical paths. And this contrasts with the
so-called DVT or dual Vt scheme where gates of all one Vt or the other inserted in critical and noncritical
paths respectively.

In this study, there are two MVT schemes. One is called MVT1 and MVT2. In MVT1 the authors used in a
pull up network, or the pull down network, all the same Vt s for each network. Then in the second scheme,
the Vt s within the pull up network could be different thresholds, the Vt s in the pull down network could be
different thresholds.

60. Mixed-Vt (MVT) CMOS (cont’d)

Here’s a table of results for their study. In this study, there were three different insertion algorithms in which
the Vts were tuned for area delay and so forth. This table shows the mapping for area and the leakage power
savings. This is the leakage power savings of DVT that we’ll talk about in a minute, but the MVT schemes
here, both had significant power benefits over that DVT scheme for leakage.

In this study, a 32 bit adder was also able to achieve a 20% leakage savings over gate-level DVT. So clearly,
MVT can be an advantage.

61. Final MVT CMOS Considerations

There are a few considerations besides just leakage and area to consider. The advantages are that we can
reduce leakage, improve gain and area, but when we start inserting these Vt s in these gates, we end up with
a large range of custom gates and within these gates, depending on how the transistors are ratioed, we can end
up with a wider β ratio range which leads to reduced fall margin. With the unequal rise and fall delays, it’s a
little more difficult for the designer to do an efficient job of inserting these, so automation is typically needed
in order to get the most out of this. And since the beta ratio can shift a little bit with differences between how
the Vt devices are manufactured, the beta ratio in the noise margin can shift manufacturing time.

So some alternatives would be to consider upsizing an HVT gate to be approximately the same delay,
incurring the active power hit, getting a leaking power benefit and getting less manufacturing shift, but this is
very process dependent, depends on which Vt you’re comparing for that gate. So, a lot of careful tradeoffs
need to be considered for this specific process as to whether to use that and where to apply it.

62. Path Timing-Based Vt Selection

So, let’s talk about path based Vt selection. In this scheme, we select gates of different thresholds based on
the path timing. We we’ll insert LVT gates along critical paths, HVT gates along noncritical paths, and we
either let the designers to it manually, or we’ll have some automated gate Vt selection. The gate selections
occur after timing convergence and then we can even use synthesis to do sizing and gate selection at the same
time.
63. Timing-Based Vt Selection Methods

So, within these selection methods, I talked about manual Vt selection, we could do swaps based on the
amount of slack in the paths after convergence, or we could use synthesis. And we do this post timing swap
method, we need to converge with a base Vt type, so we could either converge with LVT, we need that to hit
our most critical paths, and then opportunistically swap and reduce leakage on the noncritical stuff. Or we
could do our best to hit timing with HVT, and then just use LVT where it was needed in order to get the final
piece of speed out of that.

Typically optimizing the design around HVTs can result in the lowest power, since structurally we’ll be
changing the RTL design and then the number of gates per cycle, in order to meet timing with the HVT base.

This is an example of differences in the ability to use low Vt or high Vt FETs in different block types. Data
paths typically have more critical paths and more instances of them, so when they’re implemented, they tend
to be able to use less low leakage devices compared to random control blocks. And again, with all of this, the
amount of power efficiency and leakage benefit we get, really depends on the activity factor of that block,
which can be quite different between data path and random control section.

64. Block-Level Vt Optimization

And just a few notes about block level Vt optimization. In this case, we’re typically designing or synthesizing
a block at the gate level and doing what’s called a multi-mode optimization where we’re optimizing for power
and delay, pretty much simultaneously. In order to do this, we need a cell library that’s characterized for
delay and accurately characterized for power. This would be the active, the leakage, and the crossover
power. We need to specify the activity factor that’s a representative of our work load in order to do the
optimization. And ideally, the result coming out of synthesis process, we end up with the lowest power design
that meets our timing goals at that activity factor.

65. Multiple -Vt Optimization Challenges

There are a few challenges here though. The activity factor can vary pretty widely across the traces, so
unless we have a good representative set of activity factors for our work load, we might end up with a design
that’s optimized for a pretty small range. Also, transistor Vts can shift with many manufacturing, so designers
have to ensure that in noncritical paths, where we’re using high Vt, that those don’t become critical in silicon.
As we saw in that one PDP chart, the HVT gates can roll off at a really high rate of change with low voltage
or with small manufacturing variations. So we have to be careful, it’s not just the magnitude of their delay
but the rate of change where we’re operating as to whether they become critical or not. Maybe you need
more margins on some of those high Vt gates.

Also, just noting that when we optimized the design, if we have very invasive changes, we may need to
reoptimize the design, in order to get the most out of it.

66. Agenda

So now we’ll move on at look at how power gating is used to reduce leakage power.

67. Power Gating Considerations

So power gating is a method of reducing idle state leakage current across large sections of the design. This
method uses sleep transistors to gate the supplies that feed idle functional blocks. And successful
implementation of power gating requires tradeoffs in these areas. Basically, how much leakage power can be
saved in a block, the latencies of transitions between the sleep and wake states, the overhead power that’s
needed to switch the sleep signal, there can be frequency loss associated with the IR drop across the power
gating device, and there can also be performance losses associated with transitioning between the active and
sleeping states.

There’s an area overhead associated with power gating devices. And then also, there’s a need to functionally
isolate the IOs of sleeping blocks from active blocks so that we get the correct functional operation.

68. Power Gating Basic Idea

These slides are all borrowed from Steve Kosonocky’s presentation from last year in the Low-Power Digital
tutorial. So the basic power gating idea is that we isolate these functional logic blocks with a sleep transistor,
either a footer switch or header switch, when the block is idle. When the block is active, we turn on the
gating device, when the block is idle, we put in the sleep stage and turn off the device.

69. Power Gating Modes (Active)

During the active mode, we can take advantage of these naturally low activity factors inside the logic block so
that the gating device can be a lot smaller than we might think it needs to be. This helps to reduce the
resistance requirements of that power gating device and reduce the area overhead of that device.

The objective here is to minimize the resistance of the sleep device. We see that the overhead of the sleep
device in terms of frequency impacts as the on resistance times the maximum current of it which gives us a
voltage drop and that directly reduces the voltage available for running the logic block. That results in an
increased logic delay which we can offset by increasing the supply but at a higher active power cost.

70. Power Gating Modes (Sleep)

During sleep mode in steady state, we really want to maximize the resistance, the cut off device. Here we’re
using it to reduce the leakage current, reductions of 10-1,000 x are possible, but we have to be a little bit
careful when we turn off that device in that, we can create di/dt events. So, when we’re consuming a large
amount of current, we cut off that device and go to a small sleep current over a short period of time. We can
create ringing on the supply that can create long term reliability problems or damage devices.

71. Power Gating Modes (Wake)

Likewise during wake up mode, we may need to throttle the gating device. When the device comes out of the
sleep mode and starts charging up, it’s not just the virtual rails of the supply that are charging, but there’s a lot
of internal logic capacitance that needs to be charged up. That’s discharge to some fraction of VDD or say, in
a VDD during sleep mode and it has to be charged back up to 1 – n in order to start using that back in active
mode again.

72. Comparing Power Gating to Clock Gating

So, if we look at how the long term power reduction compares to the fraction of the time the device is active.
On the right side of this graph is more active blocks, on the left side are blocks that are sleeping for longer
periods or idle for longer periods. And in the centre, there’s really a breakeven point for using power gating.
Conceptually on the right side, if the block is not sleeping for long enough periods, then we can’t recoup the
overhead needed to switch the sleep transistor in order to get in to sleep mode and back out. So in these
cases where there would actually be a power loss associated with sleeping then. With switching the sleep
transistor, the we would want to consider other types of power reduction for that idle mode such as clock
gating.

On the left side, when we get a positive long term power reduction, then we would go into sleep mode.

73. Energy Break Even Point

So, this slide goes through a little more detail around the mechanics of transitioning from active into sleep
mode, and then back to active, and how that affects the breakeven point.

So in this slide in the graph, this red line here is the voltage on the virtual VDD, the blue line is the aggregate
saved energy in the circuit, and the green line is the overhead energy associated with the sleep devices.

So let’s walk through the sequence of time steps here. So a time T0 on the left side of this graph, the logic
block becomes idle, and then at point T1, the control circuit detects that and decides to switch the power gate.
At T2, the power gating signal propagation is complete and you see that the overhead associated with
switching it, is been incurred. And at that point, the voltage on the virtual rail starts discharging and saving
energy in response. At T3, we reach the energy breakeven point where the aggregate energy saved, is equal
to the total energy that would be needed to come back out of sleep mode, plus the aggregate overhead
already. At T4, the supply is discharged to its full point. At T5, the control circuit detects that the circuit
needs to be powered back up. At T6, the power gating signal has been completely propagated to all the sleep
transistors, you can see that the overhead associated with switching that transistor has been incurred. And the
virtual VDD starts charging quickly toward rail. Then at T7, the gated logic is fully charged up to rail and is
ready for new activity.

So that’s just a brief overview of power gating

74. Agenda

Now we’ll move on to talk a little bit about power management at the system level. So we’re really going to
switch gears here because in all of these circuits and hardware oriented techniques that we’ve talked about so
far, we weren’t really including any of the software management there.

In system level power management, typically we are going to evoke software in order to control the processor
by using queues from hardware.

75. System-Level Power Issues

So in order to talk about this, we’re going to give a few definitions here of system level power.

First, average power is the average power dissipation across a die for a given benchmark or set of
benchmarks. Thermal design power is the maximum power that’s dissipated by the heat sink or the cooling
solution while it’s maintaining the temperature within spec for an indefinite period of time, and that’s at a
given ambient temperature for a given heat sink. The thermal design current is the current that’s drawn from
the power plane during operation at TDP. And TDP is really what the designer needs to use to design the
cooling solution, look at electrical robustness, and what determines reliability.

A power envelope is a power that dissipate is a function of frequency and voltage as a part operates across its
range.
76. System Level Power Issues

So in system level power management, we want to manage the power of the processor dynamically as a
function of work load, the TDP or the power budget, and with the performances, and that performance may
be throughput or it may be response time or both.

Some of the methods for this are hardware like we’ve already discussed. The advantage of that is very
predictable, it’s very fast, and it’s generally invisible to software. Some of the drawbacks are that, it tends to
have fairly small optimization windows and kind of a limited flexibility to changing conditions.

Some of the examples of hardware control are clock gating, power gating that we’ve talked about, or thermal
protection which has much simpler behaviour, where it might throttle the device down when the temperature
of the device reaches a critical point. That’s an example of a very limited flexibility, and adaptability of a
part, but for a critical function.

So with software control, back in the 1990s for example, power management function started to be integrated
into the BIOS. And in this case, the operating system would work with the BIOS to implement power
management. But because the BIOS and the power features that were implemented were different for each
platform, it was difficult for the operating system to really take peak advantage of that. The complexity was
limited and the flexibility for dynamic power management was really limited.

If you fast forward to today, there’s standards such as ACPI which stands for advanced configuration and
power interface, where power management has been migrated completely to the operating system. And now
there’s an abstract hardware and software interface, which defines power states, transitions between them
and communication between the hardware and software for managing power.

This enables a very wide range of solutions and some very complex dynamic tradeoffs that are able to be
made between power and performance.

77. Software Power Management: ACPI Overview

So let’s take a little bit of a deeper look at ACPI as an example of software power management. So ACPI
defines states, transitions, and this interface between hardware and software. And the operating system is
able to use those to control the power of the part by passing it through these various states based on global
inputs. The states defined are power states or C-states, performance states or P-states and clock throttling.
Transitions between these states can take some time. For example, on frequency changes we might require a
PLL relock, or in supply changes, there might be a power gating startup latency like we discussed.

The different hardware implementations provide different features within an ACPI, and for example, the
P-state operating point, so voltage and frequency and power might be very different, or the number and
behaviour for the C-states can be different. Those all get communicated to the operating system, specific for
that implementation.

78. ACPI Continued

So processor power states or C-states, are labelled C0 through Cn. C0 is defined as the active state where
instructions are being executed, C1 through Cn are sleep states, or halt states, where no instructions are being
executed. C1 through Cn are defined to be typically progressively lower power and generally have higher
entry in exit latencies associated with them. But the number, the latencies, and the power of those C-states
are all hardware features that can be specific to an implementation.
C0 and C1 are required but the rest are optional. There are no limits really, on the number and latency and
power of those states. Those all get declared to the operating system through this abstract description, so that
the operating system can take advantage of what’s available.

Other definitions of sleep states are C1 is a halt state, C2 is a lower power halt state in which cache
coherency is maintained, and C3 is yet a lower power state in which caches aren’t coherent but the state of
the cache is maintained.

79. ACPI Continued

In processor clock throttling, it’s a mechanism by which the operating system can reduce the performance
and power during active states, and reduce them fairly linearly. This occurs in active processor state C0 and
the clock frequency is reduced along with clock processor performance.

In processor performance states, there’s P-states, those are labelled P0 through Pn. These all occur within the
active processor state-C0, and P0 is defined as the maximum performance state. It’s also likely, the
maximum power state. P1 through Pn have progressively lower power and lower performance and the states
again are defined by hardware features. The operating system uses these states to operate the processor
across its envelope.

80. Example C-State and P-State Relationships

Here’s a cartoon to kind of show the relationship between C-state and P-state and clock throttling, and this on
the y-axis is increasing power and on the x-axis is increasing performance.

You see in processor state-C0, the active state is where all of the performance states are defined. P0 is our
maximum performance state, down through P3. And in this case, I kind of think of a contour through P0
through P3 as being what I usually think of as sort of the voltage frequency curve for the processor.

In each one of the performance states, clock throttling can be used while the processor is active to reduce the
performance and power of the linearly.

I’m showing down below, the axis here, the sleep states, there’s a kind of negative performance, because, not
only is the processor not executing instructions, but there can be a latency associated with coming out of
those states. Then of course, those states are progressively low leakage power as we go through, I’m sorry,
progressively lower power with active and leakage power.

81. OS-Based Power Management

So, an operating system power management, the operating system is controlling the power and performance
states based on the performance and work load requirements of the processor, the power envelope that the
processor is operating within, and even some diverse criteria, such as the temperature of the part. This
enables very complex tradeoffs in multi core designs and heterogeneous processors. For example, we can
perform dynamic voltage and frequency scaling, based on the work load. We can tradeoff power between
heterogeneous processors such as CPUs and GPUs while staying within a given power envelope. And this
also is what enables processor boost modes. For example, a single core that’s operating in a multi core
package, the other cores are sleeping, that single core can use much higher power budget as a fraction of the
total to operate at higher performance.

82. Putting It All Together: Bulldozer CPU Power Breakdown


So, let’s take a look at a modern processor and some of it’s scaling across applications. This is an example
from the upcoming bulldozer paper. It shows the percentage of clocks firing across several applications. This
is a maximum power rap, typical application and idle and the total number of clocks out of the entire die that
are firing here at peak power is about 25% . You see it scaling down to about 4% at idle.

This shows the median, normalized median power breakdown between various units. You can see flops, logic
gates and clocks here. What’s interesting is that the leakage power is about 24% at maximum power, and it’s
the same at idle, except now it’s more than 60% of the total power. And it’s not until we go into the
processor power state C-6, which we drop that idle power down to about 5% of its peak.

83. Processor Power Through Time

This is kind of a historical graph of power and performance over time. I think it’s the most interesting line on
this curve is the per-core TDP. If you look back in the 2003, 2004 time frame with single core processors,
each one of those cores could use up to the total TDP of that part. Now that we’re at 6 and 12 core
processors, we’re down to about 10 watts or less per core, now at the same time, we need to increase the
performance of each core. So our design space is getting very squeezed.

84. Power-Delay Tradeoffs

So, just to summarize, our power efficiency is really what’s giving us increase silicon capability. How do we
take advantage of that? First of all, by running it at the lowest VDD that we can.

Optimizing the number of gates per cycle and optimizing the circuit type for that cycle. So that increasingly
means using static complementary gates that give us the efficiency and robustness. We want to improve the
circuit gain by selecting the right circuit types and sizing them optimally.

We want to keep our edge rates fast in order to control crossover current, reduce activity factor through clock
and functional gating, and we want to reduce leakage by careful Vt selection and by power gating.

And lastly, we want to use dynamic hardware, software tradeoffs to control the CPU operating envelope.

85. Bibliography and References

So, this is a list of references that are in the paper.

86. Acknowledgements

And I’d like to acknowledge the contributions of these people to this presentation and to the data shown in
here.

So that’s the end of the presentation, and thank you for your time.

You might also like