P. 1
How to provide a power-efficient architecture

How to provide a power-efficient architecture

|Views: 16|Likes:
Published by api-3712130

More info:

Published by: api-3712130 on Oct 16, 2008
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less





How to provide a power-efficient architecture

It's anything but a smooth ride for processors with many challenges including the amount of energy consumed per logic operation. Here's what can be done to reduce total power consumption.

By Bob Crepps, Intel Corporation Page 1 of 5 Power Management DesignLine (07/24/2006 11:02 AM EDT) Microprocessor performance has scaled over the last three decades from devices that could perform tens of thousands of instructions per second to tens of billions for today's products. There seems to be no limit to the demand for increasing performance. Processors have evolved from super-scalar architecture to instruction-level parallelism, where each evolution makes more efficient use of fast single instruction pipeline. The goal is to continue that scaling, to reach a capability of 10 tera-instructions per second by the year 2015. However, there are many challenges on the way to that goal.

Figure 1: Processor Performance As semiconductor process technology advances at a rate predicted by Moore's Law, some effects that could be ignored in previous process generations are having increasing impacts on transistor performance and ways to deal with those effects must be found.

Consider the Technology Outlook chart below. As indicated by "High Volume Manufacturing", Intel is using a 65nm process technology and can integrate up to 4 billion transistors on a single die. If you look ahead just 5 years, the process technology will be 22nm and the integration capacity will grow to 16-32 billion transistors! But note also that certain aspects will not scale as they have in the past. Delay (CV/I) has been scaling at a rate of about .7 per process step and the rate of scaling will decrease. The amount of energy consumed per logic operation will scale less. The probability that we will use bulk planar transistors as we do today will decrease; new transistor configurations such as strained silicon, High K dielectric, metal gate and tri-gate will be used. And variability, the difference in performance measured in "identical" transistors on the same die, will increase. Moore's Law is alive and well, and there are new challenges ahead as process technology nodes shrink.

Figure 2: Technology outlook There are a number of techniques we've developed to control these factors. Among those are several techniques for power reduction, including leakage control and active power reduction. In addition, power consumption can be reduced by using special purpose hardware, multi-threading, chip multi-processing and dual and multi-core processors. These techniques will be described in more detail. The Power Technology Roadmap shows which technique will intercept each process technology step.

Figure 3: Power Reduction Technology Roadmap Leakage control The ideal logic transistor should act like a switch. When it's off, no current flows, and when it's on, current flows with very low loss. As the insulating layers of transistors become thinner, leakage current increases. Following the current trend, leakage power could soon reach 50% of active power. These transistors act much less like switches and more like dimmers. There are several techniques to help reduce leakage power. One of those is called Body Bias. When the transistor is off, a bias voltage is applied to the body of the transistor, reducing the leakage current by a factor of 2-10X. Another technique is called Stack Effect. Instead of using a single transistor, you can use two transistors stacked together to perform the switching function. This Stack Effect can reduce leakage currents by a factor of 5-10X. A third technique is called Sleep Transistors. Sleep Transistors can be used to isolate, or disconnect from the power supply, blocks of logic when not needed. For example, a floating point unit can be tuned off until called. Sleep Transistors can reduce leakage by a factor of 2-1000X, depending on the size of the logic block. Note that in all of these techniques, leakage power is reduced by using more, not fewer, transistors. It seems counter-intuitive but takes advantage of the higher transistor integration capability.

Figure 4: Leakage Control Active Power Reduction: We can extend that use of more transistors to active power reduction as well. By using multiple supply voltages, and dividing logic circuits into faster and slower sections, then the slower sections can use lower supply voltage and consume less active power. Fro example, an ALU which must operate at high speed is connected to the higher supply voltage and a sequencer can be connected to the lower voltage. Replicated design is another very powerful technique. Consider the figure below. We start with a single logic block that has operating frequency, supply voltage, power, die area and power density all normalized to 1 unit, and has a throughput of 1. If we replicate that block and adjust the operating conditions, we can reduce the active power consumption. By reducing the supply voltage to 0.5, with a corresponding reduction in operating frequency to 0.5, the power is reduced to 0.25. The area must increase to 2 and combined with the reduced power consumption, the power density is 0.125. The throughput per block is 0.5 so the total throughput is 1. The power has been reduced by a factor of 4 but the throughput is the same. More transistors equals less power for the same performance.

Figure 5: Active Power Reduction Special-Purpose Hardware So far, we've talked about using multiple but identical logic blocks or cores, but there are other ways to scale performance by using special purpose cores. In the chart shown below, one of the curves is labeled "GP MIPS @ 75W". This represents the general purpose MIPS (millions of instructions per second) over time. The data was derived from various processors used to service a saturated Ethernet link. The processor power was normalized to 75 Watts. The curve labeled "TOE MIPS @~2W" comes from a special purpose device. This TOE or TCP Offload Engine, is a test chip made to process TCP traffic. The chart clearly shows that these "specialized MIPs" consume much less power to perform the specific function than are required by general purpose MIPs to do the same task. The die shot shows that the TOE is a very small die and requires relatively few transistors. The high performance versus power of special purpose hardware will find applications in network processing, multimedia, speech recognition, encryption and XML processing, to name a few applications.

Figure 6: Special Purpose Hardware Code-named Paragon1 All of these techniques sound good, but can they really be implemented? The Lab designed prototype test chip called Paragon. This device is special-purpose test circuit for digital signal processing. The design was executed for highest energy efficiency for the target performance level, rather than the highest obtainable performance (operating frequency). Paragon incorporated Body Bias and Sleep Transistors to control leakage power, and used Dual VT with a tiled architecture for fast and slow data paths to reduce active power. The end result was a filter accelerator that operates at 110 Giga-operations per watt. Paragon uses just 9milliwatts total power at 1GHz throughput as a 16x16 singlecycle multiplier. Paragon achieved a leakage power reduction of 7X compared to previous designs with 75 microwatts standby leakage and 540 microwatts active leakage. Overall performance is 3X higher Giga-operations per watt than previously known designs. Dual and Multi-Core We've looked at Tera-scale from the logic level and the thread level. Now let's look at multiple cores. First, consider this Rule of Thumb. It is derived from power, voltage and frequency and considers active and leakage power. The Rule of Thumb is that to make a 1% change in voltage, a corresponding change of frequency of 1% is required. That will cause a 3% change in power (power varies as a cubic function of voltage and frequency). Performance will change by 0.66%.

Assume we have a single core with cache whose voltage, frequency, power and performance are normalized to 1. Now replicate the core and share the cache between the two cores. Next, reduce the voltage and frequency by 15%, so that the power for each core is .5 and the total for the tow cores is 1. According to the Rule of Thumb, the performance will be 1.8 for two cores consuming the same power as the single core. Next consider two cores compared to multiple cores. Start with a large core and cache that consumes 4 units of power and delivers 2 units of performance. Compare that to a small core that has been scaled to use one unit of power and has a corresponding performance of 1 unit. By combining 4 of the small cores together, the total power is equal to that of the large core, or 4 units, and the total performance is equal to 4 units, twice that of the large core for the same power.

Figure 7: Single- to Dual-Core Power and Scalar Performance Historically, increasing frequency was a major factor in increasing performance. As the total power consumed by microprocessors grew with increasing frequency, the need to find other ways to scale performance became clear. In the chart below, note that if we factor out all the contributions to performance due to process technology, that is, frequency, delay scaling, etc. it's clear that scalar performance has increased at a rate of about one-fourth the rate of power increase. This increasing energy per instruction has been the trend over the last 15 years. The exception is the Pentium M processor, which was designed to have much lower EPI than previous versions. This processor shows scalar performance much closer to 1:1 with increasing power. Is there more to be done to control EPI?

Figure 8: Power and Scalar Performance An Energy per Instruction Throttle2 Researchers began looking for ways to control EPI. There are several ways to do this, each with different advantages and ranges of control. Changing voltage and frequency is one method. By lowering voltage and frequency, power is reduced, as previously discussed. This method has a practical EPI control range of 1:2 to 1:4 and a response time of about 100 microseconds to ramp the supply voltage. Another technique is called variable-sized core, where processor resources are reduced to reduce energy consumption. This method has an EPI control range of 1:1 to 1:2 and a response time of 1 microsecond (to fill 32KB L1 cache). A third technique is speculation control. When speculative execution is used, each missed speculation has used energy and reduces overall EPI. By limiting speculation, EPI can be improved. It allows for an EPI range of 1:1 to 1:1.4 and has a response time of 10 nanoseconds due to the latency of the pipeline. We chose the voltage and frequency method for our experiments, as it offers a wide control range.

Figure 9: Energy per Instruction The experiment was intended to measure power savings and performance by varying voltage and frequency on a multi-way system. Various Linux benchmarks were used to measure performance by measuring the wall-clock run-time of each benchmark. Linux is well suited to this experiment due to the ability to assign thread affinity, that is, assign a thread to a processor. The benchmarks represent a combination of single and multithreaded applications. The experimental platform used was a 4-way Xeon processor system operating at 2GHz, with 2MB L3 cache, 4GB main memory and 3, Ultra 320 disk drives. The platform was operated in four modes: 1. CPU at 2GHz as a baseline. Power and performance normalized to 1.0 2. CPUs operating at 1.5GHz, power normalized to 1.12 compared to baseline, performance normalized to 1.06. 3. CPUs operating at 1.25GHz, power normalized to 1.17, performance normalized to 1.08. 4. CPUs operating at 1GHz, power and performance normalized to 1.0. The 2-CPU and 3-CPU configurations run-times were adjusted to make power exactly the same as the baseline and 4-CPU configurations. For multiple CPU configurations, the platform was operated in two modes: Symmetric Multi-processing or SMP, where all processors run at the same speed, and Asymmetric Multi-processing, where processors run at different speeds. Benchmark Results: The benchmark results (wall-clock run times) fell into three categories: The 4-CPU SMP and AMP performed equally well.

The AMP configurations achieved significant speedup over SMP. The AMP and 4-CPU SMP performed worse than the baseline configuration. The intuitive explanation of the results shows that, during the sequential or single-thread phases of each application (and even highly threaded applications have sequential phases), the 4-CPU configuration underutilizes power as only one CPU is in use. The 1CPU configuration, on the other hand, is unable to exploit available thread-level parallelism applications, so performance suffers. The AMP configuration continuously varies EPI with thread-level parallelism to optimize power and performance.

Figure 10: Speeds on AMP Why and When is AMP Better? To understand the advantages of AMP, we used the following methodology: * First, for each benchmark, the percent of the run-time that spent on sequential and parallel portions was computed. * Second, the run-times were compared, both as measured on the Amp prototype and projected onto an ideal AMP system. The results are clustered into three categories: First, for applications where the application is mostly parallel, the SMP configuration gives the best performance. Second, for applications that are mostly sequential, the 1-CPU configuration gives the best performance. Third, for applications where there is a mix of sequential and parallel, AMP gives the best overall performance. Remember, the operating conditions were set so that each configuration consumes the same amount of power. In the graph below, the results are shown for each benchmark and for SMP and AMP against the baseline system. Performance is measured as wall-clock run time so lower is better. You can see that an AMP configuration of 1 CPU running at higher speed and 3 at lower speed (1+3) gives performance very close to the best performance of any configuration over the range of applications and so the best overall EPI.

Figure 11: Why and When AMP is better Conclusion Moore's Law is alive and well and our ability to integrate more transistors onto a single device will continue to scale well into the future. However, as process technology advances, we face new challenges that require new techniques for control of active and leakage power. We can use more transistors to achieve higher performance at lower power, as in the case of special purpose hardware. Multi-processing will allow performance to scale while maintaining or reducing total power consumption and enable more efficient energy per instruction usage. To learn more about Moore's Law and enabling energy efficient instruction usage go to: http://www.intel.com/technology/eep/index.htm?ppc_cid=c96 References: 1. "Paragon: A 110GOPS/W 16b Multiplier and Reconfigurable PLA Loop in 90nm CMOS" available from IEEE, ISSCC 2005 Session 20 2. "Mitigating Amdahl's Law Through EPI Throttling" available from IEEE, 0-76952270-X/05 About the author: Bob Crepps is a ten-year veteran of Intel. He has worked as an engineer in Motherboard Products and in Industry Enabling for technologies such as USB, AGP and PCI Express. His current role is as a technology strategist for the Microprocessor Technology Lab in Corporate Technology Group. Prior to Intel he was an analog design engineer and information systems engineer for fifteen years for various technology companies in the Northwest. Bob lives and works in Hillsboro, Oregon. bob.a.crepps@intel.com

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->