This action might not be possible to undo. Are you sure you want to continue?
IN AN UPDATED VERSION OF AGERWALA’S JULY 2004 KEYNOTE ADDRESS AT
THE INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, THE AUTHORS URGE THE COMPUTER ARCHITECTURE COMMUNITY TO DEVISE INNOVATIVE WAYS OF DELIVERING CONTINUING IMPROVEMENT IN SYSTEM PERFORMANCE AND PRICE-PERFORMANCE, WHILE SIMULTANEOUSLY SOLVING THE POWER PROBLEM.
Tilak Agerwala Siddhartha Chatterjee IBM Research
Computer architecture forms the bridge between application needs and the capabilities of the underlying technologies. As application demands change and technologies cross various thresholds, computer architects must continue innovating to produce systems that can deliver needed performance and cost effectiveness. Our challenge as computer architects is to deliver end-to-end performance growth at historical levels in the presence of technology discontinuities. We can address this challenge by focusing on power optimization at all levels. Key levers are the development of power-optimized building blocks, deployment of chip-level multiprocessors, increasing use of accelerators and ofﬂoad engines, widespread use of scale-out systems, and system-level power optimization.
with some observations on the evolving nature of workloads. The computational and storage demands of technical, scientiﬁc, digital media, and business applications continue to grow rapidly, driven by ﬁner degrees of spatial and temporal resolution, the growth of physical simulation, and the desire to perform real-time optimization of scientiﬁc and business problems. The following are some examples of such applications: • A computational ﬂuid dynamics (CFD) calculation on an airplane wing of a 512 × 64 × 256 grid, with 5,000 floatingpoint operations per grid point and 5,000 time steps, requires 2.1 × 1014 floating-point operations. Such a computation would take 3.5 minutes on a machine sustaining 1 billion floatingpoint operations per second (Tﬂops). A similar CFD simulation of a full aircraft, on the other hand, would involve 3.5 × 1017 grid points, for a total of 8.7 × 1024
To design leadership computer systems, we must thoroughly understand the nature of the workloads that such systems are intended to support. It is, therefore, worthwhile to begin
Published by the IEEE Computer Society
0272-1732/05/$20.00 © 2005 IEEE
000 atoms. semantic search. We will discuss later how different workload characteristics can drive computer systems to different design points. some commercially important workloads. However. simulation of a full hard-disk drive will require about 30 Tflops of computational power and 2 Tbytes of storage (http://www. • Large amounts of computation are no longer the sole province of classical highperformance computing. On the same 1-Tflops machine. low-cost computers that work as a single entity to cooperatively provide applications. This rate shows no foreseeable slowdown.000 years to complete.html).com/ deepcomputing/). Enterprise resource planning. and developing tools to aid the migration of existing applications to future platforms.2 Important business and scientific applications demonstrate similar variability. which require 2. A third important characteristic of many workloads is that they are amenable to scaling out. such as online transaction processing. requiring 0. both across different workloads and within different temporal phases of a single workload. and science/engineering computations are prime examples of scale-out workloads. a 90-minute movie represents 2. As a community. ibm.ﬂoating-point operations. streaming media. rack-mounted blade systems. Designing computer architectures to adequately handle such variability is essential.1 • Materials scientists currently simulate magnetic materials at the level of 2.000 atoms will require 100 Tﬂops of computational power and 2. On the other hand. computer architects must make a concerted effort to better characterize applications and environments to drive the design of future computing platforms.zurich. There is an industry trend toward continual optimization—rapid and frequent modeling for timely business decision support in domains as diverse as inventory planning. Future investigations involving some 10. Such applications also contribute to the drive for improved performance and more cost-effective numerical computing. At around 1014 floatingpoint operations per frame and 50 frames per second. and massively parallel systems.5 Tbytes of storage.ibm. this computation would require more than 275. developing opportunities for optimizing applications across all system stack levels. Applications continue to drive the growth of absolute performance and cost-performance at the historical level of an 80 percent compound annual growth rate (CAGR).64 Tﬂops of computational power and 512 Gbytes of storage. Web serving. Scale-out platforms include clusters. • Digital movies and special effects are yet another source of growing demand for computation. Many important workloads are scaling out. A scale-out architecture is a collection of interconnected.zurich.and data-intensive computing. It would take 2. application demands will grow even faster— perhaps a 90 to 100 percent CAGR—over the next few years. such as delivery and processing of streaming and rich media. New workloads.000 1-Gflops CPUs approximately 150 days to complete this computation. Current investigation of electronic structures is limited to about 1. systems resources. Another growing workload characteristic is variability of demand for system resources.5 Tflops of computational power and 250 Gbytes of storage (http://www. Technology Even as application demands for computational power continue to grow. are increasing the demand for numerical. conventional symmetric multiprocessor (SMP) systems are scale-up platforms. customer relationship management. modular. risk analysis. If anything. In the future.com/deepcomputing/parallel/projects_cpmd. massively multiplayer online gaming.000atom systems. and data to users. business intelligence. are difﬁcult to scale out and continue to require the highest possible single-thread performance and symmetric multiprocessing. and chip design. silicon MAY–JUNE 2005 59 . This effort should include developing a detailed understanding of applications’ scale-out characteristics.7 × 1019 floating-point operations. Figure 1 shows an example of variable and periodic behavior of instructions per cycle (IPC) in the SPEC2000 benchmarks bzip2 and art. workforce scheduling. high-density. and national security.
This effect increases as device channel lengths decrease and is also exponential in turn-off voltage. as initially stated by Dennard et al.FUTURE TRENDS 3. and operating-environment parameters by a factor of α will result in higher density (~α2). in our pursuit of higher operating frequency. making it more difficult to ramp up operating frequency at historical rates.6 31. and constant active-power density.5 2.5 0.55 Instructions per cycle 0. it is clear that frequency will grow in the future at half the rate of the past decade. This effect is exponential in gate voltage and oxide thickness.5 1. process. There are two distinct forms of passive power: technology is running into some major discontinuities as it scales to smaller feature sizes.’s scaling theory is based on considerations of active (or switching) power. we have not scaled operating voltage as required by this scaling theory.2 31. In the near future.8 32 Figure 1.2 (Copyright IEEE Press.3 31 (b) 31. CMOS device scaling rules.0 0. the difference between the device’s power supply and threshold voltages. Variability of instructions per cycle (IPC) in SPEC2000: IPC over the entire execution for benchmark bzip2 (a) and a 1-second interval from 31 to 32 seconds for the art benchmark (b).0 Instructions per cycle 2.5 3. As a result. however. power densities have grown with every CMOS technology generation. 2003) chip-level performance must result from on-chip functional integration rather than continued frequency scaling. Although technology scaling delivers devices with ever-finer feature sizes.5 0 0 (a) 100 200 300 Time(s) 400 500 0. • Subthreshold leakage is a thermodynamic phenomenon in which charge leaks between a MOSFET’s source and drain. 60 IEEE MICRO . • Gate leakage is a quantum tunneling effect in which electrons tunnel through the thin gate dielectric.35 0.4 0. Dennard et al. As CMOS device features shrink. higher speed (~α).45 0.3 In the past several years. additional sources of passive (or leakage) power dissipation are increasing in importance.6 0. lower switching power per circuit (~1/α2). When we study operating frequencies of microprocessors introduced over the last 10 years and projected frequencies for the next two to three years.. power dissipation is limiting chip-level performance.0 1. predict that scaling of device geometry.4 Time(s) 31. therefore. the dominant source of power dissipation when CMOS device features were large relative to atomic dimensions.
system performance improvements will increasingly be driven by integration at all levels. and off-load engines on chip to boost total chip-level performance. maximum power. Although scaling allows us to grow the number of devices on a chip. The computer architecture community’s challenge. power density. average power. there was a signiﬁcant difference between that situation and the current one: We had CMOS available as a mature. and how we integrate multiple cores. low-power. they leak signiﬁcant amounts of passive power even when they are not performing useful computation or storing useful data. and ultimately they adversely affect system cost and cost-performance. as are improvements in air cooling. Power-performance optimization in a single core Let us consider an instruction set architecture (ISA) and a family of pipelined implementations of that ISA parameterized by the number of pipeline stages or. We faced a similar situation two decades ago. including energy. when the heat ﬂux of bipolar technology was similarly exploding beyond the effective air-cooling limits of the day. The critical dimensions in our designs are scaling faster than our ability to control them. Microprocessors and chip-level integration Chip-level design space includes two major options: how we trade power and performance within a single processor pipeline (core). the depth in fan-out of four (FO4) of each pipeline stage. However. and the associated physical processes often have vastly different time constants. high-volume technology then. The shift in focus implied by this challenge requires us to optimize performance at all system stack levels (both hardware and software). We need 80-plus percent compound growth in system-level performance. and temperature. This implies that deeper pipelines have smaller FO4 delays. equivalently. or we can design for variability. constrained by power dissipation and reliability issues. (FO4 delay is the delay of one inverter driving four copies of an equal-sized inverter. while frequency growth has dropped to 15 to 20 percent because of power limitations.) The following discussion also ﬁxes the circuit family and assumes it to be one of the standard static CMOS circuit families. Challenge We face a gap. We have no other technology with similar characteristics waiting in the wings today. The investigation of these issues requires appropriate methodologies for evaluating design choices. Technologists are making many advances in materials and processes. and manufacturing and environmental variations are becoming critical. the term can be a proxy for various quantities. Rather than riding on the steady frequency growth of the past decade. is to devise innovative ways of delivering continuing growth in sys- MAY–JUNE 2005 61 . therefore. The implications of such variability are twofold: We can either use chip area to obtain performance.The implication of the growth of passive power at the chip level is profound. Depending on context. The evaluation methodology must accommodate the subtleties of the context. They will thus limit the exponential growth of power density and total chip-level power that CMOS technology scaling is driving. But. The following discussion illustrates such a methodology. Liquid cooling is an option being increasingly explored. instantaneous power. these devices are no longer “free”— that is. These quantities are not interrelated in a simple manner. The amount of logic and latch overhead per pipeline stage is often measured in terms of FO4 delay. but computer architects must find alternate designs within the conﬁnes of CMOS. Opportunities for optimization exist at both the chip and system levels. together with hardware-software optimization. in the end. readers should focus less on the speciﬁc numerical values of the results and more on how the results are derived. tem performance and price-performance while simultaneously solving the power problem. The term power is often used loosely in discussions like this one. CMOS scaling results in another dimension of complexity—it affects variability. all heat extraction and removal processes are inherently subexponential. The industry is beginning to use both approaches to counteract the increasing variability of deep-submicron CMOS. the basic silicon technology. Such variations affect both operating frequency and chip yield.4 Chip-level power is already at the limits of air cooling. accelerators.
The y-axis numbers came from power increasing from bottom to top on the detailed simulation. Power-performance trade-off in a single-processor pipeline.0 Relative to optimal FO4 0.6 0.the 12 FO4 design. once again emphasizing the superlinear trade• The optimal design point for the power. it is posprediction penalties. as the power-performance as a function of pipeline ﬁgure shows. Options exist. power budget. branch mis. and the 18 design point of 10 FO4 per pipeline stage. which is optimal for the powerPerformance drops off for deeper pipelines as performance metric. the shows up as a single curve.2 bips/W bips bips2/W bips3/W IPC • The falloff past this optimal point is much steeper than in the case of the performance-only curve. shown as the horizontal dashed line in the ﬁgure. but at a signiﬁcantcurve: ly reduced level of single-core performance. There are two key differences Either choice could return this design to an between this curve and the performance-only acceptable power budget.7 0. which delivers high pertion of pipeline stages and shows an optimal formance (at a high power cost). corresponding to a shal.1 1. Here. y-axis. demonstrating the fundamental superlinear trade-off between performance and power. to trade performance and power per watt is a proxy for (energy × delay2)−1. Figure 2 shows plots of different manner.committed to silicon and fabricated. Applying VDD scaling would 62 IEEE MICRO . suppose that the less-aggressive pipeline stage. and cache and transla. The term bips3 the process.18 FO4 design comes in slightly below the lower pipeline. a by reducing either the operating voltage (shown metric commonly used to quantify the power. If we added pas0 sive power to the model.in the “Varying VDD and η” curve) or the operperformance efﬁciency of high-performance ating frequency (the “Reducing f ” curve).) active and passive power increases less rapidly with Now consider the implementation family’s increasing pipeline depth). obvious. Suppose that the 12 increasing role. behavior for some agreed-upon workload and Figure 3 plots the same information in a metric of goodness.4 0.1 power only.3 0. 18 FO4 bips3/W design point (because combined 2002. a family of pipeline designs and the y-axis shows normalized behavior.off between performance and power. with performance pipeline organization with the best value is decreasing from left to right on the x-axis and deﬁned as 1. processors.8 0.5 0. The curve labeled “bips3/W” measures FO4 design exceeds the power budget. Once these designs are the effects of pipeline hazards. FO4 numbers of individual design The curve labeled “bips” (billions of points appear on the curve.sible to determine whether they meet the chiption look-aside buffer misses play an level power budget.FUTURE TRENDS 1. the 37 34 31 28 25 22 19 16 13 10 7 optimal power-performance design point would shift Total FO4 per stage somewhat to the right of the Figure 2. again for SPEC2000. The power model for these curves incorporates active 0. even at this stage of stages.5 (Copyright IEEE Press. The number of pipeline stages between power and performance visually increases from left to right along the x-axis.9 0. making the trade-off such behavior. FO4 design. instructions per second) plots performance We now focus on two example design points: for the SPEC2000 benchmark suite as a func. On the performance metric is at 18 FO4 per other hand.
2 18 FO4 1.8 0. Microarchitects are using an increasing Table 1. In addition to fixing and scaling pipeline depth appropriately to match technology trends. cost-performance) deficiency.and passive-power reduction Techniques Clock gating Bandwidth gating Register port gating Asynchronously clocked pipelined units and globally asynchronous.4 12 FO4 2.7-13 Table 1 shows some of these techniques.6 (Copyright IEEE Press. while staying within the power budget. Effect of pipeline depth on a single-core design. For example: • determining the proper logic-level granularity of applying clock-gating techniques to maximize power savings. P/P0 23 FO4 Relative delay.10 1.00 1.95 1.30 Experimental points Varying depth (fixed Vdd and η) Varying VDD and η (fixed depth) Reducing ƒ (fixed depth. Power-aware microarchitectural techniques. many difficult problems remain open.6 Maximum power budget 1.8 14 FO4 1.05 1. developing various techniques for reducing active and passive power in cores. Although voltage.0 1. MAY–JUNE 2005 63 . However.and frequency-scaling techniques can certainly correct small mismatches.2. additional enhancements to increase power efﬁciency at the microarchitecture level are possible and desirable. Microarchitecture optimization goal Active-power reduction Active. D/D0 Figure 3.15 1. 2004) boost its performance. selecting a pipeline structure on the basis of both performance and power is critical because a fundamental error here could lead to an irrecoverable postsilicon power-performance (hence.20 1.25 1.80 0.85 0. Vdd. and η) Relative power.0 0.6 0.90 0. The computer architecture research community has worked for several years on power-aware microarchitectures. The preceding example illustrates the importance of incorporating power as an optimization target early in the design process along with the traditional performance metric. locally synchronous architectures Power-efﬁcient thread prioritization (simultaneous multithreading) Simpler cores Voltage gating of unused functional units and cache lines Adaptive resizing of computing and storage resources Dynamic voltage and frequency scaling number of these techniques in commercial microprocessors.4 1.2 2.
What types of cores should we integrate on a chip. as discussed earlier.5 3. • building in predictive support for voltage gating at the microarchitectural and compiler levels to minimize switchingunit overhead. Several conclusions follow from the curves in Figure 4: 64 IEEE MICRO . software. Figure 4 presents two extreme designs that illustrate the methodology: a complex. out-of-order core and a simple. CPU manufacturers are moving to the multiple-cores-per-chip design.) • reconciling pervasive clock gating’s effect on cycle time.5 0 0.5 2. 2004. module. IBM is a trailblazer in this space. Integrating multiple cores on a chip With single-core performance improvements slowing.0 1. introduced in 2001 in 180-nm technology. and how many of them should we integrate? Of course.0 2. The Power4 microprocessor. The curves show the power-performance trade-offs possible for each of these designs through variation of the pipeline depth. ” invited course at Swedish Intelect Summer School on Low-Power Systems on Chip.5 1.FUTURE TRENDS 2. of wide-issue. multiple cores per chip can help continue the exponential growth of chiplevel performance. in-order cores 1 2 4 8 Relative power 1. we assume that we could integrate up to four of the complex cores or up to eight of the simple cores on a single chip. Let’s examine the trade-offs that arise in putting multiple cores on a chip. (Courtesy of V. Zyuban. and application synergies. introduced in 2004 in 130-nm technology.0 0. was a remapping of Power4 to 130-nm technology. augments the two cores per chip with twoway simultaneous multithreading per core.14 The Power4+ microprocessor. wide-issue. and the resulting systems lead in 34 industry-standard benchmarks. “Power-Performance Optimizations across Microarchitectural and Circuit Domains. This solution exploits performance through higher chip. The Power5. narrow-issue. introduced in 2003.0 Figure 4. Given the relative difference in size between these two organizations. we’ll leverage what we learned in our discussion of power-performance trade-offs for a single core.15 The 389-mm2 Power5 chip contains 276 million transistors.0 No. of narrow-issue. 23 to 25 Aug. out-of-order cores 1 2 4 No. and • addressing increased design veriﬁcation complexity in the presence of these techniques. Increasingly. Power-performance trade-offs in integrating multiple cores on a chip.5 1.5 Relative chip throughput 2. system. in-order core. comprised two cores per chip. and system integration levels and optimizes for performance through technology.
memory issues.23 or OpenMP-like compiler directives (http://www. and optimization even more important.17.21 MAY–JUNE 2005 65 . integration. A lack of compilers.18 and high-performance libraries19. and allows the accelerator to beneﬁt from the same technology advances as the CPU. • A complex core provides much higher single-thread performance than a simple core (compare the curves “1 wide-issue out-of-order core” and “1 narrow-issue in-order core”). Software issues The systems just described depend on exploiting greater levels of locality and concurrency to gain performance within an acceptable power budget. • Integrating a heterogeneous mixture of simple and complex cores on a chip might provide acceptable performance over a wider variety of workloads. such as code generation. instruction scheduling. requires the development of higher-level abstractions. • In the past. • Increasing density allows the integration of accelerators on chips along with the CPU. As discussed later. and collective communications in high-performance computing.• For a given power budget (consider a horizontal line at 1. libraries. We can choose the appropriate design only by weighing the relative importance of single-thread performance and chip throughput for workloads that the systems are expected to run. such a solution has signiﬁcant implications on programming models and software support. Examples include Transmission Control Protocol/Internet Protocol (TCP/IP) offloading. Accelerators and ofﬂoad engines Special-purpose accelerators and offload engines offer an alternative means of increasing performance and reducing power. and ﬂexibility of general-purpose processors. • Domain-specific programmable and reconﬁgurable accelerators have emerged. In addition. Appropriate support from compilers. streaming and rich media. However. the operating system must provide richer functionality in terms of coscheduling and switching threads to cores. combined with vastly increased hardware system complexity. making wider deployment feasible: • Functionality that merits acceleration has become clearer. make such enablement. The fundamental technology discontinuities discussed earlier.5). Such engines help exploit concurrency and data formats in a specialized domain for which we have a thorough understanding of the bottlenecks and expected end-to-end gains. operating systems.org). Systems will increasingly rely on accelerators for improved performance and cost-performance. runtime systems. multiple simple cores produce higher throughput (aggregate chip-level performance). These conclusions show that no single design for chip-level integration is optimal for all workloads. security.16. This results in tighter coupling and ﬁner-grained integration of the CPU and the accelerator.22 A fundamental issue in exploiting thread-level parallelism is identifying the threads in a computation. Increasing software componentization. Accelerators are not new. Processor issues involved in exploiting instruction-level parallelism. are generally well understood. performance. such as latency hiding and locality enhancement.20 to sustain the performance growth levels that applications demand. Scaling up a simple core by reducing FO4 and/or raising VDD does not achieve this level of performance. The simulations used to derive the curves show that this conclusion holds for both SMP workloads and independent threads. for more effective exploitation of shared resources. and libraries is essential to delivering the hardware’s potential at the application level. accelerators had to compete against the increasing frequency.17 innovative compiler optimizations. The slowing of frequency growth makes accelerators more attractive. and register allocation.openmp. Explicitly parallel languages such as Java make the programmer responsible for this determination. need further examination. which slow the rate of frequency growth. Sequential languages require either automatic parallelization techniques21. but in recent years several conditions have changed. and software tools to enable acceleration is the primary bottleneck to more pervasive deployment of these engines.
Given the power issues discussed earlier. Sixteen racks (32. Programming models. or 5. Moreover.24 The relatively modest-sized chip (121 mm2 in 130-nm technology) integrates two PowerPC 440 cores (PU0 and PU1) running at 700 MHz. scale-out can provide the same computational performance for far less power. Examples include SIMD instruction set architecture extensions and FPGA-based accelerators. Eth. developing the right tools (including libraries. the chip used in the Blue Gene/L machine that IBM Research is building in collaboration with Lawrence Livermore National Laboratory. This structure produces a peak computational rate of 8 ﬂoating-point operations per cycle. deciding what functions to accelerate. and tool chains for exploiting accelerators must continue to mature to make such specialized functions easier for application developers to use productively. Given the potential for improvement. On top of this balanced hardware platform. an innovative hierarchically structured system software environment and standard programming models (Message-Passing Interface) and APIs for ﬁle systems. two enhanced floatingpoint units (FPU0 and FPU1). In other words. accelerators are not free. The end-to-end beneﬁt of deploying an accelerator critically depends on the workload and the ease of accessing the accelerator functionality from application code. Integrated functionality on IBM’s Blue Gene/L computer chip.FUTURE TRENDS BLC DD 1. It is extremely important to achieve high utilization of an accelerator or to clock gate and power gate it effectively. and JTAG) tightly coupled to the processors and performance counters. and developing industry-standard software interfaces and practices that support accelerator use. Tree. the judicious use of accelerators will remain an important part of system design methodology in the foreseeable future. and each SIMD FPU unit performs one Fused Multiply Add operation (equivalent to two ﬂoating-point operations) per cycle. given that the power-performance trade-off is superlinear. which integrates high-bandwidth. job scheduling. proﬁlers. L3 Figure 5. communication interfaces (Torus.887.6 Gﬂops of peak computation power for approximately 5 W of power dissipation. replacing ﬁxed-function.72 Tflops on a problem size of 933. if an application is amenable to scale-out. securing the top spot on the 24th Top500 list of supercomputers (http://www. It uses two enhanced ﬂoating-point units (FPU) per chip.768 processors) of the system sustained a Linpack performance of 70. power-efﬁcient system.6 Gﬂops for a 700-MHz clock rate.0 Tree FPU1 L2 PU1 PU0 Torus and both link-time and dynamic compiler optimizations) for software enablement of accelerators. each FPU is two-way SIMD. An effective scale-out solution requires a balanced building block.top500. L2 and L3 caches. lower-performance cores to satisfy the application’s overall computational requirement with much less power dissipation at the system level. FPU0 Eth JTAG Perf Scale-out Scale-out provides the opportunity to meet performance demands beyond the levels that chip-level integration can provide. we can execute it on a large enough collection of lower-power. and system management result in a scalable. This chip provides 5. dedicated units. compilers. Much work remains in this area. for example.org). 66 IEEE MICRO . low-latency memory and interconnects on chip to balance data transfer and computational capabilities. understanding the system-level implications of integrating accelerators. Figure 5 shows an example of such a building block.
To effectively manage the range of components that use power. Lorraine Herger. system-level view. The power distributions in Table 2 make it clear that we can ignore none of the power components. • Researchers must develop algorithms. MICRO Acknowledgments The work cited here came from multiple individuals and groups at IBM Research. and accelerators). and buses are each capable of trading power for performance. To do this in real-time. Although the preceding discussion has concentrated primarily on the CPU. These algorithms must also dynamically rebalance maximum available power across components to achieve the required quality of service. Achieving dynamic power balancing requires three enablers: • System components must support multiple power-performance operating points. It is now a principal design constraint across the computing spectrum. William Pulley- MAY–JUNE 2005 67 . Hendrik Hamann. dual in-line memory modules. We thank Pradip Bose. we must have a holistic. system on chips. most likely at the operating system or workload manager level. scale-out and parallel computing. Pratap Pattnaik. we face a technology discontinuity: the exponential growth in device and chip-level power dissipation and the consequent slowdown in frequency growth. today’s DRAM designs have different power states and both microprocessors and bus frequencies can be dynamically voltage. while maintaining the health of the system and its components. Bruce Knaack.and frequency-scaled. As computer architects. the power densities of all computing components at all scales are increasing exponentially. chip-level integration (chip multiprocessors. Each level in the hardware/software stack needs to be aware of power consumption and must cooperate in an overall strategy for intelligent power management. and opportunities abound for innovative work to meet the challenge. We will need a maniacal focus on power at all architecture and design levels to bridge this gap.System-level power management Power is clearly a limiting factor at the system level. • The system’s design must exploit the extremely unlikely fact that all components will simultaneously operate at their maximum power dissipation points (while providing a safe fallback position for the rare occasion when this might actually happen). The discontinuity is stimulating renewed interest in architecture and microarchitecture. Power dissipation (percentage) 46 28 17 7 2 30 28 23 11 5 3 System and component Data center Servers Tape drives Direct-access storage devices Network Other Midrange server DRAM system Processors Fans Level-three cache I/O fans I/O and miscellaneous T he inexorable growth in applications’ requirements for performance and costperformance improvements will continue at historical rates. together with tight hardware-software integration across the system stack to optimize performance. The right building blocks (cores). our challenge over the next decade is to deliver end-to-end performance growth at historical levels in the presence of this discontinuity. Tom Keller. Power distribution across system components. Table 2. At the same time. Evelyn Duesterwald. For example. caches. Jaime Moreno. power usage information must be available at all levels of the stack and managed via a global systems view. Michael Gschwind. and systemlevel power management are key levers. Microprocessors. Eric Kronstadt. Sleep modes in disks are a mature example of this feature. Rajiv Joshi. Philip Emma. to monitor and/or predict workloads’ power-performance trade-offs over time. Dynamically rebalancing total power across system components is key to improving system-level performance.
Kluwer Academic. http://public. Programming Language Design and Implementation (PLDI 04).” ACM Trans. Fang.” IBM J. “Vectorization for SIMD Architectures with Alignment Constraints. Petitet. J. W. “IBM Power5 Chip: A Dual-Core Multithreaded Processor.” Proc. vol. 2. “Automatic Fence Insertion for Shared Memory Mul- 68 IEEE MICRO . 23. 2001. no. ACM SIGPLAN Conf. 2004. K.. 2. Martonosi. 22. International Technology Roadmap for Semiconductors. R. “Architectures for Low Power. 11. Coarfa. Dongarra. J. 2003. 5. Vassberg. 19.” IEEE J. A. V. 15. A. Solid-State Circuits. Zyuban et al. Int’l Symp. 14. Programming Language Design and Implementation (PLDI 03). 4. Int’l Council of Aeronautical Sciences. Michael Rosenﬁeld.” Proc. Dennard et al.E. Nov.C. V. Cascaval. 2004. 2001. pp. 2002. Melhem and R. Whaley. 2002. 2. and J. 8. 26-44. vol. O’Brien. Graybill. “Power Efﬁcient Issue Queue Design. Aug. “Power4 System Microarchitecture. “Optimizing Pipelines for Power and Performance. Computer Systems. “Value-Based Clock Gating and Operation Packing: Dynamic Strategies for Improving Processor Power and Performance. and J. 21. Martinelli. 6. 2. vol. and J. “Design of Ion-Implanted MOSFETs with Very Small Physical Dimensions. pp.net/ Files/2003ITRS/Home2003. no.itrs.” Proc. report CCS-TR99-157. L. Leon Stok. Research & Development. Victor Zyuban. 220-231. 9. Z. Duesterwald. Computer Systems. pp. Jan. Oklobdzija. 9.C.” ACM Trans. 13th Int’l Conf. “Characterizing and Predicting Program Behavior and Its Variability. 24. Wu. Morgan Kaufmann. ACM SIGPLAN Conf.. A. no. 161-190. no. D. References 1. 20. Lawrence Livermore Nat’l Lab.” IEEE Micro. Tendler. V. 32-37. 2002. Mar. IEEE Press. Ellen Yoffa. Kaxiras. “Integrated Analysis of Power and Performance for Pipelined Microprocessors. May 2002.-Dec. E. “Using Computational Fluid Dynamics for Aerodynamics: A Critical Assessment. 1. Lee. 82-93.. 27. vol. P. ACM Press. The Blue Gene/L project was developed in part through a partnership with the Department of Energy. and J. Martonosi. eds. 46. Oct. pp. 18. 5-26. vol. 5. 12th Int’l Conf.P. IEEE CS Press.. Introduction to UPC and Language Speciﬁcation.” IEEE Micro. “Microarchitectural Techniques for Power Gating of Execution Units. Tendler et al. 2004. 17. Hu. pp. D. pp. ACM Press. pp. 8... Buyuktosunoglu et al. and K. pp. 1974. Sinharoy. R.J. Parallel Architectures and Compilation Techniques (PACT 03). no. S. IEEE CS Press.FUTURE TRENDS blank..” Power-Aware Computing.Apr. X. Dwarkadas. Kalla.. 1004-1016. B. pp. Z. 35th ACM/IEEE Int’l Symp. and the entire Blue Gene/L team for the technical results and for helping us to coherently formulate the views discussed in this article. and S. Yotov et al. Microarchitecture (MICRO-35). 18. Nov. 1-2. pp. 52-61. 29-40. 89-126. “Let Caches Decay: Reducing Leakage Energy via Exploitation of Cache Generational Behavior. 20. ed. 3-35. 53. Computers. 2004. “Power-Aware Microarchitectures: Design and Challenges for NextGeneration Microprocessors. 10. pp. Jameson. vol. 2004. 7. “Automated Empirical Optimization of Software and the ATLAS Project.” Proc. pp. R.. pp.. 23rd Int’l Congress Aeronautical Sciences (ICAS 02). 6. Dotsenko. 3. Eichenberger.M.. Hu et al. no. Optimizing Compilers for Modern Architectures. C. “Temperature-Aware Computer Systems: Opportunities and Challenges. vol. 40-47. pp. Brooks et al. Brooks and M. vol. R. Jan. 2001. National Nuclear Security Administration Advanced Simulation and Computing Program to develop computing systems suited to scientiﬁc and programmatic missions. no.htm. IEEE Press. 12. A. 2003 ed. 333344.” Proc. K.W. Allen and K.” Proc. no. 1999. 2002.. 256-268.” Computer Engineering Handbook. CRC Press. Low Power Electronics and Design (ISLPED 04). Parallel Architectures and Compilation Techniques (PACT 04)..” IEEE Micro. 6. Kennedy. vol. Skadron et al. P.H. 20. 16.” IEEE Trans. Midkiff. Bose. “A Comparison of Empirical and Model-Driven Optimization.-Dec. tech.” Proc. 63-76. no. May 2000.” Parallel Computing. Carlson et al. Mellor-Crummey. Y. and M. 2003. 2000.. “A Multi-Platform Co-Array Fortran Compiler. R. 2003. and S. pp. C. Srinivasan et al. 13.
1101 Kitchawan Road. REACH HIGHER Advancing in the IEEE Computer Society can elevate your standing in the profession. NY 10598. 2004. Dec. Watson Research Center. no. Tilak Agerwala is vice president.computer. He is a senior member of IEEE.J. Blume et al. systems. Direct questions and comments about this article to Tilak Agerwala. at IBM Research. tilak@us. ACM Press.tiprocessing. 24. For further information on this or any other computing topic. 2003. 285-294.. 29. 12. Almasi et al. Agerwala has a PhD in electrical engineering from The Johns Hopkins University. W. pp. G. pp.computer. 78-82.htm MAY–JUNE 2005 69 . “Unlocking the Performance of the BlueGene/L Supercomputer.” Proc. vol. Int’l Conf. He is a fellow of the IEEE. 23. 1996. 17th Ann. Supercomputing (ICS 03). Siddhartha Chatterjee is a research staff member and manager at IBM Research. Yorktown Heights.org/join/grades. visit our Digital Library at http://www. and a member of ACM and SIAM. His research interests include all aspects of highperformance systems and software quality.com. Application to Senior-grade membership recognizes ✔ ten years or more of professional expertise Nomination to Fellow-grade membership recognizes ✔ exemplary accomplishments in computer engineering GIVE YOUR CAREER A BOOST ■ UPGRADE YOUR MEMBERSHIP www. Supercomputing 2004.” Computer. “Parallel Programming with Polaris. ibm. IBM T..org/publications/dlib. Chatterjee has a PhD in computer science from Carnegie Mellon University. and a member of ACM.” Proc. IEEE Press. He is responsible for all of IBM’s advanced systems research programs in servers and supercomputers. His primary research area is high-performance computing systems.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.