Professional Documents
Culture Documents
ACM SIGARCH Computer Architecture News 46 Vol. 37, No. 2, May 2009
FP Units DISPATCHED FPU:ALL
0.23 RETIRED MMX AND FP INSTRUCTIONS:ALL
Inst Retired RETIRED BRANCH INSTRUCTIONS:ALL
RETIRED MISPREDICTED BRANCH INSTRUCTIONS:ALL
RETIRED INSTRUCTIONS
0.39 RETIRED UOPS
Stalls DECODER EMPTY
0.2 DISPATCH STALLS
Memory DRAM ACCESSES PAGE:ALL
DATA CACHE MISSES
L3 CACHE MISSES:ALL
MEMORY CONTROLLER REQUESTS:ALL
0.33 L2 CACHE MISS:ALL
100
90
80
% miss rate
Figure 1. AMD Phenom Annotated Die [1] 70
60
50
40
• We achieve accurate per-core estimates of multi- 30
threaded and multiprogrammed workloads on a CMP 20
10
with shared resources (an L3 cache, memory con- 0
troller, memory channel, and communication buses). caAD s
lc M
s ea x
gr FD lII
om TD
lib les lb s
qu lie m
an 3d
m
m cf
omna ilc
n md
vr p
spsjen y
hi g
t x3
o
ze cbmrf
us k
p
us ve
em d uli
ac
a
poetp
xa ont
m
n w
m
tu
n
• We observe and quantify the interacting effects of
ct a
ca bw
la
power and temperature.
G
ACM SIGARCH Computer Architecture News 47 Vol. 37, No. 2, May 2009
2.1 Event Selection We run four copies of the microbenchmarks simultane-
ously. We isolate the data for any single CPU, since all cores
We split the AMD Phenom PMCs into the four cat- exhibit the same PMC data. To ensure none of the separate
egories chosen above: FP Units, Memory, Stalls, threads are out of sync with each other, we run each pro-
and Instructions Retired. We use a Watts Up Pro gram phase separately. We use the SPEC-OMP benchmark
power meter and pfmon to gather power and performance suite for selecting appropriate counters only. Our code is
data for 13 counters from the entire SPEC-OMP benchmark generic, and does not use any code from benchmarks used
suite. The specific categories and the PMCs that fall within for testing (SPEC 2006, SPEC-OMP and NAS). The mi-
them are shown in Table 1. The PMCs in each category are crobenchmarks explore the space spanned by the four se-
in increasing order of correlation with power. We choose
lected counters. This means future application behavior
Spearman’s rank correlation to assess the relationship be-
tween the counter value and power. Compared to vanilla should fall within the same space, and allows us to claim
correlation, Spearman’s rank correlation does not require that our approach works independently of the benchmark
assumptions about the frequency distributions of the vari- suite. We verify this by testing our model on the NAS and
ables. After performing this rank correlation on the data, we SPEC 2006 benchmark suites, applications that we do not
select the top counter from each bucket. The AMD Phenom use during model building. Since this is an empirical pro-
only tracks four counters simultaneously, thus our choice of cess, verification lies in the quality of predictions.
four categories for the PMCs works well on this platform.
Since our power predictor is designed to work in real-time, 2.3 Forming the Model
we are bound by this four-counter limit. However, this is an
architecture-specific characteristic that varies by platform.
The performance counters we use are: For the model, we use only data from our microbench-
marks. We normalize collected performance counter values
• e1 : L2 CACHE MISS:ALL to the elapsed cycle counts, yielding an event rate, ri ,
• e2 : RETIRED UOPS per counter. The prediction model uses these rates as input.
• e3 : RETIRED MMX AND FP INSTRUCTIONS:ALL
We model per-core power for each thread using our piece-
• e4 : DISPATCH STALLS
wise model based on multiple linear regression. We collect
We claim that data from these four counters capture PMC values every second while running the microbench-
enough information to predict core and system power. This marks, and generate the model based on this data. Our ap-
claim is backed by results in Section 4. Next, we discuss the proach produces the following function (Equation 1), map-
microbenchmarks that exercise our chosen counters, and we ping observed event rates ri to core power Pcore :
then form the model based on the data collected.
F1 (g1 (r1 ), ..., gn (rn )), if condition
2.2 Microbenchmarks P̂core = (1)
F2 (g1 (r1 ), ..., gn (rn )), else
ACM SIGARCH Computer Architecture News 48 Vol. 37, No. 2, May 2009
data time. However, since static power is a function of volt-
linear function
exponential function
age, process technology and temperature, increasing tem-
perature leads to increasing leakage power, and adds to total
F(x)
3 Experimental Setup
x
We evaluate our work using the SPEC 2006 [14], SPEC-
Figure 4. An Illustrative Example of a Better Fitting
Piece-Wise Function OMP [2], and NAS [4] benchmark suites. All are com-
piled for the AMD x86 64 64-bit architecture using gcc
4.2 with -O3 optimization flags (and -OpenMP where ap-
plicable). We use reference input sets for SPEC 2006 and
SPEC-OMP, and the medium input set (B) for NAS. All
written as a linear combination of transformed event rates benchmarks are run to completion. We use the pfmon util-
(Equation 2), and is easily solved using a least squares es- ity from the perfmon2 library to access hardware perfor-
timator as in Contreras et al. [7]. We obtain the piece-wise mance counters from user space, and run on Linux kernel
linear model shown in Figure 5. We find the function be- version 2.6.25. Table 2 details the system configuration.
havior to be significantly different for very low values of System power is based on the processors’ being idle and
the L2 cache miss counter compared to the rest of the space. powered at their lowest frequency and voltage. All power
We break our function based on this counter. Since the L3 measurements are based on current draw from the outlet
is non-inclusive, most L2 cache misses trigger off-chip ac- by the power supply. We use a Watts Up Pro power meter
cesses contributing to total power. We also observe that the to measure power consumption (accurate to within 0.1W).
power grows with increasing retired uops, since the CPU The meter updates its power values every second. Base-
is doing more work. All counters have positive correla- line power for the system is 85.1W running at 2.2 GHz,
tion with power, except for the retired FP/MMX instructions and 76.8W at 1.1 GHz. For the purpose of model forma-
PMC. This is expected, since such instructions have higher tion only, we assume the base power without processor to
latencies; this class of instructions reduces the throughput be 68.5W. This simplifying assumption aids in faster model
of the system, resulting in lower power use. Finally, the formation without the need for more complicated measur-
dispatch stalls PMC correlates positively with power. This ing techniques. We calculate per-core power by subtracting
can be due to reservation stations or reorder buffer dispatch the base power and dividing by four.
stalls, where the processor attempts to extract higher de- Some performance counters measure shared resources,
grees of instruction level parallelism (ILP) from the code. which gives statistics across the CMP and not per core.
Dynamic power increases from this logic overhead. Some performance counters could be further subdivided by
type. For example, cache and DRAM accesses can be bro-
2.4 Temperature Effects ken down into cache or page hits and misses, while dis-
patch stalls can be broken down by branch flushes, or full
We are interested in the effect of temperature on system queues (reservation stations, reorder buffers, floating point
power. Ideally, power consumption does not increase over units). The AMD processor in this work can only sam-
ACM SIGARCH Computer Architecture News 49 Vol. 37, No. 2, May 2009
1.15 + 65.88 ∗ r1 + 11.49 ∗ r2 + −3.356 ∗ r3 + 17.75 ∗ r4 , r1 < 10−5
P̂core = (3)
23.34 + 0.93 ∗ log(r1 ) + 10.39 ∗ r2 + −6.425 ∗ r3 + 6.43 ∗ log(r4 ), r1 ≥ 10−5
where ri = ei /2200000000
Figure 5. Piece-wise Linear Function for Core Power Derived from Microbenchmark Data (1 Second = 2.2 Billion Cycles)
Frequency 2.2 GHz (max), 1.1 GHz (min) 4.1 Evaluating Power Estimation
Process Technology 65nm SOI
Processor AMD Phenom 9500 CMP
Number of Cores 4 We test our derived power model by comparing mea-
L1 (Instruction) Size 64 KB 2-Way Set Associative sured to predicted power in Figures 7(a), 7(b), 7(c) for
L1 (Data) Size 64KB 2-Way Set Associative
L2 Cache Size (Private) 512 KB/core 8-Way Set Associative
NAS, SPEC-OMP, and SPEC 2006, respectively. Each
L3 Cache Size (Shared) 2 MB 32-Way Set Associative multi-threaded benchmark is run across the entire CMP, and
Memory Controller Integrated On-Chip multiple copies are spawned for single-threaded programs.
Memory Width 64-bits /channel
Data reported are per core. Our estimation model tracks
Memory Channels 2
Main Memory 4 GB PC2 6400(DDR2-800) power consumption for each benchmark fairly well. We
find a large difference between estimated and actual power
Table 2. CMP Machine Configuration Parameters for some benchmarks, such as bt and ft. This can partially
be attributed to lack of dynamic temperature data in our es-
timation model. Actual power values range from 19.6W for
ple four counters simultaneously, which is why we narrow ep to 26.6W for mgrid for a variation over 35%.
our design space to using only four performance counters. Figures 8(a), 8(b), and 8(c) show percentage error for
For power-aware process scheduling, we suspend processes each suite. We attribute error in power estimates to tempera-
determined to exceed the available power envelope. Sus- ture, the limited number of PMCs simultaneously available,
pended processes are later restored, and they continue exe- and parts of the counter space possibly unexplored by our
cuting when sufficient power becomes available. microbenchmarks. For instance, temperature can increase
power consumption by up to 10%. NAS and SPEC-OMP
have average median error of 5.8% and 3.9%, respectively.
4 Evaluation SPEC 2006 has slightly higher median error of 7.2%.
Figure 9 shows the Cumulative Distribution Function
(CDF) for all three benchmark suites taken together. This
We evaluate the accuracy of our power model using sin- gives us a picture of the coverage of our model. For exam-
gle and multi-threaded benchmarks, using the entire CMP ple, 85% of predictions across all benchmarks have less than
to test our results. We later leverage this real-time estima- 10% error. The CDF helps illustrate the model’s fits, show-
tion to make power-aware scheduling decisions, suspending ing that most predictions have very small error. To further
processes to maintain a given power envelope. Note that the test model robustness, we examine system power estimates
scheduler is a user-level application, and is quite simple – for a multiprogrammed workload. Figure 11(d) shows sys-
much more sophisticated schedulers (e.g., taking priorities tem power consumption for ep (NAS), art (SPEC-OMP),
into consideration) are possible, but ours serves as a proof mcf and hmmer (SPEC 2006), run concurrently. We sum
of concept. per-core power to estimate system power. Our model con-
ACM SIGARCH Computer Architecture News 50 Vol. 37, No. 2, May 2009
% Median Error Power (W)
0
10
20
30
0
5
10
15
20
bt
% Median Error Power (W)
bt cg
0
10
20
30
0
5
10
15
20
as
ta
as cg bz r
ta ep
b ip
actual
bz r ep ca wa 2
predicted
bw ip2 ct ve
ca a us s ft
ct ve AD
us s ft ca M
actual
AD lc lu
ca M ul
predicted
(a) NAS
(a) NAS
lc G ix
ul lu em de lu
G ix -h
em de lu sF alll p
sF alll
-h
p
D
D ga TD
m m
51
er l am
0
10
20
30
0
5
10
15
20
le bm m
(c) SPEC 2006
ap
om i m si om mil
c ar
lc
predicted
pe net t
pe net ar rlb pp ga
rlb pp t en fo
en ga rt
fo fm
(b) SPEC-OMP
la le m la lex
nc gr
nc x gr b id
b id qu
ze mk qu ze mk
us us ak
m ak m e
p e p
sw sw
w i m w im
up up
w w
is is
e e
0.8
0.6
0.4
0.2
0.0
0 10 20 30
% Error
ACM SIGARCH Computer Architecture News 52 Vol. 37, No. 2, May 2009
200 200
100 100
predicted predicted
50 actual 50 actual
envelope envelope
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Time (sec) Time (sec)
(a) 140W (b) 160W
200 200
150 150
100 100
predicted predicted
50 actual 50 actual
envelope
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Time (sec) Time (sec)
(c) 180W (d) None
Joseph and Martonosi [11] estimate power using per- cations of interest. The method is not feasible when execut-
formance counters in a simulator (Wattch [6], integrated ing programs outside of the sampled design space. At run
with SimpleScalar [3]), and on real hardware (Pentium Pro). time, scheduling is not known a priori, and the behavior of
Their power estimates are derived offline after multiple programs changes depending on their interaction with other
benchmark runs, since they require more PMCs than are processes sharing the same cache resources.
supported by their hardware. They use PMCs to estimate Curtis-Maury et al. [8] predict optimal concurrency val-
power consumption for different architectural resources, but ues using machine learning to estimate power consumption
are unable to estimate power for 24% of the chip, and as- at various thread counts based on power consumption at the
sume peak power of the structures. highest thread count for the NAS benchmark suite. They
Wu et al. [16] use microbenchmarks to estimate power require that applications be profiled at one thread count be-
consumption of functional units for the Pentium 4 architec- fore predicting power consumption at other thread counts,
ture. Dynamic activity for all portions of the chip are not making their models application dependent.
available, and if they were, the approach would exceed the Contreras and Martonosi [7] use PMCs on an XScale to
limit of four CMPs for run-time power prediction. estimate power consumption online at different frequencies.
Economou et al. [9] use performance counters to pre- Like Joseph and Martonosi, they gather data from multiple
dict power of a blade server. They use custom micro- PMCs by running benchmarks several times, and join sam-
benchmarks to profile the system and estimate power for pled data from different runs together to derive better esti-
different hardware components (CPU, memory, and hard mates. This methodology is not feasible at run-time, since it
drive), achieving estimates with 10% average error. Like requires three times th PMCs as can be sampled in real time.
our work, their kernel benchmarks for deriving their model Thus it cannot be used by an OS scheduler or application de-
are independent of benchmarks used for testing. veloper executing multiple programs in parallel at run time.
Lee and Brooks [12] use statistical inference models to Their parameterized linear model does not apply here, as
predict power consumption based on hardware design pa- their work for an in-order single core does not scale to our
rameters. They build correlation based on hardware param- platform. This is due to the increased complexity of our
eters, and use the most significant parameters for training multicore, out-of-order, power-efficient, high-performance
a model to estimate power consumption. They randomly platform.
sample a large design space and estimate power consump- Merkel and Bellosa [13] use performance counters to es-
tion based on previous power values for the same design timate power per processor in a 8-way symmetric multipro-
space that is profiled a priori. Unfortunately, this methodol- cessing (SMP) system, and shuffle processes to reduce over-
ogy requires sampling the same applications for which they heating of individual processors. Bautista et al. [5] sched-
are trying to estimate power consumption. This makes the ule processes on a multicore system for real-time applica-
approach dependent on having already trained on the appli- tions. They assume static power consumption for all appli-
ACM SIGARCH Computer Architecture News 53 Vol. 37, No. 2, May 2009
cations. They leverage chip-wide DVFS to reduce energy multi-threaded workloads. This adds a layer of complexity
when there is slack in the application. Isci et al. [10] ana- since all the threads of a specific process need to be sus-
lyze power management policies for a given power budget. pended at once. If only one thread is suspended, it might
They leverage per-core and domain-wide DVFS to retain be holding locks or other threads might be forced to wait at
performance while remaining within a power budget. Un- barrier points for it. Another strategy is to force the sched-
like their work, we do not require on-core current sensors, uler to wait for specific thread spawn points before changing
but instead use real hardware to evaluate our work. While numbers of threads.
their work involves core adaptation policies, we examine We find temperature to play a significant role in power
OS scheduling decisions with fixed cores. We will expand consumption. When supported by the hardware, we will in-
our approach when per-core DVFS becomes available. corporate thermal data into our estimation model and thread
We estimate power consumption for an aggressively scheduling strategies. We also wish to investigate the ben-
power efficient high performance platform. Our platform efits of using a power-aware scheduler for choosing pro-
differs with previous literature due to: resources being cessor cores to reduce frequency (DVFS), rather than sus-
shared across cores, performance counters that do not indi- pend processes. Much more sophisticated schedulers (e.g.,
vidually exhibit high correlation with power, and significant with knowledge of priorities or real-time demands) are pos-
power variation across benchmarks, depending on work- sible, and experimenting with these will be interesting to
load. Additionally, since we desire real-time power esti- observe the consequential effects on throughput. As num-
mation, we are confined to using only the number of perfor- ber of cores and power demands grow, process scheduling
mance counters available at run-time in a single run. This will be critical for efficient computing.
prevents us from making fair comparisons of this work with
previous work. Our framework can estimate power con- References
sumption on new programs that we have not run before,
and evaluates results on a multi-core system, providing a [1] AMD Phenom(TM) Quad-Core Processor Die. www.amd.com/us-
feasible method of estimating power consumption per-core. en/Corporate/VirtualPressRoom/0,,51 104 572 5731̂50441̃17770,00.html,
We leverage this to perform real-time power-aware per-core Jan. 2008.
scheduling on a CMP, which was not previously possible. [2] V. Aslot and R. Eigenmann. Performance characteristics of the SPEC
OMP2001 benchmarks. In Proceedings of the European Workshop
on OpenMP, Sept. 2001.
6 Conclusions [3] T. Austin. Simplescalar 4.0 release note.
http://www.simplescalar.com/.
We derive an analytic, piece-wise linear power model [4] D. Bailey, T. Harris, W. Saphir, R. Van der Wijngaart, A. Woo, and
M. Yarrow. The NAS parallel benchmarks 2.0. Report NAS-95-020,
that maps performance counters to power consumption ac-
NASA Ames Research Center, Dec. 1995.
curately, independently of program behavior for the SPEC
[5] D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato. A simple
2006, SPEC-OMP and NAS benchmark suites. We use cus- power-aware scheduling for multicore systems when running real-
tom microbenchmarks to generate data for creating our es- time applications. In Proc. 22nd IEEE/ACM International Parallel
timation model. The microbenchmarks stress four coun- and Distributed Processing Symposium, pages 1–7, Apr. 2008.
ters selected based on their correlation with measured power [6] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework
consumption values. for architectural-level power analysis and optimizations. In Proc.
27th IEEE/ACM International Symposium on Computer Architec-
We leverage our power model to perform run-time, ture, pages 83–94, June 2000.
power-aware thread scheduling. We suspend and resume
[7] G. Contreras and M. Martonosi. Power prediction for intel xs-
processes based on power consumption, ensuring that a cale processors using performance monitoring unit events. In Proc.
given power envelope is not exceeded. Our work is likely IEEE/ACM International Symposium on Low Power Electronics and
useful for consolidated data centers, where virtualization Design, pages 221–226, Aug. 2005.
leads to multiple servers on a single CMP. Using our es- [8] M. Curtis-Maury, K. Singh, S. A. McKee, F. Blagojevic, D. S.
timation methodology, we can accurately estimate power Nikolopoulos, B. R. de Supinski, and M. Schulz. Identifying energy-
efficient concurrency levels using machine learning. In Proc. 1st In-
consumption at the core granularity, allowing for accu- ternational Workshop on Green Computing, Sept. 2007.
rate billing of power usage and cooling costs. Estimating [9] D. Economou, S. Rivoire, C. Kozyrakis, and P. Ranganathan. Full-
per-core power consumption is challenging, since some re- system power analysis and modeling for server environments. In
sources are shared across cores (such as caches, the DRAM Workshop on Modeling Benchmarking and Simulation (MOBS) at
memory controller and off-chip memory accesses). Addi- ISCA, June 2006.
tionally, current machines only allow us to monitor a few [10] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi.
An analysis of efficient multi-core global power management poli-
performance counters per-core, depending on the platform
cies: Maximizing performance for a given power budget. In Proc.
(Intel, Sun, AMD). IEEE/ACM 40th Annual International Symposium on Microarchitec-
For future work, we will expand this study to schedule ture, pages 347–358, Dec. 2006.
ACM SIGARCH Computer Architecture News 54 Vol. 37, No. 2, May 2009
[11] R. Joseph and M. Martonosi. Run-time power estimation in high-
performance microprocessors. In Proc. IEEE/ACM International
Symposium on Low Power Electronics and Design, pages 135–140,
Aug. 2001.
[12] B. Lee and D. Brooks. Accurate and efficient regression modeling
for microarchitectural performance and power prediction. In Proc.
12th ACM Symposium on Architectural Support for Programming
Languages and Operating Systems, pages 185–194, Oct. 2006.
[13] A. Merkel and F. Bellosa. Balancing power consumption in mul-
tiprocessor systems. In EuroSys ’06: Proceedings of the ACM
SIGOPS/EuroSys European Conference on Computer Systems, pages
403–414, Apr. 2006.
[14] Standard Performance Evaluation Corporation. SPEC CPU bench-
mark suite. http://www.specbench.org/osg/cpu2006/, 2006.
[15] V. Weaver and S. A. McKee. Can hardware performance counters
be trusted? In Proc. IEEE International Symposium on Workload
Characterization, Sept. 2008.
[16] W. Wu, L. Jin, and J. Yang. A systematic method for functional
unit power estimation in microprocessors. In Proc. 43rd ACM/IEEE
Design Automation Conference, pages 554–557, July 2006.
ACM SIGARCH Computer Architecture News 55 Vol. 37, No. 2, May 2009