You are on page 1of 10

Real Time Power Estimation and Thread Scheduling via Performance Counters

Karan Singh Major Bhadauria Sally A. McKee


Computer Systems Laboratory Department of Computer Science and Engineering
Cornell University Chalmers University of Technology
Ithaca, NY, USA Göteborg, Sweden
{karan major}@csl.cornell.edu mckee@chalmers.se

Abstract mizing software or evaluating new architectures for existing


software, energy efficiency is now a critical part of perfor-
Estimating power consumption is critical for hardware mance analysis. If the OS is aware of the power consump-
and software developers, and of the latter, particularly for tion of various processes within the system, it can prior-
OS programmers writing process schedulers. However, ob- itize processes based on thermal constraints and available
taining processor and system power consumption informa- remaining power, or schedule processes to remain within a
tion can be non-trivial. Simulators are time consuming and given power envelope. Unfortunately, it is very hard to ex-
prone to error. Power meters report whole-system consump- pose run-time power consumption of a given processor to
tion, but cannot give per-processor or per-thread informa- the software developer, computer architect, or OS.
tion. More intrusive hardware instrumentation is possible, Power meters retrieve total system and CMP power con-
but such solutions are usually employed while designing the sumption. Although existing hardware can be modified to
system, and are not meant for customer use. monitor the current and power draw of any CPU socket, per
Given these difficulties, plus the current availability of core power consumption is difficult to measure with present
some form of performance counters on virtually all plat- multicore designs, because the processors share the same
forms (even though such counters were initially designed power planes. Embedding measurement devices on chip is
for system bring-up, and not intended for general program- not financially feasible. Furthermore, the accuracy of the
mer consumption), we analytically derive functions for real- power meter can can be affected when the current draw
time estimation of processor and system power consump- rapidly fluctuates with program phases.
tion using performance counter data on real hardware. Our We propose using Performance Monitoring Counters
model uses data gathered from microbenchmarks that cap- (PMCs) to estimate power consumption of any processor
ture potential application behavior. The model is indepen- via analytic models. Performance counters on chip are gen-
dent of our test benchmarks, and thus we expect it to be well erally accurate [15] (if used correctly), and they provide sig-
suited for future applications. We target chip multiproces- nificant insight into the processor performance at the clock-
sors, analyzing effects of shared resources and temperature cycle granularity. PMCs are already incorporated into and
on power estimation, leveraging our model to implement a exposed to user space on most modern architectures. Accu-
simple, power-aware thread scheduler. The NAS and SPEC- rately estimating real-time power consumption enables the
OMP benchmarks shows a median error of 5.8% and 3.9%, OS to make better real-time scheduling decisions, adminis-
respectively. SPEC 2006 shows a marginally higher median trators to accurately estimate the maximum number of us-
error of 7.2%. able threads for data centers, and simulators to accurately
estimate power without actually simulating it. Addition-
ally, a power meter is not required per system. Our ana-
lytic model can be queried on multiple systems regardless
1 Introduction of the programs or inputs used. This is possible because
our model uses microbenchmark data independent of pro-
Power and thermal constraints limit processor frequency. gram behavior. We write these microbenchmarks to gather
In response, computer architects have shifted to chip mul- PMC data that contribute to the power function. We use
tiprocessors (CMPs) to retain performance improvements these data to form our power model equations. We thus es-
without increasing power envelopes, thus such designs trade timate power for single threaded and multithreaded bench-
higher frequencies and voltages for more cores. When opti- mark suites, and make the following contributions:

ACM SIGARCH Computer Architecture News 46 Vol. 37, No. 2, May 2009
FP Units DISPATCHED FPU:ALL
0.23 RETIRED MMX AND FP INSTRUCTIONS:ALL
Inst Retired RETIRED BRANCH INSTRUCTIONS:ALL
RETIRED MISPREDICTED BRANCH INSTRUCTIONS:ALL
RETIRED INSTRUCTIONS
0.39 RETIRED UOPS
Stalls DECODER EMPTY
0.2 DISPATCH STALLS
Memory DRAM ACCESSES PAGE:ALL
DATA CACHE MISSES
L3 CACHE MISSES:ALL
MEMORY CONTROLLER REQUESTS:ALL
0.33 L2 CACHE MISS:ALL

Table 1. PMCs Categorized By Architecture and Ordered


(Increasing) by Correlation (Based on SPEC-OMP Data)

100
90
80

% miss rate
Figure 1. AMD Phenom Annotated Die [1] 70
60
50
40
• We achieve accurate per-core estimates of multi- 30
threaded and multiprogrammed workloads on a CMP 20
10
with shared resources (an L3 cache, memory con- 0
troller, memory channel, and communication buses). caAD s
lc M
s ea x
gr FD lII
om TD

lib les lb s
qu lie m
an 3d
m
m cf
omna ilc
n md
vr p
spsjen y
hi g
t x3
o
ze cbmrf
us k
p
us ve

em d uli

ac

a
poetp

xa ont

m
n w
m
tu

n
• We observe and quantify the interacting effects of
ct a
ca bw

la
power and temperature.
G

• We achieve real-time power estimation, without the


need for off-line benchmark profiling.
Figure 2. L3 Cache Miss Rates for SPEC 2006
• We illustrate an application of our real-time predic-
tions for making thread scheduling decisions at run-
time.
cesses for SPEC 2006 benchmarks using one reference in-
We achieve median errors of 5.8%, 3.9%, and 7.2% for put. We find that the L3 cache has a large miss rate, since it
the NAS, SPEC-OMP, and SPEC 2006 benchmark suites, is essentially a non-inclusive victim cache.
respectively. We leverage our real-time estimation to dy-
Another architectural feature responsible for power con-
namically schedule workloads, achieving a reduced power
sumption in high performance processors is the out-of-order
envelope through suspension of appropriate threads.
logic. While no single PMC tracks out-of-order logic use,
the number of cycles stalled due to resource limits (DIS-
2 Methodology PATCH STALLS) indicates total stalls due to branches, full
load/store queues, reorder buffers, and reservation stations.
We separate the AMD Phenom performance counters This PMC represents how often the CPU stalls, which pro-
into four categories: FP Units, Memory, Stalls, and vides insight into stalled issue logic, indicating potentially
Instructions Retired. We gather insight into these less power consumption. However, if the source of these
categories by examining the processor die in Figure 1. stalls is the reservation station or reorder buffer, then the
The shared L3 and private L2 caches take up significant CPU is trying to extract instruction level parallelism, in-
area, as does the floating point units (FPUs) and front end dicating /em increased power consumption. For example,
logic. Tracking retired instructions gives good perspective if all instructions fetched can subsequently be issued and
on overall processor performance, and categorizing instruc- dispatched, the reservation stations and reorder buffers will
tions into FP or INT indicates what units the instruction mix remain a constant size (as instructions are retired in or-
uses. Monitoring L2 misses allows us to track use of the der). Should a fetched instruction stall, out-of-order logic
L3 caches, since L2 cache misses often result in L3 cache attempts to execute another, requiring dynamic power ex-
misses, which then lead to off-chip memory accesses. Fig- penditure, since more reservation stations are examined to
ure 2 graphs L3 misses normalized to total L3 cache ac- check for a new instruction’s dependences being satisfied.

ACM SIGARCH Computer Architecture News 47 Vol. 37, No. 2, May 2009
2.1 Event Selection We run four copies of the microbenchmarks simultane-
ously. We isolate the data for any single CPU, since all cores
We split the AMD Phenom PMCs into the four cat- exhibit the same PMC data. To ensure none of the separate
egories chosen above: FP Units, Memory, Stalls, threads are out of sync with each other, we run each pro-
and Instructions Retired. We use a Watts Up Pro gram phase separately. We use the SPEC-OMP benchmark
power meter and pfmon to gather power and performance suite for selecting appropriate counters only. Our code is
data for 13 counters from the entire SPEC-OMP benchmark generic, and does not use any code from benchmarks used
suite. The specific categories and the PMCs that fall within for testing (SPEC 2006, SPEC-OMP and NAS). The mi-
them are shown in Table 1. The PMCs in each category are crobenchmarks explore the space spanned by the four se-
in increasing order of correlation with power. We choose
lected counters. This means future application behavior
Spearman’s rank correlation to assess the relationship be-
tween the counter value and power. Compared to vanilla should fall within the same space, and allows us to claim
correlation, Spearman’s rank correlation does not require that our approach works independently of the benchmark
assumptions about the frequency distributions of the vari- suite. We verify this by testing our model on the NAS and
ables. After performing this rank correlation on the data, we SPEC 2006 benchmark suites, applications that we do not
select the top counter from each bucket. The AMD Phenom use during model building. Since this is an empirical pro-
only tracks four counters simultaneously, thus our choice of cess, verification lies in the quality of predictions.
four categories for the PMCs works well on this platform.
Since our power predictor is designed to work in real-time, 2.3 Forming the Model
we are bound by this four-counter limit. However, this is an
architecture-specific characteristic that varies by platform.
The performance counters we use are: For the model, we use only data from our microbench-
marks. We normalize collected performance counter values
• e1 : L2 CACHE MISS:ALL to the elapsed cycle counts, yielding an event rate, ri ,
• e2 : RETIRED UOPS per counter. The prediction model uses these rates as input.
• e3 : RETIRED MMX AND FP INSTRUCTIONS:ALL
We model per-core power for each thread using our piece-
• e4 : DISPATCH STALLS
wise model based on multiple linear regression. We collect
We claim that data from these four counters capture PMC values every second while running the microbench-
enough information to predict core and system power. This marks, and generate the model based on this data. Our ap-
claim is backed by results in Section 4. Next, we discuss the proach produces the following function (Equation 1), map-
microbenchmarks that exercise our chosen counters, and we ping observed event rates ri to core power Pcore :
then form the model based on the data collected.

F1 (g1 (r1 ), ..., gn (rn )), if condition
2.2 Microbenchmarks P̂core = (1)
F2 (g1 (r1 ), ..., gn (rn )), else

Since we wish our model to be independent of appli- where ri = ei /(cycle count)


cations used, we write microbenchmarks to stress the four
Fn = p0 + p1 ∗ g1 (r1 ) + ... + p2 ∗ gn (rn ) (2)
counters selected, and we explore the space of their cross
product. This space spans values from zero to several bil- The function consists of linear weights on transforma-
lion, with large variations in values among benchmarks. For tions of event rates (Equation 2). Spearman’s rank cor-
example, integer benchmarks have few floating point op- relation operates independently of frequency distributions,
erations, and some computationally intensive benchmarks thus transformations can be linear, inverse, logarithmic, ex-
have few L2 or L3 cache misses. We try to cover the ex- ponential, or square root (whatever makes the data more
treme boundaries along with common cases, but our mi- amenable to linear regression, which helps prediction accu-
crobenchmarks cannot cover all regions. A more thorough racy). We choose a piece-wise linear function because we
set of benchmarks could improve estimation, but would in- find code to behave significantly differently for low PMC
crease training time. We only target three dimensions of the values. The approach lets us capture more detail about the
space, since one of our PMCs (DISPATCH STALLS) can- core power without sacrificing the simplicity and ease of
not be explicitly targeted with code. Our benchmark codes linear regression. For example, were we to form a model
consists of a large for loop and a case statement that it- for the data in Figure 3, we would find that neither a linear
erates through different code nests, depending on the loop nor exponential transformation fits the data. However, were
index. Our synthetic benchmarks consists mainly of a set we to break the data into two parts, we would find a piece-
of assign statements (moves) and arithmetic/floating point wise combination of the two fits much better, as in Figure 4.
operations, so we compile with no optimizations to prevent We use least squares to determine weights for function
the compiler from removing redundant lines of code. parameters. Each part of the piece-wise function can be

ACM SIGARCH Computer Architecture News 48 Vol. 37, No. 2, May 2009
data time. However, since static power is a function of volt-
linear function
exponential function
age, process technology and temperature, increasing tem-
perature leads to increasing leakage power, and adds to total
F(x)

power. We concurrently monitor the temperature and power


of the CMP, to see their relationship. Figures 6 graphs
temperature (in Celsius) and power consumption (in Watts)
over time. Results are normalized to their steady-state val-
x ues. Benchmarks bt, lu and namd are run across all four
cores of the CMP, with results capped at 120 seconds. For
Figure 3. An Illustrative Example of Best-Fit Continuous namd, four instances are run concurrently since it is single-
Approximation Functions threaded. Performance counters and program source code
are examined to insure the work performed is constant over
time. The programs exhibit varying increases in power and
temperature over time. Clearly, temperature and power af-
data
piecewise function
fect each other. Not accounting for temperature could lead
to increased error in power estimates. However, not all sys-
tems support temperature sensors on die, or per-core. We
F(x)

omit temperature readings in this model since our hardware


lacks support for per-core temperature sensors.

3 Experimental Setup
x
We evaluate our work using the SPEC 2006 [14], SPEC-
Figure 4. An Illustrative Example of a Better Fitting
Piece-Wise Function OMP [2], and NAS [4] benchmark suites. All are com-
piled for the AMD x86 64 64-bit architecture using gcc
4.2 with -O3 optimization flags (and -OpenMP where ap-
plicable). We use reference input sets for SPEC 2006 and
SPEC-OMP, and the medium input set (B) for NAS. All
written as a linear combination of transformed event rates benchmarks are run to completion. We use the pfmon util-
(Equation 2), and is easily solved using a least squares es- ity from the perfmon2 library to access hardware perfor-
timator as in Contreras et al. [7]. We obtain the piece-wise mance counters from user space, and run on Linux kernel
linear model shown in Figure 5. We find the function be- version 2.6.25. Table 2 details the system configuration.
havior to be significantly different for very low values of System power is based on the processors’ being idle and
the L2 cache miss counter compared to the rest of the space. powered at their lowest frequency and voltage. All power
We break our function based on this counter. Since the L3 measurements are based on current draw from the outlet
is non-inclusive, most L2 cache misses trigger off-chip ac- by the power supply. We use a Watts Up Pro power meter
cesses contributing to total power. We also observe that the to measure power consumption (accurate to within 0.1W).
power grows with increasing retired uops, since the CPU The meter updates its power values every second. Base-
is doing more work. All counters have positive correla- line power for the system is 85.1W running at 2.2 GHz,
tion with power, except for the retired FP/MMX instructions and 76.8W at 1.1 GHz. For the purpose of model forma-
PMC. This is expected, since such instructions have higher tion only, we assume the base power without processor to
latencies; this class of instructions reduces the throughput be 68.5W. This simplifying assumption aids in faster model
of the system, resulting in lower power use. Finally, the formation without the need for more complicated measur-
dispatch stalls PMC correlates positively with power. This ing techniques. We calculate per-core power by subtracting
can be due to reservation stations or reorder buffer dispatch the base power and dividing by four.
stalls, where the processor attempts to extract higher de- Some performance counters measure shared resources,
grees of instruction level parallelism (ILP) from the code. which gives statistics across the CMP and not per core.
Dynamic power increases from this logic overhead. Some performance counters could be further subdivided by
type. For example, cache and DRAM accesses can be bro-
2.4 Temperature Effects ken down into cache or page hits and misses, while dis-
patch stalls can be broken down by branch flushes, or full
We are interested in the effect of temperature on system queues (reservation stations, reorder buffers, floating point
power. Ideally, power consumption does not increase over units). The AMD processor in this work can only sam-

ACM SIGARCH Computer Architecture News 49 Vol. 37, No. 2, May 2009

1.15 + 65.88 ∗ r1 + 11.49 ∗ r2 + −3.356 ∗ r3 + 17.75 ∗ r4 , r1 < 10−5
P̂core = (3)
23.34 + 0.93 ∗ log(r1 ) + 10.39 ∗ r2 + −6.425 ∗ r3 + 6.43 ∗ log(r4 ), r1 ≥ 10−5
where ri = ei /2200000000

Figure 5. Piece-wise Linear Function for Core Power Derived from Microbenchmark Data (1 Second = 2.2 Billion Cycles)

Normalized to Steady State


Normalized to Steady State
Normalized to Steady State

1.10 1.10 1.10


1.05 1.05 1.05
1.00 1.00 1.00
0.95 0.95 0.95
0.90 temperature 0.90 temperature 0.90 temperature
0.85 power 0.85 power 0.85 power
0.80 0.80 0.80
0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120
(a) bt (b) namd (c) lu

Figure 6. Power vs. Temperature on Our 4-Core CMP

Frequency 2.2 GHz (max), 1.1 GHz (min) 4.1 Evaluating Power Estimation
Process Technology 65nm SOI
Processor AMD Phenom 9500 CMP
Number of Cores 4 We test our derived power model by comparing mea-
L1 (Instruction) Size 64 KB 2-Way Set Associative sured to predicted power in Figures 7(a), 7(b), 7(c) for
L1 (Data) Size 64KB 2-Way Set Associative
L2 Cache Size (Private) 512 KB/core 8-Way Set Associative
NAS, SPEC-OMP, and SPEC 2006, respectively. Each
L3 Cache Size (Shared) 2 MB 32-Way Set Associative multi-threaded benchmark is run across the entire CMP, and
Memory Controller Integrated On-Chip multiple copies are spawned for single-threaded programs.
Memory Width 64-bits /channel
Data reported are per core. Our estimation model tracks
Memory Channels 2
Main Memory 4 GB PC2 6400(DDR2-800) power consumption for each benchmark fairly well. We
find a large difference between estimated and actual power
Table 2. CMP Machine Configuration Parameters for some benchmarks, such as bt and ft. This can partially
be attributed to lack of dynamic temperature data in our es-
timation model. Actual power values range from 19.6W for
ple four counters simultaneously, which is why we narrow ep to 26.6W for mgrid for a variation over 35%.
our design space to using only four performance counters. Figures 8(a), 8(b), and 8(c) show percentage error for
For power-aware process scheduling, we suspend processes each suite. We attribute error in power estimates to tempera-
determined to exceed the available power envelope. Sus- ture, the limited number of PMCs simultaneously available,
pended processes are later restored, and they continue exe- and parts of the counter space possibly unexplored by our
cuting when sufficient power becomes available. microbenchmarks. For instance, temperature can increase
power consumption by up to 10%. NAS and SPEC-OMP
have average median error of 5.8% and 3.9%, respectively.
4 Evaluation SPEC 2006 has slightly higher median error of 7.2%.
Figure 9 shows the Cumulative Distribution Function
(CDF) for all three benchmark suites taken together. This
We evaluate the accuracy of our power model using sin- gives us a picture of the coverage of our model. For exam-
gle and multi-threaded benchmarks, using the entire CMP ple, 85% of predictions across all benchmarks have less than
to test our results. We later leverage this real-time estima- 10% error. The CDF helps illustrate the model’s fits, show-
tion to make power-aware scheduling decisions, suspending ing that most predictions have very small error. To further
processes to maintain a given power envelope. Note that the test model robustness, we examine system power estimates
scheduler is a user-level application, and is quite simple – for a multiprogrammed workload. Figure 11(d) shows sys-
much more sophisticated schedulers (e.g., taking priorities tem power consumption for ep (NAS), art (SPEC-OMP),
into consideration) are possible, but ours serves as a proof mcf and hmmer (SPEC 2006), run concurrently. We sum
of concept. per-core power to estimate system power. Our model con-

ACM SIGARCH Computer Architecture News 50 Vol. 37, No. 2, May 2009
% Median Error Power (W)

0
10
20
30

0
5
10
15
20
bt
% Median Error Power (W)
bt cg

0
10
20
30

0
5
10
15
20
as
ta
as cg bz r
ta ep
b ip
actual

bz r ep ca wa 2
predicted

bw ip2 ct ve
ca a us s ft
ct ve AD
us s ft ca M

actual
AD lc lu
ca M ul

predicted
(a) NAS

(a) NAS
lc G ix
ul lu em de lu
G ix -h
em de lu sF alll p
sF alll
-h
p
D
D ga TD
m m

ACM SIGARCH Computer Architecture News


ga TD m e g
m g go ss
e
go ss gr bm
sp om k sp
gr bm
om k a
h2 cs
h2 acs ua 64 ua
64 hm ref
hm ref m
m % Median Error er Power (W)

51
er l am
0
10
20
30

0
5
10
15
20
le bm m
(c) SPEC 2006

(c) SPEC 2006


le lbm lib slie p
lib sli am qu 3 ap
qu e3 m an d
an d p tu pl
u
tu ap m
m pl
u m ap
m cf si
cf
actual

ap
om i m si om mil
c ar
lc
predicted

Figure 7. Measured vs. Predicted Power

pe net t
pe net ar rlb pp ga
rlb pp t en fo
en ga rt
fo fm

Figure 8. Median Errors for Given Benchmark Suites


po ch rt po ch
vr fm vr a3
a a3
ay d
xa sop y d xa sop m
(b) SPEC-OMP

(b) SPEC-OMP
la le m la lex
nc gr
nc x gr b id
b id qu
ze mk qu ze mk
us us ak
m ak m e
p e p
sw sw
w i m w im
up up
w w
is is
e e

Vol. 37, No. 2, May 2009


Fraction of Space Covered

   
1.0

0.8         
              
0.6

0.4                

0.2
                                           
0.0
0 10 20 30
      
% Error

Figure 9. Cumulative Distribution Function (CDF) Plot


Showing Fraction of Space Predicted (y-axis) under a Given Figure 10. Scheduler Setup and Use
Error (x-axis)
ep from NAS, art from SPEC OMP, and mcf and hmmer
from SPEC CPU 2006. Figure 11(d) shows workload ex-
servatively over-estimates power consumption. After ep ecution without a power envelope. In Figure 11(a), there
finishes executing, power prediction decreases with mea- is a sudden drop in system power during the initial execu-
sured system power. System power further decreases after tion of the workload. The total system power of 177.7W
art completes, and only two cores continue. Even with the is far above the applied power envelope of 140W. At this
interaction of shared resources, our model accurately tracks point, ep, art, mcf, and hmmer are using 19.8W, 20.5W,
CMP system power consumption. 21.4W, and 22.4W respectively. The scheduler suspends ep
and art because their sum of 40.3W brings the power just
4.2 Power-Aware Thread Scheduling below the envelope of 140W. The drop in power before the
500 second mark is where the one process ends, and the
We use the power predicted for processes to schedule scheduler resumes another. Similarly, in Figure 11(c), total
them within a multi-programmed workload on the CMP. We system power of 181.1W predicted is just above the power
suspend processes to remain below the system power enve- envelope of 180W, and ep, art, mcf, and hmmer are us-
lope. For these experiments, we assume the system power ing 19.7W, 24.7W, 20.7W, and 22.2W respectively. The
envelope to be degraded by 10, 20, or 30%. drop at about 20 seconds into the execution of the work-
We develop a user-space scheduler in C that spawns a load illustrates the scheduler’s suspending ep, since it al-
process on each of the four cores of the AMD Phenom, lows total power to remain closest to the power envelope.
monitors their behavior via pfmon, and schedules processes The process is later resumed to complete execution.
to run within a fixed power envelope. Figure 10 illustrates Figures 11(a), 11(b), and 11(c) show that the Measured
its setup and use. The program makes real-time predictions and Predicted power match up well. We are able to fol-
for per-core and system power based on collected perfor- low the power envelope strictly, and do so entirely on the
mance counters, and suspends processes as the power enve- basis of our prediction-based scheduler. This obviates the
lope is breached. It suspends the processes such that system need for a power meter, and would be an excellent tool for
power is just below the power envelope. For example, as- software-level control of total system power.
sume that current system power is 190W and the power en-
velope is 180W. For simplicity, we have to choose between 5 Related Work
two processes consuming 20W and 25W respectively. The
scheduler suspends the first process to bring system power Prior work falls under two different approaches. The first
down to 170W, rather than choosing the second process and strategy estimates power consumption based on monitor-
being further away from the envelope (at 165W). ing functional unit usage [11, 16]. The second strategy de-
We present examples of running a randomly chosen rives mathematical correlations between performance coun-
workload at 180W, 160W, and 140W, which represent ap- ters and power consumed, independent of the underlying
proximately 90%, 80%, and 70% of maximum power us- functional units [7]. In our work, we combination the two
age when no processes are suspended. For fairness, we approaches, using correlation and architectural knowledge
run multi-threaded programs with one thread only. Since to choose appropriate performance counters, and using ana-
if we were to suspend a multi-threaded process, it would lytic techniques to identify the relationship between perfor-
span multiple cores and would be far below the power en- mance counters and power consumption. We briefly outline
velope. Benchmarks are chosen arbitrarily, consisting of related work.

ACM SIGARCH Computer Architecture News 52 Vol. 37, No. 2, May 2009
200 200

System Power (W)


System Power (W)
150 150

100 100
predicted predicted
50 actual 50 actual
envelope envelope
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Time (sec) Time (sec)
(a) 140W (b) 160W
200 200

System Power (W)


System Power (W)

150 150

100 100
predicted predicted
50 actual 50 actual
envelope
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Time (sec) Time (sec)
(c) 180W (d) None

Figure 11. Scheduler with Given Power Envelope

Joseph and Martonosi [11] estimate power using per- cations of interest. The method is not feasible when execut-
formance counters in a simulator (Wattch [6], integrated ing programs outside of the sampled design space. At run
with SimpleScalar [3]), and on real hardware (Pentium Pro). time, scheduling is not known a priori, and the behavior of
Their power estimates are derived offline after multiple programs changes depending on their interaction with other
benchmark runs, since they require more PMCs than are processes sharing the same cache resources.
supported by their hardware. They use PMCs to estimate Curtis-Maury et al. [8] predict optimal concurrency val-
power consumption for different architectural resources, but ues using machine learning to estimate power consumption
are unable to estimate power for 24% of the chip, and as- at various thread counts based on power consumption at the
sume peak power of the structures. highest thread count for the NAS benchmark suite. They
Wu et al. [16] use microbenchmarks to estimate power require that applications be profiled at one thread count be-
consumption of functional units for the Pentium 4 architec- fore predicting power consumption at other thread counts,
ture. Dynamic activity for all portions of the chip are not making their models application dependent.
available, and if they were, the approach would exceed the Contreras and Martonosi [7] use PMCs on an XScale to
limit of four CMPs for run-time power prediction. estimate power consumption online at different frequencies.
Economou et al. [9] use performance counters to pre- Like Joseph and Martonosi, they gather data from multiple
dict power of a blade server. They use custom micro- PMCs by running benchmarks several times, and join sam-
benchmarks to profile the system and estimate power for pled data from different runs together to derive better esti-
different hardware components (CPU, memory, and hard mates. This methodology is not feasible at run-time, since it
drive), achieving estimates with 10% average error. Like requires three times th PMCs as can be sampled in real time.
our work, their kernel benchmarks for deriving their model Thus it cannot be used by an OS scheduler or application de-
are independent of benchmarks used for testing. veloper executing multiple programs in parallel at run time.
Lee and Brooks [12] use statistical inference models to Their parameterized linear model does not apply here, as
predict power consumption based on hardware design pa- their work for an in-order single core does not scale to our
rameters. They build correlation based on hardware param- platform. This is due to the increased complexity of our
eters, and use the most significant parameters for training multicore, out-of-order, power-efficient, high-performance
a model to estimate power consumption. They randomly platform.
sample a large design space and estimate power consump- Merkel and Bellosa [13] use performance counters to es-
tion based on previous power values for the same design timate power per processor in a 8-way symmetric multipro-
space that is profiled a priori. Unfortunately, this methodol- cessing (SMP) system, and shuffle processes to reduce over-
ogy requires sampling the same applications for which they heating of individual processors. Bautista et al. [5] sched-
are trying to estimate power consumption. This makes the ule processes on a multicore system for real-time applica-
approach dependent on having already trained on the appli- tions. They assume static power consumption for all appli-

ACM SIGARCH Computer Architecture News 53 Vol. 37, No. 2, May 2009
cations. They leverage chip-wide DVFS to reduce energy multi-threaded workloads. This adds a layer of complexity
when there is slack in the application. Isci et al. [10] ana- since all the threads of a specific process need to be sus-
lyze power management policies for a given power budget. pended at once. If only one thread is suspended, it might
They leverage per-core and domain-wide DVFS to retain be holding locks or other threads might be forced to wait at
performance while remaining within a power budget. Un- barrier points for it. Another strategy is to force the sched-
like their work, we do not require on-core current sensors, uler to wait for specific thread spawn points before changing
but instead use real hardware to evaluate our work. While numbers of threads.
their work involves core adaptation policies, we examine We find temperature to play a significant role in power
OS scheduling decisions with fixed cores. We will expand consumption. When supported by the hardware, we will in-
our approach when per-core DVFS becomes available. corporate thermal data into our estimation model and thread
We estimate power consumption for an aggressively scheduling strategies. We also wish to investigate the ben-
power efficient high performance platform. Our platform efits of using a power-aware scheduler for choosing pro-
differs with previous literature due to: resources being cessor cores to reduce frequency (DVFS), rather than sus-
shared across cores, performance counters that do not indi- pend processes. Much more sophisticated schedulers (e.g.,
vidually exhibit high correlation with power, and significant with knowledge of priorities or real-time demands) are pos-
power variation across benchmarks, depending on work- sible, and experimenting with these will be interesting to
load. Additionally, since we desire real-time power esti- observe the consequential effects on throughput. As num-
mation, we are confined to using only the number of perfor- ber of cores and power demands grow, process scheduling
mance counters available at run-time in a single run. This will be critical for efficient computing.
prevents us from making fair comparisons of this work with
previous work. Our framework can estimate power con- References
sumption on new programs that we have not run before,
and evaluates results on a multi-core system, providing a [1] AMD Phenom(TM) Quad-Core Processor Die. www.amd.com/us-
feasible method of estimating power consumption per-core. en/Corporate/VirtualPressRoom/0,,51 104 572 5731̂50441̃17770,00.html,
We leverage this to perform real-time power-aware per-core Jan. 2008.
scheduling on a CMP, which was not previously possible. [2] V. Aslot and R. Eigenmann. Performance characteristics of the SPEC
OMP2001 benchmarks. In Proceedings of the European Workshop
on OpenMP, Sept. 2001.
6 Conclusions [3] T. Austin. Simplescalar 4.0 release note.
http://www.simplescalar.com/.
We derive an analytic, piece-wise linear power model [4] D. Bailey, T. Harris, W. Saphir, R. Van der Wijngaart, A. Woo, and
M. Yarrow. The NAS parallel benchmarks 2.0. Report NAS-95-020,
that maps performance counters to power consumption ac-
NASA Ames Research Center, Dec. 1995.
curately, independently of program behavior for the SPEC
[5] D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato. A simple
2006, SPEC-OMP and NAS benchmark suites. We use cus- power-aware scheduling for multicore systems when running real-
tom microbenchmarks to generate data for creating our es- time applications. In Proc. 22nd IEEE/ACM International Parallel
timation model. The microbenchmarks stress four coun- and Distributed Processing Symposium, pages 1–7, Apr. 2008.
ters selected based on their correlation with measured power [6] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework
consumption values. for architectural-level power analysis and optimizations. In Proc.
27th IEEE/ACM International Symposium on Computer Architec-
We leverage our power model to perform run-time, ture, pages 83–94, June 2000.
power-aware thread scheduling. We suspend and resume
[7] G. Contreras and M. Martonosi. Power prediction for intel xs-
processes based on power consumption, ensuring that a cale processors using performance monitoring unit events. In Proc.
given power envelope is not exceeded. Our work is likely IEEE/ACM International Symposium on Low Power Electronics and
useful for consolidated data centers, where virtualization Design, pages 221–226, Aug. 2005.
leads to multiple servers on a single CMP. Using our es- [8] M. Curtis-Maury, K. Singh, S. A. McKee, F. Blagojevic, D. S.
timation methodology, we can accurately estimate power Nikolopoulos, B. R. de Supinski, and M. Schulz. Identifying energy-
efficient concurrency levels using machine learning. In Proc. 1st In-
consumption at the core granularity, allowing for accu- ternational Workshop on Green Computing, Sept. 2007.
rate billing of power usage and cooling costs. Estimating [9] D. Economou, S. Rivoire, C. Kozyrakis, and P. Ranganathan. Full-
per-core power consumption is challenging, since some re- system power analysis and modeling for server environments. In
sources are shared across cores (such as caches, the DRAM Workshop on Modeling Benchmarking and Simulation (MOBS) at
memory controller and off-chip memory accesses). Addi- ISCA, June 2006.
tionally, current machines only allow us to monitor a few [10] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi.
An analysis of efficient multi-core global power management poli-
performance counters per-core, depending on the platform
cies: Maximizing performance for a given power budget. In Proc.
(Intel, Sun, AMD). IEEE/ACM 40th Annual International Symposium on Microarchitec-
For future work, we will expand this study to schedule ture, pages 347–358, Dec. 2006.

ACM SIGARCH Computer Architecture News 54 Vol. 37, No. 2, May 2009
[11] R. Joseph and M. Martonosi. Run-time power estimation in high-
performance microprocessors. In Proc. IEEE/ACM International
Symposium on Low Power Electronics and Design, pages 135–140,
Aug. 2001.
[12] B. Lee and D. Brooks. Accurate and efficient regression modeling
for microarchitectural performance and power prediction. In Proc.
12th ACM Symposium on Architectural Support for Programming
Languages and Operating Systems, pages 185–194, Oct. 2006.
[13] A. Merkel and F. Bellosa. Balancing power consumption in mul-
tiprocessor systems. In EuroSys ’06: Proceedings of the ACM
SIGOPS/EuroSys European Conference on Computer Systems, pages
403–414, Apr. 2006.
[14] Standard Performance Evaluation Corporation. SPEC CPU bench-
mark suite. http://www.specbench.org/osg/cpu2006/, 2006.
[15] V. Weaver and S. A. McKee. Can hardware performance counters
be trusted? In Proc. IEEE International Symposium on Workload
Characterization, Sept. 2008.
[16] W. Wu, L. Jin, and J. Yang. A systematic method for functional
unit power estimation in microprocessors. In Proc. 43rd ACM/IEEE
Design Automation Conference, pages 554–557, July 2006.

ACM SIGARCH Computer Architecture News 55 Vol. 37, No. 2, May 2009

You might also like