You are on page 1of 10

Systems administrators monitor the processor utilization to measure its efficiency and also to do

capacity planning. One of the pain points raised by customers using Power-5 & Power-6
environment is that they are not able to measure how efficiently they are using the SMT threads.
IBM Power Systems introduced a new measurement counter; named PURR, to accurately measure
its utilization in virtualized environments. PURR stands for Processor Utilization Resource
register. This gave a pretty good measure of the processor utilized as a whole. When it comes for
SMT the challenge is to understand how much spare capacity really the physical processor has.
SMT stands for simultaneous multi-threading, which is the threading technology built-into the
processor. This hardware threading technology helps to efficiently use different units with-in a
core.

What is different in Power7


Till Power6, when SMT is enabled and the workload just utilizes one thread of the processor core,
the processor utilization is reported as 100%, which means the customers are really challenged in
understanding the efficiency of SMT. Since Power7, the hardware PURR counter is enhanced to
measure spare or idle capacity available at a hardware thread level. This helps tools in Power7
report the idle capacity more accurately.

For e.g. lets take a Physical Processor Core with SMT-2 enabled, which means, 2 hardware threa
ds. And lets say I have a single threaded application which I kick started & bind it to run in this
core. In Power6 the measured utilization will be 100% for this core and in Power7 the measured
utilization will be 70-80%. Hence in Power7 a user will able to tell how efficiently the core
(specifically in SMT mode) is utilized by the workload.

Reference : Understanding Utilization

How to interpret CPU Utilization Metrics


AIX provides 5 key cpu utilization metrics, they are as follows:
User% - Percentage of time the CPU spent in user mode execution (Application Code)
Kernel% - Percentage of time the CPU spent in kernel mode execution (System Calls, OS Code,
etc)
Wait% - Percentage of time the CPU is Idle with any pending IO
Idle% - Percentage of time the CPU is Idle (spinning Idle Cycles)

Physical Consumption (PC) - Fractions of Physical Cores assigned by the Hypervisor or


consumed by the LPAR

Entitlement Consumption%(%entc) - Percentage of Physical Cores Consumed against the entitled


amount of cores
User% + Kernel% gives the percentage of time the CPU is actually busy doing useful work.

The percentages except for entitlement consumption, are always relative to MAXIMUM(Entitled
Cores, Assigned Cores). For e.g. if the Partition is booted with entitlement as 4.0 in uncapped
mode, then the percentages will be relative to 4 until the Physical Consumption crosses 4, once the
Physical Consumption crosses 4, then the percentages will be relative to the Physical Consum
ption. This is done to ensure the User%+Kernel%+Wait%+Idle% is always equal to 100.

A common question asked is why %entc is vastly different from User%+Kernel% ?

As mentioned earlier in POWER6 or earlier processor models there is no mechanism to measure


how effectively the core is utilized when it is running in SMT mode, hence if a single threaded
workload keeps the core busy then User%+Kernel% will be exactly same as the %entc, leaving an
impression that the core has no more headroom to take any additional workloads on other thread.

In Power7 processors this issue is addressed by accounting for the idle capacity of the threads.
Hence for the above case the User%+Kernel% will not match %entc. The wider the delta between
%entc & “User% + Kernel%” more the spare capacity in the Core and also poorer the effective
utilization of the core.

How do I know how much actual Physical Cores my Partition is consuming to do useful work ?
(Basically non Idle Work)

The non-Idle work is the amount of actual work done by the Cores, basically excluding all the
time I spend on CPU just to run Idle Cycles (exclude the cpu consumption by the process named
waitprocs). Basically every OS has a process named WaitProc or SystemIdle process. This process
always has the lowest priority (128) and when the CPU has nothing to do this process will get
dispatched. AIX also does the same, but AIX gives up the core back to PHYP within few cycles
after the Idle process is running in the CPU.

The non-Idle work is represented by User%+Kernel%. The actual number of cores I use for non-
Idle work is (User%+Kernel%)*Physical Consumption. This is also provided by Physb metric in
some tools. For e.g. if my Physical Consumption is 3 cores and my User%+Kernel% is 50%, then
my 5 metrics would look as follows:

‘User%+Kernel%’ - 50%

‘Idle%+Wait%’ - 50%

Physical Consumption - 3.0

%entc - 75% (Assume Entitlement is 4.0, hence ¾ = 75%)

No. of Cores used for non-Idle work = 50% of 4.0 = 2.0 Cores.

%entc for non-Idle work = 50% (2.0/4 = 50%)

Basically Hypervisor has given 3 cores to the Partition as the workload demands it, but the 3 cores
are not utilized to it fullest levels individually, hence effectively out of 3 cores 1.0 cores land up
just running Idle Cycles.

AIX always schedule jobs on the primary threads of every core, once all cores primary threads are
busy, then jobs are scheduled to secondary & territory threads. Hence when SMT is on and there is
no jobs scheduled to other threads of the core then all these threads will be idle and hence
accounted for idle capacity.

How Virtual Processor Folding affects the utilization metrics ?

Virtual Processor Folding comes into play when virtual processor are idle, i.e. it has no jobs to
run. When a Virtual Processor is folded, the core is not counted in Physical Consumption or
%entc. This means Physical Consumption & %entc will decrease once the Virtual Processor is
folded and this is equivalent to the decrease that happens when the system is idle.

Does the Number of Virtual Processors impacts the effective utilization of core ?

Yes, the number of virtual processors will impact the effective utilization of the cores. More the
number of VPs, more the number of jobs scheduled on primary threads leaving other threads idle,
in-turn causing a decrease in effective utilization of all the threads in the core. When there is lesser
number of VPs, more number of jobs consolidated into minimal no. of cores, in-turn loading the
other threads of SMT and causing increase in effective utilization of all the threads.

Does the Number of SMT threads impacts the effective utilization numbers ?

Yes. If the core is running in single thread mode and the workload is using the entire core, then the
%entc will be equal to User%+Kernel%. Now if I switch the core to SMT-2 mode, which
effectively means I enabled one more thread in the core but the workload is single threaded, which
means effectively the newly enabled thread is idle, this reduces ‘User%+Kernel%’ leaving %entc
unchanged. The same argument holds good when you enable SMT-4 as 3 additional threads are
enabled in the core.

Note: The behavior of SMT/ workload gain also depends on the workload characteristics.

Is it possible to map the CPU utilization (Physical Cores, Entitlement Consumption) to individual
Process ?
Yes. The topas process panel provides the CPU utilization numbers. The percentages shown here
is relative to MAXIMUM(Entitled Cores, Physical Consumption) similar to what I mentioned
earlier for User% & Kernel%. Hence if a Partition is using 150% of its entitlement, then a user can
look into the process CPU utilization % and calculate how much % of entitlement is used for that
process. For e.g. Process X is showing 90% of CPU, then this would translate to 90% of 150%,
which will be 135%. Basically sum of all Process CPU% will not exceed 100%.

Note: The above example is only when the partition is running at nominal frequency.

Performance measurement has been one of the most interesting topic in recent times. With various
advancements in computer architectures and virtualization, there are many changes in the methods
used for measuring the performance of the system. In this article we will walk through various
innovations that had gone in over the years.

Legacy systems
In legacy systems the method used was relatively simple and straight forward. The decrementor
generates an interrupt every 10ms, and depending on the current execution mode, that particular
interval(10 ms) is charged to that mode(user, sys, idle & wait). This method does not depict an
accurate picture of processor utilization since the time period over which a decision is made is
relatively large. With advancements in virtualisation (introduction of shared partitions) and
computer architecture (Simultaneous Multi-threading mode) this method was not reliable
anymore. This led to the introduction of special purpose registers for measuring processor
utilization.

PURR
PURR stands for Processor Utilization Resource Register. This register is available from POWE
R5. Each processor has two PURR registers for each hardware thread (in SMT mode). The regist
ers are updated by hypervisor and can only be read by aix. Over a period of time, the total PURR
ticks is almost equal to timebase counter. PURR based calculations of idle and cpu busy are more
accurate compared to the traditional ticks based measurement in SMT and Shared environments.
AIX uses these counters for performance measurement and accounting purposes.

SPURR
IBM's Energy Saving features lets the user to modify the CPU frequency. The frequency can be set
to any selected value (static power saver mode) or can be set to be varied dynamically (Dynamic
power saver mode). So each PURR tick does not necessarily mean the same processing capacity
as the CPU frequency varies. This calls for a frequency dependent counter and hence SPURR
register was introduced.

SPURR (Scaled PURR) was introduced in POWER6. SPURR based metrics are used for account
ting and PURR based metrics are used to denote utilization. SPURR counters are proporti onal to
the processor frequency.

The SPURR and PURR counters increment the same way when the CPU is running at nominal
frequency. When running at a lesser frequency the SPURR ticks are lesser than the PURR ticks
and when running at higher frequency (turbo mode) the SPURR ticks are higher. The ratio of
SPURR and PURR multiplied by the CPU frequency gives the current frequency.

Delta SPURR
Current Frequency = ---------------- X CPU frequency
Delta PURR
Note: The above formula holds good only when there is no change in the frequency during the
interval over which a delta is taken. Otherwise, it gives the average frequency over the interval.

pmcycles -M
The pmcycles command has an unsupported flag -M which gives the current frequency of the
processor.

Missing image
The pmcycles -M command counts the number of instructions executed in an interval to calculate
the current frequency. The other pmcycle options get the frequency either from the ODM or from
the ipl which will not be correct if the frequency is being changed dynamically. pmcycles -M
might be inaccurate if the hypervisor is interrupted and does not get all CPU cycles available
during the interval.

Consider a hypothetical situation where a processor is running at half its nominal speed. The
SPURR-based metrics are with respect to the CPU's current frequency, so the SPURR-based CPU
utilization shows how much CPU capacity goes unused. SPURR is used for accounting purposes
while PURR is used to determine the actual utilization.

Missing image
Consider an LPAR running with 4 physical CPUs. When server CPUs are running a 50%
computational load at nominal frequency F, both PURR- & SPURR-based metrics report 2
physical cores consumed. When server CPUs are running at frequency 0.5F, PURR-based physc
reports 4 cores consumed, while SPURR-based physc reports 2. Here, SPURR indicates additional
available CPU capacity because CPUs are running at a reduced frequency and increasing the
frequency could provide additional CPU capacity. PURR-based metrics indicate that all CPU
capacity is being consumed. So SPURR-based metrics are used extensively for accounting &
capacity planning. The same happens when running in turbo mode, as can been seen in the above
graph.

lparstat

lparstat -E and -w flags are now available to show both the SPURR- & PURR- based utilization
metrics.

lparstat -E sample output:

Missing image
The above output was collected from a server with a nominal frequency of 3550 MHz. There are
128 logical CPUs running in 4-way SMT mode and hence a processing capacity of 32 cores. The
actual metrics use PURR counters and normalized metrics use SPURR counters. The values
shown in each mode are the actual physical cores consumed in each mode. Adding up all the
values (user, sys, idle, wait) will give the total entitlement of the partition in both the actual and
normalized view. But in shared uncapped mode this need not be true as cores consumed can
exceed entitlement. In that case, the totals of all values might not be the same. Also, the idle value
has been modified to show the actual entitlement available. So the values shown in this view
should not be compared with the default views of lparstat. The idle value shown here is the
available capacity.

idle = Entitlement - ( user + sys + wait )

As you can see, when the partition is running at a reduced frequency, the available capacity (idle)
shown by the two counters are different. The current idle capacity is shown by PURR. The idle
value shown by SPURR is an estimate of the idle capacity if the CPU were run at nominal
frequency.

lparstat -Ew sample output:

Notes:
lparstat -Ew displays long output lines with the same metrics. The values in square brackets
indicate the actual percentage of entitlement which the partition is consuming. Also, the -Ew flags
dynamically modify the precision based on the value.

All the utilization reporting commands (sar, iostat, mpstat, lparstat, sar) have been modified to use
only PURR-based counters. All the SPURR-based metrics are available in libperfstat API's which
can be exploited by the user applications. The lparstat -E and -w flags are available starting AIX
V5.3 TL09 SP7 and AIX V6.1 TL02 SP7.

Programming CPU Utilization using the new perfstat interfaces

In this article we will look more into the new interfaces provided by perfstat library which could
be used to get CPU utilization. Traditionally, perfstat library used to provide the CPU counters
which need to be used to calculate the metrics needed by the user. But with AIX 6.1 TL 07 and
AIX 7.1 TL01, APIs are available to calculate the same, so that the user could readily use the
metrics available rather than do their own calculation which could sometimes result in discrepa
ncies when compared with the AIX tools.

Introduction:
Prior to the releases mentioned above, users have to use the interfaces like perfstat_cpu_total ( ),
perfstat_partition_total ( ) to get CPU counters for system level utilization and perfstat_cpu( ) for
CPU level utilization. These interfaces provide only the raw counter values. Also there was no
interface to gather metrics at a per process level. With the latest level of AIX, new interfaces have
been introduced to provide the calculated utilization values like cpu%, process% similar to the
values shown by topas.
The new interfaces added are
1. perfstat_partition_config( )
2. perfstat_cpu_util ( )
3. perfstat_process ( )
4. perfstat_process_util ( )
perfstat_partition_config:
perfstat_partition_config provides information about the lpar's configuration, information on
hardware and some information on the operating system running in the system. Most of the
information provided by "lparstat -i" command could be obtained using this API.
perfstat_cpu_util:
perfstat_cpu_util takes in the raw counters provided by either perfstat_cpu( ) or perfstat_cpu_
total( ) and some additional information as the input and provides the cpu utilization values. The
same interface is used to calculate the cpu level and lpar level utilization. This removes a lot of
coding required to handle the calculation which the users who program their own monitoring tools
have been doing till now. The below sample code demonstrates the use of the above two interfa
ces.
Sample Code:
#include <libperfstat.h>
void main(){
perfstat_rawdata_t data;
perfstat_cpu_util_t util;
perfstat_cpu_total_t newt, oldt;
perfstat_partition_config_t config;
int i = -1,rc = -1, n = -1;
/* Use perfstat_partition_config to get lpar's configuration */
if (perfstat_partition_config(NULL, &config, sizeof(perfstat_partition_config_t), 1) < 1){
printf("perfstat_partition_config failed");
exit(1);
}
/* Get the current CPU counters using perfstat_cpu_total */
if (perfstat_cpu_total(NULL, &oldt, sizeof(perfstat_cpu_total_t), 1) < 1){
printf("perfstat_cpu_total failed");
exit(1);
}
/* Print the headers */
printf("%s Configuration ",config.partitionname);
printf("type=%s mode=%s smt=%u lcpu=%u mem=%llu\n",
((config.conf.b.shared_enabled)?"Shared":"Dedicated"),
((config.conf.b.shared_enabled)?
((config.conf.b.capped)?"Capped":"Uncapped"):
((config.conf.b.donate_enabled)?"Donating":"Capped")),
config.smtthreads, config.lcpus, config.mem.online );
printf("%5s %5s %5s %5s %6s\n", "User", "Sys", "Idle", "Wait", "Physc");
printf("%5s %5s %5s %5s %6s\n", "----", "----", "----", "----", "-----");

/* Now fill the data structure which provides information about the CPU counter source
(whether cpu_total or cpu or partition_total, number of structures passed, address of old, current
vales etc */
data.type = UTIL_CPU_TOTAL;
data.curstat = &newt;
data.prevstat= &oldt;
data.sizeof_data = sizeof(perfstat_cpu_total_t);
data.cur_elems = 1;
data.prev_elems = 1;
while (1){
sleep(5);
if(perfstat_cpu_total(NULL, &newt, sizeof(perfstat_cpu_total_t), 1) < 1){
printf("perfstat_cpu_total failed");
exit(1);
}

/* use perfstat_cpu_total to get utilization values */


if(perfstat_cpu_util(&data, &util, sizeof(perfstat_cpu_util_t), 1) < 1){
printf("perfstat_cpu_util failed");
exit(1);
}
printf("%5.1f %5.1f %5.1f %5.1f %6.1f \n", util.user_pct,util.kern_pct,
util.idle_pct,util.wait_pct, util.physical_consumed);
memcpy(&oldt, &newt, sizeof(perfstat_cpu_total_t));
}
}
Note:
perfstat_cpu_util uses the PURR counter to calculate the cpu utilization and not the logical ticks.
Output

perfstat_process :
perfstat_process provides utilization values of process related metrics. Process related metrics
include process pid, process name , priority , nice value, process size, cpu metrics and disk input
output. Most of the metrics provided by getprocs64( ) is provided by this interface.

perfstat_process_util :
Perfstat_process_util takes in raw counters provided by perfstat_process and provides calculated
utilization metrics. The below example shows a sample code which illustrates the usage of above
two interfaces. The user% and kernel% are with respect to the partition's entitlement.

#include <libperfstat.h>
#include <stdio.h>
void main ( int argc, char *argv[] )
{
perfstat_process_t new, old, util;
perfstat_id_t id;
perfstat_rawdata_t buf;
/* get the process id as input */
strcpy(id.name, argv[1]);
/* get the current values for the provided pid */
perfstat_process(&id, &old, sizeof(perfstat_process_t),1);

/* Print the headers */


printf ("Process SizeKB Priority User%% Kernel%% \n");

while (1){
sleep(5);
perfstat_process(&id, &new, sizeof(perfstat_process_t),1);
bzero(&buf, sizeof(perfstat_rawdata_t));
/* Fill the raw data structure with type of structure being passed, number of structures ,
address of current and previous structures */
buf.type = UTIL_PROCESS;
buf.curstat = &new;
buf.prevstat = &old;
buf.sizeof_data = sizeof(perfstat_process_t);
buf.cur_elems = 1;
buf.prev_elems = 1;
/* Use perfstat_process_util to get process specific utilization */
perfstat_process_util(&buf,&util,sizeof(perfstat_process_t),1);
printf("%6s %9lld %9d %10.2f %10.2f \n",util.proc_name, util.proc_size,
util.proc_priority, (double)util.ucpu_time, (double)util.scpu_time);
memcpy(&old, &new, sizeof(perfstat_process_t));
}
}
Output
The above interfaces should help reduce a lot of code changes that is needed with the new
Hardware and Software features when the formula used to calculate the utilization value changes.

You might also like