Chapter 1 Measuring Understanding Performance

Chapter 1: Measuring &
Understanding Performance
José Mª González Linares

Dept. de Arquitectura de Computadores
[Adapted from chapter 1 of Computer Organization and

Design, 5th Edition, Patterson & Hennessy, © 2013, MK]
Index
Introduction
The Computer Revolution
The Power Wall
Concluding Remarks
Dept. Arquitectura de Computadores 2

§1.1 Introduction
What is a computer?
A computer is a general purpose device
that can be programmed to carry out a
finite set of arithmetic or logical
operations.
◦ We use programs, i.e. sequences of
instructions, to solve problems.
◦ The Operating System help us to code and
execute our programs.

Classes of Computers
Desktop computers
◦ General purpose, variety of software
◦ Subject to cost/performance tradeoff
Server computers
◦ Network based
◦ High capacity, performance, reliability
◦ Range from small servers to building sized
Embedded computers
◦ Hidden as components of systems
◦ Stringent power/performance/cost constraints
What You Will Learn
How programs are translated into the
machine language
◦ And how the hardware executes them
The hardware/software interface
What determines program performance
◦ And how it can be improved
How hardware designers improve
performance
How software designers improve
performance
◦ And what is parallel processing

Does my computer know to
multiply?
Try next code in Java, for example paste it
in https://ideone.com/
import java.util.*;
import java.lang.*;
import java.io.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
System.out.println( 123.456 * 0.001 );
}
}
Why the result is not exact?

§1.2 The Computer Revolution
History of computing hardware
First generation: mechanical and electromechanical devices built
to solve some particular problem, like
◦ The abacus for arithmetic tasks
◦ The Antikythera mechanism for astronomical calculations
◦ The Pascaline for automatic basic arithmetic operations
◦ Model K, the first relay-based calculator
In 1837 Charles Babbage described the first general-purpose

programmable computer, the analytical engine:
◦ It used punch cards for input
◦ A steam engine powered the computer
◦ Gears and shafts represented numbers
◦ Ada Lovelace described how to program it
◦ Important concepts are first introduced: input/output system,
control unit and stored program

Second generation, vacuum tubes are used:
◦ Shannon rediscovers Boolean algebra
◦ Von Neumann architecture is proposed
◦ Machines like Z3, Colossus, ENIAC or EDVAC
are built
Memory system
Processing Unit
Control Unit
Input/Output System
Third generation, transistors substitute
vacuum tubes:
◦ Smaller, require less power, more reliable, …
◦ Disk data storage units are built
◦ A secondary processor controls the
input/output system: multitasking is possible
◦ The first family of computers, the IBM
System/360, has a great success

Fourth generation,VLSI integrated circuits
greatly increase available computing power
◦ Microprocessors like Intel 4004 are built.
◦ Large capacity RAM appears.
◦ Advances in integration allow performance
improvements, like fast cache memory or
multi-core CPUs.

Progress in computer technology
◦ Underpinned by Moore’s Law
In 1965, Intel’s Gordon Moore

predicted that the number of
transistors that can be integrated
on single chip would double
about every two years

Progress in computer technology
◦ Underpinned by Moore’s Law
Makes novel applications feasible
◦ Computers in automobiles
◦ Cell phones
◦ Human genome project
◦ World Wide Web
◦ Search Engines
Computers are pervasive
The Processor Market

The PostPC Era

The PostPC Era
Personal Mobile Device (PMD)
◦ Battery operated
◦ Connects to the Internet
◦ Hundreds of dollars
◦ Smart phones, tablets, electronic glasses
Cloud computing
◦ Warehouse Scale Computers (WSC)
◦ Software as a Service (SaaS)
◦ Portion of software run on a PMD and a
portion run in the Cloud
◦ Amazon and Google

§1.3 Understanding Performance
Why?
Purchasing perspective
◦ given a collection of machines, which has the
best performance ?
least cost ?
best cost/performance?
Design perspective
◦ faced with design options, which has the
best performance improvement ?
least cost ?
best cost/performance?
Both require
◦ basis for comparison
◦ metric for evaluation
Our goal is to compare which factors affect
performances and their relative weight

Algorithm
◦ Determines number of operations executed
Programming language, compiler, architecture
◦ Determine number of machine instructions executed
per operation
Processor and memory system
◦ Determine how fast instructions are executed
I/O system (including OS)
◦ Determines how fast I/O operations are executed

Below Your Program
Application software
◦ Written in high-level language
System software
◦ Compiler: translates HLL code to
machine code
◦ Operating System: service code
Handling input/output
Managing memory and storage
Scheduling tasks & sharing resources
Hardware
◦ Processor, memory, I/O controllers
Dept. Arquitectura de Computadores

Levels of Program Code
High-level language
◦ Level of abstraction closer to
problem domain
◦ Provides for productivity and
portability
Assembly language
◦ Textual representation of
instructions
Hardware representation
◦ Binary digits (bits)
◦ Encoded instructions and data
Abstraction
Abstraction helps us deal with complexity
◦ Hide lower-level detail
Instruction set architecture (ISA)
◦ The hardware/software interface
Application binary interface
◦ The ISA plus system software interface
Implementation
◦ The details underlying and interface

Response Time and Throughput
Response time
◦ How long it takes to do a task
Throughput
◦ Total work done per unit time
e.g., tasks/transactions/… per hour
How are response time and throughput affected by
◦ Replacing the processor with a faster version?
◦ Adding more processors?

“If you were plowing a field, which

would you rather use: two strong
oxen or 1024 chickens?”
Seymour Cray

Why Seymour Cray,
the father of
supercomputing,
would make such a
statement?
He didn´t like the idea

of massive parallelism
to improve
computational power

Let’s extend the analogy. Assume you need to plow
a field of size 10 hectares (100 m x 1000 m), with
furrows each 0.250 m (4000 furrows):
We will try to reduce the time needed to complete

all the furrows (i.e. tasks) without increasing too
much the cost.
Basic solution:
Use a hoe.
Assume a man can plow at a
speed of 1 meter each minute.
The latency of a complete furrow plowing
would be 100 minutes
The complete task would be finished in 400,000
minutes, that is, 6666 hours or 277 days!!!
The throughput is
4 10 0.01
4 10
First improvement:
Increase the plowing
speed using for
example a horse.
Assume a speed of 10 m each minute.
The latency would decrease to 10 minutes
The task would be finished in 40.000 minutes ( < 400.000
minutes), that is, around 27 days.
The throughput increase to
4 10 0.1
4 10

Second improvement:
Increase the number of ploughs
using stronger (but probably
slower) animals, for example
a pair of oxen.
Assume the oxen can pull an eight-furrow plough
at a speed of 5 m/min.
The latency would increase to 20 minutes
But the task would be finished in 10.000 minutes ( << 400.000
minutes), that is, only 7 days!!!
The throughput increase to
4 10 0.4
1 10
Third improvement:
Use many weaker and
slower animals, for
example 1024 chickens.
Assume chickens can pull
their plough at a speed of 0.1 m/min.
The latency would increase to 1.000 minutes.
But the task would be finished in less than 4.000 minutes (<<<
400.000 minutes), that is, less than 3 days!!!!
The throughput increase to approximately
4 10 1
4 10

In brief
Solution Latency Throughput Total time
One man with a hoe 100 min. 0.01 furrows/min. 400.000 min.
One horse with a plough 10 min. 0.1 furrows/min. 40.000 min.
2 oxen with an 8-furrow plough 20 min. 0.4 furrows/min. 10.000 min.
1024 chickens with ploughs 1000 min. 1 furrows/min. 4.000 min.
Computer designers get these results by

Increasing frequency and integration scale,
using instruction-level parallelism, etc.
Using hyper threading, several cores, etc.
With massive parallelism, like in GPUs
Relative performance
Define performance as ⁄ !"
X is n times faster than Y means
$ % & ' )* & + (
$ % & ( )* & + '
Example: time taken to run a program

◦ 10s on A, 15s on B
!" . 0
◦ 1.5
!" / 10
◦ So A is 1.5 times faster than B

Measuring Execution Time
Elapsed time
◦ Total response time, including all aspects
Processing, I/O, OS overhead, idle time
◦ Determines system performance
CPU time
◦ Time spent processing a given job
Discounts I/O time, other jobs’ shares
◦ Comprises user CPU time and system CPU
time
◦ Different programs are affected differently by
CPU and system performance

Example: ‘time’ in Unix
system
CPU time CPU use:
90.7 4 12.9
2 ∗ 60 4 39
90.7u 12.9s 2:39 65%
I/O and other
processes
user CPU elapsed
time time

CPU Clocking
Operation of digital hardware governed by a
constant-rate clock
Clock period
Clock (cycles)
Data transfer
and computation
Update state
Clock period: duration of a clock cycle

◦ e.g., 250ps = 0.25ns = 250×10–12s
Clock frequency (rate): cycles per second
◦ e.g., 4.0GHz = 4000MHz = 4.0×109Hz

Processor frequency
Why processor clock rate have almost not
increased in the last ten years?

Response time
◦ How long it takes to do a task
Throughput
◦ Total work done per unit time
e.g., tasks/transactions/… per hour
How are response time and throughput affected by
◦ Replacing the processor with a faster version?
◦ Adding more processors?

CPU Time
9$:+ 9$:9; &<9=&; > 9; &<9=&; +
9$:9; &<9=&;
9; &<?%
Performance improved by
◦ Reducing number of clock cycles
◦ Increasing clock rate
◦ Hardware designer must often trade off clock rate
against cycle count

CPU Time Example
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
◦ Aim for 6s CPU time
◦ Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?
Clock CyclesB 1.2 × Clock Cycles A
Clock RateB = =
CPU Time B 6s
Clock Cycles A = CPU Time A × Clock Rate A
= 10s × 2GHz = 20 × 10 9
1.2 × 20 × 10 9 24 × 10 9
Clock RateB = = = 4GHz
6s 6s
Instruction Count and CPI
9; &<9=&; @ & 9 > 9=&; $ @ &
9$: + @ & 9 > 9$@ > 9; &<9=&; +

@ & 9 > 9$@
9; &<?%
Instruction Count for a program

◦ Determined by program, ISA and compiler
Average cycles per instruction
◦ Determined by CPU hardware
◦ If different instructions have different CPI
Average CPI affected by instruction mix

CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?
CPU Time = Instruction Count × CPI × Cycle Time

A A A
= I × 2.0 × 250ps = I × 500ps A is faster…
CPU Time = Instruction Count × CPI × Cycle Time

B B B
= I × 1.2 × 500ps = I × 600ps
CPU Time
B = I × 600ps = 1.2
CPU Time I × 500ps …by this much
A
CPI in More Detail
If different instruction classes take different
numbers of cycles
9; &<9=&; A 9$@ > @ & 9
B
The weighted average CPI is

CD ECF D 0 G 0 H C
9$@ ∑B 9$@ > J
G 0 H C G 0 H C
Relative frequency

CPI Example
Alternative compiled code sequences
using instructions in classes A, B, C
Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
Avg. CPI = 10/5 = 2.0 Avg. CPI = 9/6 = 1.5
Performance Summary
Instructions Clock cycles Seconds

CPU Time = × ×
Program Instruction Clock cycle
Performance depends on
◦ Algorithm: affects IC, possibly CPI
◦ Programming language: affects IC, CPI
◦ Compiler: affects IC, CPI
◦ Instruction set architecture: affects IC, CPI, Tc

A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5 .5 .5 .25
Load 20% 5 1.0 .4 1.0 1.0
Store 10% 3 .3 .3 .3 .3
Branch 20% 2 .4 .4 .2 .4
Σ= 2.2 1.6 2.0 1.95
How much faster would the machine be if a better data cache

reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
How does this compare with using branch prediction to shave a cycle
off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting
a proportional improvement in overall
performance
+NOO M
+ "KH L M 4+ NOO M
P Q %&
Example: if multiply accounts for 80s of100s,

◦ How much improvement in multiply performance is
needed to get 5× overall?
R1
◦ 20 4 20 →can´t be done!

Total speed-up
How much speed-up is obtained by improving a part?
+CUV H W ND 1
T
+CUV "KH L M X 4 1 Y X
T
where X is the fraction of enhancement and T is the factor
of enhancement.
The opportunity for improvement is affected by how
much time the modified event consumes
◦ Corollary 1: make the common case fast
The performance enhancement possible with a given
improvement is limited by the amount the improved
feature is used
◦ Corollary 2: the bottleneck will limit the improvement

Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second
@ & 9 9; &<?%
Z@$T
)* & + > 10[ 9$@ > 10[
Doesn´t account for

◦ Differences in ISAs between computers
◦ Differences in complexity between instructions

SPEC CPU Benchmark
Programs used to measure performance
◦ Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
◦ Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
◦ Elapsed time to execute a selection of programs
Negligible I/O, so focuses on CPU performance
◦ Normalize relative to reference machine
◦ Summarize as geometric mean of performance ratios
CINT2006 (integer) and CFP2006 (floating-point)
]
\ )* & + ?%
B

CINT2006 for Opteron X4 2356
Name Description IC×109 CPI Tc (ns) Tcpu(s) Ref time(s) SPECratio
perl Interpreted string processing 2118 0.75 0.40 637 9777 15.3
bzip2 Block-sorting compression 2389 0.85 0.40 817 9650 11.8
gcc GNU C Compiler 1050 1.72 0.47 24 8050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1345 9120 6.8
go Go game (AI) 1658 1.09 0.40 721 10490 14.6
hmmer Search gene sequence 2783 0.80 0.40 890 9330 10.5
sjeng Chess game (AI) 2176 0.96 0.48 37 12100 14.5
libquantum Quantum computer simulation 1623 1.61 0.40 1047 20720 19.8
h264avc Video compression 3102 0.80 0.40 993 22130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6250 9.1
astar Games/path finding 1082 1.79 0.40 773 7020 9.1
xalancbmk XML parsing 1058 2.70 0.40 1143 6900 6.0
Geometric mean 11.7
High cache miss rates

CINT2006 for Intel Core i7 920

§1.4 The Power Wall
Power Trends
In CMOS IC technology
Power = Capacitive load × Voltage 2 × Frequency
×30 5V → 1V ×1000

Reducing Power
Suppose a new CPU has
◦ 85% of capacitive load of old CPU
◦ 15% voltage and 15% frequency reduction
Pnew Cold × 0.85 × (Vold × 0.85) 2 × Fold × 0.85 4
= 2
= 0.85 = 0.52
Pold Cold × Vold × Fold
But a power wall has been reached

◦ We can’t reduce voltage further
◦ We can’t remove more heat
How else can we improve performance?
Uniprocessor Performance
Constrained by power, instruction-level parallelism,

memory latency

Multiprocessors
Multicore microprocessors
◦ More than one processor per chip
Requires explicitly parallel programming
◦ Compare with instruction level parallelism
Hardware executes multiple instructions at once
Hidden from the programmer
◦ Hard to do
Programming for performance
Load balancing
Optimizing communication and synchronization

SPEC Power Benchmark
Power consumption of server at different
workload levels
◦ Performance: ssj_ops/sec
◦ Power: Watts (Joules/sec)
 10   10 
Overall ssj_ops per Watt =  ∑ ssj_ops i   ∑ poweri 
 i=0   i=0 

SPECpower_ssj2008 for X4
Target Load % Performance (ssj_ops/sec) Average Power (Watts)
100% 231,867 295
90% 211,282 286
80% 185,803 275
70% 163,427 265
60% 140,160 256
50% 118,324 246
40% 920,35 233
30% 70,500 222
20% 47,126 206
10% 23,066 180
0% 0 141
Overall sum 1,283,590 2,605
∑ssj_ops/ ∑power 493

SPECpower_ssj2008 for Xeon X5650

Fallacy: Low Power at Idle
Look back at X4 power benchmark
◦ At 100% load: 295W
◦ At 50% load: 246W (83%)
◦ At 10% load: 180W (61%)
Google data center
◦ Mostly operates at 10% – 50% load
◦ At 100% load less than 1% of the time
Consider designing processors to make
power proportional to load
§1.5 Concluding Remarks
Concluding Remarks
Cost/performance is improving
◦ Due to underlying technology development
Hierarchical layers of abstraction
◦ In both hardware and software
Instruction set architecture
◦ The hardware/software interface
Execution time: the best performance
measure
Power is a limiting factor
◦ Use parallelism to improve performance

Test
Find the clock rate (in MHz) required for a processor taking
into account that the execution of a program takes 9.4 μs,
the number of cycles per instruction is 4.6 and the
instruction count is 540 (note: in case of having a decimal
value, please write only the integer part; for example, it the
solution is 4.7, you should write 4)
The ISA of a processor has four classes of instructions:

arithmetic, store, load and branch with CPI 2, 2, 4, 2.0
respectively. Assume a program composed by 412 arithmetic
operations, 341 stores, 104 loads and 488 branches. What is
the execution time of the program in a 2.4 GHz processor?
Give the result in microseconds (μs).

Test
Find the number of mips produced by a program with
a CPI of 6.0 assuming a frequency of 3.2 GHz for the
processor (give the result with one decimal; for
example, if we have 78.56 I will write 78.5)
Find the speed-up of a processor in which we have

improved the ALU. With the new ALU, the speed-up
of the arithmetic-logic operations is 1.37 (the
remaining instructions -no arithmetic instructions-
keep the same speed). We also know that the 24% of
the computation time is dedicated to arithmetic-logic
operations in the original processor (give the answer
using two decimals).

Test
Find the instruction count (IC) in the execution of a program
which takes 38 μs. We know that the average number of
cycles per instruction (CPI) is 2.1 and that the processor
works at 191 MHz (note: if a decimal value is obtained in the
result, please write only the integer part; for example, if the
solution is 3.6, you will write 3).
Find the average number of cycles per instruction in the

execution of a program which takes 19 μs to be executed,
and taking into account that the number of executed
instructions (instructions count) is 5386 with a processor
working at 985 MHz (give the solution with only one
decimal).

Test
Program IC x 1011 CPI
We want to evaluate the
performance of a new computer. To perl 9.5 2.0
do that, we use the SPEC CPU
benchmark. The following table bzip2 8.6 4.3
shows the number instructions
executed by different benchmarks gcc 1.9 1.7
of SPECint2006 as well as the CPI

for the execution of each mcf 4.3 2.0
benchmark in the new computer.

go 6.6 0.7
The clock rate of the new
computer is 2.0 GHz. hmmer 7.8 1.4
How many times is the new sjeng 8.8 0.7
computer faster than the reference

machine? libquantum 7.0 0.9
h264avc 1.2 1.0

Notes: for the reference machine use
the values that appear in the class omnetpp 9.0 1.9
slides; provide the geometric mean
with at least two decimals astar 2.4 0.6
xalancbmk 5.8 1.7

Test
Program IC x 1011 CPI Tc (ns) Tcpu (s) Tref(s)x1000 ExecutionRatio
perl 9.5 2.0 0.5 950 9,777 10,2915789
bzip2 8.6 4.3 0.5 1849 9,650 5,21903732
gcc 1.9 1.7 0.5 161,5 8,050 49,8452012
mcf 4.3 2.0 0.5 430 9,120 21,2093023
go 6.6 0.7 0.5 231 10,490 45,4112554
hmmer 7.8 1.4 0.5 546 9,330 17,0879121
sjeng 8.8 0.7 0.5 308 12,100 39,2857143
libquantum 7.0 0.9 0.5 315 20,720 65,7777778
h264avc 1.2 1.0 0.5 60 22,130 368,833333
omnetpp 9.0 1.9 0.5 855 6,250 7,30994152
astar 2.4 0.6 0.5 72 7,020 97,5
xalancbmk 5.8 1.7 0.5 493 6,900 13,9959432
Geometric Mean 29,4

Chapter 1 Measuring Understanding Performance

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1 Measuring Understanding Performance

Uploaded by

Copyright:

Available Formats

Chapter 1: Measuring &

José Mª González Linares

[Adapted from chapter 1 of Computer Organization and

Dept. Arquitectura de Computadores 2

Dept. Arquitectura de Computadores 3

Dept. Arquitectura de Computadores 5

Why the result is not exact?

In 1837 Charles Babbage described the first general-purpose

Dept. Arquitectura de Computadores 7

Dept. Arquitectura de Computadores 9

Dept. Arquitectura de Computadores 10

In 1965, Intel’s Gordon Moore

Dept. Arquitectura de Computadores 11

Dept. Arquitectura de Computadores 13

Dept. Arquitectura de Computadores 14

Dept. Arquitectura de Computadores 15

Dept. Arquitectura de Computadores 16

Dept. Arquitectura de Computadores 17

Dept. Arquitectura de Computadores

Dept. Arquitectura de Computadores 20

Dept. Arquitectura de Computadores 22

“If you were plowing a field, which

Dept. Arquitectura de Computadores 23

He didn´t like the idea

Dept. Arquitectura de Computadores 24

We will try to reduce the time needed to complete

Dept. Arquitectura de Computadores 27

Dept. Arquitectura de Computadores 29

Computer designers get these results by

Example: time taken to run a program

Dept. Arquitectura de Computadores 31

Dept. Arquitectura de Computadores 32

Dept. Arquitectura de Computadores 33

Clock period: duration of a clock cycle

Dept. Arquitectura de Computadores 34

Dept. Arquitectura de Computadores 35

Dept. Arquitectura de Computadores 36

Dept. Arquitectura de Computadores 37

9$: + @ & 9 > 9$@ > 9; &<9=&; +

Instruction Count for a program

Dept. Arquitectura de Computadores 39

CPU Time = Instruction Count × CPI × Cycle Time

CPU Time = Instruction Count × CPI × Cycle Time

The weighted average CPI is

Dept. Arquitectura de Computadores 41

Instructions Clock cycles Seconds

Dept. Arquitectura de Computadores 43

Σ= 2.2 1.6 2.0 1.95

How much faster would the machine be if a better data cache

Example: if multiply accounts for 80s of100s,

Dept. Arquitectura de Computadores 45

Dept. Arquitectura de Computadores 46

Doesn´t account for

Dept. Arquitectura de Computadores 47

Dept. Arquitectura de Computadores 48

High cache miss rates

Dept. Arquitectura de Computadores 49

Dept. Arquitectura de Computadores 50

Dept. Arquitectura de Computadores 51

But a power wall has been reached

Constrained by power, instruction-level parallelism,

Dept. Arquitectura de Computadores 53

Dept. Arquitectura de Computadores 54

Dept. Arquitectura de Computadores 55