You are on page 1of 63

Chapter 1: Measuring &

Understanding Performance

José Mª González Linares


Dept. de Arquitectura de Computadores

[Adapted from chapter 1 of Computer Organization and


Design, 5th Edition, Patterson & Hennessy, © 2013, MK]
Index
Introduction
The Computer Revolution
Understanding Performance
The Power Wall
Concluding Remarks

Dept. Arquitectura de Computadores 2


§1.1 Introduction
What is a computer?
A computer is a general purpose device
that can be programmed to carry out a
finite set of arithmetic or logical
operations.
◦ We use programs, i.e. sequences of
instructions, to solve problems.
◦ The Operating System help us to code and
execute our programs.

Dept. Arquitectura de Computadores 3


Classes of Computers
Desktop computers
◦ General purpose, variety of software
◦ Subject to cost/performance tradeoff
Server computers
◦ Network based
◦ High capacity, performance, reliability
◦ Range from small servers to building sized
Embedded computers
◦ Hidden as components of systems
◦ Stringent power/performance/cost constraints
Dept. Arquitectura de Computadores 4
What You Will Learn
How programs are translated into the
machine language
◦ And how the hardware executes them
The hardware/software interface
What determines program performance
◦ And how it can be improved
How hardware designers improve
performance
How software designers improve
performance
◦ And what is parallel processing

Dept. Arquitectura de Computadores 5


Does my computer know to
multiply?
Try next code in Java, for example paste it
in https://ideone.com/
import java.util.*;
import java.lang.*;
import java.io.*;

class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
System.out.println( 123.456 * 0.001 );
}
}

Why the result is not exact?


Dept. Arquitectura de Computadores 6
§1.2 The Computer Revolution
History of computing hardware
First generation: mechanical and electromechanical devices built
to solve some particular problem, like
◦ The abacus for arithmetic tasks
◦ The Antikythera mechanism for astronomical calculations
◦ The Pascaline for automatic basic arithmetic operations
◦ Model K, the first relay-based calculator

In 1837 Charles Babbage described the first general-purpose


programmable computer, the analytical engine:
◦ It used punch cards for input
◦ A steam engine powered the computer
◦ Gears and shafts represented numbers
◦ Ada Lovelace described how to program it
◦ Important concepts are first introduced: input/output system,
control unit and stored program

Dept. Arquitectura de Computadores 7


History of computing hardware
Second generation, vacuum tubes are used:
◦ Shannon rediscovers Boolean algebra
◦ Von Neumann architecture is proposed
◦ Machines like Z3, Colossus, ENIAC or EDVAC
are built
Memory system

Processing Unit
Control Unit
Input/Output System
Dept. Arquitectura de Computadores 8
History of computing hardware
Third generation, transistors substitute
vacuum tubes:
◦ Smaller, require less power, more reliable, …
◦ Disk data storage units are built
◦ A secondary processor controls the
input/output system: multitasking is possible
◦ The first family of computers, the IBM
System/360, has a great success

Dept. Arquitectura de Computadores 9


History of computing hardware
Fourth generation,VLSI integrated circuits
greatly increase available computing power
◦ Microprocessors like Intel 4004 are built.
◦ Large capacity RAM appears.
◦ Advances in integration allow performance
improvements, like fast cache memory or
multi-core CPUs.

Dept. Arquitectura de Computadores 10


The Computer Revolution
Progress in computer technology
◦ Underpinned by Moore’s Law

In 1965, Intel’s Gordon Moore


predicted that the number of
transistors that can be integrated
on single chip would double
about every two years

Dept. Arquitectura de Computadores 11


The Computer Revolution
Progress in computer technology
◦ Underpinned by Moore’s Law
Makes novel applications feasible
◦ Computers in automobiles
◦ Cell phones
◦ Human genome project
◦ World Wide Web
◦ Search Engines
Computers are pervasive
Dept. Arquitectura de Computadores 12
The Processor Market

Dept. Arquitectura de Computadores 13


The PostPC Era

Dept. Arquitectura de Computadores 14


The PostPC Era
Personal Mobile Device (PMD)
◦ Battery operated
◦ Connects to the Internet
◦ Hundreds of dollars
◦ Smart phones, tablets, electronic glasses
Cloud computing
◦ Warehouse Scale Computers (WSC)
◦ Software as a Service (SaaS)
◦ Portion of software run on a PMD and a
portion run in the Cloud
◦ Amazon and Google

Dept. Arquitectura de Computadores 15


§1.3 Understanding Performance
Why?
Purchasing perspective
◦ given a collection of machines, which has the
best performance ?
least cost ?
best cost/performance?
Design perspective
◦ faced with design options, which has the
best performance improvement ?
least cost ?
best cost/performance?
Both require
◦ basis for comparison
◦ metric for evaluation
Our goal is to compare which factors affect
performances and their relative weight

Dept. Arquitectura de Computadores 16


Understanding Performance
Algorithm
◦ Determines number of operations executed
Programming language, compiler, architecture
◦ Determine number of machine instructions executed
per operation
Processor and memory system
◦ Determine how fast instructions are executed
I/O system (including OS)
◦ Determines how fast I/O operations are executed

Dept. Arquitectura de Computadores 17


Below Your Program
Application software
◦ Written in high-level language
System software
◦ Compiler: translates HLL code to
machine code
◦ Operating System: service code
Handling input/output
Managing memory and storage
Scheduling tasks & sharing resources
Hardware
◦ Processor, memory, I/O controllers

Dept. Arquitectura de Computadores


Levels of Program Code
High-level language
◦ Level of abstraction closer to
problem domain
◦ Provides for productivity and
portability

Assembly language
◦ Textual representation of
instructions

Hardware representation
◦ Binary digits (bits)
◦ Encoded instructions and data
Dept. Arquitectura de Computadores 19
Abstraction
Abstraction helps us deal with complexity
◦ Hide lower-level detail
Instruction set architecture (ISA)
◦ The hardware/software interface
Application binary interface
◦ The ISA plus system software interface
Implementation
◦ The details underlying and interface

Dept. Arquitectura de Computadores 20


Response Time and Throughput
Response time
◦ How long it takes to do a task
Throughput
◦ Total work done per unit time
e.g., tasks/transactions/… per hour
How are response time and throughput affected by
◦ Replacing the processor with a faster version?
◦ Adding more processors?

Dept. Arquitectura de Computadores 22


Response Time and Throughput

“If you were plowing a field, which


would you rather use: two strong
oxen or 1024 chickens?”

Seymour Cray

Dept. Arquitectura de Computadores 23


Response Time and Throughput
Why Seymour Cray,
the father of
supercomputing,
would make such a
statement?

He didn´t like the idea


of massive parallelism
to improve
computational power

Dept. Arquitectura de Computadores 24


Response Time and Throughput
Let’s extend the analogy. Assume you need to plow
a field of size 10 hectares (100 m x 1000 m), with
furrows each 0.250 m (4000 furrows):

We will try to reduce the time needed to complete


all the furrows (i.e. tasks) without increasing too
much the cost.
Dept. Arquitectura de Computadores 25
Response Time and Throughput
Basic solution:
Use a hoe.
Assume a man can plow at a
speed of 1 meter each minute.
The latency of a complete furrow plowing
would be 100 minutes
The complete task would be finished in 400,000
minutes, that is, 6666 hours or 277 days!!!
The throughput is
4 10 0.01
4 10
Dept. Arquitectura de Computadores 26
Response Time and Throughput
First improvement:
Increase the plowing
speed using for
example a horse.
Assume a speed of 10 m each minute.
The latency would decrease to 10 minutes
The task would be finished in 40.000 minutes ( < 400.000
minutes), that is, around 27 days.
The throughput increase to
4 10 0.1
4 10

Dept. Arquitectura de Computadores 27


Response Time and Throughput
Second improvement:
Increase the number of ploughs
using stronger (but probably
slower) animals, for example
a pair of oxen.
Assume the oxen can pull an eight-furrow plough
at a speed of 5 m/min.
The latency would increase to 20 minutes
But the task would be finished in 10.000 minutes ( << 400.000
minutes), that is, only 7 days!!!
The throughput increase to
4 10 0.4
1 10
Dept. Arquitectura de Computadores 28
Response Time and Throughput
Third improvement:
Use many weaker and
slower animals, for
example 1024 chickens.
Assume chickens can pull
their plough at a speed of 0.1 m/min.
The latency would increase to 1.000 minutes.
But the task would be finished in less than 4.000 minutes (<<<
400.000 minutes), that is, less than 3 days!!!!
The throughput increase to approximately
4 10 1
4 10

Dept. Arquitectura de Computadores 29


In brief
Solution Latency Throughput Total time
One man with a hoe 100 min. 0.01 furrows/min. 400.000 min.
One horse with a plough 10 min. 0.1 furrows/min. 40.000 min.
2 oxen with an 8-furrow plough 20 min. 0.4 furrows/min. 10.000 min.
1024 chickens with ploughs 1000 min. 1 furrows/min. 4.000 min.

Computer designers get these results by


Increasing frequency and integration scale,
using instruction-level parallelism, etc.
Using hyper threading, several cores, etc.
With massive parallelism, like in GPUs
Dept. Arquitectura de Computadores 30
Relative performance
Define performance as ⁄ !"
X is n times faster than Y means
$ % & ' )* & + (
$ % & ( )* & + '

Example: time taken to run a program


◦ 10s on A, 15s on B
!" . 0
◦ 1.5
!" / 10
◦ So A is 1.5 times faster than B

Dept. Arquitectura de Computadores 31


Measuring Execution Time
Elapsed time
◦ Total response time, including all aspects
Processing, I/O, OS overhead, idle time
◦ Determines system performance
CPU time
◦ Time spent processing a given job
Discounts I/O time, other jobs’ shares
◦ Comprises user CPU time and system CPU
time
◦ Different programs are affected differently by
CPU and system performance

Dept. Arquitectura de Computadores 32


Example: ‘time’ in Unix
system
CPU time CPU use:
90.7 4 12.9
2 ∗ 60 4 39
90.7u 12.9s 2:39 65%
I/O and other
processes
user CPU elapsed
time time

Dept. Arquitectura de Computadores 33


CPU Clocking
Operation of digital hardware governed by a
constant-rate clock
Clock period

Clock (cycles)

Data transfer
and computation
Update state

Clock period: duration of a clock cycle


◦ e.g., 250ps = 0.25ns = 250×10–12s
Clock frequency (rate): cycles per second
◦ e.g., 4.0GHz = 4000MHz = 4.0×109Hz

Dept. Arquitectura de Computadores 34


Processor frequency
Why processor clock rate have almost not
increased in the last ten years?

Dept. Arquitectura de Computadores 35


Response Time and Throughput
Response time
◦ How long it takes to do a task
Throughput
◦ Total work done per unit time
e.g., tasks/transactions/… per hour
How are response time and throughput affected by
◦ Replacing the processor with a faster version?
◦ Adding more processors?

Dept. Arquitectura de Computadores 36


CPU Time
9$:+ 9$:9; &<9=&; > 9; &<9=&; +
9$:9; &<9=&;
9; &<?%

Performance improved by
◦ Reducing number of clock cycles
◦ Increasing clock rate
◦ Hardware designer must often trade off clock rate
against cycle count

Dept. Arquitectura de Computadores 37


CPU Time Example
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
◦ Aim for 6s CPU time
◦ Can do faster clock, but causes 1.2 × clock cycles
How fast must Computer B clock be?
Clock CyclesB 1.2 × Clock Cycles A
Clock RateB = =
CPU Time B 6s
Clock Cycles A = CPU Time A × Clock Rate A
= 10s × 2GHz = 20 × 10 9
1.2 × 20 × 10 9 24 × 10 9
Clock RateB = = = 4GHz
6s 6s
Dept. Arquitectura de Computadores 38
Instruction Count and CPI
9; &<9=&; @ & 9 > 9=&; $ @ &

9$: + @ & 9 > 9$@ > 9; &<9=&; +


@ & 9 > 9$@
9; &<?%

Instruction Count for a program


◦ Determined by program, ISA and compiler
Average cycles per instruction
◦ Determined by CPU hardware
◦ If different instructions have different CPI
Average CPI affected by instruction mix

Dept. Arquitectura de Computadores 39


CPI Example
Computer A: Cycle Time = 250ps, CPI = 2.0
Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?

CPU Time = Instruction Count × CPI × Cycle Time


A A A
= I × 2.0 × 250ps = I × 500ps A is faster…

CPU Time = Instruction Count × CPI × Cycle Time


B B B
= I × 1.2 × 500ps = I × 600ps
CPU Time
B = I × 600ps = 1.2
CPU Time I × 500ps …by this much
A
Dept. Arquitectura de Computadores 40
CPI in More Detail
If different instruction classes take different
numbers of cycles
9; &<9=&; A 9$@ > @ & 9
B

The weighted average CPI is


CD ECF D 0 G 0 H C
9$@ ∑B 9$@ > J
G 0 H C G 0 H C

Relative frequency

Dept. Arquitectura de Computadores 41


CPI Example
Alternative compiled code sequences
using instructions in classes A, B, C

Sequence 1: IC = 5 Sequence 2: IC = 6
Clock Cycles Clock Cycles
= 2×1 + 1×2 + 2×3 = 4×1 + 1×2 + 1×3
= 10 =9
Avg. CPI = 10/5 = 2.0 Avg. CPI = 9/6 = 1.5
Dept. Arquitectura de Computadores 42
Performance Summary

Instructions Clock cycles Seconds


CPU Time = × ×
Program Instruction Clock cycle

Performance depends on
◦ Algorithm: affects IC, possibly CPI
◦ Programming language: affects IC, CPI
◦ Compiler: affects IC, CPI
◦ Instruction set architecture: affects IC, CPI, Tc

Dept. Arquitectura de Computadores 43


A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5 .5 .5 .25
Load 20% 5 1.0 .4 1.0 1.0
Store 10% 3 .3 .3 .3 .3
Branch 20% 2 .4 .4 .2 .4

Σ= 2.2 1.6 2.0 1.95

How much faster would the machine be if a better data cache


reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
How does this compare with using branch prediction to shave a cycle
off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting
a proportional improvement in overall
performance

+NOO M
+ "KH L M 4+ NOO M
P Q %&

Example: if multiply accounts for 80s of100s,


◦ How much improvement in multiply performance is
needed to get 5× overall?
R1
◦ 20 4 20 →can´t be done!

Dept. Arquitectura de Computadores 45


Total speed-up
How much speed-up is obtained by improving a part?
+CUV H W ND 1
T
+CUV "KH L M X 4 1 Y X
T
where X is the fraction of enhancement and T is the factor
of enhancement.
The opportunity for improvement is affected by how
much time the modified event consumes
◦ Corollary 1: make the common case fast
The performance enhancement possible with a given
improvement is limited by the amount the improved
feature is used
◦ Corollary 2: the bottleneck will limit the improvement

Dept. Arquitectura de Computadores 46


Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second

@ & 9 9; &<?%
Z@$T
)* & + > 10[ 9$@ > 10[

Doesn´t account for


◦ Differences in ISAs between computers
◦ Differences in complexity between instructions

Dept. Arquitectura de Computadores 47


SPEC CPU Benchmark
Programs used to measure performance
◦ Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC)
◦ Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
◦ Elapsed time to execute a selection of programs
Negligible I/O, so focuses on CPU performance
◦ Normalize relative to reference machine
◦ Summarize as geometric mean of performance ratios
CINT2006 (integer) and CFP2006 (floating-point)

]
\ )* & + ?%
B

Dept. Arquitectura de Computadores 48


CINT2006 for Opteron X4 2356
Name Description IC×109 CPI Tc (ns) Tcpu(s) Ref time(s) SPECratio
perl Interpreted string processing 2118 0.75 0.40 637 9777 15.3
bzip2 Block-sorting compression 2389 0.85 0.40 817 9650 11.8
gcc GNU C Compiler 1050 1.72 0.47 24 8050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1345 9120 6.8
go Go game (AI) 1658 1.09 0.40 721 10490 14.6
hmmer Search gene sequence 2783 0.80 0.40 890 9330 10.5
sjeng Chess game (AI) 2176 0.96 0.48 37 12100 14.5
libquantum Quantum computer simulation 1623 1.61 0.40 1047 20720 19.8
h264avc Video compression 3102 0.80 0.40 993 22130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6250 9.1
astar Games/path finding 1082 1.79 0.40 773 7020 9.1
xalancbmk XML parsing 1058 2.70 0.40 1143 6900 6.0
Geometric mean 11.7

High cache miss rates

Dept. Arquitectura de Computadores 49


CINT2006 for Intel Core i7 920

Dept. Arquitectura de Computadores 50


§1.4 The Power Wall
Power Trends

In CMOS IC technology
Power = Capacitive load × Voltage 2 × Frequency

×30 5V → 1V ×1000

Dept. Arquitectura de Computadores 51


Reducing Power
Suppose a new CPU has
◦ 85% of capacitive load of old CPU
◦ 15% voltage and 15% frequency reduction
Pnew Cold × 0.85 × (Vold × 0.85) 2 × Fold × 0.85 4
= 2
= 0.85 = 0.52
Pold Cold × Vold × Fold

But a power wall has been reached


◦ We can’t reduce voltage further
◦ We can’t remove more heat
How else can we improve performance?
Dept. Arquitectura de Computadores 52
Uniprocessor Performance

Constrained by power, instruction-level parallelism,


memory latency

Dept. Arquitectura de Computadores 53


Multiprocessors
Multicore microprocessors
◦ More than one processor per chip
Requires explicitly parallel programming
◦ Compare with instruction level parallelism
Hardware executes multiple instructions at once
Hidden from the programmer
◦ Hard to do
Programming for performance
Load balancing
Optimizing communication and synchronization

Dept. Arquitectura de Computadores 54


SPEC Power Benchmark
Power consumption of server at different
workload levels
◦ Performance: ssj_ops/sec
◦ Power: Watts (Joules/sec)

 10   10 
Overall ssj_ops per Watt =  ∑ ssj_ops i   ∑ poweri 
 i=0   i=0 

Dept. Arquitectura de Computadores 55


SPECpower_ssj2008 for X4
Target Load % Performance (ssj_ops/sec) Average Power (Watts)
100% 231,867 295
90% 211,282 286
80% 185,803 275
70% 163,427 265
60% 140,160 256
50% 118,324 246
40% 920,35 233
30% 70,500 222
20% 47,126 206
10% 23,066 180
0% 0 141
Overall sum 1,283,590 2,605
∑ssj_ops/ ∑power 493

Dept. Arquitectura de Computadores 56


SPECpower_ssj2008 for Xeon X5650

Dept. Arquitectura de Computadores 57


Fallacy: Low Power at Idle
Look back at X4 power benchmark
◦ At 100% load: 295W
◦ At 50% load: 246W (83%)
◦ At 10% load: 180W (61%)
Google data center
◦ Mostly operates at 10% – 50% load
◦ At 100% load less than 1% of the time
Consider designing processors to make
power proportional to load
Dept. Arquitectura de Computadores 58
§1.5 Concluding Remarks
Concluding Remarks
Cost/performance is improving
◦ Due to underlying technology development
Hierarchical layers of abstraction
◦ In both hardware and software
Instruction set architecture
◦ The hardware/software interface
Execution time: the best performance
measure
Power is a limiting factor
◦ Use parallelism to improve performance

Dept. Arquitectura de Computadores 59


Test
Find the clock rate (in MHz) required for a processor taking
into account that the execution of a program takes 9.4 μs,
the number of cycles per instruction is 4.6 and the
instruction count is 540 (note: in case of having a decimal
value, please write only the integer part; for example, it the
solution is 4.7, you should write 4)

The ISA of a processor has four classes of instructions:


arithmetic, store, load and branch with CPI 2, 2, 4, 2.0
respectively. Assume a program composed by 412 arithmetic
operations, 341 stores, 104 loads and 488 branches. What is
the execution time of the program in a 2.4 GHz processor?
Give the result in microseconds (μs).

Dept. Arquitectura de Computadores 60


Test
Find the number of mips produced by a program with
a CPI of 6.0 assuming a frequency of 3.2 GHz for the
processor (give the result with one decimal; for
example, if we have 78.56 I will write 78.5)

Find the speed-up of a processor in which we have


improved the ALU. With the new ALU, the speed-up
of the arithmetic-logic operations is 1.37 (the
remaining instructions -no arithmetic instructions-
keep the same speed). We also know that the 24% of
the computation time is dedicated to arithmetic-logic
operations in the original processor (give the answer
using two decimals).

Dept. Arquitectura de Computadores 61


Test
Find the instruction count (IC) in the execution of a program
which takes 38 μs. We know that the average number of
cycles per instruction (CPI) is 2.1 and that the processor
works at 191 MHz (note: if a decimal value is obtained in the
result, please write only the integer part; for example, if the
solution is 3.6, you will write 3).

Find the average number of cycles per instruction in the


execution of a program which takes 19 μs to be executed,
and taking into account that the number of executed
instructions (instructions count) is 5386 with a processor
working at 985 MHz (give the solution with only one
decimal).

Dept. Arquitectura de Computadores 62


Test
Program IC x 1011 CPI
We want to evaluate the
performance of a new computer. To perl 9.5 2.0
do that, we use the SPEC CPU
benchmark. The following table bzip2 8.6 4.3
shows the number instructions
executed by different benchmarks gcc 1.9 1.7

of SPECint2006 as well as the CPI


for the execution of each mcf 4.3 2.0

benchmark in the new computer.


go 6.6 0.7
The clock rate of the new
computer is 2.0 GHz. hmmer 7.8 1.4

How many times is the new sjeng 8.8 0.7

computer faster than the reference


machine? libquantum 7.0 0.9

h264avc 1.2 1.0


Notes: for the reference machine use
the values that appear in the class omnetpp 9.0 1.9
slides; provide the geometric mean
with at least two decimals astar 2.4 0.6

xalancbmk 5.8 1.7

Dept. Arquitectura de Computadores 63


Test
Program IC x 1011 CPI Tc (ns) Tcpu (s) Tref(s)x1000 ExecutionRatio

perl 9.5 2.0 0.5 950 9,777 10,2915789

bzip2 8.6 4.3 0.5 1849 9,650 5,21903732

gcc 1.9 1.7 0.5 161,5 8,050 49,8452012

mcf 4.3 2.0 0.5 430 9,120 21,2093023

go 6.6 0.7 0.5 231 10,490 45,4112554

hmmer 7.8 1.4 0.5 546 9,330 17,0879121

sjeng 8.8 0.7 0.5 308 12,100 39,2857143

libquantum 7.0 0.9 0.5 315 20,720 65,7777778

h264avc 1.2 1.0 0.5 60 22,130 368,833333

omnetpp 9.0 1.9 0.5 855 6,250 7,30994152

astar 2.4 0.6 0.5 72 7,020 97,5

xalancbmk 5.8 1.7 0.5 493 6,900 13,9959432

Geometric Mean 29,4

Dept. Arquitectura de Computadores 64

You might also like