You are on page 1of 61

Principles of scalable

performance

Principles of scalable performance

Performance measures and laws


Speedup
Efficiency
Amdahls law

Principles of scalable performance

Predict performance of parallel programs

Accurate predictions of the performance of a


parallel algorithm helps determine whether coding
it is worthwhile.

Understand barriers to higher performance

Allows you to determine how much improvement


can be realized by increasing the number of
processors used.

Speedup
Speedup measures increase in running time due
to parallelism.

The number of PEs is given by n.

Based on running times, S(n) = ts/tp , where

ts is the execution time on a single processor, using


the fastest known sequential algorithm
tp is the execution time using a parallel processor.

For theoretical analysis, S(n) = ts/tp , where

ts is the worst case running time for of the fastest


known sequential algorithm for the problem
tp is the worst case running time of the parallel
algorithm using n PEs.
4

Speedup

Sequential execution time


Speedup
Parallel execution time

Linear Speedup Usually Optimal

Theorem: The maximum possible speedup for parallel


computers with n PEs for traditional problems is n.
Proof:
Assume a computation is partitioned perfectly into n
processes of equal duration.
Assume no overhead is incurred as a result of this
partitioning of the computation (e.g., partitioning
process, information passing, coordination of
processes, etc),
Under these ideal conditions, the parallel computation
will execute n times faster than the sequential
computation.
The parallel running time is ts /n.
Then the parallel speedup of this computation is
S(n) = ts /(ts /n) = n
6

Linear Speedup Usually Optimal

This proof is not valid for certain types of


nontraditional problems.
Unfortunately, the best speedup possible for most
applications is much smaller than n

The optimal performance assumed in proof is


unfeasible.
Usually some parts of programs are sequential and
allow only one PE to be active.
Sometimes a large number of processors are idle for
certain portions of the program.

During parts of the execution, many PEs may be waiting


to receive or to send data.
E.g., recall blocking can occur in message passing
7

Superlinearity

superlinear algorithms are required for many


nonstandard problems

If a problem cannot be solved in the required time without the use


of parallel computation, it seems fair to say that ts=.
Since for a fixed tp>0, S(n) = ts/tp is greater than 1 for all
sufficiently large values of ts, it seems reasonable to consider
these solutions to be superlinear.
Examples include nonstandard problems involving
Real-Time requirements where deadlines is part of the
problem requirements.
Problems where all data is not initially available, but has to be
processed after it arrives.

Some problems are natural to solve using parallelism,


sequential solutions are inefficient.
8

Efficiency:
Speedup:
S = Time(the most efficient sequential
algorithm) / Time(parallel algorithm)
Efficiency:
E=S/N
with N is the number of

processors

Amdahls Law

Gene Amdahl,

Chief architect of IBM's first mainframe series and founder of


Amdahl Corporation and other companies
They found that there were some fairly stringent restrictions on
how much of a speedup one could get for a given parallelized
task.
These observations were wrapped up in Amdahl's Law

Those who intending to write parallel software MUST


have a very deep understanding of Amdahl's Law if they
want to avoid having unrealistic expectations

10

Amdahls Law

The main objective is to produce the results as


soon as possible

video compression, computer graphics, VLSI routing,


etc

Implications

Make Sequential bottleneck as small as possible

11

Amdahls Law

Let f be the fraction of operations in a computation that


must be performed sequentially, where 0 f 1.

The maximum speedup S achievable by a parallel


computer with n processors is

1
1
S ( p)

f (1 f ) / n f

12

Amdahls Law

Proof for Traditional Problems:

If the fraction of the computation that cannot be


divided into concurrent tasks is f,
and no overhead incurs when the computation is
divided into concurrent parts,
the time to perform the computation with n processors
is given by
tp fts + [(1 - f )ts] / n, as show in figure

13

Amdahls Law

14

Proof of Amdahls Law (cont.)

Using the preceding expression for tp


ts
ts
S ( n)

(1 f )t s
tp
ft s
n
1

(1 f )
f
n

15

Proof of Amdahls Law (cont.)

The last expression is obtained by dividing


numerator and denominator by ts , which
establishes Amdahls law.
Multiplying numerator & denominator by n
produces the following alternate version of this
formula:

n
n
S ( n)

nf (1 f ) 1 (n 1) f
16

Amdahls law Effect

Sometimes Amdahls law is just stated as


S(n) 1/f
Note that S(n) never exceeds 1/f

As n increases,

sequential portion of algorithm decreases


speedup increases

Amdahl Effect: Speedup is usually an


increasing function of the problem size
17

Amdahls law

If F is the fraction of a calculation that is


sequential, and (1-F) is the fraction that can
be parallelized, then the maximum speed-up
that can be achieved by using n processors
is 1/(F+(1-F)/n).

18

Amdahls law

if 90% of a calculation can be parallelized (i.e.


10% is sequential) then the maximum speed-up
which can be achieved on 5 processors is
1/(0.1+(1-0.1)/5) or roughly 3.6 (i.e. the program
can theoretically run 3.6 times faster on five
processors than on one)

19

Amdahls law

If 90% of a calculation can be parallelized then the


maximum speed-up on 10 processors is 1/(0.1+(10.1)/10) or 5.3 (i.e. investing twice as much hardware
speeds the calculation up by about 50%).

If 90% of a calculation can be parallelized then the


maximum speed-up on 20 processors is 1/(0.1+(10.1)/20) or 6.9 (i.e. doubling the hardware again speeds
up the calculation by only 30%).

20

Amdahls law

if 90% of a calculation can be parallelized then


the maximum speed-up on 1000 processors is
1/(0.1+(1-0.1)/1000) or 9.9 (i.e. throwing an
absurd amount of hardware at the calculation
results in a maximum theoretical (i.e. actual
results will be worse) speed-up of 9.9 vs a single
processor).

21

Speedup
n = 10,000

n = 1,000

n = 100
Processors
22

Example

95% of a programs execution time occurs


inside a loop that can be executed in parallel.
What is the maximum speedup we should
expect from a parallel version of the program
executing on 8 CPUs?

1
S
5.9
0.05 (1 0.05) / 8
23

Essence

The point that Amdahl was trying to make is that


using lots of parallel processors is not a
practical way of achieving the sort of speed up
that people were looking for.
i.e. it is essentially an argument in support of put
in effort in making single processor systems run
faster.

24

Why actual speed ups are always less ?

Distributing work to the parallel processors and collecting


the results back together is extra work required in the
parallel version which isn't required in the serial version

straggler problem :when the program is executing in the


parallel parts of the code, it is unlikely that all of the
processors will be computing all of the time
some of them will likely run out of work to do before
others are finished their part of the parallel work.

25

Maximum theoretical speed-up

Amdahl's Law is a statement of the maximum


theoretical speed-up you can ever hope to
achieve. The actual speed-ups are always
less than the speed-up predicted by Amdahl's
Law

26

Parallelism Profile in Programs

DOP Degree of Parallelism

The number of processors used at any instant to


execute a program is called

the degree of parallelism (DOP); this can vary over time.

DOP assumes an infinite number of processors are


available; this is not achievable in real machines, so
some parallel program segments must be executed
sequentially as smaller parallel segments.
Program execution on parallel computer

may use different number of processors

at different time periods during different execution cycle.

27

Parallelism profile

A plot of DOP vs. time is called a parallelism profile


Fluctuation of a profile during an observation period
depend on

the algorithmic structure, program optimization, resource


utilization and run time conditions

DOP exceeds the maximum number of available


processors in a system, some parallel branches must be
executed in chunks sequentially.
Parallelism still exists within each chunk, limited by the
machine size
DOP may be limited by memory and by other
nonprocessor resources
28

Average Parallelism

We consider a parallel computer consisting of n


homogeneous processors
The maximum parallelism in a profile is m
The computing capacity of a single processor is
approximated by the execution rate MIPS or Mflops
Without considering the penalties from memory access,
communication latency, or system over head.
When i processors are busy during an observation
period, we have DOP = i.

29

Average Parallelism

The total amount of work W (instructions or


computations) performed is proportional to the
area under the profile curve

30

Average Parallelism

31

This integral is often computed with

Ti is the total amount of time that DOP=I and


is total elapsed time
The average parallelism A is computed as

32

Example 3.1

33

Example 3.1

The parallelism profile of divide and conquer


algorithm increased from 1 to its peak value m=8
and decrease to 0 from its observation period
(t1,t2)

The avg parallelism A = 3.72

Total workload W=A(t2-t1)

34

Available Parallelism

Various studies have shown that the potential parallelism


in scientific and engineering calculations can be

High DOP due to data parallelism


Hundreds or thousands of instructions per clock cycle
But in real machines, the actual parallelism is much smaller (e.g.
10 or 20).

Available parallelism show that computation is less in


numeric code than in scientific codes
It has relatively little parallelism even when basic block
boundaries are ignored.

Available Parallelism

A basic block is a sequence or block of instructions with


one entry and one exit.
Basic blocks are frequently used as optimizers in
compilers (since its easier to manage the use of
registers utilized in the block).
Compiler optimization and algorithm redesign may
increase the available parallelism in the application
Limiting optimization to basic blocks limits the instruction
level parallelism that can be obtained (to about 2 to 5 in
typical code).
DOP may thousands in some scientific codes when
multiple processor are used

Asymptotic Speedup - 1
Wi iti

( work done when DOP = i )

W Wi

( relates sum of Wi terms to W )

i 1

ti (1) Wi /

ti (k ) Wi / k
ti () Wi / i

(execution time of Wi with single


processor)
(execution time of Wi with k processors)

When an infinite no. of processor


(for 1 i m)

Asymptotic Speedup - 2
m

Wi
T (1) ti (1)
i 1
i 1
m

(Response time w/ 1 proc.)

Wi
T () ti ()
i 1
i 1 i

W
T (1)

i 1 i
S
m
A
T () Wi / i
i 1

(Response time w/ proc.)

(Asymptotic Speedup
the ideal case)

S in

Asymptotic Speedup - 3

A i ti / ti
i1
i1
m

W
T (1)

i 1 i
S
m
A
T () Wi / i
i 1
m

Comparing these two equations

In ideal case S A
S A if communication latency and other
system overhead are considered.

39

Mean Performance Calculation

We seek to obtain a measure that characterizes the


mean, or average, performance of a set of benchmark
programs with potentially many different execution
modes (e.g. scalar, vector, sequential, parallel).

We may also wish to associate weights with these


programs to emphasize these different modes and yield
a more meaningful performance measure.

Consider a parallel computer with n processors


executing m programs in various modes with different
performance level.

Arithmetic Mean

The arithmetic mean is familiar (sum of the terms divided


by the number of terms).
Our measures will use execution rates expressed in
MIPS or Mflops.
Ri be the execution rates of the programs i=1,2,3,, m
Arithmetic mean execution rate is defined as

The Ra assumes equal weighting 1/m on all m programs


Weight distribution = {fi / i =1,2,3,, m }

Weighted Arithmetic Mean

Weighted arithmetic Mean execution rate

The arithmetic mean of a set of execution rates is


proportional to the sum of the inverses of the execution
times;
it is not inversely proportional to the sum of the
execution times.
Thus arithmetic mean fails to represent real times
consumed by the benchmarks programs when executed.

42

Harmonic Mean

Instead of using arithmetic, we use the harmonic mean


execution rate, which is just the inverse of the arithmetic
mean of the execution time (thus guaranteeing the inverse
relation not exhibited by the other means).
Ti = 1/Ri mean execution time per instruction for program i.
Arithmetic mean execution time per instruction

Weighted Harmonic Mean

Harmonic mean execution rate across m programs


defined by Rh=1/Ta

Rh

1/ R
m

i 1

If we associate weights fi with the m programs, then we can


compute the weighted harmonic mean:

Rh

f
m

i 1

/ Ri

Weighted Harmonic Mean Speedup

Program is executed in mode i, if i processors are used.


Execution rate Ri is used to reflect the speed of i processor
T1 = 1/R1 = 1 is the sequential execution time on a single
processor with rate R1 = 1.
Ti = 1/Ri = 1/i = is the execution time using i processors with
a combined execution rate of Ri = i.
Now suppose a program has n execution modes with
associated weights f1 fn. The weighted harmonic mean
speedup is defined as:

S T1 / T
*

1
n

f / Ri
i 1 i

T * 1/ Rh*
(weighted arithmetic mean
execution time)

Example 3.2

46

Amdahls Law

Assume Ri = i, and w (the weights) are (, 0, , 0, 1-).


Basically this means the system is used sequentially (with
probability ) or all n processors are used (with probability
1- ).
This yields the speedup equation known as Amdahls law:

n
Sn
1 n 1
The implication is that the best speedup possible is 1/ , regardless of
n, the number of processors. (n infinite)

Amdahls Law

48

System Efficiency 1

Assume the following definitions:

O (n) = total number of unit operations performed by an nprocessor system in completing a program P.
T (n) = execution time required to execute the program P on an
n-processor system.

O (n) can be considered similar to the total number of


instructions executed by the n processors, perhaps
scaled by a constant factor.
O (1) = T (1), for uniprocessor system
T (n) < O (n) if more then one operation is performed by
n processor per unit time when n >=2,

System Efficiency 2

The speedup factor (how much faster the program runs


with n processors) can now be expressed as
S (n) = T (1) / T (n)

System efficiency is defined as


E (n) = S (n) / n = T (1) / ( n T (n) )

It indicates the actual degree of speedup achieved in a


system as compared with the maximum possible
speedup.
Thus 1 / n E (n) 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1
when all processors are fully utilized.

Redundancy

The redundancy in a parallel computation is defined as


R (n) = O (n) / O (1)
What values can R (n) obtain?

R (n) = 1 when O (n) = O (1), or when the number of operations


performed is independent of the number of processors, n. This is
the ideal case.
R (n) = n when all processors performs the same number of
operations as when only a single processor is used; this implies
that n completely redundant computations are performed!

The R (n) figure indicates to what extent the software


parallelism is carried over to the hardware
implementation without having extra operations
performed.

System Utilization
System utilization is defined as
U (n) = R (n) E (n) = O (n) / ( n T (n) )

It indicates the degree to which the system resources


were kept busy during execution of the program.
Since 1 R (n) n, and 1 / n E (n) 1, the best
possible value for U (n) is 1, and the worst is 1 / n.
1 / n E (n) U (n) 1
1 R (n) 1 / E (n) n

Quality of Parallelism

The quality of a parallel computation is defined


as
Q (n) = S (n) E (n) / R (n)
= T 3 (1) / ( n T 2 (n) O (n) )
This measure is directly related to speedup (S)
and efficiency (E), and inversely related to
redundancy (R).
The quality measure is bounded by the speedup
(that is, Q (n) S (n) ).

Standard Industry Performance Measures

MIPS and Mflops, while easily understood, are poor


measures of system performance, since their
interpretation depends on machine clock cycles and
instruction sets. For example, which of these machines
is faster?

a 10 MIPS CISC computer


a 20 MIPS RISC computer

It is impossible to tell without knowing more details about


the instruction sets on the machines. Even the question,
which machine is faster, is suspect, since we really
need to say faster at doing what?

Doing What?

To answer the doing what? question, several


standard programs are frequently used.

The Dhrystone benchmark uses no floating point


instructions, system calls, or library functions.
It uses exclusively integer data items.
Each execution of the entire set of high-level
language statements is a Dhrystone, and a machine
is rated as having a performance of some number of
Dhrystones per second (sometimes reported as
KDhrystones/sec).

The Whestone benchmark uses a more complex


program involving floating point and integer data, arrays,
subroutines with parameters, conditional branching, and
library functions.

The performance of a machine on these benchmarks


depends in large measure on the compiler used to
generate the machine language.

56

Other Measures

Transactions per second (TPS)

TPS is a measure that is appropriate for online systems like those


used to support ATMs, reservation systems, and point of sale
terminals.
The measure may include communication overhead, database
search and update, and logging operations. The benchmark is
also useful for rating relational database performance.

Kilo Liter Inference per Second (KLIPS)

KLIPS is the measure of the number of logical inferences per


second that can be performed by a system, to relate how well
that system will perform at certain AI applications.
Since one inference requires about 100 instructions (in the
benchmark), a rating of 400 KLIPS is roughly equivalent to 40
MIPS.

Problem 1

A.

B.

For the purpose of solving a given application problem, you


benchmark a program on two computer systems.
On system A, the object code executed 80 million Arithmetic Logic
Unit operations (ALU ops), 40 million load instructions, and 25
million branch instructions.
On system B, the object code executed 50 million ALU ops, 50
million loads, and 40 million branch instructions.
In both systems, each ALU op takes 1 clock cycles, each load takes
3 clock cycles, and each branch takes 5 clock cycles.
Compute the relative frequency of occurrence of each type of
instruction executed in both systems
Find the CPI for each system.

58

Problem 1 Solution

Compute the relative frequency of occurrence


of each type of instruction executed in both
systems.

59

Problem 1 Solution

b) Find the CPI for each system.

60

Problem 2 (Book 3.1)

(a) Average CPI


(b) MIPS Rate for 4 40-MHz Processors
(c) Speed-up
(d) Efficiency
61