You are on page 1of 61

Principles of scalable


Principles of scalable performance

Performance measures and laws

Amdahls law

Principles of scalable performance

Predict performance of parallel programs

Accurate predictions of the performance of a

parallel algorithm helps determine whether coding
it is worthwhile.

Understand barriers to higher performance

Allows you to determine how much improvement

can be realized by increasing the number of
processors used.

Speedup measures increase in running time due
to parallelism.

The number of PEs is given by n.

Based on running times, S(n) = ts/tp , where

ts is the execution time on a single processor, using

the fastest known sequential algorithm
tp is the execution time using a parallel processor.

For theoretical analysis, S(n) = ts/tp , where

ts is the worst case running time for of the fastest

known sequential algorithm for the problem
tp is the worst case running time of the parallel
algorithm using n PEs.


Sequential execution time

Parallel execution time

Linear Speedup Usually Optimal

Theorem: The maximum possible speedup for parallel

computers with n PEs for traditional problems is n.
Assume a computation is partitioned perfectly into n
processes of equal duration.
Assume no overhead is incurred as a result of this
partitioning of the computation (e.g., partitioning
process, information passing, coordination of
processes, etc),
Under these ideal conditions, the parallel computation
will execute n times faster than the sequential
The parallel running time is ts /n.
Then the parallel speedup of this computation is
S(n) = ts /(ts /n) = n

Linear Speedup Usually Optimal

This proof is not valid for certain types of

nontraditional problems.
Unfortunately, the best speedup possible for most
applications is much smaller than n

The optimal performance assumed in proof is

Usually some parts of programs are sequential and
allow only one PE to be active.
Sometimes a large number of processors are idle for
certain portions of the program.

During parts of the execution, many PEs may be waiting

to receive or to send data.
E.g., recall blocking can occur in message passing


superlinear algorithms are required for many

nonstandard problems

If a problem cannot be solved in the required time without the use

of parallel computation, it seems fair to say that ts=.
Since for a fixed tp>0, S(n) = ts/tp is greater than 1 for all
sufficiently large values of ts, it seems reasonable to consider
these solutions to be superlinear.
Examples include nonstandard problems involving
Real-Time requirements where deadlines is part of the
problem requirements.
Problems where all data is not initially available, but has to be
processed after it arrives.

Some problems are natural to solve using parallelism,

sequential solutions are inefficient.

S = Time(the most efficient sequential
algorithm) / Time(parallel algorithm)
with N is the number of


Amdahls Law

Gene Amdahl,

Chief architect of IBM's first mainframe series and founder of

Amdahl Corporation and other companies
They found that there were some fairly stringent restrictions on
how much of a speedup one could get for a given parallelized
These observations were wrapped up in Amdahl's Law

Those who intending to write parallel software MUST

have a very deep understanding of Amdahl's Law if they
want to avoid having unrealistic expectations


Amdahls Law

The main objective is to produce the results as

soon as possible

video compression, computer graphics, VLSI routing,



Make Sequential bottleneck as small as possible


Amdahls Law

Let f be the fraction of operations in a computation that

must be performed sequentially, where 0 f 1.

The maximum speedup S achievable by a parallel

computer with n processors is

S ( p)

f (1 f ) / n f


Amdahls Law

Proof for Traditional Problems:

If the fraction of the computation that cannot be

divided into concurrent tasks is f,
and no overhead incurs when the computation is
divided into concurrent parts,
the time to perform the computation with n processors
is given by
tp fts + [(1 - f )ts] / n, as show in figure


Amdahls Law


Proof of Amdahls Law (cont.)

Using the preceding expression for tp

S ( n)

(1 f )t s
ft s

(1 f )


Proof of Amdahls Law (cont.)

The last expression is obtained by dividing

numerator and denominator by ts , which
establishes Amdahls law.
Multiplying numerator & denominator by n
produces the following alternate version of this

S ( n)

nf (1 f ) 1 (n 1) f

Amdahls law Effect

Sometimes Amdahls law is just stated as

S(n) 1/f
Note that S(n) never exceeds 1/f

As n increases,

sequential portion of algorithm decreases

speedup increases

Amdahl Effect: Speedup is usually an

increasing function of the problem size

Amdahls law

If F is the fraction of a calculation that is

sequential, and (1-F) is the fraction that can
be parallelized, then the maximum speed-up
that can be achieved by using n processors
is 1/(F+(1-F)/n).


Amdahls law

if 90% of a calculation can be parallelized (i.e.

10% is sequential) then the maximum speed-up
which can be achieved on 5 processors is
1/(0.1+(1-0.1)/5) or roughly 3.6 (i.e. the program
can theoretically run 3.6 times faster on five
processors than on one)


Amdahls law

If 90% of a calculation can be parallelized then the

maximum speed-up on 10 processors is 1/(0.1+(10.1)/10) or 5.3 (i.e. investing twice as much hardware
speeds the calculation up by about 50%).

If 90% of a calculation can be parallelized then the

maximum speed-up on 20 processors is 1/(0.1+(10.1)/20) or 6.9 (i.e. doubling the hardware again speeds
up the calculation by only 30%).


Amdahls law

if 90% of a calculation can be parallelized then

the maximum speed-up on 1000 processors is
1/(0.1+(1-0.1)/1000) or 9.9 (i.e. throwing an
absurd amount of hardware at the calculation
results in a maximum theoretical (i.e. actual
results will be worse) speed-up of 9.9 vs a single


n = 10,000

n = 1,000

n = 100


95% of a programs execution time occurs

inside a loop that can be executed in parallel.
What is the maximum speedup we should
expect from a parallel version of the program
executing on 8 CPUs?

0.05 (1 0.05) / 8


The point that Amdahl was trying to make is that

using lots of parallel processors is not a
practical way of achieving the sort of speed up
that people were looking for.
i.e. it is essentially an argument in support of put
in effort in making single processor systems run


Why actual speed ups are always less ?

Distributing work to the parallel processors and collecting

the results back together is extra work required in the
parallel version which isn't required in the serial version

straggler problem :when the program is executing in the

parallel parts of the code, it is unlikely that all of the
processors will be computing all of the time
some of them will likely run out of work to do before
others are finished their part of the parallel work.


Maximum theoretical speed-up

Amdahl's Law is a statement of the maximum

theoretical speed-up you can ever hope to
achieve. The actual speed-ups are always
less than the speed-up predicted by Amdahl's


Parallelism Profile in Programs

DOP Degree of Parallelism

The number of processors used at any instant to

execute a program is called

the degree of parallelism (DOP); this can vary over time.

DOP assumes an infinite number of processors are

available; this is not achievable in real machines, so
some parallel program segments must be executed
sequentially as smaller parallel segments.
Program execution on parallel computer

may use different number of processors

at different time periods during different execution cycle.


Parallelism profile

A plot of DOP vs. time is called a parallelism profile

Fluctuation of a profile during an observation period
depend on

the algorithmic structure, program optimization, resource

utilization and run time conditions

DOP exceeds the maximum number of available

processors in a system, some parallel branches must be
executed in chunks sequentially.
Parallelism still exists within each chunk, limited by the
machine size
DOP may be limited by memory and by other
nonprocessor resources

Average Parallelism

We consider a parallel computer consisting of n

homogeneous processors
The maximum parallelism in a profile is m
The computing capacity of a single processor is
approximated by the execution rate MIPS or Mflops
Without considering the penalties from memory access,
communication latency, or system over head.
When i processors are busy during an observation
period, we have DOP = i.


Average Parallelism

The total amount of work W (instructions or

computations) performed is proportional to the
area under the profile curve


Average Parallelism


This integral is often computed with

Ti is the total amount of time that DOP=I and

is total elapsed time
The average parallelism A is computed as


Example 3.1


Example 3.1

The parallelism profile of divide and conquer

algorithm increased from 1 to its peak value m=8
and decrease to 0 from its observation period

The avg parallelism A = 3.72

Total workload W=A(t2-t1)


Available Parallelism

Various studies have shown that the potential parallelism

in scientific and engineering calculations can be

High DOP due to data parallelism

Hundreds or thousands of instructions per clock cycle
But in real machines, the actual parallelism is much smaller (e.g.
10 or 20).

Available parallelism show that computation is less in

numeric code than in scientific codes
It has relatively little parallelism even when basic block
boundaries are ignored.

Available Parallelism

A basic block is a sequence or block of instructions with

one entry and one exit.
Basic blocks are frequently used as optimizers in
compilers (since its easier to manage the use of
registers utilized in the block).
Compiler optimization and algorithm redesign may
increase the available parallelism in the application
Limiting optimization to basic blocks limits the instruction
level parallelism that can be obtained (to about 2 to 5 in
typical code).
DOP may thousands in some scientific codes when
multiple processor are used

Asymptotic Speedup - 1
Wi iti

( work done when DOP = i )

W Wi

( relates sum of Wi terms to W )

i 1

ti (1) Wi /

ti (k ) Wi / k
ti () Wi / i

(execution time of Wi with single

(execution time of Wi with k processors)

When an infinite no. of processor

(for 1 i m)

Asymptotic Speedup - 2

T (1) ti (1)
i 1
i 1

(Response time w/ 1 proc.)

T () ti ()
i 1
i 1 i

T (1)

i 1 i
T () Wi / i
i 1

(Response time w/ proc.)

(Asymptotic Speedup
the ideal case)

S in

Asymptotic Speedup - 3

A i ti / ti

T (1)

i 1 i
T () Wi / i
i 1

Comparing these two equations

In ideal case S A
S A if communication latency and other
system overhead are considered.


Mean Performance Calculation

We seek to obtain a measure that characterizes the

mean, or average, performance of a set of benchmark
programs with potentially many different execution
modes (e.g. scalar, vector, sequential, parallel).

We may also wish to associate weights with these

programs to emphasize these different modes and yield
a more meaningful performance measure.

Consider a parallel computer with n processors

executing m programs in various modes with different
performance level.

Arithmetic Mean

The arithmetic mean is familiar (sum of the terms divided

by the number of terms).
Our measures will use execution rates expressed in
MIPS or Mflops.
Ri be the execution rates of the programs i=1,2,3,, m
Arithmetic mean execution rate is defined as

The Ra assumes equal weighting 1/m on all m programs

Weight distribution = {fi / i =1,2,3,, m }

Weighted Arithmetic Mean

Weighted arithmetic Mean execution rate

The arithmetic mean of a set of execution rates is

proportional to the sum of the inverses of the execution
it is not inversely proportional to the sum of the
execution times.
Thus arithmetic mean fails to represent real times
consumed by the benchmarks programs when executed.


Harmonic Mean

Instead of using arithmetic, we use the harmonic mean

execution rate, which is just the inverse of the arithmetic
mean of the execution time (thus guaranteeing the inverse
relation not exhibited by the other means).
Ti = 1/Ri mean execution time per instruction for program i.
Arithmetic mean execution time per instruction

Weighted Harmonic Mean

Harmonic mean execution rate across m programs

defined by Rh=1/Ta


1/ R

i 1

If we associate weights fi with the m programs, then we can

compute the weighted harmonic mean:



i 1

/ Ri

Weighted Harmonic Mean Speedup

Program is executed in mode i, if i processors are used.

Execution rate Ri is used to reflect the speed of i processor
T1 = 1/R1 = 1 is the sequential execution time on a single
processor with rate R1 = 1.
Ti = 1/Ri = 1/i = is the execution time using i processors with
a combined execution rate of Ri = i.
Now suppose a program has n execution modes with
associated weights f1 fn. The weighted harmonic mean
speedup is defined as:

S T1 / T


f / Ri
i 1 i

T * 1/ Rh*
(weighted arithmetic mean
execution time)

Example 3.2


Amdahls Law

Assume Ri = i, and w (the weights) are (, 0, , 0, 1-).

Basically this means the system is used sequentially (with
probability ) or all n processors are used (with probability
1- ).
This yields the speedup equation known as Amdahls law:

1 n 1
The implication is that the best speedup possible is 1/ , regardless of
n, the number of processors. (n infinite)

Amdahls Law


System Efficiency 1

Assume the following definitions:

O (n) = total number of unit operations performed by an nprocessor system in completing a program P.
T (n) = execution time required to execute the program P on an
n-processor system.

O (n) can be considered similar to the total number of

instructions executed by the n processors, perhaps
scaled by a constant factor.
O (1) = T (1), for uniprocessor system
T (n) < O (n) if more then one operation is performed by
n processor per unit time when n >=2,

System Efficiency 2

The speedup factor (how much faster the program runs

with n processors) can now be expressed as
S (n) = T (1) / T (n)

System efficiency is defined as

E (n) = S (n) / n = T (1) / ( n T (n) )

It indicates the actual degree of speedup achieved in a

system as compared with the maximum possible
Thus 1 / n E (n) 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1
when all processors are fully utilized.


The redundancy in a parallel computation is defined as

R (n) = O (n) / O (1)
What values can R (n) obtain?

R (n) = 1 when O (n) = O (1), or when the number of operations

performed is independent of the number of processors, n. This is
the ideal case.
R (n) = n when all processors performs the same number of
operations as when only a single processor is used; this implies
that n completely redundant computations are performed!

The R (n) figure indicates to what extent the software

parallelism is carried over to the hardware
implementation without having extra operations

System Utilization
System utilization is defined as
U (n) = R (n) E (n) = O (n) / ( n T (n) )

It indicates the degree to which the system resources

were kept busy during execution of the program.
Since 1 R (n) n, and 1 / n E (n) 1, the best
possible value for U (n) is 1, and the worst is 1 / n.
1 / n E (n) U (n) 1
1 R (n) 1 / E (n) n

Quality of Parallelism

The quality of a parallel computation is defined

Q (n) = S (n) E (n) / R (n)
= T 3 (1) / ( n T 2 (n) O (n) )
This measure is directly related to speedup (S)
and efficiency (E), and inversely related to
redundancy (R).
The quality measure is bounded by the speedup
(that is, Q (n) S (n) ).

Standard Industry Performance Measures

MIPS and Mflops, while easily understood, are poor

measures of system performance, since their
interpretation depends on machine clock cycles and
instruction sets. For example, which of these machines
is faster?

a 10 MIPS CISC computer

a 20 MIPS RISC computer

It is impossible to tell without knowing more details about

the instruction sets on the machines. Even the question,
which machine is faster, is suspect, since we really
need to say faster at doing what?

Doing What?

To answer the doing what? question, several

standard programs are frequently used.

The Dhrystone benchmark uses no floating point

instructions, system calls, or library functions.
It uses exclusively integer data items.
Each execution of the entire set of high-level
language statements is a Dhrystone, and a machine
is rated as having a performance of some number of
Dhrystones per second (sometimes reported as

The Whestone benchmark uses a more complex

program involving floating point and integer data, arrays,
subroutines with parameters, conditional branching, and
library functions.

The performance of a machine on these benchmarks

depends in large measure on the compiler used to
generate the machine language.


Other Measures

Transactions per second (TPS)

TPS is a measure that is appropriate for online systems like those

used to support ATMs, reservation systems, and point of sale
The measure may include communication overhead, database
search and update, and logging operations. The benchmark is
also useful for rating relational database performance.

Kilo Liter Inference per Second (KLIPS)

KLIPS is the measure of the number of logical inferences per

second that can be performed by a system, to relate how well
that system will perform at certain AI applications.
Since one inference requires about 100 instructions (in the
benchmark), a rating of 400 KLIPS is roughly equivalent to 40

Problem 1



For the purpose of solving a given application problem, you

benchmark a program on two computer systems.
On system A, the object code executed 80 million Arithmetic Logic
Unit operations (ALU ops), 40 million load instructions, and 25
million branch instructions.
On system B, the object code executed 50 million ALU ops, 50
million loads, and 40 million branch instructions.
In both systems, each ALU op takes 1 clock cycles, each load takes
3 clock cycles, and each branch takes 5 clock cycles.
Compute the relative frequency of occurrence of each type of
instruction executed in both systems
Find the CPI for each system.


Problem 1 Solution

Compute the relative frequency of occurrence

of each type of instruction executed in both


Problem 1 Solution

b) Find the CPI for each system.


Problem 2 (Book 3.1)

(a) Average CPI

(b) MIPS Rate for 4 40-MHz Processors
(c) Speed-up
(d) Efficiency