You are on page 1of 50

Parallel Computing

Parallel performance
Thorsten Grahs, 01.06.2015

Overview
Performance
Speedup & efficiency
Amdahls Law
Gustafsons Law
Efficiency and scalability metrics
Timing models
Examples: Inner product

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Parallel Performance
Primary purpose of parallelization
The primary purpose or goal to parallelize a system or a
program is performance.
So what is performance?
Usually it is about one of the following
Reducing the total time it takes to compute a single result
Increasing the rate at which a series of results can be
computed
Reducing the power consumption of a computation
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Speed up by parallelization
Question
Could one build a TeraFlop computer by using 1000
GigaFlop machines?
Yes, . . . but
Technical restrictions
Influenced by CPU speed, memory, network
Not all program parts could be parallelized?
Speed-Up?
Efficiency?
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Parallel run time


Parallel execution time

Tp (n)

Time between start of the parallel program and the end of all
involved parallel processes.
p # processors
n model size
Run time for a parallel program (distributed sys.) consists of
local computation
Computation of one processor with local data
Data exchange
Necessary communication for data exchange
Waiting time
(e.g. due to unbalanced loads on processors)
synchronisation Calibration between involved processes
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Cost of a parallel program


Costs Cp (n)
The costs (work, processor-time-product) of a parallel
product are defined as
Cp (n) = Tp (n) p
Costs Cp (n)
takes into account the time that all processors involved
needs for fulfilling their tasks
a measure of the work performed by all processors.
are optimal, when the parallel program performs as many
operations as the fastest sequential algorithm with run
time T (n), i.e.
Cp (n) = T (n).
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

Speed-up
Serial vs. parallel fraction Speed-up Sp (n)
The speed-up of an parallel program with run time Tp (n) is
defined as
Sp (n) =

T (n)
Tp (n)

T (n) Run time of the optimal serial implementation.


Measure for
comparison between serial and parallel implementation.
relative gain in speed up
Ideal case: linear speed-up T = Tp p Sp = p
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

Efficiency
Serial vs. parallel fraction Efficiency Ep (n)
The efficiency of a parallel program is defined as
Ep (n) =

Sp (n)
T (n)
T (n)
=
=
Cp (n)
p
p Tp (n)

Measure for the proportion of the run time, a processor


needs for calculation, which is also present in the serial
program
less efficiency = large parallel overhead
Linear Speed-up Sp = p corresponds to Ep = 1
E100 (n) = 0.4 means that each out of the 100 processors
needs 60% time for communication
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

Amdahls objection
Gene Amdahl, 1967
. . . the effort expended on achieving high parallel processing
rates is wasted unless it is accompanied by achievements in
sequential processing rates of nearly the same magnitude
What does this mean?
Inclusion of the sequential proportions
In general, a parallel algorithm has always inherent
sequential components. This has implications for the
achievable speed-up.
We assume
f
relative proportion of the serial task
1f
relative proportion of ideal parallelizable tasks
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

Amdahls Law
Gene Amdahl ( 1922)
Computer pioneer
& businessman
Famous for his work on
large-capacity computers at
IBM
Founder of the Amdahl Corp.

Assume
T (n) = Tser + Tpar
sequential execution
p
T (n) = Tser + Tpar /p parallel execution

Amdahls Law
Sp (n) =

Tser + Tpar
Tser + Tpar /p

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Amdahls Law | Example


Assume
a program needs 10 hours on a single processor
one hour is sequential, the rest 9 hours can be parallelized
Consequences
The program cannot run in less than an hour
The speed-up is at most 10
Speed-up is limited by the sequential part
Note: the problem size is fixed!
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Amdahls Law relative formulation


If an implementation has a
relative proportion f , (0 6 f 6 1) of inherent serial tasks,
relative proportion (1 f ) of ideal parallelizable tasks
the over-all run time of the program consists of the
Run time for serial part
f T (n)
Run time for parallel part (1 f )/p T (n)

Amdahls Law (relative)


Sp (n) =

T (n)
1
1
=
6
1f
1f
f
f T (n) + p T (n)
f+ p

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Amdahls law Example

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

Amdahls law Speed-up

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Amdahls law Consequence


Serial part is excluded from massive parallel performance
For a sequential proportion of 1 %
the maximal accessible speed-up is 100
regardless how many processors are used
For a sequential proportion of 20 %
the maximal accessible speed-up is 5
regardless how many processors are used

Conclusion
Massive parallel systems where for a long time considered as
inefficient and only interesting from a theoretical point of view
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Further sources of limitations


Further delays due to
Load balancing
i.e. unequal work packages (loads)
Communication
e.g. waiting for data/processes

Parallelization induced communication


Consider the communication time Tpc (n) due to the need for
exchange between participating processors

Tp (n) = T (n) [(1 f ) + f /p] + Tpc (n)

Tpc (n) is monotone increasing with # p.


01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Communication Master-Worker
Assumptions
Load balancing excellent
Communication between Master and Worker
Tpc (n) linear function of p.
A processor does not monologise
T2c (n) minimal comm. time (Master & Worker)
Tpc (n) = T2c (n)(p 1)
i.e.

with

r=

Tc (2)
T (1)

S(p) =


1
f r+(1f )/p+rp

Min. comm. time


Serial comp. time

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Comm. Master-Worker Max. speed-up


New situation
For a great # p speed-up can decline
Maximal Speed-up
S(p? ) =

f r+2

(1f )r

with

p? =

1f
r

Pay-of area
Only for
p  p?
S(p) = p holds. This means that only for processor number
much less than p? parallel performance can be assured.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Scaling
An example
One decorator wallpapers a room in 60 minutes.
Two decorators wallpaper a room in 30 minutes.
How long does it take, when 60 decorators work in the
room?
Remedy
Use the 60 decorators for a hotel with 60 rooms. . .

Problem scaling
With increasing # processors the problem size should be
increased, in order to guarantee parallel
performance/efficiency.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Scalability
Saturation of speed-up
According to the Amdahls Law for a fixed problem size n with
increasing number of processors p there is a saturation of
the speed-up
60 decorators in one room . . .
Quite often one is interested in scientific computing to
solve a bigger problem in the same time.
Growing problem size n combined with increasing
# processors p
Instead of decorating one room, 60 painters can renovate
the whole hotel (60 rooms).
This behaviour (speed-up for increasing n) is not detected
by Amdahls Law!
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Scaled speed-up
Assumption
The sequential proportion of a parallel program decreases
with increasing model size. I.e. it is not a constant fraction of
the total computation as adopted in Amdahls Law.
For each number of processors p maximum speed up
Sp (n) 6 p can be achieved
namely by correspondingly large model size
The behaviour of run time T with larger problem size and
correspondingly increased number of processors is
described by the Gustafsons Law.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Gustafsons Law
John Gustafson (1955)
. . . speed-up should be measured by scaling the
problem to the number of processors, not by fixing the problem size.
J. L. Gustafson Reevaluating Amdahls Law,
Comm. of the ACM, 31:532-533, 1988
This implies:
inclusion of the model size n
sequential proportion f constant
Serial run time
Ts (n) = f + n(1 f )
)
Parallel run time
Tp (n) = f + n(1f
p
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Gustafsons Law scaled speed-up


Gustafsons Law(Scaled speed-up)
If the problem size increases while the serial portion grows
slowly or remains fixed, speed-up grows as workers are
added
Sp (n) =
=

Ts (n)
Tp (n)
f + n(1 f )
f+

n(1f )
p

f + n(1 f )
p
fp + n(1 f )
lim Sp (n) = p
=

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Gustafsons Law constant run time

Constant run time


Keeping the run time constant for increasing problem size,
one has to choose n = p. This means: For double model size
(e.g. # grid points) one has to double the # processors.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Karp-Flatt metric
Both, Amdahl and Gustafson ignore the parallel overhead
of the parallelization.
This may lead to an overestimated speedup.
Karp & Flatt introduces an
experimentally determined serial fraction

Karp-Flatt metric
f =

1/S 1/p
1 1/p

S measured speedup
# processors
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Karp-Flatt metric
Advantage
Takes into account parallel overhead
Detects other sources of overhead or inefficiency
ignored in speedup model:
Process start-up time
Process synchronization time
Imbalanced workload
Architectural overhead

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Karp-Flatt metric Example 1


Measured speedup
p
2
3
S 1.8 2.5

4
3.1

5
3.6

6
4.0

7
4.4

8
4.7

What is the primary reason for speedup of only 4.7 on 8


CPUs?
Karp-Flatt metric
f 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Since f is constant, (i.e. not increasing with growing # p)
the limiting constraint is limited opportunity for parallelism
(large fraction that is inherently sequential)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Karp-Flatt metric Example 2


Measured speedup
p
2
3
S 1.9 2.6

4
3.2

5
3.7

6
4.1

7
4.5

8
4.7

What is the primary reason for speedup of only 4.7 on 8


processors?
Karp-Flatt metric
f

0.070

0.075

0.080

0.085

0.090

0.095

0.100

Since f is steadily increasing, parallel overhead is the


primary reason (f.i. time spent in process start-up,
communication synchronization or an architectural
constraint).
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Scalability again
Scalability
The scalability of a parallel system
(i.e. parallel program executing on a parallel computer)
is a measure of its ability to increase performance as the
number of processors increases.
Speedup (and hence efficiency) is typically an increasing
function of the problem size.
This is called the Amdahl effect.
in order to maintain efficiency when processors are added,
we can increase the problem size.
This idea is formulated by the
Isoefficiency relation
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

Efficiency relation
TO = is the total amount of time spent by all processes
doing work not done by the sequential algorithm. (Total
overhead)
Ts was the sequential execution time. We have
TS + TO
pTP = TS + TO TP = =
p
Speed-up
TS
pTS
S =
TP TS + TO
Efficiency
S
1
TS
E 6
=
=
p
TS + TO
1 + TTO
S

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Isoefficiency relation
E(n, p) 6

1
1+

TO (n,p)
TS (n)

TO (n, p)
1 E(n, p)
6
TS (n)
E(n, p)
1 E(n, p)
TS (n) 6
TO (n, p)
E(n, p)
If we wish to maintain a constant level of efficiency
1 E(n, p)
has to be a constant.
E(n, p)
Consequently, the isoefficiency relation simplifies too

TS (n) > CTO (n, p)


01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Scalability function
Suppose a parallel system has the isoefficiency relation
n > f (p)
M(n) is the memory required to store a problem of size n.
We can derive the estimate for the total amount of
memory, i.e.
n > M[f (p)].
Thus the function
M(f (p))
p
shows how memory usage per processor must increase to
maintain same efficiency
This function is called the Scalability function
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Meaning of scalability function


To maintain the same efficiency when increasing p,
we must increase n.
The maximum problem size is limited by the maximum
available size of memory M(f (p)),
which is linear in p.
The scalability function shows how memory usage per
processor must grow to maintain efficiency
Is the complexity of the
scalability function is constant
this means that the
parallel system is perfectly scalable.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

Meaning of scalability function

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Performance models
Analysing performance
Amdahls &. Gustafsons Laws makes simple statements
about the behaviour of a model with changing
# processors p
model/problem size n
Need for analysing the behaviour of parallel Algorithms

Performance models
Abstract machine models for the design and performance
analysis of parallel algorithms.
Allow detailed statements about run time behaviour (time,
memory, efficiency,. . . )
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Timing model
model (Van de Velde)
Simple performance model for run time analysis
Assumptions
All tasks run independently from each other and start
simultaneously
All processors are identical
All arithmetic operations take the same time ta
Message exchange in data words with unit length (16 Bit)
Communication and calculation are non-overlapping
Communication tasks are non-interfering each others
No global only point-to-point communication
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

Timing model Modelling data exchange


Data exchange
Linear model for message exchange (send & receive) of a
message M of length l (i.e. consisting of l data words)
tk (l) = + l
is the start time (latency = Time to complete a task)
Here: Time, to send a message, i.e.
pack the message
send the message
& unpack it on recipient side.

reciprocal bandwidth of the bus/network


M Message of length l to be send
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Timing model Modelling data exchange


In general
  ta
ta arithmetic time
i.e. time for processing one standard arithmetic operation
Properties
Communication time depends linear on message length l.
Communication time depends on the system/hardware
(Network)
Slope of the line is the reciprocal bandwidth.

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Timing model Example

model time response depending on message length


01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Example: Inner product


1

for(s=0.,i=0;i<N;i++) s += x[i] * y[i];


N vector length (problem size)
p # processors
n = Np uniform distribution of the vectors to processors
Constraint
s should be known on all processors.
Serial work
T1 = 2 N ta

To consider for parallelization


Distribution of data
Distribution of work
Algorithm
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

Example: Inner product Version I


Version I local inner products
1
2
3
4

s=0.;
for(i=0;i<n;i++) s += x[i] * y[i];
for(i=0;i<p; i++) if(i != myself) send(s to proc i);
for(i=0;i<p; i++) if(i != myself) s+=recieve(from proc i);

2 n ta
(p 1) tk (1)
(p 1) ta

accumulation of local summation


send and receive of the (p-1) sub totals
computation of total sum

Tp = 2 n ta + (p 1)tk (1) + (p 1)ta


01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

Example: Inner product Version I


Speed-up
2N ta
(2n + p 1)ta + (p 1)tk (1)
2N
=
(2n + p 1) + (p 1)tr

S(p) =

T1
Tp

with

tr =

tk (1)
ta

(lesser is better)

Speed-up Version I
S(p) =

2N
(2n + p 1) + (p 1)tr

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

Example: Inner product Version II


Version 2 recursive doubling
Pairs of local sums are formed recursively.
2 n ta
Formation of the local sums
ta + 2tk (1)
Receiving and adding two
sub sums.
log2 p
# inner knots (propagation
levels)

Tp = 2 n ta + log2 p [ta + 2tk (1)]


01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43

Example: Inner product Version II


Speed-up
2N ta
(2n + log2 p)ta + 2 log2 p tk (1)
2N
=
2n + log2 p (1 + 2tr )
tk (1)
with tr =
(lesser is better)
ta

S(p) =

T1
Tp

Speed-up Version II
2N
2n + log2 p (1 + 2tr )
Only logarithmic dependence on number of processors!
S(p)

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

Example: Inner product Version III


Version 3 Butterfly
1
2
3
4
5
6
7
8

s=0.;
for(i=0; i<np; i++) s += x[i] * y[i];
for(i=0; i<(int)log2(p); i++) {
k = 1<<i;/*Bit shift to left, i.e. sequence 1,2,4,...*/
j = myself^k; /* exclusive OR */
send (s to proc j);
s += recv(from proc j)
}

log2 (p) steps


Each p different pairs exchange data s.
Each processor forms subtotal
Algorithm known from Fast Fourier Transf. (FFT)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45

Example: Inner product Version III


Version 3 Butterfly
Pairs of local sums are formed recursively.
2 n ta
Formation local sums
ta + tk (1)
Sum. of two subtotals
log2 p
# inner knots
P
s(0 : 3)means: 3i=0 s(i)

Tp = 2 n ta + log2 p (ta + tk (1))


01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46

Example: Inner product Version III


Speed-up
2N ta
(2n + log2 p)ta + log2 p tk (1)
2N
=
2n + log2 p (1 + tr )
tk (1)
with tr =
(lesser is better)
ta

S(p) =

T1
Tp

Speed-up Version III


2N
2n + log2 p (1 + tr )
Only logarithmic dependence on number of processors!
Half communication time compared to recursive doubling
S(p)

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47

Example: Inner product Comparison

Comparison for Versions I,II & III (N=100,000)


01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48

Example: Inner product Conclusions


Simple performance model is helpful
It can analyse different realisations of parallel algorithms
Take into account network speed
Local inner products
linear dependence of the run time of processor number p
prevents massive parallelism
Recursive doubling
logarithmic dependence of the run time leads to better
speed-up
Butterfly
Lower communication overhead In addition
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49

Further readings
Literature
G.M. Amdahl Validity of the single-processor approach
to achieving large scale computing capabilities, In Proc.
Amer. Fed. Information Proceeding Societies Spring Joint
Computer Conference, AFIPS Press, 1967, pp. 483-485.
J. L. Gustafson Reevaluating Amdahls Law, Comm. of
the ACM, 31:532-533, 1988
E. F. Van de Velde Concurrent Scientific Computing,
Springer Texts in Applied Mathematics 16, 1994
M. McCool, A. D. Robinson, J. Reinders Structured
Parallel Programming, Morgan Kaufmann Pub., 2012
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50

You might also like