Parallel Computing Performance

Parallel Computing
Parallel performance
Thorsten Grahs, 01.06.2015
Overview
Performance
Speedup & efficiency
Amdahls Law
Gustafsons Law
Efficiency and scalability metrics
Timing models
Examples: Inner product
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Parallel Performance
Primary purpose of parallelization
The primary purpose or goal to parallelize a system or a
program is performance.
So what is performance?
Usually it is about one of the following
Reducing the total time it takes to compute a single result
Increasing the rate at which a series of results can be
computed
Reducing the power consumption of a computation
Speed up by parallelization
Question
Could one build a TeraFlop computer by using 1000
GigaFlop machines?
Yes, . . . but
Technical restrictions
Influenced by CPU speed, memory, network
Not all program parts could be parallelized?
Speed-Up?
Efficiency?
Parallel run time

Parallel execution time
Tp (n)
Time between start of the parallel program and the end of all
involved parallel processes.
p # processors
n model size
Run time for a parallel program (distributed sys.) consists of
local computation
Computation of one processor with local data
Data exchange
Necessary communication for data exchange
Waiting time
(e.g. due to unbalanced loads on processors)
synchronisation Calibration between involved processes
Cost of a parallel program

Costs Cp (n)
The costs (work, processor-time-product) of a parallel
product are defined as
Cp (n) = Tp (n) p
Costs Cp (n)
takes into account the time that all processors involved
needs for fulfilling their tasks
a measure of the work performed by all processors.
are optimal, when the parallel program performs as many
operations as the fastest sequential algorithm with run
time T (n), i.e.
Cp (n) = T (n).
Speed-up
Serial vs. parallel fraction Speed-up Sp (n)
The speed-up of an parallel program with run time Tp (n) is
defined as
Sp (n) =
T (n)
Tp (n)
T (n) Run time of the optimal serial implementation.

Measure for
comparison between serial and parallel implementation.
relative gain in speed up
Ideal case: linear speed-up T = Tp p Sp = p
Efficiency
Serial vs. parallel fraction Efficiency Ep (n)
The efficiency of a parallel program is defined as
Ep (n) =
Sp (n)
T (n)
T (n)
=
=
Cp (n)
p
p Tp (n)
Measure for the proportion of the run time, a processor

needs for calculation, which is also present in the serial
program
less efficiency = large parallel overhead
Linear Speed-up Sp = p corresponds to Ep = 1
E100 (n) = 0.4 means that each out of the 100 processors
needs 60% time for communication
Amdahls objection
Gene Amdahl, 1967
. . . the effort expended on achieving high parallel processing
rates is wasted unless it is accompanied by achievements in
sequential processing rates of nearly the same magnitude
What does this mean?
Inclusion of the sequential proportions
In general, a parallel algorithm has always inherent
sequential components. This has implications for the
achievable speed-up.
We assume
f
relative proportion of the serial task
1f
relative proportion of ideal parallelizable tasks
Amdahls Law
Gene Amdahl ( 1922)
Computer pioneer
& businessman
Famous for his work on
large-capacity computers at
IBM
Founder of the Amdahl Corp.
Assume
T (n) = Tser + Tpar
sequential execution
p
T (n) = Tser + Tpar /p parallel execution
Amdahls Law
Sp (n) =
Tser + Tpar
Tser + Tpar /p
Amdahls Law | Example

Assume
a program needs 10 hours on a single processor
one hour is sequential, the rest 9 hours can be parallelized
Consequences
The program cannot run in less than an hour
The speed-up is at most 10
Speed-up is limited by the sequential part
Note: the problem size is fixed!
Amdahls Law relative formulation

If an implementation has a
relative proportion f , (0 6 f 6 1) of inherent serial tasks,
relative proportion (1 f ) of ideal parallelizable tasks
the over-all run time of the program consists of the
Run time for serial part
f T (n)
Run time for parallel part (1 f )/p T (n)
Amdahls Law (relative)

Sp (n) =
T (n)
1
1
=
6
1f
1f
f
f T (n) + p T (n)
f+ p
Amdahls law Example
Amdahls law Speed-up
Amdahls law Consequence

Serial part is excluded from massive parallel performance
For a sequential proportion of 1 %
the maximal accessible speed-up is 100
regardless how many processors are used
For a sequential proportion of 20 %
the maximal accessible speed-up is 5
regardless how many processors are used
Conclusion
Massive parallel systems where for a long time considered as
inefficient and only interesting from a theoretical point of view
Further sources of limitations

Further delays due to
Load balancing
i.e. unequal work packages (loads)
Communication
e.g. waiting for data/processes
Parallelization induced communication

Consider the communication time Tpc (n) due to the need for
exchange between participating processors
Tp (n) = T (n) [(1 f ) + f /p] + Tpc (n)
Tpc (n) is monotone increasing with # p.

Communication Master-Worker
Assumptions
Load balancing excellent
Communication between Master and Worker
Tpc (n) linear function of p.
A processor does not monologise
T2c (n) minimal comm. time (Master & Worker)
Tpc (n) = T2c (n)(p 1)
i.e.
with
r=
Tc (2)
T (1)
S(p) =

1
f r+(1f )/p+rp
Min. comm. time

Serial comp. time
Comm. Master-Worker Max. speed-up

New situation
For a great # p speed-up can decline
Maximal Speed-up
S(p? ) =
f r+2
(1f )r
with
p? =
1f
r
Pay-of area
Only for
p p?
S(p) = p holds. This means that only for processor number
much less than p? parallel performance can be assured.
Scaling
An example
One decorator wallpapers a room in 60 minutes.
Two decorators wallpaper a room in 30 minutes.
How long does it take, when 60 decorators work in the
room?
Remedy
Use the 60 decorators for a hotel with 60 rooms. . .
Problem scaling
With increasing # processors the problem size should be
increased, in order to guarantee parallel
performance/efficiency.
Scalability
Saturation of speed-up
According to the Amdahls Law for a fixed problem size n with
increasing number of processors p there is a saturation of
the speed-up
60 decorators in one room . . .
Quite often one is interested in scientific computing to
solve a bigger problem in the same time.
Growing problem size n combined with increasing
# processors p
Instead of decorating one room, 60 painters can renovate
the whole hotel (60 rooms).
This behaviour (speed-up for increasing n) is not detected
by Amdahls Law!
Scaled speed-up
Assumption
The sequential proportion of a parallel program decreases
with increasing model size. I.e. it is not a constant fraction of
the total computation as adopted in Amdahls Law.
For each number of processors p maximum speed up
Sp (n) 6 p can be achieved
namely by correspondingly large model size
The behaviour of run time T with larger problem size and
correspondingly increased number of processors is
described by the Gustafsons Law.
Gustafsons Law
John Gustafson (1955)
. . . speed-up should be measured by scaling the
problem to the number of processors, not by fixing the problem size.
J. L. Gustafson Reevaluating Amdahls Law,
Comm. of the ACM, 31:532-533, 1988
This implies:
inclusion of the model size n
sequential proportion f constant
Serial run time
Ts (n) = f + n(1 f )
)
Parallel run time
Tp (n) = f + n(1f
p
Gustafsons Law scaled speed-up

Gustafsons Law(Scaled speed-up)
If the problem size increases while the serial portion grows
slowly or remains fixed, speed-up grows as workers are
added
Sp (n) =
=
Ts (n)
Tp (n)
f + n(1 f )
f+
n(1f )
p
f + n(1 f )
p
fp + n(1 f )
lim Sp (n) = p
=
Gustafsons Law constant run time
Constant run time

Keeping the run time constant for increasing problem size,
one has to choose n = p. This means: For double model size
(e.g. # grid points) one has to double the # processors.
Karp-Flatt metric
Both, Amdahl and Gustafson ignore the parallel overhead
of the parallelization.
This may lead to an overestimated speedup.
Karp & Flatt introduces an
experimentally determined serial fraction
Karp-Flatt metric
f =
1/S 1/p
1 1/p
S measured speedup
# processors
Karp-Flatt metric
Advantage
Takes into account parallel overhead
Detects other sources of overhead or inefficiency
ignored in speedup model:
Process start-up time
Process synchronization time
Imbalanced workload
Architectural overhead
Karp-Flatt metric Example 1

Measured speedup
p
2
3
S 1.8 2.5
4
3.1
5
3.6
6
4.0
7
4.4
8
4.7
What is the primary reason for speedup of only 4.7 on 8

CPUs?
Karp-Flatt metric
f 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Since f is constant, (i.e. not increasing with growing # p)
the limiting constraint is limited opportunity for parallelism
(large fraction that is inherently sequential)
Karp-Flatt metric Example 2

Measured speedup
p
2
3
S 1.9 2.6
4
3.2
5
3.7
6
4.1
7
4.5
8
4.7
What is the primary reason for speedup of only 4.7 on 8

processors?
Karp-Flatt metric
f
0.070
0.075
0.080
0.085
0.090
0.095
0.100
Since f is steadily increasing, parallel overhead is the

primary reason (f.i. time spent in process start-up,
communication synchronization or an architectural
constraint).
Scalability again
Scalability
The scalability of a parallel system
(i.e. parallel program executing on a parallel computer)
is a measure of its ability to increase performance as the
number of processors increases.
Speedup (and hence efficiency) is typically an increasing
function of the problem size.
This is called the Amdahl effect.
in order to maintain efficiency when processors are added,
we can increase the problem size.
This idea is formulated by the
Isoefficiency relation
Efficiency relation
TO = is the total amount of time spent by all processes
doing work not done by the sequential algorithm. (Total
overhead)
Ts was the sequential execution time. We have
TS + TO
pTP = TS + TO TP = =
p
Speed-up
TS
pTS
S =
TP TS + TO
Efficiency
S
1
TS
E 6
=
=
p
TS + TO
1 + TTO
S
Isoefficiency relation
E(n, p) 6
1
1+
TO (n,p)
TS (n)
TO (n, p)
1 E(n, p)
6
TS (n)
E(n, p)
1 E(n, p)
TS (n) 6
TO (n, p)
E(n, p)
If we wish to maintain a constant level of efficiency
1 E(n, p)
has to be a constant.
E(n, p)
Consequently, the isoefficiency relation simplifies too
TS (n) > CTO (n, p)

Scalability function
Suppose a parallel system has the isoefficiency relation
n > f (p)
M(n) is the memory required to store a problem of size n.
We can derive the estimate for the total amount of
memory, i.e.
n > M[f (p)].
Thus the function
M(f (p))
p
shows how memory usage per processor must increase to
maintain same efficiency
This function is called the Scalability function
Meaning of scalability function

To maintain the same efficiency when increasing p,
we must increase n.
The maximum problem size is limited by the maximum
available size of memory M(f (p)),
which is linear in p.
The scalability function shows how memory usage per
processor must grow to maintain efficiency
Is the complexity of the
scalability function is constant
this means that the
parallel system is perfectly scalable.
Meaning of scalability function
Performance models
Analysing performance
Amdahls &. Gustafsons Laws makes simple statements
about the behaviour of a model with changing
# processors p
model/problem size n
Need for analysing the behaviour of parallel Algorithms
Performance models
Abstract machine models for the design and performance
analysis of parallel algorithms.
Allow detailed statements about run time behaviour (time,
memory, efficiency,. . . )
Timing model
model (Van de Velde)
Simple performance model for run time analysis
Assumptions
All tasks run independently from each other and start
simultaneously
All processors are identical
All arithmetic operations take the same time ta
Message exchange in data words with unit length (16 Bit)
Communication and calculation are non-overlapping
Communication tasks are non-interfering each others
No global only point-to-point communication
Timing model Modelling data exchange

Data exchange
Linear model for message exchange (send & receive) of a
message M of length l (i.e. consisting of l data words)
tk (l) = + l
is the start time (latency = Time to complete a task)
Here: Time, to send a message, i.e.
pack the message
send the message
& unpack it on recipient side.
reciprocal bandwidth of the bus/network

M Message of length l to be send
Timing model Modelling data exchange

In general
ta
ta arithmetic time
i.e. time for processing one standard arithmetic operation
Properties
Communication time depends linear on message length l.
Communication time depends on the system/hardware
(Network)
Slope of the line is the reciprocal bandwidth.
Timing model Example
model time response depending on message length

Example: Inner product

1
for(s=0.,i=0;i<N;i++) s += x[i] * y[i];

N vector length (problem size)
p # processors
n = Np uniform distribution of the vectors to processors
Constraint
s should be known on all processors.
Serial work
T1 = 2 N ta
To consider for parallelization

Distribution of data
Distribution of work
Algorithm
Example: Inner product Version I

Version I local inner products
1
2
3
4
s=0.;
for(i=0;i<n;i++) s += x[i] * y[i];
for(i=0;i<p; i++) if(i != myself) send(s to proc i);
for(i=0;i<p; i++) if(i != myself) s+=recieve(from proc i);
2 n ta
(p 1) tk (1)
(p 1) ta
accumulation of local summation

send and receive of the (p-1) sub totals
computation of total sum
Tp = 2 n ta + (p 1)tk (1) + (p 1)ta

Example: Inner product Version I

Speed-up
2N ta
(2n + p 1)ta + (p 1)tk (1)
2N
=
(2n + p 1) + (p 1)tr
S(p) =
T1
Tp
with
tr =
tk (1)
ta
(lesser is better)
Speed-up Version I
S(p) =
2N
(2n + p 1) + (p 1)tr
Example: Inner product Version II

Version 2 recursive doubling
Pairs of local sums are formed recursively.
2 n ta
Formation of the local sums
ta + 2tk (1)
Receiving and adding two
sub sums.
log2 p
# inner knots (propagation
levels)
Tp = 2 n ta + log2 p [ta + 2tk (1)]

Example: Inner product Version II

Speed-up
2N ta
(2n + log2 p)ta + 2 log2 p tk (1)
2N
=
2n + log2 p (1 + 2tr )
tk (1)
with tr =
(lesser is better)
ta
S(p) =
T1
Tp
Speed-up Version II
2N
2n + log2 p (1 + 2tr )
Only logarithmic dependence on number of processors!
S(p)
Example: Inner product Version III

Version 3 Butterfly
1
2
3
4
5
6
7
8
s=0.;
for(i=0; i<np; i++) s += x[i] * y[i];
for(i=0; i<(int)log2(p); i++) {
k = 1<<i;/*Bit shift to left, i.e. sequence 1,2,4,...*/
j = myself^k; /* exclusive OR */
send (s to proc j);
s += recv(from proc j)
}
log2 (p) steps

Each p different pairs exchange data s.
Each processor forms subtotal
Algorithm known from Fast Fourier Transf. (FFT)

Version 3 Butterfly
Pairs of local sums are formed recursively.
2 n ta
Formation local sums
ta + tk (1)
Sum. of two subtotals
log2 p
# inner knots
P
s(0 : 3)means: 3i=0 s(i)
Tp = 2 n ta + log2 p (ta + tk (1))


Speed-up
2N ta
(2n + log2 p)ta + log2 p tk (1)
2N
=
2n + log2 p (1 + tr )
tk (1)
with tr =
(lesser is better)
ta
S(p) =
T1
Tp
Speed-up Version III

2N
2n + log2 p (1 + tr )
Only logarithmic dependence on number of processors!
Half communication time compared to recursive doubling
S(p)
Example: Inner product Comparison
Comparison for Versions I,II & III (N=100,000)

Example: Inner product Conclusions

Simple performance model is helpful
It can analyse different realisations of parallel algorithms
Take into account network speed
Local inner products
linear dependence of the run time of processor number p
prevents massive parallelism
Recursive doubling
logarithmic dependence of the run time leads to better
speed-up
Butterfly
Lower communication overhead In addition
Further readings
Literature
G.M. Amdahl Validity of the single-processor approach
to achieving large scale computing capabilities, In Proc.
Amer. Fed. Information Proceeding Societies Spring Joint
Computer Conference, AFIPS Press, 1967, pp. 483-485.
J. L. Gustafson Reevaluating Amdahls Law, Comm. of
the ACM, 31:532-533, 1988
E. F. Van de Velde Concurrent Scientific Computing,
Springer Texts in Applied Mathematics 16, 1994
M. McCool, A. D. Robinson, J. Reinders Structured
Parallel Programming, Morgan Kaufmann Pub., 2012

Parallel Computing Performance

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Computing Performance

Uploaded by

Copyright:

Available Formats

Parallel Computing

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Parallel run time

Cost of a parallel program

T (n) Run time of the optimal serial implementation.

Measure for the proportion of the run time, a processor

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Amdahls Law | Example

Amdahls Law relative formulation

Amdahls Law (relative)

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Amdahls law Example

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

Amdahls law Speed-up

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Amdahls law Consequence

Further sources of limitations

Parallelization induced communication

Tp (n) = T (n) [(1 f ) + f /p] + Tpc (n)

Tpc (n) is monotone increasing with # p.

Min. comm. time

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Comm. Master-Worker Max. speed-up

Gustafsons Law scaled speed-up

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Gustafsons Law constant run time

Constant run time

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Karp-Flatt metric Example 1

What is the primary reason for speedup of only 4.7 on 8

Karp-Flatt metric Example 2

What is the primary reason for speedup of only 4.7 on 8

Since f is steadily increasing, parallel overhead is the

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

TS (n) > CTO (n, p)

Meaning of scalability function

Meaning of scalability function

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Timing model Modelling data exchange

reciprocal bandwidth of the bus/network

Timing model Modelling data exchange

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Timing model Example

model time response depending on message length

Example: Inner product

for(s=0.,i=0;i<N;i++) s += x[i] * y[i];

To consider for parallelization

Example: Inner product Version I

accumulation of local summation

Tp = 2 n ta + (p 1)tk (1) + (p 1)ta

Example: Inner product Version I

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

Example: Inner product Version II

Tp = 2 n ta + log2 p [ta + 2tk (1)]

Example: Inner product Version II

01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

Example: Inner product Version III

log2 (p) steps

Example: Inner product Version III

Tp = 2 n ta + log2 p (ta + tk (1))

Example: Inner product Version III

Speed-up Version III