Professional Documents
Culture Documents
Parallel performance
Thorsten Grahs, 01.06.2015
Overview
Performance
Speedup & efficiency
Amdahls Law
Gustafsons Law
Efficiency and scalability metrics
Timing models
Examples: Inner product
Parallel Performance
Primary purpose of parallelization
The primary purpose or goal to parallelize a system or a
program is performance.
So what is performance?
Usually it is about one of the following
Reducing the total time it takes to compute a single result
Increasing the rate at which a series of results can be
computed
Reducing the power consumption of a computation
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3
Speed up by parallelization
Question
Could one build a TeraFlop computer by using 1000
GigaFlop machines?
Yes, . . . but
Technical restrictions
Influenced by CPU speed, memory, network
Not all program parts could be parallelized?
Speed-Up?
Efficiency?
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4
Tp (n)
Time between start of the parallel program and the end of all
involved parallel processes.
p # processors
n model size
Run time for a parallel program (distributed sys.) consists of
local computation
Computation of one processor with local data
Data exchange
Necessary communication for data exchange
Waiting time
(e.g. due to unbalanced loads on processors)
synchronisation Calibration between involved processes
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5
Speed-up
Serial vs. parallel fraction Speed-up Sp (n)
The speed-up of an parallel program with run time Tp (n) is
defined as
Sp (n) =
T (n)
Tp (n)
Efficiency
Serial vs. parallel fraction Efficiency Ep (n)
The efficiency of a parallel program is defined as
Ep (n) =
Sp (n)
T (n)
T (n)
=
=
Cp (n)
p
p Tp (n)
Amdahls objection
Gene Amdahl, 1967
. . . the effort expended on achieving high parallel processing
rates is wasted unless it is accompanied by achievements in
sequential processing rates of nearly the same magnitude
What does this mean?
Inclusion of the sequential proportions
In general, a parallel algorithm has always inherent
sequential components. This has implications for the
achievable speed-up.
We assume
f
relative proportion of the serial task
1f
relative proportion of ideal parallelizable tasks
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9
Amdahls Law
Gene Amdahl ( 1922)
Computer pioneer
& businessman
Famous for his work on
large-capacity computers at
IBM
Founder of the Amdahl Corp.
Assume
T (n) = Tser + Tpar
sequential execution
p
T (n) = Tser + Tpar /p parallel execution
Amdahls Law
Sp (n) =
Tser + Tpar
Tser + Tpar /p
T (n)
1
1
=
6
1f
1f
f
f T (n) + p T (n)
f+ p
Conclusion
Massive parallel systems where for a long time considered as
inefficient and only interesting from a theoretical point of view
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15
Communication Master-Worker
Assumptions
Load balancing excellent
Communication between Master and Worker
Tpc (n) linear function of p.
A processor does not monologise
T2c (n) minimal comm. time (Master & Worker)
Tpc (n) = T2c (n)(p 1)
i.e.
with
r=
Tc (2)
T (1)
S(p) =
1
f r+(1f )/p+rp
f r+2
(1f )r
with
p? =
1f
r
Pay-of area
Only for
p p?
S(p) = p holds. This means that only for processor number
much less than p? parallel performance can be assured.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18
Scaling
An example
One decorator wallpapers a room in 60 minutes.
Two decorators wallpaper a room in 30 minutes.
How long does it take, when 60 decorators work in the
room?
Remedy
Use the 60 decorators for a hotel with 60 rooms. . .
Problem scaling
With increasing # processors the problem size should be
increased, in order to guarantee parallel
performance/efficiency.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19
Scalability
Saturation of speed-up
According to the Amdahls Law for a fixed problem size n with
increasing number of processors p there is a saturation of
the speed-up
60 decorators in one room . . .
Quite often one is interested in scientific computing to
solve a bigger problem in the same time.
Growing problem size n combined with increasing
# processors p
Instead of decorating one room, 60 painters can renovate
the whole hotel (60 rooms).
This behaviour (speed-up for increasing n) is not detected
by Amdahls Law!
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20
Scaled speed-up
Assumption
The sequential proportion of a parallel program decreases
with increasing model size. I.e. it is not a constant fraction of
the total computation as adopted in Amdahls Law.
For each number of processors p maximum speed up
Sp (n) 6 p can be achieved
namely by correspondingly large model size
The behaviour of run time T with larger problem size and
correspondingly increased number of processors is
described by the Gustafsons Law.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21
Gustafsons Law
John Gustafson (1955)
. . . speed-up should be measured by scaling the
problem to the number of processors, not by fixing the problem size.
J. L. Gustafson Reevaluating Amdahls Law,
Comm. of the ACM, 31:532-533, 1988
This implies:
inclusion of the model size n
sequential proportion f constant
Serial run time
Ts (n) = f + n(1 f )
)
Parallel run time
Tp (n) = f + n(1f
p
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22
Ts (n)
Tp (n)
f + n(1 f )
f+
n(1f )
p
f + n(1 f )
p
fp + n(1 f )
lim Sp (n) = p
=
Karp-Flatt metric
Both, Amdahl and Gustafson ignore the parallel overhead
of the parallelization.
This may lead to an overestimated speedup.
Karp & Flatt introduces an
experimentally determined serial fraction
Karp-Flatt metric
f =
1/S 1/p
1 1/p
S measured speedup
# processors
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25
Karp-Flatt metric
Advantage
Takes into account parallel overhead
Detects other sources of overhead or inefficiency
ignored in speedup model:
Process start-up time
Process synchronization time
Imbalanced workload
Architectural overhead
4
3.1
5
3.6
6
4.0
7
4.4
8
4.7
4
3.2
5
3.7
6
4.1
7
4.5
8
4.7
0.070
0.075
0.080
0.085
0.090
0.095
0.100
Scalability again
Scalability
The scalability of a parallel system
(i.e. parallel program executing on a parallel computer)
is a measure of its ability to increase performance as the
number of processors increases.
Speedup (and hence efficiency) is typically an increasing
function of the problem size.
This is called the Amdahl effect.
in order to maintain efficiency when processors are added,
we can increase the problem size.
This idea is formulated by the
Isoefficiency relation
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29
Efficiency relation
TO = is the total amount of time spent by all processes
doing work not done by the sequential algorithm. (Total
overhead)
Ts was the sequential execution time. We have
TS + TO
pTP = TS + TO TP = =
p
Speed-up
TS
pTS
S =
TP TS + TO
Efficiency
S
1
TS
E 6
=
=
p
TS + TO
1 + TTO
S
Isoefficiency relation
E(n, p) 6
1
1+
TO (n,p)
TS (n)
TO (n, p)
1 E(n, p)
6
TS (n)
E(n, p)
1 E(n, p)
TS (n) 6
TO (n, p)
E(n, p)
If we wish to maintain a constant level of efficiency
1 E(n, p)
has to be a constant.
E(n, p)
Consequently, the isoefficiency relation simplifies too
Scalability function
Suppose a parallel system has the isoefficiency relation
n > f (p)
M(n) is the memory required to store a problem of size n.
We can derive the estimate for the total amount of
memory, i.e.
n > M[f (p)].
Thus the function
M(f (p))
p
shows how memory usage per processor must increase to
maintain same efficiency
This function is called the Scalability function
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32
Performance models
Analysing performance
Amdahls &. Gustafsons Laws makes simple statements
about the behaviour of a model with changing
# processors p
model/problem size n
Need for analysing the behaviour of parallel Algorithms
Performance models
Abstract machine models for the design and performance
analysis of parallel algorithms.
Allow detailed statements about run time behaviour (time,
memory, efficiency,. . . )
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35
Timing model
model (Van de Velde)
Simple performance model for run time analysis
Assumptions
All tasks run independently from each other and start
simultaneously
All processors are identical
All arithmetic operations take the same time ta
Message exchange in data words with unit length (16 Bit)
Communication and calculation are non-overlapping
Communication tasks are non-interfering each others
No global only point-to-point communication
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36
s=0.;
for(i=0;i<n;i++) s += x[i] * y[i];
for(i=0;i<p; i++) if(i != myself) send(s to proc i);
for(i=0;i<p; i++) if(i != myself) s+=recieve(from proc i);
2 n ta
(p 1) tk (1)
(p 1) ta
S(p) =
T1
Tp
with
tr =
tk (1)
ta
(lesser is better)
Speed-up Version I
S(p) =
2N
(2n + p 1) + (p 1)tr
S(p) =
T1
Tp
Speed-up Version II
2N
2n + log2 p (1 + 2tr )
Only logarithmic dependence on number of processors!
S(p)
s=0.;
for(i=0; i<np; i++) s += x[i] * y[i];
for(i=0; i<(int)log2(p); i++) {
k = 1<<i;/*Bit shift to left, i.e. sequence 1,2,4,...*/
j = myself^k; /* exclusive OR */
send (s to proc j);
s += recv(from proc j)
}
S(p) =
T1
Tp
Further readings
Literature
G.M. Amdahl Validity of the single-processor approach
to achieving large scale computing capabilities, In Proc.
Amer. Fed. Information Proceeding Societies Spring Joint
Computer Conference, AFIPS Press, 1967, pp. 483-485.
J. L. Gustafson Reevaluating Amdahls Law, Comm. of
the ACM, 31:532-533, 1988
E. F. Van de Velde Concurrent Scientific Computing,
Springer Texts in Applied Mathematics 16, 1994
M. McCool, A. D. Robinson, J. Reinders Structured
Parallel Programming, Morgan Kaufmann Pub., 2012
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50