Professional Documents
Culture Documents
1 1
The changing Face of Computing
• Large Mainframes
• Desktop Computing
• Serwers
• Embedded Computers
2 2
Three computer classes
3 3
Technology Trends
4 4
A
A taxonomy
taxonomy of
of parallel
parallel architectures
architectures
¾- control mechanism
- SIMD
- MIMD
¾- address-space organization
- message-passing architecture
- shared-address-space architecture
- UMA
- NUMA
¾- interconnection networks
- static
- dynamic
¾- processor granularity
- coarse-grain computers
- medium-grain computers
- fine-grain computers
5
Flynn’s classifications
– Single-instruction-stream, single-data-stream
(SISD) computers
- Typical uniprocessors
- Parallelism through pipelines,
– Multiple-instruction-stream, single-data-stream
(MISD) computers
- Not used often ?
– Single-instruction-stream, multiple-data-stream
(SIMD) computers
- Vector and array processors
– Multiple-instruction-stream, multiple-data-stream
(MIMD) computers
- Multiprocessors
6 6
Typical
Typical shared-address-space
shared-address-space architecture.
architecture.
P M
M
Inter- P Inter-
P
M
connection connection
M
Network Network
M
P P
An uniform-memory-access computer
A non-uniform-memory-access
computer with local memory only
7 7
A
A typical
typical message-passing
message-passing architecture
architecture. .
Interconnection Network
.............
P P P P P
.............
M M M M M
P - Processor
M - Memory
8 8
Dynamic
Dynamic interconnection
interconnection networks
networks
Bus-based networks
9
A
A completely
completely nonblocking
nonblocking crossbar
crossbar switch
switch
connecting
connecting pp processors
processors to
to bb memory
memory banks
banks
M0 M1 M2 M3 M0
P0
P1
P2
P3
A switching
P4 element
Pp-1
10
A typical bus-based architecture with no cache.
Global memory
Bus
11 11
Multistage
Multistageinterconnection
interconnectionnetwork
network
Memory banks
Processors
0 0
1 1
p-1 b-1
Multistage interconnection network
12 12
Cost
Costand
andperformance
performance
bus
performance
multistage
cost
bus
13 13
Omega network
000 000
001 001
010 010
011 011
100 100
101 101
110 110
111 111
Pass-through
Cross-over
14 14
An example of blocking in omega network
000 000
001 001
010 010
011 011
100 100
101 101
110 110
111 111
15 15
Static
Static interconnection
interconnection networks
networks
¾Completely-connected networks
¾Star-connected network
¾Linear array
¾Ring
¾Hypercube
16
Examples
Examples of
of static
static interconnection
interconnection networks.
networks.
A three-dimensional mesh
processor
switching
elements
101 111
1 01 11
0-D hypercube 001 011
4-D hypercube
19
Evaluating
EvaluatingStatic
StaticInterconnection
InterconnectionNetworks
Networks
20 20
Communication
Communication costs
costs in
in static
static
Interconnection
Interconnection networks
networks
Principal parameters
- startup time (ts)
- per-hop time (th)
- per-word transfer time (tw)
Routing techniques
- store-and-forward routing
- cut-through routing
21
Passing
Passing aa message
message from processor PP00 to
from processor to PP33..
Time
P0
P1
P2
P3
A single message broken into two parts and sent over a cut-through network
Time
P0
P1
P2
P3
A22
single message broken into four parts and sent over a cut-through network
Communication
Communicationcosts
costsdepends
dependson
onrouting
routingstrategy
strategy
t comm = t s + (mtw + th )l
Cut-through routing - the message is divided on parts
which are sending between processors without waiting for the
whole message
23 23
Program Performance Metrics
⇒ The parallel run time (Tpar) is the time from the moment when
computation starts to the moment when the last processor finished his
execution
⇒ The speedup (S) is defined as the ratio of the time needed to solve the
problem on a single processor (Tseq) to the time required to solve the
same problem on parallel system with "p" processors (Tpar)
– relative - Tseq is the execution time of parallel algorithm executing
on one of the processors of parallel computer
– real - Tseq is the execution time for the best-know algorithm using
one of the processors of parallel computer
– absolute - Tseq is the execution time for the best-know algorithm
using the best-know computer
24 24
Program Performance Metrics
⇒ The cost is usually defined as a product of a parallel run time and the
number of processors
25 25
Amdahl’s Law
⇒ Let’s assume that the execution of the maximum size problem using
parallel algorithm at p processors takes Pseq + Ppar = 1 time, where
Pseq indicates the sequential part of the program and Ppar parallel part
respectively
⇒ Its sequential time (using one processor) will be Pseq + p*Ppar
⇒ Then we obtain the following expression that specifies speedup
Pseq + p* Ppar
SG = = Pseq + p* Ppar ≥ p* Ppar
Pseq + Ppar
27 27
Other laws
p
⇒ Lee law (1980) S≤
log 2 p
It is a generalisation of Amdhal’s Law
28 28
Other laws
Stone’s Table
Speedup (S) Type of Algorithms
------------------------------------------------------------------------------------
α* p matrix computations,
discretization
------------------------------------------------------------------------------------
α *p sorts,
linear recursions,
log 2 p evaluation of polynomials
------------------------------------------------------------------------------------
α * log 2 p search for an element in a set
------------------------------------------------------------------------------------
α certain non-linear recursions
------------------------------------------------------------------------------------
29 29
The Scalability of Parallel Systems
In the first step each processor locally adds its n/p numbers, in the next
steps half partial sums are transmited to adjacent processors and added,
the procedure finished when one choosen processor
Assume that it takes one unit of time both to add two numbers and to
communicate a number between two directly connected processors
Then adding the n/p numbers local to each processor takes n/p - 1 time
The p partial sums are added in logp steps (one addition and one
communication)
30 30
The Scalability of Parallel Systems
Tp = n p + 2 log p
Since serial run time can be approximated by n the expresion for speedup
and efficiency are as follows:
n* p n
S= E=
n + 2 p log p n + 2 p log p
These expressions can be used to calculate the speeup and efficiency for any
pair of n and p
31 31
The Scalability of Parallel Systems
Speedup versus the number of processors
Linear n = 512
n = 320
n = 192
n = 64
S
32 32
The Scalability of Parallel Systems
⇒ For the given problem instance, the speedup does not increase linearly as
the number of processors increases
⇒ A larger instance of the same problems yields higher speedup and
efficiency for the same number of processors
33 33
The Scalability of Parallel Systems
34 34
Measuring and Reporting Performance
¾ 2 key aspects
• execution time
• throughput
• making 1 faster may slow the other
¾ Comparing performance
• performance = 1/execution time
• if X is n times faster than Y: n =Execution timeY /
Execution timeX
• similar for throughput comparisons
• improved performance ==> decreasing execution time
time
35 35
Measuring Performance
36 36
Unix time command
37 37
Reporting Performance Results
However:
• A system’s software configuration can significantly affect the
performance results for a benchmark.
• Similarly, compiler technology can play a big role in the
performance of compute-oriented benchmarks.
38 38
Other Problems
• Which is better?
• By how much?
• Are the programs equally important?
39 39
Total Execution Time: A Consistent Summary Measure
40 40
Total Execution Time: A Consistent Summary Measure
41 41
Weighted Execution Time
42 42
Weighted Execution Time
43 43
Normalized Execution Time and the Pros and Cons of
Geometric Means
• A second approach to unequal mixture of programs in the
workload is to normalize execution times to a reference machine
and then take the average of the normalized execution times.
• Average normalized execution time can be expressed as either an
arithmetic or geometric mean.
• The formula for the geometric mean is
44 44
Normalized Execution Time and the Pros and Cons of
Geometric Means
45 45
Amdahl’s Law – hardware approach
46 46
What is speedup?
• Alternatively
• Speedup tells us how much faster a task will run using the
machine with the enhancement as opposed to the original one
47 47
What Amdahl’s Law says ?
48 48
How we calculate a speedup ?
The execution time using the original machine with the enhanced
mode will be the time spent using the unenhanced portion of the
machine plus the time spent using the enhancement:
49 49
An Example
50 50
An Example
¾ Implementations of floating-point (FP) square root vary significantly in
performance, especially among processors designed for graphics.
¾ Suppose FP square root (FPSQR) is responsible for 20% of the
execution time of a critical graphics benchmark. One proposal is to
enhance the FPSQR hardware and speedup this operation by a factor of
10. The other alternative is just to try to make all FP instructions in the
graphics processor run faster by a factor of 1.6.
¾ FP instructions are responsible for a total of 50% of the execution time
for the application. The design team believes that they can make all FP
instructions run 1.6 times faster with the same effort as required for the
fast square root.
51 51
Calculating CPU Performance
or
52 52
Clock cycles per instruction
53 53
Calculating CPU time I
• By transposing instruction count in the formula for CPI, clock
cycles can be defined as IC×CPI. Then the execution time can
calculated using the following formula:
54 54
Calculating CPU time II
55 55
An example I
56 56
An example – answer I
57 57
An example – answer II
58 58