Professional Documents
Culture Documents
Parallel Computing
Parallel Computing
Fundamental Concepts
circa 2012
1.6 / yr
GIPS
Pentium II R10000
Pentium
68040 The number of cores
80486 has been increasing Cores
80386 from a few in 2005
MIPS to the current 10s, 1000
68000 and is projected to
80286 reach 100s by 2020 100
10
KIPS 1
1980 1990 2000 2010 2020
Calendar year
Fig. 1.1 The exponential growth of microprocessor performance,
known as Moore’s Law, shown over the past two decades (extrapolated).
Winter 2021 Parallel Processing, Fundamental Concepts Slide 7
Evolution of Computer Performance/Cost
From:
“Robots After All,”
by H. Moravec,
CACM, pp. 90-97,
October 2003.
Processor performance
Denser circuits; Architectural improvements GIPS
1.6 / yr
Pentium
68040
Clock
Power
Cores
Year of Introduction
NRC Report (2011): The Future of Computing Performance: Game Over or Next Level?
Winter 2021 Parallel Processing, Fundamental Concepts Slide 10
Trends in Processor Chip Density, Performance, Clock Speed, Power, and Number of
Cores
Year of Introduction
Original data up to 2010 collected/plotted by M. Horowitz et al.; Data for 2010-2017 extension collected by K. Rupp
Koomey’s Law:
Exponential improvement in
energy-efficient computing,
with computations
performed per KWh
doubling every 1.57 years
i on S
300 GFLOP per iteration yers
reg 4 N-
s
12 l a
300 000 iterations per 6 yrs = pth
in de
2
10
1016 FLOP
$240M MPPs
$30M MPPs
CM-5 Vector supers
TFLOPS
CM-5
CM-2
Micros
Y-MP
GFLOPS
Alpha
Cray
X-MP 80860
80386
MFLOPS
1980 1990 2000 2010
Calendar year
Fig. 1.2 The exponential growth in supercomputer performance over
the past two decades (from [Bell92], with ASCI performance goals and
microprocessor peak FLOPS superimposed as dotted lines).
Winter 2021 Parallel Processing, Fundamental Concepts Slide 19
The ASCI Program
1000 Plan Develop Use
Performance (TFLOPS)
100+ TFLOPS, 20 TB
100 ASCI Purple
30+ TFLOPS, 10 TB
ASCI Q
10+ TFLOPS, 5 TB
10 ASCI White ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1 ASCI Red
1995 2000 2005 2010
Calendar year
Fig. 24.1 Milestones in the Accelerated Strategic (Advanced Simulation &)
Computing Initiative (ASCI) program, sponsored by the US Department of
Energy, with extrapolation up to the PFLOPS level.
Winter 2021 Parallel Processing, Fundamental Concepts Slide 20
The Quest for Higher Performance
Top Three Supercomputers in 2005 (IEEE Spectrum, Feb. 2005, pp. 15-16)
Bit-vector n
1 2
1 2 n
(b)
2 |
| 3 11 | 19 29 31
5 | 7 13|17 23
p = 3, t = 499
Communi-
cation n/p+1 2n/p
n <n/p
n–n/p+1 n
Computation
Communication
Ideal speedup
Solution time
Actual speedup
Computation
I/O time
IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRe
Networking
IndResIndResIndResIndResIndResIndResIndResIndResIndR
Transfer of
ideas/people IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1
GovRes
RISC
IndResIndR
IndDev $1B$1B$1B$1B$1B$1B$1B$1B$
GovResGovResGovResG GovResGovResGovResGo
Parallelism
IndResIndResIndResIndResIndResInd
Evolution of parallel processing has been
quite different from other high tech fields IndDevIndDev $1B$1B$1B$1B$1B$1B$1
1960 1970 1980 1990 2000
Development of some technical fields into $1B businesses and the roles played by
government research and industrial R&D over time (IEEE Computer, early 90s?).
Winter 2021 Parallel Processing, Fundamental Concepts Slide 36
Trends in Hi-Tech Development (2003)
memory
stream
Global
SISD SIMD GMSV GMMP
Johnson’ s expansion
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr
Distributed
streams
memory
MISD MIMD DMSV DMMP
Rarely used Multiproc’s or Distributed Distrib-memory
multicomputers shared memory multicomputers
Flynn’s categories
f = 0.01
30 p = speedup
f = 0.02 of the rest
20
f = 0.05
10 1
s=
f = 0.1 f + (1 – f)/p
0
0 10 20 30 40 50
min(p, 1/f)
Enhancement factor (p )
10
9 R(p) Redundancy = W(p) / W(1)
11
12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)]
T(1) = 13
13 Q(p) Quality = T3(1) / [p T2(p) W(p)]
T() = 8
Winter 2021 Parallel Processing, Fundamental Concepts Slide 42
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
----------- 16 numbers to be added -----------
Zero-time communication
+ + + + + + + +
E(8) = 15 / (8 4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
+ + + + Q(8) = 1.76
Unit-time communication
+ +
E(8) = 15 / (8 7) = 27%
S(8) = 15 / 7 = 2.14
+ R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum
C Cost-Effectiveness Adage
Real news – The most cost-effective parallel solution may not be
the one with highest peak OPS (communication?), greatest speed-up
(at what cost?), or best utilization (hardware busy doing what?).
Analogy: Mass transit might be more cost-effective than private cars
even if it is slower and leads to many empty seats.
Winter 2021 Parallel Processing, Fundamental Concepts Slide 44
2 A Taste of Parallel Algorithms
Learn about the nature of parallel algorithms and complexity:
• By implementing 5 building-block parallel computations
• On 4 simple parallel architectures (20 combinations)
P1 P4
P0 P1 P2 P3 P4 P5 P6 P7 P8
P0 P1 P2 P3 P4 P5 P6 P7 P8
P0
We will spend more time on
P1 P4
linear array and binary tree
P2 P3 P5 P6
P7 P8
P P P P P P
0 1 2 0 1 2
P P P P P P
3 4 5 3 4 5
P P
7 2
P P
6 3
P P
5 4
log2 n levels
s = x0 x1 . . . xn–1
s
Semigroup computation viewed as tree or fan-in computation.
s n–1
Prefix computation on a uniprocessor.
Winter 2021 Parallel Processing, Fundamental Concepts Slide 50
The Five Building-Block Computations
Reduction computation: aka tree, semigroup, fan-in comp.
All processors to get the result at the end
Scan computation: aka parallel prefix comp.
The ith processor to hold the ith prefix result at the end
Packet routing:
Send a packet from a source to a destination processor
Broadcasting:
Send a packet from a source to all processors
Sorting:
Arrange a set of keys, stored one per processor, so that
the ith processor holds the ith key in ascending order
Winter 2021 Parallel Processing, Fundamental Concepts Slide 51
2.2 Some Simple Architectures
P0 P1 P2 P3 P4 P5 P6 P7 P8
P0 P1 P2 P3 P4 P5 P6 P7 P8
Fig. 2.2 A linear array of nine processors and its ring variant.
P2 P3 P5 P6
P P P P P P
0 1 2 0 1 2
Nonsquare
mesh
P P P (r rows, P P P
3 4 5 3 4 5
p/r col’s)
also possible
P P P P P P
6 7 8 6 7 8
P P Costly to implement
7 2
Not scalable
But . . .
P P
6 3 Conceptually simple
Easy to program
P P
5 4
P0 P1 P2 P3 P4 P5 P6 P7 P8
Initial
5 2 8 6 3 7 9 1 4 values
5 7 8 6 3 7 9 1 4
5 7 15 6 3 7 9 1 4
5 7 15 21 3 7 9 1 4
5 7 15 21 24 7 9 1 4
5 7 15 21 24 31 9 1 4
5 7 15 21 24 31 40 1 4
5 7 15 21 24 31 40 41 4
5 7 15 21 24 31 40 41 45 Final
results
Fig. 2.7 Computing prefix sums on a linear array of nine processors.
P0 P1 P2 P3 P4 P5 P6 P7 P8
5 2 8 6 3 7 9 1 4 Initial
1 6 3 2 5 3 6 7 5 values
5 2 8 6 3 7 9 1 4 Local
6 8 11 8 8 10 15 8 9 prefixes
+
0 6 14 25 33 41 51 66 74 Diminished
= prefixes
5 8 22 31 36 48 60 67 78
6 14 25 33 41 51 66 74 83 Final
results
Left-moving packets
P0 P1 P2 P3 P4 P5 P6 P7 P8
Right-moving packets
5 2 8 6 3 7 9 4
1
9
5 2 8 6 3 7 1 4
Linear Array
7 9
5 2 8 6 3 1 4
3 7
Sorting 5 2 8 6 1 4 9
6 4 9
5 2 8 1 3 7
(Externally 5 2 1
8
3
6
4
7
9
Supplied Keys) 5 1 2
3
8 4 6 7 9
5 3 8 7
1 2 4 6 9
5 4 8 9
1 2 3 6 7
5 6 8
1 2 3 4 7 9
5 7 9
1 2 3 4 6 8
6 8
1 2 3 4 5 7 9
Fig. 2.9 Sorting on a
7 9
linear array with the 1 2 3 4 5 6 8
8
keys input sequentially 1 2 3 4 5 6 7 9
9
from the left. 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
P0 P1 P2 P3 P4 P5 P6 P7 P8
In odd steps,
5 2 8 6 3 7 9 1 4 1, 3, 5, etc.,
5 2 8 3 6 7 9 1 4
2 5 3 8 6 7 1 9 4 odd-
2 3 5 6 8 1 7 4 9 numbered
2 3 5 6 1 8 4 7 9 processors
2 3 5 1 6 4 8 7 9 exchange
2 3 1 5 4 6 7 8 9 values with
2 1 3 4 5 6 7 8 9 their right
1 2 3 4 5 6 7 8 9
neighbors
Fig. 2.10 Odd-even transposition sort on a linear array.
P0 P0
P1 P4 P1 P4
P2 P3 P5 P6 P5 P6
P2 P3
P7 P8 P7 P8
Binary Tree
x 0 x 1 x 2 x 3 x 4
Scan Upward
Upward
Propagation
propagation
Computation x0 x1 x2 x 3 x 4
x3 x4
Identity
Fig. 2.11 Scan
computation on a Identity x 0 x 1
binary tree of
processors. Identity x0 x 0 x 1 x 0 x 1 x 2 Downward
Downward
Propagation
x0 x1 x2
propagation
x 0 x 1 x 2 x 0 x 1 x 2 x 3
x3 x4
x0
x 0 x 1
x 0 x 1 x 2
Results
x 0 x 1 x 2 x 3
x 0 x 1 x 2 x 3 x 4
[i, j – 1] [0, j – 1]
[0, i – 1] [ j, k]
Carry-lookahead network:
p¢x=x p g a g g p p p g a cin
a¢x=a g or a
g¢x=g Direction of indexing
Winter 2021 Parallel Processing, Fundamental Concepts Slide 65
Binary Tree Packet Routing
P0 XXX
LXX RXX
P1 P4
LLX RRX
P2 P3 P5 P6 LRX RLX
RRL RRR
Preorder P7 P8 Node index is a representation
indexing of the path from the tree root
5 2 3
Small values
“bubble up,”
causing the
5 2 3 1 4
root to “see”
the values in
1 4
(a) (b) ascending
order
2 2 1
5 3 1 5 3 Linear-time
sorting (no
4 4 better than
linear array)
(c) (d)
Fig. 2.12 The first few steps of the sorting algorithm on a binary tree.
Winter 2021 Parallel Processing, Fundamental Concepts Slide 67
The Bisection-Width Bottleneck in a Binary Tree
Bisection Width = 1
Linear-time
sorting is the
best possible
due to B = 1
6 3 7 7 7 7 9 9 9
9 1 4 9 9 9 9 9 9
Row maximums Column maximums
Finding the max value on a 2D mesh.
5 7 15 5 7 15 0 5 7 15
6 9 16 6 9 16 15 21 24 31
9 10 14 9 10 14 31 40 41 45
Row prefix sums Diminished prefix Broadcast in rows
sums in last column and combine
Computing prefix sums on a 2D mesh
Winter 2021 Parallel Processing, Fundamental Concepts Slide 69
Routing and Broadcasting on a 2D Mesh
P P P P P P
0 1 2 0 1 2
Nonsquare
mesh
P P P (r rows, P P P
3 4 5 3 4 5
p/r col’s)
also possible
P P P P P P
6 7 8 6 7 8
Routing: Send along the row to the correct column; route in column
Broadcasting: Broadcast in row; then broadcast in all column
g(n)
f(n)
c g(n) c g(n)
n0 n n0 n n0 n
f(n) = O(g(n)) f(n) =
(g(n)) f(n) =
f(n) = O(g(n)) f(n) = (g(n)) f(n) = (g(n))
(g(n))
Fig. 3.1 Graphical representation of the notions of asymptotic complexity.
O(n) Linear
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
S
hiftin
g lo
werb
oun
ds Im
pr
o v
ingup
perb
oun
ds
lo
g n lo
g2n n/lo
g n n nlo
g lo
g n nlo
g n n2
L
in
ear
S
ublin
ear S
upe
rlin
ear
T
ypic
alc
omp
le
xitycla
sse
s
Fig. 3.2 Upper and lower bounds may tighten over time.
Winter 2021 Parallel Processing, Fundamental Concepts Slide 78
Complexity History of Some Real Problems
Examples from the book Algorithmic Graph Theory and Perfect Graphs [GOLU04]:
Complexity of determining whether an n-vertex graph is planar
Exponential Kuratowski 1930
O(n3) Auslander and Porter 1961
Goldstein 1963
Shirey 1969
O(n2) Lempel, Even, and Cederbaum 1967
O(n log n) Hopcroft and Tarjan 1972
O(n) Hopcroft and Tarjan 1974
Booth and Leuker 1976
A second, more complex example: Max network flow, n vertices, e edges:
ne2 n2e n3 n2e1/2 n5/3e2/3 ne log2 n ne log(n2/e)
ne + n2+ ne loge/(n log n) n ne loge/n n + n2 log2+ n
S
olu
tio
n
Ma
chin
eor
a
lg
or
ith
m A
4s
te
ps
2
0ste
ps
Ma
chin
eor
a
lg
or
ith
m B For example, one algorithm may
need 20 GFLOP, another 4 GFLOP
(but float division is a factor of 10
slower than float multiplication
NP-complete
(e.g. the subset sum problem)
NP
Nondeterministic
Polynomial
P
Polynomial
(Tractable) ?
P = NP
This diagram
has been replaced
with a more
complete one
Efficiently
parallelizable
SIMD
Single SISD SIMD
Control stream(s)
versus
“Uniprocessor” “Array processor” MIMD
Distrib Global
“Shared-memory
multiprocessor”
Memory
Multiple
GMSV GMMP
MISD MIMD
(Rarely used) DMSV DMMP
“Distributed “Distrib-memory
I2 shared memory” multicomputer
1 1 Options:
Processor- Processor-
. . Crossbar
to-processor to-memory
network Bus(es)
. network .
MIN
. .
p-1 m-1 Bottleneck
... Complex
Parallel I/O Expensive
1 1
Processor- Processor-
to-processor Challenge: . to-memory .
network
Cache . network .
coherence . .
p-1 m-1
...
Parallel I/O
Fig. 4.4 A parallel processor with global memory and processor caches.
Winter 2021 Parallel Processing, Fundamental Concepts Slide 97
Distributed Shared Memory
Memories Processors
Some Terminology:
0
NUMA
1 Nonuniform memory access
(distributed shared memory)
.
. Interconnection
UMA
. Uniform memory access
network
(global shared memory)
p-1
COMA
.
Cache-only memory arch
Parallel I/O .
.
p–1 m–1
Bus switch
(Gateway)
Low-level
cluster
Scalability dictates hierarchical connectivity
Computer as:
Industrial
installation
Office
machine
Hypercube
1.0
0.5
2-D Torus
2-D Mesh
0.0
0 2 4 6
Wire Length
Length (mm)
(nm)
Scaling up
(Fig. 4.11)
If the weight
of ant grows Scaled upScaled
ant u p an t o n th e ramp ag e!
onngthe
W h at is wro with rampage!
th is p ictu re?