Professional Documents
Culture Documents
CPS343
Spring 2020
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 1 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 2 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 3 / 65
Foster’s model
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 4 / 65
Four-phase design process: PCAM
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 5 / 65
Four-phase design process: PCAM
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 7 / 65
Partitioning
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 8 / 65
Domain decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 9 / 65
Domain decomposition example: 3-D cube of data
1-D decomposition: split cube into a 1-D array of slices (each slice is
2-D, coarse granularity)
2-D decomposition: split cube into a 2-D array of columns (each
column is 1-D)
3-D decomposition: split cube into a 3-D array of individual data
elements. (fine granularity)
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 10 / 65
Functional decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 11 / 65
Functional decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 12 / 65
Partitioning design checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 13 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 14 / 65
Communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 15 / 65
Communication in domain and functional decomposition
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 16 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 17 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 17 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 17 / 65
Patterns of communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 17 / 65
Local communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 18 / 65
Local communication: Jacobi finite differences
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 19 / 65
Local communication: Jacobi finite differences
The communications channels for a
particular node are shown by the arrows
in the diagram on the right.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 20 / 65
Local communication: Jacobi finite differences
Define channels linking each task that requires a value with the task
that generates that value.
Each task then executes the following logic:
for t = 0 to T − 1
(t)
send Xi,j to each neighbor
(t) (t) (t) (t)
receive Xi−1,j , Xi,j−1 , Xi+1,j , and Xi,j+1 from neighbors
(t+1)
compute Xi,j
endfor
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 21 / 65
Global communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 22 / 65
Global communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 23 / 65
Global communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 24 / 65
Distributing communication and computation
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 25 / 65
Distributing communication and computation
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 26 / 65
Uncovering concurrency: Divide and conquer
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 27 / 65
Divide and conquer algorithm
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 28 / 65
Divide and conquer analysis
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 30 / 65
Asynchronous communication
In this case, tasks that possess data (producers) are not able to
determine when other tasks (consumers) may require data
Consumers must explicitly request data from producers
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 31 / 65
Communication checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 32 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 33 / 65
Agglomeration
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 34 / 65
Agglomeration: Conflicting goals
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 35 / 65
Increasing granularity
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 36 / 65
Increasing granularity: Fine grained version
Fine-grained
partition of 8 × 8
grid.
Partitioned into 64
tasks.
Each task responsible
for a single point.
64 × 4 = 256
communications are
required, 4 per task.
Total of 256 data Outgoing messages are dark shaded and
values transferred. incoming messages are light shaded.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 37 / 65
Increasing granularity: Coarse grained version
Coarse-grained
partition of 8 × 8
grid.
Partitioned into 4
tasks.
Each task responsible
for 16 points.
4 × 4 = 16
communications are
required.
total of 16 × 4 = 64 Outgoing messages are dark shaded and
data values incoming messages are light shaded.
transferred.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 38 / 65
Surface-to-volume effects
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 39 / 65
Replicating computation
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 40 / 65
Replicating computation example
Sum followed by broadcast: N tasks each have a value that must be
combined into a sum and made available to all tasks.
1 Task receives a partial sum from neighbor, updates sum, and passes
on updated value. Task 0 completes the sum and sends it back. This
requires 2(N − 1) communication steps.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 41 / 65
Replicating computation example
These algorithms are optimal in the sense that they do not perform
any unnecessary computation or communication.
To improve the first summation, assume that tasks are connected in a
ring rather than an array, and all N tasks execute the same algorithm
so that N partial sums are in motion simultaneously. After N − 1
steps, the complete sum is replicated in every task.
This strategy avoids the need for a subsequent broadcast operation,
but at the expense of (N − 1)2 redundant additions and (N − 1)2
unnecessary (but simultaneous) communications.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 42 / 65
Replicating computation example
The tree summation algorithm can be modified so that after log N
steps each task has a copy of the sum. When the communication
structure is a butterfly structure there are only O(N log N)
operations. In the case that N = 8 this looks like:
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 43 / 65
Avoiding communication
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 44 / 65
Preserving flexibility
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 45 / 65
Reducing software engineering costs
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 46 / 65
Agglomeration design checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 47 / 65
Agglomeration design checklist (continued)
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 48 / 65
Outline
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 49 / 65
Mapping
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 50 / 65
Mapping
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 51 / 65
Mapping
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 52 / 65
Load balancing
Recursive Bisection
Local Algorithms
Probabilistic Methods
Cyclic Methods
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 53 / 65
Recursive bisection
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 54 / 65
Recursive bisection
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 55 / 65
Local algorithms
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 56 / 65
Probabilistic methods
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 57 / 65
Cyclic mappings
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 58 / 65
Task-scheduling algorithms
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 59 / 65
Manager/Worker
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 60 / 65
Hierarchical Manager/Worker
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 61 / 65
Decentralized schemes
No central manager.
Separate task pool is maintained on each processor.
Idle workers request problems from other processors.
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 62 / 65
Termination detection
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 63 / 65
Mapping design checklist
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 64 / 65
Acknowledgements
Material used in creating these slides comes from “Designing and Building
Parallel Programs” by Ian Foster, Addison-Wesley, 1995. Available on-line
at http://www.mcs.anl.gov/~itf/dbpp/
CPS343 (Parallel and HPC) Parallel Algorithm Analysis and Design Spring 2020 65 / 65