Professional Documents
Culture Documents
Algorithm Design
Prof V B More
MET-BKC IOE
1
Algorithms and Concurrency
Introduction to Parallel Algorithms
• Tasks and Decomposition
• Processes and Mapping
• Processes Versus Processors
Decomposition / Partitioning
Techniques
• Recursive Decomposition
• Exploratory Decomposition
• Hybrid Decomposition
2
Algorithms and Concurrency
3
Concurrency and Mapping
• Mapping Techniques for Load
Balancing
–Static and Dynamic Mapping
6
Decomposition, Tasks, and Dependency
Graphs
• Tasks can be decomposed into sub-
tasks in various ways. Decomposed
tasks may be of :
– same sized,
– different sized, or
– of intermediate sizes.
7
Decomposition, Tasks, and Dependency
Graphs
• A decomposition can be justified in
the form of a directed graph with
nodes corresponding to tasks and
edges indicating that the result of
one task is required for processing
the next. Such a graph is called a
task dependency graph.
8
Example Decomposition of task into nodes and edges
10 10 10 10
9 Task 6 6 Task 5
Task 7
8
Main task
Decomposed task
9
Example: Multiplying a Dense Matrix with a Vector
A n b y
Computation of
01
Task 1
2
each element of
output vector y is
independent of
other elements.
n-1
Task n Based on this,
a dense matrix-vector product can be
decomposed into n tasks. shaded
portion of the matrix and vector is
accessed by Task 1. 10
Example: Multiplying a Dense Matrix with a Vector
A
Task 1
01 n b y Findings: While
2 tasks share data
(vector b), they do
not have any
control
n-1
dependencies – i.
Task n e., no task needs
to
wait for the (partial) completion of
any other. All tasks are of the same
size in terms of number of operations.
11
Example: Database Query Processing
Consider the execution of the query:
MODEL = ”CIVIC” AND YEAR = 2001 AND
(COLOR = ”GREEN” OR COLOR = “WHITE”)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
12
Example: Database Query Processing
13
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
14
Example: Database Query Processing
15
Example: Database Query Processing
COLOR = ``WHITE
MODEL = ``CIVIC'' YEAR = 2001
ID# Year
ID# Model ID# Color
7623 2001
4523 Civic 7623 Green
6734 2001 ID# Color
6734
4395
Civic
Civic
5342 2001 3476 White 9834
5342
Green
Green
COLOR = ``GREEN''
3845 2001 6734 White
7352 Civic 4395 2001 8354 Green
6734 Civic 2001 Civic AND 2001 White OR Green 7623 Green
4395 Civic 2001 9834 Green
6734 White
5342 Green
8354 Green
Decompo
graph denote ID# Model Year Color
6734 Civic 2001 White sing the
that the
output of one MODEL = ``CIVIC'' AND YEAR = 2001 AND given
(COLOR = ``GREEN'' OR COLOR = ``WHITE)
task is query into
needed to a number
accomplish of tasks.16
the next.
Example: Database Query Processing
MODEL = “CIVIC”
MODEL = “CIVIC”
ID# Model
4523 Civic
6734 Civic
Civic 2001
4395 Civic
7352 Civic
17
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
18
Example: Database Query Processing
YEAR = 2001
ID# Year
7623 2001
9834 2001
YEAR = 6734 2001
Civic 2001 2001 5342 2001
3845 2001
4395 2001
Civic AND 2001 White OR Green
19
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
20
Example: Database Query Processing
COLOR = “WHITE”
COLOR = “WHITE”
ID# Color
3476 White
6734 White
Civic 2001 White Green
21
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
22
Example: Database Query Processing
COLOR = “GREEN”
ID# Color
7623 Green
9834 Green
5342 Green
8354 Green COLOR = “GREEN”
White Green
White OR Green
23
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
24
Example: Database Query Processing
MODEL = “CIVIC” AND YEAR = 2001
YEAR = 2001
MODEL = “CIVIC”
ID# Year
ID# Model 7623 2001
4523 Civic 9834 2001
6734 Civic 6734 2001
4395 Civic 5342 2001
7352 Civic 3845 2001
MODEL = “CIVIC” AND 4395 2001
YEAR = 2001 Civic 2001
25
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
26
Example: Database Query Processing
COLOR = “GREEN” OR COLOR = “WHITE”
COLOR =
“WHITE” COLOR = “GREEN”
ID# Color
ID# Color
7623 Green
3476 White 9834 Green
6734 White 5342 Green
8354 Green
COLOR = “GREEN”
White Green
OR
COLOR = “WHITE”
ID# Color
3476 White
White OR Green 7623 Green
9834 Green
6734 White
5342 Green
8354 Green
27
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
28
Example: Database Query Processing
MODEL = “CIVIC” AND YEAR = 2001 AND
(COLOR = “GREEN” OR COLOR = “WHITE”)
COLOR = “GREEN” OR
MODEL = “CIVIC” AND
COLOR = “WHITE”
YEAR = 2001
ID# Model Year ID# Color
3476 White
6734 Civic 2001 7623 Green
4395 Civic 2001 Civic AND 2001 White OR Green 9834 Green
6734 White
5342 Green
8354 Green
An alternate
ID# Model ID# Color
7623 2001
4523 Civic 7623 Green
6734 2001 ID# Color
6734 Civic 5342 2001 3476 White 9834 Green
decompositi
4395 Civic 3845 2001 6734 White 5342 Green
7352 Civic 4395 2001 8354 Green
on of the
Civic 2001 White Green
ID# Color
7623
9834
White
Green
Green
problem into
6734 White
5342 Green
8354 Green
subtasks,
2001 AND (White or Green) ID# Color Year
7623 Green 2001
along with
5342 Green 2001
dependencie
Different task decomposition methods may leads to different
s.
parallel performance. 31
Granularity of Task Decompositions
• The number of tasks into which a
problem is decomposed determines
its granularity.
• Decomposition into a large number
of tasks results in fine grained
decomposition and that into a small
number of tasks results in a
coarse grained decomposition.
32
Granularity of Task Decompositions
A b y
0 1 ... n
Task
1
Task
2
Task
3
Task
4
38
Critical Path Length
39
Critical Path Length
40
Critical Path Length
• Consider the task dependency graphs of
the two database query decompositions:
Task 4 Task 3 Task 2 Task 1 Task 4 Task 3 Task 2 Task
1
10 10 10 10 10 10 10 10
6 Task 5
9 Task 6 6 Task 5 Task 6
11
8 Task 7 7 Task 7
(a) (b)
41
Questions:
• What are the critical path lengths for the
two task dependency graphs?
• If each task takes 10 time units, what is the
shortest parallel execution time for each
decomposition?
• How many processors are needed in each
case to achieve this minimum parallel
execution time?
• What is the maximum degree of
concurrency?
42
Limits on Parallel Performance
• It can be observed that the parallel
time can be made arbitrarily small by
making the decomposition finer in
granularity.
• There is an inherent bound on how fine
the granularity of a computation can
be. For example, in the case of
multiplying a dense matrix with a
vector, there can be no more than (n2)
concurrent tasks.
43
Limits on Parallel Performance
Col 1 Col 2 Col 3
Row 1
1 2 3
Row 2
4 5 6
Row 3 7 8 9
44
Limits on Parallel Performance
Col 1 Col 2 Col 3
Row 1
1 2 3
Row 2
4 5 6
Row 3 7 8 9
Row 1
1 2 3
Row 2
4 5 6
Row 3 7 8 9
47
Partitioning Techniques
• There is no single recipe that works
for all problems.
• We can benefit from some commonly
used techniques:
– Recursive Decomposition
– Data Decomposition
– Exploratory Decomposition
– Speculative Decomposition
48
Recursive Decomposition
• Generally suited to problems that are
solved using a divide and conquer
strategy.
• Decompose based on sub-problems
• Often results in natural concurrency
as sub-problems can be solved in
parallel.
• Need to think recursively
– parallel not sequential
49
Recursive Decomposition: Quicksort
51
Recursive Decomposition:
Finding the Min/Max/Sum
• Rewrite using recursion and max partitioning
– Don’t make a serial recursive routine
1. procedure RECURSIVE_MIN (A, n)
2. begin
3. if ( n = 1 ) then
4. min := A [0] ;
5. else Note: Divide the work
6. lmin := RECURSIVE_MIN ( A, n/2 ); in half each time.
7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 );
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN 52
Recursive Decomposition:
Finding the Min/Max/Sum
• Example: Find min of {4,9,1,7,8,11,2,12}
Step
1 4 9 1 7 8 11 2 12
2 4 9 1 7 8 11 2 12
3 4 9 1 7 8 11 2 12
53
Recursive Decomposition:
Finding the Min/Max/Sum
• struggle to divide in half
• Often, can be mapped to a hypercube
for a very efficient algorithm
• The overhead of dividing the
computation is important.
– How much does it cost to communicate
necessary dependencies?
54
Recursive Decomposition:
Sequential Merge Sort
Pseudo code of Sequential Merge Sort with Recursion
void mergeSort(int *a, int first, int last, int *aux) {
if (last <= first) return;
int mid = (first + last) / 2;
mergeSort (a, first, mid, aux); //first half
mergeSort (a, mid+1, last, aux); //next half
mergeArrays(a, first, mid, a, mid+1, last, aux, first, last);
for (int i=first; i<= last; i++) a[i] = aux[i];
}
void mergeArrays(int *a, int afirst, int alast, int *b, int bfirst, int blast, int *c, int cfirst, int
clast) {
int i = afirst, j = bfirst, k = cfirst;
while (i <= alast && j <= blast ) {
if ( a[i] < b[j]) c[k++] = a[i++];
else c[k++] = b[j++];
}
while (i <= alast ) c[k++] = a[i++];
while (j <= blast ) c[k++] = b[j++];
}
55
Recursive Decomposition:
Parallel Merge Sort
Pseudo code of Parallel Merge Sort
void parallel_mergeSort() {
if (proc_id > 0) {
Recv(size,parent);
Recv(a, size, parent);
}
mid = size / 2;
if (both children) {
Send (mid, child1);
Send(size-mid, child2);
Send(a, mid, child1);
Send(a+mid, size-mid, child2);
Recv(a, mid, child1);
Recv(a+mid, size-mid, child2);
mergeArrays(a, 0, mid, a, mid+1, size, aux, 0, size);
for (int i=first; i<= last; i++) a[i] = aux[i];
}
else mergeSort( a, 0, size);
57
Output Data Decomposition
• Each element of the output can be
computed independently of the others
– A function of the input
– All may be able to share the input or have a
copy of their own
• Decompose the problem naturally.
• Embarrassingly Parallel
– Output data decomposition with no need for
communication
– Mandelbrot, Simple Ray Tracing, etc.
58
59
60
61
Prof V B More, MET IOE BKC 62
Mandlebrot Fractal Zoom
https://youtu.be/PD2XgQOyCCk
3D Fractal
https://youtu.be/S530Vwa33G0
64
Output Data Decomposition:
Example
A partitioning of output data does
not result in a unique
decomposition into tasks. For
example, with identical output data
distribution, we can derive the
following two (other)
decompositions:
65
Output Data Decomposition:
Example
Decomposition I Decomposition II
67
Output Data Decomposition
• Count the instances of given itemsets
68
Input Data Decomposition
• Applicable if the output can be naturally
computed as a function of the input.
• In many cases, this is the only natural
decomposition because the output is not
clearly known a-priori
– finding minimum in list, sorting, etc.
• Associate a task with each input data
partition.
• Tasks communicate where necessary input is
“owned” by another task.
69
Input Data Decomposition
• Count the instances of given itemsets
• Each task generates partial counts for all itemsets
which must be aggregated.
71
Intermediate Data Partitioning
72
Intermediate Data Partitioning: Example
Let us revisit the example of dense matrix
multiplication. We first show how we can visualize
this computation in terms of intermediate matrices D.
.
A 1,1
B 1,1 B 1,2 D1,1,1 D1,1,2
+
A 1,2
. D2,1,1 D2,1,2
C 1,1 C 1,2
C 2,1 C 2,2
73
Multiplication of two 2x2 matrices A
2 3
&B
3 2 2 x3 3 x3 2 x ( 2 ) 3 x1
1 3 ( 1) x 3 4 x 3
4 1 ( 1) x ( 2 ) 4 x1
2 2 x3 2 x ( 2 ) 6 4 Intermediate
1 3 2 3 Result D1
( 1) x 3 ( 1) x ( 2 ) 2
3 3 x 3 3 x1 9 3
Intermediate
4 3 1 12
4 x 3 4 x1 4 Result D2
6 4 9 3 6 9 4 3 15 1
3 12 3 12 9 Result
2 4 2 4 6
74
Domain Decomposition
• Often can be viewed as input data
decomposition
– May not be input data
– Just domain of calculation
• Split up the domain among tasks
• Each task is responsible for computing the
answer for its partition of the domain
• Tasks may end up needing to
communicate boundary values to perform
necessary calculations
75
Domain Decomposition
1
• Evaluate the integral 4
0 1 x2
Each task evaluates the integral
in their partition of the domain
76
Domain Decomposition
• Often a natural approach for grid/matrix
problems
77
Exploratory Decomposition
• In many cases, the decomposition of a
problem goes hand-in-hand with its
execution.
• Typically, these problems involve the
exploration of a state space.
– Discrete optimization
– Theorem proving
– Game playing
78
Exploratory Decomposition
• 15 puzzle – put the numbers in order
– only move one piece at a time to a blank spot
79
Exploratory
Decomposition
• Generate
successor
states and
assign to
independent
tasks.
80
1
Task 4
Task 3
Task 2
Task 1
81
2
82
3
83
4
84
5
85
Solved State
86
Exploratory Decomposition
• Exploratory decomposition techniques may change the
amount of work done by the parallel implementation.
• Change can result in super- or sub-linear speedups
87
Speculative Decomposition
• Sometimes, dependencies are not known
a-priori
• Two approaches
– conservative – identify independent tasks only
when they are guaranteed to not have
dependencies
• May yield little concurrency
– optimistic – schedule tasks even when they
may be erroneous
• May require a roll-back mechanism in the case of
an error.
88
Speculative Decomposition
• The speedup due to speculative
decomposition can add up if there
are multiple speculative stages
• Examples
– Concurrently evaluating all branches of
a C switch stmt
– Discrete event simulation
89
Speculative Decomposition
A switch statement works based on the value of
expression and corresponding case statement
executes. Parallel switch
Slave(i)
Sequential switch {
compute expr; compute ei; Wait(request);
switch(expr) if (request) Send(ei, 0);
{ }
case 1: compute-e1; break; Master() {
case 2: compute-e2; break; compute expr;
...... switch(expr)
} {
case 1:
Send(request,1);
Receive(a1,i);
...
90
}
Speculative Decomposition
Discrete Event Simulation
• The central data structure is a time-
ordered event list.
• Events are extracted precisely in time
order, processed, and if required,
resulting events are inserted back
into the event list.
91
Speculative Decomposition
Discrete Event Simulation
• Consider MET-UTSAV-21 as a discrete
event system -
–Every day there are number of
events scheduled one after other.
And at last day there is Musical
Night.
92
Speculative Decomposition
Discrete Event Simulation
• Each of these events may be
processed independently,
–Since, there is no concrete
dependency of one event on other,
they can be executed independently
93
Speculative Decomposition
Discrete Event Simulation
–In case in certain situations like
natural calamities, the scheduled
execution of the events can hamper,
that may lead cumbersome to
manage.
• Therefore, an optimistic scheduling
of other events will be possible if
there is a backup plan.
94
Speculative Decomposition
Discrete Event Simulation
• Simulate a network of nodes
– various inputs, node delay parameters, queue
sizes, service rates, etc.
95
Hybrid Decomposition
• Often, a mix of decomposition techniques
is necessary
• In quicksort, recursive decomposition
alone limits concurrency (Why?). A mix of
data and recursive decompositions is
more desirable.
96
Hybrid Decomposition
• In discrete event simulation, there might
be concurrency in task processing. A mix
of speculative decomposition and data
decomposition may work well.
• Even for simple problems like finding a
minimum of a list of numbers, a mix of
data and recursive decomposition works
well.
97
Task Characterization
98
Task Characterization
• Task characteristics can have a
dramatic impact on performance.
There are following basic
characteristics of task:
• Generation of Task
– Static
– Dynamic
99
Task Characterization
• Size of Task
– Uniform
– Non-uniform
• Data Size
– Size
– Uniformity
100
Task Generation
• Algorithm point of view, tasks are
considered to be executed in parallel.
• Tasks are generated in two ways.
Static and Dynamic.
101
Task Generation
• Static
–Tasks are known in advance before
their execution. Tasks are executed
in order which are defined
previously.
–number of tasks, task size, data
size are all known in prior, therefore
their execution is deterministic.
– E.g.image processing, matrix & graph algorithms
102
Task Generation
• Dynamic
–Tasks are created dynamically
based on decomposition of data in
particular situations.
–Tasks are not available before their
execution. tasks changes
throughout run
–difficult to launch during run –
scheduled environment
103
Task Generation
–most often dealt with using
dynamic load balancing techniques
–Recursive and exploratory
decomposition techniques are
considered as examples of
dynamic task generation.
104
Task Size – Data size
• Execution time
–uniform – synchronous steps
–non-uniform – difficult to determine
synchronization points
• often handled using Master-Worker
paradigm
• otherwise polling is necessary in
message passing
105
Task Size – Data size
• Data Size
– Date size is one of the crucial property
of the task. The required data should be
made available when mapping it into
processes.
– The overhead associated with data
movement can be reduced when size of
data and its memory locations are
available at the time of processing.
106
Task Interactions
• Static interactions: The tasks and
their interactions are known a-priori.
These are relatively simple to code
into programs.
• Dynamic interactions: The timing or
interacting tasks cannot be
determined a-priori. These
interactions are harder to code,
especially using message passing
APIs. 107
Task Interactions
• Regular interactions: There is a
definite pattern (in the graph sense)
to the interactions. These patterns
can be exploited for efficient
implementation.
• Irregular interactions: When
interactions are irregular, they lack
well-defined topologies.
108
Static Task Interaction Patterns
Regular patterns are easier to code
Both producer and consumer are
aware of when communication
is required Explicit and Simple code
110
Example - Hotplate
• Use Domain Decomposition
– domain is the hotplate
– split the hotplate up among tasks
• row-wise or block-wise
Consider the communication costs
Row-wise 2 neighbors
Block-wise 4 neighbors
Tasks don’t
know when to
receive a
message –
periodically poll
112
Static vs Dynamic Task Interactions
Static Interactions Dynamic Interactions
For each task, the Timing of interaction
interactions happen is not known prior to
at predetermined execurion of the
times program
Task interaction The stage at which
graph and interaction interaction is needed
will happen in which is decided
stage is known dynamically
113
Static vs Dynamic Task Interactions
Static Interactions Dynamic Interactions
Can be programmed Uncertainties in
easily in the MPI interaction makes it
paradigm hard for both sender
and receiver to
participate in
interaction at same
time in MPI.
easy to code in easy to code in
shared address shared address
space model space model 114
Regular vs Irregular Task Interactions
Regular Interactions Irregular Interactions
An interaction pattern There is no regular
is considered to be interaction pattern
regular if it has some exists
definite pattern of
interaction
Easy to handle Irregular and
dynamic patterns are
difficult to handle
especially in MPI
model 115
Regular vs Irregular Task Interactions
Regular Interactions Irregular
Interactions
In image dithering problem, In sparse matrix
the color of each pixel in an - vector
image is determined as the multiplication,
weighted average of its task cannot
original value and the values know which
of its neighboring pixels. entries of vector
image is decomposed in it requires, due
square boxes and assigned to to irregular
multiple task to carry out chunk of task.
dither operation for each of
116
this region indenpendently.
Read only vs Read write Task Interactions
Read only Read write
Interactions Interactions
Task requires only a Multiple tasks need
read-access to the to read and write on
data shared among some shared data.
many concurrent
tasks
117
Read only vs Read write Task Interactions
Read only Read write
Interactions Interactions
The decomposition 15-puzzle problem
for parallel matrix the priority queue
multiplication. In this constitutes shared
problem the tasks data and tasks need
only need to read the both read and write
shared input access to it.
matrices A and B
118
One way vs Two way Task Interactions
One way Interactions Two way Interactions
One task pushes data Both tasks pushes
to another data to each other
Cannot be Suitable for MPI
programmed directly
into MPI
Easy to handle in Easy to handle in
shared address shared address
space model space model
119
Questions based onTask Interactions
• Explain characteristics of tasks
• Write short note on task generation
• Differentiate between static and dynamic task
generation
• Discuss the impact ofr task size on task generation
• Compare between static interaction and dynamic task
interactions
• Compare between regular and irregular task interaction
• Compare between read only and read write task
interaction
• Compare between one way and two way task
interaction
120
Mapping Techniques for Load
Balancing
1
Mapping
• Once a problem has been
decomposed into concurrent tasks,
the tasks must be mapped to
processes.
–Mapping and Decomposition are
often interrelated steps
• Mapping must minimize overhead
–Inter-process communication and
–Time for which the processes are
idle 2
Mapping
Main
Task
Decomposition
Sub-Tasks 3
Mapping
Sub-Tasks
Mapping
processes 4
Mapping
Mapping
7
Mapping
Main
Task
Mapping to only
one process
processes 8
Mapping
Mapping to only
one process
processes 9
Mapping
idle idle
processes processes
no inter-process communication
10
Mapping
• Due to load imbalancing, some
processes finish the work early.
• Based on the constraints in task
dependency grah some processes
may have to wait other processes to
finish their work.
11
Mapping
Main
Task
fine grain 12
Mapping
Mapping
processes 13
Mapping
Mapping
processes 14
Mapping
processes
completes
execution early 15
Mapping
processes
completes
execution late 16
Mapping
• To reduce overhead caused due to
interaction, one way is to assign the
tasks which need interaction into the
same process
• this leads to imbalance of workload
among processes, heavy load
processes always busy and less load
processes becomes idle.
17
Mapping
• Good mapping scheme must ensure
the balance between computations
and interactions among processes.
• If synchronization among the
interacting tasks is improper then
waiting time for sending and
receiving data among processes will
increase.
18
Mapping Techniques for Minimum Idling
P 1 5 9 P1 1 2 3
1
P 2 6 10 P2 4 5 6
2
P 3 7 11 P3 7 8 9
3
P 4 8 12 P4 10 11 12
4
t=0 t=2 t=3 t=0 t=3 t=6
(a) (b)
19
Mapping
• There are two types of mapping
techniques:
– static mapping
– dynaimc mapping
20
Mapping
• Static
–tasks mapped to processes a-priori.
–Tasks are distributed among
available processes prior to
execution of algorithms
–need a good estimate of the task
size, data size, and inter-task
interactions.
21
Mapping
–often based on data or task graph
partitioning
–algorithm with static mapping are
easy to design
–since everything is known apriory,
static mapping schemes are
suitable for both shared address
space and message passing
programming models.
22
Mapping
• Dynamic
–tasks are mapped to processes at
runtime.
–tasks generation and mapping is
done dynamically
–tasks size not known a-priory
–indeterminate processing times is
also unknown
23
Mapping
24
Schemes for Static Mapping
• Hybrid mappings.
25
Mapping – Data Partitioning
Block-wise distribution 27
Mapping – Data Partitioning
Block Distribution
• In block distributions the uniform
contiguous portions of the array are
distributed to different processes.
• Eg. consider d - dimensional array in
which each process will receive
contiguous block of array along array
dimensions.
29
Block Distribution
32
Block Distribution
40
Cyclic and Block Cyclic Distribution
• Cyclic distributions often “spread the load”
41
Cyclic and Block Cyclic Distributions
Cyclic Distribution:
The situation where computational
load is distributed in identical
fashion, cyclic distribution is used
to distribute the load.
42
Cyclic and Block Cyclic Distributions
43
Cyclic and Block Cyclic Distributions
44
Cyclic Distribution
• Ex, m = 23 elements and P = 3 processes (0 to 2)
m 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
P 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1
i 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7
45
Cyclic Distribution
• 1-D cyclic distribution of array ‘A’ on 4 processes
(0 to 3)
* * * * * * * * * *
A 0 1 2 3 4 5 6 7 8 9
46
P0 P1 P2 P3 P0 P1 P2 P3 P0 P1
A * * * * * * * * * *
0 1 2 3 4 5 6 7 8 9
47
Block Cyclic Distributions
48
Block Cyclic Distributions
Column
Column
Inactive part k
Row k (k,k) j
(k,j) A[k,j] := A[k,j]/A[k,k]
Active part
Row i (i,k) (i,j) A[i,j] := A[i,j] - A[i,k] x A[k,j]
50
Block-Cyclic Distribution: Examples
One- and two-dimensional block-cyclic distributions among 4
processes.
P0 P3 P6
T1 T4 T5
P1 P4 P7
T2 T6 T10 T8 T12
P2 P5 P8
T3 T7 T11 T 9 T13T14
51
Block-Cyclic Distribution
• A cyclic distribution is a special case in which block size is one.
P0
P1 P0 P1 P0 P1
P2
P3 P2 P3 P2 P3
P0
P1 P0 P1 P0 P1
P2
(a)
P3 P2 P3 (b) P2 P3
52
1-D and 2-D block cyclic distribution of a two dimensional array
Graph Partitioning Based Data
Decomposition
• The array-based distribution schemes
that we described so far are quite
effective in balancing the
computations and minimizing the
interactions for a wide range of
algorithms that use dense matrices
and have structured and regular
interaction patterns.
53
Graph Partitioning Based Data
Decomposition
• However, there are many algorithms
that operate on sparse data structures
and for which the pattern of interaction
among data elements is data
dependent and highly irregular.
54
Graph Partitioning Based Data
Decomposition
• In these computations, the physical
domain is discretized and represented
by a mesh of elements.
55
Graph Partitioning Based Data
Decomposition
• The computation at a mesh point
usually requires data corresponding to
that mesh point and to the points that
are adjacent to it in the mesh.
56
Graph Partitioning Based Data
Decomposition
Figure 5 A mesh used to model Lake
Superior.
57
Graph Partitioning Based Data
Decomposition
• The simulation of a physical
phenomenon such the dispersion of a
water contaminant in the lake would
now involve computing the level of
contamination at each vertex of this
mesh at various intervals of time.
58
Graph Partitioning Based Data
Decomposition
• Since, in general, the amount of
computation at each point is the same,
the load can be easily balanced by
simply assigning the same number of
mesh points to each process.
• However, if a distribution of the mesh
points to processes does not strive to
keep nearby mesh points together, then
it may lead to high interaction
overheads due to excessive data
sharing. 59
Graph Partitioning Based Data
Decomposition
• For example, if each process receives
a random set of points as illustrated in
Figure 6, then each process will need
to access a large set of points
belonging to other processes to
complete computations for its
assigned portion of the mesh.
60
Partitioning the Graph of Lake
Superior
Random Partitioning
63
Graph Partitioning Based Data
Decomposition
• As a result, each process is assigned a
contiguous region of the mesh such
that the total number of mesh points
that needs to be accessed across
partition boundaries is minimized.
0 4
0 2 4 6
0 1 2 3 4 5 6 7
67
Task Paritioning: Mapping a Sparse Graph
Sparse graph for computing a sparse matrix-vector product and its
mapping.
A b
0 1 2 3 4 5 6 7 8 9 1011
Process 0 C0 = (4,5,6,7,8)
Process 1 C1 = (0,1,2,3,8,9,10,11)
Process 2 C2 = (0,4,5,6)
8 9 10 11
Process C2 = (1,2,4,5,7,8)
2
Reducing interaction overhead in sparse matrix-vector multiplication
68
Hierarchical Mappings
•Sometimes a single mapping technique is
inadequate.
P0 P1 P4 P5
P2 P3 P6 P7
P0 P1 P4 P5
P2 P3 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
70
Questions on the topic
71
Schemes for Dynamic Mapping
• Dynamic mapping is sometimes also referred to as
dynamic load balancing, since load balancing is
the primary motivation for dynamic mapping.
• Dynamic mapping is used in :
• where highly imbalanced distribution caused by
static mapping
• where task-dependency graph itself is dynamic
by nature
72
Dynamic Mapping with Centralized Schemes
Processes are managed in masters-
slave fashion.
When a process runs out of work, it
requests the master for more work.
When the number of processes
increases, the master may become the
bottleneck.
To overcome this, a process may pick up
a number of tasks (a chunk) at one
time. This is called Chunk scheduling.
73
Dynamic Mapping with Centralized Schemes
74
Distributed Dynamic Mapping
• Each process can send or receive work
from other processes.
75
Distributed Dynamic Mapping
•There are four critical questions:
ohow are sending and receiving
processes paired together,
owho initiates work transfer,
ohow much work is transferred, and
owhen is a transfer triggered?
76
Methods for Containing Interaction Overhead
77
Methods for Containing Interaction Overhead
78
Minimizing Interaction Overheads
•Maximize data locality: Where possible,
reuse intermediate data. Restructure
computation so that data can be reused
in smaller time windows.
81
Parallel Algorithm Models
82
Basic Communication Operations
V.B.More
MET’s IOE, BKC, Nashik
Thanks to Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing slides
1
Topic Overview
2
Basic Communication Operations:
Introduction
• Computations and communication are
two important factors of any parallel
algorithm.
4
Basic Communication Operations:
Introduction
• Group communication operations are
built using point-to-point messaging
primitives.
6
One-to-All Broadcast and All-to-One
Reduction
• A processor has a piece of data (of size m)
it needs to send to every other processor.
• The dual of one-to-all broadcast is all-to-
one reduction.
• In all-to-one reduction, each processor has
m units of data. These data items must
be combined piece-wise (using some
associative operator, such as addition or
min), and the result made available at a
target processor.
7
One-to-All Broadcast and All-to-One
Reduction
8
One-to-All Broadcast and All-to-One
Reduction
One-to-all broadcast
• One-to-All broadcast is the operation in
which a single processor send identical
data to all other processors.
• Most parallel algorithm often need this
operation.
• Consider data size m is to be sent to all
the processors.
• Initially only source processor has the
data.
• After completion of algorithm, there will be
a copy of initial data with each processor. 9
One-to-All Broadcast and All-to-One
Reduction
All-to-One Reduction
• All-to-One reduction is the operation in
which data from all processors are
combined at a single destination
processor.
• Various operations like sum, product, max,
min, avg of numbers can be performed by
all-to-one reduction operation.
10
One-to-All Broadcast and All-to-One
Reduction
All-to-One Reduction
11
One-to-All Broadcast and All-to-One
Reduction
12
One-to-All Broadcast and All-to-One
Reduction on Rings
• Simplest way is to send p − 1 messages
from the source to the other p − 1
processors – this is not very efficient.
• Use recursive doubling: source sends a
message to a selected processor. We
now have two independent problems
defined over halves of machines.
0 1 2 3
2
3 3
1 1
2
7 6 5 4
3
0 1 2 3
2
1 1
17
Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one Input Vector
reduction
P0 P1 P2
P3 One-to-all broadcast
P0 P0 P1 P2 P3
P4 P4 P5 P6 P7
P8 P8 P9 P11 Matrix
P10
20
Broadcast and Reduction on a Mesh: Example
3 7 11 15
One-to-all
4 4 4 4
broadcast on a
2 6 10 14
16-node mesh.
3 3 3 3
1 5 9 13 Source node is 0.
4 4 4 4
Row data transfer
2 2
0 4 8 12
2 6 10 14
3 3 3 3
1 5 9 13
4 4 4 4
2 2
0 4 8 12
1
The reduction process in linear array can be
carried out on two and three dimensional meshes
as well by reversing the direction and order of
messages. 24
Broadcast and Reduction on a Hypercube
25
Broadcast and Reduction on a Hypercube
8-node hypercube.
3
6 7
(110) (111)
2 3
(010) 2
3 (011)
3
2 4 5
1
(100) (101)
0 1
(000)
(001)
3
One-to-all broadcast on a three-dimensional hypercube.
The binary representations of node labels are shown in
parentheses. 26
Broadcast and Reduction on a Hypercube
27
Broadcast and Reduction on a Hypercube:
Example
28
Broadcast and Reduction on a Balanced
Binary Tree
• Consider a binary tree in which processors
are (logically) at the leaves and internal
nodes are routing nodes i.e. switching units.
• The communicating nodes have the same
labels as in the hypercube
• The communication pattern will be same as
that of hypercube algorithm.
• There will not be any congestion on any of
the communication link at any time.
29
Broadcast and Reduction on a Balanced
Binary Tree
2
2
3 3 3 3
0 1 2 3 4 5 6 7
30
Broadcast and Reduction on a Balanced
Binary Tree
3 3 3 3
0 1 2 3 4 5 6 7
3 3 3 3
0 1 2 3 4 5 6 7
40
All-to-All Broadcast and Reduction
42
All-to-All Broadcast and Reduction
43
All-to-All Broadcast and Reduction on a Ring
• All-to-all Broadcast:
• Simplest approach: perform p one-to-all
broadcasts. This is not the most efficient way,
though.
• It can take p times to complete
communication.
• Communication links can be used more
efficiently by simultaneously performing all p
one-to-all broadcast.
• By this here will be concatenation of all
messages traversing the same path at the
same time into a single message.
• The algorithm terminates in p − 1 steps.
44
All-to-All Broadcast and Reduction on a Ring
7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)
2nd communication
(0,7) (1,0) (2,1) (3,2)
step
0 1 2 3
. .
. .
. .
7 (0) 7 (7) 7 (6)
7 6 5 4
(7,6,5,4,3,2, (6,5,4,3,2,1, (5,4,3,2,1,0, (4,3,2,1,0,7,
7 (1) 1) 0) 7) 6) 7 (5)
7th communication
(0,7,6,5,4,3, (1,0,7,6,5,4, (2,1,0,7,6,5, (3,2,1,0,7,6,
2) 3) 4) 5) step
0 1 2 3
46
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
n (m)
1st Communication Step
nth mth
time step message
47
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
48
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
49
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
• Detailed Algorithm
• At every node, my_msg contains initial message
to be broadcast.
• At the end of the algorithm, all p messages are
collected at each node.
50
All-to-All Broadcast and Reduction on a Ring
52
All-to-all Broadcast on a Mesh
(6) (7) (8) (6,7,8) (6,7,8) (6,7,8)
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)
6 7 8 6 7 8 8
0 1 2 0 1 2 0 1 2
(a) Initial data (b) Data distribution after rowwise (c) Final result of all-to-all broadcast on
distribution broadcast Mesh
53
All-to-all Broadcast on a Mesh
3 4 5
0 1 2
(0) (1) (2)
(a) Initial data distribution
• After completion of second phase each node obtains all p pieces of
m-word data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from
each node.
54
All-to-all Broadcast on a Mesh
(6,7,8) (6,7,8) (6,7,8)
6 7 8
0 1 2
55
All-to-all Broadcast on a Mesh
(0,1,2,
3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8) 3,4,5, 3,4,5,
6,7,8) 6,7,8)
0 1 2
56
All-to-all Broadcast on a Mesh
57
All-to-all Broadcast on a Mesh
1. procedure ALL_TO_ALL_BC_MESH(my_id, my_msg, p, result)
2. begin
/* Communication along rows */
3. left := my_id − (my_id mod √p) + (my_id − 1)mod√p;
4. right := my_id − (my_id mod √p) + (my_id + 1) mod √p;
5. result := my_msg;
6. msg := result;
7. for i := 1 to √p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
/* Communication along columns */
up := (my_id − √p) mod p;
12.
13. down := (my_id + √p) mod p;
14. msg := result;
15. for i := 1 to √p − 1 do
16. send msg to down;
17. receive msg from up;
18. result := result ∪ msg;
19. endfor;
20. end ALL_TO_ALL_BC_MESH
59
All-to-all broadcast on a Hypercube
(6) (7) (6,7) (6,7)
6 7 6 7
(4) (5)
4 5 4 5
(4,5) (4,5)
(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7
(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5
(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)
(c) Distribution before the third step (d) Final distribution of messages
60
All-to-all broadcast on a Hypercube
61
All-to-all broadcast on a Hypercube
(6) (7)
6 7
(2) 2 3 (3)
(4) (5)
4 5
(0) 0 1 (1)
62
All-to-all broadcast on a Hypercube
(6,7) (6,7)
6 7
(2,3) 2 3 (2,3)
4 5
(4,5) (4,5)
(0,1) 0 1 (0,1)
(4,5, (4,5,
6,7) 6,7)
6 7
(0,1, (0,1,
2,3) 2 3 2,3)
(4,5, (4,5,
6,7) 6,7)
4 5
(0,1, (0,1,
2,3) 0 1 2,3)
(0,...,7) (0,...,7)
6 7
(0,...,7) (0,...,7)
2 3
(0,...,7) (0,...,7)
4 5
(0,...,7) (0,...,7)
0 1
66
All-to-all Broadcast
67
All-to-all reduction on a Hypercube
68
All-to-all Reduction
69
Cost Analysis
• On a hypercube, we have:
log p
T
i 1
(ts 2 i 1 t w m ) t s log p tw m( p 1)
70
Cost Analysis of all-to-all broadcast and all-to-all reduction operation
72
Cost Analysis of all-to-all broadcast and all-to-all reduction operation
T
i 1
(ts 2 i 1 t w m ) t s log p tw m( p 1)
74
All-to-all broadcast: Notes
75
All-to-all broadcast: Notes
Contention for
a single
channel by
multiple
7 6 5 4
messages
4
0 1 2 3
6 7
1 3 4
2
2 3
3
4 5
1
2
0 1
8-node hyperccube
77
All-Reduce and Prefix-Sum Operations
78
All-Reduce and Prefix-Sum Operations
79
All-Reduce and Prefix-Sum Operations
80
The Prefix-Sum Operation
81
The Prefix-Sum Operation
s0 = n0
s1 = n0 + n1
s2 = n0 + n1 + n2
s3 = n0 + n1 + n2 + n3
S4 = n0 + n1 + n2 + n3 + n4
...
sk = n0 + n1 + n2 + .. + nk
82
The Prefix-Sum Operation
T = (ts+twm) log p
83
The Prefix-Sum Operation
(6) [6] (7) [7] (6) [6] (6+7) [6+7]
6 7 6 7
[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]
[4]
4 5 4 5
6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+
2 3 [0+1+2] 2 3
2+3)
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]
(0+1+ 1 (0+1+ 0 1
0 [0+1] [0] [0+1]
2+3) 2+3)
(c) Distribution of sums before third (d) Final distribution of prefix
step sums
6 7
[2]
(2) 2 3 (3) [3]
4 5
(4) [4]
(5) [5]
[0]
(0) 0 1 (1) [1]
6 7
[2]
(2+3) 2 3 (2+3)
[2+3]
4 5
(4+5) [4]
(4+5)
[0] [4+5]
(0+1) 0 1 (0+1)
[0+1]
6 7
[0+1+2]
(0+1+2+ 3) 2 3 (0+1+2+3)
[0+1+2+3]
4 5
(4+5) [4]
(4+5)
[0] [4+5]
(0+1+2+3) 0 1 (0+1+2+3)
[0+1]
6 7
[0+1+2] 2 3 [0+1+2+3]
4 5
[0+1+...+4]
[0+1+..+5]
[0] 0 1 [0+1]
89
The Prefix-Sum Operation or Scan Operation
91
The Prefix-Sum Operation or Scan Operation
92
The Prefix-Sum Operation
6 7 6 7
[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]
[4]
4 5 4 5
6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+
2 3 [0+1+2] 2 3
2+3)
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]
(0+1+ 1 (0+1+ 0 1
0 [0+1] [0] [0+1]
2+3) 2+3)
(c) Distribution of sums before third (d) Final distribution of prefix
step sums
• At each node, square brackets show the local prefix sum accumulated
in the result buffer and parentheses enclose the contents of the
outgoing message buffer for the next step.
• These contents are updated with every incoming message.
• All the messages received by a node will not contribute to its final
result, some of the messages it receives may be redundant.
94
Questions on Prefix sum operations
95
Scatter and Gather
96
Scatter and Gather
97
Gather and Scatter Operations
99
Example of the Scatter Operation
In the first 6 7
communication
step, node 0
transfers half
2 3
of the
messages to
one of its
neighbours 4 5
(node 4)
(0,1,2,3,
4,5,6,7) 0 1
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1
(6) (7)
This process
6 7
involves log p
communication
steps for log p
(2) (3)
dimensions of 2 3
the hypercube.
(4) (5)
4 5
(0) (1)
0 1
(2) (3)
2 3
(4) (5)
4 5
(0) (1)
0 1
(6,7)
6 7
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1
6 7
2 3
4 5
(4,5,
6,7)
(0,1,
2,3) 0 1
6 7
2 3
4 5
(0,1,2,3,
4,5,6,7) 0 1
• There are log p steps, in each step, the machine size halves
and the data size halves.
• We have the time for this operation to be:
T = ts log p + twm(p − 1).
• This time is same for a linear array as well as a 2-D mesh.
• In scatter operation, at least m(p-1) data must be transmitted
out of the source node,
• and in gather operation at least m(p-1) data must be received
by the destination node.
• Therefore, twm(p-1) time, is the lower bound on the
communication in scatter and gather operations.
Topic Questions
• Explain Scatter and Gather operations with example.
109
All-to-All Personalized Communication
110
All-to-All Personalized Communication
. . . . . .
. . . . . .
. . . . . .
M 0, 1 M 1, 1 M p-1, 1 M 1, 0 M 1, 1 M 1, p-1,
All-to-all personalized
0 1 ... p-1 communication 0 1 ... p-1
111
All-to-All Personalized Communication: Example
112
All-to-All Personalized Communication: Example
P0 P 0 = [0,0],[1,0],[2,0],[3,0]
P1 P 1 = [0,1],[1,1],[2,1],[3,1] Result of
n Transpose
of matrix
P2
P 3 = [0,3],[1,3],[2,3],[3,3]
P3
114
All-to-All Personalized Communication on a Ring
115
All-to-All Personalized Communication on a Ring
116
All-to-All Personalized Communication on a Ring
All-to-all personalized
is the label of the node
communication on a six-
that originally owned the
node ring. The label of
message, and y is the
each message is of the
label of the node that is
form {x, y}, where x
the final
destination of the message. The label ({x1 , y1 }, {x2 , y2 }, . . . , {xn, yn}) indicates that a
message is formed by concatenating n individual messages.
119
All-to-All Personalized Communication on a Ring
Communication Step 1
120
All-to-All Personalized Communication on a Ring
Communication Step 2
121
All-to-All Personalized Communication on a Ring
Communication Step 3
122
All-to-All Personalized Communication on a Ring
Communication Step 4
123
All-to-All Personalized Communication on a Ring
Communication Step 5
124
All-to-All Personalized Communication on a Ring: Cost
p−1
T= Σ (t s+ t wm(p − i))
i=1
p−1
= t (ps −w1) + Σ it m
i=1
126
All-to-All Personalized Communication on a Mesh
127
All-to-All Personalized Communication on a Mesh
128
All-to-All Personalized Communication on a Mesh
({0,4},{1,4}, ({0,5},{1,5},
{2,4},{3,4}, {2,5},{3,5},
{4,4},{5,4}, {4,5},{5,5},
{6,4},{7,4}, {6,5},{7,5},
({0,3},{1,3},{2,3}, {8,4}) {8,5})
3 4 5
{3,3},{4,3},{5,3},
{6,3},{7,3},{8,3})
({0,1},{1,1}, ({0,2},{1,2},
{2,1},{3,1}, {2,2},{3,2},
{4,1},{5,1}, {4,2},{5,2},
({0,0},{1,0},{2,0}, {6,1},{7,1}, {6,2},{7,2},
{3,0},{4,0},{5,0}, 0 {8,1}) 1 {8,2}) 2
{6,0},{7,0},{8,0})
133
(c) Final Data distribution after second phase
All-to-All Personalized Communication on a Mesh: Cost
134
All-to-All Personalized Communication on a Mesh: Cost
4 5
({5,0} ... {5,7})
({4,0} ... {4,7})
0 1
({0,0} ... {0,7}) ({1,0} ... {1,7})
6 7
({2,0},{2,2},
{2,4},{2,6},
{3,0},{3,2}, 2 3
{3,4},{3,6}) ({4,1},{4,3},
{4,5},{4,7},
4 5 {5,1},{5,3},
{5,5},{5,7})
0 1
({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
6 7
({2,2},{2,2}, 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
{5,1},{7,1},
4 5 {5,5},{7,5})
0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5},
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
139
(c) Distribution before the third step
All-to-All Personalized Communication on a
Hypercube
({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})
2 3
4 5
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1
({0,0} ... {7,0}) ({0,1} ... {7,1})
6 7 6 7
0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7 6 7
({2,2},{2,2}, 2 3 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
4 5 {5,1},{7,1}, 4 5
{5,5},{7,5})
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
142
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
143
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
144
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
6 7 0 1 3 7
1 0 2 6
2 3 1 5
2 3
3 2 0 4 Seven steps in all-to-all
4 5 7 3 personalized communication on an
4 5
5 4 6 2 eight-node hypercube.
1
6 7 5
0
0 1 7 6 4 145
(g)
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
147
Circular Shift
• Circular shift can be applied in some
matrix computations and in string and
image patterh matching.
• It is a member of a broader class of
global communication operations.
• So, in a circular q-shift node i sends data
to node (i+q) mod p in a group of p
nodes where (0 < q < p) known as
Permutation.
• In Permutation, every node sends a
message of m word to a unique node.
148
Circular Shift on a Mesh
150
Circular Shift on a Mesh
152
Circular Shift on a Mesh
159
Circular Shift on a Hypercube
• The time for this is upper bounded by:
T = (ts + twm)(2 log p − 1).
• If E-cube routing is used, this time can be
reduced to
T = ts + twm.
160
Circular Shift on a Hypercube
(4) (5) (3) (2)
4 5 4 5
(3) (0)
(2) (1)
0 3 2 3 2
1
(7) (4)
(6) (5)
2 7 6 7 6
3
(0) (7)
(1) (6)
4 0 1 0 1
5
First communication Second communication
6 step of the 4-shift step of the 4-shift
7
(a) The first phase (a 4-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
161
Circular Shift on a Hypercube
(0) (1)
4 5
0
(7)
1 (6)
3 2
2
3
(2)
4 7 6
5
(4)
6 (5)
0 1
7
(b) The second phase (a 1-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
162
Circular Shift on a Hypercube
(7) (0)
4 5
0
(6)
(5)
1 3 2
2
3 (2)
(1)
4 7 6
5
6
(3)
(4)
0 1
7
(c) Final data distribution after the 5-shift
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
163
Circular Shift on a Hypercube
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(c) 3-shift (d) 4-shift
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(e) 5-shift (f) 6-shift
6 7
2 3
4 5
0 1
(g) 7-shift
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(a) 1- (b) 2- (c) 3-
shift shift shift
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(d) 4- (e) 5- (f) 6-
shift shift shift
6 7
2 3
4 5
0 1
(g) 7-
shift
• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation followed
by an all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw(p − 1) )
p
≈ 2 × (ts log p + twm). (10)
• All-to-one reduction can be performed by performing all-to-all reduction (dual of
all-to-all broadcast) followed by a gather operation (dual of scatter).
169
Improving Performance of Operations
170
Improving Performance of Operations
171
University Questions on Unit 3
• August 2018 (Insem)
• 1. Explain Broadcast and Reduction example for
multiplying matrix with a vector.(6)
• 2. Explain the concept of scatter and gather (4)
• 3. Compare the one-to-all broadcast operation for Ring,
Mesh and Hypercube topologies (6)
• 4. Explain the prefix-sum operation for an eight-node
hypercube (4)
172
University Questions on Unit 3
• Nov-Dec 2019 (Endsem)
• 1 Explain term of all-to-all broadcast on linear array,
mesh & Hypercube
• topologies. [8]
• 2 Write short note on circular shift on a mesh. [6]
174
Parallel Algorithm Models
Prof V B More
MET’s IOE, BKC, Nashik
Parallel Algorithm Models
These models are used to specify details for
partitioning data and how these data are
processed.
A model is used to provide structure of
parallel algorithms based on two techniques:
• selection of partitioning and mapping
technique;
• appropriate use of technique for
minimization of interaction.
Prof V B More
MET-BKC IOE
1
The Age of Parallel Processing
2
The Age of Parallel Processing
• Introduction of dual-core, low-end
notebook machines to 8-and 16-core
workstation computers, are not less
than supercomputers or mainframes.
3
The Age of Parallel Processing
• Electronic devices such as mobile
phones and portable music players
came up with parallel computing
capabilities to enhance the
performance.
5
The Age of Parallel Processing
6
The Age of Parallel Processing
Evolution of the CPUs
7
The Age of Parallel Processing
Evolution of the CPUs
8
The Age of Parallel Processing
Evolution of the CPUs
9
The Rise of GPU Computing
• A graphics processing unit (GPU) is a
specialized electronic circuit designed
to rapidly manipulate and alter memory
to accelerate the creation of images in
a frame buffer used for outputting to a
display device.
10
The Rise of GPU Computing
• GPUs are used in embedded systems,
mobile phones personal computers,
workstations, research labs, and game
consoles.
11
The Rise of GPU Computing
• Their highly parallel structure makes
them more efficient than general-
purpose CPUs for algorithms where the
processing of large blocks of data is
done in parallel.
12
The Rise of GPU Computing
• In a personal computer, a GPU can be
present on a video card, or it can be
embedded on the motherboard or in
certain CPUs - on the CPU die.
13
The Rise of GPU Computing
• In comparison to the central
processor's traditional data processing
pipeline, performing general-purpose
computations on a graphics
processing unit (GPU) is a new concept
(GPGPU).
14
The Rise of GPU Computing
• In fact, GPU itself is relatively new
compared to the computing field at
large. However, the idea of computing
on graphics processors is not new.
15
A Brief History of GPUs, Early GPU
• We have already looked at how CPUs
evolved in both clock speeds and core
count.
16
A Brief History of GPUs, Early GPU
17
A Brief History of GPUs, Early GPU
18
A Brief History of GPUs, Early GPU
19
A Brief History of GPUs, Early GPU
20
A Brief History of GPUs, Early GPU
• By mid 1990, the computer based first-
person games such as Doom, Duke
Nukem 3D, and Quaks came to market.
21
A Brief History of GPUs, Early GPU
22
A Brief History of GPUs, Early GPU
23
A Brief History of GPUs, Early GPU
24
A Brief History of GPUs, Early GPU
25
A Brief History of GPUs, Early GPU
26
NVIDIA GPU Development History
Basic Communication Operations
V.B.More
MET’s IOE, BKC, Nashik
Thanks to Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing slides
1
Topic Overview
4
Basic Communication Operations: Introduction
6
One-to-All Broadcast and All-to-One Reduction
7
One-to-All Broadcast and All-to-One
Reduction
8
One-to-All Broadcast and All-to-One Reduction
One-to-all broadcast
• One-to-All broadcast is the operation in which
a single processor send identical data to all
other processors.
• Most parallel algorithm often need this
operation.
• Consider data size m is to be sent to all the
processors.
• Initially only source processor has the data.
• After completion of algorithm, there will be a
copy of initial data with each processor. 9
One-to-All Broadcast and All-to-One Reduction
All-to-One Reduction
• All-to-One reduction is the operation in which
data from all processors are combined at a
single destination processor.
• Various operations like sum, product, max,
min, avg of numbers can be performed by
all-to-one reduction operation.
10
One-to-All Broadcast and All-to-One Reduction
All-to-One Reduction
11
One-to-All Broadcast and All-to-One
Reduction
12
One-to-All Broadcast and All-to-One Reduction
on Rings
• Simplest way is to send p − 1 messages from
the source to the other p − 1 processors –
this is not very efficient.
• Use recursive doubling: source sends a
message to a selected processor. We now
have two independent problems defined
over halves of machines.
0 1 2 3
2
3 3
1 1
2
7 6 5 4
3
0 1 2 3
2
1 1
17
Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one Input Vector
reduction
P0 P1 P2 P3
One-to-all broadcast
P0 P0 P1 P2 P3
P4 P4 P5 P6 P7
P8 P8 P9 P11 Matrix
P10
19
Broadcast and Reduction on a Mesh
•2D square mesh with √p rows and √p columns
for one to all broadcast operation.
•Firstly data is sent to remaining all √p -1
nodes in a row by source using one-to-all
broadcast operation.
•In second phase, the data is sent to the
respective column by one-to-all broadcast.
•Thus, each node of the mesh will have copy of
initial messae
20
Broadcast and Reduction on a Mesh: Example
3 7 11 15
One-to-all
4 4 4 4
broadcast on a
2 6 10 14
16-node mesh.
3 3 3 3
1 5 9 13 Source node is 0.
4 4 4 4 Row data transfer
2 2
0 4 8 12
0
2
4 8
2
12
node 14 by 12.
1
4 4 4 4
2 6 10 14
3 3 3 3
1 5 9 13
4 4 4 4
2 2
0 4 8 12
1
The reduction process in linear array can be carried
out on two and three dimensional meshes as well by
reversing the direction and order of messages.
24
Broadcast and Reduction on a Hypercube
25
Broadcast and Reduction on a Hypercube
8-node hypercube.
3
6 7
(110) (111)
2 3
(010) 2
3 (011)
3
2 4 5
1
(100) (101)
0 1
(000)
(001)
3
One-to-all broadcast on a three-dimensional hypercube. The
binary representations of node labels are shown in
parentheses. 26
Broadcast and Reduction on a Hypercube
27
Broadcast and Reduction on a Hypercube:
Example
● In the next step, communication will be done
for lower dimension.
● The source and the destination nodes in three
communication steps of the algorithm are
similar to the nodes in the broadcast algorithm
on a linear array.
● Hypercube broadcast will not suffer from
congestion
28
Broadcast and Reduction on a Balanced Binary
Tree
•Consider a binary tree in which processors are
(logically) at the leaves and internal nodes are
routing nodes i.e. switching units.
•The communicating nodes have the same labels
as in the hypercube
•The communication pattern will be same as that
of hypercube algorithm.
•There will not be any congestion on any of the
communication link at any time.
29
Broadcast and Reduction on a Balanced Binary
Tree
2
2
3 3 3 3
0 1 2 3 4 5 6 7
30
Broadcast and Reduction on a Balanced
Binary Tree
3 3 3 3
0 1 2 3 4 5 6 7
3 3 3 3
0 1 2 3 4 5 6 7
40
All-to-All Broadcast and Reduction
41
All-to-All Broadcast and Reduction
42
All-to-All Broadcast and Reduction
43
All-to-All Broadcast and Reduction on a Ring
• All-to-all Broadcast:
• Simplest approach: perform p one-to-all
broadcasts. This is not the most efficient way,
though.
• It can take p times to complete communication.
• Communication links can be used more efficiently
by simultaneously performing all p one-to-all
broadcast.
• By this here will be concatenation of all messages
traversing the same path at the same time into a
single message.
• The algorithm terminates in p − 1 steps.
44
All-to-All Broadcast and Reduction on a Ring
45
All-to-All Broadcast and Reduction on a Ring
1 (6) 1 (5) 1 (4)
n (m)
7 6 5 4
(7) (6) (5) (4) nth mth
1 (7) 1 (3) time step message
7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)
0 1 2 3
. .
. .
. .
7 (0) 7 (7) 7 (6)
7 6 5 4
(7,6,5,4,3,2,1) (6,5,4,3,2,1,0) (5,4,3,2,1,0,7) (4,3,2,1,0,7,6)
7 (1) 7 (5)
0 1 2 3
46
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
n (m)
1st Communication Step
nth mth
time step message
47
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
48
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
49
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring
• Detailed Algorithm
• At every node, my_msg contains initial message to be
broadcast.
• At the end of the algorithm, all p messages are
collected at each node.
50
All-to-All Broadcast and Reduction on a Ring
52
All-to-all Broadcast on a Mesh
(6) (7) (8) (6,7,8) (6,7,8) (6,7,8)
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)
6 7 8 6 7 8 8
0 1 2 0 1 2 0 1 2
(a) Initial data distribution (b) Data distribution after rowwise broadcast (c) Final result of all-to-all broadcast on Mesh
• After completion of second phase each node obtains all p pieces of m-word
data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from each node.
53
All-to-all Broadcast on a Mesh
3 4 5
0 1 2
(0) (1) (2)
(a) Initial data distribution
• After completion of second phase each node obtains all p pieces of m-word
data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from each node.
54
All-to-all Broadcast on a Mesh
(6,7,8) (6,7,8) (6,7,8)
6 7 8
0 1 2
55
All-to-all Broadcast on a Mesh
(0,1,2,
3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8) 3,4,5, 3,4,5,
6,7,8) 6,7,8)
0 1 2
56
All-to-all Broadcast on a Mesh
57
All-to-all Broadcast on a Mesh
1. procedure ALL_TO_ALL_BC_MESH(my_id, my_msg, p, result)
2. begin
/* Communication along rows */
3. left := my_id − (my_id mod √p) + (my_id − 1)mod√p;
4. right := my_id − (my_id mod √p) + (my_id + 1) mod √p;
5. result := my_msg;
6. msg := result;
7. for i := 1 to √p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
/* Communication along columns */
up := (my_id − √p) mod p;
12.
13. down := (my_id + √p) mod p;
14. msg := result;
15. for i := 1 to √p − 1 do
16. send msg to down;
17. receive msg from up;
18. result := result ∪ msg;
19. endfor;
20. end ALL_TO_ALL_BC_MESH
59
All-to-all broadcast on a Hypercube
(6) (7) (6,7) (6,7)
6 7 6 7
(4) (5)
4 5 4 5
(4,5) (4,5)
(a) Initial distribution of messages (b) Distribution before the second step
(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7
(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5
(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)
(c) Distribution before the third step (d) Final distribution of messages
60
All-to-all broadcast on a Hypercube
61
All-to-all broadcast on a Hypercube
(6) (7)
6 7
(2) 2 3 (3)
(4) (5)
4 5
(0) 0 1 (1)
62
All-to-all broadcast on a Hypercube
(6,7) (6,7)
6 7
(2,3) 2 3 (2,3)
4 5
(4,5) (4,5)
(0,1) 0 1 (0,1)
(4,5, (4,5,
6,7) 6,7)
6 7
(0,1, (0,1,
2,3) 2 3 2,3)
(4,5, (4,5,
6,7) 6,7)
4 5
(0,1, (0,1,
2,3) 0 1 2,3)
(0,...,7) (0,...,7)
6 7
(0,...,7) (0,...,7)
2 3
(0,...,7) (0,...,7)
4 5
(0,...,7) (0,...,7)
0 1
66
All-to-all Broadcast
67
All-to-all reduction on a Hypercube
68
All-to-all Reduction
69
Cost Analysis
70
Cost Analysis of all-to-all broadcast and all-to-all reduction operation
71
Cost Analysis of all-to-all broadcast and all-to-all reduction operation
72
Cost Analysis of all-to-all broadcast and all-to-all reduction operation
73
All-to-all broadcast: Notes
74
All-to-all broadcast: Notes
75
All-to-all broadcast: Notes
Contention for a
single channel
by multiple
7 6 5 4
messages
4
0 1 2 3 7
6
1 3 4
2
2 3
3
4 5
1
2
0 1
8-node hyperccube
77
All-Reduce and Prefix-Sum Operations
78
All-Reduce and Prefix-Sum Operations
79
All-Reduce and Prefix-Sum Operations
80
The Prefix-Sum Operation
81
The Prefix-Sum Operation
s0 = n0
s1 = n0 + n1
s2 = n0 + n1 + n2
s3 = n0 + n1 + n2 + n3
S4 = n0 + n1 + n2 + n3 + n4
...
sk = n0 + n1 + n2 + .. + nk
82
The Prefix-Sum Operation
T = (ts+twm) log p
83
The Prefix-Sum Operation
(6) [6] (7) [7] (6) [6] (6+7) [6+7]
6 7 6 7
[2] [2]
(2) 3 (3) [3] (2+3) 3 (2+3)
[2+3]
2 2
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 (1) [1] 1 (0+1) 0 (0+1) [0+1]
1
(a) Initial distribution of values (b) Distribution of sums before second step
1 (0+1+ 0
(0+1+
0 [0+1] [0] [0+1] 1
2+3) 2+3)
(c) Distribution of sums before third step (d) Final distribution of prefix sums
6 7
[2]
(2) 2 3 (3) [3]
4 5
6 7
[2]
(2+3) 2 3 (2+3)
[2+3]
4 5
6 7
[0+1+2]
(0+1+2+ 3) 2 3 (0+1+2+3)
[0+1+2+3]
4 5
6 7
[0+1+2] 2 3 [0+1+2+3]
4 5
[0+1+...+4] [0+1+..+5]
[0] 0 1 [0+1]
89
The Prefix-Sum Operation or Scan Operation
90
The Prefix-Sum Operation or Scan Operation
91
The Prefix-Sum Operation or Scan Operation
92
The Prefix-Sum Operation
6 7 6 7
[2] [2]
(2) 3 (3) [3] (2+3) 3 (2+3)
[2+3]
2 2
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 (1) [1] 1 (0+1) 0 (0+1) [0+1]
1
(a) Initial distribution of values (b) Distribution of sums before second step
1 (0+1+ 0
(0+1+
0 [0+1] [0] [0+1] 1
2+3) 2+3)
(c) Distribution of sums before third step (d) Final distribution of prefix sums
• At each node, square brackets show the local prefix sum accumulated in
the result buffer and parentheses enclose the contents of the outgoing
message buffer for the next step.
• These contents are updated with every incoming message.
• All the messages received by a node will not contribute to its final result,
some of the messages it receives may be redundant.
94
Questions on Prefix sum operations
95
Scatter and Gather
96
Scatter and Gather
97
Gather and Scatter Operations
98
Gather and Scatter Operations
99
Example of the Scatter Operation
In the first 6 7
communication
step, node 0
transfers half of
2 3
the messages to
one of its
neighbours
(node 4) 4 5
(0,1,2,3,
4,5,6,7) 0 1
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1
(6) (7)
This process
6 7
involves log p
communication
steps for log p (2) (3)
dimensions of 2 3
the hypercube.
(4) (5)
4 5
(0) (1)
0 1
(2) (3)
2 3
(4) (5)
4 5
(0) (1)
0 1
(6,7)
6 7
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1
6 7
2 3
4 5
(4,5,
6,7)
(0,1,
2,3) 0 1
6 7
2 3
4 5
(0,1,2,3,
4,5,6,7) 0 1
• There are log p steps, in each step, the machine size halves and the
data size halves.
• We have the time for this operation to be:
T = ts log p + twm(p − 1).
• This time is same for a linear array as well as a 2-D mesh.
• In scatter operation, at least m(p-1) data must be transmitted out of
the source node,
• and in gather operation at least m(p-1) data must be received by the
destination node.
• Therefore, twm(p-1) time, is the lower bound on the communication
in scatter and gather operations.
Topic Questions
• Explain Scatter and Gather operations with example.
109
All-to-All Personalized Communication
110
All-to-All Personalized Communication
. . . . . .
. . . . . .
. . . . . .
M 0, 1 M 1, 1 M p-1, 1 M 1, 0 M 1, 1 M 1, p-1,
All-to-all personalized
0 1 ... p-1 communication 0 1 ... p-1
111
All-to-All Personalized Communication: Example
112
All-to-All Personalized Communication: Example
113
All-to-All Personalized Communication: Example
P0 P0 = [0,0],[1,0],[2,0],[3,0]
P1 P1 = [0,1],[1,1],[2,1],[3,1] Result of
n Transpose
of matrix
P2
P3 P3 = [0,3],[1,3],[2,3],[3,3]
• Processor Pi will contain the elements of the matrix with indices [i,0],
[i,1], .., [i,n-1]
• In transpose AT, P0 will have element [i,0], P1 will have element [i,1]
and so on.
• Initially processor Pi will have element [i,j] and after transpose, it
moves to Pj
• Figure above shows the example of 4 x 4 matrix mapped onto four
processes using one dimensional row wise partitioning.
114
All-to-All Personalized Communication on a Ring
115
All-to-All Personalized Communication on a Ring
116
All-to-All Personalized Communication on a Ring
117
All-to-All Personalized Communication on a Ring
118
All-to-All Personalized Communication on a Ring
All-to-all personalized
is the label of the node that
communication on a
originally owned the
six-node ring. The label of
message, and y is the label
each message is of the form
of the node that is the final
{x, y}, where x
destination of the message. The label ({x1, y1}, {x2, y2}, . . . , {xn, yn}) indicates that a
message is formed by concatenating n individual messages.
119
All-to-All Personalized Communication on a Ring
Communication Step 1
120
All-to-All Personalized Communication on a Ring
Communication Step 2
121
All-to-All Personalized Communication on a Ring
Communication Step 3
122
All-to-All Personalized Communication on a Ring
Communication Step 4
123
All-to-All Personalized Communication on a Ring
Communication Step 5
124
All-to-All Personalized Communication on a Ring: Cost
p−1
T = Σ (t s + tw m(p −
i=1 i))
p−1
= t s(p −w 1) + Σ it
i=1 m
126
All-to-All Personalized Communication on a Mesh
127
All-to-All Personalized Communication on a Mesh
128
All-to-All Personalized Communication on a Mesh
129
All-to-All Personalized Communication on a Mesh
130
All-to-All Personalized Communication on a Mesh
({0,4},{1,4}, ({0,5},{1,5},
{2,4},{3,4}, {2,5},{3,5},
{4,4},{5,4}, {4,5},{5,5},
{6,4},{7,4}, {6,5},{7,5},
({0,3},{1,3},{2,3}, {8,4}) {8,5})
3 4 5
{3,3},{4,3},{5,3},
{6,3},{7,3},{8,3})
({0,1},{1,1}, ({0,2},{1,2},
{2,1},{3,1}, {2,2},{3,2},
{4,1},{5,1}, {4,2},{5,2},
({0,0},{1,0},{2,0}, {6,1},{7,1}, {6,2},{7,2},
{3,0},{4,0},{5,0}, {8,1}) {8,2})
0 1 2
{6,0},{7,0},{8,0})
133
(c) Final Data distribution after second phase
All-to-All Personalized Communication on a Mesh: Cost
134
All-to-All Personalized Communication on a Mesh: Cost
135
All-to-All Personalized Communication on a
Hypercube
136
All-to-All Personalized Communication on a
Hypercube
({6,0} ... {6,7}) ({7,0} ... {7,7})
6 7
4 5
({5,0} ... {5,7})
({4,0} ... {4,7})
0 1
({0,0} ... {0,7}) ({1,0} ... {1,7})
6 7
({2,0},{2,2},
{2,4},{2,6},
{3,0},{3,2}, 2 3
{3,4},{3,6}) ({4,1},{4,3},
{4,5},{4,7},
4 5 {5,1},{5,3},
{5,5},{5,7})
0 1
({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
6 7
({2,2},{2,2}, 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
{5,1},{7,1},
4 5 {5,5},{7,5})
0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5},
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
139
(c) Distribution before the third step
All-to-All Personalized Communication on a
Hypercube
({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})
2 3
4 5
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1
({0,0} ... {7,0}) ({0,1} ... {7,1})
6 7 6 7
0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
(a) Initial distribution of messages (b) Distribution before the second step
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7 6 7
({2,2},{2,2}, 2 3 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
4 5 {5,1},{7,1}, 4 5
{5,5},{7,5})
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
(c) Distribution before the third step (d) Final distribution of messages
142
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
143
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
144
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
6 7 0 1 3
7
1 0 2
2 3 6
Seven steps in all-to-all personalized
2 3 1
4 5 7 5 communication on an eight-node
4 5
5 34 26 0 hypercube.
4
6 7 5
3
0 1 7 6 4
2
145
(g) 1
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
147
Circular Shift
•Circular shift can be applied in some matrix
computations and in string and image
patterh matching.
•It is a member of a broader class of global
communication operations.
•So, in a circular q-shift node i sends data to
node (i+q) mod p in a group of p nodes
where (0 < q < p) known as Permutation.
•In Permutation, every node sends a message
of m word to a unique node.
148
Circular Shift on a Mesh
150
Circular Shift on a Mesh
151
Circular Shift on a Mesh
152
Circular Shift on a Mesh
158
Circular Shift on a Hypercube
• For example, 5-shift operation is performed
by 4 shift (22) followed by 1 shift (20).
• Each shift will have two communication steps.
Only 1-shift will have a single step. For
example, the first phase of a 4-shift consists of
two steps and the second phase of a 1-shift
consists of one step.
• Total number of steps for any q in a p-node
hypercube is 2 log p-1
159
Circular Shift on a Hypercube
• The time for this is upper bounded by:
T = (ts + twm)(2 log p − 1).
• If E-cube routing is used, this time can be
reduced to
T = ts + twm.
160
Circular Shift on a Hypercube
(4) (5) (3) (2)
4 5 4 5
(3) (0)
(2)
0 3 2 3 2 (1)
1
(7) (4)
2 7 6 (6) 7 6 (5)
3
(0) (7)
(1) (6)
4 0 1 0 1
5
First communication Second communication
6 step of the 4-shift step of the 4-shift
7
(a) The first phase (a 4-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
161
Circular Shift on a Hypercube
(0) (1)
4 5
0
(7)
1 (6)
3 2
2
3
(2)
4
7 6
5
(4)
(5)
6 0 1
7
(b) The second phase (a 1-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
162
Circular Shift on a Hypercube
(7) (0)
4 5
0
(6)
(5)
1 3 2
2
3
(2)
(1)
4
7 6
5
(3)
6 (4)
0 1
7
(c) Final data distribution after the 5-shift
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
163
Circular Shift on a Hypercube
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(c) 3-shift (d) 4-shift
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(e) 5-shift (f) 6-shift
6 7
2 3
4 5
0 1
(g) 7-shift
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(a) 1-shift (b) 2-shift (c) 3-shift
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(d) 4-shift (e) 5-shift (f) 6-shift
6 7
2 3
4 5
0 1
(g) 7-shift
• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation followed by an
all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw(p − 1) )
p
169
Improving Performance of Operations
170
Improving Performance of Operations
171
University Questions on Unit 3
• August 2018 (Insem)
• 1. Explain Broadcast and Reduction example for multiplying
matrix with a vector.(6)
• 2. Explain the concept of scatter and gather (4)
• 3. Compare the one-to-all broadcast operation for Ring,
Mesh and Hypercube topologies (6)
• 4. Explain the prefix-sum operation for an eight-node
hypercube (4)
172
University Questions on Unit 3
• Nov-Dec 2019 (Endsem)
• 1 Explain term of all-to-all broadcast on linear array, mesh
& Hypercube
• topologies. [8]
• 2 Write short note on circular shift on a mesh. [6]
173
University Questions on Unit 3
• May-June-2019 (Endsem)
• 1. Explain Circular shift operation on mesh and hypercube
network. [8]
174