Principles of Parallel Algorithm Design: Prof V B More

Principles of Parallel
Algorithm Design
Prof V B More
MET-BKC IOE
1
Algorithms and Concurrency
 Introduction to Parallel Algorithms
• Tasks and Decomposition
• Processes and Mapping
• Processes Versus Processors
 Decomposition / Partitioning
Techniques
• Recursive Decomposition
• Exploratory Decomposition
• Hybrid Decomposition
2
Algorithms and Concurrency
 Characteristics of Tasks and

Interactions
• Task Generation, Granularity, and
Context
• Characteristics of Task
Interactions.
3
Concurrency and Mapping
• Mapping Techniques for Load
Balancing
–Static and Dynamic Mapping
• Methods for Minimizing Interaction

Overheads
–Maximizing Data Locality
–Minimizing Contention and Hot-
Spots
4
Concurrency and Mapping
–Overlapping Communication and
Computations
–Replication vs. Communication
–Group Communications vs. Point-
to-Point Communication
• Parallel Algorithm Design Models
–Data-Parallel, Work-Pool, Task
Graph, Master-Slave, Pipeline, and
Hybrid Models
5
Decomposition, Tasks, and Dependency
Graphs
• A very first step in the design process
of parallel algorithm is to decompose
the problem into various tasks that
can be executed concurrently
6
Graphs
• Tasks can be decomposed into sub-
tasks in various ways. Decomposed
tasks may be of :
– same sized,
– different sized, or
– of intermediate sizes.
7
Graphs
• A decomposition can be justified in
the form of a directed graph with
nodes corresponding to tasks and
edges indicating that the result of
one task is required for processing
the next. Such a graph is called a
task dependency graph.
8
Example Decomposition of task into nodes and edges
Task 4 Task 3 Task 2 Task 1
10 10 10 10
9 Task 6 6 Task 5
Task 7
8
Main task
Decomposed task
9
Example: Multiplying a Dense Matrix with a Vector
A n b y
Computation of
01
Task 1
2
each element of
output vector y is
independent of
other elements.
n-1
Task n Based on this,
a dense matrix-vector product can be
decomposed into n tasks. shaded
portion of the matrix and vector is
accessed by Task 1. 10
A
Task 1
01 n b y Findings: While
2 tasks share data
(vector b), they do
not have any
control
n-1
dependencies – i.
Task n e., no task needs
to
wait for the (partial) completion of
any other. All tasks are of the same
size in terms of number of operations.
11
Example: Database Query Processing
Consider the execution of the query:
MODEL = ”CIVIC” AND YEAR = 2001 AND
(COLOR = ”GREEN” OR COLOR = “WHITE”)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
12
Consider the execution of the query:
MODEL = ``CIVIC'' AND YEAR = 2001

AND
(COLOR = ``GREEN'' OR COLOR =
``WHITE)
13
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
14
• The execution of the query can be

divided into different subtasks.
• Each task can be an intermediate

table that satisfy a particular clause.
15
COLOR = ``WHITE
MODEL = ``CIVIC'' YEAR = 2001
ID# Year
ID# Model ID# Color
7623 2001
4523 Civic 7623 Green
6734 2001 ID# Color
6734
4395
Civic
Civic
5342 2001 3476 White 9834
5342
Green
Green
COLOR = ``GREEN''
3845 2001 6734 White
7352 Civic 4395 2001 8354 Green
Civic 2001 White Green COLOR = ``GREEN'' OR

COLOR = ``WHITE
MODEL = ``CIVIC'' AND ID# Color
YEAR = 2001 ID# Model Year 3476 White
6734 Civic 2001 Civic AND 2001 White OR Green 7623 Green
4395 Civic 2001 9834 Green
6734 White
5342 Green
8354 Green
Edges in this Civic AND 2001 AND (White OR Green)
Decompo
graph denote ID# Model Year Color
6734 Civic 2001 White sing the
that the
output of one MODEL = ``CIVIC'' AND YEAR = 2001 AND given
(COLOR = ``GREEN'' OR COLOR = ``WHITE)
task is query into
needed to a number
accomplish of tasks.16
the next.
MODEL = “CIVIC”
MODEL = “CIVIC”
ID# Model
4523 Civic
6734 Civic
Civic 2001
4395 Civic
7352 Civic
Civic AND 2001
17
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
18
YEAR = 2001
ID# Year
7623 2001
9834 2001
YEAR = 6734 2001
Civic 2001 2001 5342 2001
3845 2001
4395 2001
Civic AND 2001 White OR Green
19
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
20
COLOR = “WHITE”
COLOR = “WHITE”
ID# Color
3476 White
6734 White
Civic 2001 White Green
Civic AND 2001 White OR Green
21
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
22
COLOR = “GREEN”
ID# Color
7623 Green
9834 Green
5342 Green
8354 Green COLOR = “GREEN”
White Green
White OR Green
23
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
24
MODEL = “CIVIC” AND YEAR = 2001
YEAR = 2001
MODEL = “CIVIC”
ID# Year
ID# Model 7623 2001
4523 Civic 9834 2001
6734 Civic 6734 2001
4395 Civic 5342 2001
7352 Civic 3845 2001
MODEL = “CIVIC” AND 4395 2001
YEAR = 2001 Civic 2001
ID# Model Year

6734 Civic 2001
4395 Civic 2001 Civic AND 2001
25
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
26
COLOR = “GREEN” OR COLOR = “WHITE”
COLOR =
“WHITE” COLOR = “GREEN”
ID# Color
ID# Color
7623 Green
3476 White 9834 Green
6734 White 5342 Green
8354 Green
COLOR = “GREEN”
White Green
OR
COLOR = “WHITE”
ID# Color
3476 White
White OR Green 7623 Green
9834 Green
6734 White
5342 Green
8354 Green
27
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
28
MODEL = “CIVIC” AND YEAR = 2001 AND
(COLOR = “GREEN” OR COLOR = “WHITE”)
COLOR = “GREEN” OR
MODEL = “CIVIC” AND
COLOR = “WHITE”
YEAR = 2001
ID# Model Year ID# Color
3476 White
6734 Civic 2001 7623 Green
4395 Civic 2001 Civic AND 2001 White OR Green 9834 Green
6734 White
5342 Green
8354 Green
Civic AND 2001 AND (White OR Green)
MODEL = “CIVIC” AND YEAR = 2001 ID# Model Year Color

AND 6734 Civic 2001 White
(COLOR = “GREEN” OR COLOR = 29
4523 Civic 2002 Blue MN $18,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
30
The same problem can be decomposed into
subtasks in other ways as well.
ID# Year
An alternate
ID# Model ID# Color
7623 2001
4523 Civic 7623 Green
6734 2001 ID# Color
6734 Civic 5342 2001 3476 White 9834 Green
decompositi
4395 Civic 3845 2001 6734 White 5342 Green
7352 Civic 4395 2001 8354 Green
on of the
Civic 2001 White Green
ID# Color
given White OR Green

3476
7623
9834
White
Green
Green
problem into
6734 White
5342 Green
8354 Green
subtasks,
2001 AND (White or Green) ID# Color Year
7623 Green 2001
6734 White 2001
along with
5342 Green 2001
Civic AND 2001 AND (White OR Green)
their data ID#

6734
Model
Civic
Year Color
2001 White
dependencie
Different task decomposition methods may leads to different
s.
parallel performance. 31
Granularity of Task Decompositions
• The number of tasks into which a
problem is decomposed determines
its granularity.
• Decomposition into a large number
of tasks results in fine grained
decomposition and that into a small
number of tasks results in a
coarse grained decomposition.
32
Granularity of Task Decompositions
A b y
0 1 ... n
Task
1
Task
2
Task
3
Task
4
A coarse grain decomposition of

dense matrix-vector multiplication into
four tasks. Each task represents
computation of three elements of the
result vector. 33
Degree of Concurrency
• The number of tasks that can be
executed in parallel is called the
degree of concurrency of a
decomposition.
• Since the number of tasks that can
be executed in parallel may change
during program execution, the
maximum degree of concurrency is
the maximum number of such tasks
at any point during execution. 34
Degree of Concurrency
• What is the maximum degree of
concurrency of the database query
examples?
• Total amount of work is the sum of
number of decomposed tasks and
their intermediate task executions till
final result is obtained. In database
query, total amount of work is 7 in
both the cases.
35
• Critical path length in task –
dependency graph is the longest
directed path between any pair of
start and finish nodes. It is the
sum of weights of every node in
the critical path.
• The average degree of
concurrency is = (Total amount of
work) / (critical path length)
36
• Assuming that each tasks in the
database example takes identical
processing time, what is the
average degree of concurrency in
each decomposition?
• The degree of concurrency
increases as the decomposition
becomes finer in granularity and
vice versa.
37
Critical Path Length
• A directed path in the task
dependency graph represents a
sequence of tasks that must be
processed one after the other.
38
• The longest such path determines

the shortest time in which the
program can be executed in parallel.
39
• The length of the longest path in a

task dependency graph is called the
critical path length.
40
• Consider the task dependency graphs of
the two database query decompositions:
Task 4 Task 3 Task 2 Task 1 Task 4 Task 3 Task 2 Task
1
10 10 10 10 10 10 10 10
6 Task 5
9 Task 6 6 Task 5 Task 6
11
8 Task 7 7 Task 7
(a) (b)
41
Questions:
• What are the critical path lengths for the
two task dependency graphs?
• If each task takes 10 time units, what is the
shortest parallel execution time for each
decomposition?
• How many processors are needed in each
case to achieve this minimum parallel
execution time?
• What is the maximum degree of
concurrency?
42
Limits on Parallel Performance
• It can be observed that the parallel
time can be made arbitrarily small by
making the decomposition finer in
granularity.
• There is an inherent bound on how fine
the granularity of a computation can
be. For example, in the case of
multiplying a dense matrix with a
vector, there can be no more than (n2)
concurrent tasks.
43
Col 1 Col 2 Col 3
Row 1
1 2 3
Row 2
4 5 6
Row 3 7 8 9
44
Col 1 Col 2 Col 3
Row 1
1 2 3
Row 2
4 5 6
Row 3 7 8 9
Each row assigned to each task, 3 rows to 3

tasks 45
Col 1 Col 2 Col 3
Row 1
1 2 3
Row 2
4 5 6
Row 3 7 8 9
Each cell can be assigned to each task, 3 row x 3 col = 9 cells = 9

tasks
46
• Concurrent tasks may also have to
exchange data with other tasks. This
results in communication overhead.
• The tradeoff between the granularity
of a decomposition and associated
overheads often determines
performance bottleneck.
47
Partitioning Techniques
• There is no single recipe that works
for all problems.
• We can benefit from some commonly
used techniques:
– Recursive Decomposition
– Data Decomposition
– Exploratory Decomposition
– Speculative Decomposition
48
Recursive Decomposition
• Generally suited to problems that are
solved using a divide and conquer
strategy.
• Decompose based on sub-problems
• Often results in natural concurrency
as sub-problems can be solved in
parallel.
• Need to think recursively
– parallel not sequential
49
Recursive Decomposition: Quicksort
Once each sublist has been partitioned around

the pivot,each sub-sublist can be processed
concurrently.
50
Recursive Decomposition:
Finding the Min/Max/Sum
• Any associative and commutative operation.
1. procedure SERIAL_MIN (A, n)
2. begin
3. min = A[0];
4. for i := 1 to n − 1 do
5. if (A[i] < min) min := A[i];
6. endfor;
7. return min;
8. end SERIAL_MIN
51
• Rewrite using recursion and max partitioning
– Don’t make a serial recursive routine
1. procedure RECURSIVE_MIN (A, n)
2. begin
3. if ( n = 1 ) then
4. min := A [0] ;
5. else Note: Divide the work
6. lmin := RECURSIVE_MIN ( A, n/2 ); in half each time.
7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 );
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN 52
• Example: Find min of {4,9,1,7,8,11,2,12}
Step
1 4 9 1 7 8 11 2 12
2 4 9 1 7 8 11 2 12
3 4 9 1 7 8 11 2 12
53
• struggle to divide in half
• Often, can be mapped to a hypercube
for a very efficient algorithm
• The overhead of dividing the
computation is important.
– How much does it cost to communicate
necessary dependencies?
54
Sequential Merge Sort
Pseudo code of Sequential Merge Sort with Recursion
void mergeSort(int *a, int first, int last, int *aux) {
if (last <= first) return;
int mid = (first + last) / 2;
mergeSort (a, first, mid, aux); //first half
mergeSort (a, mid+1, last, aux); //next half
mergeArrays(a, first, mid, a, mid+1, last, aux, first, last);
for (int i=first; i<= last; i++) a[i] = aux[i];
}
void mergeArrays(int *a, int afirst, int alast, int *b, int bfirst, int blast, int *c, int cfirst, int
clast) {
int i = afirst, j = bfirst, k = cfirst;
while (i <= alast && j <= blast ) {
if ( a[i] < b[j]) c[k++] = a[i++];
else c[k++] = b[j++];
}
while (i <= alast ) c[k++] = a[i++];
while (j <= blast ) c[k++] = b[j++];
}
55
Parallel Merge Sort
Pseudo code of Parallel Merge Sort
void parallel_mergeSort() {
if (proc_id > 0) {
Recv(size,parent);
Recv(a, size, parent);
}
mid = size / 2;
if (both children) {
Send (mid, child1);
Send(size-mid, child2);
Send(a, mid, child1);
Send(a+mid, size-mid, child2);
Recv(a, mid, child1);
Recv(a+mid, size-mid, child2);
mergeArrays(a, 0, mid, a, mid+1, size, aux, 0, size);
for (int i=first; i<= last; i++) a[i] = aux[i];
}
else mergeSort( a, 0, size);
if (proc_id > 0) Send (a, size, parent) ;

56
}
Data Decomposition
• This is the most common approach
• Identify the data and partition across
tasks
• Three approaches
– Output Data Decomposition
– Input Data Decomposition
– Domain Decomposition
57
Output Data Decomposition
• Each element of the output can be
computed independently of the others
– A function of the input
– All may be able to share the input or have a
copy of their own
• Decompose the problem naturally.
• Embarrassingly Parallel
– Output data decomposition with no need for
communication
– Mandelbrot, Simple Ray Tracing, etc.
58
59
60
61
Prof V B More, MET IOE BKC 62
Mandlebrot Fractal Zoom
https://youtu.be/PD2XgQOyCCk
The Next Dimension - 3D Mandelbrot Fractal

Zoom (MMY3D)
https://youtu.be/hRrBnI5L0u8
3D Fractal
https://youtu.be/S530Vwa33G0
Sapphires - Mandlebrot Fractal Zoom

https://youtu.be/8cgp2WNNKmQ
Prof V B More, MET IOE BKC 63

• Matrix Multiplication: A * B = C
• Can partition output matrix C
64
Output Data Decomposition:
Example
A partitioning of output data does
not result in a unique
decomposition into tasks. For
example, with identical output data
distribution, we can derive the
following two (other)
decompositions:
65
Output Data Decomposition:
Example
Decomposition I Decomposition II
Task 1: C1,1 = A1,1B1,1 Task 1: C1,1 = A1,1B1,1
Task 2: C1,1 = C1,1 + A1,2B2,1 Task 2: C1,1 = C1,1 + A1,2B2,1
Task 3: C1,2 = A1,1B1,2 Task 3: C1,2 = A1,2B2,2
Task 5: C2,1 = A2,1B1,1 Task 5: C2,1 = A2,2B2,1
Task 7: C2,2 = A2,1B1,2 Task 7: C2,2 = A2,1B1,2
Task 8: C2,2 = C2,2 + A2,2B2,2 Task 8: C2,2 = C2,2 + A2,2B2,2 66

• Count the instances of given itemsets
67
68
Input Data Decomposition
• Applicable if the output can be naturally
computed as a function of the input.
• In many cases, this is the only natural
decomposition because the output is not
clearly known a-priori
– finding minimum in list, sorting, etc.
• Associate a task with each input data
partition.
• Tasks communicate where necessary input is
“owned” by another task.
69
Input Data Decomposition
• Each task generates partial counts for all itemsets
which must be aggregated.
• Must combine partial results at the end 70

Input & Output Data Decomposition
• Often, partitioning either input data or output data
forces a partition of the other.
• Can also consider partitioning both
71
Intermediate Data Partitioning
Computation can often be viewed as a

sequence of transformation from the
input to the output data.
In these cases, it is often beneficial to

use one of the intermediate stages as
a basis for decomposition.
72
Intermediate Data Partitioning: Example
Let us revisit the example of dense matrix
multiplication. We first show how we can visualize
this computation in terms of intermediate matrices D.
.
A 1,1
B 1,1 B 1,2 D1,1,1 D1,1,2
A 2,1 D1,2,1 D1,2,2
+
A 1,2
. D2,1,1 D2,1,2
A 2,2 B 2,1 B 2,2 D2,2,1 D2,2,2
C 1,1 C 1,2
C 2,1 C 2,2
73
Multiplication of two 2x2 matrices A
 2 3
&B

3  2  2 x3  3 x3 2 x ( 2 ) 3 x1 
 1  3   ( 1) x 3  4 x 3 
 4  1   ( 1) x ( 2 )  4 x1 
 2   2 x3 2 x ( 2 )   6  4 Intermediate
  1  3  2      3  Result D1
  ( 1) x 3 ( 1) x ( 2 )   2 
3  3 x 3 3 x1  9 3
Intermediate
 4   3 1    12 
  4 x 3 4 x1   4 Result D2
Combine intermediate results
 6  4  9 3  6 9  4  3 15  1
 3   12     3  12   9  Result
 2   4  2 4   6 
74
Domain Decomposition
• Often can be viewed as input data
decomposition
– May not be input data
– Just domain of calculation
• Split up the domain among tasks
• Each task is responsible for computing the
answer for its partition of the domain
• Tasks may end up needing to
communicate boundary values to perform
necessary calculations
75
1
• Evaluate the integral 4
0 1 x2
Each task evaluates the integral
in their partition of the domain
Once all have finished, sum each

tasks answer for the total.
0 0.25 0.5 0.75 1
76
• Often a natural approach for grid/matrix
problems
There are algorithms for more complex

domain decomposition problems
77
Exploratory Decomposition
• In many cases, the decomposition of a
problem goes hand-in-hand with its
execution.
• Typically, these problems involve the
exploration of a state space.
– Discrete optimization
– Theorem proving
– Game playing
78
• 15 puzzle – put the numbers in order
– only move one piece at a time to a blank spot
79
Exploratory
Decomposition
• Generate
successor
states and
assign to
independent
tasks.
80
1
Task 4
Task 3
Task 2
Task 1
81
2
82
3
83
4
84
5
85
Solved State
86
• Exploratory decomposition techniques may change the
amount of work done by the parallel implementation.
• Change can result in super- or sub-linear speedups
87
Speculative Decomposition
• Sometimes, dependencies are not known
a-priori
• Two approaches
– conservative – identify independent tasks only
when they are guaranteed to not have
dependencies
• May yield little concurrency
– optimistic – schedule tasks even when they
may be erroneous
• May require a roll-back mechanism in the case of
an error.
88
• The speedup due to speculative
decomposition can add up if there
are multiple speculative stages
• Examples
– Concurrently evaluating all branches of
a C switch stmt
– Discrete event simulation
89
A switch statement works based on the value of
expression and corresponding case statement
executes. Parallel switch
Slave(i)
Sequential switch {
compute expr; compute ei; Wait(request);
switch(expr) if (request) Send(ei, 0);
{ }
case 1: compute-e1; break; Master() {
case 2: compute-e2; break; compute expr;
...... switch(expr)
} {
case 1:
Send(request,1);
Receive(a1,i);
...
90
}
Discrete Event Simulation
• The central data structure is a time-
ordered event list.
• Events are extracted precisely in time
order, processed, and if required,
resulting events are inserted back
into the event list.
91
• Consider MET-UTSAV-21 as a discrete
event system -
–Every day there are number of
events scheduled one after other.
And at last day there is Musical
Night.
92
• Each of these events may be
processed independently,
–Since, there is no concrete
dependency of one event on other,
they can be executed independently
93
–In case in certain situations like
natural calamities, the scheduled
execution of the events can hamper,
that may lead cumbersome to
manage.
• Therefore, an optimistic scheduling
of other events will be possible if
there is a backup plan.
94
• Simulate a network of nodes
– various inputs, node delay parameters, queue
sizes, service rates, etc.
95
Hybrid Decomposition
• Often, a mix of decomposition techniques
is necessary
• In quicksort, recursive decomposition
alone limits concurrency (Why?). A mix of
data and recursive decompositions is
more desirable.
96
Hybrid Decomposition
• In discrete event simulation, there might
be concurrency in task processing. A mix
of speculative decomposition and data
decomposition may work well.
• Even for simple problems like finding a
minimum of a list of numbers, a mix of
data and recursive decomposition works
well.
97
Task Characterization
98
• Task characteristics can have a
dramatic impact on performance.
There are following basic
characteristics of task:
• Generation of Task
– Static
– Dynamic
99
• Size of Task
– Uniform
– Non-uniform
• Data Size
– Size
– Uniformity
100
Task Generation
• Algorithm point of view, tasks are
considered to be executed in parallel.
• Tasks are generated in two ways.
Static and Dynamic.
101
Task Generation
• Static
–Tasks are known in advance before
their execution. Tasks are executed
in order which are defined
previously.
–number of tasks, task size, data
size are all known in prior, therefore
their execution is deterministic.
– E.g.image processing, matrix & graph algorithms
102
Task Generation
• Dynamic
–Tasks are created dynamically
based on decomposition of data in
particular situations.
–Tasks are not available before their
execution. tasks changes
throughout run
–difficult to launch during run –
scheduled environment
103
Task Generation
–most often dealt with using
dynamic load balancing techniques
–Recursive and exploratory
decomposition techniques are
considered as examples of
dynamic task generation.
104
Task Size – Data size
• Execution time
–uniform – synchronous steps
–non-uniform – difficult to determine
synchronization points
• often handled using Master-Worker
paradigm
• otherwise polling is necessary in
message passing
105
Task Size – Data size
• Data Size
– Date size is one of the crucial property
of the task. The required data should be
made available when mapping it into
processes.
– The overhead associated with data
movement can be reduced when size of
data and its memory locations are
available at the time of processing.
106
Task Interactions
• Static interactions: The tasks and
their interactions are known a-priori.
These are relatively simple to code
into programs.
• Dynamic interactions: The timing or
interacting tasks cannot be
determined a-priori. These
interactions are harder to code,
especially using message passing
APIs. 107
Task Interactions
• Regular interactions: There is a
definite pattern (in the graph sense)
to the interactions. These patterns
can be exploited for efficient
implementation.
• Irregular interactions: When
interactions are irregular, they lack
well-defined topologies.
108
Static Task Interaction Patterns
Regular patterns are easier to code
Both producer and consumer are
aware of when communication
is required  Explicit and Simple code
Irregular patterns must take into

account the variable number of
neighbors for each task.
Typical image processing partitioning Timing becomes more difficult
A Sparse Matrix and its associated irregular task interaction 109

graph
Static & Regular Interaction
• Algorithm has phases of computation and
communication
• Example - Hotplate
– Communicate initial conditions
– Loop
• Communicate dependencies
• Calculate “owned” values
• Check for convergence in “owned” values
• Communicate to determine convergence
– Communicate final conditions
110
Example - Hotplate
• Use Domain Decomposition
– domain is the hotplate
– split the hotplate up among tasks
• row-wise or block-wise
Consider the communication costs
Row-wise  2 neighbors
Block-wise  4 neighbors
Consider data sharing and

computational needs & efficiency
About the same for row or block 111

Dynamic Interaction
Interactions of tasks are
unpredictable in dynamic
communications. The
synchronization between
sender and receiver is one
of the challenging issue.
This may create problem
when both of them trying
to communicate at the
same time.
Tasks don’t
know when to
receive a
message –
periodically poll
112
Static vs Dynamic Task Interactions
Static Interactions Dynamic Interactions
For each task, the Timing of interaction
interactions happen is not known prior to
at predetermined execurion of the
times program
Task interaction The stage at which
graph and interaction interaction is needed
will happen in which is decided
stage is known dynamically
113
Static vs Dynamic Task Interactions
Static Interactions Dynamic Interactions
Can be programmed Uncertainties in
easily in the MPI interaction makes it
paradigm hard for both sender
and receiver to
participate in
interaction at same
time in MPI.
easy to code in easy to code in
shared address shared address
space model space model 114
Regular vs Irregular Task Interactions
Regular Interactions Irregular Interactions
An interaction pattern There is no regular
is considered to be interaction pattern
regular if it has some exists
definite pattern of
interaction
Easy to handle Irregular and
dynamic patterns are
difficult to handle
especially in MPI
model 115
Regular vs Irregular Task Interactions
Regular Interactions Irregular
Interactions
In image dithering problem, In sparse matrix
the color of each pixel in an - vector
image is determined as the multiplication,
weighted average of its task cannot
original value and the values know which
of its neighboring pixels. entries of vector
image is decomposed in it requires, due
square boxes and assigned to to irregular
multiple task to carry out chunk of task.
dither operation for each of
116
this region indenpendently.
Read only vs Read write Task Interactions
Read only Read write
Interactions Interactions
Task requires only a Multiple tasks need
read-access to the to read and write on
data shared among some shared data.
many concurrent
tasks
117
Read only vs Read write Task Interactions
Read only Read write
Interactions Interactions
The decomposition 15-puzzle problem
for parallel matrix the priority queue
multiplication. In this constitutes shared
problem the tasks data and tasks need
only need to read the both read and write
shared input access to it.
matrices A and B
118
One way vs Two way Task Interactions
One way Interactions Two way Interactions
One task pushes data Both tasks pushes
to another data to each other
Cannot be Suitable for MPI
programmed directly
into MPI
Easy to handle in Easy to handle in
shared address shared address
space model space model
119
Questions based onTask Interactions
• Explain characteristics of tasks
• Write short note on task generation
• Differentiate between static and dynamic task
generation
• Discuss the impact ofr task size on task generation
• Compare between static interaction and dynamic task
interactions
• Compare between regular and irregular task interaction
• Compare between read only and read write task
interaction
• Compare between one way and two way task
interaction
120
Mapping Techniques for Load
Balancing
1
Mapping
• Once a problem has been
decomposed into concurrent tasks,
the tasks must be mapped to
processes.
–Mapping and Decomposition are
often interrelated steps
• Mapping must minimize overhead
–Inter-process communication and
–Time for which the processes are
idle 2
Mapping
Main
Task
Decomposition
Sub-Tasks 3
Mapping
Sub-Tasks
Mapping
processes 4
Mapping
Mapping
processes with sub-tasks 5

Mapping
inter process communication 6

Mapping
• Minimizing overhead is a trade-off
game
–Assigning all work to one
processor trivially minimizes
communication at the expense of
significant idling of other
processors.
• Goal: Performance
7
Mapping
Main
Task
Mapping to only
one process
processes 8
Mapping
Mapping to only
one process
processes 9
Mapping
idle idle
processes processes
no inter-process communication
10
Mapping
• Due to load imbalancing, some
processes finish the work early.
• Based on the constraints in task
dependency grah some processes
may have to wait other processes to
finish their work.
11
Mapping
Main
Task
uneven coarse grain

Decomposition
fine grain 12
Mapping
Mapping
processes 13
Mapping
Mapping
processes 14
Mapping
processes
completes
execution early 15
Mapping
processes
completes
execution late 16
Mapping
• To reduce overhead caused due to
interaction, one way is to assign the
tasks which need interaction into the
same process
• this leads to imbalance of workload
among processes, heavy load
processes always busy and less load
processes becomes idle.
17
Mapping
• Good mapping scheme must ensure
the balance between computations
and interactions among processes.
• If synchronization among the
interacting tasks is improper then
waiting time for sending and
receiving data among processes will
increase.
18
Mapping Techniques for Minimum Idling
Mapping must simultaneously minimize

idling and load balance. Merely balancing
load does not minimize idling.
start synchronization finish start synchronization finish
P 1 5 9 P1 1 2 3
1
P 2 6 10 P2 4 5 6
2
P 3 7 11 P3 7 8 9
3
P 4 8 12 P4 10 11 12
4
t=0 t=2 t=3 t=0 t=3 t=6
(a) (b)
19
Mapping
• There are two types of mapping
techniques:
– static mapping
– dynaimc mapping
20
Mapping
• Static
–tasks mapped to processes a-priori.
–Tasks are distributed among
available processes prior to
execution of algorithms
–need a good estimate of the task
size, data size, and inter-task
interactions.
21
Mapping
–often based on data or task graph
partitioning
–algorithm with static mapping are
easy to design
–since everything is known apriory,
static mapping schemes are
suitable for both shared address
space and message passing
programming models.
22
Mapping
• Dynamic
–tasks are mapped to processes at
runtime.
–tasks generation and mapping is
done dynamically
–tasks size not known a-priory
–indeterminate processing times is
also unknown
23
Mapping
• unknown inter-task processing times

• dynamic mapping schemes are more
complicated with message passing
programming model, whereas, it can
work well with shared address space
models.
24
Schemes for Static Mapping
• Mappings based on data partitioning.
• Mappings based on task graph

partitioning.
• Hybrid mappings.
25
Mapping – Data Partitioning
• Based on “owner-computes” rule

• We can combine data partitioning
with the “owner-computes” rule
to partition the computation into
subtasks.
• The simplest data decomposition
schemes for dense matrices are
1-D block distribution schemes.
Block-wise distribution 26
Array Distribution Scheme

• In data decomposition, the tasks are
responsible for execution of the data
associated with them according to
owner computes rules.
• Mapping tasks onto processes is
same as mapping data onto
processes
Block-wise distribution 27
Figure 1 Example of One dimensional partitioning of an

array among eight processes
28
Block Distribution
• In block distributions the uniform
contiguous portions of the array are
distributed to different processes.
• Eg. consider d - dimensional array in
which each process will receive
contiguous block of array along array
dimensions.
29
Block Distribution
• Consider n x n two dimensional array

in which there are n rows and n
columns. shown in figure of 8
processes.
• Block distributions of arrays are
particularly suitable when there is a
locality of interaction, i.e.,
computation of an element of an
array requires other nearby elements
in the array. 30
Block Distribution
• For example, consider an n x n two-
dimensional array A with n rows and
n columns.
• We can now select one of these
dimensions, e.g., the first dimension,
and partition the array into p parts
such that the kth part contains rows
kn/p...(k + 1)n/p - 1, where 0 <= k < p.
• That is, each partition contains a
block of n/p consecutive rows of A.
31
Block Distribution
• Similarly, if we partition A along the
second dimension (column), then
each partition contains a block of n/p
consecutive columns.
• These row- and column-wise array
distributions are shown in figure 1
32
Block Distribution
Figure 2 Two dimensional distributions of an array on 4x4

process grid and 2x8 process grid
33
Block Distribution
• Now consider the case in which
multiple dimensions are considered
instead of single dimension partition.
• In this case, both the dimensions (i.e.
rows and columns) are selected at a
time and matrix is divided in number
of blocks.
• Example is shown in figure 2.
• Each block will have size of n/p1 x n/
p2 here p = p1 x p2 (total processes)
34
Block Distribution
For d-dimensional array we can have
block distribution upto d-dimensions.
E.g. nxn matrix multiplication of AxB=C
For two-dimensional, block distribution
will give block of size of n/p x n/p
In each case (one dim or two dim) one
partition is assigned to one process
which computes it.
In higher dimension, more blocks are
generated.
35
Block Distribution
• More processes are used for
computation.
• In single dimension, if there are n rows
(or n columns), n processes are
required for computation.
• Whereas, in two dimension, n2
processes are required for
computation.
Advantages of higher dimensions:
• Higher degree of concurrency
• Reduced interaction among the 36
Block Distribution
Figure 3 Data sharing needed for matrix

multiplication with one dimensional
partitioning of the output matrix.
Shaded portion of input matrix A and B
are required by the process that
computes the shaded portion of the
output matrix C. 37
Block Distribution
Figure 4 Data sharing needed for matrix

multiplication with two dimensional
partitioning of the output matrix.
Shaded portion of input matrix A and B
are required by the process that
computes the shaded portion of the 38
Block Distribution
One dimensional two dimensional
distribution along distribution
row
each process each process
access n/p rows access n/p rows
of A and complete of A and n/p rows
matrix B of B only
Total data to be Total data to be
accessed by each accessed by each
process is n2/p + process is O(n2/p)
39
Block Distribution
• Block distribution is useful if same
work is performed on each element.
• If amount of work is different for
different elements block distribution
results in load imbalance.
40
Cyclic and Block Cyclic Distribution
• Cyclic distributions often “spread the load”
41
Cyclic and Block Cyclic Distributions
Cyclic Distribution:
The situation where computational
load is distributed in identical
fashion, cyclic distribution is used
to distribute the load.
42
consecutive entries of global vector

are used to assign in successive
processes.
The mapping m -> (p,i) is defined

for cyclic distribution
m-> (m mod P, floor (m/P) )
43
The load imbalances occur when

amount of work is different for
different part of matrix.
This can be avoided by using cyclic

or block cyclic distribution.
44
Cyclic Distribution
• Ex, m = 23 elements and P = 3 processes (0 to 2)
m 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
P 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1
i 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7
45
Cyclic Distribution
• 1-D cyclic distribution of array ‘A’ on 4 processes
(0 to 3)
* * * * * * * * * *
A 0 1 2 3 4 5 6 7 8 9
Processes: P0, P1, P2, P3
46
P0 P1 P2 P3 P0 P1 P2 P3 P0 P1
A * * * * * * * * * *
0 1 2 3 4 5 6 7 8 9
cyclic distribution of array ‘A’ on 4 processes
47
Block Cyclic Distributions
• There are variation of the block

distribution scheme that can be used
to resolve the load-imbalance and
idling problems.
48
Block Cyclic Distributions
• Partition an array into many more

blocks than the number of available
processes.
• Blocks are assigned to processes in a

round-robin manner so that each
process gets several non-adjacent
blocks.
49
Block-Cyclic Distribution for Gaussian Elimination
The active part of the matrix in Gaussian Elimination

changes. By assigning blocks in a block-cyclic fashion, each
processor receives blocks from different parts of the matrix.
Column
Column
Inactive part k
Row k (k,k) j
(k,j) A[k,j] := A[k,j]/A[k,k]
Active part
Row i (i,k) (i,j) A[i,j] := A[i,j] - A[i,k] x A[k,j]
50
Block-Cyclic Distribution: Examples
One- and two-dimensional block-cyclic distributions among 4
processes.
P0 P3 P6
T1 T4 T5
P1 P4 P7
T2 T6 T10 T8 T12
P2 P5 P8
T3 T7 T11 T 9 T13T14
51
Block-Cyclic Distribution
• A cyclic distribution is a special case in which block size is one.
• A block distribution is a special case in which block size is n/p,

where n is the dimension of the matrix and p is the number of
processes.
P0
P1 P0 P1 P0 P1
P2
P3 P2 P3 P2 P3
P0
P1 P0 P1 P0 P1
P2
(a)
P3 P2 P3 (b) P2 P3
52
1-D and 2-D block cyclic distribution of a two dimensional array
Graph Partitioning Based Data
Decomposition
• The array-based distribution schemes
that we described so far are quite
effective in balancing the
computations and minimizing the
interactions for a wide range of
algorithms that use dense matrices
and have structured and regular
interaction patterns.
53
Decomposition
• However, there are many algorithms
that operate on sparse data structures
and for which the pattern of interaction
among data elements is data
dependent and highly irregular.
• Numerical simulations of physical

phenomena provide a large source of
such type of computations.
54
Decomposition
• In these computations, the physical
domain is discretized and represented
by a mesh of elements.
• The simulation of the physical

phenomenon being modeled then
involves computing the values of
certain physical quantities at each
mesh point.
55
Decomposition
• The computation at a mesh point
usually requires data corresponding to
that mesh point and to the points that
are adjacent to it in the mesh.
• For example, Figure 5 shows a mesh

imposed on Lake Superior.
56
Decomposition
Figure 5 A mesh used to model Lake
Superior.
57
Decomposition
• The simulation of a physical
phenomenon such the dispersion of a
water contaminant in the lake would
now involve computing the level of
contamination at each vertex of this
mesh at various intervals of time.
58
Decomposition
• Since, in general, the amount of
computation at each point is the same,
the load can be easily balanced by
simply assigning the same number of
mesh points to each process.
• However, if a distribution of the mesh
points to processes does not strive to
keep nearby mesh points together, then
it may lead to high interaction
overheads due to excessive data
sharing. 59
Decomposition
• For example, if each process receives
a random set of points as illustrated in
Figure 6, then each process will need
to access a large set of points
belonging to other processes to
complete computations for its
assigned portion of the mesh.
60
Partitioning the Graph of Lake
Superior
Random Partitioning
Figure 6 A random distribution of the mesh

elements to eight processes
61
Decomposition
• Ideally, we would like to distribute the
mesh points in a way that balances the
load and at the same time minimizes
the amount of data that each process
needs to access for completing its
computations.
• Therefore, we need to partition the
mesh into p parts such that each part
contains roughly the same number of
mesh-points or vertices, and the
number of edges that cross partition 62
Decomposition
• Finding an exact optimal partition is an
NP-complete problem. However,
algorithms that having powerful
heuristics are available to compute
reasonable partitions.
• After partitioning the mesh in this

manner, each one of these p partitions
is assigned to one of the p processes.
63
Decomposition
• As a result, each process is assigned a
contiguous region of the mesh such
that the total number of mesh points
that needs to be accessed across
partition boundaries is minimized.
• Figure 7 shows a good partitioning of

the Lake Superior mesh - the kind that
a typical graph partitioning software
would generate.
64
Partitioning the Graph of Lake
Superior
Figure 7 Partitioning for minimum edge-cut: A distribution

of the mesh elements to eight processes, by using a graph-
partitioning algorithm.
65
Mappings Based on Task Paritioning
Partitioning a given task-dependency graph, with
tasks of known sizes, across processes.
Determining an optimal mapping for a general

task-dependency graph is an NP-complete
problem.
Ex.
•Task-dependency graph that is a perfect binary
tree.
•Mapping on a hypercube.
Minimize the interaction overhead by mapping

many independent tasks onto the same process
66
with others which are having only one
Task Paritioning: Mapping a Binary Tree
Dependency Graph
Example illustrates the dependency graph of one view of
quick-sort and how it can be assigned to processes in a
hypercube.
0
0 4
0 2 4 6
0 1 2 3 4 5 6 7
67
Task Paritioning: Mapping a Sparse Graph
Sparse graph for computing a sparse matrix-vector product and its
mapping.
A b
0 1 2 3 4 5 6 7 8 9 1011
Process 0 C0 = (4,5,6,7,8)
Process 1 C1 = (0,1,2,3,8,9,10,11)
Process 2 C2 = (0,4,5,6)
mapping of sparse matrix-vector multiplication

C1 = (0,5,6) Process
1
0 2 3 partitioning task-
1
interaction graph to
Process 0 5
reduce interaction
4
C0 = (1,2,6,9) 6 overhead
7
8 9 10 11
Process C2 = (1,2,4,5,7,8)
2
Reducing interaction overhead in sparse matrix-vector multiplication
68
Hierarchical Mappings
•Sometimes a single mapping technique is
inadequate.
•For example, the task mapping of the binary

tree (quicksort) cannot use a large number
of processors.
•For this reason, task mapping can be used

at the top level and data partitioning within
each level.
69
Hierarchical Mapping: Example
An example of task partitioning at top level with data partitioning at
the lower level.
P0 P1 P4 P5
P2 P3 P6 P7
P0 P1 P4 P5
P2 P3 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
Quick sort has a task-dependency graph that is an ideal candidate for a

hierarchical mapping.
70
Questions on the topic
Explain dynamic mapping

Explain cyclic and block cyclic mapping
71
Schemes for Dynamic Mapping
• Dynamic mapping is sometimes also referred to as
dynamic load balancing, since load balancing is
the primary motivation for dynamic mapping.
• Dynamic mapping is used in :
• where highly imbalanced distribution caused by
static mapping
• where task-dependency graph itself is dynamic
by nature
• Dynamic mapping schemes can be centralized or

distributed.
72
Dynamic Mapping with Centralized Schemes
Processes are managed in masters-
slave fashion.
When a process runs out of work, it
requests the master for more work.
When the number of processes
increases, the master may become the
bottleneck.
To overcome this, a process may pick up
a number of tasks (a chunk) at one
time. This is called Chunk scheduling.
73
Dynamic Mapping with Centralized Schemes
Selecting large chunk sizes may lead to

significant load imbalances as well.
A number of schemes have been used
to gradually decrease chunk size as
the computation progresses.
74
Distributed Dynamic Mapping
• Each process can send or receive work
from other processes.
• This overcomes the bottleneck in

centralized schemes.
75
Distributed Dynamic Mapping
•There are four critical questions:
ohow are sending and receiving
processes paired together,
owho initiates work transfer,
ohow much work is transferred, and
owhen is a transfer triggered?
• Answers to these questions are generally

application specific.
76
Methods for Containing Interaction Overhead
• Parallel programs performs efficiently

when the interactions among the
concurrent tasks are efficiently
handled.
• many factors are responsible for
increased interaction overhead
-amount of data exchange during
interaction
77
Methods for Containing Interaction Overhead
-spatial and temporal pattern of

interaction
• some of these interaction overheads

are reduced during decomposition and
mapping schemes.
78
Minimizing Interaction Overheads
•Maximize data locality: Where possible,
reuse intermediate data. Restructure
computation so that data can be reused
in smaller time windows.
•Minimize volume of data exchange:

There is a cost associated with each
word that is communicated. For this
reason, we must minimize the volume
of data communicated.
79
Minimize frequency of interactions:
There is a startup cost associated with
each interaction. Therefore, try to merge
multiple interactions to one, where
possible. Ex. Sorting of an array before
distributing.
•Minimize contention and hot-spots: Use

decentralized techniques, replicate data
where necessary.
80
• Overlap communication with computation
– Uses non-blocking communications
– Multithreading, and prefetching can be used to
hide latencies.
• Replicate data or computations
– It may be less expensive to recalculate or
store redundantly than to communicate
• Use group communication instead of
point to point primitives
– They are more optimized generally
81
Parallel Algorithm Models
82
Basic Communication Operations
V.B.More
MET’s IOE, BKC, Nashik
Thanks to Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing slides
1
Topic Overview
1) One-to-All Broadcast and All-to-One

Reduction
2) All-to-All Broadcast and Reduction
3) All-Reduce and Prefix-Sum Operations
4) Scatter and Gather
5) All-to-All Personalized Communication
6) Circular Shift
7) Improving the Speed of Some
Communication Operations
2
Basic Communication Operations:
Introduction
• Computations and communication are
two important factors of any parallel
algorithm.
• Many interactions in practical parallel

programs occur in well-defined patterns
involving groups of processors.
• Efficient implementations of these

operations can improve performance,
reduce development effort and cost, and
improve software quality. 3
Introduction
• Efficient implementations must leverage
underlying architecture. For this reason,
we refer to specific architectures here.
• We select a descriptive set of

architectures to illustrate the process of
algorithm design.
4
Introduction
• Group communication operations are
built using point-to-point messaging
primitives.
• Architectures that communicating a

message of size m over an uncongested
network takes time ts + twm.
• We use this as the basis for our analyses.

Where necessary, we take congestion
into account explicitly by scaling the tw
5
term.
Introduction
• We assume that the network is
bidirectional and that communication is
single-ported.
6
One-to-All Broadcast and All-to-One
Reduction
• A processor has a piece of data (of size m)
it needs to send to every other processor.
• The dual of one-to-all broadcast is all-to-
one reduction.
• In all-to-one reduction, each processor has
m units of data. These data items must
be combined piece-wise (using some
associative operator, such as addition or
min), and the result made available at a
target processor.
7
Reduction
One-to-all broadcast and all-to-one reduction

among p processors.
8
Reduction
One-to-all broadcast
• One-to-All broadcast is the operation in
which a single processor send identical
data to all other processors.
• Most parallel algorithm often need this
operation.
• Consider data size m is to be sent to all
the processors.
• Initially only source processor has the
data.
• After completion of algorithm, there will be
a copy of initial data with each processor. 9
Reduction
All-to-One Reduction
• All-to-One reduction is the operation in
which data from all processors are
combined at a single destination
processor.
• Various operations like sum, product, max,
min, avg of numbers can be performed by
all-to-one reduction operation.
10
Reduction
• Each processor p will have buffer M which

contains m words.
• After completion of algorithm, The ith
word of the accumulated M is the sum,
product, maximum, or minimum of the ith
words of each of the original buffers.
11
Reduction

among p processors.
12
Reduction on Rings
• Simplest way is to send p − 1 messages
from the source to the other p − 1
processors – this is not very efficient.
• Use recursive doubling: source sends a
message to a selected processor. We
now have two independent problems
defined over halves of machines.
• Reduction can be performed in an

identical fashion by inverting the process.
13
One-to-All Broadcast
3 3
2
7 6 5 4
0 1 2 3
2
3 3
One-to-all broadcast on an eight-node ring. Node 0 is

the source of the broadcast. Each message transfer
step is shown by a numbered, dotted arrow from the
source of the message to its destination. The
number on an arrow indicates the time step during
which the message is transferred. 14
1 1
2
7 6 5 4
3
0 1 2 3
2
1 1
Reduction on an eight-node ring with node 0

as the destination of the reduction.
15
Broadcast and Reduction: Example
Consider the problem of multiplying a matrix

with a vector.
• The n × n matrix is assigned to an n × n

(virtual) processor grid. The vector is
assumed to be on the first row of
processors.
• The first step of the product requires a

one-to-all broadcast of the vector element
along the corresponding column of
processors. This can be done concurrently16
• The processors compute local product of
the vector element and the local matrix
entry.
• In the final step, the results of these

products are accumulated to the first row
using n concurrent all-to-one reduction
operations along the oclumns (using the
sum operation).
17
Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one Input Vector
reduction
P0 P1 P2
P3 One-to-all broadcast
P0 P0 P1 P2 P3
P4 P4 P5 P6 P7
P8 P8 P9 P11 Matrix
P10
P12 P12 P13 P14 P15

Output
Vector
One-to-all broadcast and all-to-one reduction in

the multiplication of a 4 × 4 matrix with a 4 × 18
1
Broadcast and Reduction on a Mesh
• We can view each row and column of a
square mesh of p nodes as a linear array
of √p nodes.
• Broadcast and reduction operations can
be performed in two steps – the first step
does the operation along a row and the
second step along each column
concurrently.
• This process generalizes to higher
dimensions as well.
19
• 2D square mesh with √p rows and √p
columns for one to all broadcast
operation.
• Firstly data is sent to remaining all √p -1
nodes in a row by source using one-to-all
broadcast operation.
• In second phase, the data is sent to the
respective column by one-to-all broadcast.
• Thus, each node of the mesh will have
copy of initial messae
20
Broadcast and Reduction on a Mesh: Example
3 7 11 15
One-to-all
4 4 4 4
broadcast on a
2 6 10 14
16-node mesh.
3 3 3 3
1 5 9 13 Source node is 0.
4 4 4 4
Row data transfer
2 2
0 4 8 12
First phase: steps 1 and 2; Step 1: node 8 by

node 0; Step 2: node 4 by node 0, node 12 by
node 8
21
First Row: node 0, 4, 8, 12 having the data
Broadcast and Reduction on a Mesh:
Example
3 7 11 15
Second phase:
4 4 4 4 steps 3, 4;
2 6 10 14 Column data
3 3 3 3
transfer; Step 3:
1 5 9 13
node 2 by 0,
node 6 by 4,
4
2
4 4
2
4
node 10 by 8,
0 4 8 12
node 14 by 12.
1
Step 4: col 1- node 1 by 0 and 3 by 2; col 2-

node 5 by 4 and 7 by 6; col 3- node 9 by 8
and 11 by 10; col 4-node 13 by 12 and 15
22
by 14
3 7 11 15 Example Similar process for
4 4 4 4
one-to-all broadcast
2 6 10 14
on a three-
dimensional mesh
3 3 3 3 can be carried out by
1 5 9 13 treating rows of
nodes in each of
4 4 4 4
2 2
three dimensions as
0 4 8 12
linear array.
1
Similar process for one-to-all broadcast on a
three-dimensional mesh can be carried out by
treating rows of nodes in each of three
dimensions as linear array. 23
3 7 11 15 Example
4 4 4 4
2 6 10 14
3 3 3 3
1 5 9 13
4 4 4 4
2 2
0 4 8 12
1
The reduction process in linear array can be
carried out on two and three dimensional meshes
as well by reversing the direction and order of
messages. 24
Broadcast and Reduction on a Hypercube
• A hypercube with 2d nodes canbe

regarded as a d-dimensional mesh with
two nodes in each dimension.
• The mesh algorithm can be generalized to
a hypercube and the operation is carried
out in d (= log p) steps, one in each
dimension.
• Example of 8-node hypercube.
25
8-node hypercube.
3
6 7
(110) (111)
2 3
(010) 2
3 (011)
3
2 4 5
1
(100) (101)
0 1
(000)
(001)
3
One-to-all broadcast on a three-dimensional hypercube.
The binary representations of node labels are shown in
parentheses. 26
• Each node is identified by a unique 3-digit

label.
• Communication starts along the highest
dimension specified by MSB of the node
label. Ex. In step 1, node 0 (000) will send
data to node 4(100) with higher dimension.
27
Broadcast and Reduction on a Hypercube:
Example
● In the next step, communication will be

done for lower dimension.
● The source and the destination nodes in
three communication steps of the
algorithm are similar to the nodes in the
broadcast algorithm on a linear array.
● Hypercube broadcast will not suffer from
congestion
28
Broadcast and Reduction on a Balanced
Binary Tree
• Consider a binary tree in which processors
are (logically) at the leaves and internal
nodes are routing nodes i.e. switching units.
• The communicating nodes have the same
labels as in the hypercube
• The communication pattern will be same as
that of hypercube algorithm.
• There will not be any congestion on any of
the communication link at any time.
29
Binary Tree
2
2
3 3 3 3
0 1 2 3 4 5 6 7
30
Binary Tree
3 3 3 3
0 1 2 3 4 5 6 7
One-to-all broadcast on an eight-node tree.
• On different path, different number of

switching nodes will be there making its
communication different from hypercube.
31
Binary Tree
1
3 3 3 3
0 1 2 3 4 5 6 7
• E.g. Assume that source processor is the

root of this tree. In the first step, the
source sends the data to the right child
(assuming the source is also the left child)
. The problem has now been decomposed
into two problems with half the number of 32
Broadcast and Reduction Algorithms
• All of the algorithms described here before
are adaptations of the same algorithmic
template.
• We illustrate the algorithm for a hypercube,

but the algorithm can be adapted to other
architectures.
• The hypercube has 2d nodes and my_id is

the label for a node.
• X is the message to be broadcast, which
initially resides at the source node 0. 33
1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id,
source, X)
2. begin
3. my_virtual_id := my_id XOR source;
4. mask := 2d - 1;
5. for i := d - 1 downto 0 do /* Outer loop */
/* Set bit i of mask to 0 */
6. mask := mask XOR 2i;
7. if (my_virtual_id AND mask) = 0 then
8. if (my_virtual_id AND 2i) = 0 then
9. virtual_dest := my_virtual_id XOR 2i;
One-to-all broadcast of a message X initiated by source on a

d-dimensional p-node hypercube. d = log (p)
34
10. send X to (virtual_dest XOR source);
/* Convert virtual_dest to the label of the physical
destination */
11. else
12. virtual_source := my_virtual_id XOR 2i;
13. receive X from (virtual_source XOR
source);
/* Convert virtual_source to the label of the physical
source */
14. endelse;
15. endfor;
16. end GENERAL_ONE_TO_ALL_BC
d-dimensional p-node hypercube. d = log (p) 35
1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id, source, X)

2. begin
4. mask := 2d − 1;
5. for i := d − 1 downto 0 do /* Outer loop */
6. mask := mask XOR 2i; /* Set bit i of mask to 0 */
9.
virtual_dest := my_virtual_id XOR 2i;
10.
send X to (virtual_dest XOR source);
/* Convert virtual_dest to the label of the physical destination */
11. else
13. receive X from (virtual_source XOR source);
/* Convert virtual_source to the label of the physical source */
14. endelse;
15. endfor;

36
1. procedure ALL_TO_ONE_REDUCE(d, my_id, m, X, sum)

2. begin
3. for j := 0 to m − 1 do sum[j] := X[j];
4. mask := 0;
5. for i := 0 to d − 1 do
/* Select nodes whose lower i bits are 0 */
6. if (my_id AND mask) = 0 then
7. if (my_id AND 2i) ƒ= 0 then
8. msg_destination := my id XOR 2i;
9. send sum to msg_destination;
10. else
11. msg_source := my_id XOR 2i;
12. receive X from msg_source;
13. for j := 0 to m − 1 do
14. sum[j] :=sum[j] + X[j];
15. endelse;
17. endfor;
18. end ALL_TO_ONE_REDUCE
Single-node accumulation on a d-dimensional hypercube. Each node contributes

a message X containing m words, and node 0 is the destination. 37
Cost Analysis
• The one –to-all broadcast or all-to-one

reduction procedure involves log p point-
to-point simple message transfers.
• Each message transfer will have a time

cost of ts + twm.
• The total time is therefore given by:
T = (ts + twm) log p.

38
Questions based on one-to-all Broadcast
and and all-to-one Reduction
• Explain one-to-all broadcast and all-to-
one reduction operation in brief.
• Explain with example one-to-all
broadcast and all-to-one reduction
operation on ring.
• Explain how matrix-vector multiplication
can be performed using one-to-all
broadcast and all-to-one reduction
operation.
• Explain one-to-all broadcast operation on
16-node mesh.
39
Questions based on one-to-all Broadcast
and and all-to-one Reduction
• Write and explain algorithm of one-to-all

broadcast on a hypercube network.
• Explain one-to-all broadcast algorithm
for arbitrary source on d-dimensional
hypercube.
• Explain all-to-one reduction operation on
d-dimensional hypercube.
40
All-to-All Broadcast and Reduction
• Generalization of broadcast in which

each processor is the source as well as
destination.
• All-to-all broadcast operation is used in
matrix operations like matrix
multiplication and matrix-vector
multiplication.
• In all-to-all broadcast operation, all p
nodes simultaneously broadcast the
message.
41
• Note that, a process sends same

message of m-word to all the
processes, but it is not compulsory that
every process should send same
message. Different processes can
broadcast different messages.
• In all-to-all reduction, it happens at
every node.
42
All-to-all broadcast and all-to-all reduction.
43
All-to-All Broadcast and Reduction on a Ring
• All-to-all Broadcast:
• Simplest approach: perform p one-to-all
broadcasts. This is not the most efficient way,
though.
• It can take p times to complete
communication.
• Communication links can be used more
efficiently by simultaneously performing all p
one-to-all broadcast.
• By this here will be concatenation of all
messages traversing the same path at the
same time into a single message.
• The algorithm terminates in p − 1 steps.
44
• Linear Array and Ring:

• Each node first sends the data to one of its
neighbors it needs to broadcast.
• In subsequent steps, it forwards the data
received from one of its neighbors to its
other neighbor.
• This process is continued in subsequent
steps so all the communication links can be
kept busy.
• As the communication is performed circularly
in a single direction, each node receives all
(p-1) pieces of information fro all other nodes
in (p-1) steps.
45
1 (6) 1 (5) 1 (4)
n (m)
7 6 5 4
(7) (6) (5) (4) nth m th
1 (7) 1 (3) time step message
(0) (1) (2) (3)

1st communication step
0 1 2 3
1 (0) 1 (1) 1 (2)
2 (5) 2 (4) 2 (3)
7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)
2nd communication
(0,7) (1,0) (2,1) (3,2)
step
0 1 2 3
2 (7) 2 (0) 2 (1)
. .
. .
. .
7 (0) 7 (7) 7 (6)
7 6 5 4
(7,6,5,4,3,2, (6,5,4,3,2,1, (5,4,3,2,1,0, (4,3,2,1,0,7,
7 (1) 1) 0) 7) 6) 7 (5)
7th communication
(0,7,6,5,4,3, (1,0,7,6,5,4, (2,1,0,7,6,5, (3,2,1,0,7,6,
2) 3) 4) 5) step
0 1 2 3
7 (2) 7 (3) 7 (4)
46
All-to-all broadcast on an eight-node ring.
n (m)
1st Communication Step
nth mth
time step message
47
2nd Communication Step
48
7th communication step
49
• Detailed Algorithm
• At every node, my_msg contains initial message
to be broadcast.
• At the end of the algorithm, all p messages are
collected at each node.
50
1. procedure ALL_TO_ALL_BC_RING(my_id, my_msg, p, result)

2. begin
3. left := (my_id − 1) mod p;
4. right := (my_id + 1) mod p;
5. result := my_msg;
6. msg := result;
7. for i := 1 to p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
12. end ALL_TO_ALL_BC_RING
All-to-all broadcast on a p-node ring.
All-to-all reduction is simply a dual of this operation and

can be performed in an identical fashion. 51
All-to-all Broadcast on a Mesh
• Performed in two phases –
• in the first phase, each row of the mesh performs an
all-to-all broadcast using the procedure for the linear
array.
• In this phase, all nodes collect √p messages

corresponding to the √p nodes of their respective rows.
Each node consolidates this information into a single
message of size m√p.
• The second communication phase is a columnwise all-

to-all broadcast of the consolidated messages.
52
(6) (7) (8) (6,7,8) (6,7,8) (6,7,8)
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)
6 7 8 6 7 8 8
(3) (4) (5)

(3,4,5) (3,4,5) (3,4,5) (0,1,2,
3 4 5 3 4 5 3,4,5, 3 4 5
6,7,8) (0,1,2, (0,1,2,
3,4,5, 3,4,5,
6,7,8) 6,7,8)
0 1 2 0 1 2 0 1 2
(0) (1) (2) (0,1,2) (0,1,2) (0,1,2) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,

5,6,7,8)
(a) Initial data (b) Data distribution after rowwise (c) Final result of all-to-all broadcast on
distribution broadcast Mesh
All-to-all broadcast on a 3 × 3 mesh. The groups of nodes

communicating with each other in each phase are enclosed by dotted
boundaries. By the end of the second phase, all nodes get (0,1,2,3,4,5,6,7,
8) (that is, a message from each node).
• After completion of second phase each node obtains all p pieces of

m-word data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from
each node.
53
(6) (7) (8)

7 8
6
(3) (4) (5)
3 4 5
for next step
0 1 2
(0) (1) (2)
(a) Initial data distribution
each node.
54
(6,7,8) (6,7,8) (6,7,8)
6 7 8
(3,4,5) (3,4,5) (3,4,5)

3 4 5
for next step
0 1 2
(0,1,2) (0,1,2) (0,1,2)
(b) Data distribution after rowwise

broadcast
each node.
55
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)

6 7 8
(0,1,2,
3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8) 3,4,5, 3,4,5,
6,7,8) 6,7,8)
0 1 2
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)
(c) Final result of all-to-all broadcast on

Mesh
56
• All-to-all broadcast on a 3 × 3 mesh. The

groups of nodes communicating with each
other in each phase are enclosed by dotted
boundaries.
• By the end of the second phase, all nodes
get (0,1,2,3,4,5,6,7,8) (that is, a message from
each node).
• After completion of second phase each node
obtains all p pieces of m-word data i.e. all
nodes will get (0,1,2,3,4,5,6,7,8) message
from each node.
57
1. procedure ALL_TO_ALL_BC_MESH(my_id, my_msg, p, result)
2. begin
/* Communication along rows */
3. left := my_id − (my_id mod √p) + (my_id − 1)mod√p;
4. right := my_id − (my_id mod √p) + (my_id + 1) mod √p;
6. msg := result;
7. for i := 1 to √p − 1 do
11. endfor;
/* Communication along columns */
up := (my_id − √p) mod p;
12.
13. down := (my_id + √p) mod p;
14. msg := result;
15. for i := 1 to √p − 1 do
16. send msg to down;
17. receive msg from up;
19. endfor;
20. end ALL_TO_ALL_BC_MESH
All-to-all broadcast on a square mesh of p nodes. 58

All-to-all broadcast on a Hypercube
• All-to-all broadcast operation can be performed on

hypercube by implementing mesh algorithm to log p
dimensions.
• In each step, for a different dimension, communication is
carried out. Figure a shows first step that, communication is
carried out in each row.
• In figure b, communication is carried out in column in
second step.
• Pairs of nodes exchange data in each step
• received message is concatenated with the current data in
every step.
• Hypercube with bidirectional communication is considered.
59
(6) (7) (6,7) (6,7)
6 7 6 7
(2) 2 3 (3) (2,3) 2 3 (2,3)
(4) (5)
4 5 4 5
(4,5) (4,5)
(0) 0 1 (1) (0,1) 0 1 (0,1)
(a) Initial distribution of messages (b) Distribution before the second

step
(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7
(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5
(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)
(c) Distribution before the third step (d) Final distribution of messages
All-to-all broadcast on an eight-node hypercube.
60
1. procedure ALL_TO_ALL_BC_HCUBE(my_id, my_msg, d, result)

2. begin
4. for i := 0 to d − 1 do
5. partner := my_id XOR 2i;
6. send result to partner;
7. receive msg from partner;
9. endfor;
10. end ALL_TO_ALL_BC_HCUBE
All-to-all broadcast on a d-dimensional hypercube.
61
(6) (7)
6 7
(2) 2 3 (3)
(4) (5)
4 5
(0) 0 1 (1)
(a) Initial distribution of messages
62
(6,7) (6,7)
6 7
(2,3) 2 3 (2,3)
4 5
(4,5) (4,5)
(0,1) 0 1 (0,1)
(b) Distribution before the second step

63
(4,5, (4,5,
6,7) 6,7)
6 7
(0,1, (0,1,
2,3) 2 3 2,3)
(4,5, (4,5,
6,7) 6,7)
4 5
(0,1, (0,1,
2,3) 0 1 2,3)
(c) Distribution before the third step

64
(0,...,7) (0,...,7)
6 7
(0,...,7) (0,...,7)
2 3
(0,...,7) (0,...,7)
4 5
(0,...,7) (0,...,7)
0 1
(d) Final distribution of messages

65
All-to-all Broadcast
• Similar communication pattern to all-to-all

broadcast, except in the reverse order.
• On receiving a message, a node must combine it
with the local copy of the message that has the
same destination as the received message
before forwarding the combined message to the
next neighbor.
• As per the algorithm the communication starts
from lowest dimension.
• Variable i is used to represent the dimension.
• According to line 4 in, in first iteration, value of i=0.
66
• In each iteration, nodes communicate in pairs.

• In line 5, the label of receiver node will be
calculated by XOR operation.
• For eg if my_id = 000 then partner = 000 XOR 001
= 001. Hence, the partner differs in ith LSB.
• After communication of the data, each node
concatenates the received data with its own data
as shown in line 8.
• This concatenated message is then transmitted in
the next iteration.
67
All-to-all reduction on a Hypercube
1. procedure ALL_TO_ALL_RED_HCUBE(my_id, msg, d, result)

2. begin
3. recloc := 0;
4. for i := d − 1 to 0 do
6. j := my_id AND 2i;
7. k := (my_id XOR 2i) AND 2i;
8. senloc := recloc + k;
9. recloc := recloc + j;
10. send msg [senloc ..senloc + 2i - 1] to partner;
11. receive temp [0.. 2i - 1] from partner;
12. for j := 0 to 2i – 1 do
13. msg[recloc+j] := msg [recloc +j] + temp[j];
14. endfor;
15. endfor;
16. result := msg[my_id];
17. end ALL_TO_ALL_RED_HCUBE
All-to-all reduction on a d-dimensional hypercube.
68
All-to-all Reduction
• The order and direction of messages is reversed

for all-to-all reduction operation.
• The buffers are used to send and accumulate the
received messages in each iteration.
• Variable senloc is used to give the starting
location of the outgoing message.
• Variable recloc is used to give the location where
the incoming message is added in each iteration.
69
Cost Analysis
• On a ring, the time is given by:

T=(ts + twm)(p − 1).
• On a mesh, the time is given by:
T= 2ts(√p − 1) + twm(p − 1).
• On a hypercube, we have:
log p
T  
i 1
(ts  2 i 1 t w m )  t s log p  tw m( p  1)
70
Cost Analysis of all-to-all broadcast and all-to-all reduction operation

(ts + twm)(p − 1).
• =>> All-to-all broadcast can be performed in (p-1)
communicating steps on a ring or a linear array
for nearest neighbors.
• Time taken in each step is (ts + twm) where ts is
the startup time of the message, tw is the per
word transfer time, and m is the size of
message,
• Total time taken for the operation is :
(ts + twm)(p − 1).
71

2ts(√p − 1) + twm(p − 1).
=>> All-to-all broadcast can be performed on
mesh. The first phase of √p simultaneous all-to-
all broadcast will be completed in time
(ts + twm √p)(√p − 1).
For two dimensional square mesh of p-nodes, the
total time for all-to-all broadcast is addition of
time spent in each pass, and which is :
2ts(√p − 1) + twm(p − 1)
72
• On a hypercube, The time is given by:

log p
T  
i 1
(ts  2 i 1 t w m )  t s log p  tw m( p  1)
• we have: For p-node hypercube, for pair of nodes time

taken to send and receive message in ith step is
ts + 2i-1twm.
Total time taken for the operation is given by:

T = Σi=1 log p (ts + 2i-1twm)
= ts log p + twm(p − 1).
73
All-to-all broadcast: Notes
• All of the algorithms presented

here are asymptotically optimal in
message size.
• That is, twm(p-1) is the term

associated with each architecture.
74
•It is not possible to port/map

algorithms for higher dimensional
networks (such as a hypercube) into
a ring because this would cause
contention.
• Large sized messages are optimal
to transfer on ring than hypercube.
75
Contention for
a single
channel by
multiple
7 6 5 4
messages
4
0 1 2 3
6 7
1 3 4
2
2 3
3
4 5
1
2
0 1
8-node hyperccube
Contention for a channel when the hypercube is mapped onto a ring.

76
All-to-all broadcast: Questions
1. Explain all-to-all broadcast and reduction

operation
2. Write algorithm and explain all-to-all broadcast
on eight node ring/hypercube.
3. Explain with example and algorithm all-to-all
broadcast on 3x3 mesh.
4. Explain all-to-all reduction on d-dimensional
hypercube.
5. Explain all-to-all broadcast on d-dimensional
hypercube.
6. Explain cost analysis of all-to-all broadcast
operation
77
All-Reduce and Prefix-Sum Operations
• In all-reduce, each node starts with a

buffer of size m and the final results
of the operation are identical buffers
of size m on each node that are
formed by combining the original p
buffers using an associative operator.
78
•It is identical to all-to-one reduction

followed by a one-to-all broadcast.
•This formulation is not the most
efficient.
•Uses the pattern of all-to-all broadcast,
instead. The only difference is that
message size does not increase here.
•Time for this operation is
(ts + twm) log p.
79
•It is different from all-to-all reduction,

in which p simultaneous all-to-one
reductions take place, each with a
different destination for the result.
80
The Prefix-Sum Operation
• Given p numbers n0, n1, . . . , np−1 (one

on each node), the problem is to
compute the sums sk = Σi=0k ni for all k
between 0 and p − 1.
• Initially, nk resides on the node labeled
k, and at the end of the procedure, the
same node holds sk.
81
s0 = n0
s1 = n0 + n1
s2 = n0 + n1 + n2
s3 = n0 + n1 + n2 + n3
S4 = n0 + n1 + n2 + n3 + n4
...
sk = n0 + n1 + n2 + .. + nk
82
Initially, n0 resides on node 0, n1 resides

on node 1 and so on.
After completion of prefix sum

operation, every node holding sum of
its predecessor nodes including itself.
T = (ts+twm) log p
83
(6) [6] (7) [7] (6) [6] (6+7) [6+7]
6 7 6 7
[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]

[0] [0]
(0) 0 [1] 1
(1) (0+1) 0 [0+1]1
(0+1)
(a) Initial distribution of (b) Distribution of sums before second

values step
(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]
6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+
2 3 [0+1+2] 2 3
2+3)
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]
(0+1+ 1 (0+1+ 0 1
0 [0+1] [0] [0+1]
2+3) 2+3)
(c) Distribution of sums before third (d) Final distribution of prefix
step sums
Computing prefix sums on an eight-node hypercube.

At each node, square brackets show the local prefix
sum accumulated in the result buffer and
parentheses enclose the contents of the outgoing
message buffer for the next step.
84
(6) [6] (7) [7]
6 7
[2]
(2) 2 3 (3) [3]
4 5
(4) [4]
(5) [5]
[0]
(0) 0 1 (1) [1]
(a) Initial distribution of values

At each node, square brackets show the local prefix sum accumulated in the result
buffer and parentheses enclose the contents of the outgoing message buffer for the
next step. 85
(6+7) [6] (6+7) [6+7]
6 7
[2]
(2+3) 2 3 (2+3)
[2+3]
4 5
(4+5) [4]
(4+5)
[0] [4+5]
(0+1) 0 1 (0+1)
[0+1]
(b) distribution of sums before second step

next step. 86
(4+5+6+7) [4+5+6] (4+5+6+7) [4+5+6+7]
6 7
[0+1+2]
(0+1+2+ 3) 2 3 (0+1+2+3)
[0+1+2+3]
4 5
(4+5) [4]
(4+5)
[0] [4+5]
(0+1+2+3) 0 1 (0+1+2+3)
[0+1]
(c) distribution of sums before third step

next step. 87
[0+1+...+6] [0+1+...+7]
6 7
[0+1+2] 2 3 [0+1+2+3]
4 5
[0+1+...+4]
[0+1+..+5]
[0] 0 1 [0+1]
(d) final distribution of prefix sums

next step. 88
The Prefix-Sum Operation or Scan Operation
• The operation can be implemented

using the all-to-all broadcast kernel.
• We must account for the fact that in

prefix sums the node with label k uses
information from only the k-node
subset whose labels are less than or
equal to k.
89
•This is implemented using an

additional result buffer. The content of
an incoming message is added to the
result buffer only if the message
comes from a node with a smaller label
than the recipient node.
• The contents of the outgoing message

(denoted by parentheses in the figure)
are updated with every incoming
message.
90
• Prefix sum operation also uses the same

communication pattern which is used in all-to-all
broadcast and all reduce operations.
k
• The sum sk = i=0 ni for all k between 0 and
p-1 for p numbers n0, n1, ..,np-1 on each node is
calculated.
• E.g. if the original sequence of numbers is
<3, 1, 4, 0, 2> then the sequence of prefix sum is
<3, 4, 8, 8, 10>
i.e. 3+null=3, 3+1=4, 4+4=8, 8+0=8, 8+2=10
91
• At start the number nk will be present with node k.

After termination of algorithm, same node holds
sum sk.
• Instead of single number, each node will have a
buffer or a vector of size m and result will be sum
of elements of buffers.
• Each node contain additional buffer denoted by
square brackets to collect the correct prefix sum.
• After every communication step, the message
from a node with a smaller label than that of the
recipient node is added to the result buffer.
92
1. procedure PREFIX_SUMS_HCUBE(my_id, my_number, d, result)

2. begin
3. result := my_number;
4. msg := result;
5. for i := 0 to d − 1 do
7. send msg to partner;
8. receive number from partner;
9. msg := msg + number;
10. if (partner < my_id) then result := result + number;
11. endfor;
12. end PREFIX_SUMS_HCUBE
Prefix sums on a d-dimensional hypercube.

93
(6) [6] (7) [7] (6) [6] (6+7) [6+7]
6 7 6 7
[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]

[0] [0]
(0) 0 [1] 1
(1) (0+1) 0 [0+1]1
(0+1)
(a) Initial distribution of (b) Distribution of sums before second

values step
(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]
6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+
2 3 [0+1+2] 2 3
2+3)
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]
(0+1+ 1 (0+1+ 0 1
0 [0+1] [0] [0+1]
2+3) 2+3)
(c) Distribution of sums before third (d) Final distribution of prefix
step sums
• At each node, square brackets show the local prefix sum accumulated
in the result buffer and parentheses enclose the contents of the
outgoing message buffer for the next step.
• These contents are updated with every incoming message.
• All the messages received by a node will not contribute to its final
result, some of the messages it receives may be redundant.
94
Questions on Prefix sum operations
• Explain the difference between

all-to-all reduction and all reduce
operations.
• Explain with example prefix sum

operations.
95
Scatter and Gather
• In the scatter operation, a single node

sends a unique message of size m to every
other node.
• This is called a one-to-all personalized
communication.
• In the gather operation, a single node

collects a unique message from each node.
• It is the dual of scatter operation.
96
Scatter and Gather
• While the scatter operation is

fundamentally different from broadcast,
the algorithmic structure is similar, except
for differences in message sizes
(messages get smaller in scatter and stay
constant in broadcast).
• The gather operation is exactly the inverse

of the scatter operation.
97
Gather and Scatter Operations
Scatter and gather operations.
• Consider the example of 8-node hypercube

• The communication patterns of all-to-all
broadcast and scatter operation are identical, the
only difference is in size and contents of the
message.
• As in figure above, initially source node 0 will have
all the messages.
98
99
Example of the Scatter Operation
In the first 6 7
communication
step, node 0
transfers half
2 3
of the
messages to
one of its
neighbours 4 5
(node 4)
(0,1,2,3,
4,5,6,7) 0 1

The scatter operation on an eight-node hypercube.
100
In the next step, 6 7

if any process
has some data,
it transfers half 2 3
of the data to
its neighbours
who has not
received any 4 5
(4,5,
data uptil now.
6,7)
(0,1,
2,3) 0 1

101
(6,7)
6 7
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1

102
(6) (7)
This process
6 7
involves log p
communication
steps for log p
(2) (3)
dimensions of 2 3
the hypercube.
(4) (5)
4 5
(0) (1)
0 1

103
Gather Operations

• The gather operation is reverse of scatter operation.
• Every node will have m word message.
• In the first step, each odd numbered node sends its buffer to an
even numbered neighbor behind it.
• The neighbor node concatenates the received message with its
own buffer.
• In the next communication step only even numbered nodes
participate in communication.
• The nodes with multiples of four labels, gather more data and
double the sizes of their data.
• This process continued till node 0 gather all the data.
104
Example of the Gather Operation
(6) (7)
6 7
(2) (3)
2 3
(4) (5)
4 5
(0) (1)
0 1
(a) Initial distinct messages
The gather operation on an eight-node hypercube.

105
(6,7)
6 7
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1
(b) Collection before the second step

106
6 7
2 3
4 5
(4,5,
6,7)
(0,1,
2,3) 0 1
(c) Collection before the third step

107
6 7
2 3
4 5
(0,1,2,3,
4,5,6,7) 0 1
(d) Final Collection of messages

108
Cost of Scatter and Gather
• There are log p steps, in each step, the machine size halves
and the data size halves.
• We have the time for this operation to be:
T = ts log p + twm(p − 1).
• This time is same for a linear array as well as a 2-D mesh.
• In scatter operation, at least m(p-1) data must be transmitted
out of the source node,
• and in gather operation at least m(p-1) data must be received
by the destination node.
• Therefore, twm(p-1) time, is the lower bound on the
communication in scatter and gather operations.
Topic Questions
• Explain Scatter and Gather operations with example.
109
All-to-All Personalized Communication
• All-to-all personalized communication operation can be

applied in variety of parallel algorithms such as Fast
Fourier Transform, matrix transpose, sample sort, and
some parallel database join operations
• Each node has a distinct message of size m for every
other node.
• This is opposite of all-to-all broadcast, in which each
node sends the same message to all other nodes.
• All-to-all personalized communication is also known
as total exchange.
110
M 0, p -1 M 1, p -1 M p-1, p -1 M p -1,0 M p -1,1 M p-1, p -1
. . . . . .
. . . . . .
. . . . . .
M 0, 1 M 1, 1 M p-1, 1 M 1, 0 M 1, 1 M 1, p-1,
M 0, 0 M 1, 0 M p-1, 0 M 0, 0 M 0,1 M 0,p-1
All-to-all personalized
0 1 ... p-1 communication 0 1 ... p-1
All-to-all personalized communication.
111
All-to-All Personalized Communication: Example
Consider the problem of transposing a matrix.
• Each processor contains one full row of the matrix.

• The transpose operation in this case is identical to
an all-to-all personalized communication
operation.
• Let A is n x n matrix, transpose of matrix A is AT.
• AT will have same size as A and AT[i,j] = A[j,i] for
0<= i, j < n.
112
• Considering 1D row major partitioning of array, n x

n matrix can be mapped onto n processors such
that each processor contains one full row of the
matrix.
• Each processor sends a distinct element of the
matrix to every other processor as
all-to-all personalized communication.
• For p processes, where p<= n, each process will
have n/p rows (n2/p elements in a matrix)
• For finding out the transpose, all-to-all
personalized communication of matrix blocks of
size n/p x n/p will be done.
113
P0 P 0 = [0,0],[1,0],[2,0],[3,0]
P1 P 1 = [0,1],[1,1],[2,1],[3,1] Result of
n Transpose
of matrix
P2
P 3 = [0,3],[1,3],[2,3],[3,3]
P3
All-to-all personalized communication in transposing a 4 × 4 matrix using four processes.
• Processor Pi will contain the elements of the matrix with indices

[i,0], [i,1], .., [i,n-1]
• In transpose AT, P0 will have element [i,0], P1 will have element [i,
1] and so on.
• Initially processor Pi will have element [i,j] and after transpose, it
moves to Pj
• Figure above shows the example of 4 x 4 matrix mapped onto
four processes using one dimensional row wise partitioning.
114
All-to-All Personalized Communication on a Ring
• Each node sends (p − 1) pieces of

data of size m as one consolidated
message to one of its neighbors.
• These pieces are identified by label {x,
y}, where x is the label of the node
that originally owned the message,
and y is the label of the node that is
the final destination of the message.
115
• The label ({x1, y1}, {x2, y2}, . . . , {xn, yn})

indicates that a message is formed
by concatenating n individual
messages. For eg. ({0,1},{1,2},..,{4,5}).
• Each node extracts the information
meant for it from the message of size
m(p − 1) received, and forwards the
remaining (p − 2) data pieces of size
m each to the next node.
116
• The algorithm continued for (p − 1)

steps.
In (p − 1) steps every node receives
information from all the nodes in the
group.
• The size of the message reduces by m
at each step.
• In each step one m-word packet from
different node will be added to each
node.
117
• All messages are sent in the same

direction.
• To reduce the communication cost
due to tw by factor of two, half of
the messages are sent in one
direction and remaining in reverse
direction to use communication
channels fully.
118
is the label of the node
communication on a six-
that originally owned the
node ring. The label of
message, and y is the
each message is of the
label of the node that is
form {x, y}, where x
the final
destination of the message. The label ({x1 , y1 }, {x2 , y2 }, . . . , {xn, yn}) indicates that a
message is formed by concatenating n individual messages.
119
Communication Step 1
120
121
122
123
124
All-to-All Personalized Communication on a Ring: Cost
• All-to-all personalized communication on ring requires p − 1

communication steps in all.
• The size of message transfer in ith step is m(p − i).
• Therefore, total time taken by this operation is given by:
p−1
T= Σ (t s+ t wm(p − i))
i=1
p−1
= t (ps −w1) + Σ it m
i=1
= (ts + twmp/2)(p − 1).
• The tw term in this equation can be reduced by a factor of 2 by

communicating messages in both directions.
125
All-to-All Personalized Communication on a Mesh
• For all-to-all personalized

communication on mesh √p x √p,
each node first groups its p messages
according to the columns of their
destination nodes.
• Consider the example of 3 x 3 mesh.
• Each node have 9 m-word messages
one for each node.
126
• For each node, three groups of three

messages are formed.
• The first group contains the messages
for destination nodes labelled 0, 3, and
6; the second group contains the
messages for nodes 1, 4, and 7; and
the last group of messages for nodes
labelled 2, 5, and 8.
127
• After grouping, each row will contain

cluster of messages of size m√p.
• Each cluster contains information for
all the nodes of a column.
• Now in the first phase, all-to-all
personalized communication is
performed in each row.
128
• After first phase, the messages

present with each node are sorted
again according to the rows of their
destination nodes.
• In the second phase, similar
communication is carried out.
• After completion of second phase,
node i will have the messages ({0,i},..,
{8,i}) where 0 <= i <= 8. So each node
will have the a message from every
other node. 129
The label of each message is of the form

{x, y}, where x is the label of the node that
originally owned the message, and y is the label
of the node that is the final destination of the
message.
The distribution of messages at the beginning
of each phase of all-to-all personalized
communication on a 3 × 3 mesh. At the end of
the second phase, node i has messages ({0,i}, . . .
,{8,i}), where 0 ≤ i ≤ 8. The groups of nodes
communicating together in each phase are
enclosed in dotted boundaries.
130
(a) Data distribution at the beginning of first 131

(b) Data distribution at the beginning of second 132

({0,6},{1,6},{2,6}, ({0,7},{1,7},{2,7}, ({0,8},{1,8},{2,8},

{3,6},{4,6},{5,6}, {3,7},{4,7},{5,7}, {3,8},{4,8},{5,8},
{6,6},{7,6},{8,6}) {6,7},{7,7},{8,7}) {6,8},{7,8},{8,8})
6 7 8
({0,4},{1,4}, ({0,5},{1,5},
{2,4},{3,4}, {2,5},{3,5},
{4,4},{5,4}, {4,5},{5,5},
{6,4},{7,4}, {6,5},{7,5},
({0,3},{1,3},{2,3}, {8,4}) {8,5})
3 4 5
{3,3},{4,3},{5,3},
{6,3},{7,3},{8,3})
({0,1},{1,1}, ({0,2},{1,2},
{2,1},{3,1}, {2,2},{3,2},
{4,1},{5,1}, {4,2},{5,2},
({0,0},{1,0},{2,0}, {6,1},{7,1}, {6,2},{7,2},
{3,0},{4,0},{5,0}, 0 {8,1}) 1 {8,2}) 2
{6,0},{7,0},{8,0})
133
(c) Final Data distribution after second phase
All-to-All Personalized Communication on a Mesh: Cost
• Time for the first phase is identical to that in a

ring with √p processors, i.e.,
(ts + twmp/2)(√p − 1).
• Time in the second phase is identical to thefirst

phase. Therefore, total time is twice of this time, i.e.,
T = (2ts + twmp)(√p − 1).
134
• It is noted that, time required for sorting the

messages by row and column is not considered in
calculation of T.
• It is assumed that the data is ready for first
communication phase, so in second communication
phase, the rearrangement of mp words of data is
done.
• Let tr is the time to read and write single word data
in a node’s local memory.
• So, total time spent in data rearrangement by a node
in complete process is tr x m x p.
• This time is very small as compared to
communication time T above.
135
All-to-All Personalized Communication on a
Hypercube
• Generalize the mesh algorithm to log p steps.

• At any stage in all-to-all personalized
communication on p node hypercube, every
node holds p packets of size m each.
• While communicating in a particular
dimension, every node sends p/2 of these
packets (consolidated as one message).
• A node must rearrange its messages locally
before each of the log p communication steps
takes place.
• In each step, the data is exchanged by pairs of
nodes for a different dimension.
136
Hypercube
({6,0} ... {6,7}) ({7,0} ... {7,7})
6 7
({2,0} ... {2,7}) ({3,0} ... {3,7})

2 3
4 5
({5,0} ... {5,7})
({4,0} ... {4,7})
0 1
({0,0} ... {0,7}) ({1,0} ... {1,7})
(a) Initial distribution of 137

Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
{7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})
6 7
({2,0},{2,2},
{2,4},{2,6},
{3,0},{3,2}, 2 3
{3,4},{3,6}) ({4,1},{4,3},
{4,5},{4,7},
4 5 {5,1},{5,3},
{5,5},{5,7})
0 1
({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
(b) Distribution before the second step 138

Hypercube
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7})
6 7
({2,2},{2,2}, 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
{5,1},{7,1},
4 5 {5,5},{7,5})
0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5},
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
139
Hypercube
({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})
2 3
4 5
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1
({0,0} ... {7,0}) ({0,1} ... {7,1})
(d) Final distribution of messages 140

Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
({6,0} ... {6,7}) ({7,0} ... {7,7}) {7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})
6 7 6 7
({2,0} ... {2,7}) ({3,0} ... {3,7}) ({2,0},{2,2},

2 3 {2,4},{2,6},
2 3
{3,0},{3,2},
{3,4},{3,6})
({4,1},{4,3},
{4,5},{4,7},
4 5 4 5
{5,1},{5,3},
({4,0} ... {4,7}) ({5,0} ... {5,7}) {5,5},{5,7})
0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
(a) Initial distribution of (b) Distribution before the second

messages step
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7 6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})
({2,2},{2,2}, 2 3 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
4 5 {5,1},{7,1}, 4 5
{5,5},{7,5})
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
(c) Distribution before the third (d) Final distribution of

step messages
An all-to-all personalized communication algorithm on a three- 141

dimensional hypercube.
Hypercube: Cost
• We have log p iterations and mp/2 words are

communicated in each iteration. Therefore, the
cost is:
T = (ts + twmp/2) log p.
• This is not optimal!
142
Hypercube: Optimal Algorithm
• Each node simply performs p − 1 communication

steps, exchanging m words of data with a
different node in every step.
• A node must choose its communication partner in

each step so that the hypercube links do not
suffer congestion.
143
• In the jth communication step, node i exchanges

data with node (i XOR j).
• In this schedule, all paths in every communication

step are congestion-free, and none of the
bidirectional links carry more than one message
in the same direction.
144
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(a) (b) (c)
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(d) (e) (f)
6 7 0 1 3 7
1 0 2 6
2 3 1 5
2 3
3 2 0 4 Seven steps in all-to-all
4 5 7 3 personalized communication on an
4 5
5 4 6 2 eight-node hypercube.
1
6 7 5
0
0 1 7 6 4 145
(g)
1. procedure ALL_TO_ALL_PERSONAL(d, my_id)

2. begin
3. for i := 1 to 2d − 1 do
4. begin
5. partner := my_id XOR i;
6. send Mmy_id,partner to partner;
7.
receive M from partner;
8. partner,my_id
9. endfor;
end ALL_TO_ALL_PERSONAL
A procedure to perform all-to-all personalized communication on a d-

dimensional hypercube. The message Mi,j initially resides on node i
and is destined for node j.
146
Hypercube: Cost Analysis of Optimal Algorithm
• There are p − 1 steps and each step involves non-

congesting message transfer of m words.
• We have:
T=(ts + twm)(p − 1).
• This is asymptotically optimal in message size.
147
Circular Shift
• Circular shift can be applied in some
matrix computations and in string and
image patterh matching.
• It is a member of a broader class of
global communication operations.
• So, in a circular q-shift node i sends data
to node (i+q) mod p in a group of p
nodes where (0 < q < p) known as
Permutation.
• In Permutation, every node sends a
message of m word to a unique node.
148
Circular Shift on a Mesh
• Mesh algorithms for circular shift can be

derived by using ring algorithm.
• Wrap around connections are considered
in mesh. i.e. a row of 4 nodes 0,1,2,3, node
3 can communicate and send data to node
0.
• Implementation can be performed by
min(q,p-q) neighbor-to-neighbor
communications in one direction where p
is number of nodes and q is the number of
shifts to be performed.
149
• In p-node square wraparound mesh, for

nodes with row major labels, a circular q-
shift is performed in two stages.
• Example, q=5 shifts, p=16 (4x4 mesh)
• In the first stage, the data is shifted
simultaneously by (q mode √p) steps in all
the rows i.e. (5 mod √16) in our e.g.
• In second phase, it is shifted by [q/√p]
steps along the columns.
150
• Due to wraparound connection while

circular row shifts, the data moves from
highest to lowest labeled nodes of the row.
For e.g. data with node 3 will be shifted to
node 0 in the first row.
• Note that to compensate for the distance
√p that they lost while traversing the
backward edge in their respective rows,
the data packets must be shifted by an
additional step.
151
• In the example, after row shift, there is

compensentory one column shift then
column shift.
• Total time for any circular q-shift on a

p-node mesh using packets of size m is :
T = (ts + twm)(√p + 1).
152
(12) (13) (14) (15)

12 13 14 15
(8) (9) (10) (11)

8 9 10 11
(4) (5)(0) (6) (7)

4 5 6 7
h ift
5-s
(0) (1) (2) (3)
0 1 2 3
(a) Initial data distribution and the first communication step

The communication steps in a circular 5-shift on a 4 × 4 mesh. 153
(15) (12) (13) (14)
12 13 14 15
(11) (8) (9) (10)

8 9 10 11
(7) (4) (5) (6)

4 5 6 7
data from node
3 supposed to
shift on node 4,
but due to
(3) (0) (1) (2) wraparound
0 1 2 3 row shift, it shift
to node 0
(b) Step to compensate for backward row shifts

(11) (12) (13) (14)
shift is
12 13 14 15 carried out
every time
on a unique
node
(7) (8) (9) (10)
8 9 10 11
(3) (4) (5) (6)

4 5 6 7
(15) (0) (1) (2)

0 1 2 3
(c) Column shifts in the third communication step

(7) (8) (9) (10)
12 13 14 15
(3) (4) (5) (6)

8 9 10 11
(15) (0) (1) (2)

4 5 6 7
(11) (12) (13) (14)

0 1 2 3
(d) Final distribution of the data

Circular Shift on a Hypercube
• For shift operation on hypercube, linear
array with 2d nodes is mapped onto
• Node i of the linear array is assigned to
node j of the hypercube where j is a d-bit
binary Reflected Gray Code (RGC) of i.
• Consider eight nodes hypercube shown
in the figure, any two nodes at distance
2i are separated by exactly two links.
• For i=0 nodes are directly connected to
this is the exception as only one
hypercube link separates two nodes. 157
• For q-shift operation, q is expanded as a
sum of distinct powers of 2. For example
number 5 can be expanded as 22+20.
• Note that number of terms in
sum=number of 1's in binary
representation of q. e.g. 5(101) two terms
will be there in the sum correcponding to
bit position 2 and bit position 0 i.e. 22+20.
• Circular q-shift on a hypercube is
performed in s phases, where s is distinct
powers of 2.
158
• For example, 5-shift operation is
performed by 4 shift (22) followed by 1
shift (20).
• Each shift will have two communication
steps. Only 1-shift will have a single step.
For example, the first phase of a 4-shift
consists of two steps and the second
phase of a 1-shift consists of one step.
• Total number of steps for any q in a p-
node hypercube is 2 log p-1
159
• The time for this is upper bounded by:
T = (ts + twm)(2 log p − 1).
• If E-cube routing is used, this time can be
reduced to
T = ts + twm.
160
(4) (5) (3) (2)
4 5 4 5
(3) (0)
(2) (1)
0 3 2 3 2
1
(7) (4)
(6) (5)
2 7 6 7 6
3
(0) (7)
(1) (6)
4 0 1 0 1
5
First communication Second communication
6 step of the 4-shift step of the 4-shift
7
(a) The first phase (a 4-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
161
(0) (1)
4 5
0
(7)
1 (6)
3 2
2
3
(2)
4 7 6
5
(4)
6 (5)
0 1
7
(b) The second phase (a 1-shift)
162
(7) (0)
4 5
0
(6)
(5)
1 3 2
2
3 (2)
(1)
4 7 6
5
6
(3)
(4)
0 1
7
(c) Final data distribution after the 5-shift
163
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(a) 1-shift (b) 2-shift
Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 164

6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(c) 3-shift (d) 4-shift

6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(e) 5-shift (f) 6-shift

6 7
2 3
4 5
0 1
(g) 7-shift

6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(a) 1- (b) 2- (c) 3-
shift shift shift
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(d) 4- (e) 5- (f) 6-
shift shift shift
6 7
2 3
4 5
0 1
(g) 7-
shift

Improving Performance of Operations
• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation followed
by an all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw(p − 1) )
p
≈ 2 × (ts log p + twm). (10)
• All-to-one reduction can be performed by performing all-to-all reduction (dual of
all-to-all broadcast) followed by a gather operation (dual of scatter).
169
• Since an all-reduce operation is semantically equivalent to an all-to-one reduction

followed by a one-to-all broadcast, the asymptotically optimal algorithms for
these two operations can be used to construct a similar algorithm for the all-
reduce operation.
• The intervening gather and scatter operations cancel each other. Therefore, an all-
reduce operation requires an all-to-all reduction and an all-to-all broadcast.
170
• The communication algorithms are based on two assumptions:

• 1) original message cannot be divided into small parts
• 2) each node uses a single port for sending and receiving data
• We can analyse the effect of not following these two assumptions:
Splitting and Routing messages in parts:
171
University Questions on Unit 3
• August 2018 (Insem)
• 1. Explain Broadcast and Reduction example for
multiplying matrix with a vector.(6)
• 2. Explain the concept of scatter and gather (4)
• 3. Compare the one-to-all broadcast operation for Ring,
Mesh and Hypercube topologies (6)
• 4. Explain the prefix-sum operation for an eight-node
hypercube (4)
• Nov-Dec 2018 (Endsem)

• 1. Write a short note on All-to-one reduction with
suitable example. [6]
172
• 1 Explain term of all-to-all broadcast on linear array,
mesh & Hypercube
• topologies. [8]
• 2 Write short note on circular shift on a mesh. [6]
• Oct 2019 (Insem)

• 1. Explain broadcast & reduce operation with diagram.
[4]
• 2. Explain prefix- sum operation for an eight-node
hypercube. [6]
• 3. Explain scatter and gather operation? [4]
• 4. Explain all to one broadcast and reduction on a ring?
[6] 173
• May-June-2019 (Endsem)
• 1. Explain Circular shift operation on mesh and
hypercube network. [8]
174
Prof V B More
These models are used to specify details for
partitioning data and how these data are
processed.
A model is used to provide structure of
parallel algorithms based on two techniques:
• selection of partitioning and mapping
technique;
• appropriate use of technique for
minimization of interaction.
Prof V B More, MET’s IOE, BKC, Nashik

There are various parallel algorithm models :
• The data parallel model
• The task graph model
• The work pool model
• The master slave model
• The pipeline or producer consumer model
• Hybrid models

The Data-Parallel Model
• The tasks are statically or semi-statically
attached onto processes.
• Each task performs identical operations on
a variety of data.
• Single operations being applied on multiple
data items is called data parallelism (SIMD
model).
• The task may be executed in phases.
• Data is different in different phases.
The Data-Parallel Model.. contd
• Since all tasks perform same
computations, the decomposition of the
problem into tasks is usually based on data
partitioning to guarantee the load balance.
• Data-parallel algorithms can be
implemented in both shared-address-space
and message-passing paradigms.

• Partitioned address-space and locality of
reference allow better control of placement
of data in message-passing interface
paradigm.
• If the distribution of data is different in
different phases, the shared-address space
paradigm reduces programming efforts.
• Interaction overheads in the data-parallel
model can be minimized by locality
• A key characteristic of data-parallel
problems is that for most problems, the
degree of data parallelism increases with
the size of the problem, which leads to use
more processes to solve larger problems
effectively.
• An example of a data-parallel algorithm is
dense matrix multiplication problem.

A n b y
Task 1
01
Computation of each
2 element of output vector y
is independent of other
elements. Based on this, a
dense matrix-vector
product can be
decomposed into n tasks.
n-1 shaded portion of the
Task n matrix and vector is
accessed by Task 1.
Findings: While tasks share data (the vector b), they do not
have any control dependencies – i.e., no task needs to
wait for the (partial) completion of any other. All tasks are
of the same size in terms of number of operations.
8
The Task Graph Model
• The computations in any parallel algorithm
can be viewed as a task graph.
• The task graph may be either trivial or
nontrivial.
• The type of parallelism that is expressed by
the task graph is called task parallelism.
• In certain parallel algorithms, the task
graph is explicitly used in establishing
relationship between various tasks.
The Task Graph Model
• Interrelationships among the tasks are
utilized to promote locality or to reduce
interaction costs.
• This model is applied to solve problems in
which the amount of data associated with
the tasks is huge relative to the amount of
computation associated with them.

The Task Graph Model ..contd
• The tasks are mapped statically to optimize
the cost of data movement among tasks.
• Work is more easily shared with globally
addressable space, but mechanisms are
available to share work in disjoint address
space.

• Typical interaction-reducing techniques
applicable to this model include reducing
the volume and frequency of interaction by
promoting locality while mapping the tasks
based on the interaction pattern of tasks.
• Asynchronous interaction methods are
used to overlap the interaction with
computation.
• Examples of algorithms based on the task
graph model include parallel quicksort,
sparse matrix factorization, and many
parallel algorithms derived via divide-and-
conquer approach.

The Work Pool Model
• In the work pool or the task pool model, the
dynamic mapping of tasks onto processes
is performed for load balancing.
• The task may be executed by any process.
• There is no desired pre-mapping of tasks
onto processes.
• The mapping may be centralized or
decentralized.

The Work Pool Model
• Pointers to the tasks may be stored in a
physically shared list, priority queue, hash
table, tree, or in a physically distributed
data structure.
• The work may be statically available in the
beginning, or could be dynamically
generated; i.e., the processes may generate
work and add it to the global work pool.
The Work Pool Model ..contd
• If the work is generated dynamically and a
decentralized mapping is used, then a
termination detection algorithm is required
to detect completion of the entire program.
• In message-passing paradigm, the work
pool model is used when the amount of
data associated with tasks is relatively
small as compared to the computation
associated with it.
• Therefore, tasks can be easily moved
around without causing too much data
interaction overhead.
• The granularity of the tasks can be
adjusted to obtain the desired level of
tradeoff between load-imbalance and the
overhead of accessing the work pool for
adding and extracting tasks.
• Parallelization of loops by chunk
scheduling or related methods is an
example of the use of the work pool model
• Parallel tree search where the work is
represented by a centralized or distributed
data structure is an example of the use of
the work pool model where the tasks are
generated dynamically.
The Master-Slave Model
• In this master-slave or the manager-worker
model, one or more master processes
generate work and allocate it to slave
processes.
• The tasks may be allocated a-priori if the
manager can estimate the size of the tasks
• Random mapping can be performed for
better load balancing.
The Master-Slave Model
• Workers are assigned smaller piece of
work at different times.
• Work may need to be performed in phases,
and work in each phase must finish before
work in the next phases can be generated.
• The manager may cause all workers to
synchronize after each phase.
• There is no desired pre-mapping of work to
processes, and any worker can do any job
assigned to it.

The Master-Slave Model ..contd
• The manager-worker model can be
generalized to the hierarchical or multi-level
manager-worker model in which the top
level manager feeds large chunks of tasks
to second-level managers, who further
subdivide the tasks among their own
workers and may perform part of the work
themselves.
• Care should be taken to ensure that the
The Master-Slave Model ..contd
• The granularity of tasks should be chosen
such that the cost of doing work dominates
the cost of communication and
synchronization.
• Asynchronous interaction will be useful for
overlapped interaction and the computation
associated with work generated by the
master.
• It may also reduce waiting times if the
Producer-Consumer Model
• In this Producer-Consumer Model (or
pipeline model), a stream of data is
passed through a successive processes,
each of which performs some task on it.
• The simultaneous execution of different
programs on a data stream is called
stream parallelism.
• The arrival of new data triggers the
execution of a new task by a process in
the pipeline.
Producer-Consumer Model
• The processes forms pipeline in the shape
of linear or multidimensional arrays, trees,
or general graphs with or without cycles.
• A pipeline is a chain of producers and
consumers.
• Each process in the pipeline can be
viewed as a consumer of a data stream for
the process preceding it in the pipeline
and as a producer of data for the process
following it in the pipeline.
Producer-Consumer Model ..contd
• The pipeline does not need to be a linear
chain; it can be a directed graph.
• The pipeline model usually involves a
static mapping of tasks onto processes.
• Load balancing is a function of task
granularity.
• The larger the granularity, the longer it
takes to fill up the pipeline, i.e. for the
trigger produced by the first process in the
chain to propagate to the last process,
thereby keeping some of the processes
Producer-Consumer Model ..contd
• However, too fine granularity may increase
interaction overheads because processes
will need to interact to receive fresh data
after smaller pieces of computation.
• The most common interaction reduction
technique applicable to this model is
overlapping interaction with computation.
An example of a two-dimensional pipeline
is the parallel LU factorization algorithm.

Hybrid Models
• In some cases, more than one model may
be applicable to the problem at hand,
resulting in a hybrid algorithm model.
• A hybrid model may be composed either
of multiple models applied hierarchically
or multiple models applied sequentially to
different phases of a parallel algorithm.

Hybrid Models
• In some cases, an algorithm formulation
may have characteristics of more than one
algorithm model. For example, data may
flow in a pipelined manner in a pattern
guided by a task graph. In another case,
the major computation may be described
by a task graph, but each node of the
graph may represent a supertask consists
of multiple subtasks that may be suitable
for data-parallel or pipelined parallelism.
Ex. Parallel quicksort.
The Age of Parallel
Processing
Prof V B More
MET-BKC IOE
1
The Age of Parallel Processing
• In recent years, much has been made

of the computing industry widespread
shift to paraellel computing.
• Nearly all consumer computers in the

year 2020 are manufactured with
multicore central processors.
2
• Introduction of dual-core, low-end
notebook machines to 8-and 16-core
workstation computers, are not less
than supercomputers or mainframes.
• Command prompts are out and

multithreaded graphical interfaces are
in.
3
• Electronic devices such as mobile
phones and portable music players
came up with parallel computing
capabilities to enhance the
performance.
• Cellular phones that only make calls

are out; phones that can
simultaneously play music browse the
Web, and provide GPS services are in.
4
• As a result, software developers now

need to cope with a variety of parallel
computing platorms and technologies
in order to provide novel and rich
experiences for an increasingly
sophisticated base of users.
5
Evolution of the CPUs

• For 30 years, one of the important
methods for the improving the
performance of consumer computing
devices has been to increase the speed
at which the processor's clock operated.
6
• Starting with the first personal

computers of the early 1980s,
consumer CPUs ran with internal
clocks operating around 1 MHz.
7
• 30 years later (2010), most desktop

processors have clock speeds between
1 GHz and 4 GHz, nearly 1000 times
faster than the clock on the original
personal computer.
8
• Although increasing the CPU clock

speed is certainly not the only method
by which compting performance has
been improved, it has always been a
reliable source for improved
performance.
9
The Rise of GPU Computing
• A graphics processing unit (GPU) is a
specialized electronic circuit designed
to rapidly manipulate and alter memory
to accelerate the creation of images in
a frame buffer used for outputting to a
display device.
10
• GPUs are used in embedded systems,
mobile phones personal computers,
workstations, research labs, and game
consoles.
• Modern GPUs are very efficient at

manipulating computer graphics and
image processing.
11
• Their highly parallel structure makes
them more efficient than general-
purpose CPUs for algorithms where the
processing of large blocks of data is
done in parallel.
12
• In a personal computer, a GPU can be
present on a video card, or it can be
embedded on the motherboard or in
certain CPUs - on the CPU die.
13
• In comparison to the central
processor's traditional data processing
pipeline, performing general-purpose
computations on a graphics
processing unit (GPU) is a new concept
(GPGPU).
14
• In fact, GPU itself is relatively new
compared to the computing field at
large. However, the idea of computing
on graphics processors is not new.
15
A Brief History of GPUs, Early GPU
• We have already looked at how CPUs
evolved in both clock speeds and core
count.
• In the meantime, the state of graphics

processing underwent a drametic
revolution.
16
• In the late 1980s and early 1990s, the

growth in popularity of graphically
driven operating systems such as MS
Windows helped create a market for a
new type of processor.
17
• In early 1990s, users began purchasing

2D display accelerators for their
personal coputers, with hardware
assisted bitmap operations for
graphical operating system.
18
• In the 1980, Silicon Graphics used three

dimensional gaphics in a variety of
markets, government and defense
applications and scientific and
technical visualization.
19
• In 1992, Silicon Graphics opened the

programming interface to its hardware
by releasing the OpenGL library, as a
standardized, platform independent
method for writing 3D graphics
aplications.
20
• By mid 1990, the computer based first-
person games such as Doom, Duke
Nukem 3D, and Quaks came to market.
21
• In mid 1990, NVIDIA, ATI Technologies,

3DFX interactive began releasing
graphics accelerators that were
affordable. NVIDIA's GeForce 256 used
graphics pipeline architecture.
22
• The term GPU was popularized by

NVIDIA in 1999, who marketed the
GeForce 256 as “the world's first GPU”.
23
• NVIDIA release the GeForce 3 series in

2001 was the computing industry's first
chip to implement Microsoft's DirectX
8.0 standard (which was very new at
that time).
24
• Rival ATI Technologies coined the term

“visual processing unit” or VPU with the
release of the Radeon 9700 in 2002.
25
NVIDIA Comparative Chart
26
NVIDIA GPU Development History
Basic Communication Operations
V.B.More
Thanks to Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing slides
1
Topic Overview
1) One-to-All Broadcast and All-to-One

Reduction
2) All-to-All Broadcast and Reduction
3) All-Reduce and Prefix-Sum Operations
4) Scatter and Gather
5) All-to-All Personalized Communication
6) Circular Shift
7) Improving the Speed of Some
Communication Operations
2
Basic Communication Operations: Introduction
• Computations and communication are two

important factors of any parallel algorithm.
• Many interactions in practical parallel

programs occur in well-defined patterns
involving groups of processors.
• Efficient implementations of these

operations can improve performance,
reduce development effort and cost, and
improve software quality.
3
• Efficient implementations must leverage

underlying architecture. For this reason, we
refer to specific architectures here.
• We select a descriptive set of architectures

to illustrate the process of algorithm
design.
4
• Group communication operations are built

using point-to-point messaging primitives.
• Architectures that communicating a message

of size m over an uncongested network takes
time ts + twm.
• We use this as the basis for our analyses.

Where necessary, we take congestion into
account explicitly by scaling the tw term.
5
• We assume that the network is bidirectional

and that communication is single-ported.
6
One-to-All Broadcast and All-to-One Reduction
• A processor has a piece of data (of size m) it

needs to send to every other processor.
• The dual of one-to-all broadcast is all-to-one
reduction.
• In all-to-one reduction, each processor has m
units of data. These data items must be
combined piece-wise (using some associative
operator, such as addition or min), and the
result made available at a target processor.
7
Reduction

among p processors.
8
• One-to-All broadcast is the operation in which
a single processor send identical data to all
other processors.
• Most parallel algorithm often need this
operation.
• Consider data size m is to be sent to all the
processors.
• Initially only source processor has the data.
• After completion of algorithm, there will be a
copy of initial data with each processor. 9
• All-to-One reduction is the operation in which
data from all processors are combined at a
single destination processor.
• Various operations like sum, product, max,
min, avg of numbers can be performed by
10
• Each processor p will have buffer M which

contains m words.
• After completion of algorithm, The ith word of
the accumulated M is the sum, product,
maximum, or minimum of the ith words of
each of the original buffers.
11
Reduction

among p processors.
12
on Rings
• Simplest way is to send p − 1 messages from
the source to the other p − 1 processors –
this is not very efficient.
• Use recursive doubling: source sends a
message to a selected processor. We now
have two independent problems defined
over halves of machines.
• Reduction can be performed in an identical

fashion by inverting the process.
13
One-to-All Broadcast
3 3
2
7 6 5 4
0 1 2 3
2
3 3
One-to-all broadcast on an eight-node ring. Node 0 is the

source of the broadcast. Each message transfer step is
shown by a numbered, dotted arrow from the source of
the message to its destination. The number on an arrow
indicates the time step during which the message is
transferred. 14
1 1
2
7 6 5 4
3
0 1 2 3
2
1 1
Reduction on an eight-node ring with node 0 as

the destination of the reduction.
15
Consider the problem of multiplying a matrix

with a vector.
• The n × n matrix is assigned to an n × n (virtual)

processor grid. The vector is assumed to be on
the first row of processors.
• The first step of the product requires a

one-to-all broadcast of the vector element
along the corresponding column of
processors. This can be done concurrently for
16
all n columns.
• The processors compute local product of the

vector element and the local matrix entry.
• In the final step, the results of these products

are accumulated to the first row using n
concurrent all-to-one reduction operations
along the oclumns (using the sum operation).
17
Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one Input Vector
reduction
P0 P1 P2 P3
P0 P0 P1 P2 P3
P4 P4 P5 P6 P7
P8 P8 P9 P11 Matrix
P10
P12 P12 P13 P14 P15

Output
Vector
One-to-all broadcast and all-to-one reduction in the

multiplication of a 4 × 4 matrix with a 4 × 1 vector.18
•We can view each row and column of a square
mesh of p nodes as a linear array of √p nodes.
•Broadcast and reduction operations can be
performed in two steps – the first step does
the operation along a row and the second
step along each column concurrently.
•This process generalizes to higher dimensions
as well.
19
•2D square mesh with √p rows and √p columns
for one to all broadcast operation.
•Firstly data is sent to remaining all √p -1
nodes in a row by source using one-to-all
broadcast operation.
•In second phase, the data is sent to the
respective column by one-to-all broadcast.
•Thus, each node of the mesh will have copy of
initial messae
20
3 7 11 15
One-to-all
4 4 4 4
broadcast on a
2 6 10 14
16-node mesh.
3 3 3 3
1 5 9 13 Source node is 0.
4 4 4 4 Row data transfer
2 2
0 4 8 12
First phase: steps 1 and 2; Step 1: node 8 by node

0; Step 2: node 4 by node 0, node 12 by node 8
First Row: node 0, 4, 8, 12 having the data 21
3 7 11 15
Second phase:
4 4 4 4 steps 3, 4; Column
2 6 10 14
data transfer;
3 3 3 3
Step 3: node 2 by
1 5 9 13 0, node 6 by 4,
node 10 by 8,
4 4 4 4
0
2
4 8
2
12
node 14 by 12.
1
Step 4: col 1- node 1 by 0 and 3 by 2; col 2-

node 5 by 4 and 7 by 6; col 3- node 9 by 8 and
11 by 10; col 4-node 13 by 12 and 15 by 14
22
3 7 11 15
Similar process for
4 4 4 4
one-to-all broadcast
2 6 10 14
on a
three-dimensional
3 3 3 3 mesh can be carried
1 5 9 13 out by treating rows of
nodes in each of three
4 4 4 4
2 2
dimensions as linear
0 4 8 12
array.
1
Similar process for one-to-all broadcast on a
three-dimensional mesh can be carried out by treating
rows of nodes in each of three dimensions as
linear array. 23
3 7 11 15
4 4 4 4
2 6 10 14
3 3 3 3
1 5 9 13
4 4 4 4
2 2
0 4 8 12
1
The reduction process in linear array can be carried
out on two and three dimensional meshes as well by
reversing the direction and order of messages.
24
•A hypercube with 2d nodes can be

regarded as a d-dimensional mesh with two
nodes in each dimension.
•The mesh algorithm can be generalized to a
hypercube and the operation is carried out in
d (= log p) steps, one in each dimension.
•Example of 8-node hypercube.
25
8-node hypercube.
3
6 7
(110) (111)
2 3
(010) 2
3 (011)
3
2 4 5
1
(100) (101)
0 1
(000)
(001)
3
One-to-all broadcast on a three-dimensional hypercube. The
binary representations of node labels are shown in
parentheses. 26
•Each node is identified by a unique 3-digit

label.
•Communication starts along the highest
dimension specified by MSB of the node label.
Ex. In step 1, node 0 (000) will send data to
node 4(100) with higher dimension.
27
Broadcast and Reduction on a Hypercube:
Example
● In the next step, communication will be done
for lower dimension.
● The source and the destination nodes in three
communication steps of the algorithm are
similar to the nodes in the broadcast algorithm
on a linear array.
● Hypercube broadcast will not suffer from
congestion
28
Broadcast and Reduction on a Balanced Binary
Tree
•Consider a binary tree in which processors are
(logically) at the leaves and internal nodes are
routing nodes i.e. switching units.
•The communicating nodes have the same labels
as in the hypercube
•The communication pattern will be same as that
of hypercube algorithm.
•There will not be any congestion on any of the
communication link at any time.
29
Broadcast and Reduction on a Balanced Binary
Tree
2
2
3 3 3 3
0 1 2 3 4 5 6 7
30
Binary Tree
3 3 3 3
0 1 2 3 4 5 6 7
•On different path, different number of

switching nodes will be there making its
communication different from hypercube.
31
Binary Tree
1
3 3 3 3
0 1 2 3 4 5 6 7
•E.g. Assume that source processor is the root

of this tree. In the first step, the source sends
the data to the right child (assuming the
source is also the left child). The problem has
now been decomposed into two problems
with half the number of processors. 32
•All of the algorithms described here before are
adaptations of the same algorithmic template.
•We illustrate the algorithm for a hypercube,

but the algorithm can be adapted to other
architectures.
•The hypercube has 2d nodes and my_id is the

label for a node.
• X is the message to be broadcast, which
initially resides at the source node 0.
33
1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id,
source, X)
2. begin
4. mask := 2d - 1;
5. for i := d - 1 downto 0 do /* Outer loop */
/* Set bit i of mask to 0 */
6. mask := mask XOR 2i;

34
/* Convert virtual_dest to the label of the physical
destination */
11. else
13. receive X from (virtual_source XOR
source);
/* Convert virtual_source to the label of the physical
source */
14. endelse;
15. endfor;
d-dimensional p-node hypercube. d = log (p) 35
1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id, source, X)

2. begin
4. mask := 2d − 1;
5. for i := d − 1 downto 0 do /* Outer loop */
/* Convert virtual_dest to the label of the physical destination */
11. else
13. receive X from (virtual_source XOR source);
/* Convert virtual_source to the label of the physical source */
14. endelse;
15. endfor;

36
1. procedure ALL_TO_ONE_REDUCE(d, my_id, m, X, sum)

2. begin
3. for j := 0 to m − 1 do sum[j] := X[j];
4. mask := 0;
5. for i := 0 to d − 1 do
/* Select nodes whose lower i bits are 0 */
6. if (my_id AND mask) = 0 then
7. if (my_id AND 2i) ƒ= 0 then
8. msg_destination := my id XOR 2i;
9. send sum to msg_destination;
10. else
11. msg_source := my_id XOR 2i;
12. receive X from msg_source;
13. for j := 0 to m − 1 do
14. sum[j] :=sum[j] + X[j];
15. endelse;
17. endfor;
18. end ALL_TO_ONE_REDUCE
Single-node accumulation on a d-dimensional hypercube. Each node contributes a

message X containing m words, and node 0 is the destination. 37
Cost Analysis
•The one –to-all broadcast or all-to-one

reduction procedure involves log p
point-to-point simple message transfers.
•Each message transfer will have a time cost

of ts + twm.
•The total time is therefore given by:
T = (ts + twm) log p.

38
Questions based on one-to-all Broadcast and
and all-to-one Reduction
• Explain one-to-all broadcast and all-to-one
reduction operation in brief.
• Explain with example one-to-all broadcast
and all-to-one reduction operation on ring.
• Explain how matrix-vector multiplication can
be performed using one-to-all broadcast and
16-node mesh.
• Explain one-to-all broadcast operation on a
hypercube. 39
Questions based on one-to-all Broadcast and
and all-to-one Reduction
• Write and explain algorithm of one-to-all

broadcast on a hypercube network.
• Explain one-to-all broadcast algorithm for
arbitrary source on d-dimensional
hypercube.
• Explain all-to-one reduction operation on
40
• Generalization of broadcast in which each

processor is the source as well as
destination.
• All-to-all broadcast operation is used in
matrix operations like matrix multiplication
and matrix-vector multiplication.
• In all-to-all broadcast operation, all p nodes
simultaneously broadcast the message.
41
• Note that, a process sends same message

of m-word to all the processes, but it is not
compulsory that every process should send
same message. Different processes can
broadcast different messages.
• In all-to-all reduction, it happens at every
node.
42
All-to-all broadcast and all-to-all reduction.
43
• All-to-all Broadcast:
• Simplest approach: perform p one-to-all
broadcasts. This is not the most efficient way,
though.
• It can take p times to complete communication.
• Communication links can be used more efficiently
by simultaneously performing all p one-to-all
broadcast.
• By this here will be concatenation of all messages
traversing the same path at the same time into a
single message.
• The algorithm terminates in p − 1 steps.
44
• Linear Array and Ring:

• Each node first sends the data to one of its
neighbors it needs to broadcast.
• In subsequent steps, it forwards the data received
from one of its neighbors to its other neighbor.
• This process is continued in subsequent steps so
all the communication links can be kept busy.
• As the communication is performed circularly in a
single direction, each node receives all (p-1)
pieces of information fro all other nodes in (p-1)
steps.
45
1 (6) 1 (5) 1 (4)
n (m)
7 6 5 4
(7) (6) (5) (4) nth mth
1 (7) 1 (3) time step message
(0) (1) (2) (3)

1st communication step
0 1 2 3
1 (0) 1 (1) 1 (2)
2 (5) 2 (4) 2 (3)
7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)
2nd communication step

(0,7) (1,0) (2,1) (3,2)
0 1 2 3
2 (7) 2 (0) 2 (1)
. .
. .
. .
7 (0) 7 (7) 7 (6)
7 6 5 4
(7,6,5,4,3,2,1) (6,5,4,3,2,1,0) (5,4,3,2,1,0,7) (4,3,2,1,0,7,6)
7 (1) 7 (5)

(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)
0 1 2 3
7 (2) 7 (3) 7 (4)
46
n (m)
1st Communication Step
nth mth
time step message
47
2nd Communication Step
48
49
• Detailed Algorithm
• At every node, my_msg contains initial message to be
broadcast.
• At the end of the algorithm, all p messages are
collected at each node.
50
1. procedure ALL_TO_ALL_BC_RING(my_id, my_msg, p, result)

2. begin
3. left := (my_id − 1) mod p;
4. right := (my_id + 1) mod p;
6. msg := result;
7. for i := 1 to p − 1 do
11. endfor;
12. end ALL_TO_ALL_BC_RING
All-to-all broadcast on a p-node ring.
All-to-all reduction is simply a dual of this operation and can

be performed in an identical fashion. 51
• Performed in two phases –
• in the first phase, each row of the mesh performs an
all-to-all broadcast using the procedure for the linear array.
• In this phase, all nodes collect √p messages corresponding

to the √p nodes of their respective rows. Each node
consolidates this information into a single message of size
m√p.
• The second communication phase is a columnwise all-to-all

broadcast of the consolidated messages.
52
(6) (7) (8) (6,7,8) (6,7,8) (6,7,8)
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)
6 7 8 6 7 8 8
(3) (4) (5)

(3,4,5) (3,4,5) (3,4,5) (0,1,2,
3 4 5 3 4 5 3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8)
3,4,5, 3,4,5,
6,7,8) 6,7,8)
0 1 2 0 1 2 0 1 2
(0) (1) (2) (0,1,2) (0,1,2) (0,1,2) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)
(a) Initial data distribution (b) Data distribution after rowwise broadcast (c) Final result of all-to-all broadcast on Mesh
All-to-all broadcast on a 3 × 3 mesh. The groups of nodes communicating

with each other in each phase are enclosed by dotted boundaries. By the end
of the second phase, all nodes get (0,1,2,3,4,5,6,7,8) (that is, a message from
each node).
• After completion of second phase each node obtains all p pieces of m-word
data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from each node.
53
(6) (7) (8)

7 8
6
(3) (4) (5)
3 4 5
for next step
0 1 2
(0) (1) (2)
(a) Initial data distribution
54
(6,7,8) (6,7,8) (6,7,8)
6 7 8
(3,4,5) (3,4,5) (3,4,5)

3 4 5
for next step
0 1 2
(0,1,2) (0,1,2) (0,1,2)
(b) Data distribution after rowwise broadcast

55
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)

6 7 8
(0,1,2,
3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8) 3,4,5, 3,4,5,
6,7,8) 6,7,8)
0 1 2
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)
(c) Final result of all-to-all broadcast on Mesh
56
• All-to-all broadcast on a 3 × 3 mesh. The groups

of nodes communicating with each other in each
phase are enclosed by dotted boundaries.
• By the end of the second phase, all nodes get
(0,1,2,3,4,5,6,7,8) (that is, a message from each
node).
• After completion of second phase each node

obtains all p pieces of m-word data i.e. all nodes
will get (0,1,2,3,4,5,6,7,8) message from each
node.
57
1. procedure ALL_TO_ALL_BC_MESH(my_id, my_msg, p, result)
2. begin
/* Communication along rows */
3. left := my_id − (my_id mod √p) + (my_id − 1)mod√p;
4. right := my_id − (my_id mod √p) + (my_id + 1) mod √p;
6. msg := result;
7. for i := 1 to √p − 1 do
11. endfor;
/* Communication along columns */
up := (my_id − √p) mod p;
12.
13. down := (my_id + √p) mod p;
14. msg := result;
15. for i := 1 to √p − 1 do
16. send msg to down;
17. receive msg from up;
19. endfor;
20. end ALL_TO_ALL_BC_MESH
All-to-all broadcast on a square mesh of p nodes. 58

• All-to-all broadcast operation can be performed on hypercube by

implementing mesh algorithm to log p dimensions.
• In each step, for a different dimension, communication is carried
out. Figure a shows first step that, communication is carried out in
each row.
• In figure b, communication is carried out in column in second step.
• Pairs of nodes exchange data in each step
• received message is concatenated with the current data in every
step.
• Hypercube with bidirectional communication is considered.
59
(6) (7) (6,7) (6,7)
6 7 6 7
(2) 2 3 (3) (2,3) 2 3 (2,3)
(4) (5)
4 5 4 5
(4,5) (4,5)
(0) 0 1 (1) (0,1) 0 1 (0,1)
(a) Initial distribution of messages (b) Distribution before the second step
(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7
(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5
(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)
60
1. procedure ALL_TO_ALL_BC_HCUBE(my_id, my_msg, d, result)

2. begin
4. for i := 0 to d − 1 do
6. send result to partner;
7. receive msg from partner;
9. endfor;
10. end ALL_TO_ALL_BC_HCUBE
All-to-all broadcast on a d-dimensional hypercube.
61
(6) (7)
6 7
(2) 2 3 (3)
(4) (5)
4 5
(0) 0 1 (1)
62
(6,7) (6,7)
6 7
(2,3) 2 3 (2,3)
4 5
(4,5) (4,5)
(0,1) 0 1 (0,1)

63
(4,5, (4,5,
6,7) 6,7)
6 7
(0,1, (0,1,
2,3) 2 3 2,3)
(4,5, (4,5,
6,7) 6,7)
4 5
(0,1, (0,1,
2,3) 0 1 2,3)

64
(0,...,7) (0,...,7)
6 7
(0,...,7) (0,...,7)
2 3
(0,...,7) (0,...,7)
4 5
(0,...,7) (0,...,7)
0 1

65
• Similar communication pattern to all-to-all broadcast,

except in the reverse order.
• On receiving a message, a node must combine it with
the local copy of the message that has the same
destination as the received message before forwarding
the combined message to the next neighbor.
• As per the algorithm the communication starts from
lowest dimension.
• Variable i is used to represent the dimension.
• According to line 4 in, in first iteration, value of i=0.
66
• In each iteration, nodes communicate in pairs.

• In line 5, the label of receiver node will be calculated
by XOR operation.
• For eg if my_id = 000 then partner = 000 XOR 001 =
001. Hence, the partner differs in ith LSB.
• After communication of the data, each node
concatenates the received data with its own data as
shown in line 8.
• This concatenated message is then transmitted in the
next iteration.
67
All-to-all reduction on a Hypercube
1. procedure ALL_TO_ALL_RED_HCUBE(my_id, msg, d, result)

2. begin
3. recloc := 0;
4. for i := d − 1 to 0 do
6. j := my_id AND 2i;
7. k := (my_id XOR 2i) AND 2i;
8. senloc := recloc + k;
9. recloc := recloc + j;
10. send msg [senloc ..senloc + 2i - 1] to partner;
11. receive temp [0.. 2i - 1] from partner;
12. for j := 0 to 2i – 1 do
13. msg[recloc+j] := msg [recloc +j] + temp[j];
14. endfor;
15. endfor;
16. result := msg[my_id];
17. end ALL_TO_ALL_RED_HCUBE
All-to-all reduction on a d-dimensional hypercube.
68
All-to-all Reduction
• The order and direction of messages is reversed for

all-to-all reduction operation.
• The buffers are used to send and accumulate the
received messages in each iteration.
• Variable senloc is used to give the starting location of
the outgoing message.
• Variable recloc is used to give the location where the
incoming message is added in each iteration.
69
Cost Analysis

T=(ts + twm)(p − 1).
T= 2ts(√p − 1) + twm(p − 1).

• On a hypercube, we have:
70

(ts + twm)(p − 1).
• =>> All-to-all broadcast can be performed in (p-1)
communicating steps on a ring or a linear array for
nearest neighbors.
• Time taken in each step is (ts + twm) where ts is the
startup time of the message, tw is the per word
transfer time, and m is the size of message,
• Total time taken for the operation is :
(ts + twm)(p − 1).
71

2ts(√p − 1) + twm(p − 1).
=>> All-to-all broadcast can be performed on mesh.
The first phase of √p simultaneous all-to-all broadcast
will be completed in time
(ts + twm √p)(√p − 1).
For two dimensional square mesh of p-nodes, the total
time for all-to-all broadcast is addition of time spent in
each pass, and which is :
2ts(√p − 1) + twm(p − 1)
72
• On a hypercube, The time is given by:
• we have: For p-node hypercube, for pair of nodes time taken

to send and receive message in ith step is
ts + 2i-1twm.
Total time taken for the operation is given by:

T = Σi=1 log p (ts + 2i-1twm)
= ts log p + twm(p − 1).
73
•All of the algorithms presented here

are asymptotically optimal in message
size.
•That is, twm(p-1) is the term

associated with each architecture.
74
•It is not possible to port/map algorithms

for higher dimensional networks (such
as a hypercube) into a ring because this
would cause contention.
•Large sized messages are optimal to
transfer on ring than hypercube.
75
Contention for a
single channel
by multiple
7 6 5 4
messages
4
0 1 2 3 7
6
1 3 4
2
2 3
3
4 5
1
2
0 1
8-node hyperccube
Contention for a channel when the hypercube is mapped onto a ring.

76
All-to-all broadcast: Questions
1. Explain all-to-all broadcast and reduction operation

2. Write algorithm and explain all-to-all broadcast on
eight node ring/hypercube.
3. Explain with example and algorithm all-to-all
broadcast on 3x3 mesh.
4. Explain all-to-all reduction on d-dimensional
hypercube.
5. Explain all-to-all broadcast on d-dimensional
hypercube.
6. Explain cost analysis of all-to-all broadcast operation
77
All-Reduce and Prefix-Sum Operations
•In all-reduce, each node starts with a

buffer of size m and the final results of
the operation are identical buffers of size
m on each node that are formed by
combining the original p buffers using an
associative operator.
78
•It is identical to all-to-one reduction

followed by a one-to-all broadcast.
•This formulation is not the most efficient.
•Uses the pattern of all-to-all broadcast,
instead. The only difference is that message
size does not increase here.
•Time for this operation is
(ts + twm) log p.
79
•It is different from all-to-all reduction, in

which p simultaneous all-to-one reductions
take place, each with a different destination
for the result.
80
The Prefix-Sum Operation
•Given p numbers n0, n1, . . . , np−1 (one on

each node), the problem is to compute the
sums sk = Σi=0k ni for all k between 0 and p
− 1.
•Initially, nk resides on the node labeled k,
and at the end of the procedure, the same
node holds sk.
81
s0 = n0
s1 = n0 + n1
s2 = n0 + n1 + n2
s3 = n0 + n1 + n2 + n3
S4 = n0 + n1 + n2 + n3 + n4
...
sk = n0 + n1 + n2 + .. + nk
82
Initially, n0 resides on node 0, n1 resides on

node 1 and so on.
After completion of prefix sum operation,

every node holding sum of its predecessor
nodes including itself.
T = (ts+twm) log p
83
(6) [6] (7) [7] (6) [6] (6+7) [6+7]
6 7 6 7
[2] [2]
(2) 3 (3) [3] (2+3) 3 (2+3)
[2+3]
2 2
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 (1) [1] 1 (0+1) 0 (0+1) [0+1]
1
(a) Initial distribution of values (b) Distribution of sums before second step
(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]

6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+ 3 [0+1+2] 2 3
2+3) 2
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]
1 (0+1+ 0
(0+1+
0 [0+1] [0] [0+1] 1
2+3) 2+3)
(c) Distribution of sums before third step (d) Final distribution of prefix sums
Computing prefix sums on an eight-node hypercube. At

each node, square brackets show the local prefix sum
accumulated in the result buffer and parentheses enclose
the contents of the outgoing message buffer for the next
step.
84
(6) [6] (7) [7]
6 7
[2]
(2) 2 3 (3) [3]
4 5
(4) [4] (5) [5]

[0]
(0) 0 1 (1) [1]
(a) Initial distribution of values

At each node, square brackets show the local prefix sum accumulated in the result buffer
and parentheses enclose the contents of the outgoing message buffer for the next step.
85
(6+7) [6] (6+7) [6+7]
6 7
[2]
(2+3) 2 3 (2+3)
[2+3]
4 5
(4+5) [4] (4+5)

[0] [4+5]
(0+1) 0 1 (0+1)
[0+1]
(b) distribution of sums before second step

86
(4+5+6+7) [4+5+6] (4+5+6+7) [4+5+6+7]
6 7
[0+1+2]
(0+1+2+ 3) 2 3 (0+1+2+3)
[0+1+2+3]
4 5
(4+5) [4] (4+5)

[0] [4+5]
(0+1+2+3) 0 1 (0+1+2+3)
[0+1]
(c) distribution of sums before third step

87
[0+1+...+6] [0+1+...+7]
6 7
[0+1+2] 2 3 [0+1+2+3]
4 5
[0+1+...+4] [0+1+..+5]
[0] 0 1 [0+1]
(d) final distribution of prefix sums

88
The Prefix-Sum Operation or Scan Operation
•The operation can be implemented using

the all-to-all broadcast kernel.
•We must account for the fact that in

prefix sums the node with label k uses
information from only the k-node subset
whose labels are less than or equal to k.
89
•This is implemented using an additional

result buffer. The content of an incoming
message is added to the result buffer only if
the message comes from a node with a
smaller label than the recipient node.
•The contents of the outgoing message

(denoted by parentheses in the figure) are
updated with every incoming message.
90
• Prefix sum operation also uses the same

communication pattern which is used in all-to-all
broadcast and all reduce operations.
k
• The sum sk = ∑i=0 ni for all k between 0 and
p-1 for p numbers n0, n1, ..,np-1 on each node is
calculated.
• E.g. if the original sequence of numbers is
<3, 1, 4, 0, 2> then the sequence of prefix sum is <3,
4, 8, 8, 10>
i.e. 3+null=3, 3+1=4, 4+4=8, 8+0=8, 8+2=10
91
• At start the number nk will be present with node k.

After termination of algorithm, same node holds sum
sk.
• Instead of single number, each node will have a buffer
or a vector of size m and result will be sum of
elements of buffers.
• Each node contain additional buffer denoted by
square brackets to collect the correct prefix sum.
• After every communication step, the message from a
node with a smaller label than that of the recipient
node is added to the result buffer.
92
1. procedure PREFIX_SUMS_HCUBE(my_id, my_number, d, result)

2. begin
3. result := my_number;
4. msg := result;
5. for i := 0 to d − 1 do
7. send msg to partner;
8. receive number from partner;
9. msg := msg + number;
10. if (partner < my_id) then result := result + number;
11. endfor;
12. end PREFIX_SUMS_HCUBE
Prefix sums on a d-dimensional hypercube.

93
(6) [6] (7) [7] (6) [6] (6+7) [6+7]
6 7 6 7
[2] [2]
(2) 3 (3) [3] (2+3) 3 (2+3)
[2+3]
2 2
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 (1) [1] 1 (0+1) 0 (0+1) [0+1]
1
(a) Initial distribution of values (b) Distribution of sums before second step
(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]

6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+ 3 [0+1+2] 2 3
2+3) 2
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]
1 (0+1+ 0
(0+1+
0 [0+1] [0] [0+1] 1
2+3) 2+3)
(c) Distribution of sums before third step (d) Final distribution of prefix sums
• At each node, square brackets show the local prefix sum accumulated in
the result buffer and parentheses enclose the contents of the outgoing
message buffer for the next step.
• These contents are updated with every incoming message.
• All the messages received by a node will not contribute to its final result,
some of the messages it receives may be redundant.
94
Questions on Prefix sum operations
•Explain the difference between

all-to-all reduction and all reduce
operations.
•Explain with example prefix sum

operations.
95
Scatter and Gather
• In the scatter operation, a single node sends a

unique message of size m to every other node.
• This is called a one-to-all personalized
communication.
• In the gather operation, a single node collects a

unique message from each node.
• It is the dual of scatter operation.
96
Scatter and Gather
• While the scatter operation is fundamentally

different from broadcast, the algorithmic
structure is similar, except for differences in
message sizes (messages get smaller in scatter
and stay constant in broadcast).
• The gather operation is exactly the inverse of

the scatter operation.
97
• Consider the example of 8-node hypercube

• The communication patterns of all-to-all broadcast and
scatter operation are identical, the only difference is in
size and contents of the message.
• As in figure above, initially source node 0 will have all
the messages.
98
99
In the first 6 7
communication
step, node 0
transfers half of
2 3
the messages to
one of its
neighbours
(node 4) 4 5
(0,1,2,3,
4,5,6,7) 0 1

100
In the next step, 6 7

if any process
has some data, it
transfers half of 2 3
the data to its
neighbours who
has not received
4 5
any data uptil (4,5,
now. 6,7)
(0,1,
2,3) 0 1

101
(6,7)
6 7
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1

102
(6) (7)
This process
6 7
involves log p
communication
steps for log p (2) (3)
dimensions of 2 3
the hypercube.
(4) (5)
4 5
(0) (1)
0 1

103
Gather Operations

• The gather operation is reverse of scatter operation.
• Every node will have m word message.
• In the first step, each odd numbered node sends its buffer to an even
numbered neighbor behind it.
• The neighbor node concatenates the received message with its own
buffer.
• In the next communication step only even numbered nodes participate
in communication.
• The nodes with multiples of four labels, gather more data and double the
sizes of their data.
• This process continued till node 0 gather all the data. 104
(6) (7)
6 7
(2) (3)
2 3
(4) (5)
4 5
(0) (1)
0 1
(a) Initial distinct messages

105
(6,7)
6 7
(2,3)
2 3
4 5
(4,5)
(0,1)
0 1
(b) Collection before the second step

106
6 7
2 3
4 5
(4,5,
6,7)
(0,1,
2,3) 0 1
(c) Collection before the third step

107
6 7
2 3
4 5
(0,1,2,3,
4,5,6,7) 0 1
(d) Final Collection of messages

108
Cost of Scatter and Gather
• There are log p steps, in each step, the machine size halves and the
data size halves.
• We have the time for this operation to be:
T = ts log p + twm(p − 1).
• This time is same for a linear array as well as a 2-D mesh.
• In scatter operation, at least m(p-1) data must be transmitted out of
the source node,
• and in gather operation at least m(p-1) data must be received by the
destination node.
• Therefore, twm(p-1) time, is the lower bound on the communication
in scatter and gather operations.
Topic Questions
• Explain Scatter and Gather operations with example.
109
• All-to-all personalized communication operation can be

applied in variety of parallel algorithms such as Fast Fourier
Transform, matrix transpose, sample sort, and some parallel
database join operations
• Each node has a distinct message of size m for every other
node.
• This is opposite of all-to-all broadcast, in which each node
sends the same message to all other nodes.
• All-to-all personalized communication is also known as
total exchange.
110
M 0, p -1 M 1, p -1 M p-1, p -1 M p -1,0 M p -1,1 M p-1, p -1
. . . . . .
. . . . . .
. . . . . .
M 0, 1 M 1, 1 M p-1, 1 M 1, 0 M 1, 1 M 1, p-1,
M 0, 0 M 1, 0 M p-1, 0 M 0, 0 M 0,1 M 0,p-1
0 1 ... p-1 communication 0 1 ... p-1
All-to-all personalized communication.
111
• Each processor contains one full row of the matrix.

• The transpose operation in this case is identical to an
all-to-all personalized communication operation.
• Let A is n x n matrix, transpose of matrix A is AT.
• AT will have same size as A and AT[i,j] = A[j,i] for
0<= i, j < n.
112
• Considering 1D row major partitioning of array, n x n

matrix can be mapped onto n processors such that each
processor contains one full row of the matrix.
• Each processor sends a distinct element of the matrix
to every other processor as
all-to-all personalized communication.
• For p processes, where p<= n, each process will have
n/p rows (n2/p elements in a matrix)
• For finding out the transpose, all-to-all personalized
communication of matrix blocks of size n/p x n/p will
be done.
113
P0 P0 = [0,0],[1,0],[2,0],[3,0]
P1 P1 = [0,1],[1,1],[2,1],[3,1] Result of
n Transpose
of matrix
P2
P3 P3 = [0,3],[1,3],[2,3],[3,3]
All-to-all personalized communication in transposing a 4 × 4 matrix using four processes.
• Processor Pi will contain the elements of the matrix with indices [i,0],
[i,1], .., [i,n-1]
• In transpose AT, P0 will have element [i,0], P1 will have element [i,1]
and so on.
• Initially processor Pi will have element [i,j] and after transpose, it
moves to Pj
• Figure above shows the example of 4 x 4 matrix mapped onto four
processes using one dimensional row wise partitioning.
114
•Each node sends (p − 1) pieces of data of

size m as one consolidated message to
one of its neighbors.
•These pieces are identified by label {x,y},
where x is the label of the node that
originally owned the message, and y is
the label of the node that is the final
destination of the message.
115
•The label ({x1, y1}, {x2, y2}, . . . , {xn,

yn}) indicates that a message is formed
by concatenating n individual messages.
For eg. ({0,1},{1,2},..,{4,5}).
•Each node extracts the information meant
for it from the message of size m(p − 1)
received, and forwards the remaining (p
− 2) data pieces of size m each to the
next node.
116
•The algorithm continued for (p − 1) steps.

In (p − 1) steps every node receives
information from all the nodes in the
group.
•The size of the message reduces by m at
each step.
•In each step one m-word packet from
different node will be added to each node.
117
•All messages are sent in the same

direction.
•To reduce the communication cost due
to tw by factor of two, half of the
messages are sent in one direction and
remaining in reverse direction to use
communication channels fully.
118
is the label of the node that
communication on a
originally owned the
six-node ring. The label of
message, and y is the label
each message is of the form
of the node that is the final
{x, y}, where x
destination of the message. The label ({x1, y1}, {x2, y2}, . . . , {xn, yn}) indicates that a
message is formed by concatenating n individual messages.
119
120
121
122
123
124
All-to-All Personalized Communication on a Ring: Cost
• All-to-all personalized communication on ring requires p − 1

communication steps in all.
• The size of message transfer in ith step is m(p − i).
• Therefore, total time taken by this operation is given by:
p−1
T = Σ (t s + tw m(p −
i=1 i))
p−1
= t s(p −w 1) + Σ it
i=1 m
= (ts + twmp/2)(p − 1).
• The tw term in this equation can be reduced by a factor of 2 by

communicating messages in both directions.
125
•For all-to-all personalized

communication on mesh √p x √p, each
node first groups its p messages
according to the columns of their
destination nodes.
•Consider the example of 3 x 3 mesh.
•Each node have 9 m-word messages one
for each node.
126
•For each node, three groups of three

messages are formed.
•The first group contains the messages
for destination nodes labelled 0, 3, and
6; the second group contains the
messages for nodes 1, 4, and 7; and the
last group of messages for nodes
labelled 2, 5, and 8.
127
•After grouping, each row will contain

cluster of messages of size m√p.
•Each cluster contains information for all
the nodes of a column.
•Now in the first phase, all-to-all
personalized communication is performed
in each row.
128
•After first phase, the messages present

with each node are sorted again according
to the rows of their destination nodes.
•In the second phase, similar
communication is carried out.
•After completion of second phase, node i
will have the messages ({0,i},..,{8,i})
where 0 <= i <= 8. So each node will have
the a message from every other node.
129
The label of each message is of the form

{x, y}, where x is the label of the node that
originally owned the message, and y is the label of
the node that is the final destination of the message.
The distribution of messages at the beginning of each
phase of all-to-all personalized communication on a
3 × 3 mesh. At the end of the second phase, node i
has messages ({0,i}, . . . ,{8,i}), where 0 ≤ i ≤ 8.
The groups of nodes communicating together in
each phase are enclosed in dotted boundaries.
130
(a) Data distribution at the beginning of first phase 131

(b) Data distribution at the beginning of second phase 132

({0,6},{1,6},{2,6}, ({0,7},{1,7},{2,7}, ({0,8},{1,8},{2,8},

{3,6},{4,6},{5,6}, {3,7},{4,7},{5,7}, {3,8},{4,8},{5,8},
{6,6},{7,6},{8,6}) {6,7},{7,7},{8,7}) {6,8},{7,8},{8,8})
6 7 8
({0,4},{1,4}, ({0,5},{1,5},
{2,4},{3,4}, {2,5},{3,5},
{4,4},{5,4}, {4,5},{5,5},
{6,4},{7,4}, {6,5},{7,5},
({0,3},{1,3},{2,3}, {8,4}) {8,5})
3 4 5
{3,3},{4,3},{5,3},
{6,3},{7,3},{8,3})
({0,1},{1,1}, ({0,2},{1,2},
{2,1},{3,1}, {2,2},{3,2},
{4,1},{5,1}, {4,2},{5,2},
({0,0},{1,0},{2,0}, {6,1},{7,1}, {6,2},{7,2},
{3,0},{4,0},{5,0}, {8,1}) {8,2})
0 1 2
{6,0},{7,0},{8,0})
133
(c) Final Data distribution after second phase
• Time for the first phase is identical to that in a ring

with √p processors, i.e.,
(ts + twmp/2)(√p − 1).
• Time in the second phase is identical to the first phase.

Therefore, total time is twice of this time, i.e.,
T = (2ts + twmp)(√p − 1).
134
• It is noted that, time required for sorting the messages by

row and column is not considered in calculation of T.
• It is assumed that the data is ready for first
communication phase, so in second communication
phase, the rearrangement of mp words of data is done.
• Let tr is the time to read and write single word data in a
node’s local memory.
• So, total time spent in data rearrangement by a node in
complete process is tr x m x p.
• This time is very small as compared to communication
time T above.
135
Hypercube
• Generalize the mesh algorithm to log p steps.

• At any stage in all-to-all personalized
communication on p node hypercube, every node
holds p packets of size m each.
• While communicating in a particular dimension,
every node sends p/2 of these packets (consolidated
as one message).
• A node must rearrange its messages locally before
each of the log p communication steps takes place.
• In each step, the data is exchanged by pairs of nodes
for a different dimension.
136
Hypercube
({6,0} ... {6,7}) ({7,0} ... {7,7})
6 7
({2,0} ... {2,7}) ({3,0} ... {3,7})

2 3
4 5
({5,0} ... {5,7})
({4,0} ... {4,7})
0 1
({0,0} ... {0,7}) ({1,0} ... {1,7})
(a) Initial distribution of messages 137

Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
{7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})
6 7
({2,0},{2,2},
{2,4},{2,6},
{3,0},{3,2}, 2 3
{3,4},{3,6}) ({4,1},{4,3},
{4,5},{4,7},
4 5 {5,1},{5,3},
{5,5},{5,7})
0 1
({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
(b) Distribution before the second step 138

Hypercube
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7})
6 7
({2,2},{2,2}, 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
{5,1},{7,1},
4 5 {5,5},{7,5})
0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5},
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
139
Hypercube
({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})
2 3
4 5
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1
({0,0} ... {7,0}) ({0,1} ... {7,1})
(d) Final distribution of messages 140

Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
({6,0} ... {6,7}) ({7,0} ... {7,7}) {7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})
6 7 6 7
({2,0} ... {2,7}) ({3,0} ... {3,7}) ({2,0},{2,2},

{2,4},{2,6},
2 3 2 3
{3,0},{3,2},
{3,4},{3,6})
({4,1},{4,3},
{4,5},{4,7},
4 5 4 5
{5,1},{5,3},
({4,0} ... {4,7}) ({5,0} ... {5,7}) {5,5},{5,7})
0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})
(a) Initial distribution of messages (b) Distribution before the second step
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})
6 7 6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})
({2,2},{2,2}, 2 3 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
4 5 {5,1},{7,1}, 4 5
{5,5},{7,5})
({0,4} ... {7,4}) ({0,5} ... {7,5})
0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
An all-to-all personalized communication algorithm on a 141

three-dimensional hypercube.
Hypercube: Cost
• We have log p iterations and mp/2 words are

communicated in each iteration. Therefore, the cost is:
T = (ts + twmp/2) log p.
• This is not optimal!
142
• Each node simply performs p − 1 communication

steps, exchanging m words of data with a different
node in every step.
• A node must choose its communication partner in each

step so that the hypercube links do not suffer
congestion.
143
• In the jth communication step, node i exchanges data

with node (i XOR j).
• In this schedule, all paths in every communication step

are congestion-free, and none of the bidirectional links
carry more than one message in the same direction.
144
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(a) (b) (c)
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(d) (e) (f)
6 7 0 1 3
7
1 0 2
2 3 6
Seven steps in all-to-all personalized
2 3 1
4 5 7 5 communication on an eight-node
4 5
5 34 26 0 hypercube.
4
6 7 5
3
0 1 7 6 4
2
145
(g) 1
1. procedure ALL_TO_ALL_PERSONAL(d, my_id)

2. begin
3. for i := 1 to 2d − 1 do
4. begin
5. partner := my_id XOR i;
6. send Mmy_id,partner to partner;
7. receive M
partner,my_id
from partner;
8.
endfor;
9.
end ALL_TO_ALL_PERSONAL
A procedure to perform all-to-all personalized communication on a

d-dimensional hypercube. The message Mi,j initially resides on node i and is
destined for node j.
146
Hypercube: Cost Analysis of Optimal Algorithm
• There are p − 1 steps and each step involves non-congesting

message transfer of m words.
• We have:
T=(ts + twm)(p − 1).
• This is asymptotically optimal in message size.
147
Circular Shift
•Circular shift can be applied in some matrix
computations and in string and image
patterh matching.
•It is a member of a broader class of global
communication operations.
•So, in a circular q-shift node i sends data to
node (i+q) mod p in a group of p nodes
where (0 < q < p) known as Permutation.
•In Permutation, every node sends a message
of m word to a unique node.
148
•Mesh algorithms for circular shift can be

derived by using ring algorithm.
•Wrap around connections are considered in
mesh. i.e. a row of 4 nodes 0,1,2,3, node 3 can
communicate and send data to node 0.
•Implementation can be performed by
min(q,p-q) neighbor-to-neighbor
communications in one direction where p is
number of nodes and q is the number of shifts
to be performed.
149
•In p-node square wraparound mesh, for nodes

with row major labels, a circular q-shift is
performed in two stages.
•Example, q=5 shifts, p=16 (4x4 mesh)
•In the first stage, the data is shifted
simultaneously by (q mode √p) steps in all the
rows i.e. (5 mod √16) in our e.g.
•In second phase, it is shifted by [q/√p] steps
along the columns.
150
•Due to wraparound connection while circular

row shifts, the data moves from highest to
lowest labeled nodes of the row. For e.g. data
with node 3 will be shifted to node 0 in the
first row.
•Note that to compensate for the distance √p
that they lost while traversing the backward
edge in their respective rows, the data packets
must be shifted by an additional step.
151
•In the example, after row shift, there is

compensentory one column shift then column
shift.
•Total time for any circular q-shift on a

p-node mesh using packets of size m is :
T = (ts + twm)(√p + 1).
152
(12) (13) (14) (15)

12 13 14 15
(8) (9) (10) (11)

8 9 10 11
(4) (5) (0) (6) (7)

4 5 6 7
ift
sh
5-
(0) (1) (2) (3)
0 1 2 3
(a) Initial data distribution and the first communication step

(15) (12) (13) (14)
12 13 14 15
(11) (8) (9) (10)

8 9 10 11
(7) (4) (5) (6)

4 5 6 7
data from node 3
supposed to shift
on node 4, but
due to
(3) (0) (1) (2)
wraparound row
0 1 2 3 shift, it shift to
node 0
(b) Step to compensate for backward row shifts

(11) (12) (13) (14)
shift is carried
12 13 14 15 out every
time on a
unique node
(7) (8) (9) (10)

8 9 10 11
(3) (4) (5) (6)

4 5 6 7
(15) (0) (1) (2)

0 1 2 3
(c) Column shifts in the third communication step

(7) (8) (9) (10)
12 13 14 15
(3) (4) (5) (6)

8 9 10 11
(15) (0) (1) (2)

4 5 6 7
(11) (12) (13) (14)

0 1 2 3
(d) Final distribution of the data

• For shift operation on hypercube, linear
d
array with 2 nodes is mapped onto
• Node i of the linear array is assigned to
node j of the hypercube where j is a d-bit
binary Reflected Gray Code (RGC) of i.
• Consider eight nodes hypercube shown in
the figure, any two nodes at distance 2i are
separated by exactly two links.
• For i=0 nodes are directly connected to this
is the exception as only one hypercube link
separates two nodes. 157
• For q-shift operation, q is expanded as a sum
of distinct powers of 2. For example number 5
can be expanded as 22+20.
• Note that number of terms in sum=number of
1's in binary representation of q. e.g. 5(101)
two terms will be there in the sum
correcponding to bit position 2 and bit
position 0 i.e. 22+20.
• Circular q-shift on a hypercube is performed
in s phases, where s is distinct powers of 2.
158
• For example, 5-shift operation is performed
by 4 shift (22) followed by 1 shift (20).
• Each shift will have two communication steps.
Only 1-shift will have a single step. For
example, the first phase of a 4-shift consists of
two steps and the second phase of a 1-shift
consists of one step.
• Total number of steps for any q in a p-node
hypercube is 2 log p-1
159
• The time for this is upper bounded by:
T = (ts + twm)(2 log p − 1).
• If E-cube routing is used, this time can be
reduced to
T = ts + twm.
160
(4) (5) (3) (2)
4 5 4 5
(3) (0)
(2)
0 3 2 3 2 (1)
1
(7) (4)
2 7 6 (6) 7 6 (5)
3
(0) (7)
(1) (6)
4 0 1 0 1
5
First communication Second communication
6 step of the 4-shift step of the 4-shift
7
(a) The first phase (a 4-shift)
161
(0) (1)
4 5
0
(7)
1 (6)
3 2
2
3
(2)
4
7 6
5
(4)
(5)
6 0 1
7
(b) The second phase (a 1-shift)
162
(7) (0)
4 5
0
(6)
(5)
1 3 2
2
3
(2)
(1)
4
7 6
5
(3)
6 (4)
0 1
7
(c) Final data distribution after the 5-shift
163
6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(a) 1-shift (b) 2-shift

6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(c) 3-shift (d) 4-shift

6 7 6 7
2 3 2 3
4 5 4 5
0 1 0 1
(e) 5-shift (f) 6-shift

6 7
2 3
4 5
0 1
(g) 7-shift

6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(a) 1-shift (b) 2-shift (c) 3-shift
6 7 6 7 6 7
2 3 2 3 2 3
4 5 4 5 4 5
0 1 0 1 0 1
(d) 4-shift (e) 5-shift (f) 6-shift
6 7
2 3
4 5
0 1
(g) 7-shift

• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation followed by an
all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw(p − 1) )
p
≈ 2 × (ts log p + twm). (10)

• All-to-one reduction can be performed by performing all-to-all reduction (dual of all-to-all
broadcast) followed by a gather operation (dual of scatter).
169
• Since an all-reduce operation is semantically equivalent to an all-to-one reduction

followed by a one-to-all broadcast, the asymptotically optimal algorithms for these two
operations can be used to construct a similar algorithm for the all-reduce operation.
• The intervening gather and scatter operations cancel each other. Therefore, an all-reduce
operation requires an all-to-all reduction and an all-to-all broadcast.
170
• The communication algorithms are based on two assumptions:

• 1) original message cannot be divided into small parts
• 2) each node uses a single port for sending and receiving data
• We can analyse the effect of not following these two assumptions:
Splitting and Routing messages in parts:
171
• August 2018 (Insem)
• 1. Explain Broadcast and Reduction example for multiplying
matrix with a vector.(6)
• 2. Explain the concept of scatter and gather (4)
• 3. Compare the one-to-all broadcast operation for Ring,
Mesh and Hypercube topologies (6)
• 4. Explain the prefix-sum operation for an eight-node
hypercube (4)

• 1. Write a short note on All-to-one reduction with suitable
example. [6]
172
• 1 Explain term of all-to-all broadcast on linear array, mesh
& Hypercube
• topologies. [8]
• 2 Write short note on circular shift on a mesh. [6]
• Oct 2019 (Insem)

• 1. Explain broadcast & reduce operation with diagram. [4]
• 2. Explain prefix- sum operation for an eight-node
hypercube. [6]
• 3. Explain scatter and gather operation? [4]
• 4. Explain all to one broadcast and reduction on a ring? [6]
173
• May-June-2019 (Endsem)
• 1. Explain Circular shift operation on mesh and hypercube
network. [8]
174

Principles of Parallel Algorithm Design: Prof V B More

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principles of Parallel Algorithm Design: Prof V B More

Uploaded by

Copyright:

Available Formats

Principles of Parallel

 Characteristics of Tasks and

• Methods for Minimizing Interaction

Task 4 Task 3 Task 2 Task 1

Consider the execution of the query:

MODEL = ``CIVIC'' AND YEAR = 2001

• The execution of the query can be

• Each task can be an intermediate

Civic 2001 White Green COLOR = ``GREEN'' OR

YEAR = 2001 ID# Model Year 3476 White

Edges in this Civic AND 2001 AND (White OR Green)

Civic AND 2001

Civic AND 2001 White OR Green

ID# Model Year

Civic AND 2001 AND (White OR Green)

MODEL = “CIVIC” AND YEAR = 2001 ID# Model Year Color

given White OR Green

6734 White 2001

Civic AND 2001 AND (White OR Green)

their data ID#

A coarse grain decomposition of

• The longest such path determines

• The length of the longest path in a

Each row assigned to each task, 3 rows to 3

Each cell can be assigned to each task, 3 row x 3 col = 9 cells = 9

Once each sublist has been partitioned around

if (proc_id > 0) Send (a, size, parent) ;

The Next Dimension - 3D Mandelbrot Fractal

Sapphires - Mandlebrot Fractal Zoom

Prof V B More, MET IOE BKC 63

Task 1: C1,1 = A1,1B1,1 Task 1: C1,1 = A1,1B1,1

Task 2: C1,1 = C1,1 + A1,2B2,1 Task 2: C1,1 = C1,1 + A1,2B2,1

Task 3: C1,2 = A1,1B1,2 Task 3: C1,2 = A1,2B2,2

Task 4: C1,2 = C1,2 + A1,2B2,2 Task 4: C1,2 = C1,2 + A1,1B1,2

Task 5: C2,1 = A2,1B1,1 Task 5: C2,1 = A2,2B2,1

Task 6: C2,1 = C2,1 + A2,2B2,1 Task 6: C2,1 = C2,1 + A2,1B1,1

Task 7: C2,2 = A2,1B1,2 Task 7: C2,2 = A2,1B1,2

Task 8: C2,2 = C2,2 + A2,2B2,2 Task 8: C2,2 = C2,2 + A2,2B2,2 66

• Must combine partial results at the end 70

Computation can often be viewed as a

In these cases, it is often beneﬁcial to

A 2,1 D1,2,1 D1,2,2

A 2,2 B 2,1 B 2,2 D2,2,1 D2,2,2

Combine intermediate results

Once all have ﬁnished, sum each

0 0.25 0.5 0.75 1

There are algorithms for more complex

Irregular patterns must take into

A Sparse Matrix and its associated irregular task interaction 109

Consider data sharing and

About the same for row or block 111

processes with sub-tasks 5

inter process communication 6

uneven coarse grain

Mapping must simultaneously minimize

start synchronization ﬁnish start synchronization ﬁnish

• unknown inter-task processing times

• Mappings based on data partitioning.

• Mappings based on task graph

• Based on “owner-computes” rule

Array Distribution Scheme

Figure 1 Example of One dimensional partitioning of an