You are on page 1of 605

Principles of Parallel

Algorithm Design

Prof V B More
MET-BKC IOE

1
Algorithms and Concurrency
 Introduction to Parallel Algorithms
• Tasks and Decomposition
• Processes and Mapping
• Processes Versus Processors

 Decomposition / Partitioning
Techniques
• Recursive Decomposition
• Exploratory Decomposition
• Hybrid Decomposition
2
Algorithms and Concurrency

 Characteristics of Tasks and


Interactions
• Task Generation, Granularity, and
Context
• Characteristics of Task
Interactions.

3
Concurrency and Mapping
• Mapping Techniques for Load
Balancing
–Static and Dynamic Mapping

• Methods for Minimizing Interaction


Overheads
–Maximizing Data Locality
–Minimizing Contention and Hot-
Spots
4
Concurrency and Mapping
–Overlapping Communication and
Computations
–Replication vs. Communication
–Group Communications vs. Point-
to-Point Communication
• Parallel Algorithm Design Models
–Data-Parallel, Work-Pool, Task
Graph, Master-Slave, Pipeline, and
Hybrid Models
5
Decomposition, Tasks, and Dependency
Graphs
• A very first step in the design process
of parallel algorithm is to decompose
the problem into various tasks that
can be executed concurrently

6
Decomposition, Tasks, and Dependency
Graphs
• Tasks can be decomposed into sub-
tasks in various ways. Decomposed
tasks may be of :
– same sized,
– different sized, or
– of intermediate sizes.

7
Decomposition, Tasks, and Dependency
Graphs
• A decomposition can be justified in
the form of a directed graph with
nodes corresponding to tasks and
edges indicating that the result of
one task is required for processing
the next. Such a graph is called a
task dependency graph.

8
Example Decomposition of task into nodes and edges

Task 4 Task 3 Task 2 Task 1

10 10 10 10

9 Task 6 6 Task 5

Task 7
8

Main task
Decomposed task

9
Example: Multiplying a Dense Matrix with a Vector
A n b y
Computation of
01
Task 1
2
each element of
output vector y is
independent of
other elements.
n-1
Task n Based on this,
a dense matrix-vector product can be
decomposed into n tasks. shaded
portion of the matrix and vector is
accessed by Task 1. 10
Example: Multiplying a Dense Matrix with a Vector
A
Task 1
01 n b y Findings: While
2 tasks share data
(vector b), they do
not have any
control
n-1
dependencies – i.
Task n e., no task needs
to
wait for the (partial) completion of
any other. All tasks are of the same
size in terms of number of operations.
11
Example: Database Query Processing
Consider the execution of the query:
MODEL = ”CIVIC” AND YEAR = 2001 AND
(COLOR = ”GREEN” OR COLOR = “WHITE”)
on the following database:
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000

12
Example: Database Query Processing

Consider the execution of the query:

MODEL = ``CIVIC'' AND YEAR = 2001


AND
(COLOR = ``GREEN'' OR COLOR =
``WHITE)

13
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
14
Example: Database Query Processing

• The execution of the query can be


divided into different subtasks.

• Each task can be an intermediate


table that satisfy a particular clause.

15
Example: Database Query Processing
COLOR = ``WHITE
MODEL = ``CIVIC'' YEAR = 2001
ID# Year
ID# Model ID# Color
7623 2001
4523 Civic 7623 Green
6734 2001 ID# Color
6734
4395
Civic
Civic
5342 2001 3476 White 9834
5342
Green
Green
COLOR = ``GREEN''
3845 2001 6734 White
7352 Civic 4395 2001 8354 Green

Civic 2001 White Green COLOR = ``GREEN'' OR


COLOR = ``WHITE
MODEL = ``CIVIC'' AND ID# Color

YEAR = 2001 ID# Model Year 3476 White

6734 Civic 2001 Civic AND 2001 White OR Green 7623 Green
4395 Civic 2001 9834 Green
6734 White
5342 Green
8354 Green

Edges in this Civic AND 2001 AND (White OR Green)

Decompo
graph denote ID# Model Year Color
6734 Civic 2001 White sing the
that the
output of one MODEL = ``CIVIC'' AND YEAR = 2001 AND given
(COLOR = ``GREEN'' OR COLOR = ``WHITE)
task is query into
needed to a number
accomplish of tasks.16
the next.
Example: Database Query Processing
MODEL = “CIVIC”

MODEL = “CIVIC”
ID# Model
4523 Civic
6734 Civic
Civic 2001
4395 Civic
7352 Civic

Civic AND 2001

17
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
18
Example: Database Query Processing
YEAR = 2001

ID# Year
7623 2001
9834 2001
YEAR = 6734 2001
Civic 2001 2001 5342 2001
3845 2001
4395 2001
Civic AND 2001 White OR Green

19
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
20
Example: Database Query Processing
COLOR = “WHITE”
COLOR = “WHITE”

ID# Color
3476 White
6734 White
Civic 2001 White Green

Civic AND 2001 White OR Green

21
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
22
Example: Database Query Processing
COLOR = “GREEN”

ID# Color
7623 Green
9834 Green
5342 Green
8354 Green COLOR = “GREEN”
White Green

White OR Green

23
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
24
Example: Database Query Processing
MODEL = “CIVIC” AND YEAR = 2001
YEAR = 2001
MODEL = “CIVIC”
ID# Year
ID# Model 7623 2001
4523 Civic 9834 2001
6734 Civic 6734 2001
4395 Civic 5342 2001
7352 Civic 3845 2001
MODEL = “CIVIC” AND 4395 2001
YEAR = 2001 Civic 2001

ID# Model Year


6734 Civic 2001
4395 Civic 2001 Civic AND 2001

25
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
26
Example: Database Query Processing
COLOR = “GREEN” OR COLOR = “WHITE”
COLOR =
“WHITE” COLOR = “GREEN”
ID# Color
ID# Color
7623 Green
3476 White 9834 Green
6734 White 5342 Green
8354 Green
COLOR = “GREEN”
White Green
OR
COLOR = “WHITE”
ID# Color
3476 White
White OR Green 7623 Green
9834 Green
6734 White
5342 Green
8354 Green
27
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
28
Example: Database Query Processing
MODEL = “CIVIC” AND YEAR = 2001 AND
(COLOR = “GREEN” OR COLOR = “WHITE”)
COLOR = “GREEN” OR
MODEL = “CIVIC” AND
COLOR = “WHITE”
YEAR = 2001
ID# Model Year ID# Color
3476 White
6734 Civic 2001 7623 Green
4395 Civic 2001 Civic AND 2001 White OR Green 9834 Green
6734 White
5342 Green
8354 Green

Civic AND 2001 AND (White OR Green)

MODEL = “CIVIC” AND YEAR = 2001 ID# Model Year Color


AND 6734 Civic 2001 White
(COLOR = “GREEN” OR COLOR = 29
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
30
Example: Database Query Processing
The same problem can be decomposed into
subtasks in other ways as well.
ID# Year

An alternate
ID# Model ID# Color
7623 2001
4523 Civic 7623 Green
6734 2001 ID# Color
6734 Civic 5342 2001 3476 White 9834 Green

decompositi
4395 Civic 3845 2001 6734 White 5342 Green
7352 Civic 4395 2001 8354 Green

on of the
Civic 2001 White Green

ID# Color

given White OR Green


3476

7623
9834
White

Green
Green

problem into
6734 White
5342 Green
8354 Green

subtasks,
2001 AND (White or Green) ID# Color Year
7623 Green 2001

6734 White 2001

along with
5342 Green 2001

Civic AND 2001 AND (White OR Green)

their data ID#


6734
Model
Civic
Year Color
2001 White

dependencie
Different task decomposition methods may leads to different
s.
parallel performance. 31
Granularity of Task Decompositions
• The number of tasks into which a
problem is decomposed determines
its granularity.
• Decomposition into a large number
of tasks results in fine grained
decomposition and that into a small
number of tasks results in a
coarse grained decomposition.

32
Granularity of Task Decompositions
A b y
0 1 ... n

Task
1

Task
2

Task
3

Task
4

A coarse grain decomposition of


dense matrix-vector multiplication into
four tasks. Each task represents
computation of three elements of the
result vector. 33
Degree of Concurrency
• The number of tasks that can be
executed in parallel is called the
degree of concurrency of a
decomposition.
• Since the number of tasks that can
be executed in parallel may change
during program execution, the
maximum degree of concurrency is
the maximum number of such tasks
at any point during execution. 34
Degree of Concurrency
• What is the maximum degree of
concurrency of the database query
examples?
• Total amount of work is the sum of
number of decomposed tasks and
their intermediate task executions till
final result is obtained. In database
query, total amount of work is 7 in
both the cases.
35
• Critical path length in task –
dependency graph is the longest
directed path between any pair of
start and finish nodes. It is the
sum of weights of every node in
the critical path.
• The average degree of
concurrency is = (Total amount of
work) / (critical path length)
36
• Assuming that each tasks in the
database example takes identical
processing time, what is the
average degree of concurrency in
each decomposition?
• The degree of concurrency
increases as the decomposition
becomes finer in granularity and
vice versa.
37
Critical Path Length
• A directed path in the task
dependency graph represents a
sequence of tasks that must be
processed one after the other.

38
Critical Path Length

• The longest such path determines


the shortest time in which the
program can be executed in parallel.

39
Critical Path Length

• The length of the longest path in a


task dependency graph is called the
critical path length.

40
Critical Path Length
• Consider the task dependency graphs of
the two database query decompositions:
Task 4 Task 3 Task 2 Task 1 Task 4 Task 3 Task 2 Task
1
10 10 10 10 10 10 10 10

6 Task 5
9 Task 6 6 Task 5 Task 6
11

8 Task 7 7 Task 7

(a) (b)
41
Questions:
• What are the critical path lengths for the
two task dependency graphs?
• If each task takes 10 time units, what is the
shortest parallel execution time for each
decomposition?
• How many processors are needed in each
case to achieve this minimum parallel
execution time?
• What is the maximum degree of
concurrency?

42
Limits on Parallel Performance
• It can be observed that the parallel
time can be made arbitrarily small by
making the decomposition finer in
granularity.
• There is an inherent bound on how fine
the granularity of a computation can
be. For example, in the case of
multiplying a dense matrix with a
vector, there can be no more than (n2)
concurrent tasks.

43
Limits on Parallel Performance
Col 1 Col 2 Col 3

Row 1
1 2 3

Row 2
4 5 6

Row 3 7 8 9

44
Limits on Parallel Performance
Col 1 Col 2 Col 3

Row 1
1 2 3

Row 2
4 5 6

Row 3 7 8 9

Each row assigned to each task, 3 rows to 3


tasks 45
Limits on Parallel Performance
Col 1 Col 2 Col 3

Row 1
1 2 3

Row 2
4 5 6

Row 3 7 8 9

Each cell can be assigned to each task, 3 row x 3 col = 9 cells = 9


tasks
46
Limits on Parallel Performance
• Concurrent tasks may also have to
exchange data with other tasks. This
results in communication overhead.
• The tradeoff between the granularity
of a decomposition and associated
overheads often determines
performance bottleneck.

47
Partitioning Techniques
• There is no single recipe that works
for all problems.
• We can benefit from some commonly
used techniques:
– Recursive Decomposition
– Data Decomposition
– Exploratory Decomposition
– Speculative Decomposition

48
Recursive Decomposition
• Generally suited to problems that are
solved using a divide and conquer
strategy.
• Decompose based on sub-problems
• Often results in natural concurrency
as sub-problems can be solved in
parallel.
• Need to think recursively
– parallel not sequential
49
Recursive Decomposition: Quicksort

Once each sublist has been partitioned around


the pivot,each sub-sublist can be processed
concurrently.
50
Recursive Decomposition:
Finding the Min/Max/Sum
• Any associative and commutative operation.
1. procedure SERIAL_MIN (A, n)
2. begin
3. min = A[0];
4. for i := 1 to n − 1 do
5. if (A[i] < min) min := A[i];
6. endfor;
7. return min;
8. end SERIAL_MIN

51
Recursive Decomposition:
Finding the Min/Max/Sum
• Rewrite using recursion and max partitioning
– Don’t make a serial recursive routine
1. procedure RECURSIVE_MIN (A, n)
2. begin
3. if ( n = 1 ) then
4. min := A [0] ;
5. else Note: Divide the work
6. lmin := RECURSIVE_MIN ( A, n/2 ); in half each time.
7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 );
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN 52
Recursive Decomposition:
Finding the Min/Max/Sum
• Example: Find min of {4,9,1,7,8,11,2,12}

Step

1 4 9 1 7 8 11 2 12

2 4 9 1 7 8 11 2 12

3 4 9 1 7 8 11 2 12
53
Recursive Decomposition:
Finding the Min/Max/Sum
• struggle to divide in half
• Often, can be mapped to a hypercube
for a very efficient algorithm
• The overhead of dividing the
computation is important.
– How much does it cost to communicate
necessary dependencies?

54
Recursive Decomposition:
Sequential Merge Sort
Pseudo code of Sequential Merge Sort with Recursion
void mergeSort(int *a, int first, int last, int *aux) {
if (last <= first) return;
int mid = (first + last) / 2;
mergeSort (a, first, mid, aux); //first half
mergeSort (a, mid+1, last, aux); //next half
mergeArrays(a, first, mid, a, mid+1, last, aux, first, last);
for (int i=first; i<= last; i++) a[i] = aux[i];
}

void mergeArrays(int *a, int afirst, int alast, int *b, int bfirst, int blast, int *c, int cfirst, int
clast) {
int i = afirst, j = bfirst, k = cfirst;
while (i <= alast && j <= blast ) {
if ( a[i] < b[j]) c[k++] = a[i++];
else c[k++] = b[j++];
}
while (i <= alast ) c[k++] = a[i++];
while (j <= blast ) c[k++] = b[j++];
}
55
Recursive Decomposition:
Parallel Merge Sort
Pseudo code of Parallel Merge Sort
void parallel_mergeSort() {
if (proc_id > 0) {
Recv(size,parent);
Recv(a, size, parent);
}
mid = size / 2;
if (both children) {
Send (mid, child1);
Send(size-mid, child2);
Send(a, mid, child1);
Send(a+mid, size-mid, child2);
Recv(a, mid, child1);
Recv(a+mid, size-mid, child2);
mergeArrays(a, 0, mid, a, mid+1, size, aux, 0, size);
for (int i=first; i<= last; i++) a[i] = aux[i];
}
else mergeSort( a, 0, size);

if (proc_id > 0) Send (a, size, parent) ;


56
}
Data Decomposition
• This is the most common approach
• Identify the data and partition across
tasks
• Three approaches
– Output Data Decomposition
– Input Data Decomposition
– Domain Decomposition

57
Output Data Decomposition
• Each element of the output can be
computed independently of the others
– A function of the input
– All may be able to share the input or have a
copy of their own
• Decompose the problem naturally.
• Embarrassingly Parallel
– Output data decomposition with no need for
communication
– Mandelbrot, Simple Ray Tracing, etc.

58
59
60
61
Prof V B More, MET IOE BKC 62
Mandlebrot Fractal Zoom
https://youtu.be/PD2XgQOyCCk

The Next Dimension - 3D Mandelbrot Fractal


Zoom (MMY3D)
https://youtu.be/hRrBnI5L0u8

3D Fractal
https://youtu.be/S530Vwa33G0

Sapphires - Mandlebrot Fractal Zoom


https://youtu.be/8cgp2WNNKmQ

Prof V B More, MET IOE BKC 63


Output Data Decomposition
• Matrix Multiplication: A * B = C
• Can partition output matrix C

64
Output Data Decomposition:
Example
A partitioning of output data does
not result in a unique
decomposition into tasks. For
example, with identical output data
distribution, we can derive the
following two (other)
decompositions:

65
Output Data Decomposition:
Example
Decomposition I Decomposition II

Task 1: C1,1 = A1,1B1,1 Task 1: C1,1 = A1,1B1,1

Task 2: C1,1 = C1,1 + A1,2B2,1 Task 2: C1,1 = C1,1 + A1,2B2,1

Task 3: C1,2 = A1,1B1,2 Task 3: C1,2 = A1,2B2,2

Task 4: C1,2 = C1,2 + A1,2B2,2 Task 4: C1,2 = C1,2 + A1,1B1,2

Task 5: C2,1 = A2,1B1,1 Task 5: C2,1 = A2,2B2,1

Task 6: C2,1 = C2,1 + A2,2B2,1 Task 6: C2,1 = C2,1 + A2,1B1,1

Task 7: C2,2 = A2,1B1,2 Task 7: C2,2 = A2,1B1,2

Task 8: C2,2 = C2,2 + A2,2B2,2 Task 8: C2,2 = C2,2 + A2,2B2,2 66


Output Data Decomposition
• Count the instances of given itemsets

67
Output Data Decomposition
• Count the instances of given itemsets

68
Input Data Decomposition
• Applicable if the output can be naturally
computed as a function of the input.
• In many cases, this is the only natural
decomposition because the output is not
clearly known a-priori
– finding minimum in list, sorting, etc.
• Associate a task with each input data
partition.
• Tasks communicate where necessary input is
“owned” by another task.

69
Input Data Decomposition
• Count the instances of given itemsets
• Each task generates partial counts for all itemsets
which must be aggregated.

• Must combine partial results at the end 70


Input & Output Data Decomposition
• Often, partitioning either input data or output data
forces a partition of the other.
• Can also consider partitioning both

71
Intermediate Data Partitioning

Computation can often be viewed as a


sequence of transformation from the
input to the output data.

In these cases, it is often beneficial to


use one of the intermediate stages as
a basis for decomposition.

72
Intermediate Data Partitioning: Example
Let us revisit the example of dense matrix
multiplication. We first show how we can visualize
this computation in terms of intermediate matrices D.

.
A 1,1
B 1,1 B 1,2 D1,1,1 D1,1,2

A 2,1 D1,2,1 D1,2,2

+
A 1,2
. D2,1,1 D2,1,2

A 2,2 B 2,1 B 2,2 D2,2,1 D2,2,2

C 1,1 C 1,2

C 2,1 C 2,2

73
Multiplication of two 2x2 matrices A
 2 3
&B

3  2  2 x3  3 x3 2 x ( 2 ) 3 x1 
 1  3   ( 1) x 3  4 x 3 
 4  1   ( 1) x ( 2 )  4 x1 

 2   2 x3 2 x ( 2 )   6  4 Intermediate
  1  3  2      3  Result D1
  ( 1) x 3 ( 1) x ( 2 )   2 

3  3 x 3 3 x1  9 3
Intermediate
 4   3 1    12 
  4 x 3 4 x1   4 Result D2

Combine intermediate results

 6  4  9 3  6 9  4  3 15  1
 3   12     3  12   9  Result
 2   4  2 4   6 

74
Domain Decomposition
• Often can be viewed as input data
decomposition
– May not be input data
– Just domain of calculation
• Split up the domain among tasks
• Each task is responsible for computing the
answer for its partition of the domain
• Tasks may end up needing to
communicate boundary values to perform
necessary calculations

75
Domain Decomposition
1
• Evaluate the integral 4
0 1 x2
Each task evaluates the integral
in their partition of the domain

Once all have finished, sum each


tasks answer for the total.

0 0.25 0.5 0.75 1

76
Domain Decomposition
• Often a natural approach for grid/matrix
problems

There are algorithms for more complex


domain decomposition problems

77
Exploratory Decomposition
• In many cases, the decomposition of a
problem goes hand-in-hand with its
execution.
• Typically, these problems involve the
exploration of a state space.
– Discrete optimization
– Theorem proving
– Game playing

78
Exploratory Decomposition
• 15 puzzle – put the numbers in order
– only move one piece at a time to a blank spot

79
Exploratory
Decomposition
• Generate
successor
states and
assign to
independent
tasks.

80
1
Task 4

Task 3

Task 2

Task 1
81
2

82
3

83
4

84
5

85
Solved State

86
Exploratory Decomposition
• Exploratory decomposition techniques may change the
amount of work done by the parallel implementation.
• Change can result in super- or sub-linear speedups

87
Speculative Decomposition
• Sometimes, dependencies are not known
a-priori
• Two approaches
– conservative – identify independent tasks only
when they are guaranteed to not have
dependencies
• May yield little concurrency
– optimistic – schedule tasks even when they
may be erroneous
• May require a roll-back mechanism in the case of
an error.

88
Speculative Decomposition
• The speedup due to speculative
decomposition can add up if there
are multiple speculative stages
• Examples
– Concurrently evaluating all branches of
a C switch stmt
– Discrete event simulation

89
Speculative Decomposition
A switch statement works based on the value of
expression and corresponding case statement
executes. Parallel switch
Slave(i)
Sequential switch {
compute expr; compute ei; Wait(request);
switch(expr) if (request) Send(ei, 0);
{ }
case 1: compute-e1; break; Master() {
case 2: compute-e2; break; compute expr;
...... switch(expr)
} {
case 1:
Send(request,1);
Receive(a1,i);
...
90
}
Speculative Decomposition
Discrete Event Simulation
• The central data structure is a time-
ordered event list.
• Events are extracted precisely in time
order, processed, and if required,
resulting events are inserted back
into the event list.

91
Speculative Decomposition
Discrete Event Simulation
• Consider MET-UTSAV-21 as a discrete
event system -
–Every day there are number of
events scheduled one after other.
And at last day there is Musical
Night.

92
Speculative Decomposition
Discrete Event Simulation
• Each of these events may be
processed independently,
–Since, there is no concrete
dependency of one event on other,
they can be executed independently

93
Speculative Decomposition
Discrete Event Simulation
–In case in certain situations like
natural calamities, the scheduled
execution of the events can hamper,
that may lead cumbersome to
manage.
• Therefore, an optimistic scheduling
of other events will be possible if
there is a backup plan.

94
Speculative Decomposition
Discrete Event Simulation
• Simulate a network of nodes
– various inputs, node delay parameters, queue
sizes, service rates, etc.

95
Hybrid Decomposition
• Often, a mix of decomposition techniques
is necessary
• In quicksort, recursive decomposition
alone limits concurrency (Why?). A mix of
data and recursive decompositions is
more desirable.

96
Hybrid Decomposition
• In discrete event simulation, there might
be concurrency in task processing. A mix
of speculative decomposition and data
decomposition may work well.
• Even for simple problems like finding a
minimum of a list of numbers, a mix of
data and recursive decomposition works
well.

97
Task Characterization

98
Task Characterization
• Task characteristics can have a
dramatic impact on performance.
There are following basic
characteristics of task:
• Generation of Task
– Static
– Dynamic

99
Task Characterization
• Size of Task
– Uniform
– Non-uniform
• Data Size
– Size
– Uniformity

100
Task Generation
• Algorithm point of view, tasks are
considered to be executed in parallel.
• Tasks are generated in two ways.
Static and Dynamic.

101
Task Generation
• Static
–Tasks are known in advance before
their execution. Tasks are executed
in order which are defined
previously.
–number of tasks, task size, data
size are all known in prior, therefore
their execution is deterministic.
– E.g.image processing, matrix & graph algorithms

102
Task Generation
• Dynamic
–Tasks are created dynamically
based on decomposition of data in
particular situations.
–Tasks are not available before their
execution. tasks changes
throughout run
–difficult to launch during run –
scheduled environment
103
Task Generation
–most often dealt with using
dynamic load balancing techniques
–Recursive and exploratory
decomposition techniques are
considered as examples of
dynamic task generation.

104
Task Size – Data size
• Execution time
–uniform – synchronous steps
–non-uniform – difficult to determine
synchronization points
• often handled using Master-Worker
paradigm
• otherwise polling is necessary in
message passing

105
Task Size – Data size
• Data Size
– Date size is one of the crucial property
of the task. The required data should be
made available when mapping it into
processes.
– The overhead associated with data
movement can be reduced when size of
data and its memory locations are
available at the time of processing.
106
Task Interactions
• Static interactions: The tasks and
their interactions are known a-priori.
These are relatively simple to code
into programs.
• Dynamic interactions: The timing or
interacting tasks cannot be
determined a-priori. These
interactions are harder to code,
especially using message passing
APIs. 107
Task Interactions
• Regular interactions: There is a
definite pattern (in the graph sense)
to the interactions. These patterns
can be exploited for efficient
implementation.
• Irregular interactions: When
interactions are irregular, they lack
well-defined topologies.
108
Static Task Interaction Patterns
Regular patterns are easier to code
Both producer and consumer are
aware of when communication
is required  Explicit and Simple code

Irregular patterns must take into


account the variable number of
neighbors for each task.
Typical image processing partitioning Timing becomes more difficult

A Sparse Matrix and its associated irregular task interaction 109


graph
Static & Regular Interaction
• Algorithm has phases of computation and
communication
• Example - Hotplate
– Communicate initial conditions
– Loop
• Communicate dependencies
• Calculate “owned” values
• Check for convergence in “owned” values
• Communicate to determine convergence
– Communicate final conditions

110
Example - Hotplate
• Use Domain Decomposition
– domain is the hotplate
– split the hotplate up among tasks
• row-wise or block-wise
Consider the communication costs

Row-wise  2 neighbors

Block-wise  4 neighbors

Consider data sharing and


computational needs & efficiency

About the same for row or block 111


Dynamic Interaction
Interactions of tasks are
unpredictable in dynamic
communications. The
synchronization between
sender and receiver is one
of the challenging issue.
This may create problem
when both of them trying
to communicate at the
same time.

Tasks don’t
know when to
receive a
message –

periodically poll
112
Static vs Dynamic Task Interactions
Static Interactions Dynamic Interactions
For each task, the Timing of interaction
interactions happen is not known prior to
at predetermined execurion of the
times program
Task interaction The stage at which
graph and interaction interaction is needed
will happen in which is decided
stage is known dynamically

113
Static vs Dynamic Task Interactions
Static Interactions Dynamic Interactions
Can be programmed Uncertainties in
easily in the MPI interaction makes it
paradigm hard for both sender
and receiver to
participate in
interaction at same
time in MPI.
easy to code in easy to code in
shared address shared address
space model space model 114
Regular vs Irregular Task Interactions
Regular Interactions Irregular Interactions
An interaction pattern There is no regular
is considered to be interaction pattern
regular if it has some exists
definite pattern of
interaction
Easy to handle Irregular and
dynamic patterns are
difficult to handle
especially in MPI
model 115
Regular vs Irregular Task Interactions
Regular Interactions Irregular
Interactions
In image dithering problem, In sparse matrix
the color of each pixel in an - vector
image is determined as the multiplication,
weighted average of its task cannot
original value and the values know which
of its neighboring pixels. entries of vector
image is decomposed in it requires, due
square boxes and assigned to to irregular
multiple task to carry out chunk of task.
dither operation for each of
116
this region indenpendently.
Read only vs Read write Task Interactions
Read only Read write
Interactions Interactions
Task requires only a Multiple tasks need
read-access to the to read and write on
data shared among some shared data.
many concurrent
tasks

117
Read only vs Read write Task Interactions
Read only Read write
Interactions Interactions
The decomposition 15-puzzle problem
for parallel matrix the priority queue
multiplication. In this constitutes shared
problem the tasks data and tasks need
only need to read the both read and write
shared input access to it.
matrices A and B

118
One way vs Two way Task Interactions
One way Interactions Two way Interactions
One task pushes data Both tasks pushes
to another data to each other
Cannot be Suitable for MPI
programmed directly
into MPI
Easy to handle in Easy to handle in
shared address shared address
space model space model

119
Questions based onTask Interactions
• Explain characteristics of tasks
• Write short note on task generation
• Differentiate between static and dynamic task
generation
• Discuss the impact ofr task size on task generation
• Compare between static interaction and dynamic task
interactions
• Compare between regular and irregular task interaction
• Compare between read only and read write task
interaction
• Compare between one way and two way task
interaction

120
Mapping Techniques for Load
Balancing

1
Mapping
• Once a problem has been
decomposed into concurrent tasks,
the tasks must be mapped to
processes.
–Mapping and Decomposition are
often interrelated steps
• Mapping must minimize overhead
–Inter-process communication and
–Time for which the processes are
idle 2
Mapping

Main
Task

Decomposition

Sub-Tasks 3
Mapping
Sub-Tasks

Mapping

processes 4
Mapping

Mapping

processes with sub-tasks 5


Mapping

inter process communication 6


Mapping
• Minimizing overhead is a trade-off
game
–Assigning all work to one
processor trivially minimizes
communication at the expense of
significant idling of other
processors.
• Goal: Performance

7
Mapping

Main
Task

Mapping to only
one process

processes 8
Mapping

Mapping to only
one process

processes 9
Mapping

idle idle
processes processes

no inter-process communication
10
Mapping
• Due to load imbalancing, some
processes finish the work early.
• Based on the constraints in task
dependency grah some processes
may have to wait other processes to
finish their work.

11
Mapping

Main
Task

uneven coarse grain


Decomposition

fine grain 12
Mapping

Mapping

processes 13
Mapping

Mapping

processes 14
Mapping

processes
completes
execution early 15
Mapping

processes
completes
execution late 16
Mapping
• To reduce overhead caused due to
interaction, one way is to assign the
tasks which need interaction into the
same process
• this leads to imbalance of workload
among processes, heavy load
processes always busy and less load
processes becomes idle.

17
Mapping
• Good mapping scheme must ensure
the balance between computations
and interactions among processes.
• If synchronization among the
interacting tasks is improper then
waiting time for sending and
receiving data among processes will
increase.

18
Mapping Techniques for Minimum Idling

Mapping must simultaneously minimize


idling and load balance. Merely balancing
load does not minimize idling.

start synchronization finish start synchronization finish

P 1 5 9 P1 1 2 3
1
P 2 6 10 P2 4 5 6
2
P 3 7 11 P3 7 8 9
3
P 4 8 12 P4 10 11 12
4
t=0 t=2 t=3 t=0 t=3 t=6

(a) (b)

19
Mapping
• There are two types of mapping
techniques:
– static mapping
– dynaimc mapping

20
Mapping
• Static
–tasks mapped to processes a-priori.
–Tasks are distributed among
available processes prior to
execution of algorithms
–need a good estimate of the task
size, data size, and inter-task
interactions.

21
Mapping
–often based on data or task graph
partitioning
–algorithm with static mapping are
easy to design
–since everything is known apriory,
static mapping schemes are
suitable for both shared address
space and message passing
programming models.
22
Mapping

• Dynamic
–tasks are mapped to processes at
runtime.
–tasks generation and mapping is
done dynamically
–tasks size not known a-priory
–indeterminate processing times is
also unknown
23
Mapping

• unknown inter-task processing times


• dynamic mapping schemes are more
complicated with message passing
programming model, whereas, it can
work well with shared address space
models.

24
Schemes for Static Mapping

• Mappings based on data partitioning.

• Mappings based on task graph


partitioning.

• Hybrid mappings.

25
Mapping – Data Partitioning

• Based on “owner-computes” rule


• We can combine data partitioning
with the “owner-computes” rule
to partition the computation into
subtasks.
• The simplest data decomposition
schemes for dense matrices are
1-D block distribution schemes.
Block-wise distribution 26
Mapping – Data Partitioning

Array Distribution Scheme


• In data decomposition, the tasks are
responsible for execution of the data
associated with them according to
owner computes rules.
• Mapping tasks onto processes is
same as mapping data onto
processes

Block-wise distribution 27
Mapping – Data Partitioning

Figure 1 Example of One dimensional partitioning of an


array among eight processes
28
Mapping – Data Partitioning

Block Distribution
• In block distributions the uniform
contiguous portions of the array are
distributed to different processes.
• Eg. consider d - dimensional array in
which each process will receive
contiguous block of array along array
dimensions.
29
Block Distribution

• Consider n x n two dimensional array


in which there are n rows and n
columns. shown in figure of 8
processes.
• Block distributions of arrays are
particularly suitable when there is a
locality of interaction, i.e.,
computation of an element of an
array requires other nearby elements
in the array. 30
Block Distribution
• For example, consider an n x n two-
dimensional array A with n rows and
n columns.
• We can now select one of these
dimensions, e.g., the first dimension,
and partition the array into p parts
such that the kth part contains rows
kn/p...(k + 1)n/p - 1, where 0 <= k < p.
• That is, each partition contains a
block of n/p consecutive rows of A.
31
Block Distribution
• Similarly, if we partition A along the
second dimension (column), then
each partition contains a block of n/p
consecutive columns.
• These row- and column-wise array
distributions are shown in figure 1

32
Block Distribution

Figure 2 Two dimensional distributions of an array on 4x4


process grid and 2x8 process grid
33
Block Distribution
• Now consider the case in which
multiple dimensions are considered
instead of single dimension partition.
• In this case, both the dimensions (i.e.
rows and columns) are selected at a
time and matrix is divided in number
of blocks.
• Example is shown in figure 2.
• Each block will have size of n/p1 x n/
p2 here p = p1 x p2 (total processes)
34
Block Distribution
For d-dimensional array we can have
block distribution upto d-dimensions.
E.g. nxn matrix multiplication of AxB=C
For two-dimensional, block distribution
will give block of size of n/p x n/p
In each case (one dim or two dim) one
partition is assigned to one process
which computes it.
In higher dimension, more blocks are
generated.
35
Block Distribution
• More processes are used for
computation.
• In single dimension, if there are n rows
(or n columns), n processes are
required for computation.
• Whereas, in two dimension, n2
processes are required for
computation.
Advantages of higher dimensions:
• Higher degree of concurrency
• Reduced interaction among the 36
Block Distribution

Figure 3 Data sharing needed for matrix


multiplication with one dimensional
partitioning of the output matrix.
Shaded portion of input matrix A and B
are required by the process that
computes the shaded portion of the
output matrix C. 37
Block Distribution

Figure 4 Data sharing needed for matrix


multiplication with two dimensional
partitioning of the output matrix.
Shaded portion of input matrix A and B
are required by the process that
computes the shaded portion of the 38
Block Distribution
One dimensional two dimensional
distribution along distribution
row
each process each process
access n/p rows access n/p rows
of A and complete of A and n/p rows
matrix B of B only
Total data to be Total data to be
accessed by each accessed by each
process is n2/p + process is O(n2/p)
39
Block Distribution
• Block distribution is useful if same
work is performed on each element.
• If amount of work is different for
different elements block distribution
results in load imbalance.

40
Cyclic and Block Cyclic Distribution
• Cyclic distributions often “spread the load”

41
Cyclic and Block Cyclic Distributions

Cyclic Distribution:
The situation where computational
load is distributed in identical
fashion, cyclic distribution is used
to distribute the load.

42
Cyclic and Block Cyclic Distributions

consecutive entries of global vector


are used to assign in successive
processes.

The mapping m -> (p,i) is defined


for cyclic distribution

m-> (m mod P, floor (m/P) )

43
Cyclic and Block Cyclic Distributions

The load imbalances occur when


amount of work is different for
different part of matrix.

This can be avoided by using cyclic


or block cyclic distribution.

44
Cyclic Distribution
• Ex, m = 23 elements and P = 3 processes (0 to 2)

m 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

P 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1

i 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7

45
Cyclic Distribution
• 1-D cyclic distribution of array ‘A’ on 4 processes
(0 to 3)

* * * * * * * * * *
A 0 1 2 3 4 5 6 7 8 9

Processes: P0, P1, P2, P3

46
P0 P1 P2 P3 P0 P1 P2 P3 P0 P1
A * * * * * * * * * *
0 1 2 3 4 5 6 7 8 9

cyclic distribution of array ‘A’ on 4 processes

47
Block Cyclic Distributions

• There are variation of the block


distribution scheme that can be used
to resolve the load-imbalance and
idling problems.

48
Block Cyclic Distributions

• Partition an array into many more


blocks than the number of available
processes.

• Blocks are assigned to processes in a


round-robin manner so that each
process gets several non-adjacent
blocks.
49
Block-Cyclic Distribution for Gaussian Elimination

The active part of the matrix in Gaussian Elimination


changes. By assigning blocks in a block-cyclic fashion, each
processor receives blocks from different parts of the matrix.

Column
Column
Inactive part k

Row k (k,k) j
(k,j) A[k,j] := A[k,j]/A[k,k]

Active part
Row i (i,k) (i,j) A[i,j] := A[i,j] - A[i,k] x A[k,j]

50
Block-Cyclic Distribution: Examples
One- and two-dimensional block-cyclic distributions among 4
processes.
P0 P3 P6
T1 T4 T5

P1 P4 P7
T2 T6 T10 T8 T12
P2 P5 P8
T3 T7 T11 T 9 T13T14

51
Block-Cyclic Distribution
• A cyclic distribution is a special case in which block size is one.

• A block distribution is a special case in which block size is n/p,


where n is the dimension of the matrix and p is the number of
processes.

P0

P1 P0 P1 P0 P1

P2

P3 P2 P3 P2 P3

P0

P1 P0 P1 P0 P1

P2
(a)
P3 P2 P3 (b) P2 P3
52
1-D and 2-D block cyclic distribution of a two dimensional array
Graph Partitioning Based Data
Decomposition
• The array-based distribution schemes
that we described so far are quite
effective in balancing the
computations and minimizing the
interactions for a wide range of
algorithms that use dense matrices
and have structured and regular
interaction patterns.

53
Graph Partitioning Based Data
Decomposition
• However, there are many algorithms
that operate on sparse data structures
and for which the pattern of interaction
among data elements is data
dependent and highly irregular.

• Numerical simulations of physical


phenomena provide a large source of
such type of computations.

54
Graph Partitioning Based Data
Decomposition
• In these computations, the physical
domain is discretized and represented
by a mesh of elements.

• The simulation of the physical


phenomenon being modeled then
involves computing the values of
certain physical quantities at each
mesh point.

55
Graph Partitioning Based Data
Decomposition
• The computation at a mesh point
usually requires data corresponding to
that mesh point and to the points that
are adjacent to it in the mesh.

• For example, Figure 5 shows a mesh


imposed on Lake Superior.

56
Graph Partitioning Based Data
Decomposition
Figure 5 A mesh used to model Lake
Superior.

57
Graph Partitioning Based Data
Decomposition
• The simulation of a physical
phenomenon such the dispersion of a
water contaminant in the lake would
now involve computing the level of
contamination at each vertex of this
mesh at various intervals of time.

58
Graph Partitioning Based Data
Decomposition
• Since, in general, the amount of
computation at each point is the same,
the load can be easily balanced by
simply assigning the same number of
mesh points to each process.
• However, if a distribution of the mesh
points to processes does not strive to
keep nearby mesh points together, then
it may lead to high interaction
overheads due to excessive data
sharing. 59
Graph Partitioning Based Data
Decomposition
• For example, if each process receives
a random set of points as illustrated in
Figure 6, then each process will need
to access a large set of points
belonging to other processes to
complete computations for its
assigned portion of the mesh.

60
Partitioning the Graph of Lake
Superior

Random Partitioning

Figure 6 A random distribution of the mesh


elements to eight processes
61
Graph Partitioning Based Data
Decomposition
• Ideally, we would like to distribute the
mesh points in a way that balances the
load and at the same time minimizes
the amount of data that each process
needs to access for completing its
computations.
• Therefore, we need to partition the
mesh into p parts such that each part
contains roughly the same number of
mesh-points or vertices, and the
number of edges that cross partition 62
Graph Partitioning Based Data
Decomposition
• Finding an exact optimal partition is an
NP-complete problem. However,
algorithms that having powerful
heuristics are available to compute
reasonable partitions.

• After partitioning the mesh in this


manner, each one of these p partitions
is assigned to one of the p processes.

63
Graph Partitioning Based Data
Decomposition
• As a result, each process is assigned a
contiguous region of the mesh such
that the total number of mesh points
that needs to be accessed across
partition boundaries is minimized.

• Figure 7 shows a good partitioning of


the Lake Superior mesh - the kind that
a typical graph partitioning software
would generate.
64
Partitioning the Graph of Lake
Superior

Figure 7 Partitioning for minimum edge-cut: A distribution


of the mesh elements to eight processes, by using a graph-
partitioning algorithm.
65
Mappings Based on Task Paritioning
Partitioning a given task-dependency graph, with
tasks of known sizes, across processes.

Determining an optimal mapping for a general


task-dependency graph is an NP-complete
problem.
Ex.
•Task-dependency graph that is a perfect binary
tree.
•Mapping on a hypercube.

Minimize the interaction overhead by mapping


many independent tasks onto the same process
66
with others which are having only one
Task Paritioning: Mapping a Binary Tree
Dependency Graph
Example illustrates the dependency graph of one view of
quick-sort and how it can be assigned to processes in a
hypercube.
0

0 4

0 2 4 6

0 1 2 3 4 5 6 7

67
Task Paritioning: Mapping a Sparse Graph
Sparse graph for computing a sparse matrix-vector product and its
mapping.

A b
0 1 2 3 4 5 6 7 8 9 1011

Process 0 C0 = (4,5,6,7,8)

Process 1 C1 = (0,1,2,3,8,9,10,11)

Process 2 C2 = (0,4,5,6)

mapping of sparse matrix-vector multiplication


C1 = (0,5,6) Process
1
0 2 3 partitioning task-
1
interaction graph to
Process 0 5
reduce interaction
4
C0 = (1,2,6,9) 6 overhead
7

8 9 10 11
Process C2 = (1,2,4,5,7,8)
2
Reducing interaction overhead in sparse matrix-vector multiplication
68
Hierarchical Mappings
•Sometimes a single mapping technique is
inadequate.

•For example, the task mapping of the binary


tree (quicksort) cannot use a large number
of processors.

•For this reason, task mapping can be used


at the top level and data partitioning within
each level.
69
Hierarchical Mapping: Example
An example of task partitioning at top level with data partitioning at
the lower level.

P0 P1 P4 P5
P2 P3 P6 P7

P0 P1 P4 P5
P2 P3 P6 P7

P0 P1 P2 P3 P4 P5 P6 P7

P0 P1 P2 P3 P4 P5 P6 P7

Quick sort has a task-dependency graph that is an ideal candidate for a


hierarchical mapping.

70
Questions on the topic

Explain dynamic mapping


Explain cyclic and block cyclic mapping

71
Schemes for Dynamic Mapping
• Dynamic mapping is sometimes also referred to as
dynamic load balancing, since load balancing is
the primary motivation for dynamic mapping.
• Dynamic mapping is used in :
• where highly imbalanced distribution caused by
static mapping
• where task-dependency graph itself is dynamic
by nature

• Dynamic mapping schemes can be centralized or


distributed.

72
Dynamic Mapping with Centralized Schemes
Processes are managed in masters-
slave fashion.
When a process runs out of work, it
requests the master for more work.
When the number of processes
increases, the master may become the
bottleneck.
To overcome this, a process may pick up
a number of tasks (a chunk) at one
time. This is called Chunk scheduling.
73
Dynamic Mapping with Centralized Schemes

Selecting large chunk sizes may lead to


significant load imbalances as well.
A number of schemes have been used
to gradually decrease chunk size as
the computation progresses.

74
Distributed Dynamic Mapping
• Each process can send or receive work
from other processes.

• This overcomes the bottleneck in


centralized schemes.

75
Distributed Dynamic Mapping
•There are four critical questions:
ohow are sending and receiving
processes paired together,
owho initiates work transfer,
ohow much work is transferred, and
owhen is a transfer triggered?

• Answers to these questions are generally


application specific.

76
Methods for Containing Interaction Overhead

• Parallel programs performs efficiently


when the interactions among the
concurrent tasks are efficiently
handled.
• many factors are responsible for
increased interaction overhead
-amount of data exchange during
interaction

77
Methods for Containing Interaction Overhead

-spatial and temporal pattern of


interaction

• some of these interaction overheads


are reduced during decomposition and
mapping schemes.

78
Minimizing Interaction Overheads
•Maximize data locality: Where possible,
reuse intermediate data. Restructure
computation so that data can be reused
in smaller time windows.

•Minimize volume of data exchange:


There is a cost associated with each
word that is communicated. For this
reason, we must minimize the volume
of data communicated.
79
Minimizing Interaction Overheads
Minimize frequency of interactions:
There is a startup cost associated with
each interaction. Therefore, try to merge
multiple interactions to one, where
possible. Ex. Sorting of an array before
distributing.

•Minimize contention and hot-spots: Use


decentralized techniques, replicate data
where necessary.
80
Minimizing Interaction Overheads
• Overlap communication with computation
– Uses non-blocking communications
– Multithreading, and prefetching can be used to
hide latencies.
• Replicate data or computations
– It may be less expensive to recalculate or
store redundantly than to communicate
• Use group communication instead of
point to point primitives
– They are more optimized generally

81
Parallel Algorithm Models

82
Basic Communication Operations

V.B.More
MET’s IOE, BKC, Nashik

Thanks to Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing slides

1
Topic Overview

1) One-to-All Broadcast and All-to-One


Reduction
2) All-to-All Broadcast and Reduction
3) All-Reduce and Prefix-Sum Operations
4) Scatter and Gather
5) All-to-All Personalized Communication
6) Circular Shift
7) Improving the Speed of Some
Communication Operations

2
Basic Communication Operations:
Introduction
• Computations and communication are
two important factors of any parallel
algorithm.

• Many interactions in practical parallel


programs occur in well-defined patterns
involving groups of processors.

• Efficient implementations of these


operations can improve performance,
reduce development effort and cost, and
improve software quality. 3
Basic Communication Operations:
Introduction
• Efficient implementations must leverage
underlying architecture. For this reason,
we refer to specific architectures here.

• We select a descriptive set of


architectures to illustrate the process of
algorithm design.

4
Basic Communication Operations:
Introduction
• Group communication operations are
built using point-to-point messaging
primitives.

• Architectures that communicating a


message of size m over an uncongested
network takes time ts + twm.

• We use this as the basis for our analyses.


Where necessary, we take congestion
into account explicitly by scaling the tw
5
term.
Basic Communication Operations:
Introduction
• We assume that the network is
bidirectional and that communication is
single-ported.

6
One-to-All Broadcast and All-to-One
Reduction
• A processor has a piece of data (of size m)
it needs to send to every other processor.
• The dual of one-to-all broadcast is all-to-
one reduction.
• In all-to-one reduction, each processor has
m units of data. These data items must
be combined piece-wise (using some
associative operator, such as addition or
min), and the result made available at a
target processor.
7
One-to-All Broadcast and All-to-One
Reduction

One-to-all broadcast and all-to-one reduction


among p processors.

8
One-to-All Broadcast and All-to-One
Reduction
One-to-all broadcast
• One-to-All broadcast is the operation in
which a single processor send identical
data to all other processors.
• Most parallel algorithm often need this
operation.
• Consider data size m is to be sent to all
the processors.
• Initially only source processor has the
data.
• After completion of algorithm, there will be
a copy of initial data with each processor. 9
One-to-All Broadcast and All-to-One
Reduction
All-to-One Reduction
• All-to-One reduction is the operation in
which data from all processors are
combined at a single destination
processor.
• Various operations like sum, product, max,
min, avg of numbers can be performed by
all-to-one reduction operation.

10
One-to-All Broadcast and All-to-One
Reduction
All-to-One Reduction

• Each processor p will have buffer M which


contains m words.
• After completion of algorithm, The ith
word of the accumulated M is the sum,
product, maximum, or minimum of the ith
words of each of the original buffers.

11
One-to-All Broadcast and All-to-One
Reduction

One-to-all broadcast and all-to-one reduction


among p processors.

12
One-to-All Broadcast and All-to-One
Reduction on Rings
• Simplest way is to send p − 1 messages
from the source to the other p − 1
processors – this is not very efficient.
• Use recursive doubling: source sends a
message to a selected processor. We
now have two independent problems
defined over halves of machines.

• Reduction can be performed in an


identical fashion by inverting the process.
13
One-to-All Broadcast
3 3
2
7 6 5 4

0 1 2 3

2
3 3

One-to-all broadcast on an eight-node ring. Node 0 is


the source of the broadcast. Each message transfer
step is shown by a numbered, dotted arrow from the
source of the message to its destination. The
number on an arrow indicates the time step during
which the message is transferred. 14
All-to-One Reduction

1 1
2
7 6 5 4

3
0 1 2 3

2
1 1

Reduction on an eight-node ring with node 0


as the destination of the reduction.
15
Broadcast and Reduction: Example

Consider the problem of multiplying a matrix


with a vector.

• The n × n matrix is assigned to an n × n


(virtual) processor grid. The vector is
assumed to be on the first row of
processors.

• The first step of the product requires a


one-to-all broadcast of the vector element
along the corresponding column of
processors. This can be done concurrently16
Broadcast and Reduction: Example
• The processors compute local product of
the vector element and the local matrix
entry.

• In the final step, the results of these


products are accumulated to the first row
using n concurrent all-to-one reduction
operations along the oclumns (using the
sum operation).

17
Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one Input Vector
reduction
P0 P1 P2
P3 One-to-all broadcast

P0 P0 P1 P2 P3

P4 P4 P5 P6 P7

P8 P8 P9 P11 Matrix
P10

P12 P12 P13 P14 P15


Output
Vector

One-to-all broadcast and all-to-one reduction in


the multiplication of a 4 × 4 matrix with a 4 × 18
1
Broadcast and Reduction on a Mesh
• We can view each row and column of a
square mesh of p nodes as a linear array
of √p nodes.
• Broadcast and reduction operations can
be performed in two steps – the first step
does the operation along a row and the
second step along each column
concurrently.
• This process generalizes to higher
dimensions as well.
19
Broadcast and Reduction on a Mesh
• 2D square mesh with √p rows and √p
columns for one to all broadcast
operation.
• Firstly data is sent to remaining all √p -1
nodes in a row by source using one-to-all
broadcast operation.
• In second phase, the data is sent to the
respective column by one-to-all broadcast.
• Thus, each node of the mesh will have
copy of initial messae

20
Broadcast and Reduction on a Mesh: Example
3 7 11 15
One-to-all
4 4 4 4
broadcast on a
2 6 10 14
16-node mesh.
3 3 3 3
1 5 9 13 Source node is 0.
4 4 4 4
Row data transfer
2 2
0 4 8 12

First phase: steps 1 and 2; Step 1: node 8 by


node 0; Step 2: node 4 by node 0, node 12 by
node 8
21
First Row: node 0, 4, 8, 12 having the data
Broadcast and Reduction on a Mesh:
Example
3 7 11 15
Second phase:
4 4 4 4 steps 3, 4;
2 6 10 14 Column data
3 3 3 3
transfer; Step 3:
1 5 9 13
node 2 by 0,
node 6 by 4,
4
2
4 4
2
4
node 10 by 8,
0 4 8 12
node 14 by 12.
1

Step 4: col 1- node 1 by 0 and 3 by 2; col 2-


node 5 by 4 and 7 by 6; col 3- node 9 by 8
and 11 by 10; col 4-node 13 by 12 and 15
22
by 14
Broadcast and Reduction on a Mesh:
3 7 11 15 Example Similar process for
4 4 4 4
one-to-all broadcast
2 6 10 14
on a three-
dimensional mesh
3 3 3 3 can be carried out by
1 5 9 13 treating rows of
nodes in each of
4 4 4 4
2 2
three dimensions as
0 4 8 12
linear array.
1
Similar process for one-to-all broadcast on a
three-dimensional mesh can be carried out by
treating rows of nodes in each of three
dimensions as linear array. 23
Broadcast and Reduction on a Mesh:
3 7 11 15 Example
4 4 4 4

2 6 10 14

3 3 3 3
1 5 9 13

4 4 4 4
2 2
0 4 8 12

1
The reduction process in linear array can be
carried out on two and three dimensional meshes
as well by reversing the direction and order of
messages. 24
Broadcast and Reduction on a Hypercube

• A hypercube with 2d nodes canbe


regarded as a d-dimensional mesh with
two nodes in each dimension.
• The mesh algorithm can be generalized to
a hypercube and the operation is carried
out in d (= log p) steps, one in each
dimension.

• Example of 8-node hypercube.

25
Broadcast and Reduction on a Hypercube
8-node hypercube.
3
6 7
(110) (111)

2 3
(010) 2
3 (011)
3
2 4 5
1
(100) (101)

0 1
(000)
(001)
3
One-to-all broadcast on a three-dimensional hypercube.
The binary representations of node labels are shown in
parentheses. 26
Broadcast and Reduction on a Hypercube

• Each node is identified by a unique 3-digit


label.
• Communication starts along the highest
dimension specified by MSB of the node
label. Ex. In step 1, node 0 (000) will send
data to node 4(100) with higher dimension.

27
Broadcast and Reduction on a Hypercube:
Example

● In the next step, communication will be


done for lower dimension.
● The source and the destination nodes in
three communication steps of the
algorithm are similar to the nodes in the
broadcast algorithm on a linear array.
● Hypercube broadcast will not suffer from
congestion

28
Broadcast and Reduction on a Balanced
Binary Tree
• Consider a binary tree in which processors
are (logically) at the leaves and internal
nodes are routing nodes i.e. switching units.
• The communicating nodes have the same
labels as in the hypercube
• The communication pattern will be same as
that of hypercube algorithm.
• There will not be any congestion on any of
the communication link at any time.

29
Broadcast and Reduction on a Balanced
Binary Tree

2
2

3 3 3 3
0 1 2 3 4 5 6 7

30
Broadcast and Reduction on a Balanced
Binary Tree

3 3 3 3
0 1 2 3 4 5 6 7

One-to-all broadcast on an eight-node tree.

• On different path, different number of


switching nodes will be there making its
communication different from hypercube.
31
Broadcast and Reduction on a Balanced
Binary Tree
1

3 3 3 3
0 1 2 3 4 5 6 7

One-to-all broadcast on an eight-node tree.

• E.g. Assume that source processor is the


root of this tree. In the first step, the
source sends the data to the right child
(assuming the source is also the left child)
. The problem has now been decomposed
into two problems with half the number of 32
Broadcast and Reduction Algorithms
• All of the algorithms described here before
are adaptations of the same algorithmic
template.

• We illustrate the algorithm for a hypercube,


but the algorithm can be adapted to other
architectures.

• The hypercube has 2d nodes and my_id is


the label for a node.
• X is the message to be broadcast, which
initially resides at the source node 0. 33
Broadcast and Reduction Algorithms
1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id,
source, X)
2. begin
3. my_virtual_id := my_id XOR source;
4. mask := 2d - 1;
5. for i := d - 1 downto 0 do /* Outer loop */
/* Set bit i of mask to 0 */
6. mask := mask XOR 2i;
7. if (my_virtual_id AND mask) = 0 then
8. if (my_virtual_id AND 2i) = 0 then
9. virtual_dest := my_virtual_id XOR 2i;

One-to-all broadcast of a message X initiated by source on a


d-dimensional p-node hypercube. d = log (p)
34
Broadcast and Reduction Algorithms
10. send X to (virtual_dest XOR source);
/* Convert virtual_dest to the label of the physical
destination */
11. else
12. virtual_source := my_virtual_id XOR 2i;
13. receive X from (virtual_source XOR
source);
/* Convert virtual_source to the label of the physical
source */
14. endelse;
15. endfor;
16. end GENERAL_ONE_TO_ALL_BC
One-to-all broadcast of a message X initiated by source on a
d-dimensional p-node hypercube. d = log (p) 35
Broadcast and Reduction Algorithms

1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id, source, X)


2. begin
3. my_virtual_id := my_id XOR source;
4. mask := 2d − 1;
5. for i := d − 1 downto 0 do /* Outer loop */
6. mask := mask XOR 2i; /* Set bit i of mask to 0 */
7. if (my_virtual_id AND mask) = 0 then
8. if (my_virtual_id AND 2i) = 0 then
9.
virtual_dest := my_virtual_id XOR 2i;
10.
send X to (virtual_dest XOR source);
/* Convert virtual_dest to the label of the physical destination */
11. else
12. virtual_source := my_virtual_id XOR 2i;
13. receive X from (virtual_source XOR source);
/* Convert virtual_source to the label of the physical source */
14. endelse;
15. endfor;
16. end GENERAL_ONE_TO_ALL_BC

One-to-all broadcast of a message X initiated by source on a


d-dimensional p-node hypercube. d = log (p)
36
Broadcast and Reduction Algorithms

1. procedure ALL_TO_ONE_REDUCE(d, my_id, m, X, sum)


2. begin
3. for j := 0 to m − 1 do sum[j] := X[j];
4. mask := 0;
5. for i := 0 to d − 1 do
/* Select nodes whose lower i bits are 0 */
6. if (my_id AND mask) = 0 then
7. if (my_id AND 2i) ƒ= 0 then
8. msg_destination := my id XOR 2i;
9. send sum to msg_destination;
10. else
11. msg_source := my_id XOR 2i;
12. receive X from msg_source;
13. for j := 0 to m − 1 do
14. sum[j] :=sum[j] + X[j];
15. endelse;
16. mask := mask XOR 2i; /* Set bit i of mask to 1 */
17. endfor;
18. end ALL_TO_ONE_REDUCE

Single-node accumulation on a d-dimensional hypercube. Each node contributes


a message X containing m words, and node 0 is the destination. 37
Cost Analysis

• The one –to-all broadcast or all-to-one


reduction procedure involves log p point-
to-point simple message transfers.

• Each message transfer will have a time


cost of ts + twm.

• The total time is therefore given by:

T = (ts + twm) log p.


38
Questions based on one-to-all Broadcast
and and all-to-one Reduction
• Explain one-to-all broadcast and all-to-
one reduction operation in brief.
• Explain with example one-to-all
broadcast and all-to-one reduction
operation on ring.
• Explain how matrix-vector multiplication
can be performed using one-to-all
broadcast and all-to-one reduction
operation.
• Explain one-to-all broadcast operation on
16-node mesh.
39
• Explain one-to-all broadcast operation on
Questions based on one-to-all Broadcast
and and all-to-one Reduction

• Write and explain algorithm of one-to-all


broadcast on a hypercube network.
• Explain one-to-all broadcast algorithm
for arbitrary source on d-dimensional
hypercube.
• Explain all-to-one reduction operation on
d-dimensional hypercube.

40
All-to-All Broadcast and Reduction

• Generalization of broadcast in which


each processor is the source as well as
destination.
• All-to-all broadcast operation is used in
matrix operations like matrix
multiplication and matrix-vector
multiplication.
• In all-to-all broadcast operation, all p
nodes simultaneously broadcast the
message.
41
All-to-All Broadcast and Reduction

• Note that, a process sends same


message of m-word to all the
processes, but it is not compulsory that
every process should send same
message. Different processes can
broadcast different messages.
• In all-to-all reduction, it happens at
every node.

42
All-to-All Broadcast and Reduction

All-to-all broadcast and all-to-all reduction.

43
All-to-All Broadcast and Reduction on a Ring
• All-to-all Broadcast:
• Simplest approach: perform p one-to-all
broadcasts. This is not the most efficient way,
though.
• It can take p times to complete
communication.
• Communication links can be used more
efficiently by simultaneously performing all p
one-to-all broadcast.
• By this here will be concatenation of all
messages traversing the same path at the
same time into a single message.
• The algorithm terminates in p − 1 steps.
44
All-to-All Broadcast and Reduction on a Ring

• Linear Array and Ring:


• Each node first sends the data to one of its
neighbors it needs to broadcast.
• In subsequent steps, it forwards the data
received from one of its neighbors to its
other neighbor.
• This process is continued in subsequent
steps so all the communication links can be
kept busy.
• As the communication is performed circularly
in a single direction, each node receives all
(p-1) pieces of information fro all other nodes
in (p-1) steps.
45
All-to-All Broadcast and Reduction on a Ring
1 (6) 1 (5) 1 (4)
n (m)
7 6 5 4
(7) (6) (5) (4) nth m th
1 (7) 1 (3) time step message

(0) (1) (2) (3)


1st communication step
0 1 2 3

1 (0) 1 (1) 1 (2)

2 (5) 2 (4) 2 (3)

7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)

2nd communication
(0,7) (1,0) (2,1) (3,2)
step
0 1 2 3

2 (7) 2 (0) 2 (1)

. .
. .
. .
7 (0) 7 (7) 7 (6)

7 6 5 4
(7,6,5,4,3,2, (6,5,4,3,2,1, (5,4,3,2,1,0, (4,3,2,1,0,7,
7 (1) 1) 0) 7) 6) 7 (5)

7th communication
(0,7,6,5,4,3, (1,0,7,6,5,4, (2,1,0,7,6,5, (3,2,1,0,7,6,
2) 3) 4) 5) step
0 1 2 3

7 (2) 7 (3) 7 (4)

46
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

n (m)
1st Communication Step
nth mth
time step message

47
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

2nd Communication Step

48
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

7th communication step

49
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

• Detailed Algorithm
• At every node, my_msg contains initial message
to be broadcast.
• At the end of the algorithm, all p messages are
collected at each node.

50
All-to-All Broadcast and Reduction on a Ring

1. procedure ALL_TO_ALL_BC_RING(my_id, my_msg, p, result)


2. begin
3. left := (my_id − 1) mod p;
4. right := (my_id + 1) mod p;
5. result := my_msg;
6. msg := result;
7. for i := 1 to p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
12. end ALL_TO_ALL_BC_RING

All-to-all broadcast on a p-node ring.

All-to-all reduction is simply a dual of this operation and


can be performed in an identical fashion. 51
All-to-all Broadcast on a Mesh
• Performed in two phases –
• in the first phase, each row of the mesh performs an
all-to-all broadcast using the procedure for the linear
array.

• In this phase, all nodes collect √p messages


corresponding to the √p nodes of their respective rows.
Each node consolidates this information into a single
message of size m√p.

• The second communication phase is a columnwise all-


to-all broadcast of the consolidated messages.

52
All-to-all Broadcast on a Mesh
(6) (7) (8) (6,7,8) (6,7,8) (6,7,8)
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)

6 7 8 6 7 8 8

(3) (4) (5)


(3,4,5) (3,4,5) (3,4,5) (0,1,2,
3 4 5 3 4 5 3,4,5, 3 4 5
6,7,8) (0,1,2, (0,1,2,
3,4,5, 3,4,5,
6,7,8) 6,7,8)

0 1 2 0 1 2 0 1 2

(0) (1) (2) (0,1,2) (0,1,2) (0,1,2) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,


5,6,7,8)

(a) Initial data (b) Data distribution after rowwise (c) Final result of all-to-all broadcast on
distribution broadcast Mesh

All-to-all broadcast on a 3 × 3 mesh. The groups of nodes


communicating with each other in each phase are enclosed by dotted
boundaries. By the end of the second phase, all nodes get (0,1,2,3,4,5,6,7,
8) (that is, a message from each node).

• After completion of second phase each node obtains all p pieces of


m-word data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from
each node.

53
All-to-all Broadcast on a Mesh

(6) (7) (8)


7 8
6

(3) (4) (5)

3 4 5

for next step

0 1 2
(0) (1) (2)
(a) Initial data distribution
• After completion of second phase each node obtains all p pieces of
m-word data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from
each node.

54
All-to-all Broadcast on a Mesh
(6,7,8) (6,7,8) (6,7,8)

6 7 8

(3,4,5) (3,4,5) (3,4,5)


3 4 5

for next step

0 1 2

(0,1,2) (0,1,2) (0,1,2)

(b) Data distribution after rowwise


broadcast
• After completion of second phase each node obtains all p pieces of
m-word data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from
each node.

55
All-to-all Broadcast on a Mesh

(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)


6 7 8

(0,1,2,
3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8) 3,4,5, 3,4,5,
6,7,8) 6,7,8)

0 1 2

(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)

(c) Final result of all-to-all broadcast on


Mesh

56
All-to-all Broadcast on a Mesh

• All-to-all broadcast on a 3 × 3 mesh. The


groups of nodes communicating with each
other in each phase are enclosed by dotted
boundaries.
• By the end of the second phase, all nodes
get (0,1,2,3,4,5,6,7,8) (that is, a message from
each node).
• After completion of second phase each node
obtains all p pieces of m-word data i.e. all
nodes will get (0,1,2,3,4,5,6,7,8) message
from each node.

57
All-to-all Broadcast on a Mesh
1. procedure ALL_TO_ALL_BC_MESH(my_id, my_msg, p, result)
2. begin
/* Communication along rows */
3. left := my_id − (my_id mod √p) + (my_id − 1)mod√p;
4. right := my_id − (my_id mod √p) + (my_id + 1) mod √p;
5. result := my_msg;
6. msg := result;
7. for i := 1 to √p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
/* Communication along columns */
up := (my_id − √p) mod p;
12.
13. down := (my_id + √p) mod p;
14. msg := result;
15. for i := 1 to √p − 1 do
16. send msg to down;
17. receive msg from up;
18. result := result ∪ msg;
19. endfor;
20. end ALL_TO_ALL_BC_MESH

All-to-all broadcast on a square mesh of p nodes. 58


All-to-all broadcast on a Hypercube

• All-to-all broadcast operation can be performed on


hypercube by implementing mesh algorithm to log p
dimensions.
• In each step, for a different dimension, communication is
carried out. Figure a shows first step that, communication is
carried out in each row.
• In figure b, communication is carried out in column in
second step.
• Pairs of nodes exchange data in each step
• received message is concatenated with the current data in
every step.
• Hypercube with bidirectional communication is considered.

59
All-to-all broadcast on a Hypercube
(6) (7) (6,7) (6,7)
6 7 6 7

(2) 2 3 (3) (2,3) 2 3 (2,3)

(4) (5)
4 5 4 5
(4,5) (4,5)

(0) 0 1 (1) (0,1) 0 1 (0,1)

(a) Initial distribution of messages (b) Distribution before the second


step

(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7

(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5

(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)

(c) Distribution before the third step (d) Final distribution of messages

All-to-all broadcast on an eight-node hypercube.

60
All-to-all broadcast on a Hypercube

1. procedure ALL_TO_ALL_BC_HCUBE(my_id, my_msg, d, result)


2. begin
3. result := my_msg;
4. for i := 0 to d − 1 do
5. partner := my_id XOR 2i;
6. send result to partner;
7. receive msg from partner;
8. result := result ∪ msg;
9. endfor;
10. end ALL_TO_ALL_BC_HCUBE

All-to-all broadcast on a d-dimensional hypercube.

61
All-to-all broadcast on a Hypercube

(6) (7)

6 7

(2) 2 3 (3)

(4) (5)
4 5

(0) 0 1 (1)

(a) Initial distribution of messages

All-to-all broadcast on an eight-node hypercube.

62
All-to-all broadcast on a Hypercube
(6,7) (6,7)

6 7

(2,3) 2 3 (2,3)

4 5
(4,5) (4,5)

(0,1) 0 1 (0,1)

(b) Distribution before the second step


All-to-all broadcast on an eight-node hypercube.
63
All-to-all broadcast on a Hypercube

(4,5, (4,5,
6,7) 6,7)
6 7

(0,1, (0,1,
2,3) 2 3 2,3)

(4,5, (4,5,
6,7) 6,7)

4 5

(0,1, (0,1,
2,3) 0 1 2,3)

(c) Distribution before the third step

All-to-all broadcast on an eight-node hypercube.


64
All-to-all broadcast on a Hypercube

(0,...,7) (0,...,7)

6 7

(0,...,7) (0,...,7)

2 3

(0,...,7) (0,...,7)
4 5

(0,...,7) (0,...,7)
0 1

(d) Final distribution of messages


All-to-all broadcast on an eight-node hypercube.
65
All-to-all Broadcast

• Similar communication pattern to all-to-all


broadcast, except in the reverse order.
• On receiving a message, a node must combine it
with the local copy of the message that has the
same destination as the received message
before forwarding the combined message to the
next neighbor.
• As per the algorithm the communication starts
from lowest dimension.
• Variable i is used to represent the dimension.
• According to line 4 in, in first iteration, value of i=0.

66
All-to-all Broadcast

• In each iteration, nodes communicate in pairs.


• In line 5, the label of receiver node will be
calculated by XOR operation.
• For eg if my_id = 000 then partner = 000 XOR 001
= 001. Hence, the partner differs in ith LSB.
• After communication of the data, each node
concatenates the received data with its own data
as shown in line 8.
• This concatenated message is then transmitted in
the next iteration.

67
All-to-all reduction on a Hypercube

1. procedure ALL_TO_ALL_RED_HCUBE(my_id, msg, d, result)


2. begin
3. recloc := 0;
4. for i := d − 1 to 0 do
5. partner := my_id XOR 2i;
6. j := my_id AND 2i;
7. k := (my_id XOR 2i) AND 2i;
8. senloc := recloc + k;
9. recloc := recloc + j;
10. send msg [senloc ..senloc + 2i - 1] to partner;
11. receive temp [0.. 2i - 1] from partner;
12. for j := 0 to 2i – 1 do
13. msg[recloc+j] := msg [recloc +j] + temp[j];
14. endfor;
15. endfor;
16. result := msg[my_id];
17. end ALL_TO_ALL_RED_HCUBE

All-to-all reduction on a d-dimensional hypercube.

68
All-to-all Reduction

• The order and direction of messages is reversed


for all-to-all reduction operation.
• The buffers are used to send and accumulate the
received messages in each iteration.
• Variable senloc is used to give the starting
location of the outgoing message.
• Variable recloc is used to give the location where
the incoming message is added in each iteration.

69
Cost Analysis

• On a ring, the time is given by:


T=(ts + twm)(p − 1).

• On a mesh, the time is given by:

T= 2ts(√p − 1) + twm(p − 1).

• On a hypercube, we have:

log p

T  
i 1
(ts  2 i 1 t w m )  t s log p  tw m( p  1)

70
Cost Analysis of all-to-all broadcast and all-to-all reduction operation

• On a ring, the time is given by:


(ts + twm)(p − 1).
• =>> All-to-all broadcast can be performed in (p-1)
communicating steps on a ring or a linear array
for nearest neighbors.
• Time taken in each step is (ts + twm) where ts is
the startup time of the message, tw is the per
word transfer time, and m is the size of
message,
• Total time taken for the operation is :
(ts + twm)(p − 1).
71
Cost Analysis of all-to-all broadcast and all-to-all reduction operation

• On a mesh, the time is given by:


2ts(√p − 1) + twm(p − 1).
=>> All-to-all broadcast can be performed on
mesh. The first phase of √p simultaneous all-to-
all broadcast will be completed in time
(ts + twm √p)(√p − 1).
For two dimensional square mesh of p-nodes, the
total time for all-to-all broadcast is addition of
time spent in each pass, and which is :
2ts(√p − 1) + twm(p − 1)

72
Cost Analysis of all-to-all broadcast and all-to-all reduction operation

• On a hypercube, The time is given by:


log p

T  
i 1
(ts  2 i 1 t w m )  t s log p  tw m( p  1)

• we have: For p-node hypercube, for pair of nodes time


taken to send and receive message in ith step is
ts + 2i-1twm.

Total time taken for the operation is given by:


T = Σi=1 log p (ts + 2i-1twm)
= ts log p + twm(p − 1).
73
All-to-all broadcast: Notes

• All of the algorithms presented


here are asymptotically optimal in
message size.

• That is, twm(p-1) is the term


associated with each architecture.

74
All-to-all broadcast: Notes

•It is not possible to port/map


algorithms for higher dimensional
networks (such as a hypercube) into
a ring because this would cause
contention.
• Large sized messages are optimal
to transfer on ring than hypercube.

75
All-to-all broadcast: Notes
Contention for
a single
channel by
multiple
7 6 5 4
messages

4
0 1 2 3
6 7
1 3 4
2
2 3
3
4 5
1
2
0 1
8-node hyperccube

Contention for a channel when the hypercube is mapped onto a ring.


76
All-to-all broadcast: Questions

1. Explain all-to-all broadcast and reduction


operation
2. Write algorithm and explain all-to-all broadcast
on eight node ring/hypercube.
3. Explain with example and algorithm all-to-all
broadcast on 3x3 mesh.
4. Explain all-to-all reduction on d-dimensional
hypercube.
5. Explain all-to-all broadcast on d-dimensional
hypercube.
6. Explain cost analysis of all-to-all broadcast
operation

77
All-Reduce and Prefix-Sum Operations

• In all-reduce, each node starts with a


buffer of size m and the final results
of the operation are identical buffers
of size m on each node that are
formed by combining the original p
buffers using an associative operator.

78
All-Reduce and Prefix-Sum Operations

•It is identical to all-to-one reduction


followed by a one-to-all broadcast.
•This formulation is not the most
efficient.
•Uses the pattern of all-to-all broadcast,
instead. The only difference is that
message size does not increase here.
•Time for this operation is
(ts + twm) log p.

79
All-Reduce and Prefix-Sum Operations

•It is different from all-to-all reduction,


in which p simultaneous all-to-one
reductions take place, each with a
different destination for the result.

80
The Prefix-Sum Operation

• Given p numbers n0, n1, . . . , np−1 (one


on each node), the problem is to
compute the sums sk = Σi=0k ni for all k
between 0 and p − 1.
• Initially, nk resides on the node labeled
k, and at the end of the procedure, the
same node holds sk.

81
The Prefix-Sum Operation

s0 = n0
s1 = n0 + n1
s2 = n0 + n1 + n2
s3 = n0 + n1 + n2 + n3
S4 = n0 + n1 + n2 + n3 + n4
...
sk = n0 + n1 + n2 + .. + nk

82
The Prefix-Sum Operation

Initially, n0 resides on node 0, n1 resides


on node 1 and so on.

After completion of prefix sum


operation, every node holding sum of
its predecessor nodes including itself.

T = (ts+twm) log p

83
The Prefix-Sum Operation
(6) [6] (7) [7] (6) [6] (6+7) [6+7]

6 7 6 7

[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]

[4]
4 5 4 5

(4) [4] (5) [5] (4+5) (4+5) [4+5]


[0] [0]
(0) 0 [1] 1
(1) (0+1) 0 [0+1]1
(0+1)

(a) Initial distribution of (b) Distribution of sums before second


values step

(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]

6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+
2 3 [0+1+2] 2 3
2+3)
(0+1+2+3)

[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]

(0+1+ 1 (0+1+ 0 1
0 [0+1] [0] [0+1]
2+3) 2+3)
(c) Distribution of sums before third (d) Final distribution of prefix
step sums

Computing prefix sums on an eight-node hypercube.


At each node, square brackets show the local prefix
sum accumulated in the result buffer and
parentheses enclose the contents of the outgoing
message buffer for the next step.
84
The Prefix-Sum Operation
(6) [6] (7) [7]

6 7

[2]
(2) 2 3 (3) [3]

4 5

(4) [4]
(5) [5]
[0]
(0) 0 1 (1) [1]

(a) Initial distribution of values


At each node, square brackets show the local prefix sum accumulated in the result
buffer and parentheses enclose the contents of the outgoing message buffer for the
next step. 85
The Prefix-Sum Operation
(6+7) [6] (6+7) [6+7]

6 7

[2]
(2+3) 2 3 (2+3)
[2+3]

4 5

(4+5) [4]
(4+5)
[0] [4+5]
(0+1) 0 1 (0+1)
[0+1]

(b) distribution of sums before second step


At each node, square brackets show the local prefix sum accumulated in the result
buffer and parentheses enclose the contents of the outgoing message buffer for the
next step. 86
The Prefix-Sum Operation
(4+5+6+7) [4+5+6] (4+5+6+7) [4+5+6+7]

6 7

[0+1+2]
(0+1+2+ 3) 2 3 (0+1+2+3)
[0+1+2+3]

4 5

(4+5) [4]
(4+5)
[0] [4+5]
(0+1+2+3) 0 1 (0+1+2+3)
[0+1]

(c) distribution of sums before third step


At each node, square brackets show the local prefix sum accumulated in the result
buffer and parentheses enclose the contents of the outgoing message buffer for the
next step. 87
The Prefix-Sum Operation
[0+1+...+6] [0+1+...+7]

6 7

[0+1+2] 2 3 [0+1+2+3]

4 5

[0+1+...+4]
[0+1+..+5]

[0] 0 1 [0+1]

(d) final distribution of prefix sums


At each node, square brackets show the local prefix sum accumulated in the result
buffer and parentheses enclose the contents of the outgoing message buffer for the
next step. 88
The Prefix-Sum Operation or Scan Operation

• The operation can be implemented


using the all-to-all broadcast kernel.

• We must account for the fact that in


prefix sums the node with label k uses
information from only the k-node
subset whose labels are less than or
equal to k.

89
The Prefix-Sum Operation or Scan Operation

•This is implemented using an


additional result buffer. The content of
an incoming message is added to the
result buffer only if the message
comes from a node with a smaller label
than the recipient node.

• The contents of the outgoing message


(denoted by parentheses in the figure)
are updated with every incoming
message.
90
The Prefix-Sum Operation or Scan Operation

• Prefix sum operation also uses the same


communication pattern which is used in all-to-all
broadcast and all reduce operations.
k
• The sum sk = i=0 ni for all k between 0 and
p-1 for p numbers n0, n1, ..,np-1 on each node is
calculated.
• E.g. if the original sequence of numbers is
<3, 1, 4, 0, 2> then the sequence of prefix sum is
<3, 4, 8, 8, 10>
i.e. 3+null=3, 3+1=4, 4+4=8, 8+0=8, 8+2=10

91
The Prefix-Sum Operation or Scan Operation

• At start the number nk will be present with node k.


After termination of algorithm, same node holds
sum sk.
• Instead of single number, each node will have a
buffer or a vector of size m and result will be sum
of elements of buffers.
• Each node contain additional buffer denoted by
square brackets to collect the correct prefix sum.
• After every communication step, the message
from a node with a smaller label than that of the
recipient node is added to the result buffer.

92
The Prefix-Sum Operation

1. procedure PREFIX_SUMS_HCUBE(my_id, my_number, d, result)


2. begin
3. result := my_number;
4. msg := result;
5. for i := 0 to d − 1 do
6. partner := my_id XOR 2i;
7. send msg to partner;
8. receive number from partner;
9. msg := msg + number;
10. if (partner < my_id) then result := result + number;
11. endfor;
12. end PREFIX_SUMS_HCUBE

Prefix sums on a d-dimensional hypercube.


93
The Prefix-Sum Operation
(6) [6] (7) [7] (6) [6] (6+7) [6+7]

6 7 6 7

[2] [2]
(2) 2 3 (3) [3] (2+3) 2 3 (2+3)
[2+3]

[4]
4 5 4 5

(4) [4] (5) [5] (4+5) (4+5) [4+5]


[0] [0]
(0) 0 [1] 1
(1) (0+1) 0 [0+1]1
(0+1)

(a) Initial distribution of (b) Distribution of sums before second


values step

(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]

6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+
2 3 [0+1+2] 2 3
2+3)
(0+1+2+3)

[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]

(0+1+ 1 (0+1+ 0 1
0 [0+1] [0] [0+1]
2+3) 2+3)
(c) Distribution of sums before third (d) Final distribution of prefix
step sums

• At each node, square brackets show the local prefix sum accumulated
in the result buffer and parentheses enclose the contents of the
outgoing message buffer for the next step.
• These contents are updated with every incoming message.
• All the messages received by a node will not contribute to its final
result, some of the messages it receives may be redundant.

94
Questions on Prefix sum operations

• Explain the difference between


all-to-all reduction and all reduce
operations.

• Explain with example prefix sum


operations.

95
Scatter and Gather

• In the scatter operation, a single node


sends a unique message of size m to every
other node.
• This is called a one-to-all personalized
communication.

• In the gather operation, a single node


collects a unique message from each node.
• It is the dual of scatter operation.

96
Scatter and Gather

• While the scatter operation is


fundamentally different from broadcast,
the algorithmic structure is similar, except
for differences in message sizes
(messages get smaller in scatter and stay
constant in broadcast).

• The gather operation is exactly the inverse


of the scatter operation.

97
Gather and Scatter Operations

Scatter and gather operations.

• Consider the example of 8-node hypercube


• The communication patterns of all-to-all
broadcast and scatter operation are identical, the
only difference is in size and contents of the
message.
• As in figure above, initially source node 0 will have
all the messages.
98
Gather and Scatter Operations

Scatter and gather operations.

99
Example of the Scatter Operation

In the first 6 7
communication
step, node 0
transfers half
2 3
of the
messages to
one of its
neighbours 4 5
(node 4)
(0,1,2,3,
4,5,6,7) 0 1

(a) Initial distribution of messages


The scatter operation on an eight-node hypercube.
100
Example of the Scatter Operation

In the next step, 6 7


if any process
has some data,
it transfers half 2 3
of the data to
its neighbours
who has not
received any 4 5
(4,5,
data uptil now.
6,7)
(0,1,
2,3) 0 1

(b) Distribution before the second step


The scatter operation on an eight-node hypercube.
101
Example of the Scatter Operation
(6,7)
6 7

(2,3)
2 3

4 5
(4,5)

(0,1)
0 1

(c) Distribution before the third step


The scatter operation on an eight-node hypercube.
102
Example of the Scatter Operation

(6) (7)
This process
6 7
involves log p
communication
steps for log p
(2) (3)
dimensions of 2 3
the hypercube.
(4) (5)
4 5

(0) (1)
0 1

(d) Final distribution of messages


The scatter operation on an eight-node hypercube.
103
Gather Operations

Scatter and gather operations.


• The gather operation is reverse of scatter operation.
• Every node will have m word message.
• In the first step, each odd numbered node sends its buffer to an
even numbered neighbor behind it.
• The neighbor node concatenates the received message with its
own buffer.
• In the next communication step only even numbered nodes
participate in communication.
• The nodes with multiples of four labels, gather more data and
double the sizes of their data.
• This process continued till node 0 gather all the data.
104
Example of the Gather Operation
(6) (7)
6 7

(2) (3)
2 3

(4) (5)
4 5

(0) (1)
0 1

(a) Initial distinct messages

The gather operation on an eight-node hypercube.


105
Example of the Gather Operation

(6,7)
6 7

(2,3)
2 3

4 5
(4,5)

(0,1)
0 1

(b) Collection before the second step

The gather operation on an eight-node hypercube.


106
Example of the Gather Operation

6 7

2 3

4 5
(4,5,
6,7)
(0,1,
2,3) 0 1

(c) Collection before the third step


The gather operation on an eight-node hypercube.
107
Example of the Gather Operation

6 7

2 3

4 5

(0,1,2,3,
4,5,6,7) 0 1

(d) Final Collection of messages


The gather operation on an eight-node hypercube.
108
Cost of Scatter and Gather

• There are log p steps, in each step, the machine size halves
and the data size halves.
• We have the time for this operation to be:
T = ts log p + twm(p − 1).
• This time is same for a linear array as well as a 2-D mesh.
• In scatter operation, at least m(p-1) data must be transmitted
out of the source node,
• and in gather operation at least m(p-1) data must be received
by the destination node.
• Therefore, twm(p-1) time, is the lower bound on the
communication in scatter and gather operations.

Topic Questions
• Explain Scatter and Gather operations with example.
109
All-to-All Personalized Communication

• All-to-all personalized communication operation can be


applied in variety of parallel algorithms such as Fast
Fourier Transform, matrix transpose, sample sort, and
some parallel database join operations
• Each node has a distinct message of size m for every
other node.
• This is opposite of all-to-all broadcast, in which each
node sends the same message to all other nodes.
• All-to-all personalized communication is also known
as total exchange.

110
All-to-All Personalized Communication

M 0, p -1 M 1, p -1 M p-1, p -1 M p -1,0 M p -1,1 M p-1, p -1

. . . . . .
. . . . . .
. . . . . .

M 0, 1 M 1, 1 M p-1, 1 M 1, 0 M 1, 1 M 1, p-1,

M 0, 0 M 1, 0 M p-1, 0 M 0, 0 M 0,1 M 0,p-1

All-to-all personalized
0 1 ... p-1 communication 0 1 ... p-1

All-to-all personalized communication.

111
All-to-All Personalized Communication: Example

Consider the problem of transposing a matrix.

• Each processor contains one full row of the matrix.


• The transpose operation in this case is identical to
an all-to-all personalized communication
operation.
• Let A is n x n matrix, transpose of matrix A is AT.
• AT will have same size as A and AT[i,j] = A[j,i] for
0<= i, j < n.

112
All-to-All Personalized Communication: Example

Consider the problem of transposing a matrix.

• Considering 1D row major partitioning of array, n x


n matrix can be mapped onto n processors such
that each processor contains one full row of the
matrix.
• Each processor sends a distinct element of the
matrix to every other processor as
all-to-all personalized communication.
• For p processes, where p<= n, each process will
have n/p rows (n2/p elements in a matrix)
• For finding out the transpose, all-to-all
personalized communication of matrix blocks of
size n/p x n/p will be done.
113
All-to-All Personalized Communication: Example

P0 P 0 = [0,0],[1,0],[2,0],[3,0]

P1 P 1 = [0,1],[1,1],[2,1],[3,1] Result of
n Transpose
of matrix
P2

P 3 = [0,3],[1,3],[2,3],[3,3]
P3

All-to-all personalized communication in transposing a 4 × 4 matrix using four processes.

• Processor Pi will contain the elements of the matrix with indices


[i,0], [i,1], .., [i,n-1]
• In transpose AT, P0 will have element [i,0], P1 will have element [i,
1] and so on.
• Initially processor Pi will have element [i,j] and after transpose, it
moves to Pj
• Figure above shows the example of 4 x 4 matrix mapped onto
four processes using one dimensional row wise partitioning.

114
All-to-All Personalized Communication on a Ring

• Each node sends (p − 1) pieces of


data of size m as one consolidated
message to one of its neighbors.
• These pieces are identified by label {x,
y}, where x is the label of the node
that originally owned the message,
and y is the label of the node that is
the final destination of the message.

115
All-to-All Personalized Communication on a Ring

• The label ({x1, y1}, {x2, y2}, . . . , {xn, yn})


indicates that a message is formed
by concatenating n individual
messages. For eg. ({0,1},{1,2},..,{4,5}).
• Each node extracts the information
meant for it from the message of size
m(p − 1) received, and forwards the
remaining (p − 2) data pieces of size
m each to the next node.

116
All-to-All Personalized Communication on a Ring

• The algorithm continued for (p − 1)


steps.
In (p − 1) steps every node receives
information from all the nodes in the
group.
• The size of the message reduces by m
at each step.
• In each step one m-word packet from
different node will be added to each
node.
117
All-to-All Personalized Communication on a Ring

• All messages are sent in the same


direction.
• To reduce the communication cost
due to tw by factor of two, half of
the messages are sent in one
direction and remaining in reverse
direction to use communication
channels fully.
118
All-to-All Personalized Communication on a Ring

All-to-all personalized
is the label of the node
communication on a six-
that originally owned the
node ring. The label of
message, and y is the
each message is of the
label of the node that is
form {x, y}, where x
the final
destination of the message. The label ({x1 , y1 }, {x2 , y2 }, . . . , {xn, yn}) indicates that a
message is formed by concatenating n individual messages.

119
All-to-All Personalized Communication on a Ring

Communication Step 1
120
All-to-All Personalized Communication on a Ring

Communication Step 2
121
All-to-All Personalized Communication on a Ring

Communication Step 3
122
All-to-All Personalized Communication on a Ring

Communication Step 4
123
All-to-All Personalized Communication on a Ring

Communication Step 5
124
All-to-All Personalized Communication on a Ring: Cost

• All-to-all personalized communication on ring requires p − 1


communication steps in all.
• The size of message transfer in ith step is m(p − i).
• Therefore, total time taken by this operation is given by:

p−1
T= Σ (t s+ t wm(p − i))
i=1
p−1
= t (ps −w1) + Σ it m
i=1

= (ts + twmp/2)(p − 1).

• The tw term in this equation can be reduced by a factor of 2 by


communicating messages in both directions.
125
All-to-All Personalized Communication on a Mesh

• For all-to-all personalized


communication on mesh √p x √p,
each node first groups its p messages
according to the columns of their
destination nodes.
• Consider the example of 3 x 3 mesh.
• Each node have 9 m-word messages
one for each node.

126
All-to-All Personalized Communication on a Mesh

• For each node, three groups of three


messages are formed.
• The first group contains the messages
for destination nodes labelled 0, 3, and
6; the second group contains the
messages for nodes 1, 4, and 7; and
the last group of messages for nodes
labelled 2, 5, and 8.

127
All-to-All Personalized Communication on a Mesh

• After grouping, each row will contain


cluster of messages of size m√p.
• Each cluster contains information for
all the nodes of a column.
• Now in the first phase, all-to-all
personalized communication is
performed in each row.

128
All-to-All Personalized Communication on a Mesh

• After first phase, the messages


present with each node are sorted
again according to the rows of their
destination nodes.
• In the second phase, similar
communication is carried out.
• After completion of second phase,
node i will have the messages ({0,i},..,
{8,i}) where 0 <= i <= 8. So each node
will have the a message from every
other node. 129
All-to-All Personalized Communication on a Mesh

The label of each message is of the form


{x, y}, where x is the label of the node that
originally owned the message, and y is the label
of the node that is the final destination of the
message.
The distribution of messages at the beginning
of each phase of all-to-all personalized
communication on a 3 × 3 mesh. At the end of
the second phase, node i has messages ({0,i}, . . .
,{8,i}), where 0 ≤ i ≤ 8. The groups of nodes
communicating together in each phase are
enclosed in dotted boundaries.
130
All-to-All Personalized Communication on a Mesh

(a) Data distribution at the beginning of first 131


All-to-All Personalized Communication on a Mesh

(b) Data distribution at the beginning of second 132


All-to-All Personalized Communication on a Mesh

({0,6},{1,6},{2,6}, ({0,7},{1,7},{2,7}, ({0,8},{1,8},{2,8},


{3,6},{4,6},{5,6}, {3,7},{4,7},{5,7}, {3,8},{4,8},{5,8},
{6,6},{7,6},{8,6}) {6,7},{7,7},{8,7}) {6,8},{7,8},{8,8})
6 7 8

({0,4},{1,4}, ({0,5},{1,5},
{2,4},{3,4}, {2,5},{3,5},
{4,4},{5,4}, {4,5},{5,5},
{6,4},{7,4}, {6,5},{7,5},
({0,3},{1,3},{2,3}, {8,4}) {8,5})
3 4 5
{3,3},{4,3},{5,3},
{6,3},{7,3},{8,3})
({0,1},{1,1}, ({0,2},{1,2},
{2,1},{3,1}, {2,2},{3,2},
{4,1},{5,1}, {4,2},{5,2},
({0,0},{1,0},{2,0}, {6,1},{7,1}, {6,2},{7,2},
{3,0},{4,0},{5,0}, 0 {8,1}) 1 {8,2}) 2
{6,0},{7,0},{8,0})
133
(c) Final Data distribution after second phase
All-to-All Personalized Communication on a Mesh: Cost

• Time for the first phase is identical to that in a


ring with √p processors, i.e.,
(ts + twmp/2)(√p − 1).

• Time in the second phase is identical to thefirst


phase. Therefore, total time is twice of this time, i.e.,

T = (2ts + twmp)(√p − 1).

134
All-to-All Personalized Communication on a Mesh: Cost

• It is noted that, time required for sorting the


messages by row and column is not considered in
calculation of T.
• It is assumed that the data is ready for first
communication phase, so in second communication
phase, the rearrangement of mp words of data is
done.
• Let tr is the time to read and write single word data
in a node’s local memory.
• So, total time spent in data rearrangement by a node
in complete process is tr x m x p.
• This time is very small as compared to
communication time T above.
135
All-to-All Personalized Communication on a
Hypercube

• Generalize the mesh algorithm to log p steps.


• At any stage in all-to-all personalized
communication on p node hypercube, every
node holds p packets of size m each.
• While communicating in a particular
dimension, every node sends p/2 of these
packets (consolidated as one message).
• A node must rearrange its messages locally
before each of the log p communication steps
takes place.
• In each step, the data is exchanged by pairs of
nodes for a different dimension.
136
All-to-All Personalized Communication on a
Hypercube
({6,0} ... {6,7}) ({7,0} ... {7,7})
6 7

({2,0} ... {2,7}) ({3,0} ... {3,7})


2 3

4 5
({5,0} ... {5,7})
({4,0} ... {4,7})

0 1
({0,0} ... {0,7}) ({1,0} ... {1,7})

(a) Initial distribution of 137


All-to-All Personalized Communication on a
Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
{7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})

6 7
({2,0},{2,2},
{2,4},{2,6},
{3,0},{3,2}, 2 3
{3,4},{3,6}) ({4,1},{4,3},
{4,5},{4,7},
4 5 {5,1},{5,3},
{5,5},{5,7})

0 1
({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})

(b) Distribution before the second step 138


All-to-All Personalized Communication on a
Hypercube
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7})

6 7

({2,2},{2,2}, 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
{5,1},{7,1},
4 5 {5,5},{7,5})

0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5},
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
139
(c) Distribution before the third step
All-to-All Personalized Communication on a
Hypercube
({0,6} ... {7,6}) ({0,7} ... {7,7})

6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})

2 3

4 5
({0,4} ... {7,4}) ({0,5} ... {7,5})

0 1
({0,0} ... {7,0}) ({0,1} ... {7,1})

(d) Final distribution of messages 140


All-to-All Personalized Communication on a
Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
({6,0} ... {6,7}) ({7,0} ... {7,7}) {7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})

6 7 6 7

({2,0} ... {2,7}) ({3,0} ... {3,7}) ({2,0},{2,2},


2 3 {2,4},{2,6},
2 3
{3,0},{3,2},
{3,4},{3,6})
({4,1},{4,3},
{4,5},{4,7},
4 5 4 5
{5,1},{5,3},
({4,0} ... {4,7}) ({5,0} ... {5,7}) {5,5},{5,7})

0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})

(a) Initial distribution of (b) Distribution before the second


messages step

({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})

6 7 6 7

({0,2} ... {7,2}) ({0,3} ... {7,3})

({2,2},{2,2}, 2 3 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
4 5 {5,1},{7,1}, 4 5
{5,5},{7,5})
({0,4} ... {7,4}) ({0,5} ... {7,5})

0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})

(c) Distribution before the third (d) Final distribution of


step messages

An all-to-all personalized communication algorithm on a three- 141


dimensional hypercube.
All-to-All Personalized Communication on a
Hypercube: Cost

• We have log p iterations and mp/2 words are


communicated in each iteration. Therefore, the
cost is:
T = (ts + twmp/2) log p.

• This is not optimal!

142
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

• Each node simply performs p − 1 communication


steps, exchanging m words of data with a
different node in every step.

• A node must choose its communication partner in


each step so that the hypercube links do not
suffer congestion.

143
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

• In the jth communication step, node i exchanges


data with node (i XOR j).

• In this schedule, all paths in every communication


step are congestion-free, and none of the
bidirectional links carry more than one message
in the same direction.

144
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1

(a) (b) (c)

6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1

(d) (e) (f)

6 7 0 1 3 7
1 0 2 6
2 3 1 5
2 3
3 2 0 4 Seven steps in all-to-all
4 5 7 3 personalized communication on an
4 5
5 4 6 2 eight-node hypercube.
1
6 7 5
0
0 1 7 6 4 145
(g)
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

1. procedure ALL_TO_ALL_PERSONAL(d, my_id)


2. begin
3. for i := 1 to 2d − 1 do
4. begin
5. partner := my_id XOR i;
6. send Mmy_id,partner to partner;
7.
receive M from partner;
8. partner,my_id
9. endfor;
end ALL_TO_ALL_PERSONAL

A procedure to perform all-to-all personalized communication on a d-


dimensional hypercube. The message Mi,j initially resides on node i
and is destined for node j.
146
All-to-All Personalized Communication on a
Hypercube: Cost Analysis of Optimal Algorithm

• There are p − 1 steps and each step involves non-


congesting message transfer of m words.
• We have:

T=(ts + twm)(p − 1).

• This is asymptotically optimal in message size.

147
Circular Shift
• Circular shift can be applied in some
matrix computations and in string and
image patterh matching.
• It is a member of a broader class of
global communication operations.
• So, in a circular q-shift node i sends data
to node (i+q) mod p in a group of p
nodes where (0 < q < p) known as
Permutation.
• In Permutation, every node sends a
message of m word to a unique node.
148
Circular Shift on a Mesh

• Mesh algorithms for circular shift can be


derived by using ring algorithm.
• Wrap around connections are considered
in mesh. i.e. a row of 4 nodes 0,1,2,3, node
3 can communicate and send data to node
0.
• Implementation can be performed by
min(q,p-q) neighbor-to-neighbor
communications in one direction where p
is number of nodes and q is the number of
shifts to be performed.
149
Circular Shift on a Mesh

• In p-node square wraparound mesh, for


nodes with row major labels, a circular q-
shift is performed in two stages.
• Example, q=5 shifts, p=16 (4x4 mesh)
• In the first stage, the data is shifted
simultaneously by (q mode √p) steps in all
the rows i.e. (5 mod √16) in our e.g.
• In second phase, it is shifted by [q/√p]
steps along the columns.

150
Circular Shift on a Mesh

• Due to wraparound connection while


circular row shifts, the data moves from
highest to lowest labeled nodes of the row.
For e.g. data with node 3 will be shifted to
node 0 in the first row.
• Note that to compensate for the distance
√p that they lost while traversing the
backward edge in their respective rows,
the data packets must be shifted by an
additional step.
151
Circular Shift on a Mesh

• In the example, after row shift, there is


compensentory one column shift then
column shift.

• Total time for any circular q-shift on a


p-node mesh using packets of size m is :
T = (ts + twm)(√p + 1).

152
Circular Shift on a Mesh

(12) (13) (14) (15)


12 13 14 15

(8) (9) (10) (11)


8 9 10 11

(4) (5)(0) (6) (7)


4 5 6 7
h ift
5-s
(0) (1) (2) (3)
0 1 2 3

(a) Initial data distribution and the first communication step


The communication steps in a circular 5-shift on a 4 × 4 mesh. 153
Circular Shift on a Mesh
(15) (12) (13) (14)
12 13 14 15

(11) (8) (9) (10)


8 9 10 11

(7) (4) (5) (6)


4 5 6 7
data from node
3 supposed to
shift on node 4,
but due to
(3) (0) (1) (2) wraparound
0 1 2 3 row shift, it shift
to node 0

(b) Step to compensate for backward row shifts


The communication steps in a circular 5-shift on a 4 × 4 mesh. 154
Circular Shift on a Mesh
(11) (12) (13) (14)
shift is
12 13 14 15 carried out
every time
on a unique
node
(7) (8) (9) (10)
8 9 10 11

(3) (4) (5) (6)


4 5 6 7

(15) (0) (1) (2)


0 1 2 3

(c) Column shifts in the third communication step


The communication steps in a circular 5-shift on a 4 × 4 mesh. 155
Circular Shift on a Mesh
(7) (8) (9) (10)
12 13 14 15

(3) (4) (5) (6)


8 9 10 11

(15) (0) (1) (2)


4 5 6 7

(11) (12) (13) (14)


0 1 2 3

(d) Final distribution of the data


The communication steps in a circular 5-shift on a 4 × 4 mesh. 156
Circular Shift on a Hypercube
• For shift operation on hypercube, linear
array with 2d nodes is mapped onto
d-dimensional hypercube.
• Node i of the linear array is assigned to
node j of the hypercube where j is a d-bit
binary Reflected Gray Code (RGC) of i.
• Consider eight nodes hypercube shown
in the figure, any two nodes at distance
2i are separated by exactly two links.
• For i=0 nodes are directly connected to
this is the exception as only one
hypercube link separates two nodes. 157
Circular Shift on a Hypercube
• For q-shift operation, q is expanded as a
sum of distinct powers of 2. For example
number 5 can be expanded as 22+20.
• Note that number of terms in
sum=number of 1's in binary
representation of q. e.g. 5(101) two terms
will be there in the sum correcponding to
bit position 2 and bit position 0 i.e. 22+20.
• Circular q-shift on a hypercube is
performed in s phases, where s is distinct
powers of 2.
158
Circular Shift on a Hypercube
• For example, 5-shift operation is
performed by 4 shift (22) followed by 1
shift (20).
• Each shift will have two communication
steps. Only 1-shift will have a single step.
For example, the first phase of a 4-shift
consists of two steps and the second
phase of a 1-shift consists of one step.
• Total number of steps for any q in a p-
node hypercube is 2 log p-1

159
Circular Shift on a Hypercube
• The time for this is upper bounded by:
T = (ts + twm)(2 log p − 1).
• If E-cube routing is used, this time can be
reduced to
T = ts + twm.

160
Circular Shift on a Hypercube
(4) (5) (3) (2)
4 5 4 5

(3) (0)
(2) (1)
0 3 2 3 2

1
(7) (4)
(6) (5)
2 7 6 7 6

3
(0) (7)
(1) (6)
4 0 1 0 1

5
First communication Second communication
6 step of the 4-shift step of the 4-shift
7
(a) The first phase (a 4-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
161
Circular Shift on a Hypercube

(0) (1)

4 5
0
(7)
1 (6)
3 2
2

3
(2)
4 7 6
5
(4)
6 (5)
0 1
7
(b) The second phase (a 1-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
162
Circular Shift on a Hypercube
(7) (0)
4 5

0
(6)
(5)
1 3 2
2

3 (2)
(1)
4 7 6
5

6
(3)
(4)
0 1
7
(c) Final data distribution after the 5-shift
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
163
Circular Shift on a Hypercube

6 7 6 7

2 3 2 3

4 5 4 5

0 1 0 1

(a) 1-shift (b) 2-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 164


Circular Shift on a Hypercube

6 7 6 7

2 3 2 3

4 5 4 5

0 1 0 1
(c) 3-shift (d) 4-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 165


Circular Shift on a Hypercube

6 7 6 7

2 3 2 3

4 5 4 5

0 1 0 1
(e) 5-shift (f) 6-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 166


Circular Shift on a Hypercube

6 7

2 3

4 5

0 1
(g) 7-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 167


Circular Shift on a Hypercube
6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1
(a) 1- (b) 2- (c) 3-
shift shift shift
6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1
(d) 4- (e) 5- (f) 6-
shift shift shift
6 7

2 3

4 5

0 1
(g) 7-
shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 168


Improving Performance of Operations

• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation followed
by an all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw(p − 1) )
p
≈ 2 × (ts log p + twm). (10)
• All-to-one reduction can be performed by performing all-to-all reduction (dual of
all-to-all broadcast) followed by a gather operation (dual of scatter).

169
Improving Performance of Operations

• Since an all-reduce operation is semantically equivalent to an all-to-one reduction


followed by a one-to-all broadcast, the asymptotically optimal algorithms for
these two operations can be used to construct a similar algorithm for the all-
reduce operation.
• The intervening gather and scatter operations cancel each other. Therefore, an all-
reduce operation requires an all-to-all reduction and an all-to-all broadcast.

170
Improving Performance of Operations

• The communication algorithms are based on two assumptions:


• 1) original message cannot be divided into small parts
• 2) each node uses a single port for sending and receiving data

• We can analyse the effect of not following these two assumptions:

Splitting and Routing messages in parts:

171
University Questions on Unit 3
• August 2018 (Insem)
• 1. Explain Broadcast and Reduction example for
multiplying matrix with a vector.(6)
• 2. Explain the concept of scatter and gather (4)
• 3. Compare the one-to-all broadcast operation for Ring,
Mesh and Hypercube topologies (6)
• 4. Explain the prefix-sum operation for an eight-node
hypercube (4)

• Nov-Dec 2018 (Endsem)


• 1. Write a short note on All-to-one reduction with
suitable example. [6]

172
University Questions on Unit 3
• Nov-Dec 2019 (Endsem)
• 1 Explain term of all-to-all broadcast on linear array,
mesh & Hypercube
• topologies. [8]
• 2 Write short note on circular shift on a mesh. [6]

• Oct 2019 (Insem)


• 1. Explain broadcast & reduce operation with diagram.
[4]
• 2. Explain prefix- sum operation for an eight-node
hypercube. [6]
• 3. Explain scatter and gather operation? [4]
• 4. Explain all to one broadcast and reduction on a ring?
[6] 173
University Questions on Unit 3
• May-June-2019 (Endsem)
• 1. Explain Circular shift operation on mesh and
hypercube network. [8]

174
Parallel Algorithm Models

Prof V B More
MET’s IOE, BKC, Nashik
Parallel Algorithm Models
These models are used to specify details for
partitioning data and how these data are
processed.
A model is used to provide structure of
parallel algorithms based on two techniques:
• selection of partitioning and mapping
technique;
• appropriate use of technique for
minimization of interaction.

Prof V B More, MET’s IOE, BKC, Nashik


Parallel Algorithm Models
There are various parallel algorithm models :
• The data parallel model
• The task graph model
• The work pool model
• The master slave model
• The pipeline or producer consumer model
• Hybrid models

Prof V B More, MET’s IOE, BKC, Nashik


The Data-Parallel Model
• The tasks are statically or semi-statically
attached onto processes.
• Each task performs identical operations on
a variety of data.
• Single operations being applied on multiple
data items is called data parallelism (SIMD
model).
• The task may be executed in phases.
• Data is different in different phases.
Prof V B More, MET’s IOE, BKC, Nashik
The Data-Parallel Model.. contd
• Since all tasks perform same
computations, the decomposition of the
problem into tasks is usually based on data
partitioning to guarantee the load balance.
• Data-parallel algorithms can be
implemented in both shared-address-space
and message-passing paradigms.

Prof V B More, MET’s IOE, BKC, Nashik


The Data-Parallel Model.. contd
• Partitioned address-space and locality of
reference allow better control of placement
of data in message-passing interface
paradigm.
• If the distribution of data is different in
different phases, the shared-address space
paradigm reduces programming efforts.
• Interaction overheads in the data-parallel
model can be minimized by locality
Prof V B More, MET’s IOE, BKC, Nashik
The Data-Parallel Model.. contd
• A key characteristic of data-parallel
problems is that for most problems, the
degree of data parallelism increases with
the size of the problem, which leads to use
more processes to solve larger problems
effectively.
• An example of a data-parallel algorithm is
dense matrix multiplication problem.

Prof V B More, MET’s IOE, BKC, Nashik


Example: Multiplying a Dense Matrix with a Vector
A n b y
Task 1
01
Computation of each
2 element of output vector y
is independent of other
elements. Based on this, a
dense matrix-vector
product can be
decomposed into n tasks.
n-1 shaded portion of the
Task n matrix and vector is
accessed by Task 1.
Findings: While tasks share data (the vector b), they do not
have any control dependencies – i.e., no task needs to
wait for the (partial) completion of any other. All tasks are
of the same size in terms of number of operations.
8
Prof V B More, MET’s IOE, BKC, Nashik
The Task Graph Model
• The computations in any parallel algorithm
can be viewed as a task graph.
• The task graph may be either trivial or
nontrivial.
• The type of parallelism that is expressed by
the task graph is called task parallelism.
• In certain parallel algorithms, the task
graph is explicitly used in establishing
relationship between various tasks.
Prof V B More, MET’s IOE, BKC, Nashik
The Task Graph Model
• Interrelationships among the tasks are
utilized to promote locality or to reduce
interaction costs.
• This model is applied to solve problems in
which the amount of data associated with
the tasks is huge relative to the amount of
computation associated with them.

Prof V B More, MET’s IOE, BKC, Nashik


The Task Graph Model ..contd
• The tasks are mapped statically to optimize
the cost of data movement among tasks.
• Work is more easily shared with globally
addressable space, but mechanisms are
available to share work in disjoint address
space.

Prof V B More, MET’s IOE, BKC, Nashik


The Task Graph Model ..contd
• Typical interaction-reducing techniques
applicable to this model include reducing
the volume and frequency of interaction by
promoting locality while mapping the tasks
based on the interaction pattern of tasks.
• Asynchronous interaction methods are
used to overlap the interaction with
computation.
Prof V B More, MET’s IOE, BKC, Nashik
The Task Graph Model ..contd
• Examples of algorithms based on the task
graph model include parallel quicksort,
sparse matrix factorization, and many
parallel algorithms derived via divide-and-
conquer approach.

Prof V B More, MET’s IOE, BKC, Nashik


The Work Pool Model
• In the work pool or the task pool model, the
dynamic mapping of tasks onto processes
is performed for load balancing.
• The task may be executed by any process.
• There is no desired pre-mapping of tasks
onto processes.
• The mapping may be centralized or
decentralized.

Prof V B More, MET’s IOE, BKC, Nashik


The Work Pool Model
• Pointers to the tasks may be stored in a
physically shared list, priority queue, hash
table, tree, or in a physically distributed
data structure.
• The work may be statically available in the
beginning, or could be dynamically
generated; i.e., the processes may generate
work and add it to the global work pool.
Prof V B More, MET’s IOE, BKC, Nashik
The Work Pool Model ..contd
• If the work is generated dynamically and a
decentralized mapping is used, then a
termination detection algorithm is required
to detect completion of the entire program.
• In message-passing paradigm, the work
pool model is used when the amount of
data associated with tasks is relatively
small as compared to the computation
associated with it.
Prof V B More, MET’s IOE, BKC, Nashik
The Work Pool Model ..contd
• Therefore, tasks can be easily moved
around without causing too much data
interaction overhead.
• The granularity of the tasks can be
adjusted to obtain the desired level of
tradeoff between load-imbalance and the
overhead of accessing the work pool for
adding and extracting tasks.
Prof V B More, MET’s IOE, BKC, Nashik
The Work Pool Model ..contd
• Parallelization of loops by chunk
scheduling or related methods is an
example of the use of the work pool model
• Parallel tree search where the work is
represented by a centralized or distributed
data structure is an example of the use of
the work pool model where the tasks are
generated dynamically.
Prof V B More, MET’s IOE, BKC, Nashik
The Master-Slave Model
• In this master-slave or the manager-worker
model, one or more master processes
generate work and allocate it to slave
processes.
• The tasks may be allocated a-priori if the
manager can estimate the size of the tasks
• Random mapping can be performed for
better load balancing.
Prof V B More, MET’s IOE, BKC, Nashik
The Master-Slave Model
• Workers are assigned smaller piece of
work at different times.
• Work may need to be performed in phases,
and work in each phase must finish before
work in the next phases can be generated.
• The manager may cause all workers to
synchronize after each phase.
• There is no desired pre-mapping of work to
processes, and any worker can do any job
assigned to it.

Prof V B More, MET’s IOE, BKC, Nashik


The Master-Slave Model ..contd
• The manager-worker model can be
generalized to the hierarchical or multi-level
manager-worker model in which the top
level manager feeds large chunks of tasks
to second-level managers, who further
subdivide the tasks among their own
workers and may perform part of the work
themselves.
• Care should be taken to ensure that the
Prof V B More, MET’s IOE, BKC, Nashik
The Master-Slave Model ..contd
• The granularity of tasks should be chosen
such that the cost of doing work dominates
the cost of communication and
synchronization.
• Asynchronous interaction will be useful for
overlapped interaction and the computation
associated with work generated by the
master.
• It may also reduce waiting times if the
Prof V B More, MET’s IOE, BKC, Nashik
Producer-Consumer Model
• In this Producer-Consumer Model (or
pipeline model), a stream of data is
passed through a successive processes,
each of which performs some task on it.
• The simultaneous execution of different
programs on a data stream is called
stream parallelism.
• The arrival of new data triggers the
execution of a new task by a process in
the pipeline.
Prof V B More, MET’s IOE, BKC, Nashik
Producer-Consumer Model
• The processes forms pipeline in the shape
of linear or multidimensional arrays, trees,
or general graphs with or without cycles.
• A pipeline is a chain of producers and
consumers.
• Each process in the pipeline can be
viewed as a consumer of a data stream for
the process preceding it in the pipeline
and as a producer of data for the process
following it in the pipeline.
Prof V B More, MET’s IOE, BKC, Nashik
Producer-Consumer Model ..contd
• The pipeline does not need to be a linear
chain; it can be a directed graph.
• The pipeline model usually involves a
static mapping of tasks onto processes.
• Load balancing is a function of task
granularity.
• The larger the granularity, the longer it
takes to fill up the pipeline, i.e. for the
trigger produced by the first process in the
chain to propagate to the last process,
thereby keeping some of the processes
Prof V B More, MET’s IOE, BKC, Nashik
Producer-Consumer Model ..contd
• However, too fine granularity may increase
interaction overheads because processes
will need to interact to receive fresh data
after smaller pieces of computation.
• The most common interaction reduction
technique applicable to this model is
overlapping interaction with computation.
An example of a two-dimensional pipeline
is the parallel LU factorization algorithm.

Prof V B More, MET’s IOE, BKC, Nashik


Hybrid Models
• In some cases, more than one model may
be applicable to the problem at hand,
resulting in a hybrid algorithm model.
• A hybrid model may be composed either
of multiple models applied hierarchically
or multiple models applied sequentially to
different phases of a parallel algorithm.

Prof V B More, MET’s IOE, BKC, Nashik


Hybrid Models
• In some cases, an algorithm formulation
may have characteristics of more than one
algorithm model. For example, data may
flow in a pipelined manner in a pattern
guided by a task graph. In another case,
the major computation may be described
by a task graph, but each node of the
graph may represent a supertask consists
of multiple subtasks that may be suitable
for data-parallel or pipelined parallelism.
Ex. Parallel quicksort.
Prof V B More, MET’s IOE, BKC, Nashik
The Age of Parallel
Processing

Prof V B More
MET-BKC IOE

1
The Age of Parallel Processing

• In recent years, much has been made


of the computing industry widespread
shift to paraellel computing.

• Nearly all consumer computers in the


year 2020 are manufactured with
multicore central processors.

2
The Age of Parallel Processing
• Introduction of dual-core, low-end
notebook machines to 8-and 16-core
workstation computers, are not less
than supercomputers or mainframes.

• Command prompts are out and


multithreaded graphical interfaces are
in.

3
The Age of Parallel Processing
• Electronic devices such as mobile
phones and portable music players
came up with parallel computing
capabilities to enhance the
performance.

• Cellular phones that only make calls


are out; phones that can
simultaneously play music browse the
Web, and provide GPS services are in.
4
The Age of Parallel Processing

• As a result, software developers now


need to cope with a variety of parallel
computing platorms and technologies
in order to provide novel and rich
experiences for an increasingly
sophisticated base of users.

5
The Age of Parallel Processing

Evolution of the CPUs


• For 30 years, one of the important
methods for the improving the
performance of consumer computing
devices has been to increase the speed
at which the processor's clock operated.

6
The Age of Parallel Processing
Evolution of the CPUs

• Starting with the first personal


computers of the early 1980s,
consumer CPUs ran with internal
clocks operating around 1 MHz.

7
The Age of Parallel Processing
Evolution of the CPUs

• 30 years later (2010), most desktop


processors have clock speeds between
1 GHz and 4 GHz, nearly 1000 times
faster than the clock on the original
personal computer.

8
The Age of Parallel Processing
Evolution of the CPUs

• Although increasing the CPU clock


speed is certainly not the only method
by which compting performance has
been improved, it has always been a
reliable source for improved
performance.

9
The Rise of GPU Computing
• A graphics processing unit (GPU) is a
specialized electronic circuit designed
to rapidly manipulate and alter memory
to accelerate the creation of images in
a frame buffer used for outputting to a
display device.

10
The Rise of GPU Computing
• GPUs are used in embedded systems,
mobile phones personal computers,
workstations, research labs, and game
consoles.

• Modern GPUs are very efficient at


manipulating computer graphics and
image processing.

11
The Rise of GPU Computing
• Their highly parallel structure makes
them more efficient than general-
purpose CPUs for algorithms where the
processing of large blocks of data is
done in parallel.

12
The Rise of GPU Computing
• In a personal computer, a GPU can be
present on a video card, or it can be
embedded on the motherboard or in
certain CPUs - on the CPU die.

13
The Rise of GPU Computing
• In comparison to the central
processor's traditional data processing
pipeline, performing general-purpose
computations on a graphics
processing unit (GPU) is a new concept
(GPGPU).

14
The Rise of GPU Computing
• In fact, GPU itself is relatively new
compared to the computing field at
large. However, the idea of computing
on graphics processors is not new.

15
A Brief History of GPUs, Early GPU
• We have already looked at how CPUs
evolved in both clock speeds and core
count.

• In the meantime, the state of graphics


processing underwent a drametic
revolution.

16
A Brief History of GPUs, Early GPU

• In the late 1980s and early 1990s, the


growth in popularity of graphically
driven operating systems such as MS
Windows helped create a market for a
new type of processor.

17
A Brief History of GPUs, Early GPU

• In early 1990s, users began purchasing


2D display accelerators for their
personal coputers, with hardware
assisted bitmap operations for
graphical operating system.

18
A Brief History of GPUs, Early GPU

• In the 1980, Silicon Graphics used three


dimensional gaphics in a variety of
markets, government and defense
applications and scientific and
technical visualization.

19
A Brief History of GPUs, Early GPU

• In 1992, Silicon Graphics opened the


programming interface to its hardware
by releasing the OpenGL library, as a
standardized, platform independent
method for writing 3D graphics
aplications.

20
A Brief History of GPUs, Early GPU
• By mid 1990, the computer based first-
person games such as Doom, Duke
Nukem 3D, and Quaks came to market.

21
A Brief History of GPUs, Early GPU

• In mid 1990, NVIDIA, ATI Technologies,


3DFX interactive began releasing
graphics accelerators that were
affordable. NVIDIA's GeForce 256 used
graphics pipeline architecture.

22
A Brief History of GPUs, Early GPU

• The term GPU was popularized by


NVIDIA in 1999, who marketed the
GeForce 256 as “the world's first GPU”.

23
A Brief History of GPUs, Early GPU

• NVIDIA release the GeForce 3 series in


2001 was the computing industry's first
chip to implement Microsoft's DirectX
8.0 standard (which was very new at
that time).

24
A Brief History of GPUs, Early GPU

• Rival ATI Technologies coined the term


“visual processing unit” or VPU with the
release of the Radeon 9700 in 2002.

25
A Brief History of GPUs, Early GPU

NVIDIA Comparative Chart

26
NVIDIA GPU Development History
Basic Communication Operations

V.B.More
MET’s IOE, BKC, Nashik

Thanks to Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar for providing slides

1
Topic Overview

1) One-to-All Broadcast and All-to-One


Reduction
2) All-to-All Broadcast and Reduction
3) All-Reduce and Prefix-Sum Operations
4) Scatter and Gather
5) All-to-All Personalized Communication
6) Circular Shift
7) Improving the Speed of Some
Communication Operations
2
Basic Communication Operations: Introduction

• Computations and communication are two


important factors of any parallel algorithm.

• Many interactions in practical parallel


programs occur in well-defined patterns
involving groups of processors.

• Efficient implementations of these


operations can improve performance,
reduce development effort and cost, and
improve software quality.
3
Basic Communication Operations: Introduction

• Efficient implementations must leverage


underlying architecture. For this reason, we
refer to specific architectures here.

• We select a descriptive set of architectures


to illustrate the process of algorithm
design.

4
Basic Communication Operations: Introduction

• Group communication operations are built


using point-to-point messaging primitives.

• Architectures that communicating a message


of size m over an uncongested network takes
time ts + twm.

• We use this as the basis for our analyses.


Where necessary, we take congestion into
account explicitly by scaling the tw term.
5
Basic Communication Operations: Introduction

• We assume that the network is bidirectional


and that communication is single-ported.

6
One-to-All Broadcast and All-to-One Reduction

• A processor has a piece of data (of size m) it


needs to send to every other processor.
• The dual of one-to-all broadcast is all-to-one
reduction.
• In all-to-one reduction, each processor has m
units of data. These data items must be
combined piece-wise (using some associative
operator, such as addition or min), and the
result made available at a target processor.

7
One-to-All Broadcast and All-to-One
Reduction

One-to-all broadcast and all-to-one reduction


among p processors.

8
One-to-All Broadcast and All-to-One Reduction

One-to-all broadcast
• One-to-All broadcast is the operation in which
a single processor send identical data to all
other processors.
• Most parallel algorithm often need this
operation.
• Consider data size m is to be sent to all the
processors.
• Initially only source processor has the data.
• After completion of algorithm, there will be a
copy of initial data with each processor. 9
One-to-All Broadcast and All-to-One Reduction

All-to-One Reduction
• All-to-One reduction is the operation in which
data from all processors are combined at a
single destination processor.
• Various operations like sum, product, max,
min, avg of numbers can be performed by
all-to-one reduction operation.

10
One-to-All Broadcast and All-to-One Reduction

All-to-One Reduction

• Each processor p will have buffer M which


contains m words.
• After completion of algorithm, The ith word of
the accumulated M is the sum, product,
maximum, or minimum of the ith words of
each of the original buffers.

11
One-to-All Broadcast and All-to-One
Reduction

One-to-all broadcast and all-to-one reduction


among p processors.

12
One-to-All Broadcast and All-to-One Reduction
on Rings
• Simplest way is to send p − 1 messages from
the source to the other p − 1 processors –
this is not very efficient.
• Use recursive doubling: source sends a
message to a selected processor. We now
have two independent problems defined
over halves of machines.

• Reduction can be performed in an identical


fashion by inverting the process.
13
One-to-All Broadcast
3 3
2
7 6 5 4

0 1 2 3

2
3 3

One-to-all broadcast on an eight-node ring. Node 0 is the


source of the broadcast. Each message transfer step is
shown by a numbered, dotted arrow from the source of
the message to its destination. The number on an arrow
indicates the time step during which the message is
transferred. 14
All-to-One Reduction

1 1
2
7 6 5 4

3
0 1 2 3

2
1 1

Reduction on an eight-node ring with node 0 as


the destination of the reduction.
15
Broadcast and Reduction: Example

Consider the problem of multiplying a matrix


with a vector.

• The n × n matrix is assigned to an n × n (virtual)


processor grid. The vector is assumed to be on
the first row of processors.

• The first step of the product requires a


one-to-all broadcast of the vector element
along the corresponding column of
processors. This can be done concurrently for
16
all n columns.
Broadcast and Reduction: Example

• The processors compute local product of the


vector element and the local matrix entry.

• In the final step, the results of these products


are accumulated to the first row using n
concurrent all-to-one reduction operations
along the oclumns (using the sum operation).

17
Broadcast and Reduction: Matrix-Vector
Multiplication Example
All-to-one Input Vector
reduction
P0 P1 P2 P3
One-to-all broadcast

P0 P0 P1 P2 P3

P4 P4 P5 P6 P7

P8 P8 P9 P11 Matrix
P10

P12 P12 P13 P14 P15


Output
Vector

One-to-all broadcast and all-to-one reduction in the


multiplication of a 4 × 4 matrix with a 4 × 1 vector.18
Broadcast and Reduction on a Mesh
•We can view each row and column of a square
mesh of p nodes as a linear array of √p nodes.
•Broadcast and reduction operations can be
performed in two steps – the first step does
the operation along a row and the second
step along each column concurrently.
•This process generalizes to higher dimensions
as well.

19
Broadcast and Reduction on a Mesh
•2D square mesh with √p rows and √p columns
for one to all broadcast operation.
•Firstly data is sent to remaining all √p -1
nodes in a row by source using one-to-all
broadcast operation.
•In second phase, the data is sent to the
respective column by one-to-all broadcast.
•Thus, each node of the mesh will have copy of
initial messae

20
Broadcast and Reduction on a Mesh: Example
3 7 11 15
One-to-all
4 4 4 4
broadcast on a
2 6 10 14
16-node mesh.
3 3 3 3
1 5 9 13 Source node is 0.
4 4 4 4 Row data transfer
2 2
0 4 8 12

First phase: steps 1 and 2; Step 1: node 8 by node


0; Step 2: node 4 by node 0, node 12 by node 8
First Row: node 0, 4, 8, 12 having the data 21
Broadcast and Reduction on a Mesh: Example
3 7 11 15
Second phase:
4 4 4 4 steps 3, 4; Column
2 6 10 14
data transfer;
3 3 3 3
Step 3: node 2 by
1 5 9 13 0, node 6 by 4,
node 10 by 8,
4 4 4 4

0
2
4 8
2
12
node 14 by 12.
1

Step 4: col 1- node 1 by 0 and 3 by 2; col 2-


node 5 by 4 and 7 by 6; col 3- node 9 by 8 and
11 by 10; col 4-node 13 by 12 and 15 by 14
22
Broadcast and Reduction on a Mesh: Example
3 7 11 15
Similar process for
4 4 4 4
one-to-all broadcast
2 6 10 14
on a
three-dimensional
3 3 3 3 mesh can be carried
1 5 9 13 out by treating rows of
nodes in each of three
4 4 4 4
2 2
dimensions as linear
0 4 8 12
array.
1
Similar process for one-to-all broadcast on a
three-dimensional mesh can be carried out by treating
rows of nodes in each of three dimensions as
linear array. 23
Broadcast and Reduction on a Mesh: Example
3 7 11 15

4 4 4 4

2 6 10 14

3 3 3 3
1 5 9 13

4 4 4 4
2 2
0 4 8 12

1
The reduction process in linear array can be carried
out on two and three dimensional meshes as well by
reversing the direction and order of messages.
24
Broadcast and Reduction on a Hypercube

•A hypercube with 2d nodes can be


regarded as a d-dimensional mesh with two
nodes in each dimension.
•The mesh algorithm can be generalized to a
hypercube and the operation is carried out in
d (= log p) steps, one in each dimension.

•Example of 8-node hypercube.

25
Broadcast and Reduction on a Hypercube
8-node hypercube.
3
6 7
(110) (111)

2 3
(010) 2
3 (011)
3
2 4 5
1
(100) (101)

0 1
(000)
(001)
3
One-to-all broadcast on a three-dimensional hypercube. The
binary representations of node labels are shown in
parentheses. 26
Broadcast and Reduction on a Hypercube

•Each node is identified by a unique 3-digit


label.
•Communication starts along the highest
dimension specified by MSB of the node label.
Ex. In step 1, node 0 (000) will send data to
node 4(100) with higher dimension.

27
Broadcast and Reduction on a Hypercube:
Example
● In the next step, communication will be done
for lower dimension.
● The source and the destination nodes in three
communication steps of the algorithm are
similar to the nodes in the broadcast algorithm
on a linear array.
● Hypercube broadcast will not suffer from
congestion

28
Broadcast and Reduction on a Balanced Binary
Tree
•Consider a binary tree in which processors are
(logically) at the leaves and internal nodes are
routing nodes i.e. switching units.
•The communicating nodes have the same labels
as in the hypercube
•The communication pattern will be same as that
of hypercube algorithm.
•There will not be any congestion on any of the
communication link at any time.

29
Broadcast and Reduction on a Balanced Binary
Tree

2
2

3 3 3 3
0 1 2 3 4 5 6 7

30
Broadcast and Reduction on a Balanced
Binary Tree

3 3 3 3
0 1 2 3 4 5 6 7

One-to-all broadcast on an eight-node tree.

•On different path, different number of


switching nodes will be there making its
communication different from hypercube.
31
Broadcast and Reduction on a Balanced
Binary Tree
1

3 3 3 3
0 1 2 3 4 5 6 7

One-to-all broadcast on an eight-node tree.

•E.g. Assume that source processor is the root


of this tree. In the first step, the source sends
the data to the right child (assuming the
source is also the left child). The problem has
now been decomposed into two problems
with half the number of processors. 32
Broadcast and Reduction Algorithms
•All of the algorithms described here before are
adaptations of the same algorithmic template.

•We illustrate the algorithm for a hypercube,


but the algorithm can be adapted to other
architectures.

•The hypercube has 2d nodes and my_id is the


label for a node.
• X is the message to be broadcast, which
initially resides at the source node 0.
33
Broadcast and Reduction Algorithms
1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id,
source, X)
2. begin
3. my_virtual_id := my_id XOR source;
4. mask := 2d - 1;
5. for i := d - 1 downto 0 do /* Outer loop */
/* Set bit i of mask to 0 */
6. mask := mask XOR 2i;
7. if (my_virtual_id AND mask) = 0 then
8. if (my_virtual_id AND 2i) = 0 then
9. virtual_dest := my_virtual_id XOR 2i;

One-to-all broadcast of a message X initiated by source on a


d-dimensional p-node hypercube. d = log (p)
34
Broadcast and Reduction Algorithms
10. send X to (virtual_dest XOR source);
/* Convert virtual_dest to the label of the physical
destination */
11. else
12. virtual_source := my_virtual_id XOR 2i;
13. receive X from (virtual_source XOR
source);
/* Convert virtual_source to the label of the physical
source */
14. endelse;
15. endfor;
16. end GENERAL_ONE_TO_ALL_BC
One-to-all broadcast of a message X initiated by source on a
d-dimensional p-node hypercube. d = log (p) 35
Broadcast and Reduction Algorithms

1. procedure GENERAL_ONE_TO_ALL_BC(d, my_id, source, X)


2. begin
3. my_virtual_id := my_id XOR source;
4. mask := 2d − 1;
5. for i := d − 1 downto 0 do /* Outer loop */
6. mask := mask XOR 2i; /* Set bit i of mask to 0 */
7. if (my_virtual_id AND mask) = 0 then
8. if (my_virtual_id AND 2i) = 0 then
9. virtual_dest := my_virtual_id XOR 2i;
10. send X to (virtual_dest XOR source);
/* Convert virtual_dest to the label of the physical destination */
11. else
12. virtual_source := my_virtual_id XOR 2i;
13. receive X from (virtual_source XOR source);
/* Convert virtual_source to the label of the physical source */
14. endelse;
15. endfor;
16. end GENERAL_ONE_TO_ALL_BC

One-to-all broadcast of a message X initiated by source on a


d-dimensional p-node hypercube. d = log (p)
36
Broadcast and Reduction Algorithms

1. procedure ALL_TO_ONE_REDUCE(d, my_id, m, X, sum)


2. begin
3. for j := 0 to m − 1 do sum[j] := X[j];
4. mask := 0;
5. for i := 0 to d − 1 do
/* Select nodes whose lower i bits are 0 */
6. if (my_id AND mask) = 0 then
7. if (my_id AND 2i) ƒ= 0 then
8. msg_destination := my id XOR 2i;
9. send sum to msg_destination;
10. else
11. msg_source := my_id XOR 2i;
12. receive X from msg_source;
13. for j := 0 to m − 1 do
14. sum[j] :=sum[j] + X[j];
15. endelse;
16. mask := mask XOR 2i; /* Set bit i of mask to 1 */
17. endfor;
18. end ALL_TO_ONE_REDUCE

Single-node accumulation on a d-dimensional hypercube. Each node contributes a


message X containing m words, and node 0 is the destination. 37
Cost Analysis

•The one –to-all broadcast or all-to-one


reduction procedure involves log p
point-to-point simple message transfers.

•Each message transfer will have a time cost


of ts + twm.

•The total time is therefore given by:

T = (ts + twm) log p.


38
Questions based on one-to-all Broadcast and
and all-to-one Reduction
• Explain one-to-all broadcast and all-to-one
reduction operation in brief.
• Explain with example one-to-all broadcast
and all-to-one reduction operation on ring.
• Explain how matrix-vector multiplication can
be performed using one-to-all broadcast and
all-to-one reduction operation.
• Explain one-to-all broadcast operation on
16-node mesh.
• Explain one-to-all broadcast operation on a
hypercube. 39
Questions based on one-to-all Broadcast and
and all-to-one Reduction

• Write and explain algorithm of one-to-all


broadcast on a hypercube network.
• Explain one-to-all broadcast algorithm for
arbitrary source on d-dimensional
hypercube.
• Explain all-to-one reduction operation on
d-dimensional hypercube.

40
All-to-All Broadcast and Reduction

• Generalization of broadcast in which each


processor is the source as well as
destination.
• All-to-all broadcast operation is used in
matrix operations like matrix multiplication
and matrix-vector multiplication.
• In all-to-all broadcast operation, all p nodes
simultaneously broadcast the message.

41
All-to-All Broadcast and Reduction

• Note that, a process sends same message


of m-word to all the processes, but it is not
compulsory that every process should send
same message. Different processes can
broadcast different messages.
• In all-to-all reduction, it happens at every
node.

42
All-to-All Broadcast and Reduction

All-to-all broadcast and all-to-all reduction.

43
All-to-All Broadcast and Reduction on a Ring
• All-to-all Broadcast:
• Simplest approach: perform p one-to-all
broadcasts. This is not the most efficient way,
though.
• It can take p times to complete communication.
• Communication links can be used more efficiently
by simultaneously performing all p one-to-all
broadcast.
• By this here will be concatenation of all messages
traversing the same path at the same time into a
single message.
• The algorithm terminates in p − 1 steps.
44
All-to-All Broadcast and Reduction on a Ring

• Linear Array and Ring:


• Each node first sends the data to one of its
neighbors it needs to broadcast.
• In subsequent steps, it forwards the data received
from one of its neighbors to its other neighbor.
• This process is continued in subsequent steps so
all the communication links can be kept busy.
• As the communication is performed circularly in a
single direction, each node receives all (p-1)
pieces of information fro all other nodes in (p-1)
steps.

45
All-to-All Broadcast and Reduction on a Ring
1 (6) 1 (5) 1 (4)
n (m)

7 6 5 4
(7) (6) (5) (4) nth mth
1 (7) 1 (3) time step message

(0) (1) (2) (3)


1st communication step
0 1 2 3

1 (0) 1 (1) 1 (2)

2 (5) 2 (4) 2 (3)

7 6 5 4
(7,6) (6,5) (5,4) (4,3)
2 (6) 2 (2)

2nd communication step


(0,7) (1,0) (2,1) (3,2)

0 1 2 3

2 (7) 2 (0) 2 (1)

. .
. .
. .
7 (0) 7 (7) 7 (6)

7 6 5 4
(7,6,5,4,3,2,1) (6,5,4,3,2,1,0) (5,4,3,2,1,0,7) (4,3,2,1,0,7,6)
7 (1) 7 (5)

7th communication step


(0,7,6,5,4,3,2) (1,0,7,6,5,4,3) (2,1,0,7,6,5,4) (3,2,1,0,7,6,5)

0 1 2 3

7 (2) 7 (3) 7 (4)

46
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

n (m)
1st Communication Step
nth mth
time step message

47
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

2nd Communication Step

48
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

7th communication step

49
All-to-all broadcast on an eight-node ring.
All-to-All Broadcast and Reduction on a Ring

• Detailed Algorithm
• At every node, my_msg contains initial message to be
broadcast.
• At the end of the algorithm, all p messages are
collected at each node.

50
All-to-All Broadcast and Reduction on a Ring

1. procedure ALL_TO_ALL_BC_RING(my_id, my_msg, p, result)


2. begin
3. left := (my_id − 1) mod p;
4. right := (my_id + 1) mod p;
5. result := my_msg;
6. msg := result;
7. for i := 1 to p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
12. end ALL_TO_ALL_BC_RING

All-to-all broadcast on a p-node ring.

All-to-all reduction is simply a dual of this operation and can


be performed in an identical fashion. 51
All-to-all Broadcast on a Mesh
• Performed in two phases –
• in the first phase, each row of the mesh performs an
all-to-all broadcast using the procedure for the linear array.

• In this phase, all nodes collect √p messages corresponding


to the √p nodes of their respective rows. Each node
consolidates this information into a single message of size
m√p.

• The second communication phase is a columnwise all-to-all


broadcast of the consolidated messages.

52
All-to-all Broadcast on a Mesh
(6) (7) (8) (6,7,8) (6,7,8) (6,7,8)
(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)

6 7 8 6 7 8 8

(3) (4) (5)


(3,4,5) (3,4,5) (3,4,5) (0,1,2,
3 4 5 3 4 5 3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8)
3,4,5, 3,4,5,
6,7,8) 6,7,8)

0 1 2 0 1 2 0 1 2

(0) (1) (2) (0,1,2) (0,1,2) (0,1,2) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)

(a) Initial data distribution (b) Data distribution after rowwise broadcast (c) Final result of all-to-all broadcast on Mesh

All-to-all broadcast on a 3 × 3 mesh. The groups of nodes communicating


with each other in each phase are enclosed by dotted boundaries. By the end
of the second phase, all nodes get (0,1,2,3,4,5,6,7,8) (that is, a message from
each node).

• After completion of second phase each node obtains all p pieces of m-word
data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from each node.

53
All-to-all Broadcast on a Mesh

(6) (7) (8)


7 8
6

(3) (4) (5)

3 4 5

for next step

0 1 2
(0) (1) (2)
(a) Initial data distribution
• After completion of second phase each node obtains all p pieces of m-word
data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from each node.

54
All-to-all Broadcast on a Mesh
(6,7,8) (6,7,8) (6,7,8)

6 7 8

(3,4,5) (3,4,5) (3,4,5)


3 4 5

for next step

0 1 2

(0,1,2) (0,1,2) (0,1,2)

(b) Data distribution after rowwise broadcast


• After completion of second phase each node obtains all p pieces of m-word
data i.e. all nodes will get (0,1,2,3,4,5,6,7,8) message from each node.

55
All-to-all Broadcast on a Mesh

(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)


6 7 8

(0,1,2,
3,4,5, 3 4 5
(0,1,2, (0,1,2,
6,7,8) 3,4,5, 3,4,5,
6,7,8) 6,7,8)

0 1 2

(0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8) (0,1,2,3,4,5,6,7,8)

(c) Final result of all-to-all broadcast on Mesh

56
All-to-all Broadcast on a Mesh

• All-to-all broadcast on a 3 × 3 mesh. The groups


of nodes communicating with each other in each
phase are enclosed by dotted boundaries.
• By the end of the second phase, all nodes get
(0,1,2,3,4,5,6,7,8) (that is, a message from each
node).

• After completion of second phase each node


obtains all p pieces of m-word data i.e. all nodes
will get (0,1,2,3,4,5,6,7,8) message from each
node.

57
All-to-all Broadcast on a Mesh
1. procedure ALL_TO_ALL_BC_MESH(my_id, my_msg, p, result)
2. begin
/* Communication along rows */
3. left := my_id − (my_id mod √p) + (my_id − 1)mod√p;
4. right := my_id − (my_id mod √p) + (my_id + 1) mod √p;
5. result := my_msg;
6. msg := result;
7. for i := 1 to √p − 1 do
8. send msg to right;
9. receive msg from left;
10. result := result ∪ msg;
11. endfor;
/* Communication along columns */
up := (my_id − √p) mod p;
12.
13. down := (my_id + √p) mod p;
14. msg := result;
15. for i := 1 to √p − 1 do
16. send msg to down;
17. receive msg from up;
18. result := result ∪ msg;
19. endfor;
20. end ALL_TO_ALL_BC_MESH

All-to-all broadcast on a square mesh of p nodes. 58


All-to-all broadcast on a Hypercube

• All-to-all broadcast operation can be performed on hypercube by


implementing mesh algorithm to log p dimensions.
• In each step, for a different dimension, communication is carried
out. Figure a shows first step that, communication is carried out in
each row.
• In figure b, communication is carried out in column in second step.
• Pairs of nodes exchange data in each step
• received message is concatenated with the current data in every
step.
• Hypercube with bidirectional communication is considered.

59
All-to-all broadcast on a Hypercube
(6) (7) (6,7) (6,7)

6 7 6 7

(2) 2 3 (3) (2,3) 2 3 (2,3)

(4) (5)

4 5 4 5
(4,5) (4,5)

(0) 0 1 (1) (0,1) 0 1 (0,1)

(a) Initial distribution of messages (b) Distribution before the second step

(0,...,7) (0,...,7)
(4,5, (4,5,
6,7) 6 7 6,7) 6 7

(0,...,7) (0,...,7)
(0,1, (0,1,
2 3 2 3
2,3) 2,3)
(4,5, (4,5,
6,7) 6,7) (0,...,7) (0,...,7)
4 5 4 5

(0,...,7) (0,...,7)
(0,1, (0,1,
0 1 0 1
2,3) 2,3)

(c) Distribution before the third step (d) Final distribution of messages

All-to-all broadcast on an eight-node hypercube.

60
All-to-all broadcast on a Hypercube

1. procedure ALL_TO_ALL_BC_HCUBE(my_id, my_msg, d, result)


2. begin
3. result := my_msg;
4. for i := 0 to d − 1 do
5. partner := my_id XOR 2i;
6. send result to partner;
7. receive msg from partner;
8. result := result ∪ msg;
9. endfor;
10. end ALL_TO_ALL_BC_HCUBE

All-to-all broadcast on a d-dimensional hypercube.

61
All-to-all broadcast on a Hypercube

(6) (7)

6 7

(2) 2 3 (3)

(4) (5)
4 5

(0) 0 1 (1)

(a) Initial distribution of messages

All-to-all broadcast on an eight-node hypercube.

62
All-to-all broadcast on a Hypercube
(6,7) (6,7)

6 7

(2,3) 2 3 (2,3)

4 5
(4,5) (4,5)

(0,1) 0 1 (0,1)

(b) Distribution before the second step


All-to-all broadcast on an eight-node hypercube.
63
All-to-all broadcast on a Hypercube

(4,5, (4,5,
6,7) 6,7)
6 7

(0,1, (0,1,
2,3) 2 3 2,3)

(4,5, (4,5,
6,7) 6,7)

4 5

(0,1, (0,1,
2,3) 0 1 2,3)

(c) Distribution before the third step

All-to-all broadcast on an eight-node hypercube.


64
All-to-all broadcast on a Hypercube

(0,...,7) (0,...,7)

6 7

(0,...,7) (0,...,7)

2 3

(0,...,7) (0,...,7)
4 5

(0,...,7) (0,...,7)
0 1

(d) Final distribution of messages


All-to-all broadcast on an eight-node hypercube.
65
All-to-all Broadcast

• Similar communication pattern to all-to-all broadcast,


except in the reverse order.
• On receiving a message, a node must combine it with
the local copy of the message that has the same
destination as the received message before forwarding
the combined message to the next neighbor.
• As per the algorithm the communication starts from
lowest dimension.
• Variable i is used to represent the dimension.
• According to line 4 in, in first iteration, value of i=0.

66
All-to-all Broadcast

• In each iteration, nodes communicate in pairs.


• In line 5, the label of receiver node will be calculated
by XOR operation.
• For eg if my_id = 000 then partner = 000 XOR 001 =
001. Hence, the partner differs in ith LSB.
• After communication of the data, each node
concatenates the received data with its own data as
shown in line 8.
• This concatenated message is then transmitted in the
next iteration.

67
All-to-all reduction on a Hypercube

1. procedure ALL_TO_ALL_RED_HCUBE(my_id, msg, d, result)


2. begin
3. recloc := 0;
4. for i := d − 1 to 0 do
5. partner := my_id XOR 2i;
6. j := my_id AND 2i;
7. k := (my_id XOR 2i) AND 2i;
8. senloc := recloc + k;
9. recloc := recloc + j;
10. send msg [senloc ..senloc + 2i - 1] to partner;
11. receive temp [0.. 2i - 1] from partner;
12. for j := 0 to 2i – 1 do
13. msg[recloc+j] := msg [recloc +j] + temp[j];
14. endfor;
15. endfor;
16. result := msg[my_id];
17. end ALL_TO_ALL_RED_HCUBE

All-to-all reduction on a d-dimensional hypercube.

68
All-to-all Reduction

• The order and direction of messages is reversed for


all-to-all reduction operation.
• The buffers are used to send and accumulate the
received messages in each iteration.
• Variable senloc is used to give the starting location of
the outgoing message.
• Variable recloc is used to give the location where the
incoming message is added in each iteration.

69
Cost Analysis

• On a ring, the time is given by:


T=(ts + twm)(p − 1).
• On a mesh, the time is given by:

T= 2ts(√p − 1) + twm(p − 1).


• On a hypercube, we have:

70
Cost Analysis of all-to-all broadcast and all-to-all reduction operation

• On a ring, the time is given by:


(ts + twm)(p − 1).
• =>> All-to-all broadcast can be performed in (p-1)
communicating steps on a ring or a linear array for
nearest neighbors.
• Time taken in each step is (ts + twm) where ts is the
startup time of the message, tw is the per word
transfer time, and m is the size of message,
• Total time taken for the operation is :
(ts + twm)(p − 1).

71
Cost Analysis of all-to-all broadcast and all-to-all reduction operation

• On a mesh, the time is given by:


2ts(√p − 1) + twm(p − 1).
=>> All-to-all broadcast can be performed on mesh.
The first phase of √p simultaneous all-to-all broadcast
will be completed in time
(ts + twm √p)(√p − 1).
For two dimensional square mesh of p-nodes, the total
time for all-to-all broadcast is addition of time spent in
each pass, and which is :
2ts(√p − 1) + twm(p − 1)

72
Cost Analysis of all-to-all broadcast and all-to-all reduction operation

• On a hypercube, The time is given by:

• we have: For p-node hypercube, for pair of nodes time taken


to send and receive message in ith step is
ts + 2i-1twm.

Total time taken for the operation is given by:


T = Σi=1 log p (ts + 2i-1twm)
= ts log p + twm(p − 1).

73
All-to-all broadcast: Notes

•All of the algorithms presented here


are asymptotically optimal in message
size.

•That is, twm(p-1) is the term


associated with each architecture.

74
All-to-all broadcast: Notes

•It is not possible to port/map algorithms


for higher dimensional networks (such
as a hypercube) into a ring because this
would cause contention.
•Large sized messages are optimal to
transfer on ring than hypercube.

75
All-to-all broadcast: Notes
Contention for a
single channel
by multiple
7 6 5 4
messages

4
0 1 2 3 7
6
1 3 4
2
2 3
3
4 5
1
2
0 1
8-node hyperccube

Contention for a channel when the hypercube is mapped onto a ring.


76
All-to-all broadcast: Questions

1. Explain all-to-all broadcast and reduction operation


2. Write algorithm and explain all-to-all broadcast on
eight node ring/hypercube.
3. Explain with example and algorithm all-to-all
broadcast on 3x3 mesh.
4. Explain all-to-all reduction on d-dimensional
hypercube.
5. Explain all-to-all broadcast on d-dimensional
hypercube.
6. Explain cost analysis of all-to-all broadcast operation

77
All-Reduce and Prefix-Sum Operations

•In all-reduce, each node starts with a


buffer of size m and the final results of
the operation are identical buffers of size
m on each node that are formed by
combining the original p buffers using an
associative operator.

78
All-Reduce and Prefix-Sum Operations

•It is identical to all-to-one reduction


followed by a one-to-all broadcast.
•This formulation is not the most efficient.
•Uses the pattern of all-to-all broadcast,
instead. The only difference is that message
size does not increase here.
•Time for this operation is
(ts + twm) log p.

79
All-Reduce and Prefix-Sum Operations

•It is different from all-to-all reduction, in


which p simultaneous all-to-one reductions
take place, each with a different destination
for the result.

80
The Prefix-Sum Operation

•Given p numbers n0, n1, . . . , np−1 (one on


each node), the problem is to compute the
sums sk = Σi=0k ni for all k between 0 and p
− 1.
•Initially, nk resides on the node labeled k,
and at the end of the procedure, the same
node holds sk.

81
The Prefix-Sum Operation

s0 = n0
s1 = n0 + n1
s2 = n0 + n1 + n2
s3 = n0 + n1 + n2 + n3
S4 = n0 + n1 + n2 + n3 + n4
...
sk = n0 + n1 + n2 + .. + nk

82
The Prefix-Sum Operation

Initially, n0 resides on node 0, n1 resides on


node 1 and so on.

After completion of prefix sum operation,


every node holding sum of its predecessor
nodes including itself.

T = (ts+twm) log p

83
The Prefix-Sum Operation
(6) [6] (7) [7] (6) [6] (6+7) [6+7]

6 7 6 7

[2] [2]
(2) 3 (3) [3] (2+3) 3 (2+3)
[2+3]
2 2
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 (1) [1] 1 (0+1) 0 (0+1) [0+1]
1

(a) Initial distribution of values (b) Distribution of sums before second step

(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]


6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+ 3 [0+1+2] 2 3
2+3) 2
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]

1 (0+1+ 0
(0+1+
0 [0+1] [0] [0+1] 1
2+3) 2+3)
(c) Distribution of sums before third step (d) Final distribution of prefix sums

Computing prefix sums on an eight-node hypercube. At


each node, square brackets show the local prefix sum
accumulated in the result buffer and parentheses enclose
the contents of the outgoing message buffer for the next
step.
84
The Prefix-Sum Operation
(6) [6] (7) [7]

6 7

[2]
(2) 2 3 (3) [3]

4 5

(4) [4] (5) [5]


[0]
(0) 0 1 (1) [1]

(a) Initial distribution of values


At each node, square brackets show the local prefix sum accumulated in the result buffer
and parentheses enclose the contents of the outgoing message buffer for the next step.
85
The Prefix-Sum Operation
(6+7) [6] (6+7) [6+7]

6 7

[2]
(2+3) 2 3 (2+3)
[2+3]

4 5

(4+5) [4] (4+5)


[0] [4+5]
(0+1) 0 1 (0+1)
[0+1]

(b) distribution of sums before second step


At each node, square brackets show the local prefix sum accumulated in the result buffer
and parentheses enclose the contents of the outgoing message buffer for the next step.
86
The Prefix-Sum Operation
(4+5+6+7) [4+5+6] (4+5+6+7) [4+5+6+7]

6 7

[0+1+2]
(0+1+2+ 3) 2 3 (0+1+2+3)
[0+1+2+3]

4 5

(4+5) [4] (4+5)


[0] [4+5]
(0+1+2+3) 0 1 (0+1+2+3)
[0+1]

(c) distribution of sums before third step


At each node, square brackets show the local prefix sum accumulated in the result buffer
and parentheses enclose the contents of the outgoing message buffer for the next step.
87
The Prefix-Sum Operation
[0+1+...+6] [0+1+...+7]

6 7

[0+1+2] 2 3 [0+1+2+3]

4 5

[0+1+...+4] [0+1+..+5]

[0] 0 1 [0+1]

(d) final distribution of prefix sums


At each node, square brackets show the local prefix sum accumulated in the result buffer
and parentheses enclose the contents of the outgoing message buffer for the next step.
88
The Prefix-Sum Operation or Scan Operation

•The operation can be implemented using


the all-to-all broadcast kernel.

•We must account for the fact that in


prefix sums the node with label k uses
information from only the k-node subset
whose labels are less than or equal to k.

89
The Prefix-Sum Operation or Scan Operation

•This is implemented using an additional


result buffer. The content of an incoming
message is added to the result buffer only if
the message comes from a node with a
smaller label than the recipient node.

•The contents of the outgoing message


(denoted by parentheses in the figure) are
updated with every incoming message.

90
The Prefix-Sum Operation or Scan Operation

• Prefix sum operation also uses the same


communication pattern which is used in all-to-all
broadcast and all reduce operations.
k
• The sum sk = ∑i=0 ni for all k between 0 and
p-1 for p numbers n0, n1, ..,np-1 on each node is
calculated.
• E.g. if the original sequence of numbers is
<3, 1, 4, 0, 2> then the sequence of prefix sum is <3,
4, 8, 8, 10>
i.e. 3+null=3, 3+1=4, 4+4=8, 8+0=8, 8+2=10

91
The Prefix-Sum Operation or Scan Operation

• At start the number nk will be present with node k.


After termination of algorithm, same node holds sum
sk.
• Instead of single number, each node will have a buffer
or a vector of size m and result will be sum of
elements of buffers.
• Each node contain additional buffer denoted by
square brackets to collect the correct prefix sum.
• After every communication step, the message from a
node with a smaller label than that of the recipient
node is added to the result buffer.

92
The Prefix-Sum Operation

1. procedure PREFIX_SUMS_HCUBE(my_id, my_number, d, result)


2. begin
3. result := my_number;
4. msg := result;
5. for i := 0 to d − 1 do
6. partner := my_id XOR 2i;
7. send msg to partner;
8. receive number from partner;
9. msg := msg + number;
10. if (partner < my_id) then result := result + number;
11. endfor;
12. end PREFIX_SUMS_HCUBE

Prefix sums on a d-dimensional hypercube.


93
The Prefix-Sum Operation
(6) [6] (7) [7] (6) [6] (6+7) [6+7]

6 7 6 7

[2] [2]
(2) 3 (3) [3] (2+3) 3 (2+3)
[2+3]
2 2
[4]
4 5 4 5
(4) [4] (5) [5] (4+5) (4+5) [4+5]
[0] [0]
(0) 0 (1) [1] 1 (0+1) 0 (0+1) [0+1]
1

(a) Initial distribution of values (b) Distribution of sums before second step

(4+5+6) [4+5+6] (4+5+6+7) [4+5+6+7] [0+ .. +6] [0+ .. +7]


6 7 6 7
[0+1+2]
[0+1+2+3] [0+1+2+3]
(0+1+ 3 [0+1+2] 2 3
2+3) 2
(0+1+2+3)
[4] (4+5)
4 5 4 5
[4+5]
(4+5) [0+1+2+3+4] [0+ .. +5]
[0]

1 (0+1+ 0
(0+1+
0 [0+1] [0] [0+1] 1
2+3) 2+3)
(c) Distribution of sums before third step (d) Final distribution of prefix sums

• At each node, square brackets show the local prefix sum accumulated in
the result buffer and parentheses enclose the contents of the outgoing
message buffer for the next step.
• These contents are updated with every incoming message.
• All the messages received by a node will not contribute to its final result,
some of the messages it receives may be redundant.
94
Questions on Prefix sum operations

•Explain the difference between


all-to-all reduction and all reduce
operations.

•Explain with example prefix sum


operations.

95
Scatter and Gather

• In the scatter operation, a single node sends a


unique message of size m to every other node.
• This is called a one-to-all personalized
communication.

• In the gather operation, a single node collects a


unique message from each node.
• It is the dual of scatter operation.

96
Scatter and Gather

• While the scatter operation is fundamentally


different from broadcast, the algorithmic
structure is similar, except for differences in
message sizes (messages get smaller in scatter
and stay constant in broadcast).

• The gather operation is exactly the inverse of


the scatter operation.

97
Gather and Scatter Operations

Scatter and gather operations.

• Consider the example of 8-node hypercube


• The communication patterns of all-to-all broadcast and
scatter operation are identical, the only difference is in
size and contents of the message.
• As in figure above, initially source node 0 will have all
the messages.

98
Gather and Scatter Operations

Scatter and gather operations.

99
Example of the Scatter Operation

In the first 6 7
communication
step, node 0
transfers half of
2 3
the messages to
one of its
neighbours
(node 4) 4 5

(0,1,2,3,
4,5,6,7) 0 1

(a) Initial distribution of messages


The scatter operation on an eight-node hypercube.
100
Example of the Scatter Operation

In the next step, 6 7


if any process
has some data, it
transfers half of 2 3
the data to its
neighbours who
has not received
4 5
any data uptil (4,5,
now. 6,7)
(0,1,
2,3) 0 1

(b) Distribution before the second step


The scatter operation on an eight-node hypercube.
101
Example of the Scatter Operation
(6,7)
6 7

(2,3)
2 3

4 5
(4,5)

(0,1)
0 1

(c) Distribution before the third step


The scatter operation on an eight-node hypercube.
102
Example of the Scatter Operation

(6) (7)
This process
6 7
involves log p
communication
steps for log p (2) (3)
dimensions of 2 3
the hypercube.
(4) (5)
4 5

(0) (1)
0 1

(d) Final distribution of messages


The scatter operation on an eight-node hypercube.
103
Gather Operations

Scatter and gather operations.


• The gather operation is reverse of scatter operation.
• Every node will have m word message.
• In the first step, each odd numbered node sends its buffer to an even
numbered neighbor behind it.
• The neighbor node concatenates the received message with its own
buffer.
• In the next communication step only even numbered nodes participate
in communication.
• The nodes with multiples of four labels, gather more data and double the
sizes of their data.
• This process continued till node 0 gather all the data. 104
Example of the Gather Operation
(6) (7)
6 7

(2) (3)
2 3

(4) (5)
4 5

(0) (1)
0 1

(a) Initial distinct messages

The gather operation on an eight-node hypercube.


105
Example of the Gather Operation

(6,7)
6 7

(2,3)
2 3

4 5
(4,5)

(0,1)
0 1

(b) Collection before the second step

The gather operation on an eight-node hypercube.


106
Example of the Gather Operation

6 7

2 3

4 5
(4,5,
6,7)
(0,1,
2,3) 0 1

(c) Collection before the third step


The gather operation on an eight-node hypercube.
107
Example of the Gather Operation

6 7

2 3

4 5

(0,1,2,3,
4,5,6,7) 0 1

(d) Final Collection of messages


The gather operation on an eight-node hypercube.
108
Cost of Scatter and Gather

• There are log p steps, in each step, the machine size halves and the
data size halves.
• We have the time for this operation to be:
T = ts log p + twm(p − 1).
• This time is same for a linear array as well as a 2-D mesh.
• In scatter operation, at least m(p-1) data must be transmitted out of
the source node,
• and in gather operation at least m(p-1) data must be received by the
destination node.
• Therefore, twm(p-1) time, is the lower bound on the communication
in scatter and gather operations.

Topic Questions
• Explain Scatter and Gather operations with example.
109
All-to-All Personalized Communication

• All-to-all personalized communication operation can be


applied in variety of parallel algorithms such as Fast Fourier
Transform, matrix transpose, sample sort, and some parallel
database join operations
• Each node has a distinct message of size m for every other
node.
• This is opposite of all-to-all broadcast, in which each node
sends the same message to all other nodes.
• All-to-all personalized communication is also known as
total exchange.

110
All-to-All Personalized Communication

M 0, p -1 M 1, p -1 M p-1, p -1 M p -1,0 M p -1,1 M p-1, p -1

. . . . . .
. . . . . .
. . . . . .

M 0, 1 M 1, 1 M p-1, 1 M 1, 0 M 1, 1 M 1, p-1,

M 0, 0 M 1, 0 M p-1, 0 M 0, 0 M 0,1 M 0,p-1

All-to-all personalized
0 1 ... p-1 communication 0 1 ... p-1

All-to-all personalized communication.

111
All-to-All Personalized Communication: Example

Consider the problem of transposing a matrix.

• Each processor contains one full row of the matrix.


• The transpose operation in this case is identical to an
all-to-all personalized communication operation.
• Let A is n x n matrix, transpose of matrix A is AT.
• AT will have same size as A and AT[i,j] = A[j,i] for
0<= i, j < n.

112
All-to-All Personalized Communication: Example

Consider the problem of transposing a matrix.

• Considering 1D row major partitioning of array, n x n


matrix can be mapped onto n processors such that each
processor contains one full row of the matrix.
• Each processor sends a distinct element of the matrix
to every other processor as
all-to-all personalized communication.
• For p processes, where p<= n, each process will have
n/p rows (n2/p elements in a matrix)
• For finding out the transpose, all-to-all personalized
communication of matrix blocks of size n/p x n/p will
be done.

113
All-to-All Personalized Communication: Example

P0 P0 = [0,0],[1,0],[2,0],[3,0]

P1 P1 = [0,1],[1,1],[2,1],[3,1] Result of
n Transpose
of matrix
P2

P3 P3 = [0,3],[1,3],[2,3],[3,3]

All-to-all personalized communication in transposing a 4 × 4 matrix using four processes.

• Processor Pi will contain the elements of the matrix with indices [i,0],
[i,1], .., [i,n-1]
• In transpose AT, P0 will have element [i,0], P1 will have element [i,1]
and so on.
• Initially processor Pi will have element [i,j] and after transpose, it
moves to Pj
• Figure above shows the example of 4 x 4 matrix mapped onto four
processes using one dimensional row wise partitioning.

114
All-to-All Personalized Communication on a Ring

•Each node sends (p − 1) pieces of data of


size m as one consolidated message to
one of its neighbors.
•These pieces are identified by label {x,y},
where x is the label of the node that
originally owned the message, and y is
the label of the node that is the final
destination of the message.

115
All-to-All Personalized Communication on a Ring

•The label ({x1, y1}, {x2, y2}, . . . , {xn,


yn}) indicates that a message is formed
by concatenating n individual messages.
For eg. ({0,1},{1,2},..,{4,5}).
•Each node extracts the information meant
for it from the message of size m(p − 1)
received, and forwards the remaining (p
− 2) data pieces of size m each to the
next node.

116
All-to-All Personalized Communication on a Ring

•The algorithm continued for (p − 1) steps.


In (p − 1) steps every node receives
information from all the nodes in the
group.
•The size of the message reduces by m at
each step.
•In each step one m-word packet from
different node will be added to each node.

117
All-to-All Personalized Communication on a Ring

•All messages are sent in the same


direction.
•To reduce the communication cost due
to tw by factor of two, half of the
messages are sent in one direction and
remaining in reverse direction to use
communication channels fully.

118
All-to-All Personalized Communication on a Ring

All-to-all personalized
is the label of the node that
communication on a
originally owned the
six-node ring. The label of
message, and y is the label
each message is of the form
of the node that is the final
{x, y}, where x

destination of the message. The label ({x1, y1}, {x2, y2}, . . . , {xn, yn}) indicates that a
message is formed by concatenating n individual messages.

119
All-to-All Personalized Communication on a Ring

Communication Step 1
120
All-to-All Personalized Communication on a Ring

Communication Step 2
121
All-to-All Personalized Communication on a Ring

Communication Step 3
122
All-to-All Personalized Communication on a Ring

Communication Step 4
123
All-to-All Personalized Communication on a Ring

Communication Step 5
124
All-to-All Personalized Communication on a Ring: Cost

• All-to-all personalized communication on ring requires p − 1


communication steps in all.
• The size of message transfer in ith step is m(p − i).
• Therefore, total time taken by this operation is given by:

p−1
T = Σ (t s + tw m(p −
i=1 i))
p−1
= t s(p −w 1) + Σ it
i=1 m

= (ts + twmp/2)(p − 1).

• The tw term in this equation can be reduced by a factor of 2 by


communicating messages in both directions.
125
All-to-All Personalized Communication on a Mesh

•For all-to-all personalized


communication on mesh √p x √p, each
node first groups its p messages
according to the columns of their
destination nodes.
•Consider the example of 3 x 3 mesh.
•Each node have 9 m-word messages one
for each node.

126
All-to-All Personalized Communication on a Mesh

•For each node, three groups of three


messages are formed.
•The first group contains the messages
for destination nodes labelled 0, 3, and
6; the second group contains the
messages for nodes 1, 4, and 7; and the
last group of messages for nodes
labelled 2, 5, and 8.

127
All-to-All Personalized Communication on a Mesh

•After grouping, each row will contain


cluster of messages of size m√p.
•Each cluster contains information for all
the nodes of a column.
•Now in the first phase, all-to-all
personalized communication is performed
in each row.

128
All-to-All Personalized Communication on a Mesh

•After first phase, the messages present


with each node are sorted again according
to the rows of their destination nodes.
•In the second phase, similar
communication is carried out.
•After completion of second phase, node i
will have the messages ({0,i},..,{8,i})
where 0 <= i <= 8. So each node will have
the a message from every other node.

129
All-to-All Personalized Communication on a Mesh

The label of each message is of the form


{x, y}, where x is the label of the node that
originally owned the message, and y is the label of
the node that is the final destination of the message.
The distribution of messages at the beginning of each
phase of all-to-all personalized communication on a
3 × 3 mesh. At the end of the second phase, node i
has messages ({0,i}, . . . ,{8,i}), where 0 ≤ i ≤ 8.
The groups of nodes communicating together in
each phase are enclosed in dotted boundaries.

130
All-to-All Personalized Communication on a Mesh

(a) Data distribution at the beginning of first phase 131


All-to-All Personalized Communication on a Mesh

(b) Data distribution at the beginning of second phase 132


All-to-All Personalized Communication on a Mesh

({0,6},{1,6},{2,6}, ({0,7},{1,7},{2,7}, ({0,8},{1,8},{2,8},


{3,6},{4,6},{5,6}, {3,7},{4,7},{5,7}, {3,8},{4,8},{5,8},
{6,6},{7,6},{8,6}) {6,7},{7,7},{8,7}) {6,8},{7,8},{8,8})
6 7 8

({0,4},{1,4}, ({0,5},{1,5},
{2,4},{3,4}, {2,5},{3,5},
{4,4},{5,4}, {4,5},{5,5},
{6,4},{7,4}, {6,5},{7,5},
({0,3},{1,3},{2,3}, {8,4}) {8,5})
3 4 5
{3,3},{4,3},{5,3},
{6,3},{7,3},{8,3})
({0,1},{1,1}, ({0,2},{1,2},
{2,1},{3,1}, {2,2},{3,2},
{4,1},{5,1}, {4,2},{5,2},
({0,0},{1,0},{2,0}, {6,1},{7,1}, {6,2},{7,2},
{3,0},{4,0},{5,0}, {8,1}) {8,2})
0 1 2
{6,0},{7,0},{8,0})
133
(c) Final Data distribution after second phase
All-to-All Personalized Communication on a Mesh: Cost

• Time for the first phase is identical to that in a ring


with √p processors, i.e.,
(ts + twmp/2)(√p − 1).

• Time in the second phase is identical to the first phase.


Therefore, total time is twice of this time, i.e.,

T = (2ts + twmp)(√p − 1).

134
All-to-All Personalized Communication on a Mesh: Cost

• It is noted that, time required for sorting the messages by


row and column is not considered in calculation of T.
• It is assumed that the data is ready for first
communication phase, so in second communication
phase, the rearrangement of mp words of data is done.
• Let tr is the time to read and write single word data in a
node’s local memory.
• So, total time spent in data rearrangement by a node in
complete process is tr x m x p.
• This time is very small as compared to communication
time T above.

135
All-to-All Personalized Communication on a
Hypercube

• Generalize the mesh algorithm to log p steps.


• At any stage in all-to-all personalized
communication on p node hypercube, every node
holds p packets of size m each.
• While communicating in a particular dimension,
every node sends p/2 of these packets (consolidated
as one message).
• A node must rearrange its messages locally before
each of the log p communication steps takes place.
• In each step, the data is exchanged by pairs of nodes
for a different dimension.

136
All-to-All Personalized Communication on a
Hypercube
({6,0} ... {6,7}) ({7,0} ... {7,7})
6 7

({2,0} ... {2,7}) ({3,0} ... {3,7})


2 3

4 5
({5,0} ... {5,7})
({4,0} ... {4,7})

0 1
({0,0} ... {0,7}) ({1,0} ... {1,7})

(a) Initial distribution of messages 137


All-to-All Personalized Communication on a
Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
{7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})

6 7
({2,0},{2,2},
{2,4},{2,6},
{3,0},{3,2}, 2 3
{3,4},{3,6}) ({4,1},{4,3},
{4,5},{4,7},
4 5 {5,1},{5,3},
{5,5},{5,7})

0 1
({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})

(b) Distribution before the second step 138


All-to-All Personalized Communication on a
Hypercube
({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7})

6 7

({2,2},{2,2}, 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
{5,1},{7,1},
4 5 {5,5},{7,5})

0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5},
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})
139
(c) Distribution before the third step
All-to-All Personalized Communication on a
Hypercube
({0,6} ... {7,6}) ({0,7} ... {7,7})

6 7
({0,2} ... {7,2}) ({0,3} ... {7,3})

2 3

4 5
({0,4} ... {7,4}) ({0,5} ... {7,5})

0 1
({0,0} ... {7,0}) ({0,1} ... {7,1})

(d) Final distribution of messages 140


All-to-All Personalized Communication on a
Hypercube
({6,0},{6,2},{6,4},{6,6}, ({6,1},{6,3},{6,5},{6,7},
({6,0} ... {6,7}) ({7,0} ... {7,7}) {7,0},{7,2},{7,4},{7,6}) {7,1},{7,3},{7,5},{7,7})

6 7 6 7

({2,0} ... {2,7}) ({3,0} ... {3,7}) ({2,0},{2,2},


{2,4},{2,6},
2 3 2 3
{3,0},{3,2},
{3,4},{3,6})
({4,1},{4,3},
{4,5},{4,7},
4 5 4 5
{5,1},{5,3},
({4,0} ... {4,7}) ({5,0} ... {5,7}) {5,5},{5,7})

0 1 0 1
({0,0} ... {0,7}) ({1,0} ... {1,7}) ({0,0},{0,2},{0,4},{0,6}, ({1,1},{1,3},{1,5},{1,7},
{1,0},{1,2},{1,4},{1,6}) {0,1},{0,3},{0,5},{0,7})

(a) Initial distribution of messages (b) Distribution before the second step

({6,2},{6,6},{4,2},{4,6}, ({7,3},{7,7},{5,3},{5,7},
{7,2},{7,6},{5,2},{5,6}) {6,3},{6,7},{4,3},{4,7}) ({0,6} ... {7,6}) ({0,7} ... {7,7})

6 7 6 7

({0,2} ... {7,2}) ({0,3} ... {7,3})

({2,2},{2,2}, 2 3 2 3
{0,6},{2,6},
({4,1},{6,1},
{1,2},{3,2},
{4,5},{6,5},
{1,6},{3,6})
4 5 {5,1},{7,1}, 4 5
{5,5},{7,5})
({0,4} ... {7,4}) ({0,5} ... {7,5})

0 1 0 1
({0,0},{0,4},{2,0},{2,4}, ({1,1},{1,5},{3,1},{3,5}, ({0,0} ... {7,0}) ({0,1} ... {7,1})
{1,0},{1,4},{3,0},{3,4}) {0,1},{0,5},{2,1},{2,5})

(c) Distribution before the third step (d) Final distribution of messages

An all-to-all personalized communication algorithm on a 141


three-dimensional hypercube.
All-to-All Personalized Communication on a
Hypercube: Cost

• We have log p iterations and mp/2 words are


communicated in each iteration. Therefore, the cost is:

T = (ts + twmp/2) log p.

• This is not optimal!

142
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

• Each node simply performs p − 1 communication


steps, exchanging m words of data with a different
node in every step.

• A node must choose its communication partner in each


step so that the hypercube links do not suffer
congestion.

143
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

• In the jth communication step, node i exchanges data


with node (i XOR j).

• In this schedule, all paths in every communication step


are congestion-free, and none of the bidirectional links
carry more than one message in the same direction.

144
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm
6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1

(a) (b) (c)

6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1

(d) (e) (f)

6 7 0 1 3
7
1 0 2
2 3 6
Seven steps in all-to-all personalized
2 3 1
4 5 7 5 communication on an eight-node
4 5
5 34 26 0 hypercube.
4
6 7 5
3
0 1 7 6 4
2
145
(g) 1
All-to-All Personalized Communication on a
Hypercube: Optimal Algorithm

1. procedure ALL_TO_ALL_PERSONAL(d, my_id)


2. begin
3. for i := 1 to 2d − 1 do
4. begin
5. partner := my_id XOR i;
6. send Mmy_id,partner to partner;
7. receive M
partner,my_id
from partner;
8.
endfor;
9.
end ALL_TO_ALL_PERSONAL

A procedure to perform all-to-all personalized communication on a


d-dimensional hypercube. The message Mi,j initially resides on node i and is
destined for node j.
146
All-to-All Personalized Communication on a
Hypercube: Cost Analysis of Optimal Algorithm

• There are p − 1 steps and each step involves non-congesting


message transfer of m words.
• We have:

T=(ts + twm)(p − 1).

• This is asymptotically optimal in message size.

147
Circular Shift
•Circular shift can be applied in some matrix
computations and in string and image
patterh matching.
•It is a member of a broader class of global
communication operations.
•So, in a circular q-shift node i sends data to
node (i+q) mod p in a group of p nodes
where (0 < q < p) known as Permutation.
•In Permutation, every node sends a message
of m word to a unique node.

148
Circular Shift on a Mesh

•Mesh algorithms for circular shift can be


derived by using ring algorithm.
•Wrap around connections are considered in
mesh. i.e. a row of 4 nodes 0,1,2,3, node 3 can
communicate and send data to node 0.
•Implementation can be performed by
min(q,p-q) neighbor-to-neighbor
communications in one direction where p is
number of nodes and q is the number of shifts
to be performed.
149
Circular Shift on a Mesh

•In p-node square wraparound mesh, for nodes


with row major labels, a circular q-shift is
performed in two stages.
•Example, q=5 shifts, p=16 (4x4 mesh)
•In the first stage, the data is shifted
simultaneously by (q mode √p) steps in all the
rows i.e. (5 mod √16) in our e.g.
•In second phase, it is shifted by [q/√p] steps
along the columns.

150
Circular Shift on a Mesh

•Due to wraparound connection while circular


row shifts, the data moves from highest to
lowest labeled nodes of the row. For e.g. data
with node 3 will be shifted to node 0 in the
first row.
•Note that to compensate for the distance √p
that they lost while traversing the backward
edge in their respective rows, the data packets
must be shifted by an additional step.

151
Circular Shift on a Mesh

•In the example, after row shift, there is


compensentory one column shift then column
shift.

•Total time for any circular q-shift on a


p-node mesh using packets of size m is :
T = (ts + twm)(√p + 1).

152
Circular Shift on a Mesh

(12) (13) (14) (15)


12 13 14 15

(8) (9) (10) (11)


8 9 10 11

(4) (5) (0) (6) (7)


4 5 6 7
ift
sh
5-
(0) (1) (2) (3)
0 1 2 3

(a) Initial data distribution and the first communication step


The communication steps in a circular 5-shift on a 4 × 4 mesh. 153
Circular Shift on a Mesh
(15) (12) (13) (14)
12 13 14 15

(11) (8) (9) (10)


8 9 10 11

(7) (4) (5) (6)


4 5 6 7
data from node 3
supposed to shift
on node 4, but
due to
(3) (0) (1) (2)
wraparound row
0 1 2 3 shift, it shift to
node 0

(b) Step to compensate for backward row shifts


The communication steps in a circular 5-shift on a 4 × 4 mesh. 154
Circular Shift on a Mesh
(11) (12) (13) (14)
shift is carried
12 13 14 15 out every
time on a
unique node

(7) (8) (9) (10)


8 9 10 11

(3) (4) (5) (6)


4 5 6 7

(15) (0) (1) (2)


0 1 2 3

(c) Column shifts in the third communication step


The communication steps in a circular 5-shift on a 4 × 4 mesh. 155
Circular Shift on a Mesh
(7) (8) (9) (10)
12 13 14 15

(3) (4) (5) (6)


8 9 10 11

(15) (0) (1) (2)


4 5 6 7

(11) (12) (13) (14)


0 1 2 3

(d) Final distribution of the data


The communication steps in a circular 5-shift on a 4 × 4 mesh. 156
Circular Shift on a Hypercube
• For shift operation on hypercube, linear
d
array with 2 nodes is mapped onto
d-dimensional hypercube.
• Node i of the linear array is assigned to
node j of the hypercube where j is a d-bit
binary Reflected Gray Code (RGC) of i.
• Consider eight nodes hypercube shown in
the figure, any two nodes at distance 2i are
separated by exactly two links.
• For i=0 nodes are directly connected to this
is the exception as only one hypercube link
separates two nodes. 157
Circular Shift on a Hypercube
• For q-shift operation, q is expanded as a sum
of distinct powers of 2. For example number 5
can be expanded as 22+20.
• Note that number of terms in sum=number of
1's in binary representation of q. e.g. 5(101)
two terms will be there in the sum
correcponding to bit position 2 and bit
position 0 i.e. 22+20.
• Circular q-shift on a hypercube is performed
in s phases, where s is distinct powers of 2.

158
Circular Shift on a Hypercube
• For example, 5-shift operation is performed
by 4 shift (22) followed by 1 shift (20).
• Each shift will have two communication steps.
Only 1-shift will have a single step. For
example, the first phase of a 4-shift consists of
two steps and the second phase of a 1-shift
consists of one step.
• Total number of steps for any q in a p-node
hypercube is 2 log p-1

159
Circular Shift on a Hypercube
• The time for this is upper bounded by:
T = (ts + twm)(2 log p − 1).
• If E-cube routing is used, this time can be
reduced to
T = ts + twm.

160
Circular Shift on a Hypercube
(4) (5) (3) (2)
4 5 4 5

(3) (0)
(2)
0 3 2 3 2 (1)

1
(7) (4)

2 7 6 (6) 7 6 (5)

3
(0) (7)
(1) (6)
4 0 1 0 1

5
First communication Second communication
6 step of the 4-shift step of the 4-shift

7
(a) The first phase (a 4-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
161
Circular Shift on a Hypercube

(0) (1)

4 5
0
(7)
1 (6)
3 2
2

3
(2)
4
7 6
5
(4)
(5)
6 0 1
7
(b) The second phase (a 1-shift)
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
162
Circular Shift on a Hypercube
(7) (0)
4 5

0
(6)
(5)
1 3 2
2

3
(2)
(1)
4
7 6
5
(3)
6 (4)
0 1
7
(c) Final data distribution after the 5-shift
The mapping of an eight-node linear array onto a three-dimensional hypercube to
perform a circular 5-shift as a combination of a 4-shift and a 1-shift.
163
Circular Shift on a Hypercube

6 7 6 7

2 3 2 3

4 5 4 5

0 1 0 1

(a) 1-shift (b) 2-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 164


Circular Shift on a Hypercube

6 7 6 7

2 3 2 3

4 5 4 5

0 1 0 1
(c) 3-shift (d) 4-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 165


Circular Shift on a Hypercube

6 7 6 7

2 3 2 3

4 5 4 5

0 1 0 1
(e) 5-shift (f) 6-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 166


Circular Shift on a Hypercube

6 7

2 3

4 5

0 1
(g) 7-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 167


Circular Shift on a Hypercube
6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1
(a) 1-shift (b) 2-shift (c) 3-shift

6 7 6 7 6 7

2 3 2 3 2 3

4 5 4 5 4 5

0 1 0 1 0 1
(d) 4-shift (e) 5-shift (f) 6-shift

6 7

2 3

4 5

0 1
(g) 7-shift

Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. 168


Improving Performance of Operations

• Splitting and routing messages into parts: If the message can be split into p
parts, a one-to-all broadcase can be implemented as a scatter operation followed by an
all-to-all broadcast operation. The time for this is:
m
T = 2 × (ts log p + tw(p − 1) )
p

≈ 2 × (ts log p + twm). (10)


• All-to-one reduction can be performed by performing all-to-all reduction (dual of all-to-all
broadcast) followed by a gather operation (dual of scatter).

169
Improving Performance of Operations

• Since an all-reduce operation is semantically equivalent to an all-to-one reduction


followed by a one-to-all broadcast, the asymptotically optimal algorithms for these two
operations can be used to construct a similar algorithm for the all-reduce operation.
• The intervening gather and scatter operations cancel each other. Therefore, an all-reduce
operation requires an all-to-all reduction and an all-to-all broadcast.

170
Improving Performance of Operations

• The communication algorithms are based on two assumptions:


• 1) original message cannot be divided into small parts
• 2) each node uses a single port for sending and receiving data

• We can analyse the effect of not following these two assumptions:

Splitting and Routing messages in parts:

171
University Questions on Unit 3
• August 2018 (Insem)
• 1. Explain Broadcast and Reduction example for multiplying
matrix with a vector.(6)
• 2. Explain the concept of scatter and gather (4)
• 3. Compare the one-to-all broadcast operation for Ring,
Mesh and Hypercube topologies (6)
• 4. Explain the prefix-sum operation for an eight-node
hypercube (4)

• Nov-Dec 2018 (Endsem)


• 1. Write a short note on All-to-one reduction with suitable
example. [6]

172
University Questions on Unit 3
• Nov-Dec 2019 (Endsem)
• 1 Explain term of all-to-all broadcast on linear array, mesh
& Hypercube
• topologies. [8]
• 2 Write short note on circular shift on a mesh. [6]

• Oct 2019 (Insem)


• 1. Explain broadcast & reduce operation with diagram. [4]
• 2. Explain prefix- sum operation for an eight-node
hypercube. [6]
• 3. Explain scatter and gather operation? [4]
• 4. Explain all to one broadcast and reduction on a ring? [6]

173
University Questions on Unit 3
• May-June-2019 (Endsem)
• 1. Explain Circular shift operation on mesh and hypercube
network. [8]

174

You might also like