Parallel Computing: MPI - Collective Communication

Parallel Computing
MPI Collective communication

Thorsten Grahs, 18. May 2015
Table of contents
Collective Communication
Communicator
Intercommunicator
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Communication involving a group of processes
Selection of the collective group by a suitable
communicator
All communication members get an identical call.
No tags
Collective communication
...does not necessarily mean all processes
(i.e. global communication)
Amount of data sent must exactly match the amount of
data received
Collective routines are collective across an entire
communicator and must be called in the same order from
all processors within the communicator
Collective routines are all blocking
Buffer can be reused upon return
Collective routines may return as soon as the calling

process participation is complete
No mixing of collective and point-to-point communication
Collective Communication functions

Barrier operation
MPI_Barrier()
All tasks waiting for each other
Broadcast operation
MPI_Bcast()
One task sends to all
Accumulation operation
MPI_Reduce()
One task associated / acts on distributed data
Gather operation
MPI_Gather()
One task collects/gather data
Scatter operation
MPI_Scatter()
One task scatter data (e.g. a vector)
Multi-Task functions
Multi-Broadcast operation
MPI_Allgather()
All participating tasks make the data available to other
participating tasks
Multi-Accumulation operation
MPI_Allreduce()
All participating tasks get result of the operation
Total exchange
MPI_Alltoall()
Each involved task sends and receives to/from all
Synchronisation
Barrier operation
MPI_Barrier(comm)
All tasks in comm wait on each other to achieve a barrier.
Only collective routine which provides explicit
synchronization
Returns at any processor only after all processes have
entered the call
Barrier can be used to ensure all processes have reached
a certain point in the computation
Mostly used for synchronization sequence of tasks
(e.g. debugging)
Example: MPI_Barrier
Tasks are waiting on each other

MPI_Isend is not completed
Data can not be accessed.
Broadcast operation
MPI_Bcast(buffer,count,datatype,root,communicator)
All processes in the communicator use same function call.
Data from rank root process are distributed to all process
in the communicator
The call is blocking, but not connected to synchronization
Accumulation operation
MPI_Reduce(sendbf,recvbf,count,type,op,master,comm)
Calling process is master
Join operation op (e.g. summation)
Processes involved put their local data into sendbf
master collects results into recvbf
Reduce operation
Pre-defined operations
MPI_MAX
MPI_MAXLOC
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LXOR
MPI_BXOR
...
maximum
maximum and index of maximum
minimum
summation
product
logical exclusive OR
bitwise exclusive OR
Example: Reduce Summation

MPI_Reduce(teil,s,1,MPI_DOUBLE,MPI_SUM,0,comm)
Gather operation
MPI_Gather(sbf,scount,stype,rbf,rcount,rtype,ma,comm)
sbuf local send-buffer
rbuf receive-buffer from master ma
Each processor sends rcount elements of data type
rtype to master ma
Order of data in the rbuf corresponds to numerical order
in communicator comm
Scatter operation
MPI_Scatter(sbf,scount,stype,rbf,rcount,rtype,ma,comm)
Master ma distributes/scatters data from sbf
Each process receives sub-buffers from sbf in local
receive buffer rbf
Master ma sends to itself
Order of data in the rbuf corresponds to numerical order
in communicator comm
Example: Scatter
Three processes involved in comm
Send-buffer: int sbuf[6]={3,14,15,92,65,35};
Recieve-buffer:
int rbuf[2];
Function call
MPI_Scatter(sbuf,2,MPI_INT,rbuf,2,MPI_INT,0,comm);
leads to the following distribution:
Process
rbuf
0
{ 3, 14}
1
{15, 92}
2
{65, 35}
Example Scatter-Gather: Averaging

1
2
if (world_rank == 0)
rand_nums = create_rand_nums(elements_per_proc * world_size);
3
4
5
// Create a buffer that will hold a subset of the random numbers

float *sub_rand_nums = malloc(sizeof(float) * elements_per_proc);
6
7
8
9
10
11
12
13
14
15
16
// Scatter the random numbers to all processes

MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums,
elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);
// Compute the average of your subset
float sub_avg = compute_avg(sub_rand_nums, elements_per_proc);
// Gather all partial averages down to the root process
float *sub_avgs = NULL;
sub_avgs = malloc(sizeof(float) * world_size);
MPI_Gather(&sub_avg,1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,MPI_COMM_WORLD);
17
18
19
20
// Compute the total average of all numbers.

float avg = compute_avg(sub_avgs, world_size);
Multi-broadcast operation
MPI_Allgather(sbuf,scount,stype,rbuf,rcount,rtype,comm)
Data from local sbuf are sent to all in rbuf
Indication of master redundant since all processes receive
the same data
MPI_Allgather corresponds to MPI_Gather followed by a
MPI_Bcast
Example Allgather: Averaging

1
2
3
4
// Gather all partial averages down to all the processes

float *sub_avgs = (float *)malloc(sizeof(float) * world_size);
MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT,
MPI_COMM_WORLD);
5
6
7
// Compute the total average of all numbers.

float avg = compute_avg(sub_avgs, world_size);
Output
/home/th/:
Avg of all
Avg of all
Avg of all
Avg of all
mpirun -n 4 ./average 100

elements from proc 1 is 0.479736
Total exchange
MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm)
Matrix view
Before MPI_Alltoall process k has row k of the matrix
After MPI_Alltoall process k has column k of the matrix
MPI_Alltoall corresponds to
MPI_Gather followed by a MPI_Scatter
Variable exchange operations

Variable scatter & Gather variants
MPI_Scatterv & MPI_Gatherv
Variable are:
Number of data elements that will be distributed to individual
processes
Their position in the send-buffer sbuf
Variable Scatter & Gather

Variable scatter
MPI_Scatterv(sbf,scount,displs,styp,
rbf,rcount,rtyp,ma,comm)
scount[i] contains the number of data elements which
has to be send to process i.
displs[i] defines the start of the data block for process i
relative to sbuf.
Variable gather
MPI_Gatherv(sbuf,scount,styp,
rbuf,rcount,displs,rtyp,ma,comm)
Also variable function for Allgather, Allscatter & Alltoall
Example MPI_Scatterv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/*Initialising */
if(myrank==root) init(sbuf,N);
/* Splitting work and data */
MPI_Comm_size(comm,&size);
Nopt=N/size;
Rest=N-Nopt*size;
displs[0]=0;
for(i=0;i<N;i++) {
scount[i]=Nopt;
if(i>0) displs[i]=displs[i-1]+scount[i-1]*sizeof(double);
if(Rest>0) { scount[i]++; Rest--;}
}
/* Distributing data */
MPI_Scatterv(sbuf,scount,displs,MPI_DOUBLE,rbuf,
scount[myrank],MPI_DOUBLE,root,comm);
Comparison between BLAS & Reduce

Multiplication Matrix with Vector
Example comparison
Compare different approaches
A RNM , N rows, M columns
Row-wise distribution
y=Ax
BLAS-Routine
Column-wise distribution
Reduction operation
Example row-wise
Row-wise distribution
Result vector y distributed
Example row-wise BLAS

Building block Multiplikation Matrix*Vektor
BLAS (Basic Linear Algorithm subroutines) algorithm dgemv
1
2
3
4
5
6
7
8
9
10
11
void local_mv(N,M,y,A,lda,x)
{
double x[N],A[N*M],y[M],s;
/*partial sum-local op.*/
for(i=0;i<M;i++) {
s=0;
for(j=0;j<N;j++)
s+=A[i*lda+j]*x[j];
y[i]=s;
}
}
Timing
arith.
mem.access
-x
-y
-A
2 N M Ta
M Tm (N, 1)
Tm (M, 1)
M Tm (N, 1)
Example row-wise vector

Task
Initial distribution:
All data at process 0
Result vector y expected at process 0
Example row-wise matrix

Operations
Distribute x to all processes: MPI_Bcast (p-1)*Tk(N)
Distribute rows of A: MPI_Scatter (p-1)*Tk(M*N)
Vector x
Matrix A
Example row-wise results

Operations
Arithmetic
2 N M Ta
Communication
(p 1) [Tk (N) + Tk (M N) + Tk (M)]
Memory access
2 M Tm (N, 1) + Tm (M, 1)
Example column-wise
Task
Distribution column-wise
Solution vector assembled by reduction operation
Example column-wise vector
Distributing vector x
Vector x
MPI_Scatter
(p-1)*Tk(M)
Example column-wise matrix

Distributing matrix A
Matrix A
pack blocks in buffer
memory:
N*Tm(M,1)+M*Tm(N,1)
Sending:
(p-1)Tk(M*N)
Example column-wise result

Assemble vector y MPI_Reduce
Cost for reduction of y: log2 (p)(Tk (N) + NTa + 2Tm (N, 1))
Arithmetic
2 N M Ta
Communication
(p 1)[Tk (M) + TK (M N)] + log2 (p)Tk (N)
Memory access
N Tm (M, 1) + M TM (N, 1) + 2 log2 (p)Tm (N, 1)
Algorithm is slightly faster
Parallelization is only useful if the corresponding data
distribution is already available before the algorithm
starts
Communicator
Communicators
Motivation
Communicator: Distinguish different contexts
Conflict-free organization of groups
Integration of third party software
Example: Distinction between
library functions
application
Predefined communicators
MPI_COMM_WORLD
MPI_COMM_SELF
MPI_COMM_NULL
Duplicate communicators
MPI_Comm_dup(MPI_COMM comm, MPI_COMM &newcomm);
Creates a copy newcomm of comm
Identical process group
Allows
clear delineation
characterisation
of process groups
example
MPI_COMM myworld;
...
MPI_Comm_dup(MPI_COMM_WORLD, &myworld)
Splitting communicators
MPI_Comm_split(MPI_COMM comm, int color, int key,
MPI_COMM &newcomm);
Divides communicator comm into multiple communicators
with disjoint processor groups
MPI_Comm_split has to be called by all processes in comm
Processes with the same value of color forms a new
communicator group
Example Splitting communicator

1
2
3
4
5
6
7
8
9
10
11
MPI_COMM comm1, comm2;

MPI_Comm_size(comm,&size);
MPI_Comm_rank(comm,&rank);
i=rank%3;
j=size-rank;
if(i==0)
MPI_Comm_split(comm,MPI_UNDEFINED,0,&newcomm);
else if(i==1)
MPI_Comm_split(comm,i,j,&comm1);
else
MPI_Comm_split(comm,i,j,&comm2)
MPI_UNDEFINED returns null-handle MPI_COMM_NULL.
Example Splitting communicator

MPI_COMM_WORLD
Rang
color
key
P0
P1
1
7
P2
2
6
P3 P4
1
5
4
P5 P6
2
3
2
P7
1
1
P8
2
0
MPI_COMM_WORLD
comm1
P1 P4
2
1
P7
0
comm2
P2 P5 P8
2
1
0
P0
0
P3
1
P6
2
Free communicator group

Clean up
MPI_COMM_free(MPI_COMM *comm);
Deletes the communicator comm

Resources occupied by comm are released by MPI.
After the function call, the communicator has the value of
the null-handle MPI_COMM_NULL
MPI_COMM_free has to be called by all process, which
belongs to comm
Grouping communicators
MPI_COMM_group(MPI_COMM comm, MPI_Group grp)
Creates a process group from a communicator
More group constructors
MPI_COMM_create
Generating a communicator from the group
MPI_Group_incl
Include processes into a group
MPI_Group_excl
Exclude processes from a group
MPI_Group_range_incl
Forms a group from a simple pattern
MPI_Group_range_excl
Excludes processes from a group by simple pattern
Example: create a group

Group
grp=(a,b,c,d,e,f,g),
n=3,
rank=[5,0,2]
MPI_Group_incl(grp, n, &rank, &newgrp)

Include in new group newgrp
n=3 processes
defined by pattern rank=[5,0,2]
newgrp=(f,a,c)
MPI_Group_excl(grp, n, &rank, &newgrp)
Exclude from new group newgrp
n=3 processes
defined by pattern rank=[5,0,2]
newgrp=(b,d,e,g)
Example: create a group II

Group
grp=(a,b,c,d,e,f,g,h,i,j),,
n=3,
ranges=[[6,7,1],[1,6,2],[0,9,4]]
Ranges forms a triple [start, end, spacing]
MPI_Group_range_incl(grp, 3, ranges, &newgrp)
Include in new group newgrp
n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]]
newgrp=(g,h,b,d,f,a,e,i)
MPI_Group_range_excl(grp, 3, ranges, &newgrp)
Exclude from new group newgrp
n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]]
newgrp=(c,j)
Operations on communicator groups

More grouping functions
Merging groups
Intersection of groups
Difference of groups
Comparing groups
Delete/Free groups
Size of a group
Rank of a group
...
MPI_Group_union
MPI_Group_intersection
MPI_Group_difference
MPI_Group_compare
MPI_Group_free
MPI_Group_size
MPI_Group_rank
Intercommunicator
Intercommunicator
Intracommunicator
Up til now, we had only handled communication inside a
contiguous group.
This communication was inside (intra/internal) a
communicator.
Intercommunicator
A communicator who establishes a context between groups
Intercommunicators are associated with 2 groups of
disjoint processes
Intercommunicators are associated with a remote group
and a local group
The target process (destination for send, source for
receive) is its rank in the remote group.
A communicator is either intra or inter, never both
Create intercommunicator
MPI_Intercomm_create(local_comm, local_bridge,
bridge_comm, remote_bridge, tag, &newcomm )
local_comm
local Intracommunicator (handle)
local_bridge
Rank of a distinguished process in local_comm (integer)
bridge_comm
Remote intracommunication, which should be connected
to local_comm by the newly build intercommunicator
newcomm
remote_bridge
Rank of a certain process in remote communicator
Communication between groups
Function uses point-to-point communication with specified

tag between the two processes defined as bridge heads.
Example
1
2
3
4
5
6
int main(int argc, char **argv)

{
MPI_Comm myComm; /* intra-communicator local sub-group */
MPI_Comm myFirstComm; /* inter-communicator */
MPI_Comm mySecondComm; /* second inter-communicator (group 1 only) */
int memberKey, rank;
7
8
9
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
10
11
12
/* User code must generate memberKey in the range [0, 1, 2] */

memberKey = rank % 3;
13
14
15
/* Build intra-communicator for local sub-group */

MPI_Comm_split(MPI_COMM_WORLD,memberKey,rank,&myComm);
Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/* Build inter-communicators. Tags are hard-coded. */

if (memberKey == 0)
{
/*Group 0 communicates with group 1. */
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1,
01, &myFirstComm); }
else if (memberKey == 1)
{
/* Group 1 communicates with groups 0 and 2. */
01, &myFirstComm);
12, &mySecondComm);
}
else if (memberKey == 2)
{
/* Group 2 communicates with group 1. */
12, &mySecondComm);
}
Example
/* Do work ... */
1
2
switch(memberKey) /* free communicators appropriately */

{
case 1:
MPI_Comm_free(&myFirstComm);
MPI_Comm_free(&mySecondComm);
case 0:
MPI_Comm_free(&myFirstComm);
case 2:
MPI_Comm_free(&mySecondComm);
break;
}
3
4
5
6
7
8
9
10
11
12
13
14
MPI_Finalize();
15
16
Motivation Intercommunicator
Used for
Meta-Computing
Cloud-Computing
Low bandwidth between components
e.g. cluster < > pc
bridge head controls
communication with remote-computer

Parallel Computing: MPI - Collective Communication

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Computing: MPI - Collective Communication

Uploaded by

Copyright:

Available Formats

Parallel Computing

MPI Collective communication

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Collective routines may return as soon as the calling

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Collective Communication functions

Tasks are waiting on each other

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Example: Reduce Summation

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Example Scatter-Gather: Averaging

// Create a buffer that will hold a subset of the random numbers

// Scatter the random numbers to all processes

// Compute the total average of all numbers.

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Example Allgather: Averaging

// Gather all partial averages down to all the processes

// Compute the total average of all numbers.

mpirun -n 4 ./average 100

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Variable exchange operations

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Variable Scatter & Gather

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Comparison between BLAS & Reduce

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Example row-wise BLAS

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Example row-wise vector

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Example row-wise matrix

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Example row-wise results

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Example column-wise vector

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Example column-wise matrix

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Example column-wise result

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Example Splitting communicator

MPI_COMM comm1, comm2;

MPI_UNDEFINED returns null-handle MPI_COMM_NULL.

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Example Splitting communicator

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Free communicator group

Deletes the communicator comm

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

Example: create a group

MPI_Group_incl(grp, n, &rank, &newgrp)

defined by pattern rank=[5,0,2]

defined by pattern rank=[5,0,2]

Example: create a group II