You are on page 1of 52

Parallel Computing

MPI Collective communication


Thorsten Grahs, 18. May 2015

Table of contents

Collective Communication
Communicator
Intercommunicator

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Collective Communication
Communication involving a group of processes
Selection of the collective group by a suitable
communicator
All communication members get an identical call.
No tags

Collective communication
...does not necessarily mean all processes
(i.e. global communication)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Collective Communication
Amount of data sent must exactly match the amount of
data received
Collective routines are collective across an entire
communicator and must be called in the same order from
all processors within the communicator
Collective routines are all blocking
Buffer can be reused upon return

Collective routines may return as soon as the calling


process participation is complete
No mixing of collective and point-to-point communication

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Collective Communication functions


Barrier operation
MPI_Barrier()
All tasks waiting for each other
Broadcast operation
MPI_Bcast()
One task sends to all
Accumulation operation
MPI_Reduce()
One task associated / acts on distributed data
Gather operation
MPI_Gather()
One task collects/gather data
Scatter operation
MPI_Scatter()
One task scatter data (e.g. a vector)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

Multi-Task functions
Multi-Broadcast operation
MPI_Allgather()
All participating tasks make the data available to other
participating tasks
Multi-Accumulation operation
MPI_Allreduce()
All participating tasks get result of the operation
Total exchange
MPI_Alltoall()
Each involved task sends and receives to/from all
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

Synchronisation
Barrier operation

MPI_Barrier(comm)
All tasks in comm wait on each other to achieve a barrier.
Only collective routine which provides explicit
synchronization
Returns at any processor only after all processes have
entered the call
Barrier can be used to ensure all processes have reached
a certain point in the computation
Mostly used for synchronization sequence of tasks
(e.g. debugging)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

Example: MPI_Barrier

Tasks are waiting on each other


MPI_Isend is not completed
Data can not be accessed.
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

Broadcast operation
MPI_Bcast(buffer,count,datatype,root,communicator)
All processes in the communicator use same function call.
Data from rank root process are distributed to all process
in the communicator
The call is blocking, but not connected to synchronization

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

Accumulation operation
MPI_Reduce(sendbf,recvbf,count,type,op,master,comm)
Calling process is master
Join operation op (e.g. summation)
Processes involved put their local data into sendbf
master collects results into recvbf

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

Reduce operation
Pre-defined operations
MPI_MAX
MPI_MAXLOC
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LXOR
MPI_BXOR
...

maximum
maximum and index of maximum
minimum
summation
product
logical exclusive OR
bitwise exclusive OR

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Example: Reduce Summation


MPI_Reduce(teil,s,1,MPI_DOUBLE,MPI_SUM,0,comm)

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Gather operation
MPI_Gather(sbf,scount,stype,rbf,rcount,rtype,ma,comm)
sbuf local send-buffer
rbuf receive-buffer from master ma
Each processor sends rcount elements of data type
rtype to master ma
Order of data in the rbuf corresponds to numerical order
in communicator comm

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

Scatter operation
MPI_Scatter(sbf,scount,stype,rbf,rcount,rtype,ma,comm)
Master ma distributes/scatters data from sbf
Each process receives sub-buffers from sbf in local
receive buffer rbf
Master ma sends to itself
Order of data in the rbuf corresponds to numerical order
in communicator comm

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

Example: Scatter
Three processes involved in comm
Send-buffer: int sbuf[6]={3,14,15,92,65,35};
Recieve-buffer:
int rbuf[2];
Function call
MPI_Scatter(sbuf,2,MPI_INT,rbuf,2,MPI_INT,0,comm);
leads to the following distribution:
Process
rbuf
0
{ 3, 14}
1
{15, 92}
2
{65, 35}
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Example Scatter-Gather: Averaging


1
2

if (world_rank == 0)
rand_nums = create_rand_nums(elements_per_proc * world_size);

3
4
5

// Create a buffer that will hold a subset of the random numbers


float *sub_rand_nums = malloc(sizeof(float) * elements_per_proc);

6
7
8
9
10
11
12
13
14
15
16

// Scatter the random numbers to all processes


MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums,
elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);
// Compute the average of your subset
float sub_avg = compute_avg(sub_rand_nums, elements_per_proc);
// Gather all partial averages down to the root process
float *sub_avgs = NULL;
if (world_rank == 0)
sub_avgs = malloc(sizeof(float) * world_size);
MPI_Gather(&sub_avg,1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,MPI_COMM_WORLD);

17
18
19
20

// Compute the total average of all numbers.


if (world_rank == 0)
float avg = compute_avg(sub_avgs, world_size);
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Multi-broadcast operation
MPI_Allgather(sbuf,scount,stype,rbuf,rcount,rtype,comm)
Data from local sbuf are sent to all in rbuf
Indication of master redundant since all processes receive
the same data
MPI_Allgather corresponds to MPI_Gather followed by a
MPI_Bcast

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Example Allgather: Averaging


1
2
3
4

// Gather all partial averages down to all the processes


float *sub_avgs = (float *)malloc(sizeof(float) * world_size);
MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT,
MPI_COMM_WORLD);

5
6
7

// Compute the total average of all numbers.


float avg = compute_avg(sub_avgs, world_size);

Output
/home/th/:
Avg of all
Avg of all
Avg of all
Avg of all

mpirun -n 4 ./average 100


elements from proc 1 is 0.479736
elements from proc 3 is 0.479736
elements from proc 0 is 0.479736
elements from proc 2 is 0.479736

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

Total exchange
MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm)
Matrix view
Before MPI_Alltoall process k has row k of the matrix
After MPI_Alltoall process k has column k of the matrix
MPI_Alltoall corresponds to
MPI_Gather followed by a MPI_Scatter

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

Variable exchange operations


Variable scatter & Gather variants
MPI_Scatterv & MPI_Gatherv
Variable are:
Number of data elements that will be distributed to individual
processes
Their position in the send-buffer sbuf

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Variable Scatter & Gather


Variable scatter
MPI_Scatterv(sbf,scount,displs,styp,
rbf,rcount,rtyp,ma,comm)
scount[i] contains the number of data elements which
has to be send to process i.
displs[i] defines the start of the data block for process i
relative to sbuf.

Variable gather
MPI_Gatherv(sbuf,scount,styp,
rbuf,rcount,displs,rtyp,ma,comm)
Also variable function for Allgather, Allscatter & Alltoall
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Example MPI_Scatterv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

/*Initialising */
if(myrank==root) init(sbuf,N);
/* Splitting work and data */
MPI_Comm_size(comm,&size);
Nopt=N/size;
Rest=N-Nopt*size;
displs[0]=0;
for(i=0;i<N;i++) {
scount[i]=Nopt;
if(i>0) displs[i]=displs[i-1]+scount[i-1]*sizeof(double);
if(Rest>0) { scount[i]++; Rest--;}
}
/* Distributing data */
MPI_Scatterv(sbuf,scount,displs,MPI_DOUBLE,rbuf,
scount[myrank],MPI_DOUBLE,root,comm);

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Comparison between BLAS & Reduce


Multiplication Matrix with Vector

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Example comparison
Compare different approaches
A RNM , N rows, M columns
Row-wise distribution

y=Ax

BLAS-Routine
Column-wise distribution

Reduction operation
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Example row-wise
Row-wise distribution
Result vector y distributed

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Example row-wise BLAS


Building block Multiplikation Matrix*Vektor
BLAS (Basic Linear Algorithm subroutines) algorithm dgemv

1
2
3
4
5
6
7
8
9
10
11

void local_mv(N,M,y,A,lda,x)
{
double x[N],A[N*M],y[M],s;
/*partial sum-local op.*/
for(i=0;i<M;i++) {
s=0;
for(j=0;j<N;j++)
s+=A[i*lda+j]*x[j];
y[i]=s;
}
}

Timing
arith.
mem.access
-x
-y
-A

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

2 N M Ta

M Tm (N, 1)
Tm (M, 1)
M Tm (N, 1)

Example row-wise vector


Task
Initial distribution:
All data at process 0
Result vector y expected at process 0

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Example row-wise matrix


Operations
Distribute x to all processes: MPI_Bcast (p-1)*Tk(N)
Distribute rows of A: MPI_Scatter (p-1)*Tk(M*N)
Vector x
Matrix A

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

Example row-wise results


Operations
Arithmetic
2 N M Ta
Communication
(p 1) [Tk (N) + Tk (M N) + Tk (M)]
Memory access
2 M Tm (N, 1) + Tm (M, 1)

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29

Example column-wise
Task
Distribution column-wise
Solution vector assembled by reduction operation

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Example column-wise vector

Distributing vector x
Vector x
MPI_Scatter

(p-1)*Tk(M)

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

Example column-wise matrix


Distributing matrix A
Matrix A
pack blocks in buffer
memory:
N*Tm(M,1)+M*Tm(N,1)
Sending:
(p-1)Tk(M*N)

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

Example column-wise result


Assemble vector y MPI_Reduce
Cost for reduction of y: log2 (p)(Tk (N) + NTa + 2Tm (N, 1))
Arithmetic
2 N M Ta
Communication
(p 1)[Tk (M) + TK (M N)] + log2 (p)Tk (N)
Memory access
N Tm (M, 1) + M TM (N, 1) + 2 log2 (p)Tm (N, 1)
Algorithm is slightly faster
Parallelization is only useful if the corresponding data
distribution is already available before the algorithm
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33
starts

Communicator

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

Communicators
Motivation
Communicator: Distinguish different contexts
Conflict-free organization of groups
Integration of third party software
Example: Distinction between
library functions
application

Predefined communicators
MPI_COMM_WORLD
MPI_COMM_SELF
MPI_COMM_NULL
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

Duplicate communicators
MPI_Comm_dup(MPI_COMM comm, MPI_COMM &newcomm);
Creates a copy newcomm of comm
Identical process group
Allows
clear delineation
characterisation

of process groups

example
MPI_COMM myworld;
...
MPI_Comm_dup(MPI_COMM_WORLD, &myworld)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

Splitting communicators
MPI_Comm_split(MPI_COMM comm, int color, int key,
MPI_COMM &newcomm);
Divides communicator comm into multiple communicators
with disjoint processor groups
MPI_Comm_split has to be called by all processes in comm
Processes with the same value of color forms a new
communicator group

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Example Splitting communicator


1
2
3
4
5
6
7
8
9
10
11

MPI_COMM comm1, comm2;


MPI_Comm_size(comm,&size);
MPI_Comm_rank(comm,&rank);
i=rank%3;
j=size-rank;
if(i==0)
MPI_Comm_split(comm,MPI_UNDEFINED,0,&newcomm);
else if(i==1)
MPI_Comm_split(comm,i,j,&comm1);
else
MPI_Comm_split(comm,i,j,&comm2)

MPI_UNDEFINED returns null-handle MPI_COMM_NULL.

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Example Splitting communicator


MPI_COMM_WORLD
Rang
color
key

P0

P1
1
7

P2
2
6

P3 P4

1
5
4

P5 P6
2

3
2

P7
1
1

P8
2
0

MPI_COMM_WORLD
comm1
P1 P4
2
1

P7
0

comm2
P2 P5 P8
2
1
0

P0
0

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

P3
1

P6
2

Free communicator group


Clean up
MPI_COMM_free(MPI_COMM *comm);

Deletes the communicator comm


Resources occupied by comm are released by MPI.
After the function call, the communicator has the value of
the null-handle MPI_COMM_NULL
MPI_COMM_free has to be called by all process, which
belongs to comm

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

Grouping communicators
MPI_COMM_group(MPI_COMM comm, MPI_Group grp)
Creates a process group from a communicator
More group constructors
MPI_COMM_create
Generating a communicator from the group
MPI_Group_incl
Include processes into a group
MPI_Group_excl
Exclude processes from a group
MPI_Group_range_incl
Forms a group from a simple pattern
MPI_Group_range_excl
Excludes processes from a group by simple pattern
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

Example: create a group


Group
grp=(a,b,c,d,e,f,g),

n=3,

rank=[5,0,2]

MPI_Group_incl(grp, n, &rank, &newgrp)


Include in new group newgrp
n=3 processes

defined by pattern rank=[5,0,2]

newgrp=(f,a,c)
MPI_Group_excl(grp, n, &rank, &newgrp)
Exclude from new group newgrp
n=3 processes

defined by pattern rank=[5,0,2]

newgrp=(b,d,e,g)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

Example: create a group II


Group
grp=(a,b,c,d,e,f,g,h,i,j),,
n=3,
ranges=[[6,7,1],[1,6,2],[0,9,4]]
Ranges forms a triple [start, end, spacing]
MPI_Group_range_incl(grp, 3, ranges, &newgrp)
Include in new group newgrp
n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]]

newgrp=(g,h,b,d,f,a,e,i)
MPI_Group_range_excl(grp, 3, ranges, &newgrp)
Exclude from new group newgrp
n=3 range triples defined by [[6,7,1],[1,6,2],[0,9,4]]

newgrp=(c,j)
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43

Operations on communicator groups


More grouping functions
Merging groups
Intersection of groups
Difference of groups
Comparing groups
Delete/Free groups
Size of a group
Rank of a group
...

MPI_Group_union
MPI_Group_intersection
MPI_Group_difference
MPI_Group_compare
MPI_Group_free
MPI_Group_size
MPI_Group_rank

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

Intercommunicator

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45

Intercommunicator
Intracommunicator
Up til now, we had only handled communication inside a
contiguous group.
This communication was inside (intra/internal) a
communicator.
Intercommunicator
A communicator who establishes a context between groups
Intercommunicators are associated with 2 groups of
disjoint processes
Intercommunicators are associated with a remote group
and a local group
The target process (destination for send, source for
receive) is its rank in the remote group.
A communicator is either intra or inter, never both
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46

Create intercommunicator
MPI_Intercomm_create(local_comm, local_bridge,
bridge_comm, remote_bridge, tag, &newcomm )
local_comm
local Intracommunicator (handle)
local_bridge
Rank of a distinguished process in local_comm (integer)
bridge_comm
Remote intracommunication, which should be connected
to local_comm by the newly build intercommunicator
newcomm
remote_bridge
Rank of a certain process in remote communicator
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47

Communication between groups

Function uses point-to-point communication with specified


tag between the two processes defined as bridge heads.
18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48

Example
1
2
3
4
5
6

int main(int argc, char **argv)


{
MPI_Comm myComm; /* intra-communicator local sub-group */
MPI_Comm myFirstComm; /* inter-communicator */
MPI_Comm mySecondComm; /* second inter-communicator (group 1 only) */
int memberKey, rank;

7
8
9

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

10
11
12

/* User code must generate memberKey in the range [0, 1, 2] */


memberKey = rank % 3;

13
14
15

/* Build intra-communicator for local sub-group */


MPI_Comm_split(MPI_COMM_WORLD,memberKey,rank,&myComm);

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49

Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

/* Build inter-communicators. Tags are hard-coded. */


if (memberKey == 0)
{
/*Group 0 communicates with group 1. */
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1,
01, &myFirstComm); }
else if (memberKey == 1)
{
/* Group 1 communicates with groups 0 and 2. */
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 0,
01, &myFirstComm);
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 2,
12, &mySecondComm);
}
else if (memberKey == 2)
{
/* Group 2 communicates with group 1. */
MPI_Intercomm_create( myComm, 0, MPI_COMM_WORLD, 1,
12, &mySecondComm);
}

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50

Example
/* Do work ... */

1
2

switch(memberKey) /* free communicators appropriately */


{
case 1:
MPI_Comm_free(&myFirstComm);
MPI_Comm_free(&mySecondComm);
case 0:
MPI_Comm_free(&myFirstComm);
case 2:
MPI_Comm_free(&mySecondComm);
break;
}

3
4
5
6
7
8
9
10
11
12
13
14

MPI_Finalize();

15
16

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 51

Motivation Intercommunicator
Used for
Meta-Computing
Cloud-Computing
Low bandwidth between components
e.g. cluster < > pc
bridge head controls
communication with remote-computer

18. May 2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 52

You might also like