MPI

Rohit Banga Prakher Anand K Swagat Manoj Gupta Advanced Computer Architecture Spring, 2010

ORGANIZATION
Basics of MPI  Point to Point Communication  Collective Communication  Demo

GOALS
Explain basics of MPI  Start coding today!  Keep It Short and Simple

INFINIBAND (on our cluster)  . message passing.MESSAGE PASSING INTERFACE  A message passing library specification  Extended message-passing model  Not a language or compiler specification  Not a specific implementation. several implementations (like pthread) standard for distributed memory. parallel computing  Distributed Memory – Shared Nothing approach!  Some interconnection technology – TCP.

GOALS OF MPI SPECIFICATION Provide source code portability  Allow efficient implementations  Flexible to port different algorithms on different hardware environments  Support for heterogeneous architectures – processors not identical  .

REASONS FOR USING MPI Standardization – virtually all HPC platforms  Portability – same code runs on another platform  Performance – vendor implementations should exploit native hardware features  Functionality – 115 routines  Availability – a variety of implementations available  .

BASIC MODEL Communicators and Groups  Group   ordered set of processes  each process is associated with a unique integer rank  rank from 0 to (N-1) for N processes  an object in system memory accessed by handle  MPI_GROUP_EMPTY  MPI_GROUP_NULL .

BASIC MODEL (CONTD.)  Communicator  Group of processes that may communicate with each other  MPI messages must specify a communicator  An object in memory  Handle to access the object There is a default communicator (automatically defined):  MPI_COMM_WORLD  identify the group of all processes  .

COMMUNICATORS Intra-Communicator – All processes from the same group  Inter-Communicator – Processes picked up from several groups  .

COMMUNICATOR AND GROUPS         For a programmer. group and communicator are one Allow you to organize tasks. based upon function. into task groups Enable Collective Communications (later) operations across a subset of related tasks safe communications Many Communicators at the same time Dynamic – can be created and destroyed at run time Process may be in more than one group/communicator – unique rank in every group/communicator implementing user defined virtual topologies .

0): rank 0 coord (0.1): rank 3 Attach graph topology information to an existing communicator  .0): rank 2 coord (1.1): rank 1 coord (1.VIRTUAL TOPOLOGIES     coord (0.

) rc = MPI_Bsend(&buf.type.SEMANTICS  Header file  #include <mpi.dest.tag.h> (C)  include mpif. .comm) Returned as "rc". Python etc. Format: Example: Error code: rc = MPI_Xxxxx(parameter.count.h (fortran)  Java.. MPI_SUCCESS if successful ..

MPI PROGRAM STRUCTURE .

MPI FUNCTIONS – MINIMAL SUBSET MPI_Init – Initialize MPI  MPI_Comm_size – size of group associated with the communicator  MPI_Comm_rank – identify the process  MPI_Send  MPI_Recv  MPI_Finalize   We will discuss simple ones first .

 Collective Communication  MPI_Reduce. MPI_Finalize MPI_Recv MPI_Bcast MPI_Get_processor_name  Point-to-Point Communication  MPI_Send.  Information on the Processes  MPI_Comm_rank.CLASSIFICATION OF MPI ROUTINES  Environment Management  MPI_Init. .

… } . char **argv) { MPI_Init(&argc. &argv).    Must be called in every MPI program Must be called only once and before any other MPI functions are called Pass command line arguments to all processes int main(int argc.MPI_INIT  All MPI programs call this before using other MPI functions  int MPI_Init(int *pargc. char ***pargv).

&p). int *psize).  Find out number of processes being used by your application int main(int argc. &argv).MPI_COMM_SIZE  Number of processes in the group associated with a communicator  int MPI_Comm_size(MPI_Comm comm. … } . int p. char **argv) { MPI_Init(&argc. MPI_Comm_size(MPI_COMM_WORLD.

&p). int rank. Unique rank for a process in each communicator it belongs to Used to identify work for the processor int main(int argc. &rank).MPI_COMM_RANK      Rank of the calling process within the communicator Unique Rank between 0 and (p-1) Can be called task ID  int MPI_Comm_rank(MPI_Comm comm. … } . MPI_Comm_rank(MPI_COMM_WORLD. char **argv) { MPI_Init(&argc. MPI_Comm_size(MPI_COMM_WORLD. &argv). int *rank). int p.

MPI_Comm_size(MPI_COMM_WORLD. int rank. MPI_Comm_rank(MPI_COMM_WORLD. printf(“no. char **argv) { MPI_Init(&argc. &rank). MPI_Finalize(). int p. of processors: %d\n rank: %d”.MPI_FINALIZE Terminates the MPI execution environment  Last MPI routine to be called in any MPI program   int MPI_Finalize(void). &argv). &p). int main(int argc. p. rank). } .

.

HOW TO COMPILE THIS Open MPI implementation on our Cluster  mpicc -o test_1 test_1.c  Like gcc only  mpicc not a special compiler   $mpicc: gcc: no input files  Mpi implemented just as any other library  Just a wrapper around gcc that includes required command line parameters .

HOW TO RUN THIS mpirun -np X test_1  Will run X copies of program in your current run time environment  np option specifies number of copies of program  .

MPIRUN  Only rank 0 process can receive standard input. SIGKILL kill all processes in the communicator SIGUSR1.   mpirun redirects standard input of all others to /dev/null Open MPI redirects standard input of mpirun to standard input of rank 0 process      Node which invoked mpirun need not be the same as the node for the MPI_COMM_WORLD rank 0 process mpirun directs standard output and error of remote nodes to the node that invoked mpirun SIGTERM. SIGUSR2 propagated to all processes All other signals ignored .

A NOTE ON IMPLEMENTATION   I want to implement my own version of MPI Evidence MPI_Init MPI_Init MPI Thread MPI Thread .

SOME MORE FUNCTIONS  int MPI_Init (&flag)  Check  Why? if MPI_Initialized has been called  int MPI_Wtime()  Returns elapsed wall clock time in seconds (double precision) on the calling processor the resolution in seconds (double precision) of MPI_Wtime() is what MPI is meant for!  int MPI_Wtick()  Returns  Message Passing Functionality  That .

POINT TO POINT COMMUNICATION .

POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes  One sending and one receiving  Types  • • • • • • Synchronous send Blocking send / blocking receive Non-blocking send / non-blocking receive Buffered send Combined send/receive "Ready" send .

to assist the receiving process in identifying the message  MPI_ANY_TAG  .POINT-TO-POINT COMMUNICATION Processes can be collected into groups  Each message is sent in a context. and  must be received in the same context  A group and context together form a Communicator  A process is identified by its rank in the group associated with a communicator  Messages are sent with an accompanying user defined integer tag.

POINT-TO-POINT COMMUNICATION How is “data” described?  How are processes identified?  How does the receiver recognize messages?  What does it mean for these operations to complete?  .

int dest.data to send  count: number of elements in buffer . MPI_Comm communicator)  buf: pointer .BLOCKING SEND/RECEIVE int MPI_Send(void *buf.  Datatype : which kind of data types in buffer ?  . int tag. MPI_Datatype datatype. int count.

data to send  count: number of elements in buffer . MPI_Comm communicator)  buf: pointer . int count. int tag. MPI_Datatype datatype. int dest.BLOCKING SEND/RECEIVE int MPI_Send(void *buf.  Datatype : which kind of data types in buffer ?  .

int dest. int count.data to send  count: number of elements in buffer .BLOCKING SEND/RECEIVE int MPI_Send(void *buf. MPI_Datatype datatype. int tag. MPI_Comm communicator)  buf: pointer .  Datatype : which kind of data types in buffer ?  .

.

BLOCKING SEND/RECEIVE
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm communicator)  buf: pointer - data to send  count: number of elements in buffer .  Datatype : which kind of data types in buffer ?  dest: the receiver  tag: the label of the message  communicator: set of processors involved (MPI_COMM_WORLD)

BLOCKING SEND/RECEIVE
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm communicator)  buf: pointer - data to send  count: number of elements in buffer .  Datatype : which kind of data types in buffer ?  dest: the receiver  tag: the label of the message  communicator: set of processors involved (MPI_COMM_WORLD)

BLOCKING SEND/RECEIVE
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm communicator)  buf: pointer - data to send  count: number of elements in buffer .  Datatype : which kind of data types in buffer ?  dest: the receiver  tag: the label of the message  communicator: set of processors involved (MPI_COMM_WORLD)

BLOCKING SEND/RECEIVE (CONTD.) Processor 1 Process 1 Application Send Processor 2 Process 2 Application Send Data System Buffer System Buffer .

(BUFFERING)  create links between processors. send data and return control when all the data are sent (but NOT received)  uses a combination of the above methods . start communication. and returns control before all the data are transferred.A WORD ABOUT SPECIFICATION  The user does not know if MPI implementation:  copies BUFFER in an internal buffer.

BLOCKING SEND/RECEIVE (CONTD.)   "return" after it is safe to modify the application buffer Safe   modifications will not affect the data intended for the receive task does not imply that the data was actually received    Blocking send can be synchronous which means there is handshaking occurring with the receive task to confirm a safe send A blocking send can be asynchronous if a system buffer is used to hold the data for eventual delivery to the receive A blocking receive only "returns" after the data has arrived and is ready for use by the program .

datatype.NON-BLOCKING SEND/RECEIVE return almost immediately  simply "request" the MPI library to perform the operation when it is able  Cannot predict when that will happen  request a send/receive and start doing other work!  unsafe to modify the application buffer (your variable space) until you know that the non-blocking operation has been completed    MPI_Isend (&buf.tag.tag.&request) MPI_Irecv (&buf.comm.count.dest.datatype.count.source.&request) .comm.

) Processor 1 Process 1 Application Send Processor 2 Process 2 Application Send Data System Buffer System Buffer .NON-BLOCKING SEND/RECEIVE (CONTD.

MPI_Datatype type. A call to this subroutine cause the code to wait until the communication pointed by req is complete  input/output. int dest. MPI_Request *req). MPI_Status *status).NON-BLOCKING SEND/RECEIVE (CONTD. int MPI_Wait(MPI_Request *req. MPI_Comm comm.)    To check if the send/receive operations have completed int MPI_Irecv (void *buf. identifier associated to a communications event (initiated by MPI_ISEND or MPI_IRECV). int tag.  input/output. int count.  . identifier associated to a communications event (initiated by MPI_ISEND or MPI_IRECV).

 A call to this subroutine sets flag to true if the communication pointed by req is complete. MPI_Status *status). int *flag. sets flag to false otherwise.NON-BLOCKING SEND/RECEIVE (CONTD. .)  int MPI_Test(MPI_Request *req.

It is up to MPI to decide whether outgoing messages will be buffered The standard mode send is non-local. Might be copied directly into the matching receive buffer. Message buffering decouples the send and receive operations.STANDARD MODE • Returns when Sender is free to access and overwrite the send buffer. or might be copied into a temporary system buffer. • • • • • . Message buffering can be expensive.

 Blocking send & Blocking receive in synchronous mode.  Simulate a synchronous communication.  .SYNCHRONOUS MODE Send can be started whether or not a matching receive was posted.  Synchronous Send is non-local.  Send completes successfully only if a corresponding receive was already posted and has already started to receive the message sent.

 .BUFFERED MODE Send operation can be started whether or not a matching receive has been posted.  Operation is local.  The amount of available buffer space is controlled by the user.  MPI must buffer the outgoing message.  It may complete before a matching receive is posted.  Error will occur if there is insufficient buffer space.

BUFFSIZE). &size). /* Buffer of BUFFSIZE bytes available again */ . int size)  Provides to MPI a buffer in the user's memory to be used for buffering outgoing messages. size).  int MPI_Buffer_detach( void* buffer_addr.  MPI_Buffer_attach( malloc(BUFFSIZE).BUFFER MANAGEMENT int MPI_Buffer_attach( void* buffer. int* size)  Detach the buffer currently associated with MPI. /* Buffer size reduced to zero */ MPI_Buffer_attach( buff. /* a buffer of BUFFSIZE bytes can now be used by MPI_Bsend */ MPI_Buffer_detach( &buff.

READY MODE A send may be started only if the matching receive is already posted.  Completion of the send operation does not depend on the status of a matching receive.  Ready-send could be replaced by a standard-send with no effect on the behavior of the program other than performance.  Merely indicates that the send buffer can be reused.  If the receive is not already posted. the operation is erroneous and its outcome is undefined.  The user must be sure of this.  .

MPI does not guarantee fairness Example: task 0 sends a message to task 2. However. When a receive matches 2 messages. Message-passing code is deterministic. Only one of the sends will complete.ORDER AND FAIRNESS • Order: • • • • MPI Messages are non-overtaking. task 1 sends a competing message that matches task 2's receive. • Fairness: – – . unless the processes are multi-threaded or the wild-card MPI_ANY_SOURCE is used in a receive statement. When a sent message matches 2 receive statements.

MPI_REAL. count. status. tag.EQ.EXAMPLE OF NON-OVERTAKING MESSAGES. ierr) END IF . comm. MPI_REAL. tag. rank. CALL MPI_COMM_RANK(comm. ierr) CALL MPI_BSEND(buf2. 0. ierr) IF (rank. MPI_REAL.EQ. MPI_ANY_TAG. status. 0. 1. count. count. 1. ierr) ELSE ! rank. count. comm. ierr) CALL MPI_RECV(buf2.1 CALL MPI_RECV(buf1. comm.0) THEN CALL MPI_BSEND(buf1. tag. MPI_REAL. comm.

1. status. comm. CALL MPI_COMM_RANK(comm. ierr) CALL MPI_SSEND(buf2. rank. ierr) . tag1. MPI_REAL. count.0) THEN CALL MPI_BSEND(buf1. ierr) CALL MPI_RECV(buf2. ierr) ELSE ! rank. count. MPI_REAL. END IF status. count. MPI_REAL.1 CALL MPI_RECV(buf1.EQ.EXAMPLE OF INTERTWINGLED MESSAGES. tag2. tag1. 0. 1. ierr) IF (rank. comm. MPI_REAL. comm. comm. 0.EQ. tag2. count.

tag. ierr) ELSE ! rank. ierr) IF (rank.0) THEN CALL MPI_RECV(recvbuf. tag. count. 0.1 CALL MPI_RECV(recvbuf.DEADLOCK EXAMPLE CALL MPI_COMM_RANK(comm. comm. 1. 1. ierr) CALL MPI_SEND(sendbuf.EQ. tag. ierr) CALL MPI_SEND(sendbuf. count. 0. ierr) END IF . MPI_REAL. comm. comm. status. count. MPI_REAL. comm. status. MPI_REAL. MPI_REAL. tag. count.EQ. rank.

count. 0. MPI_REAL. tag. comm.EQ. ierr) ELSE ! rank. status. comm. count. rank. comm.EXAMPLE OF BUFFERING CALL MPI_COMM_RANK(comm. count. ierr) CALL MPI_RECV(buf2. ierr) END IF . ierr) IF (rank.EQ. 1. MPI_REAL.0) THEN CALL MPI_SEND(buf1. tag. ierr) CALL MPI_RECV (recvbuf. 0. tag. status. 1. tag.1 CALL MPI_SEND(sendbuf. count. comm. MPI_REAL. MPI_REAL.

COLLECTIVE COMMUNICATIONS .

different communicators deliver similar functionality. Tags are not used. collective computation. Each process executes the same communication operations. Three classes of operations: synchronization. . Groups and communicators can be constructed “by hand” or using topology routines. data movement. Communications involving group of processes in a communicator. No non-blocking collective operations.COLLECTIVE ROUTINES        Collective routines provide a higher-level way to organize a parallel program.

)    int MPI_Barrier(MPI_Comm comm) Stop processes until all processes within a communicator reach the barrier Occasionally useful in measuring performance .COLLECTIVE ROUTINES (CONTD.

MPI_Datatype datatype.) int MPI_Bcast(void *buf. int root. MPI_Comm comm)  Broadcast  One-to-all communication: same data sent from root process to all others in the communicator  .COLLECTIVE ROUTINES (CONTD. int count.

min. max. and.) Reduction  The reduction operation allow to:   Collect data from each process  Reduce the data to a single value  Store the result on the root processes  Store the result on all processes Reduction function works with arrays  other operation: product. ….COLLECTIVE ROUTINES (CONTD.  Internally is usually implemented with a binary tree  .

MPI_Op op.) int MPI_Reduce/MPI_Allreduce(void * snd_buf. MPI_Comm comm)  snd_buf: input array  rcv_buf output array  count: number of element of snd_buf and rcv_buf  type: MPI type of snd_buf and rcv_buf  op: parallel operation to be performed  root: MPI id of the process storing the result  comm: communicator of processes involved in the operation  . int root.COLLECTIVE ROUTINES (CONTD. void * rcv_buf. MPI_Datatype type. int count.

MPI OPERATIONS MPI_OP MPI_MIN MPI_SUM MPI_PROD MPI_MAX MPI_LAND MPI_BAND MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MAXLOC MPI_MINLOC operator Minimum Sum product maximum Logical and Bitwise and Logical or Bitwise or Logical xor Bit-wise xor Max value and location Min value and location .

) .COLLECTIVE ROUTINES (CONTD.

Learn by Examples .

b. Each process != 0 sends its integral to 0. 2. 3b.Parallel Trapezoidal Rule Output: Estimate of the integral from a to b of f(x) using the trapezoidal rule and n trapezoids. 3a. 2. Each process estimates the integral of f(x) over its interval using the trapezoidal rule. Notes: 1. Algorithm: 1. Each process calculates "its" interval of integration. Process 0 sums the calculations received from the individual processes and prints the result. The number of processes (p) should evenly divide the number of trapezoids (n = 1024) . and n are all hardwired. f(x). a.

/* All messages go to 0 */ int tag = 0. /* Right endpoint */ int n = 1024.0. /* Process sending integral */ int dest = 0. /* Integral over my interval */ double total. MPI_Status status. /* The number of processes */ double a = 0. /* Right endpoint my process */ int local_n.Parallelizing the Trapezoidal Rule #include <stdio. char** argv) { int my_rank. /* Left endpoint my process */ double local_b. /* Left endpoint */ double b = 1.h" main(int argc. . /* Number of trapezoids */ double h. /* Trapezoid base length */ double local_a.0. /* Number of trapezoids for */ /* my calculation */ double integral.h> #include "mpi. /* Total integral */ int source. /* My process rank */ int p.

local_b = local_a + local_n*h. /* So is the number of trapezoids */ /* Length of each process' interval of integration = local_n*h.Continued… double Trap(double local_a. &p). double elapsed_time = -MPI_Wtime(). MPI_Comm_size(MPI_COMM_WORLD. . h = (b-a)/n. double local_b. &argv). MPI_Barrier(MPI_COMM_WORLD). /* h is the same for all processes */ local_n = n/p. /* Calculate local integral */ MPI_Init (&argc. So my interval starts at: */ local_a = a + my_rank*local_n*h. integral = Trap(local_a. local_n.double h). h). &my_rank). int local_n. local_b. MPI_Comm_rank(MPI_COMM_WORLD.

tag. /* Print the result */ if (my_rank == 0) { printf("With n = %d trapezoids. 1. source < p. elapsed_time).Continued… /* Add up the integrals calculated by each process */ if (my_rank == 0) { total = integral. } . b. MPI_COMM_WORLD. elapsed_time += MPI_Wtime(). source++) { MPI_Recv(&integral. total = total + integral. tag. &status). MPI_DOUBLE. dest. 1. total). for (source = 1.n). our estimate\n". source. printf("of the integral from %lf to %lf = %lf\n". MPI_Barrier(MPI_COMM_WORLD). printf("time taken: %lf\n". }//End for } else MPI_Send(&integral. MPI_DOUBLE. MPI_COMM_WORLD).a.

i++) { x = x + h. double local_b. x = local_a. /* Store result in integral */ double x.0.Continued… /* Shut down MPI */ MPI_Finalize(). } integral = integral*h. /* function we're integrating */ integral = (f(local_a) + f(local_b))/2. } /* Trap */ . for (i = 1. int i. } /* main */ double Trap( double local_a . int local_n. i <= local_n-1. double h) { double integral. integral = integral + f(x). return integral. double f(double x).

} /* f */ .Continued… double f(double x) { double return_val. return return_val. */ return_val = 4 / (1+x*x). /* Calculate f(x). */ /* Store calculation in return_val.

Root sums up and displays sum.Program 2 Process other than root generates the random value less than 1 and sends to root. .

&myrank). int tag =0. int source. MPI_Status status. MPI_Comm_rank(MPI_COMM_WORLD.randOut.h> #include <mpi. char **argv) { int myrank.h> #include <string. double randIn. dest=0.#include <stdio. MPI_Init(&argc.h> #include<stdlib.h> int main(int argc.&argv). int i. p. .h> #include<time.

&status). MPI_COMM_WORLD.source++) { MPI_Recv(&randIn. source. MPI_DOUBLE. MPI_ANY_TAG. printf("Message from root: From %d received number %f\n".source<p. &p). if(myrank==0)//I am the root { double total=0.1. total+=randIn.MPI_Comm_size(MPI_COMM_WORLD. for(source=1.randIn).average=0. }//End if .source . }//End for average=total/(p-1).

dest.tag.myrank).else//I am other than root { srand48((long int) myrank).randOut. } . return 0. printf("randout=%f. randOut=drand48(). myrank=%d\n".MPI_DOUBLE.1. MPI_Send(&randOut. }//End If-Else MPI_Finalize().MPI_COMM_WORLD).

at http://www.gov/mpi  Books:      Other information on Web:  For man pages of open MPI on the web : http://www. R.org All MPI official releases. 1999. Morgan-Kaufmann. Parallel Programming with MPI.mpi-forum. Designing and Building Parallel Programs. Thakur MPI: The Complete Reference. Lusk. Addison-Wesley.org/doc/v1. by Ian Foster. MIT Press. 2 vols.open-mpi. in both postscript and HTML Using MPI: Portable Parallel Programming with the Message-Passing Interface. w.MPI References  The Standard itself:   at http://www. and Skjellum.4/  apropos mpi  . by Gropp. 1999. 1995.mcs. by Peter Pacheco. 2nd Edition. Also Using MPI-2. MIT Press.anl. 1997.

THANK YOU .

Sign up to vote on this title
UsefulNot useful