You are on page 1of 62

Message Passing Interface

Unit III

Parallel Computing & Algorithms

Srinivasa Ramanujan Centre

SASTRA University

June 2009
• MPI the Message Passing Interface is a standardized
and portable message passing system
• Standardization of Message Passing in a Distributed
Memory Environment i.e., SPC with DM and NOWS
• The MPI standard defines the syntax and semantics of a
core of library routines for FORTRAN 77 and C
• MPI Standard developed by a collaborative effort of 80
people from 40 organizations through MPI Forum
(representing vendors of parallel systems industrial
users industrial and national research laboratories and
• MPI draft released on Nov 1993 at Supercomputing
Conference. MPI 1.0 released on June 1994.
MPI intro
• MPI has been strongly influenced by work at the
IBM T. J. Watson Research Center. Intel’s
NX/2, Express , nCUBE’s Vertex, p4, and
PARMACS. Other important contributions have
come from Zipcode, Chimp, PVM, Chameleon,
and PICL.
• MPI Forum identified some critical
shortcomings of existing message passing
systems e.g.., complex data layouts or support
for modularity and safe communication.
• The major goal of MPI as with most standards is
a degree of portability across different
Supported Platform
• Runs on
• DM parallel computers
• SM parallel computers
• As set of processes on single workstation
• Develop in single workstation and deploy it in SPC for
production run.
• Runs transparently on Heterogeneous systems –
supports virtual computing model hiding many
architectural differences
• MPI implementation will automatically do any necessary
data conversion and utilize the correct communications
• Portability is central but Performance is not
• More portable than vendor specific systems.
Efficient Implementation
• An important design goal of MPI was to allow
efficient implementations across machines of
differing characteristics.
– MPI can be easily implemented on systems that
buffer messages at the sender / receiver or do no
buffering at all.
• Implementations can take advantage of specific
features of the communication subsystem of
various machines.
– On machines with intelligent communication
coprocessors much of the message passing protocol
can be off loaded to this coprocessor. On other
systems most of the communication code is executed
by the main processor
Opaque Objects
• By hiding the details of how MPI specific
objects are represented each
implementation is free to do whatever is
best under the circumstances
Minimize work
• avoid a requirement for large amounts of extra
information with each message
• avoid a need for complex encoding or decoding
of message headers
• avoids extra computation or tests in critical
routines since this can degrade performance.
• avoids the need for extra copying and buffering
of data
– data can be moved from the user memory directly to
the wire and be received directly from the wire to the
receiver memory
• Encourage the reuse of previous computations
through persistent communication requests and
caching of attributes on communicators
Overlap computation with communication

• achieved by the use of non-blocking

communication calls which separate the
initiation of a communication from its
• hide communication latencies
• Sub-group of processes
• Collective communication
• Topology
– Cartesian and Graph topology
• Reliability of sending and receiving the
Goals summarized
• to develop a widely used standard for writing
message passing programs
• a practical portable efficient and flexible
standard for message passing
• Efficient communication
– avoid m2m copy, overlap comp with comm
– Offload to communication co-processors when
• Support heterogeneous environment
• C & Fortran 77 – semantics of MPI APIs are
Language independent.
• Reliable communication
• APIs not too different from PVM, NX, Express,
p4 etc.,
• MPI APIs are Thread safe.
Supported Architectures
• run on distributed memory multi computers
shared memory multiprocessors networks of
workstations and combinations of all of these
• MPI provides many features intended to improve
performance on scalable parallel computers with
specialized inter processor communication
– Thus we expect that native high performance
implementations of MPI will be provided on such
• At the same time implementations of MPI on top
of standard Unix inter processor communication
protocols will provide portability to workstation
clusters and heterogeneous networks of
What is included in MPI
• Point-to-point communication
• Collective operations
• Process groups
• Communication domains
• Process topologies
• Environmental Management and inquiry
• Profiling interface
• Bindings for Fortran and C languages
P2P communication
• the transmittal of data between a pair of
processes one side sending the other
• Almost all the constructs of MPI are built
around the point to point operations.
P2P arguments
• MPI procedures are specified using a language
independent notation.
• The arguments of procedure calls are marked
as IN, OUT or INOUT.
• IN : the call uses but does not update an
argument marked IN
• OUT : the call may update an argument marked
• INOUT: the call both uses and updates an
argument marked INOUT
MPI_Send() args
MPI_Recv() args
Message Type
• MPI provides a set of send and receive functions that
allow the communication of typed data with an
associated tag.
– Typing of the message contents is necessary for heterogeneous support
the type information is needed so that correct data representation
conversions can be performed as data is sent from one architecture to
Message Tag
• The tag allows selectivity of messages at the receiving end one can
receive on a particular tag or one can wildcard this quantity allowing
reception of messages with any tag
• Integer value
• Used by the application to distinguish the messages
• Range 0..UB
• UB is implementation dependent. Typical value no less than 32767.
• Can be found by querying the value of the attribute MPI_TAG_UB
Communication domain
• a communicator serves to define a set of
processes that can be contacted.
• Each such process is labeled by a
process rank.
• Process ranks are integers and are
discovered by inquiry to a communicator.
• MPI_COMM_WORLD is a default
communicator provided upon startup that
defines an initial communication domain
for all the processes that participate in the
p0 sending a message to p0
MPI_Send() args – blocking send
MPI_Send (msg, strlen(msg)+1, MPI_CHAR,
1, tag, MPI_COMM_WORLD);
• Arg1 : the outgoing data is to be taken from msg
• Arg2 : it consists of strlen(msg)+1 entries
+1 for ‘\0’ MPI_CHAR
• Arg3 : each of type MPI_CHAR
• Arg4 : specifies the message destination which
is process 1
• Arg5 : species the message tag
• Arg6 : communicator that species a
communication domain for this communication
MPI_Recv() args - blocking receive
MPI_Recv (msg,20,MPI_CHAR, 0, tag,
MPI_COMM_WORLD, &status)
• Arg1 : Place incoming data in msg.
• Arg2 : max size of incoming data is 20
• Arg3 : type MPI_CHAR
• Arg4 : receive message from process 0
• Arg5 : specifies the message tag to retrieve
• Arg6 : communicator that species a
communication domain for this communication
• Arg7 : variable status set by MPI_Recv()
MPI_Send()/MPI_Recv – blocking send/receive

• The send call blocks until the send buffer

can be reclaimed i.e., after the send
process can safely overwrite the contents
of msg
• the receive function blocks until the
receive buffer actually contains the
contents of the message
Modes of p2p communication
• The mode allows one to choose the semantics
of the send operation and in effect to influence
the underlying protocol of the transfer of data.
• Both blocking and non-blocking
communications have modes
• 4 Modes
– Standard mode
– Buffered mode
– Synchronous mode
– Ready mode
4 modes of p2p communication
– Standard mode
• the completion of the send does not necessarily mean that the
matching receive has started and no assumption should be made in
the application program about whether the outgoing data is buffered
by MPI
– Buffered mode
• the user can guarantee that a certain amount of buffering space is
available The catch is that the space must be explicitly provided by
the application program.
– Synchronous mode
• a rendezvous (ron’de-voo – meeting place fixed beforehand)
semantics between sender and receiver is used
– Ready mode
• This allows the user to exploit extra knowledge to simplify the
protocol and potentially achieve higher performance In a ready-
mode send the user asserts that the matching receive already has
been posted
P2P Blocking Message Passing Routines
• MPI_Send (&buf,count,datatype,dest,tag,comm)
Routine returns only after the application buffer in the sending task is free
for reuse.
• MPI_Recv (&buf,count,datatype,source,tag,comm,&status)
Receive a message and block until the requested data is available in the
application buffer in the receiving task.
• MPI_Ssend (&buf,count,datatype,dest,tag,comm)
Send a message and block until the application buffer in the sending task is
free for reuse and the destination process has started to receive the
• MPI_Bsend (&buf,count,datatype,dest,tag,comm)
Buffered blocking send: permits the programmer to allocate the required
amount of buffer space into which data can be copied until it is delivered.
• MPI_Rsend (&buf,count,datatype,dest,tag,comm)
Should only be used if the programmer is certain that the matching receive
has already been posted
• MPI_Sendrecv (&sendbuf,sendcount,sendtype,dest,sendtag,
Send a message and post a receive before blocking. Will block until the
sending application buffer is free for reuse and until the receiving application
buffer contains the received message.
User defined buffering used along
with MPI_Bsend()
• MPI_Buffer_attach (&buffer,size)
MPI_Buffer_detach (&buffer,size)
– Used by programmer to allocate/deallocate
message buffer space to be used by the
MPI_Bsend routine.
– The size argument is specified in actual
data bytes - not a count of data elements.
– Only one buffer can be attached to a
process at a time.
P2P Non-Blocking Message Passing Routines
• MPI_Isend (&buf,count,datatype,dest,tag,comm,&request)
• MPI_Irecv (&buf,count,datatype,source,tag,comm,&request)
– Identifies an area in memory to serve as a send/recv buffer.
– Processing continues immediately without waiting for the
message to be copied out/in from the application buffer.
– A communication request handle is returned for handling the
pending message status.
– The program should not use the application buffer until
subsequent calls to MPI_Wait or MPI_Test indicate that
the non-blocking send/recv has completed.
• MPI_Issend (&buf,count,datatype,dest,tag,comm,&request)
– Non-blocking synchronous send.
• MPI_Ibsend (&buf,count,datatype,dest,tag,comm,&request)
– Non-blocking buffered send.
• MPI_Irsend (&buf,count,datatype,dest,tag,comm,&request)
– Non-blocking ready send.
• MPI_Wait (&request,&status)
MPI_Waitany (count,&array_of_requests,&index,&status)
MPI_Waitall (count,&array_of_requests,&array_of_statuses)
MPI_Waitsome (incount,&array_of_requests,&outcount,
&array_of_offsets, &array_of_statuses)
– blocks until a specified non-blocking send or
receive operation has completed.
• MPI_Test (&request,&flag,&status)
MPI_Testany (count,&array_of_requests,&index,&flag,&status)
MPI_Testall (count,&array_of_requests,&flag,
MPI_Testsome (incount,&array_of_requests,&outcount,
&array_of_offsets, &array_of_statuses)
– checks the status of a specified non-blocking send or
receive operation.
– The "flag" parameter is returned logical true (1) if the
operation has completed, and logical false (0) if not.
• MPI_Probe (source,tag,comm,&status)
– Performs a blocking test for a message.
– The "wildcards"
may be used to test for a message from any source
or with any tag.
• MPI_Iprobe (source,tag,comm,&flag,&status)
– Performs a non-blocking test for a message.
– The integer "flag" parameter is returned logical true
(1) if a message has arrived, and logical false (0) if
Collective Communications
• ALL or None property
• Collective communication must involve all
processes in the scope of a communicator
• All processes are by default, members in
the communicator MPI_COMM_WORLD.
• It is the programmer's responsibility to
insure that all processes within a
communicator participate in any collective
Types of Collective Operations
• Synchronization - processes wait until all
members of the group have reached the
synchronization point.
• Data Movement - broadcast, scatter/gather, all
to all.
• Collective Computation (reductions) - one
member of the group collects data from the
other members and performs an operation (min,
max, add, multiply, etc.) on that data.
Programming Considerations
and Restrictions
• Collective operations are blocking
• Collective communication routines do not take
message tag arguments.
• Collective operations within subsets of
processes are accomplished by first partitioning
the subsets into new groups and then attaching
the new groups to new communicators
• Can only be used with MPI predefined datatypes
- not with MPI Derived Data Types.
Collective Communication Routines
• MPI_Barrier (comm)
– Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all
tasks in the group reach the same MPI_Barrier call.
• MPI_Bcast (&buffer,count,datatype,root,comm)
– Broadcasts (sends) a message from the process with rank "root" to all other processes in the group.
3. MPI_Scatter (&sendbuf,sendcnt,sendtype,&recvbuf,
– Distributes distinct messages from a single source task to each task in the group.
4. MPI_Gather (&sendbuf,sendcnt,sendtype,&recvbuf,
– Gathers distinct messages from each task in the group to a single destination task. This routine is the
reverse operation of MPI_Scatter.
5. MPI_Reduce (&sendbuf,&recvbuf,count,datatype,op,root,comm)
– Applies a reduction operation on all tasks in the group and places the result in one task.
6. MPI_Allgather (&sendbuf,sendcount,sendtype,&recvbuf,
– Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all
broadcasting operation within the group.
• MPI_Allreduce (&sendbuf,&recvbuf,count,datatype,op,comm)
– Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an
MPI_Reduce followed by an MPI_Bcast.
• MPI_Reduce_scatter (&sendbuf,&recvbuf,recvcount,datatype,
– First does an element-wise reduction on a vector across all tasks in the group. Next, the result vector is split
into disjoint segments and distributed across the tasks. This is equivalent to an MPI_Reduce followed by an
MPI_Scatter operation.
• MPI_Alltoall (&sendbuf,sendcount,sendtype,&recvbuf,
– Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group
in order by index.
• MPI_Scan (&sendbuf,&recvbuf,count,datatype,op,comm)
– Performs a scan operation with respect to a reduction operation across a task group.
C Language - Collective Communications
#include "mpi.h" #include <stdio.h>
#define SIZE 4
int main(argc,argv)
int argc; char *argv[];
int numtasks, rank, sendcount, recvcount, source;
float sendbuf[SIZE][SIZE] = {
{1.0, 2.0, 3.0, 4.0},
{5.0, 6.0, 7.0, 8.0},
{9.0, 10.0, 11.0, 12.0},
{13.0, 14.0, 15.0, 16.0}

float recvbuf[SIZE];

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
if (numtasks == SIZE) {
source = 1;
sendcount = SIZE;
recvcount = SIZE;
MPI_Scatter(sendbuf,sendcount,MPI_FLOAT,recvbuf,recvcount, MPI_FLOAT, source,
printf("rank= %d Results: %f %f %f %f\n",rank,recvbuf[0], recvbuf[1],recvbuf[2],recvbuf[3]);
} else
printf("Must specify %d processors. Terminating.\n",SIZE);
Sample program output
rank= 0 Results: 1.000000 2.000000 3.000000 4.000000
rank= 1 Results: 5.000000 6.000000 7.000000 8.000000
rank= 2 Results: 9.000000 10.000000 11.000000 12.000000
rank= 3 Results: 13.000000 14.000000 15.000000 16.000000
Process Topologies
• Logical process arrangement is Virtual Topology.
• Virtual Topology Vs Physical topology
• Used for efficient mapping process to physical processor
• Virtual topology describes a mapping/ordering of MPI
processes into a geometric "shape".
• The two main types of topologies supported by MPI are
Cartesian (grid) and Graph
• MPI topologies are virtual - there may be no relation
between the physical structure of the parallel machine
and the process topology.
• Virtual topologies are built upon MPI communicators and
• Must be "programmed" by the application developer.
Why we need Topologies
• Some of the parallel algorithms / data suits to a specific
topology like, mesh – matrix, Tree – reduction and
broadcast etc.,
• Virtual topologies may be useful for applications with
specific communication patterns - patterns that match
an MPI topology structure.
• A particular implementation may optimize process
mapping based upon the physical characteristics of
a given parallel machine.
• Provide a convenient naming mechanism of processes
of group.
• In many parallel applications a linear ranking of
processes does not adequately reflect the logical
communication pattern of the processes.
– which is usually determined by the underlying problem
geometry and the numerical algorithm used.
Graph Vs Cart
• Graph topology is sufficient for all applications
• In many applications the graph structure is regular
• The detailed set up of the graph would be inconvenient
for the user and might be less efficient at run time.
• A large fraction of all parallel applications use process
topologies like rings, two or higher dimensional grids or
• The mapping of grids and torus is generally an easier
problem than general graphs.
– completely defined by the number of dimensions and the
numbers of processes in each coordinate direction
Cart topology
User Defined Data types and
Process Topologies
Environmental Management
The MPI Profiling Interface
• MPI The Complete Reference
Marc Snir, Steve Otto, Steven HussLederman, David Walker, Jack Dongarra