You are on page 1of 69

|  



  |   |

Dr. Rajkumar Buyya


÷ J 
J    J    

 
      
    


 !J  
½utline

ΠIntroduction to Message Passing


Environments
ΠHelloWorld MPI Program
ΠCompiling and Running MPI programs
Œ ½n interactive clusters And Batch clusters
ΠElements of Hello World Program
ΠMPI Routines Listing
ΠCommunication in MPI programs
ΠSummary
Message-Passing Programming
Paradigm
ΠEach processor in a message-passing program
runs a sub-program
Πwritten in a conventional sequential language
Πall variables are private
Πcommunicate via special subroutine calls

O O O O

 


     
SPMD: A dominant paradigm for
writing data parallel applications
main(int argc, char **argv)
{
if(process is assigned Master role)
{
/* Assign work and coordinate workers and collect results */
MasterRoutine(/*arguments*/);
}
else /* it is worker process */
{
/* interact with master and other workers. Do the work and send
results to the master*/
WorkerRoutine(/*arguments*/);
}
}
Messages
ΠMessages are packets of
data moving between sub-
programs.
ΠThe message passing system
has to be told the following
information
ΠSending processor
ΠSource location
ΠData type
ΠData length
ΠReceiving processor(s)
ΠDestination location
ΠDestination size
Messages

ΠAccess:
ΠEach sub-program needs to be connected to a message passing
system
ΠAddressing:
ΠMessages need to have addresses to be sent to
ΠReception:
ΠIt is important that the receiving process is capable of dealing
with the messages it is sent
ΠA message passing system is similar to:
ΠPost-office, Phone line, Fax, E-mail, etc
ΠMessage Types:
ΠPoint-to-Point, Collective, Synchronous (telephone)/Asynchronous
(Postal)
Message Passing Systems and MPI
- www.mpi-forum.org
ΠInitially each manufacturer developed their own message
passing interface
ΠWide range of features, often incompatible
ΠMPI Forum brought together several Vendors and users of HPC
systems from US and Europe ² overcome above limitations.
ΠProduced a document defining a standard, called
Message Passing Interface (MPI), which is derived from
experience or common features/issues addressed by many
message-passing libraries. It aimed:
Πto provide source-code portability
Πto allow efficient implementation
Πit provides a high level of functionality
Πsupport for heterogeneous parallel architectures
Œ parallel I/½ (in MPI 2.0)
ΠMPI 1.0 contains over 115 routines/functions that can be
grouped into 8 categories.
eneral MPI Program Structure

O   

  O 
   

      

 O 
   
MPI programs

Œ MPI is a library - there are N½ language


changes
ΠHeader Files
ΠC: #include <mpi.h>
ΠMPI Function Format
ΠC: error = MPI_Xxxx(parameter,...);
MPI_Xxxx(parameter,...);
Example - C
´ 

    
   

    
  !!"

  

 

 
 #  "
$ %"
&
MPI helloworld.c
´ 

   

 ''"
  !!"
 (   () *)+,-! '"
 ( ' () *)+,-!'"


 ./* 
0012
' '"

 #  "


&
MPI Programs Compilation and
Execution
Manjra: RIDS Lab Linux Cluster

Π   ΠMaster: manjra.cs.mu.oz.au


  

 ΠInternal worker nodes:
Π
   Πnode1
Π!"  # Πnode2
Π $"       Π....
Π% &' Πnode13
Π()*+ ,,# - 
Π(  &
.  /0
1 2 
Π3 4   ! !0
Π52 2 !06 4  
   2  6
Π|  
7
Π!"  #
Π7$"2  4
Π% &'
Π(  &
.  /0
1 2 


"
# ! 
How Manjra cluster looks

ΠFront View ΠBack View


A snapshot of Manjra cluster
Compile and Run Commands

ΠCompile:
Πmanjra> mpicc helloworld.c -o helloworld
ΠRun:
Πmanjra> mpirun -np 3 helloworld [hosts picked from configuration]
Πmanjra> mpirun -np 3 -machinefile machines.list helloworld
ΠThe file machines.list contains nodes list:
Πmanjra.cs.mu.oz.au
Πnode1
Πnode2
-   ! 
Π..
Πnode6
Πnode13
ΠSome nodes may not work today, if they had failed!
Sample Run and ½utput

ΠA Run with 3 Processes:


Πmanjra> mpirun -np 3 -machinefile machines.list helloworld
Hello World from process 0 of 3
Hello World from process 1 of 3
Hello World from process 2 of 3

A Run by default
Πmanjra> helloworld
Hello World from process 0 of 1
Sample Run and ½utput

ΠA Run with 6 Processes:


Πmanjra> mpirun -np 6 -machinefile machines.list helloworld
ΠHello World from process 0 of 6
ΠHello World from process 3 of 6
ΠHello World from process 1 of 6
ΠHello World from process 5 of 6
ΠHello World from process 4 of 6
ΠHello World from process 2 of 6

ΠNote: Process execution need not be in


process number order.
Sample Run and ½utput

ΠA Run with 6 Processes:


Πmanjra> mpirun -np 6 -machinefile machines.list helloworld
ΠHello World from process 0 of 6
ΠHello World from process 3 of 6
ΠHello World from process 1 of 6
ΠHello World from process 2 of 6
ΠHello World from process 5 of 6
ΠHello World from process 4 of 6
ΠNote: Change in process output order. For
each run, process mapping can be different.
They may run on machines with different
load. Hence such difference.
Running Applications using PBS
(Portable Batch System) on
Manjra cluster
PBS
ΠPBS is a batch system - jobs get submitted to a queue
ΠThe job is a shell script to execute your program
ΠThe shell script can contain job management instructions (note
that these instructions can also be in the command line)
ΠPBS will allocate your job to some other computer, log in as you,
and execute your script, ie your script must contain cd's or
aboslute references to access files (or globus objects)
ΠUseful PBS commands:
Πqsub - submits a job
Πqstat - monitors status
Πqdel - deletes a job from a queue
PBS directives

ΠSome PBS directives to insert at the start


of your shell script:
Œ ´34566 
Œ ´345    
Œ ´345    
Œ ´345 7 
Œ ´345  $   
Œ ´345 7 8 589
Manjra and PBS

Π<manjra.cs.mu.oz.au> runs a batch system - called


PBS:
ΠYou submit a script telling the system how to run your job
ΠThe script requests the number of nodes in DEDICATED mode.
ΠThe batch system is PBS
Œ Ú  
 
 
Œ Ú   
    Ú   
Π        
Π!
" #$$$$%$$ #$$$$%$$ #&$$ '
Π(     $$ '
Π 
Π$
mpich on majra

ΠRun with
67:7


Πwhere jobscript is
´3489


 
PBS Script

> [raj@manjra mpi]$ cat hello.bat


Πcd mpi
Π/usr/local/mpich/mpich-1.2.5.2/bin/mpirun -np 5
helloworld-hostname
> [raj@manjra mpi]$ cat hello.sh
Π#!/bin/bash
Πcd /home/mpi678-2010/mpi
Πmpirun -np 5 ./helloworld
 
  
       
  
Submitting to a Queue

Π[raj@manjra mpi]$ qsub hello.bat


Π2811.manjra.cs.mu.oz.au

`     

Œ [raj@manjra mpi]$ qsub ²V hello.sh


Π2811.manjra.cs.mu.oz.au
Q Status

Π[raj@manjra mpi]$ qstat


Œ ÿ) 
  Ú  
Π     
Π*+$,- 
.-) 
 $Ú!
"

Π*+#&- 
.-) 
 $'!
"
½utput ² Result/Error

Œ ½utput
Πhello.bat.oXXXXX
ΠError, if any
Πhello.bat.eXXXXX
ΠWhere XXXXX is the ID assigned to your
job by PBS
References

Œ PBS User uide:


Πhttp://www.doesciencegrid.org/public/pbs
More on MPI Program Elements
and Error Checking
Handles

ΠMPI controls its own internal data structures


Œ MPI releases ¶handles· to allow programmers
to refer to these
Œ ´Cµ handles are of distinct ;
¶d types
and arrays are indexed from 0
ΠSome arguments can be of any type - in C
these are declared as  
Initializing MPI

ΠThe first MPI routine called in any MPI


program must be MPI_Init.
ΠThe C version accepts the arguments to main
Π/0/12
3454.
222
3678
ΠMPI_Init must be called by every MPI
program
ΠMaking multiple MPI_Init calls is erroneous
ΠMPI_INITIALIZED is an exception to first
rule
MPI_C½MM_W½RLD
ΠMPI_INIT defines a
communicator called
MPI_C½MM_W½RLD for every
process that calls it.
ΠAll MPI communication calls
require a communicator
argument
ΠMPI processes can only
communicate if they share a
communicator.
ΠA communicator contains a
group which is a list of
processes
Œ Each process has it·s rank
within the communicator
ΠA process can have several
communicators
Communicators
ΠMPI uses objects called Communicators that
defines which collection of processes communicate
with each other.
ΠEvery process has unique integer identifier
assigned by the system when the process initialises.
A rand is sometimes called process ID.
ΠProcesses can request information from a
communicator
Π ( '    '
ΠReturns the rank of the process in comm
Π (   (    
ΠReturns the size of the group in comm
Finishing up

ΠAn MPI program should call /09:


when all communications have completed.
Œ ½nce called no other MPI calls can be made
ΠAborting:
/0;)
14 7
ΠAttempts to abort all processes listed in
comm
if 4 </0=0= the whole program
terminates
Hello World with Error Check
Display Hostname of MPI Process

#include <mpi.h>
main(int argc, char **argv)
{
int numtasks, rank;
int resultlen;
static char mpi_hostname[MPI_MAX_PR½CESS½R_NAME];

MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_C½MM_W½RLD, &numtasks);
MPI_Comm_rank(MPI_C½MM_W½RLD, &rank);
MPI_et_processor_name( mpi_hostname, &resultlen );

printf("Hello World from process %d of %d running on %s\n", rank,


numtasks, mpi_hostname);

MPI_Finalize();
}
MPI Routines
MPI Routines ² C and Fortran

ΠEnvironment Management
Π|  | 
 
Π  - 
 
Π|   
,  
Π
 
Π - 8#,
Π1
 8, 
Π   
(

Environment Management Routines
Point-to-Point Communication

ΠA simplest form of message passing


Œ ½ne process sends a message to another
ΠSeveral variations on how sending a message
can interact with execution of the sub-
program
Point-to-Point variations

ΠSynchronous Sends
Πprovide information about the completion of the
message
Πe.g. fax machines
ΠAsynchronous Sends
Œ ½nly know when the message has left
Πe.g. post cards
ΠBlocking operations
Πonly return from the call when operation has completed
ΠNon-blocking operations
Πreturn straight away - can test/wait later for
completion
|  | 
 
Collective Communications

ΠCollective communication routines are higher


level routines involving several processes at a
time
ΠCan be built out of point-to-point
communications
ΠBarriers
Πsynchronise processes
ΠBroadcast
Πone-to-many communication
ΠReduction operations
Πcombine data from several processes to produce a single
(usually) result
  - 
 (

|   
,   
(


  (

 - 8#, (

1
 8,  (

   
(

MPI Communication Routines
and Examples
MPI Messages

ΠA message contains a number of elements


of some particular data type
ΠMPI data types
ΠBasic Types
ΠDerived types
ΠDerived types can be built up from basic
types
Œ ´Cµ types are different from Fortran types
MPI Basic Data types - C

O    


O  

O  
 

O  


O  



O  


O  

 

O  



O  




O  
O ! "
O ! 
"
O !#
O  $
Point-to-Point Communication

ΠCommunication between two processes


ΠSource process sends message to
destination process
ΠCommunication takes place within a
communicator
ΠDestination process is identified by its rank
in the communicator
ΠMPI provides four communication modes for
sending messages
Πstandard, synchronous, buffered, and ready
Œ ½nly one mode for receiving
Standard Send
ΠCompletes once the message has been sent
ΠNote: it may or may not have been received
ΠPrograms should obey the following rules:
ΠIt should not assume the send will complete before the
receive begins - can lead to deadlock
ΠIt should not assume the send will complete after the
receive begins - can lead to non-determinism
Πprocesses should be eager readers - they should guarantee
to receive all messages sent to them - else network
overload
ΠCan be implemented as either a buffered
send or synchronous send
Standard Send (cont.)
 4 7  -;
;
 
 (  
7 the address of the data to be sent
 the number of elements of datatype buf contains
;
 the MPI datatype
 rank of destination in communicator 
 a marker used to distinguish different message types
 the communicator shared by sender and receiver
 the fortran return value of the send
Standard Blocking Receive

ΠNote: all sends so far have been blocking (but this


only makes a difference for synchronous sends)
ΠCompletes when message received
 +7;
 

5 rank of source process in communicator 
5 returns information about message
ΠSynchronous Blocking Message-Passing
Πprocesses synchronise
Πsender process specifies the synchronous mode
Πblocking - both processes wait until transaction completed
For a communication to succeed

ΠSender must specify a valid destination


rank
ΠReceiver must specify a valid source rank
ΠThe communicator must be the same
ΠTags must match
ΠMessage types must match
ΠReceivers buffer must be large enough
ΠReceiver can use wildcards
Π <=> 4)?+(@
Π <=> A<B
Πactual source and tag are returned in status parameter
Standard/Blocked Send/Receive
MPI Send/Receive a Character
(cont...)
// mpi_com.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int numtasks, rank, dest, source, rc, tag=1;
char inmsg, outmsg='X';
MPI_Status Stat;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_C½MM_W½RLD, &numtasks);
MPI_Comm_rank(MPI_C½MM_W½RLD, &rank);

if (rank == 0) {
dest = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_C½MM_W½RLD);
printf("Rank0 sent: %c\n", outmsg);
source = 1;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_C½MM_W½RLD, &Stat);
}
MPI Send/Receive a Character

else if (rank == 1) {
source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag,
MPI_C½MM_W½RLD, &Stat);
printf("Rank1 received: %c\n", inmsg);
dest = 0;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag,
MPI_C½MM_W½RLD);
}

MPI_Finalize();
}
Execution Demo

Πmpicc mpi_com.c
Π[raj@manjra mpi]$ mpirun -np 2 a.out
Rank0 sent: X
Rank0 recv: Y
Rank1 received: X
Non Blocking Message Passing
Exercise: Ping Pong

1. Write a program in which two processes


repeatedly pass a message back and forth.
2. Insert timing calls to measure the time
taken for one message.
3. Investigate how the time taken to exchange
messages varies with the size of the
message.
A simple Ping Pong.c (cont..)
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int numtasks, rank, dest, source, rc, tag=1;
char inmsg, outmsg='X';
char pingmsg[10]; char pongmsg[10]; char buff[100];
MPI_Status Stat;

strcpy(pingmsg, "ping");
strcpy(pongmsg, "pong");

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_C½MM_W½RLD, &numtasks);
MPI_Comm_rank(MPI_C½MM_W½RLD, &rank);
Ñ $ % &

if (rank == 0) { /* Send Ping, Receive Pong */


dest = 1;
source = 1;
rc = MPI_Send(pingmsg, strlen(pingmsg)+1, MPI_CHAR, dest, tag, MPI_C½MM_W½RLD);
rc = MPI_Recv(buff, strlen(pongmsg)+1, MPI_CHAR, source, tag, MPI_C½MM_W½RLD,
&Stat);
printf("Rank0 Sent: %d & Received: %s\n", pingmsg, buff);
}
A simple Ping Pong.c
else if (rank == 1) { /* Receive Ping, Send Pong */
dest = 0;
source = 0;
rc = MPI_Recv(buff, strlen(pingmsg)+1, MPI_CHAR, source, tag,
MPI_C½MM_W½RLD, &Stat);
printf("Rank1 received: %s & Sending: %s\n", buff, pongmsg);
rc = MPI_Send(pongmsg, strlen(pongmsg)+1, MPI_CHAR, dest,
tag, MPI_C½MM_W½RLD);
}

MPI_Finalize();
}
Timers

ΠC: 7 *  "


ΠReturns an elapsed wall clock time in seconds (double
precision) on the calling processor.
ΠTime is measured in seconds
ΠTime to perform a task is measured by consulting the time
before and after
Upcoming Evaluations

Œ Mid term exam: ´peerµ evaluation


ΠReview your understanding of topics covered so far.
Œ No official marking ² ´How you are going testµ.
ΠDate: April 27 (Monday),
ΠTime: 20 min (exam), 15min (for peer marking)

Œ Microsoft uest Lecture (May?)


ΠAssignment 2:
Œ Implementation of ´parallelµ Matrix multiplication (using
MPI)
Œ Deadline: April 30 from 1: 10-12; 2: 2-4pm
Acknowledgements:
MPI Slides are Derived from
ΠDirk van der Knijff, High Performance
Parallel Programming, PPT Slides
ΠMPI Notes, Maui HPC Centre:
Πhttp://www.buyya.com/csc433/MPITut.pdf
ΠMelbourne Advanced Research Computing
Center
Πhttp://www.hpc.unimelb.edu.au

You might also like