Programming Clusters using Message-Passing Interface (MPI


Dr. Rajkumar Buyya
Cloud Computing and Distributed Systems (CLOUDS) Laboratory

The University of Melbourne Melbourne, Australia


Introduction to Message Passing Environments HelloWorld MPI Program Compiling and Running MPI programs 

On interactive clusters And Batch clusters 


Elements of Hello World Program MPI Routines Listing Communication in MPI programs Summary

Message-Passing Programming Paradigm 

Each processor in a message-passing program runs a sub-program 

written in a conventional sequential language all variables are private communicate via special subroutine calls
M P M P M P Memory Processors/Node

Interconnection Network

} } . } else /* it is worker process */ { /* interact with master and other workers. Do the work and send results to the master*/ WorkerRoutine(/*arguments*/). char **argv) { if(process is assigned Master role) { /* Assign work and coordinate workers and collect results */ MasterRoutine(/*arguments*/).SPMD: A dominant paradigm for writing data parallel applications main(int argc.

The message passing system has to be told the following information        Sending processor Source location Data type Data length Receiving processor(s) Destination location Destination size .Messages   Messages are packets of data moving between subprograms.

Messages  Access:  Each sub-program needs to be connected to a message passing system Messages need to have addresses to be sent to It is important that the receiving process is capable of dealing with the messages it is sent Post-office. E-mail. etc Point-to-Point. Synchronous (telephone)/Asynchronous (Postal)  Addressing:   Reception:   A message passing system is similar to:   Message Types:  . Phone line. Fax. Collective.

Produced a document defining a standard. often incompatible MPI Forum brought together several Vendors and users of HPC systems from US and Europe ² overcome above     Initially each manufacturer developed their own message passing interface Wide range of features. called Message Passing Interface (MPI).0 contains over 115 routines/functions that can be grouped into 8 categories.www.0)  MPI 1.mpi-forum. . It aimed:      to provide source-code portability to allow efficient implementation it provides a high level of functionality support for heterogeneous parallel architectures parallel I/O (in MPI 2.Message Passing Systems and MPI . which is derived from experience or common features/issues addressed by many message-passing libraries.

General MPI Program Structure MPI Include File Initialise MPI Environment Do work and perform message communication Terminate MPI Environment .

. .)..MPI programs   MPI is a library ..h>  MPI Function Format  C: error = MPI_Xxxx(parameter..).. MPI_Xxxx(parameter.there are NO language changes Header Files  C: #include <mpi..

Example . &argv). char **argv) { /* initialize MPI */ MPI_Init(&argc. /* main part of program */ /* terminate MPI */ MPI_Finalize(). } .h> /* include other usual header files*/ main(int argc.C #include <mpi. exit(0).

&rank). MPI_Comm_size(MPI_COMM_WORLD. &argv). MPI_Comm_rank(MPI_COMM_WORLD. MPI_Finalize().h> main(int argc.MPI helloworld. rank. rank. printf("Hello World from process %d of %d\n³. MPI_Init(&argc.c #include <mpi. } . char **argv) { int numtasks. & numtasks). numtasks).

MPI Programs Compilation and Execution . Internal worker nodes:     node1 node2 .node13)       Manjra Linux cluster .mu.oz..3 (Valhalla) Master: manjra.3 (Valhalla) Each of the 13 worker node consists of the following: Pentium 4 2GHz 512 MB memory 40 GB harddisk Gigabit LAN Red Hat Linux release 7.cs.cs.. node13  Worker Nodes(node1.Manjra: GRIDS Lab Linux Cluster  Master Node:         Dual Xeon 2GHz 512 MB memory 250 GB integrated storage Gigabit LAN CDROM & Floppy Drives Red Hat Linux release

How Manjra cluster looks  Front View  Back View .

A snapshot of Manjra cluster .

c -o helloworld manjra> mpirun -np 3 helloworld [hosts picked from configuration] manjra> mpirun -np 3 -machinefile machines.oz..cs.list helloworld contains nodes list:       No of processes  Some nodes may not work node1 node2 . if they had failed! .Compile and Run Commands  Compile:  manjra> mpicc helloworld. node6 node13  Run:    The file machines.

list helloworld  Hello World from process 0 of 3  Hello World from process 1 of 3  Hello World from process 2 of 3  A Run by default  manjra> helloworld  Hello World from process 0 of 1 .Sample Run and Output  A Run with 3 Processes:  manjra> mpirun -np 3 -machinefile machines.

Sample Run and Output  A Run with 6 Processes:  manjra> mpirun -np 6 -machinefile machines.list helloworld       Hello World from process 0 of 6 Hello World from process 3 of 6 Hello World from process 1 of 6 Hello World from process 5 of 6 Hello World from process 4 of 6 Hello World from process 2 of 6  Note: Process execution need not be in process number order. .

For each run. process mapping can be different.list helloworld       Hello World from process 0 of 6 Hello World from process 3 of 6 Hello World from process 1 of 6 Hello World from process 2 of 6 Hello World from process 5 of 6 Hello World from process 4 of 6  Note: Change in process output order. They may run on machines with different load. Hence such difference. .Sample Run and Output  A Run with 6 Processes:  manjra> mpirun -np 6 -machinefile machines.

Running Applications using PBS (Portable Batch System) on Manjra cluster .

monitors status qdel . log in as you.deletes a job from a queue .PBS      PBS is a batch system . and execute your script.submits a job qstat .jobs get submitted to a queue The job is a shell script to execute your program The shell script can contain job management instructions (note that these instructions can also be in the command line) PBS will allocate your job to some other computer. ie your script must contain cd's or aboslute references to access files (or globus objects) Useful PBS commands:    qsub .

PBS directives  Some PBS directives to insert at the start of your shell script:       #PBS #PBS #PBS #PBS #PBS #PBS -q <queuename> -e <filename> (stderr location) -o <filename> (stdout location) -eo (combines stderr and stdout) -t <seconds> (maximum time) -l <attribute>=<value> (eg -l nodes=2) .

au> runs a batch system .---.-----.called PBS:    You submit a script telling the system how to run your job The script requests the number of nodes in DEDICATED mode.-------.Manjra and PBS  <manjra.oz. The batch system is PBS Queue Memory CPU Time Walltime Node Run Que Lm ---------------.--0 State ----E R E R  Queue Details [raj@manjra mpi]$ qstat -q -10000:00 10000:00 13 0 0 -defaultq ----0 0 ---.

mpich on majra  Run with qsub <jobscript>  where jobscript is #PBS ²l nodes=2 mpirun <progname> .

2.bat   cd mpi /usr/local/mpich/mpich-1. ./helloworld > [raj@manjra mpi]$ cat    Give Full path of your working directory for your programs execution.2/bin/mpirun -np 5 helloworld-hostname #!/bin/bash cd /home/mpi678-2010/mpi mpirun -np 5 .5.PBS Script > [raj@manjra mpi]$ cat hello.

bat .au ID Assigned to your job  [raj@manjra mpi]$ qsub ²V hello.Submitting to a Queue  [raj@manjra mpi]$ qsub

bat raj 0 E workq .Q Status     [raj@manjra mpi]$ qstat Job id Name User Time Use S Queue ---------------.---------------..bat raj 0 Q workq  2813.---------------.manjra hello.-------.manjra hello.----2807.

Output ² Result/Error  Output  hello.oXXXXX hello. if any   Where XXXXX is the ID assigned to your job by PBS .eXXXXX  Error.bat.bat.

References  PBS User Guide: .doesciencegrid.

More on MPI Program Elements and Error Checking .

in C these are declared as void * .Handles     MPI controls its own internal data structures MPI releases ¶handles· to allow programmers to refer to these ´Cµ handles are of distinct typedef¶d types and arrays are indexed from 0 Some arguments can be of any type .

char ***argv).Initializing MPI    The first MPI routine called in any MPI program must be MPI_Init.    MPI_Init must be called by every MPI program Making multiple MPI_Init calls is erroneous MPI_INITIALIZED is an exception to first rule . The C version accepts the arguments to main int MPI_Init(int *argc.

MPI_COMM_WORLD       MPI_INIT defines a communicator called MPI_COMM_WORLD for every process that calls it. All MPI communication calls require a communicator argument MPI processes can only communicate if they share a communicator. A communicator contains a group which is a list of processes Each process has it·s rank within the communicator A process can have several communicators .

int *rank)  Returns the rank of the process in comm MPI_Comm_size(MPI_Comm comm.Communicators    MPI uses objects called Communicators that defines which collection of processes communicate with each other. Every process has unique integer identifier assigned by the system when the process initialises. Processes can request information from a communicator MPI_Comm_rank(MPI_comm comm. A rand is sometimes called process ID. int *size)  Returns the size of the group in comm   .

Once called no other MPI calls can be made Aborting: MPI_Abort(comm)  Attempts to abort all processes listed in comm if comm = MPI_COMM_WORLD the whole program terminates .Finishing up    An MPI program should call MPI_Finalize when all communications have completed.

Hello World with Error Check .

&argv). int resultlen. rank. } . &rank). MPI_Comm_size(MPI_COMM_WORLD. rank. printf("Hello World from process %d of %d running on %s\n". &numtasks).Display Hostname of MPI Process #include <mpi. static char mpi_hostname[MPI_MAX_PROCESSOR_NAME]. MPI_Comm_rank(MPI_COMM_WORLD. MPI_Init(&argc. MPI_Get_processor_name( mpi_hostname.h> main(int argc. numtasks. mpi_hostname). MPI_Finalize(). &resultlen ). char **argv) { int numtasks.

MPI Routines .

MPI Routines ² C and Fortran         Environment Management Point-to-Point Communication Collective Communication Process Group Management Communicators Derived Type Virtual Topologies Miscellaneous Routines .

Environment Management Routines .

Point-to-Point Communication    A simplest form of message passing One process sends a message to another Several variations on how sending a message can interact with execution of the subprogram .

fax machines Only know when the message has left e. post cards only return from the call when operation has completed return straight away .g.g.Point-to-Point variations  Synchronous Sends   provide information about the completion of the message e.can test/wait later for completion  Asynchronous Sends    Blocking operations   Non-blocking operations  .

Point-to-Point Communication .

Collective Communications    Collective communication routines are higher level routines involving several processes at a time Can be built out of point-to-point communications Barriers  synchronise processes one-to-many communication combine data from several processes to produce a single (usually) result  Broadcast   Reduction operations  .

Collective Communication Routines .

Process Group Management Routines .

Communicators Routines .

Derived Type Routines .

Virtual Topologies Routines .

Miscellaneous Routines .

MPI Communication Routines and Examples .

MPI Messages   A message contains a number of elements of some particular data type MPI data types   Basic Types Derived types   Derived types can be built up from basic types ´Cµ types are different from Fortran types .

C MPI datat ype MPI_CHAR MPI_SHORT MPI_INT MPI_LONG MPI_UNSIGNED_CHAR MPI_UNSIGNED_SHORT MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_FLOAT MPI_DOUBLE MPI_LONG_DOUBLE MPI_BYTE MPI_PACKED C datat ype signed char signed short int signed int signed long int unsigned char unsigned short int unsigned int unsigned long int float double long double .MPI Basic Data types .

and ready  Only one mode for receiving . buffered. synchronous.Point-to-Point Communication      Communication between two processes Source process sends message to destination process Communication takes place within a communicator Destination process is identified by its rank in the communicator MPI provides four communication modes for sending messages  standard.

else network overload  Programs should obey the following rules:     Can be implemented as either a buffered send or synchronous send .they should guarantee to receive all messages sent to them .can lead to non-determinism processes should be eager readers .Standard Send  Completes once the message has been sent  Note: it may or may not have been received It should not assume the send will complete before the receive begins .can lead to deadlock It should not assume the send will complete after the receive begins .

int dest.) MPI_Send(void *buf. MPI_Comm comm) buf count datatype dest tag comm ierror the address of the data to be sent the number of elements of datatype buf contains the MPI datatype rank of destination in communicator comm a marker used to distinguish different message types the communicator shared by sender and receiver the fortran return value of the send .Standard Send (cont. MPI_Datatype datatype. int count. int tag.

datatype.both processes wait until transaction completed .Standard Blocking Receive   Note: all sends so far have been blocking (but this only makes a difference for synchronous sends) Completes when message received source .returns information about message MPI_Recv(buf. status)  Synchronous Blocking Message-Passing    processes synchronise sender process specifies the synchronous mode blocking . comm.rank of source process in communicator comm status . source. count. tag.

For a communication to succeed  


Sender must specify a valid destination rank Receiver must specify a valid source rank The communicator must be the same Tags must match Message types must match Receivers buffer must be large enough Receiver can use wildcards 


actual source and tag are returned in status parameter

Standard/Blocked Send/Receive

MPI Send/Receive a Character (cont...)
// mpi_com.c #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int numtasks, rank, dest, source, rc, tag=1; char inmsg, outmsg='X'; MPI_Status Stat; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { dest = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); printf("Rank0 sent: %c\n", outmsg); source = 1; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); }

dest. } .MPI Send/Receive a Character else if (rank == 1) { source = 0. tag. tag. 1. &Stat). MPI_CHAR. 1. MPI_COMM_WORLD. rc = MPI_Send(&outmsg. printf("Rank1 received: %c\n". source. } MPI_Finalize(). inmsg). MPI_CHAR. rc = MPI_Recv(&inmsg. dest = 0. MPI_COMM_WORLD).

Execution Demo   mpicc mpi_com.out Rank0 sent: X Rank0 recv: Y Rank1 received: X .c [raj@manjra mpi]$ mpirun -np 2 a.

Non Blocking Message Passing .

Insert timing calls to measure the time taken for one message. 3. 2. Write a program in which two processes repeatedly pass a message back and forth.Exercise: Ping Pong 1. . Investigate how the time taken to exchange messages varies with the size of the message.

&rank). source. printf("Rank0 Sent: %d & Received: %s\n". &numtasks). rc = MPI_Recv(buff. char *argv[]) { int numtasks.. MPI_CHAR. "ping"). strlen(pingmsg)+1. MPI_Init(&argc. tag=1. strcpy(pongmsg. "pong"). Receive Pong */ dest = 1. MPI_COMM_WORLD). Why + 1 ? if (rank == 0) { /* Send Ping. char pingmsg[10].h> #include <stdio. strlen(pongmsg)+1.&argv).) #include <mpi. dest. MPI_Comm_rank(MPI_COMM_WORLD. buff). dest.A simple Ping Pong. rank. source. char buff[100]. MPI_Comm_size(MPI_COMM_WORLD. source = 1. MPI_CHAR. tag. char pongmsg[10]. outmsg='X'. rc = MPI_Send(pingmsg. pingmsg. strcpy(pingmsg. MPI_Status Stat. &Stat). tag. MPI_COMM_WORLD.c (cont. rc.h> int main(int argc. } . char inmsg.

A simple Ping Pong. strlen(pingmsg)+1. } MPI_Finalize(). pongmsg). dest. tag. tag. printf("Rank1 received: %s & Sending: %s\n". MPI_CHAR. MPI_CHAR. Send Pong */ dest = 0. MPI_COMM_WORLD). MPI_COMM_WORLD. &Stat). source. source = 0. strlen(pongmsg)+1.c else if (rank == 1) { /* Receive Ping. } . rc = MPI_Send(pongmsg. rc = MPI_Recv(buff. buff.

 Returns an elapsed wall clock time in seconds (double precision) on the calling processor. Time to perform a task is measured by consulting the time before and after  Time is measured in seconds  .Timers  C: double MPI_Wtime(void).

Upcoming Evaluations  Mid term exam: ´peerµ evaluation    Review your understanding of topics covered so far. 15min (for peer marking)   Microsoft Guest Lecture (May?) Assignment 2:   Implementation of ´parallelµ Matrix multiplication (using MPI) Deadline: April 30 from G1: 10-12. Date: April 27 (Monday). G2: 2-4pm .  Time: 20 min (exam). No official marking ² ´How you are going testµ.

Acknowledgements: MPI Slides are Derived from   Dirk van der Knijff. Maui HPC Centre:  http://www.unimelb. High Performance Parallel .com/csc433/MPITut. PPT Slides MPI  Melbourne Advanced Research Computing Center  http://www.hpc.

Sign up to vote on this title
UsefulNot useful