You are on page 1of 54

Parallel

Programming and
MPI
A course for IIT-M. September 2008
R Badrinath, STSD Bangalore
(ramamurthy.badrinath@hp.com)

© 2006 Hewlett-Packard Development Company, L.P.


The information contained herein is subject to change without notice
Context and Background
• IIT- Madras has recently added a good deal of compute power.
• Why –
− Further R&D in sciences, engineering
− Provide computing services to the region
− Create new opportunities in education and skills
−…
• Why this course –
− Update skills to program modern cluster computers
• Length -2 theory and 2 practice sessions, 4 hrs each

2 September 2008 IIT-Madras


Audience Check

3
Contents
1. MPI_Init Instead we
2. •Understand Issues
MPI_Comm_rank
3. MPI_Comm_size
• Understand Concepts
4. MPI_Send
5. MPI_Recv •Learn enough to pickup from the manual
6. MPI_Bcast
7.
Go
MPI_Create_comm
• by motivating examples
8. MPI_Sendrecv
9. MPI_Scatter •Try out some of the examples
10. MPI_Gather

………………

4 September 2008 IIT-Madras


Outline
• Sequential vs Parallel programming
• Shared vs Distributed Memory
• Parallel work breakdown models
• Communication vs Computation
• MPI Examples
• MPI Concepts
• The role of IO

5 September 2008 IIT-Madras


Sequential vs Parallel
• We are used to sequential programming – C, Java, C+
+, etc. E.g., Bubble Sort, Binary Search, Strassen
Multiplication, FFT, BLAST, …
• Main idea – Specify the steps in perfect order
• Reality – We are used to parallelism a lot more than
we think – as a concept; not for programming
• Methodology – Launch a set of tasks; communicate to
make progress. E.g., Sorting 500 answer papers by –
making 5 equal piles, have them sorted by 5 people,
merge them together.

6 September 2008 IIT-Madras


Shared vs Distributed Memory
Programming
• Shared Memory – All tasks access the same memory, hence the
same data. pthreads
• Distributed Memory – All memory is local. Data sharing is by
explicitly transporting data from one task to another (send-
receive pairs in MPI, e.g.)

Program Memory Communications channel

• HW – Programming model relationship – Tasks vs CPUs;


• SMPs vs Clusters

7 September 2008 IIT-Madras


Designing Parallel Programs

8
Simple Parallel Program – sorting numbers
in a large array A
• Notionally divide A into 5 pieces
[0..99;100..199;200..299;300..399;400..499].
• Each part is sorted by an independent sequential
algorithm and left within its region.

• The resultant parts are merged by simply reordering


among adjacent parts.

9 September 2008 IIT-Madras


What is different – Think about…
• How many people doing the work. (Degree of
Parallelism)
• What is needed to begin the work. (Initialization)
• Who does what. (Work distribution)
• Access to work part. (Data/IO access)
• Whether they need info from each other to finish their
own job. (Communication)
• When are they all done. (Synchronization)
• What needs to be done to collate the result.

10 September 2008 IIT-Madras


Work Break-down
• Parallel algorithm
• Prefer simple intuitive breakdowns
• Usually highly optimized sequential algorithms are not
easily parallelizable
• Breaking work often involves some pre- or post-
processing (much like divide and conquer)
• Fine vs large grain parallelism and relationship to
communication

11 September 2008 IIT-Madras


Digression – Let’s get a simple MPI Program to work
#include <mpi.h>
#include <stdio.h>

int main()
{
int total_size, my_rank;

MPI_Init(NULL,NULL);

MPI_Comm_size(MPI_COMM_WORLD, &total_size);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

printf("\n Total number of programs = %d, out of which


rank of this process is %d\n", total_size, my_rank);
MPI_Finalize();
return 0;
}
12 September 2008 IIT-Madras
Getting it to work
• Compile it:
− mpicc –o simple simple.c # If you want HP-MPI set your path
# /opt/hpmpi/bin
• Run it
− This depends a bit on the system
− mpirun -np2 simple
− qsub –l ncpus=2 –o simple.out /opt/hpmpi/bin/mpirun <your
program location>/simple
− [Fun: qsub –l ncpus=2 –I hostname ]

• Results are in the output file.


• What is mpirun ?
• What does qsub have to do with MPI?... More about qsub in a separate
talk.

13 September 2008 IIT-Madras


What goes on
• Same program is run at the same time on 2 different
CPUs
• Each is slightly different in that each returns different
values for some simple calls like MPI_Comm_rank.
• This gives each instance its identity
• We can make different instances run different pieces
of code based on this identity difference
• Typically it is an SPMD model of computation

14 September 2008 IIT-Madras


Continuing work breakdown…
Simple Example: Find shortest distances
PROBLEM: 7
Find shortest path distances 2 2 3
5
1 3
1 2
2

7 0 6 4

Let Nodes be numbered 0,1,…,n-1


0 2 1 .. 6
Let us put all of this in a matrix 7 0 .. .. ..
A[i][j] is the distance from i to j 1 5 0 2 3
.. .. 2 0 2
.. .. .. .. 0

15 September 2008 IIT-Madras


Floyd’s (sequential) algorithm
For (k=0; k<n; k++)
For (i=0; i<n; i++)
for (j=0; j<n; j++)
a[i][j]=min( a[i][j] , a[i,k]+a[k][j] );

Observation:
For a fixed k,
Computing i-th row needs i-th row and k-th row

16 September 2008 IIT-Madras


Parallelizing Floyd
• Actually we just need n2 tasks, with each task iterating
n times (once for each value of k).
• After each iteration we need to make sure everyone
sees the matrix.
• ‘Ideal’ for shared memory.. Programming
• What if we have less than n2 tasks?... Say p<n.
• Need to divide the work among the p tasks.
• We can simply divide up the rows.

17 September 2008 IIT-Madras


Dividing the work
• Each task gets [n/p] rows, with the last possibly getting
a little more.
T0

i-th row
q x [ n/p ]

Tq

k-th row Remember the


observation

18 September 2008 IIT-Madras


/* “id” is TASK NUMBER, each node has only the part of A that
it owns. This is approximate code */ Note that each node calls its
for (k=0;k<n;k++) { own matrix by the same name
name a [ ][ ] but has only
The MPI Model… current_owner_task = GET_BLOCK_OWNER(k);
[p/n] rows.
-All nodes run the if (id == current_owner_task) {
same code!! P k_here = k - LOW_END_OF_MY_BLOCK(id);
replica tasks!!… Distributed Memory Model
for(j=0;j<n;j++)
-Some times they
rowk[j]=a[k_here][j];
need to do
different things }
/* rowk is broadcast by the owner and received by others..
The MPI code will come here later */
for(i=0;i<GET_MY_BLOCK_SIZE(id);i++)
for(j=0;j<n;j++)
a[i,j]=min(a[i][j],
a[i][k]+rowk[j]);

19 September 2008
} IIT-Madras
The MPI model
• Recall MPI tasks are typically created when the jobs
are launched – not inside the MPI program (no
forking).
− mpirun usually creates the task set
− mpirun –np 2 a.out <args to a.out>
− a.out is run on all nodes and a communication channel is
setup between them
• Functions allow for tasks to find out
− Size of the task group
− Ones own position within the group

20 September 2008 IIT-Madras


MPI Notions [ Taking from the example ]
• Communicator – A group of tasks in a program
• Rank – Each task’s ID in the group
− MPI_Comm_rank() … /* use this to set “id” */
• Size – Of the group
− MPI_Comm_size() … /* use to set “p” */
• Notion of send/receive/broadcast…
− MPI_Bcast() … /* use to broadcast rowk[] */

• For actual syntax use a good MPI book or manual


• Online resource: http://www-unix.mcs.anl.gov/mpi/www/

21 September 2008 IIT-Madras


MPI Prologue to our Floyd example
int a[MAX][MAX];
int n=20; /* real size of the matrix,
can be read in */
int id,p;

MPI_Init(argc,argv);

MPI_Comm_rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&p);
.
./* This is where all the real work happens */
.
MPI_Finalize(); /* Epilogue */

22 September 2008 IIT-Madras


This is the time to try out several
simple MPI programs using the
few functions we have seen.
- use mpicc
- use mpirun

23
Visualizing the execution Multiple Tasks/CPUs
maybe on the same node

Job is Launched Scheduler ensures 1 task


per cpu
Tasks On CPUs

•MPI_INIT, MPI_Comm_rank, MPI_Comm_size etc…


•Other initializations, like reading in the array
•For initial values of k, task with rank 0 broadcasts row k, others receive
•For each value of k they do their computation with the correct rowk
•Loop above for all values of k

•Task 0 receives all blocks of the final array and prints them out
•MPI_Finalize
24 September 2008 IIT-Madras
Communication vs Computation
• Often communication is needed between iterations to complete
the work.
• Often the more the tasks the more the communication can
become.
− In Floyd, bigger “p” indicates that “rowk” will be sent to a larger
number of tasks.
− If each iteration depends on more data, it can get very busy.
• This may mean network contention; i.e., delays.
• Try to count the numbr of “a”s in a string. Time vs p
• This is why for a fixed problem size increasing number of
CPUs does not continually increase performance
• This needs experimentation – problem specific

25 September 2008 IIT-Madras


Communication primitives
• MPI_Send(sendbuffer, senddatalength,
datatype, destination, tag,
communicator);
• MPI_Send(“Hello”, strlen(“Hello”),
MPI_CHAR, 2 , 100,
MPI_COMM_WORLD);
• MPI_Recv(recvbuffer, revcdatalength,
MPI_CHAR, source, tag,
MPI_COMM_WORLD,
&status);
• Send-Recv happen in pairs.

26 September 2008 IIT-Madras


Collectives
• Broadcast is one-to-all communication
• Both receivers and sender call the same function
• All MUST call it. All end up with SAME result.
• MPI_Bcast (buffer, count, type, root, comm);
• Examples
− MPI_Bcast(&k, 1, MPI_INT, 0,
MPI_Comm_World);
− Task 0 sends its integer k and all others receive it.
− MPI_Bcast(rowk,n,MPI_INT,current_owner_task,
MPI_COMM_WORLD);
− Current_owner_task sends rowk to all others.

27 September 2008 IIT-Madras


Try out a simple MPI program with
send-recvs and braodcasts.

Try out Floyd’s algorithm.


What if you have to read a file to
initialize Floyd’s algorithm?

28
A bit more on Broadcast
Ranks: 0 1 2
x : 0 1 2
MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..); MPI_Bcast(&x,1,..,0,..);

x : 0 0 0

0 0 0

29 September 2008 IIT-Madras


Other useful collectives
• MPI_Reduce(&values,&results,count,type,operator,
root,comm);
• MPI_Reduce(&x, &res, 1, MPI_INT, MPI_SUM,
9, MPI_COMM_WORLD);

• Task number 9 gets in the variable res the sum of


whatever was in x in all of the tasks (including itself).
• Must be called by ALL tasks.

30 September 2008 IIT-Madras


Scattering as opposed to broadcasting
• MPI_Scatterv(sndbuf, sndcount[], send_disp[], type,
recvbuf, recvcount, recvtype, root, comm);
• All nodes MUST call

Rank0

Rank0 Rank1 Rank2 Rank3

31 September 2008 IIT-Madras


Common Communication pitfalls!!
• Make sure that communication
primitives are called by the right
number of tasks.
• Make sure they are called in the right
sequence.
• Make sure that you use the proper
tags.
• If not, you can easily get into
deadlock (“My program seems to be
hung”)

32 September 2008 IIT-Madras


More on work breakdown
• Finding the right work breakdown can be challenging
• Sometime dynamic work breakdown is good
• Master (usually task 0) decides who will do what and
collects the results.
• E.g., you have a huge number of 5x5 matrices to
multiply (chained matrix multiplication).
• E.g., Search for a substring in a huge collection of
strings.

33 September 2008 IIT-Madras


Master-slave dynamic work assignment
Master
1
0

2 Slaves

34 September 2008 IIT-Madras


Master slave example – Reverse strings
Slave(){
do{
MPI_Recv(&work,MAX,MPI_CHAR,i,0,MPI_COMM_WORLD,&stat);
n=strlen(work);
if(n==0) break; /* detecting the end */

reverse(work);

MPI_Send(&work,n+1,MPI_CHAR,0,0,MPI_COMM_WORLD);
} while (1);

MPI_Finalize();
}

35 September 2008 IIT-Madras


Master slave example – Reverse strings
Master(){ /* rank 0 task */
initialize_work_tems();
for(i=1;i<np;i++){ /* Initial work distribution */
work=next_work_item();
n = strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,i,0,MPI_COMM_WORLD);
}
unfinished_work=np;
while (unfinished_work!=0) {
MPI_Recv(&res,MAX,MPI_CHAR,MPI_ANY_SOURCE,0,
MPI_COMM_WORLD,&status);
process(res);
work=next_work_item();
if(work==NULL) unfinished_work--;
else {
n=strlen(work)+1;
MPI_Send(&work,n,MPI_CHAR,status->MPI_source,
0,MPI_COMM_WORLD);
}
}
36 September 2008 IIT-Madras
Master slave example
Main(){
...
MPI_Comm_Rank(MPI_COMM_WORLD,&id);
MPI_Comm_size(MPI_COMM_WORLD,&np);
if (id ==0 )
Master();
else
Slave();
...
}

37 September 2008 IIT-Madras


Matrix Multiply and Communication
Patterns

38
Block Distribution of Matrices
• Matrix Mutliply: •Each task owns a block – its own
− Cij = Σ (Aik * Bkj) part of A,B and C
•The old formula holds for blocks!
• BMR Algorithm:
•Example:
C21=A20 * B01
A21 * B11
A22 * B21
A23 * B31

Each is a smaller Block – a submatrix


39 September 2008 IIT-Madras
Block Distribution of Matrices
C21 = A20 * B01
• Matrix Mutliply:
A21 * B11
− Cij = Σ (Aik * Bkj)
A22 * B21
• BMR Algorithm:
A23 * B31

•A22 is row broadcast


•A22*B21 added into C21
•B_1 is Rolled up one slot
•Out task now has B31
Now repeat the above block except
the item to broadcast is A23
Each is a smaller Block – a submatrix
40 September 2008 IIT-Madras
Attempt doing this with just Send-
Recv and Broadcast

41
Communicators and Topologies
• BMR example shows limitations of broadcast..
Although there is pattern
• Communicators can be created on subgroups of
processes.
• Communicators can be created that have a topology
− Will make programming natural
− Might improve performance by matching to hardware

42 September 2008 IIT-Madras


for (k = 0; k < s; k++) {
sender = (my_row + k) % s;
if (sender == my_col) {
MPI_Bcast(&my_A, m*m, MPI_INT,
sender, row_comm);
T = my_A;
else MPI_Bcast(&T, m*m, MPI_INT,
sender, row_comm);
my_C = my_C + T x my_B;
}
MPI_Sendrecv_replace(my_B, m*m, MPI_INT, dest, 0,
source, 0, col_comm, &status); }

43 September 2008 IIT-Madras


Creating topologies and communicators
• Creating a grid
• MPI_Cart_create(MPI_COMM_WORLD, 2,
dim_sizes, istorus, canreorder, &grid_comm);
− int dim_sizes[2], int istorus[2], int canreorder, MPI_Comm
grid_comm

• Divide a grid into rows- each with own communicator


• MPI_Cart_sub(grid_comm,free,&rowcom)
− MPI_Comm rowcomm; int free[2]

44 September 2008 IIT-Madras


Try implementing the BMR
algorithm with communicators

45
A brief on other MPI Topics – The last leg
• MPI+Multi-threaded / OpenMP
• One sided Communication
• MPI and IO

46 September 2008 IIT-Madras


MPI and OpenMP
•Grain
•Communication

•Where does the


interesting pragma
… omp for fit in our
MPI Floyd?

•How do I assign
exactly one MPI
task per CPU?

47 September 2008 IIT-Madras


One-Sided Communication
• Have no corresponding send-recv pairs!
• RDMA
• Get
• Put

48 September 2008 IIT-Madras


IO in Parallel Programs
• Typically a root task, does the IO.
− Simpler to program
− Natural because of some post processing occasionally needed
(sorting)
− All nodes generating IO requests might overwhelm
fileserver, essentially sequentializing it.
• Performance not the limitation for Lustre/SFS.
• Parallel IO interfaces such as MPI-IO can make use of
parallel filesystems such as Lustre.

49 September 2008 IIT-Madras


MPI-BLAST exec time vs other time[4]

50 September 2008 IIT-Madras


How IO/Comm Optimizations help MPI-
BLAST[4]

51 September 2008 IIT-Madras


What did we learn?
• Distributed Memory Programming Model
• Parallel Algorithm Basics
• Work Breakdown
• Topologies in Communication
• Communication Overhead vs Computation
• Impact of Parallel IO

52 September 2008 IIT-Madras


What MPI Calls did we see here?
1. MPI_Init
2. MPI_Finalize
3. MPI_Comm_size
4. MPI_Comm_Rank
5. MPI_Send
6. MPI_Recv
7. MPI_Sendrecv_replace
8. MPI_Bcast
9. MPI_Reduce
10. MPI_Cart_create
11. MPI_Cart_sub
12. MPI_Scatter

53 September 2008 IIT-Madras


References
1. Parallel Programming in C with MPI and OpenMP, M J
Quinn, TMH. This is an excellent practical book. Motivated
much of the material here, specifically Floyd’s algorithm.
2. BMR Algorithm for Matrix Multiply and topology ideas is
motivated by
http://www.cs.indiana.edu/classes/b673/notes/matrix_mult.ht
ml
3. MPI online manual
http://www-unix.mcs.anl.gov/mpi/www/
4. Efficient Data Access For Parallel BLAST, IPDPDS’05

54 September 2008 IIT-Madras

You might also like