You are on page 1of 34

Parallel Systems

- Lecture 5 -

Parallel Abstractions and MPI

Outline
 Fixed, Unlimited, and Scalable parallelism
 Load Balancing
 Granularity
 Data and Task Parallelism
 Programming Models
 MPI

2
Fixed, Unlimited, and Scalable
parallelism

Three Types of Parallelism


 Three types of parallelism presented in the course book:
– Fixed parallelism
– Unlimited parallelism
– Scalable parallelism
 The latter one is normally the one to aim for!
– Fixed parallelism might be OK for hardware platforms which are
fixed, e.g. gaming consoles
– Unlimited parallelism not possible in reality

4
Fixed Parallelism
 Number of tasks are hard coded into the problem
formulation
– Does not scale at all with the available number of processors
 Example: Count 3s solution using 4 tasks

Unlimited Parallelism
 An elegant solution?

 Problems:
– The number of array items (n) is much larger than the number
of available processors (P)
– Many threads allocated at forall statement must be executing
on same processor in sequence
 Conclusion:
– Identifying parallelism is usually not problematic
6
– Difficulties lie in structuring the parallelism to manage and
reduce interaction among threads
Scalable Parallelism
 “Formulate a set S of substantial subproblems in which
natural units of the solution, of size s, are assigned to
each subproblem and solved as independently as
possible”
 “Substantial”
– There should be enough local work in a thread to amortize
parallel overhead
 “Natural”
– Computations are not always that easy to partition
 “Independent”
– Reducing interaction among subproblems leads to less idle time,
7

communication, etc

Example: Scalable Count 3s

8
Load Balancing
 Scalability and good speedup can only be achieved if the
parallel workload is relatively equally spread over the
available processors
 If workload is unevenly spread, overall performance is
bound to the “slowest” processor (i.e. processor with most
workload)
 Can be difficult to affect
– Is often up to the operating system scheduler if using a shared
memory architecture
– Network often becomes the bottleneck in a distributed system
9

Example: Load Balancing


 Task to be performed:
– Performing 10000 integer summations
– Sequential time = 10000 × t
 P = 100 processors:
– Assume each processor gets 1/P = 1/100 of workload
– Time = 10000/100 × t = 100 × t
– Speedup = (10000 × t)/(100 × t) = 100 (100% of potential)
 P = 100 processors:
– Assume 1 processor gets 2/100 of workload, rest is equally
distributed among the remaining 99 processors
– Time = max(9800/99, 200) × t = 200 × t
– Speedup = (10000 × t)/(200 × t) = 50 (50% of potential)
11
Granularity

12

Recap: Parallel Computing


 In the simplest sense, parallel computing is the
simultaneous use of multiple computing resources to
solve a computational problem:
– Problem is broken into discrete parts that can be solved
concurrently
– Each part is further broken down to a series of instructions

13
Synchronization and
Communication
 Parallel programs need synchronization and
communication to ensure correct program behavior
 Synchronization and communication adds overhead and
thus reduces parallel efficiency

14

Granularity
 Computation / Communication ratio:
– In parallel computing, granularity is a qualitative measure of the
ratio of computation to communication
– Periods of computation are typically separated from periods of
communication by synchronization events
– The granularity of parallelism is denoting the frequency of
interactions among parallel activities

15
Granularity
 Fine-grained parallelism:
– Relatively small amounts of computational work
are done between communication events
– Facilitates load balancing
– Implies high communication overhead and less
opportunity for performance enhancement
– If too fine-grained the overhead required for
comm./synch. may exceed computation time
 Coarse-grained parallelism:
– Relatively large amounts of computational work
are done between communication events
– High computation to communication ratio
– Implies more opportunity for performance increase
16
– Harder to load balance efficiently

Granularity
 Embarrassingly parallel
– Computations extremely easy to parallelize
– Consists of a large number of threads that clearly have no
dependencies among them

17
Data and Task Parallelism

18

Data and Task Parallelism


 Parallel computations can generally be divided into two
broad classes:
– Data parallel
– Task parallel
 A data parallel computation is one in which parallelism is
applied by performing the same operation to different
items of data at the same time
– The amount of parallelism grows with the size of the data
 A task parallel computation is one in which parallelism is
applied by performing distinct computations (or tasks) at
the same time
19
Illustrative Example
 Task to perform:
– Prepare a banquet
 Data parallel approach:
– Each meal is a unit of parallelism
– P chefs and N meals
– Each chef creates N/P complete meals
– If N increases we can increase P if we have sufficient resources,
such as stoves, refrigerators, etc.

20

Illustrative Example
 Task to perform:
– Prepare a banquet
 Task parallel approach:
– Partition the task of preparing a meal
into subtasks
• Preparing appetizers, salad, main course, dessert, …
– Each chef can focus on one or a few subtasks
– If more chefs are added, subtasks can be subdivided further
• E.g. salad preparation can be divided into washing, dicing, and
assembling
– There are dependencies among the subtasks
• Vegetables should be washed before dicing 21
• Vegetables should be diced before being assembled into a salad
Illustrative Example
 Task to perform:
– Prepare a banquet
 Hybrid approach:
– First partition banquet preparation
into a number of subtasks
– Then apply data parallelism on each of the subtasks
• E.g., multiple cooks could diced vegetables for the salad

22

Data and Task Parallelism


Granularity
 Data parallelism is often fine grained
– Work often assigned statically to processes
based on data items

 Task parallelism is typically coarse grained


– Processes often created in a more dynamic
fashion

23
Programming Models
 Many types of algorithmic paradigms used in both data
and task parallel programming
– Event driven (often task parallel problem)
– Work pool (often task/data parallel problem)
– Master-slave (often data parallel problem)
– Peer (often data parallel problem)
– Divide-and-conquer (often data parallel problem)
– Pipeline (often task parallel problem)

24

Event Driven
 Concerns a group of independent tasks that are
interacting, but in a somewhat irregular fashion
 Suitable for distributed memory programming models with
asynchronous communication
– A task sends an event but does not wait for a response
 In a shared-memory system, a queue may be used to
represent message-passing among the tasks
– This requires safe, concurrent access to mutex variables
 Challenges
– Avoiding deadlock and starvation
• Deadlock: Task waits for an event that will never occur
25
– Load-balancing the tasks across processor elements
Master-Slave
 Master process/thread often executes the “sequential
part” and spawns slaves to execute the parallel part
– Master responsible for allocating amount of slaves
– Master responsible for distributing the workload among the
slaves
– Master waits for slaves to finish Master
 Challenges
– Master could become a bottleneck
– Master could have a lot of idle-time Slave Slave Slave

 Example:
– Web server 26

Peer
 Like Master-Slave, but the Master assigns part of the
workload to itself to minimize the overhead
 Example: OpenMP Fork-Join-model
Master/Slave

Slave Slave Slave

27
Work Pool
 Tasks fetch a piece of work from the pool, executes it, and
(sometimes) produces new work which is the put into the
pool
– Book example: FIFO work queue
– OpenMP: dynamically scheduled for-loop
 Challenges
– Difficult to implement the work pool in an
efficient way (shared data structure which
may become a bottleneck)

28

Divide-and-Conquer
 Parent tasks divides its workload into several smaller
pieces, one for each of its children
– Dividing and merging can be done recursively
– Natural for computations such as Quicksort
– Presented in more detail in Björn’s lectures
 Challenges
– For many problems difficult to
achieve a balanced workload

29
Pipeline
 Suitable when overall computation involves feeding
data through a series of operations
– Graphics applications often fall into this category
– Used heavily by GPUs
 Important to clearly express ordering constraints
– I.e., which operations must occur before others
 The key to good parallelism is to distribute the
stages of the pipeline to processor elements in a
balanced manner
 Pipelining can be expressed at both the algorithmic
and SIMD-
processing level
30

Example: Task Parallelism

31
Task Parallelism Summary
 “Natural” approach to parallelism
 Typically good efficiency
– Often coarse-grained granularity
– Tasks can often proceed without interactions
– Synchronization/communication needed at the end
 In practice scalability is limited
– Problem can be split only into a finite set of different tasks

32

Data Parallelism
 Will be handled in lectures to come

33
Threads and processes – Recap
 Threads live within a process
 Threads usually use shared memory (e.g. global
variables) to communicate

34

Threads and processes – Recap

35
Threads and processes – Recap
 Threads live within a process
 Threads usually use shared memory (e.g. global
variables) to communicate
 Processes (usually) do not share their address space
with other processes

36

Threads and processes – Recap

37
Threads and processes – Recap
 Threads live within a process
 Threads usually use shared memory (e.g. global
variables) to communicate
 Processes (usually) do not share their address space
with other processes
 Processes (usually) use message passing to
communicate

38

Threads and processes – Recap

39
Threads and processes – Recap
 In this lecture, we only focus on processes!
 Threads were considered in a previous lecture

40

MPI
 A nice tutorial can be found at:
https://computing.llnl.gov/tutorials/mpi/

 Chapter 7 in the course literature


 The MPI standards are available at:
http://www.mpi-forum.org/
 To see all the routines and utilities:
man -ik ^mpi

41
MPI

43

MPI
 Local View Programming (Explicit parallelism)
 The programmer must keep track of the state of each
process in the system
 You define the communication points
 Often kind of like threaded programming, except the
address space is not shared (SIMD/SPMD)
 Threads can be used within each process
 The interface is extremely rich!

47
MPI
 Over 100 routines (see man -k ^MPI_| wc -l)

Call Interface:
– rc = MPI_Xxxxx(arg,...)
 The most essential are:
– MPI_Init()
– MPI_Comm_size()
– MPI_Comm_rank()
– MPI_Finalize()

51

MPI
 Communicators and groups
– Define which processes that can communicate
– MPI_COMM_WORLD includes all processes
– You can define your own communicators and groups
– Within a communicator, a process belonging to it has a
unique rank (ID) in [0..pcomm-1]
– The rank is typically used to control program flow and
specify the source and destination of msgs

56
MPI
/* mpi.h */

int MPI_Init( int *argc, /* pointers to command */


char ***argv); /* line args from main() */

int MPI_Comm_size( MPI_Comm comm, /* communicator */


int *size); /* number of processes */

int MPI_Comm_rank( MPI_Comm comm, /* communicator */


int *rank); /* rank a.k.a. ID */

int MPI_Finalize();

57

MPI
 Point-to-point communication types:
– Blocking send / blocking receive
– Non-blocking send / non-blocking receive
– Synchronous send
– Buffered send
– Combined send/receive
– "Ready" send

63
MPI
 Blocking calls:
– MPI_Xxxx( buffer, count, type, src/dest,
tag, communicator [, status] )
 Non-blocking calls:
– MPI_Ixxx( buffer, count, type, src/dest,
tag, communicator, request )

64

MPI
/* mpi.h – blocking send and receive */

int MPI_Send( void *buf, /* send data buffer */


int count, /* items to send */
MPI_Datatype datatype, /* type of data */
int dest, /* receiving task */
int tag, /* message tag */
MPI_Comm comm); /* communicator */

int MPI_Recv( void *buf, /* recv data buffer */


int count, /* size of buffer */
MPI_Datatype datatype, /* type of data */
int source, /* sending task */
int tag, /* message tag */
MPI_Comm comm, /* communicator */
MPI_Status *status); /* status */
65
MPI
/* mpi.h – datatypes */

/* MPI type */ /* Corresponding type in C/C++ */


MPI_CHAR /* signed char */
MPI_SHORT /* signed short int */
MPI_INT /* signed int */
MPI_LONG /* signed long int */
MPI_UNSIGNED_CHAR /* unsigned char */
MPI_UNSIGNED_SHORT /* unsigned short int */
MPI_UNSIGNED_INT /* unsigned int */
MPI_UNSIGNED_LONG /* unsigned long int */
MPI_FLOAT /* float */
MPI_DOUBLE /* double */
MPI_LONG_DOUBLE /* long double */
MPI_BYTE /* 8 binary digits */
MPI_PACKED /* data packed with MPI_Pack() */
... /* ... */ 66
User defined /* ??? */

MPI
/* mpi.h – non-blocking send and receive */

int MPI_Isend( void *buf, /* send data buffer */


int count, /* items to send */
MPI_Datatype datatype, /* type of data */
int dest, /* receiving task */
int tag, /* message tag */
MPI_Comm comm, /* communicator */
MPI_Request *request); /* communication req */

int MPI_Irecv( void *buf, /* recv data buffer */


int count, /* size of buffer */
MPI_Datatype datatype, /* type of data */
int source, /* sending task */
int tag, /* message tag */
MPI_Comm comm, /* communicator */
MPI_Request *request); /* communication req */67
MPI
/* mpi.h – checking completion of non-blocking communication */

int MPI_Wait( MPI_Request *request) /* communication req */


MPI_Status *status); /* status variable */

int MPI_Probe( int source, /* sending task */


int tag, /* message tag */
MPI_Comm comm, /* communicator */
MPI_Status *status); /* status variable */

int MPI_Test( MPI_Request *request, /* communication req */


int *flag, /* request completed */
MPI_Status *status); /* status variable */

68

MPI
/* mpi.h – synchronous and buffered send */

int MPI_Ssend( void *buf, /* send data buffer */


int count, /* items to send */
MPI_Datatype datatype, /* type of data */
int dest, /* receiving task */
int tag, /* message tag */
MPI_Comm comm, /* communicator */
MPI_Request *request); /* communication req */

int MPI_Bsend( void *buf, /* recv data buffer */


int count, /* size of buffer */
MPI_Datatype datatype, /* type of data */
int dest, /* receiving task */
int tag, /* message tag */
MPI_Comm comm); /* communicator */
69
MPI
/* mpi.h – combined send and receive */

int MPI_Sendrecv(
void *sbuf, /* send data buffer */
int scount, /* items to send */
MPI_Datatype stype, /* type of data to send */
int dest, /* receiving task */
int stag, /* send message tag */

void *rbuf, /* recv data buffer */


int rcount, /* size of recv buffer */
MPI_Datatype rtype, /* type of data to recv */
int source, /* sending task */
int rtag, /* recv message tag */

MPI_Comm comm, /* communicator */


MPI_Status * stat); /* status */ 70

MPI
/* mpi.h – ready send */

int MPI_Rsend( void *buf, /* send data buffer */


int count, /* items to send */
MPI_Datatype datatype, /* type of data */
int dest, /* receiving task */
int tag, /* message tag */
MPI_Comm comm); /* communicator */

71
MPI
 Order and fairness:
– Messages will not overtake each other
• Multiple sent msgs will be received in send-order
• Multiple recvs on the same msgs will be recv in-order
– Note: the above bullets do not apply for multiple senders /
receivers
– Starvation and deadlock can occur

72

MPI
#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv) {


int rank, size, data=0;
MPI_Status s;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size > 1 && rank == 0) {
data = 42;
MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (size > 1 && rank == 1) {
MPI_Recv(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &s);
printf ("data = %d\n", data);
}
MPI_Finalize(); 73
return 0; }
MPI
 Remember to include mpi.h
 Compile with:
– mpicc prog.c -o prog
 Execute with:
– mpirun -np 12 prog
 ... or:
– mpirun -np 12 -hostfile hf prog

74

MPI
 Collective communication
– Operations must involve all (or none) processes within a
communicator
– Operations are blocking
– No tag for messages
– Some overhead (operation can create new process
groupings and communicators)
– Does not support user defined data types

75
MPI
 Collective communication
– Types:
• Synchronization
– Broadcast
– Barrier
• Data movement
– Scatter
– Gather
• Collective computation
– Reduce
– Scan

76

MPI
/* mpi.h – synchronisation */

int MPI_Barrier( MPI_Comm comm); /* communicator */

int MPI_Bcast( void *buf, /* send data buffer */


int count, /* items to send */
MPI_Datatype datatype, /* type of data */
int root, /* rank of sender */
MPI_Comm comm); /* communicator */

77
MPI
/* mpi.h – data movement */

int MPI_Scatter( void *sbuf, /* send data buffer */


int scount, /* items to send */
MPI_Datatype stype, /* type of data */
void *rbuf, /* recv data buffer */
int rcount, /* items to recv */
MPI_Datatype rtype, /* type of data */
int root, /* rank of sender */
MPI_Comm comm); /* communicator */

int MPI_Gather( void *sbuf, /* send data buffer */


int scount, /* items to send */
MPI_Datatype stype, /* type of data */
void *rbuf, /* recv data buffer */
int rcount, /* items to recv */
MPI_Datatype rtype, /* type of data */
int root, /* rank of receiver */78
MPI_Comm comm); /* communicator */

MPI
/* mpi.h – computation */

int MPI_Reduce( void *sbuf, /* send data buffer */


void *rbuf, /* recv data buffer */
int count, /* items in send buf */
MPI_Datatype datatype, /* type of send data */
MPI_Op op, /* reduce operation */
int root, /* rank of receiver */
MPI_Comm comm); /* communicator */

int MPI_Scan( void *sbuf, /* send data buffer */


void *rbuf, /* recv data buffer */
int count, /* items in send buf */
MPI_Datatype datatype, /* type of send data */
MPI_Op op, /* scan operation */
MPI_Comm comm); /* communicator */

79
MPI
/* mpi.h – operators */

/* MPI operator */ /* Operation */


MPI_MAX /* maximum */
MPI_MIN /* minimum */
MPI_SUM /* sum */
MPI_PROD /* product */
MPI_LAND /* logical AND */
MPI_BAND /* bit-wise AND */
MPI_LOR /* logical OR */
MPI_BOR /* bit-wise OR */
MPI_LXOR /* logical XOR */
MPI_BXOR /* bit-wise XOR */
MPI_MAXLOC /* max value and location */
MPI_MINLOC /* min value and location */
... /* ... */
User defined /* ??? */
80

MPI
#include <stdio.h>

int main(int argc, char ** argv) {


int i, rank, size, interval;

interval = 1000*power(10,rank);
for (i=1; i<=interval*5; ++i)
if (i%interval == 0)
printf("P with interval %d is at %d\n", interval, i);

printf ("Done!\n");

return 0;
}
81
MPI
#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv) {


int i, rank, size, interval;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
interval = 1000*power(10,rank);
for (i=1; i<=interval*5; ++i)
if (i%interval == 0)
printf("P with interval %d is at %d\n", interval, i);
MPI_Barrier(MPI_COMM_WORLD);
printf ("Done!\n");
MPI_Finalize();
return 0;
}
86

Lab 2
 Introduction to distributed memory programming using
MPI
 Preparation: Read chapter 7
 Bonus points deadline is 2011-11-29!

87
More Information
 Further reading:
– Principles of Parallel Programming
• Chapter 4 (First steps toward parallel programming)
• Chapter 5 (Scalable algorithmic techniques)
• Chapter 7 (MPI)
• Chapter 9 (Assessing the state of the art)
• Chapter 10 (Future directions in parallel programming)
• Chapter 11 (Writing parallel programs)
– Introduction to MPI:
https://computing.llnl.gov/tutorials/mpi
 Acknowledgments:
– Several slides based on material created by Prof. Erwin Laure,
erwinl@pdc.kth.se and Andreas Ermedahl, 88
andreas.ermedahl@ericsson.com

You might also like