Chap 4-7 - Parallel - Abstractions - and - MPI

Parallel Systems
- Lecture 5 -
Parallel Abstractions and MPI
Outline
Fixed, Unlimited, and Scalable parallelism
Load Balancing
Granularity
Data and Task Parallelism
Programming Models
MPI
2
Fixed, Unlimited, and Scalable
parallelism
Three Types of Parallelism

Three types of parallelism presented in the course book:
Fixed parallelism
Unlimited parallelism
Scalable parallelism
The latter one is normally the one to aim for!
Fixed parallelism might be OK for hardware platforms which are
fixed, e.g. gaming consoles
Unlimited parallelism not possible in reality
4
Fixed Parallelism
Number of tasks are hard coded into the problem
formulation
Does not scale at all with the available number of processors
Example: Count 3s solution using 4 tasks
Unlimited Parallelism
An elegant solution?
Problems:
The number of array items (n) is much larger than the number
of available processors (P)
Many threads allocated at forall statement must be executing
on same processor in sequence
Conclusion:
Identifying parallelism is usually not problematic
6
Difficulties lie in structuring the parallelism to manage and
reduce interaction among threads
Scalable Parallelism
Formulate a set S of substantial subproblems in which
natural units of the solution, of size s, are assigned to
each subproblem and solved as independently as
possible
Substantial
There should be enough local work in a thread to amortize
parallel overhead
Natural
Computations are not always that easy to partition
Independent
Reducing interaction among subproblems leads to less idle time,
7
communication, etc
Example: Scalable Count 3s
8
Load Balancing
Scalability and good speedup can only be achieved if the
parallel workload is relatively equally spread over the
available processors
If workload is unevenly spread, overall performance is
bound to the slowest processor (i.e. processor with most
workload)
Can be difficult to affect
Is often up to the operating system scheduler if using a shared
memory architecture
Network often becomes the bottleneck in a distributed system
9
Example: Load Balancing

Task to be performed:
Performing 10000 integer summations
Sequential time = 10000 × t
P = 100 processors:
Assume each processor gets 1/P = 1/100 of workload
Time = 10000/100 × t = 100 × t
Speedup = (10000 × t)/(100 × t) = 100 (100% of potential)
P = 100 processors:
Assume 1 processor gets 2/100 of workload, rest is equally
distributed among the remaining 99 processors
Time = max(9800/99, 200) × t = 200 × t
Speedup = (10000 × t)/(200 × t) = 50 (50% of potential)
11
Granularity
12
Recap: Parallel Computing

In the simplest sense, parallel computing is the
simultaneous use of multiple computing resources to
solve a computational problem:
Problem is broken into discrete parts that can be solved
concurrently
Each part is further broken down to a series of instructions
13
Synchronization and
Communication
Parallel programs need synchronization and
communication to ensure correct program behavior
Synchronization and communication adds overhead and
thus reduces parallel efficiency
14
Granularity
Computation / Communication ratio:
In parallel computing, granularity is a qualitative measure of the
ratio of computation to communication
Periods of computation are typically separated from periods of
communication by synchronization events
The granularity of parallelism is denoting the frequency of
interactions among parallel activities
15
Granularity
Fine-grained parallelism:
Relatively small amounts of computational work
are done between communication events
Facilitates load balancing
Implies high communication overhead and less
opportunity for performance enhancement
If too fine-grained the overhead required for
comm./synch. may exceed computation time
Coarse-grained parallelism:
Relatively large amounts of computational work
are done between communication events
High computation to communication ratio
Implies more opportunity for performance increase
16
Harder to load balance efficiently
Granularity
Embarrassingly parallel
Computations extremely easy to parallelize
Consists of a large number of threads that clearly have no
dependencies among them
17
18

Parallel computations can generally be divided into two
broad classes:
Data parallel
Task parallel
A data parallel computation is one in which parallelism is
applied by performing the same operation to different
items of data at the same time
The amount of parallelism grows with the size of the data
A task parallel computation is one in which parallelism is
applied by performing distinct computations (or tasks) at
the same time
19
Illustrative Example
Task to perform:
Prepare a banquet
Data parallel approach:
Each meal is a unit of parallelism
P chefs and N meals
Each chef creates N/P complete meals
If N increases we can increase P if we have sufficient resources,
such as stoves, refrigerators, etc.
20
Task to perform:
Prepare a banquet
Task parallel approach:
Partition the task of preparing a meal
into subtasks
Preparing appetizers, salad, main course, dessert,
Each chef can focus on one or a few subtasks
If more chefs are added, subtasks can be subdivided further
E.g. salad preparation can be divided into washing, dicing, and
assembling
There are dependencies among the subtasks
Vegetables should be washed before dicing 21
Vegetables should be diced before being assembled into a salad
Task to perform:
Prepare a banquet
Hybrid approach:
First partition banquet preparation
into a number of subtasks
Then apply data parallelism on each of the subtasks
E.g., multiple cooks could diced vegetables for the salad
22

Granularity
Data parallelism is often fine grained
Work often assigned statically to processes
based on data items
Task parallelism is typically coarse grained

Processes often created in a more dynamic
fashion
23
Programming Models
Many types of algorithmic paradigms used in both data
and task parallel programming
Event driven (often task parallel problem)
Work pool (often task/data parallel problem)
Master-slave (often data parallel problem)
Peer (often data parallel problem)
Divide-and-conquer (often data parallel problem)
Pipeline (often task parallel problem)
24
Event Driven
Concerns a group of independent tasks that are
interacting, but in a somewhat irregular fashion
Suitable for distributed memory programming models with
asynchronous communication
A task sends an event but does not wait for a response
In a shared-memory system, a queue may be used to
represent message-passing among the tasks
This requires safe, concurrent access to mutex variables
Challenges
Avoiding deadlock and starvation
Deadlock: Task waits for an event that will never occur
25
Load-balancing the tasks across processor elements
Master-Slave
Master process/thread often executes the sequential
part and spawns slaves to execute the parallel part
Master responsible for allocating amount of slaves
Master responsible for distributing the workload among the
slaves
Master waits for slaves to finish Master
Challenges
Master could become a bottleneck
Master could have a lot of idle-time Slave Slave Slave
Example:
Web server 26
Peer
Like Master-Slave, but the Master assigns part of the
workload to itself to minimize the overhead
Example: OpenMP Fork-Join-model
Master/Slave
Slave Slave Slave
27
Work Pool
Tasks fetch a piece of work from the pool, executes it, and
(sometimes) produces new work which is the put into the
pool
Book example: FIFO work queue
OpenMP: dynamically scheduled for-loop
Challenges
Difficult to implement the work pool in an
efficient way (shared data structure which
may become a bottleneck)
28
Divide-and-Conquer
Parent tasks divides its workload into several smaller
pieces, one for each of its children
Dividing and merging can be done recursively
Natural for computations such as Quicksort
Presented in more detail in Björns lectures
Challenges
For many problems difficult to
achieve a balanced workload
29
Pipeline
Suitable when overall computation involves feeding
data through a series of operations
Graphics applications often fall into this category
Used heavily by GPUs
Important to clearly express ordering constraints
I.e., which operations must occur before others
The key to good parallelism is to distribute the
stages of the pipeline to processor elements in a
balanced manner
Pipelining can be expressed at both the algorithmic
and SIMD-
processing level
30
Example: Task Parallelism
31
Task Parallelism Summary
Natural approach to parallelism
Typically good efficiency
Often coarse-grained granularity
Tasks can often proceed without interactions
Synchronization/communication needed at the end
In practice scalability is limited
Problem can be split only into a finite set of different tasks
32
Data Parallelism
Will be handled in lectures to come
33
Threads and processes Recap
Threads live within a process
Threads usually use shared memory (e.g. global
variables) to communicate
34
35
Processes (usually) do not share their address space
with other processes
36
37
Processes (usually) do not share their address space
with other processes
Processes (usually) use message passing to
communicate
38
39
In this lecture, we only focus on processes!
Threads were considered in a previous lecture
40
MPI
A nice tutorial can be found at:
https://computing.llnl.gov/tutorials/mpi/
Chapter 7 in the course literature

The MPI standards are available at:
http://www.mpi-forum.org/
To see all the routines and utilities:
man -ik ^mpi
41
MPI
43
MPI
Local View Programming (Explicit parallelism)
The programmer must keep track of the state of each
process in the system
You define the communication points
Often kind of like threaded programming, except the
address space is not shared (SIMD/SPMD)
Threads can be used within each process
The interface is extremely rich!
47
MPI
Over 100 routines (see man -k ^MPI_| wc -l)

Call Interface:
rc = MPI_Xxxxx(arg,...)
The most essential are:
MPI_Init()
MPI_Comm_size()
MPI_Comm_rank()
MPI_Finalize()
51
MPI
Communicators and groups
Define which processes that can communicate
MPI_COMM_WORLD includes all processes
You can define your own communicators and groups
Within a communicator, a process belonging to it has a
unique rank (ID) in [0..pcomm-1]
The rank is typically used to control program flow and
specify the source and destination of msgs
56
MPI
/* mpi.h */
int MPI_Init( int *argc, /* pointers to command */

char ***argv); /* line args from main() */
int MPI_Comm_size( MPI_Comm comm, /* communicator */

int *size); /* number of processes */
int MPI_Comm_rank( MPI_Comm comm, /* communicator */

int *rank); /* rank a.k.a. ID */
int MPI_Finalize();
57
MPI
Point-to-point communication types:
Blocking send / blocking receive
Non-blocking send / non-blocking receive
Synchronous send
Buffered send
Combined send/receive
"Ready" send
63
MPI
Blocking calls:
MPI_Xxxx( buffer, count, type, src/dest,
tag, communicator [, status] )
Non-blocking calls:
MPI_Ixxx( buffer, count, type, src/dest,
tag, communicator, request )
64
MPI
/* mpi.h – blocking send and receive */
int MPI_Send( void *buf, /* send data buffer */

int count, /* items to send */
MPI_Datatype datatype, /* type of data */
int dest, /* receiving task */
int tag, /* message tag */
MPI_Comm comm); /* communicator */
int MPI_Recv( void *buf, /* recv data buffer */

int count, /* size of buffer */
int source, /* sending task */
MPI_Comm comm, /* communicator */
MPI_Status *status); /* status */
65
MPI
/* mpi.h – datatypes */
/* MPI type */ /* Corresponding type in C/C++ */

MPI_CHAR /* signed char */
MPI_SHORT /* signed short int */
MPI_INT /* signed int */
MPI_LONG /* signed long int */
MPI_UNSIGNED_CHAR /* unsigned char */
MPI_UNSIGNED_SHORT /* unsigned short int */
MPI_UNSIGNED_INT /* unsigned int */
MPI_UNSIGNED_LONG /* unsigned long int */
MPI_FLOAT /* float */
MPI_DOUBLE /* double */
MPI_LONG_DOUBLE /* long double */
MPI_BYTE /* 8 binary digits */
MPI_PACKED /* data packed with MPI_Pack() */
... /* ... */ 66
User defined /* ??? */
MPI
/* mpi.h – non-blocking send and receive */
int MPI_Isend( void *buf, /* send data buffer */

MPI_Request *request); /* communication req */
int MPI_Irecv( void *buf, /* recv data buffer */

MPI_Request *request); /* communication req */67
MPI
/* mpi.h – checking completion of non-blocking communication */
int MPI_Wait( MPI_Request *request) /* communication req */

MPI_Status *status); /* status variable */
int MPI_Probe( int source, /* sending task */

int MPI_Test( MPI_Request *request, /* communication req */

int *flag, /* request completed */
68
MPI
/* mpi.h – synchronous and buffered send */
int MPI_Ssend( void *buf, /* send data buffer */

MPI_Request *request); /* communication req */
int MPI_Bsend( void *buf, /* recv data buffer */

69
MPI
/* mpi.h – combined send and receive */
int MPI_Sendrecv(
void *sbuf, /* send data buffer */
int scount, /* items to send */
MPI_Datatype stype, /* type of data to send */
int stag, /* send message tag */
void *rbuf, /* recv data buffer */

int rcount, /* size of recv buffer */
MPI_Datatype rtype, /* type of data to recv */
int rtag, /* recv message tag */

MPI_Status * stat); /* status */ 70
MPI
/* mpi.h – ready send */
int MPI_Rsend( void *buf, /* send data buffer */

71
MPI
Order and fairness:
Messages will not overtake each other
Multiple sent msgs will be received in send-order
Multiple recvs on the same msgs will be recv in-order
Note: the above bullets do not apply for multiple senders /
receivers
Starvation and deadlock can occur
72
MPI
#include <mpi.h>
#include <stdio.h>
int main(int argc, char ** argv) {

int rank, size, data=0;
MPI_Status s;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size > 1 && rank == 0) {
data = 42;
MPI_Send(&data, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (size > 1 && rank == 1) {
MPI_Recv(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &s);
printf ("data = %d\n", data);
}
MPI_Finalize(); 73
return 0; }
MPI
Remember to include mpi.h
Compile with:
mpicc prog.c -o prog
Execute with:
mpirun -np 12 prog
... or:
mpirun -np 12 -hostfile hf prog
74
MPI
Collective communication
Operations must involve all (or none) processes within a
communicator
Operations are blocking
No tag for messages
Some overhead (operation can create new process
groupings and communicators)
Does not support user defined data types
75
MPI
Collective communication
Types:
Synchronization
Broadcast
Barrier
Data movement
Scatter
Gather
Collective computation
Reduce
Scan
76
MPI
/* mpi.h – synchronisation */
int MPI_Barrier( MPI_Comm comm); /* communicator */
int MPI_Bcast( void *buf, /* send data buffer */

int root, /* rank of sender */
77
MPI
/* mpi.h – data movement */
int MPI_Scatter( void *sbuf, /* send data buffer */

MPI_Datatype stype, /* type of data */
int rcount, /* items to recv */
MPI_Datatype rtype, /* type of data */
int root, /* rank of sender */
int MPI_Gather( void *sbuf, /* send data buffer */

MPI_Datatype stype, /* type of data */
int rcount, /* items to recv */
MPI_Datatype rtype, /* type of data */
int root, /* rank of receiver */78
MPI
/* mpi.h – computation */
int MPI_Reduce( void *sbuf, /* send data buffer */

int count, /* items in send buf */
MPI_Datatype datatype, /* type of send data */
MPI_Op op, /* reduce operation */
int root, /* rank of receiver */
int MPI_Scan( void *sbuf, /* send data buffer */

int count, /* items in send buf */
MPI_Datatype datatype, /* type of send data */
MPI_Op op, /* scan operation */
79
MPI
/* mpi.h – operators */
/* MPI operator */ /* Operation */

MPI_MAX /* maximum */
MPI_MIN /* minimum */
MPI_SUM /* sum */
MPI_PROD /* product */
MPI_LAND /* logical AND */
MPI_BAND /* bit-wise AND */
MPI_LOR /* logical OR */
MPI_BOR /* bit-wise OR */
MPI_LXOR /* logical XOR */
MPI_BXOR /* bit-wise XOR */
MPI_MAXLOC /* max value and location */
MPI_MINLOC /* min value and location */
... /* ... */
User defined /* ??? */
80
MPI
#include <stdio.h>

int i, rank, size, interval;
interval = 1000*power(10,rank);
for (i=1; i<=interval*5; ++i)
if (i%interval == 0)
printf("P with interval %d is at %d\n", interval, i);
printf ("Done!\n");
return 0;
}
81
MPI
#include <mpi.h>
#include <stdio.h>

int i, rank, size, interval;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
interval = 1000*power(10,rank);
for (i=1; i<=interval*5; ++i)
if (i%interval == 0)
printf("P with interval %d is at %d\n", interval, i);
MPI_Barrier(MPI_COMM_WORLD);
printf ("Done!\n");
MPI_Finalize();
return 0;
}
86
Lab 2
Introduction to distributed memory programming using
MPI
Preparation: Read chapter 7
Bonus points deadline is 2011-11-29!
87
More Information
Further reading:
Principles of Parallel Programming
Chapter 4 (First steps toward parallel programming)
Chapter 5 (Scalable algorithmic techniques)
Chapter 7 (MPI)
Chapter 9 (Assessing the state of the art)
Chapter 10 (Future directions in parallel programming)
Chapter 11 (Writing parallel programs)
Introduction to MPI:
https://computing.llnl.gov/tutorials/mpi
Acknowledgments:
Several slides based on material created by Prof. Erwin Laure,
erwinl@pdc.kth.se and Andreas Ermedahl, 88
andreas.ermedahl@ericsson.com

Chap 4-7 - Parallel - Abstractions - and - MPI

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 4-7 - Parallel - Abstractions - and - MPI

Uploaded by

Copyright:

Available Formats

Parallel Systems

Parallel Abstractions and MPI

Three Types of Parallelism

Example: Scalable Count 3s

Example: Load Balancing

Recap: Parallel Computing

Data and Task Parallelism

Data and Task Parallelism

Task parallelism is typically coarse grained

Slave Slave Slave

Example: Task Parallelism

Threads and processes  Recap

Threads and processes  Recap

Threads and processes  Recap

Chapter 7 in the course literature

int MPI_Init( int *argc, /* pointers to command */

int MPI_Comm_size( MPI_Comm comm, /* communicator */

int MPI_Comm_rank( MPI_Comm comm, /* communicator */

int MPI_Send( void *buf, /* send data buffer */

int MPI_Recv( void *buf, /* recv data buffer */

/* MPI type */ /* Corresponding type in C/C++ */

int MPI_Isend( void *buf, /* send data buffer */

int MPI_Irecv( void *buf, /* recv data buffer */

int MPI_Wait( MPI_Request *request) /* communication req */

int MPI_Probe( int source, /* sending task */

int MPI_Test( MPI_Request *request, /* communication req */

int MPI_Ssend( void *buf, /* send data buffer */

int MPI_Bsend( void *buf, /* recv data buffer */

void *rbuf, /* recv data buffer */

MPI_Comm comm, /* communicator */

int MPI_Rsend( void *buf, /* send data buffer */

int main(int argc, char ** argv) {

int MPI_Barrier( MPI_Comm comm); /* communicator */

int MPI_Bcast( void *buf, /* send data buffer */

int MPI_Scatter( void *sbuf, /* send data buffer */

int MPI_Gather( void *sbuf, /* send data buffer */

int MPI_Reduce( void *sbuf, /* send data buffer */

int MPI_Scan( void *sbuf, /* send data buffer */

/* MPI operator */ /* Operation */

int main(int argc, char ** argv) {

int main(int argc, char ** argv) {

You might also like

Threads and processes Recap

Threads and processes Recap

Threads and processes Recap

int MPI_Init( int argc, / pointers to command */

int MPI_Send( void buf, / send data buffer */

int MPI_Recv( void buf, / recv data buffer */

/* MPI type / / Corresponding type in C/C++ */

int MPI_Isend( void buf, / send data buffer */

int MPI_Irecv( void buf, / recv data buffer */

int MPI_Wait( MPI_Request request) / communication req */

int MPI_Test( MPI_Request request, / communication req */

int MPI_Ssend( void buf, / send data buffer */

int MPI_Bsend( void buf, / recv data buffer */

void rbuf, / recv data buffer */

int MPI_Rsend( void buf, / send data buffer */

int MPI_Bcast( void buf, / send data buffer */

int MPI_Scatter( void sbuf, / send data buffer */

int MPI_Gather( void sbuf, / send data buffer */

int MPI_Reduce( void sbuf, / send data buffer */

int MPI_Scan( void sbuf, / send data buffer */

/* MPI operator / / Operation */