Professional Documents
Culture Documents
- Lecture 5 -
Outline
Fixed, Unlimited, and Scalable parallelism
Load Balancing
Granularity
Data and Task Parallelism
Programming Models
MPI
2
Fixed, Unlimited, and Scalable
parallelism
4
Fixed Parallelism
Number of tasks are hard coded into the problem
formulation
Does not scale at all with the available number of processors
Example: Count 3s solution using 4 tasks
Unlimited Parallelism
An elegant solution?
Problems:
The number of array items (n) is much larger than the number
of available processors (P)
Many threads allocated at forall statement must be executing
on same processor in sequence
Conclusion:
Identifying parallelism is usually not problematic
6
Difficulties lie in structuring the parallelism to manage and
reduce interaction among threads
Scalable Parallelism
Formulate a set S of substantial subproblems in which
natural units of the solution, of size s, are assigned to
each subproblem and solved as independently as
possible
Substantial
There should be enough local work in a thread to amortize
parallel overhead
Natural
Computations are not always that easy to partition
Independent
Reducing interaction among subproblems leads to less idle time,
7
communication, etc
8
Load Balancing
Scalability and good speedup can only be achieved if the
parallel workload is relatively equally spread over the
available processors
If workload is unevenly spread, overall performance is
bound to the slowest processor (i.e. processor with most
workload)
Can be difficult to affect
Is often up to the operating system scheduler if using a shared
memory architecture
Network often becomes the bottleneck in a distributed system
9
12
13
Synchronization and
Communication
Parallel programs need synchronization and
communication to ensure correct program behavior
Synchronization and communication adds overhead and
thus reduces parallel efficiency
14
Granularity
Computation / Communication ratio:
In parallel computing, granularity is a qualitative measure of the
ratio of computation to communication
Periods of computation are typically separated from periods of
communication by synchronization events
The granularity of parallelism is denoting the frequency of
interactions among parallel activities
15
Granularity
Fine-grained parallelism:
Relatively small amounts of computational work
are done between communication events
Facilitates load balancing
Implies high communication overhead and less
opportunity for performance enhancement
If too fine-grained the overhead required for
comm./synch. may exceed computation time
Coarse-grained parallelism:
Relatively large amounts of computational work
are done between communication events
High computation to communication ratio
Implies more opportunity for performance increase
16
Harder to load balance efficiently
Granularity
Embarrassingly parallel
Computations extremely easy to parallelize
Consists of a large number of threads that clearly have no
dependencies among them
17
Data and Task Parallelism
18
20
Illustrative Example
Task to perform:
Prepare a banquet
Task parallel approach:
Partition the task of preparing a meal
into subtasks
Preparing appetizers, salad, main course, dessert,
Each chef can focus on one or a few subtasks
If more chefs are added, subtasks can be subdivided further
E.g. salad preparation can be divided into washing, dicing, and
assembling
There are dependencies among the subtasks
Vegetables should be washed before dicing 21
Vegetables should be diced before being assembled into a salad
Illustrative Example
Task to perform:
Prepare a banquet
Hybrid approach:
First partition banquet preparation
into a number of subtasks
Then apply data parallelism on each of the subtasks
E.g., multiple cooks could diced vegetables for the salad
22
23
Programming Models
Many types of algorithmic paradigms used in both data
and task parallel programming
Event driven (often task parallel problem)
Work pool (often task/data parallel problem)
Master-slave (often data parallel problem)
Peer (often data parallel problem)
Divide-and-conquer (often data parallel problem)
Pipeline (often task parallel problem)
24
Event Driven
Concerns a group of independent tasks that are
interacting, but in a somewhat irregular fashion
Suitable for distributed memory programming models with
asynchronous communication
A task sends an event but does not wait for a response
In a shared-memory system, a queue may be used to
represent message-passing among the tasks
This requires safe, concurrent access to mutex variables
Challenges
Avoiding deadlock and starvation
Deadlock: Task waits for an event that will never occur
25
Load-balancing the tasks across processor elements
Master-Slave
Master process/thread often executes the sequential
part and spawns slaves to execute the parallel part
Master responsible for allocating amount of slaves
Master responsible for distributing the workload among the
slaves
Master waits for slaves to finish Master
Challenges
Master could become a bottleneck
Master could have a lot of idle-time Slave Slave Slave
Example:
Web server 26
Peer
Like Master-Slave, but the Master assigns part of the
workload to itself to minimize the overhead
Example: OpenMP Fork-Join-model
Master/Slave
27
Work Pool
Tasks fetch a piece of work from the pool, executes it, and
(sometimes) produces new work which is the put into the
pool
Book example: FIFO work queue
OpenMP: dynamically scheduled for-loop
Challenges
Difficult to implement the work pool in an
efficient way (shared data structure which
may become a bottleneck)
28
Divide-and-Conquer
Parent tasks divides its workload into several smaller
pieces, one for each of its children
Dividing and merging can be done recursively
Natural for computations such as Quicksort
Presented in more detail in Björns lectures
Challenges
For many problems difficult to
achieve a balanced workload
29
Pipeline
Suitable when overall computation involves feeding
data through a series of operations
Graphics applications often fall into this category
Used heavily by GPUs
Important to clearly express ordering constraints
I.e., which operations must occur before others
The key to good parallelism is to distribute the
stages of the pipeline to processor elements in a
balanced manner
Pipelining can be expressed at both the algorithmic
and SIMD-
processing level
30
31
Task Parallelism Summary
Natural approach to parallelism
Typically good efficiency
Often coarse-grained granularity
Tasks can often proceed without interactions
Synchronization/communication needed at the end
In practice scalability is limited
Problem can be split only into a finite set of different tasks
32
Data Parallelism
Will be handled in lectures to come
33
Threads and processes Recap
Threads live within a process
Threads usually use shared memory (e.g. global
variables) to communicate
34
35
Threads and processes Recap
Threads live within a process
Threads usually use shared memory (e.g. global
variables) to communicate
Processes (usually) do not share their address space
with other processes
36
37
Threads and processes Recap
Threads live within a process
Threads usually use shared memory (e.g. global
variables) to communicate
Processes (usually) do not share their address space
with other processes
Processes (usually) use message passing to
communicate
38
39
Threads and processes Recap
In this lecture, we only focus on processes!
Threads were considered in a previous lecture
40
MPI
A nice tutorial can be found at:
https://computing.llnl.gov/tutorials/mpi/
41
MPI
43
MPI
Local View Programming (Explicit parallelism)
The programmer must keep track of the state of each
process in the system
You define the communication points
Often kind of like threaded programming, except the
address space is not shared (SIMD/SPMD)
Threads can be used within each process
The interface is extremely rich!
47
MPI
Over 100 routines (see man -k ^MPI_| wc -l)
Call Interface:
rc = MPI_Xxxxx(arg,...)
The most essential are:
MPI_Init()
MPI_Comm_size()
MPI_Comm_rank()
MPI_Finalize()
51
MPI
Communicators and groups
Define which processes that can communicate
MPI_COMM_WORLD includes all processes
You can define your own communicators and groups
Within a communicator, a process belonging to it has a
unique rank (ID) in [0..pcomm-1]
The rank is typically used to control program flow and
specify the source and destination of msgs
56
MPI
/* mpi.h */
int MPI_Finalize();
57
MPI
Point-to-point communication types:
Blocking send / blocking receive
Non-blocking send / non-blocking receive
Synchronous send
Buffered send
Combined send/receive
"Ready" send
63
MPI
Blocking calls:
MPI_Xxxx( buffer, count, type, src/dest,
tag, communicator [, status] )
Non-blocking calls:
MPI_Ixxx( buffer, count, type, src/dest,
tag, communicator, request )
64
MPI
/* mpi.h – blocking send and receive */
MPI
/* mpi.h – non-blocking send and receive */
68
MPI
/* mpi.h – synchronous and buffered send */
int MPI_Sendrecv(
void *sbuf, /* send data buffer */
int scount, /* items to send */
MPI_Datatype stype, /* type of data to send */
int dest, /* receiving task */
int stag, /* send message tag */
MPI
/* mpi.h – ready send */
71
MPI
Order and fairness:
Messages will not overtake each other
Multiple sent msgs will be received in send-order
Multiple recvs on the same msgs will be recv in-order
Note: the above bullets do not apply for multiple senders /
receivers
Starvation and deadlock can occur
72
MPI
#include <mpi.h>
#include <stdio.h>
74
MPI
Collective communication
Operations must involve all (or none) processes within a
communicator
Operations are blocking
No tag for messages
Some overhead (operation can create new process
groupings and communicators)
Does not support user defined data types
75
MPI
Collective communication
Types:
Synchronization
Broadcast
Barrier
Data movement
Scatter
Gather
Collective computation
Reduce
Scan
76
MPI
/* mpi.h – synchronisation */
77
MPI
/* mpi.h – data movement */
MPI
/* mpi.h – computation */
79
MPI
/* mpi.h – operators */
MPI
#include <stdio.h>
interval = 1000*power(10,rank);
for (i=1; i<=interval*5; ++i)
if (i%interval == 0)
printf("P with interval %d is at %d\n", interval, i);
printf ("Done!\n");
return 0;
}
81
MPI
#include <mpi.h>
#include <stdio.h>
Lab 2
Introduction to distributed memory programming using
MPI
Preparation: Read chapter 7
Bonus points deadline is 2011-11-29!
87
More Information
Further reading:
Principles of Parallel Programming
Chapter 4 (First steps toward parallel programming)
Chapter 5 (Scalable algorithmic techniques)
Chapter 7 (MPI)
Chapter 9 (Assessing the state of the art)
Chapter 10 (Future directions in parallel programming)
Chapter 11 (Writing parallel programs)
Introduction to MPI:
https://computing.llnl.gov/tutorials/mpi
Acknowledgments:
Several slides based on material created by Prof. Erwin Laure,
erwinl@pdc.kth.se and Andreas Ermedahl, 88
andreas.ermedahl@ericsson.com