You are on page 1of 30

Disclaimer

 In preparation of these slides, materials have been


taken from different online sources in the shape of
books, websites, research papers and presentations etc.
However, the author does not have any intention to take
any benefit of these in her/his own name. This lecture
(audio, video, slides etc) is prepared and delivered only
for educational purposes and is not intended to infringe
upon the copyrighted material. Sources have been
acknowledged where applicable. The views expressed
are presenter’s alone and do not necessarily represent
actual author(s) or the institution.
Designing Parallel Programs

Steps in designing parallel programs
The main steps:
1. Understand the problem

2. Partitioning (separation into main tasks)
3. Decomposition & Granularity
4. Communications
5. Identify data dependencies
6. Synchronization
7. Load balancing
8. Performance analysis and tuning
Automatic vs. Manual Parallelization

 Designing and developing parallel programs has characteristically been a


completely manual process. The programmer is typically responsible for
both identifying and actually implementing parallelism.

 Very often, manually developing parallel codes is a time consuming,


complex, error-prone and iterative process.

 For a number of years now, various tools have been available to assist
the programmer with converting serial programs into parallel
programs.

 The most common type of tool used to automatically parallelize a serial


program is a parallelizing compiler or pre- processor.
Parallelizing Compiler

 Fully Automatic
The compiler analyzes the source code and identifies
opportunities for parallelism
The analysis includes identifying inhibitors to parallelism and
possibly a cost weighting on whether or not the parallelism would
actually improve performance
Loops (do-while, for, while) loops are the most frequent target for
automatic parallelization.
 Programmer Directed
"compiler directives" or possibly compiler flags
The programmer explicitly tells the compiler how to parallelize
the code
May be able to be used in combination with some degree of automatic
parallelization also
Automatic Parallelization:
The Good and the Bad
 The “Good”
Beginning with an existing serial code Time or
budget constraints

 The “Bad”
Wrong results may be produced
Performance may actually degrade
Much less flexible than manual parallelization Limited
to a subset (mostly loops) of code
May actually not parallelize code if the analysis suggests there are
inhibitors or the code is too complex
Most automatic parallelization tools are for Fortran
Steps to Parallelization

Understand the Problem and the Program

Partitioning
(domain vs functional decomposition)

Communication
(cost, latency, bandwidth, visibility,
synchronization, etc.)

Data Dependencies
Step 1: Understanding the problem
 Problem A: Calculate the potential energy for each of several thousand
independent conformations of a molecule. When done, find the minimum
energy conformation.

 Problem B: Calculation of the Fibonacci series (1,1,2,3,5,8,13,21,...)


by use of the formula: F(k+2)=F(k
+1)+F(k)

Which one can be parallelized?


Why?
Step 1: Understanding the problem
 Make sure that you understand the
problem, that is the right problem,
before you attempt to formulate a
solution (i.e. ‘problem validation’)
 Some problems aren’t suited to
parallelization – it just might not be
parallelizable
Example of non-parallelizable problem:
Calculating the set of Fibonacci series e.g., Fib(1020) as quickly as possible.
Fib(n) = n } if n < 2
= Fib(n – 1) + Fib(n – 2) } otherwise
Step 1: Understanding the problem

 Many problems may have obvious


parallel solutions.
 Others may need some concerted effort
to discover the parallel nature.

Easily parallel example:


Given two tables, client (clientname,clientnum) and orders (clientnum,ordernum,date)
find the name of clients who ordered something on 1 March 2010.
Obvious solution: divide orders into smaller parts, each task checks the dates in its
partition, it matches the clientnum is added to a temporary list, at the end all the
temporary lists are combined and clientnum replaced by clientname.
Understand the Problem

 Identify the program's hotspots


Know where most of the real work is being done
Most programs accomplish most of their work in a few places. (profilers and
performance analysis tools)
Focus on parallelizing the hotspots

 Identify bottlenecks in the program


Are there areas that are disproportionately slow, or cause parallelizable
work to halt or be deferred? (I/O)
May be possible to restructure the program or use a different algorithm to
reduce or eliminate unnecessary slow areas

 Identify inhibitors to parallelism


Data dependence, …

 Investigate other algorithms if


possible
Step 2: Partitioning partitioning The
Problem

 This step involves breaking the problem


into separate chunks of work, which can
then be implemented as multiple
distributed tasks.
 Two typical methods to partition
computation among parallel tasks:
1. Functional decomposition
2. Domain decomposition
Functional Decomposition
(Focus on the computation)
 The focus is on the computation that is to be performed rather than on the data
manipulated by the computation
 The problem is decomposed according to the work that must be done. Each task
then performs a portion of the overall work.
 Functional decomposition lends itself well to problems that can be split into
different tasks.
Functional decomposition
Task #2:
Task #1: 1.
1. 2.
2. 3.
3.
4.
le m
Pr ob
Th e
Task #3:
1.
Task #4:
2.
1.
3.
2.
4.
5.

Functional decomposition is suited to problems that can be split into different tasks
Example of Functional Decomposition:
Ecosystem Modeling
 Each program calculates the population of a given group, where each group's
growth depends on that of its neighbors.
 As time progresses, each process calculates its current state, then exchanges
information with the neighbor populations.
 All tasks then progress to calculate the state at the next time step
Example of Functional Decomposition:
Climate Modeling
 Each model component can be thought of as a separate task.
 Arrows represent exchanges of data between components during computation: the
atmosphere model generates wind velocity data that are used by the ocean model,
the ocean model generates sea surface temperature data that are used by the
atmosphere model, and so on.
Domain decomposition
Involves separating the data, or taking different ‘views’ of
the same data. E.g.
View 1 : looking at frequencies (e.g. FFTs)
View 2 : looking at amplitudes
Each parallel task works on its own portion of the data,
or does something different with the same data
The Data The Data

View 1 View 2
Task1 Task2 Task3
Task1 Task2
Result Result
Domain decomposition

 Good to use for problems where:


 Data is static (e.g., factoring; matrix
calculations)
 Dynamic data structures linked to a single
entity (where entity can be made into
subsets) (e.g., multi-body problems)
 Domain is fixed, but computation within
certain regions of the domain is dynamic
(e.g., fluid vortices models)
Domain decomposition
(terms introduced in Lecture 3)

Many possible ways to divide things up. If you want to look at it visually…

continuous Blocked (same- Blocked


size partitions)

Interlaced Interleaved or
cyclic
Other methods can be done
Step 3: Decomposition and
Granularity
Step 3: Decomposition and
Granularity
 Decomposition – how the problem can
be divided up; looked at earlier:
 Functionaldecomposition
 Domain (or data) decomposition

 Granularity
 How big or small are the parts that the
problem has been decomposed into?
 How interrelated are the sub-tasks

This was introduced this in Lecture 5


Granularity
 In parallel computing, granularity (or grain size) of a task is a
measure of the amount of work (or computation) which is performed
by that task.

 Another definition of granularity takes into account the


communication overhead between multiple processors or processing
elements. It defines granularity as the ratio of computation time to
communication time, wherein, computation time is the time required
to perform the computation of a task and communication time is the
time required to exchange data between processors.

 Depending on the amount of work which is performed by a parallel


task, parallelism can be classified into three categories:
 fine-grained
 medium-grained
 coarse-grained
The Ratio of
Computation : Communication
 Thisratio can help to decide either a
problem is fine or course grained.
1 : 1 = Each intermediate result needs a
communication operation
 100 : 1 = 100 computations (or intermediate
results) require only one communication
operation
 1 : 100 = Each computation
needs 100 communication
operations
 In fine-grained parallelism, a program is broken down to a
Granularity: Fine Grained

large number of small tasks. These tasks are assigned


individually to many processors.
 The amount of work associated with a parallel task is low and
the work is evenly distributed among the processors. Hence,
fine-grained parallelism facilitates load balancing.
 As each task processes less data, the number of processors
required to perform the complete processing is high. This in
turn, increases the communication and synchronization
overhead.
 Shared memory architecture which has a low communication
overhead is most suitable for fine-grained parallelism.
Granularity : Fine Grained

 An example of a fine-grained system (from outside the parallel


computing domain) is the system of neurons in our brain.
 Connection Machine (CM-2) and J-Machine are examples of fine-grain
parallel computers that have grain size in the range of 4-5 μs.
 The ratio computation : communication is low, approaching 1:1
 In coarse-grained parallelism, a program is split
Granularity : coarse-grained

into large tasks. Due to this, a large amount of


computation takes place in processors.
 This might result in load imbalance, wherein certain
tasks process bulk of the data while others might
be idle.
 The advantage of this type of parallelism is low
communication and synchronization overhead.
 Message-passing architecture takes a long time to
communicate data among processes which makes
it suitable for coarse-grained parallelism.
Granularity : Coarse Grained

 Cray Y-MP is an example of coarse-grained


parallel computer which has a grain size of
about 20s.
 The computation : communication ratio is high
say around 100:1.
Decomposition and Granularity
 Many image processing problems are suited to coarse
grained solutions, e.g.: can perform calculations on
individual pixels or small sets of pixels without
requiring knowledge of any other pixel in the image.
 Scientific problems tend to be between coarse and fine
granularity. These solutions may require some amount
of interaction between regions, therefore the individual
processors doing the work need to collaborate and
exchange results (i.e., need for synchronization and
message passing).
 E.g., any one element in the data set may depend on the
values of its nearest neighbors. If data is decomposed into two
parts that are each processed by a separate CPU, then the
CPUs will need to exchange boundary information.
Class Activity: Discussion
 Whichof the following are more fine-grained,
and which are more course-grained?
 Matrix multiply
 FFTs
 Decryption code breaking
 (deterministic) Finite state machine validation /
termination checking
 Map navigation (e.g., shortest path)
 Population modelling
Class Activity: Discussion
 Whichof the following are more fine-
grained, and which are course-grained?
 Matrix multiply - fine grain data, coarse func.
 FFTs - fine grained
 Decryption code breaking - coarse grained
 (deterministic) Finite state machine validation /
termination checking - coarse grained
 Map navigation (e.g., shortest path) - coarse
 Population modelling - coarse grained

You might also like