CS526 3 Design of Parallel Programs PDF

Parallelizing Applications and
Performance Aspects
(CS 526)
Dr. Muhammad Aleem,
Department of Computer Science,

National University of Computer & Emerging Sciences,
Islamabad Campus
Lecture Outline
• Designing Parallel Programs
– Automatic vs. Manual Parallelization
– Understand the Problem and the Program
– Partitioning the Problem
– Communications
– Synchronization
– Data Dependencies
– Load Balancing
– Granularity
– Limits and Costs of Parallel Programming
• Performance Analysis
What is Parallel Computing? (1)
• Traditionally, software has been written for serial
computation:
– To be run on a single computer having a single Central
Processing Unit (CPU);
– A problem is broken into a discrete series of
instructions.
– Instructions are executed one after another.
– Only one instruction may execute at any moment in
time.
What is Parallel Computing? (2)
• In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem
– To be run using multiple CPUs
– A problem is broken into discrete parts that can be solved
concurrently
– Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different
CPUs
Some General Parallel Terminologies
• Task
– A logically discrete section of computational work.
– A task is typically a program or program-like set of
instructions that is executed by a processor.
• Parallel Task
– A task that can be executed by multiple processors safely
(producing correct results)
• Serial Execution
– Execution of a program sequentially, one statement at a
time.
– In the simplest sense, this is what happens on a one
processor machine.
• Parallel Execution
– Execution of a program by more than one task (threads)
– Each task being able to execute the same or different
statement at the same moment in time.
• Shared Memory
– where all processors have direct (usually bus based) access
to common physical memory
– In a programming sense, it describes a model where parallel
tasks all have the same "picture" of memory
• Distributed Memory
– Network based memory access for physical memory that is
not common.
– Tasks can only logically "see" local machine memory and
must use communications to access memory on other
nodes.
• Communications
– Parallel tasks typically need to exchange data. This can be
accomplished: shared memory or over a network,
– However the actual event of data exchange is commonly
referred to as communications (regardless of the method
employed).
• Synchronization
– The coordination of parallel tasks in real time, very often
associated with communications
– Often implemented by establishing a synchronization point
within an application where a task may not proceed further
until another task(s) reaches the same or logically
equivalent point.
• Granularity
– In parallel computing, granularity is a measure of the ratio
of computation to communication.
– Coarse: relatively large amount of computational work are
done between communication events
– Fine: relatively small amounts of computational work are
done between communication events
• Observed Speedup:
– Observed speedup of a code which has been parallelized
wall-clock time of serial execution
wall-clock time of parallel execution
– One of the simplest and most widely used indicators for a

parallel program's performance.
• Parallel Overhead
– Amount of time required to coordinate parallel tasks, as
opposed to doing useful work. Parallel overhead can
include factors such as:
• Task start-up time
• Synchronisations
• Data communications
• Software overhead imposed by parallel compilers, libraries,
tools, operating system, etc.
• Task termination time
• Massively Parallel
– Refers to the hardware that comprises a given parallel
system - having many processors (over 100’s of processors)
• Scalability
– Refers to a parallel system's (hardware and/or software)
ability to demonstrate a proportionate increase in
parallel speedup with the addition of more processors.
– Factors that contribute to scalability include:

• Hardware: particularly Memory-CPU bandwidths and
network communications
• Application algorithm
• Parallel overhead related
• Characteristics of your specific application and coding
Designing Parallel Programs
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning the Problem
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Performance Analysis and Tuning
Automatic vs. Manual Parallelization
• Designing and developing parallel programs has commonly been a
very manual process
• The programmer is typically responsible for both identifying and

actually implementing parallelism
• Very often, manually developing parallel codes is a time consuming,

complex, error-prone and iterative process
• Today, various tools are available to assist the programmer with

converting serial programs into parallel programs
• The most common type of tool used to automatically parallelize a

serial program is a parallelizing compiler or pre-processor.
• A parallelizing compiler generally works in two different
ways:
– Fully Automatic:
• Compiler analyzes the source code and identifies
opportunities for parallelism
• Loops (do, for) are the most frequent target for automatic
parallelization Examples: Paralax compiler, Insieme
compiler
– Programmer Directed:
• Using "compiler directives" or possibly compiler flags,
the programmer explicitly tells the compiler how to
parallelize the code. Examples: OpenMP, OpenACC
• May be able to be used in conjunction with some degree
of automatic parallelization.
• Some disadvantages of automatic parallelization:
– Wrong results may be produced
– Performance may actually degrade
– Less flexible than manual parallelization
– Limited to a subset (mostly loops) of code
– May actually not parallelize code if the code is too
complex
• The remainder of the lecture applies to the manual

method of developing parallel codes.
• Partitioning
• Communications
• Synchronization
• Load Balancing
• Granularity
• I/O
• Limits and Costs of Parallel Programming
• Performance Analysis and Tuning
1. Understanding the Problem
• Undoubtedly, the first step in developing parallel
application is to first understand the problem that
you wish to solve
• If you are starting with a serial program, this

requires understanding of the existing code
• Before spending time in an attempt to develop a

parallel solution for a problem, determine whether or
not the problem is suitable to be parallelized?
Identify the program's hot-spots
• Know where most of the real work is being done.
(The majority of scientific and technical programs
usually accomplish most of their work in a few
places.)
• Profilers and performance analysis tools can help

here
• Focus on parallelizing the hotspots and ignore

those sections of the program that account for
little CPU usage.
Identify bottlenecks in the program
• Are there areas that are unjustly slow, or cause
parallelizable work to halt?
– For example: I/O is usually something that slows a
program down.
• May be possible to restructure the program or use

a different algorithm to reduce or eliminate
unnecessary slow areas
– Overlap the communication with computation
Other considerations
• Identify blockage to parallelism. One common
class of obstacle is data dependence
• Investigate other algorithms if possible:

– This may be the single most important consideration
when designing a parallel application
• Partitioning
• Communications
• Synchronization
• Load Balancing
• Granularity
Partitioning
• One of the first steps in designing a parallel
program is to break the problem into discrete
"chunks"
• The broken chunks then can be distributed to

multiple tasks. This is known as decomposition or
partitioning
• There are two basic ways to partition

computational work among parallel tasks:
1. Domain decomposition
2. Functional decomposition
Domain Decomposition
• In this type of partitioning, the data associated
with a problem is decomposed. Each parallel task
then works on a portion of the data
Partitioning Data
• There are different ways to partition data:
Functional Decomposition
• In this approach: the focus is on the computation that is
to be performed rather than on the data manipulated by
the computation.
• The problem is decomposed according to the work that

must be done.
• Each task then performs a portion of the overall work.
• Functional decomposition useful to the problems that

can be split into different tasks. For example:
– Ecosystem Modeling
– Signal Processing
– Climate Modeling
• Partitioning
• Communications
• Synchronization
• Load Balancing
• Granularity
Who Needs Communications?
• You DON'T need communications
– Some types of problems can be decomposed and
executed in parallel with virtually no need for tasks
to share data. Example:
• Imagine an image processing operation where
every pixel in a black and white image needs to
have its color reversed
– These types of problems are often called

embarrassingly parallel because they are so
straight-forward.
• Very little or No inter-task communication is required
Who Needs Communications?
• You DO need communications
– Most parallel applications are not quite so simple,
– They do require tasks to share data with each
other.
• For example: a 3-D heat diffusion problem requires
a task to know the temperatures calculated by the
tasks that have neighbouring data
• Changes to neighbouring data that has a direct

effect on that task's data.
Factors to Consider (1)
Cost of communications
– Inter-task communication virtually always implies
overhead
– Machine cycles and resources that could be used for
computation are instead used to package and
transmit data.
– Communications frequently require some type of
synchronization between tasks, which can result in
tasks spending time "waiting" instead of doing
work.
– Communication traffic can saturate the available
network bandwidth, further reduce performance
• Latency VS Bandwidth:
– latency is the time it takes to send a message from
point A to point B. Commonly expressed in
microseconds
– bandwidth is the amount of data that can be
communicated per unit of time
• Commonly expressed as megabytes/sec
– Sending many small messages can cause latency to

dominate communication overheads
– Often it is more efficient to package small messages

into a larger message, thus increasing the effective
communications bandwidth.
• Visibility of communications
– With the Message Passing Model, communications are
explicit and generally quite visible and under the control of
the programmer
– With the Data Parallel Model (and shared Memory model),

communications often occur transparently to the
programmer
– The programmer may not even be able to know exactly how

inter-task communications are being accomplished.
Factors to Consider (4) - Synchronous vs. Asynchronous
Communications
– Synchronous communications require some type of "handshaking"
between tasks. This can be explicitly structured in code by
programmer, or it may happen at a lower level unknown to the
programmer.
– Synchronous communications are often referred to as blocking

communications since other work must wait until the
communications have completed
– Asynchronous communications allow tasks to transfer data

independently from one another
– Asynchronous communications are often referred to as non-blocking

communications
– Interleaving computation with communication is the single greatest

benefit for using asynchronous communications
• Scope of communications
– Knowing which tasks must communicate with each other
is critical during the design stage of a parallel code.
– Both of the two scopes (described below) can be
implemented as Synchronous or Asynchronous.
1. Point-to-point - involves two tasks with one task acting as

the sender/producer of data, and the other acting as the
receiver/consumer.
2. Collective - involves data sharing between more than

two tasks, which are often specified as being members in
a common group, or collective.
Collective Communications, Examples
• Partitioning
• Communications
• Synchronization
• Load Balancing
• Granularity
Types of Synchronization
1. Barrier
– Usually implies that all tasks are involved
– Each task performs its work until it reaches the barrier.

It then stops, or "blocks”
– When the last task reaches the barrier, all tasks are
Synchronized
Types of Synchronization
2. Lock / semaphore
– Can involve any number of tasks
– Typically used to serialize (protect) access to global

data or a section of code. Only one task at a time may
use (own) the lock / semaphore / flag.
– The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or
code.
– Other tasks can attempt to acquire the lock but must

wait until the task that owns the lock releases it.
– Can be blocking or non-blocking

Types of Synchronisation
3. Synchronous communication Operations
– Involves only those tasks executing a communication
operation
– When a task performs a communication operation,

some form of coordination is required with the other
task(s) participating in the communication.
– For example: before a task can perform a send

operation  it must first receive an acknowledgment
from the receiving task that it is OK to send.
• Partitioning
• Communications
• Synchronization
• Load Balancing
• Granularity
Definitions: Data Dependence
• A dependence exists between program statements
when the order of statement execution affects the
results of the program
• A data dependence results from multiple use of the

same location(s) in storage by different tasks
• Dependencies are important to parallel programming

because they are one of the primary obstacle to
parallelism
Examples (1): Loop carried data dependence
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0500
CONTINUE
• The value of A(J-1) must be computed before the

value of A(J), therefore A(J) exhibits a data
dependency on A(J-1). Parallelism is not possible
• If Task 2 has A(J) and task 1 has A(J-1), computing the

correct value of A(J) requires:
– Distributed memory architecture - task 2 must obtain the
value of A(J-1) from task 1 after task 1 finishes its
computation
– Shared memory architecture - task 2 must read A(J-1) after
task 1 updates it
• Partitioning
• Communications
• Synchronization
• Load Balancing
• Granularity
Definitions:Granularity
• Computation / Communication Ratio:
– In parallel computing, granularity is a qualitative
measure of the ratio of computation to
communication
– Periods of computation are typically separated from
periods of communication by synchronization events.
1. Fine-grain parallelism
2. Coarse-grain parallelism
Fine-grain Parallelism
• Relatively small amounts of computational work are done between
communication events
• Low computation to communication ratio
• Implies high communication overhead and less opportunity for
performance enhancement
• If granularity is too fine it is possible that the overhead required for
communications and synchronization between tasks takes longer
than the computation.
Coarse-grain Parallelism
• Relatively large amounts of computational
work are done between
communication/synchronization events
• High computation to communication ratio
• Implies more opportunity for performance

increase
• Harder to load balance efficiently

• Partitioning
• Communications
• Synchronization
• Load Balancing
• Granularity
Amdahl's Law
Amdahl's Law states that potential program
speedup is defined by the fraction of code (P)
that can be parallelized:
1
Max.speedup = --------
1 - P
• If none of the code can be parallelized, P = 0 and the speedup

= 1 (no speedup). If all of the code is parallelized, P = 1 and
the speedup is infinite (in theory).
• If 50% of the code can be parallelized, maximum speedup = 2,

meaning the code will run twice as fast.
Amdahl's Law
• It soon becomes obvious that there are limits to the
scalability of parallelism.
• For example, at P = .50, .90 and .99 (50%, 90% and

99% of the code is parallelizable)
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
Amdahl's Law
F = serial fraction
E.g., 1/0.05 (5% serial) = 20 speedup (maximum)
Maximum Speedup (Amdahl's Law)
Maximum speedup is usually p with p processors
(linear speedup).
Possible to get super-linear speedup (greater than p)

but usually a specific reason such as:
• Extra memory in multiprocessor system
• Nondeterministic algorithm
Maximum Speedup (Amdahl's Law)
Speedup?
where ts is execution time on a single processor and tp is

execution time on a multiprocessor.
• S(p) gives increase in speed by using multiprocessor.
• Use best sequential algorithm with single processor

system instead of parallel program run with 1
processor for ts. Underlying algorithm for parallel
implementation might be (and is usually) different.
Speedup Factor?
Speedup factor can also be used in terms of
computational steps:
Speedup Factor
Here f is the part of the code that is serial:
e.g. if f==1 (all the code is serial, then the speedup will be 1
no matter how may processors are used
Speedup (with N CPUs or Machines)
• Introducing the number of processors performing the
parallel fraction of work, the relationship can be
modelled by:
1
speedup = ------------
fS + fP
-----
Proc
• where fP = parallel fraction,

Proc = number of processors and
fS = serial fraction
Speedup
• However, certain problems demonstrate increased performance
by increasing the problem size. For example:
– 2D Grid Calculations 85 seconds 85%
– Serial fraction 15 seconds 15%
• We can increase the problem size by doubling the grid dimensions

and halving the time step. This results in four times the number of
grid points and twice the number of time steps. The timings then
look like:
– 2D Grid Calculations 680 seconds 97.84%
– Serial fraction 15 seconds 2.16%
• Problems that increase the percentage of parallel time with their

size are more scalable, than problems with a fixed percentage of
parallel time.
Why use parallel processing?
 Save time: wall clock time
 Solve larger problems: increased extensibility
and configurability
 Possible better fault tolerance: Advantage of
non-local resources
 Cost savings
 Overcoming memory constraints
 Scientific interest
Other Metrics for Performance Evaluation
Parallel Computers
• Task system: Set of tasks with a dependence
relation solving a given problem.
• Concurrent: 2 tasks are called concurrent, if they
are not dependent on each other.
• Simultaneous/parallel: 2 tasks are executed in
parallel if at some point in time both tasks have
been started and none terminated.
• Parallel computer: Computer executing concurrent
tasks of task systems in parallel.
Parallelism Granularity
Parallel Computers
• Graph approach: as complexity increases, new
metrics need to be introduced:
• Number of operations
• Volume of data manipulated
• Type of data: temporary, read only, etc.
• Volume of data communicated between nodes
Regularity versus Irregularity
• Data Structures: dense vectors/matrices versus sparse
(stored as such) matrices
• Data access: regular vector access (with strides) versus

indirect access (scatter/gather)
• Computation: uniform computation on a grid versus

highly dependent upon grid point computation
• Communication: regular communication versus highly

irregular.
Static VS Dynamic Program Structure and
Behavior
Communication Structure
• LOCAL: neighbor-type communication
• GLOBAL: every one communicates with everyone
• In practice, mixture of both!!

Architecture Application Match
• Architectures are good at uniform and regular,
coarse-grain computations with local
communications, everything being static.
• Everything else generates problems!
• Computer architects assumptions:

– Software will solve it!
• Challenging problems are left to software

developers
Interaction in Parallel Systems
• Programming model specifying the interaction abstraction:
– Shared Memory
• Global adresses
• Explicit synchronization
– Message Passing
• Explicit exchange of messages
• Implicit synchronization
• Communication hardware:
– Shared memory: Bus-based shared memory systems,
Symmetrical Multiprocessors – SMPs
– Message passing: Network based such as, Ethernet, Infiniband,

etc.
Research in Parallel Systems
• There is no strict limit for contributors to the area
of parallel processing:
– Computer Architecture,
– Operating Systems,
– High-level languages
– Compilers
– Databases,
– Computer networks, all have a role to play
• This makes it a hot topic of research

Programming for Parallel Architectures
(Trick-1)
• Highway rule: Functions which consume most of

the time should be optimized the most
• But be aware of Amdahl’s Law!

– Over-provisioning of the resources (CPUs)
– Large serial part
(Trick-2)
• Plumber’s rule: lot of care has to be given to
match bandwidth of communicating subsystems
• Use buffer to limit the effect of sudden

bandwidths variations
• Try to overlap communication with useful

computations
Memory Hierarchy
Memory Hierarchy
Unified v Split Caches
1. One cache for data and instructions (Unified)
2. Two, one for data, and one for instructions (Split)
• Advantages of the unified cache:

– Higher hit rate
• Balances load of instruction and data fetch
– Only one cache to design & implement
• Advantages of split cache

– Eliminates cache contention between instructions
fetch/decode unit and execution unit
• Important in pipelining
Principal of Locality
• Principal of Locality: the tendency to reference
data items that are near other recently referenced
data items, OR that were recently referenced
• Two categories:
1. Temporal Locality: location that is referenced once is
likely to be referenced multiple times in near future:
for(int i=0;i<1000;i++)
for(int j=0;j<1000;j++)
a[j] = b[i] * PI;
Principal of Locality
1. Spatial Locality: memory location that is referenced
once, then the program is likely to be reference a
nearby memory location:
for(int i=0;i<1000;i++)
for(int j=0;j<1000;j++)
a[j] = b[i] * PI;
Vector Product Example
float dot_prod(float x[1024], y[1024])

{
float sum = 0.0;
int i;
for (i = 0; i < 1024; i++)

sum = sum + (x[i]* y[i]);
return sum;
}
Vector Product Example
Assumptions:
• Cache line size: 4 bytes

• Cache mapping function: Direct Mapping
Thrashing Example: Good Case
x[0] y[0]
x[1] y[1] Loaded into different
x[2] y[2] Cache Lines
x[3] y[3]
•Access Sequence
– Read x[0] •Analysis
• x[0], x[1], x[2], x[3] loaded – x[i] and y[i] map to different cache lines
– Read y[0] – Cache Miss rate = 25% ( 2misses/8
• y[0], y[1], y[2], y[3] loaded loads)
– Read x[1] • Two memory accesses / iteration
• Hit • After every 4th iteration we have
two misses
– Read y[1]
• Hit
– •••
– 2 misses / 8 reads
Thrashing Example: Bad Case
x[0] y[0]
x[1] y[1] Loaded into same
x[2] y[2] Cache Lines
x[3] y[3]
•Analysis
•Access Pattern
– x[i] and y[i] map to same cache lines
– Read x[0]
– Miss rate = 100%
• x[0], x[1], x[2], x[3] loaded
• Two memory accesses / iteration
– Read y[0]
• On every iteration have two
• y[0], y[1], y[2], y[3] loaded
misses
– Read x[1]
• x[0], x[1], x[2], x[3] loaded
– Read y[1]
• y[0], y[1], y[2], y[3] loaded
•••
– 8 misses / 8 reads (Thrashing)
Matrix Sum Example-1
// <Get START time here>
for(kk=0;kk<1000;kk++)
{ sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += A[i][j];
}
// <Get END time>

Matrix Sum Example-2
// <Get START time here>
for(kk=0;kk<1000;kk++)
{ sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += A[j][i];
}
// <Get END time>

(Trick-3)
• Prophet’s rule: knowing the future always allows

you to make the best decisions.
• OR:
– Use the past to predict the future
– Use the compiler
– Bet on several horses
(Trick-4)
• No matter how well thought of is your

architecture, some applications will perform too
bad
– Use the compiler
– Change the applications/algorithms!

CS526 3 Design of Parallel Programs PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS526 3 Design of Parallel Programs PDF

Uploaded by

Copyright:

Available Formats

Parallelizing Applications and

Dr. Muhammad Aleem,

Department of Computer Science,

– One of the simplest and most widely used indicators for a

– Factors that contribute to scalability include:

• The programmer is typically responsible for both identifying and

• Very often, manually developing parallel codes is a time consuming,

• Today, various tools are available to assist the programmer with

• The most common type of tool used to automatically parallelize a

• The remainder of the lecture applies to the manual

• If you are starting with a serial program, this

• Before spending time in an attempt to develop a

• Profilers and performance analysis tools can help

• Focus on parallelizing the hotspots and ignore

• May be possible to restructure the program or use

• Investigate other algorithms if possible:

• The broken chunks then can be distributed to

• There are two basic ways to partition

• The problem is decomposed according to the work that

• Each task then performs a portion of the overall work.

• Functional decomposition useful to the problems that

– These types of problems are often called

• Changes to neighbouring data that has a direct

– Sending many small messages can cause latency to

– Often it is more efficient to package small messages

– With the Data Parallel Model (and shared Memory model),

– The programmer may not even be able to know exactly how

– Synchronous communications are often referred to as blocking

– Asynchronous communications allow tasks to transfer data

– Asynchronous communications are often referred to as non-blocking

– Interleaving computation with communication is the single greatest

1. Point-to-point - involves two tasks with one task acting as

2. Collective - involves data sharing between more than

– Each task performs its work until it reaches the barrier.

– Typically used to serialize (protect) access to global

– Other tasks can attempt to acquire the lock but must

– Can be blocking or non-blocking

– When a task performs a communication operation,

– For example: before a task can perform a send

• A data dependence results from multiple use of the

• Dependencies are important to parallel programming

• The value of A(J-1) must be computed before the

• If Task 2 has A(J) and task 1 has A(J-1), computing the

• High computation to communication ratio

• Implies more opportunity for performance

• Harder to load balance efficiently

• If none of the code can be parallelized, P = 0 and the speedup

• If 50% of the code can be parallelized, maximum speedup = 2,

• For example, at P = .50, .90 and .99 (50%, 90% and

Possible to get super-linear speedup (greater than p)

where ts is execution time on a single processor and tp is

• S(p) gives increase in speed by using multiprocessor.

• Use best sequential algorithm with single processor

Here f is the part of the code that is serial:

• where fP = parallel fraction,

• We can increase the problem size by doubling the grid dimensions

• Problems that increase the percentage of parallel time with their

• Data access: regular vector access (with strides) versus

• Computation: uniform computation on a grid versus

• Communication: regular communication versus highly

• GLOBAL: every one communicates with everyone

• In practice, mixture of both!!

• Everything else generates problems!