You are on page 1of 83

Parallelizing Applications and

Performance Aspects
(CS 526)

Dr. Muhammad Aleem,

Department of Computer Science,


National University of Computer & Emerging Sciences,
Islamabad Campus
Lecture Outline
• Designing Parallel Programs
– Automatic vs. Manual Parallelization
– Understand the Problem and the Program
– Partitioning the Problem
– Communications
– Synchronization
– Data Dependencies
– Load Balancing
– Granularity
– Limits and Costs of Parallel Programming
• Performance Analysis
What is Parallel Computing? (1)
• Traditionally, software has been written for serial
computation:
– To be run on a single computer having a single Central
Processing Unit (CPU);
– A problem is broken into a discrete series of
instructions.
– Instructions are executed one after another.
– Only one instruction may execute at any moment in
time.
What is Parallel Computing? (2)
• In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem
– To be run using multiple CPUs
– A problem is broken into discrete parts that can be solved
concurrently
– Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different
CPUs
Some General Parallel Terminologies
• Task
– A logically discrete section of computational work.
– A task is typically a program or program-like set of
instructions that is executed by a processor.

• Parallel Task
– A task that can be executed by multiple processors safely
(producing correct results)

• Serial Execution
– Execution of a program sequentially, one statement at a
time.
– In the simplest sense, this is what happens on a one
processor machine.
Some General Parallel Terminologies
• Parallel Execution
– Execution of a program by more than one task (threads)
– Each task being able to execute the same or different
statement at the same moment in time.

• Shared Memory
– where all processors have direct (usually bus based) access
to common physical memory
– In a programming sense, it describes a model where parallel
tasks all have the same "picture" of memory

• Distributed Memory
– Network based memory access for physical memory that is
not common.
– Tasks can only logically "see" local machine memory and
must use communications to access memory on other
nodes.
Some General Parallel Terminologies
• Communications
– Parallel tasks typically need to exchange data. This can be
accomplished: shared memory or over a network,
– However the actual event of data exchange is commonly
referred to as communications (regardless of the method
employed).

• Synchronization
– The coordination of parallel tasks in real time, very often
associated with communications
– Often implemented by establishing a synchronization point
within an application where a task may not proceed further
until another task(s) reaches the same or logically
equivalent point.
Some General Parallel Terminologies
• Granularity
– In parallel computing, granularity is a measure of the ratio
of computation to communication.
– Coarse: relatively large amount of computational work are
done between communication events
– Fine: relatively small amounts of computational work are
done between communication events

• Observed Speedup:
– Observed speedup of a code which has been parallelized
wall-clock time of serial execution
wall-clock time of parallel execution

– One of the simplest and most widely used indicators for a


parallel program's performance.
Some General Parallel Terminologies
• Parallel Overhead
– Amount of time required to coordinate parallel tasks, as
opposed to doing useful work. Parallel overhead can
include factors such as:
• Task start-up time
• Synchronisations
• Data communications
• Software overhead imposed by parallel compilers, libraries,
tools, operating system, etc.
• Task termination time

• Massively Parallel
– Refers to the hardware that comprises a given parallel
system - having many processors (over 100’s of processors)
Some General Parallel Terminologies
• Scalability
– Refers to a parallel system's (hardware and/or software)
ability to demonstrate a proportionate increase in
parallel speedup with the addition of more processors.

– Factors that contribute to scalability include:


• Hardware: particularly Memory-CPU bandwidths and
network communications
• Application algorithm
• Parallel overhead related
• Characteristics of your specific application and coding
Designing Parallel Programs
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning the Problem
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Performance Analysis and Tuning
• Automatic vs. Manual Parallelization
Automatic vs. Manual Parallelization
• Designing and developing parallel programs has commonly been a
very manual process

• The programmer is typically responsible for both identifying and


actually implementing parallelism

• Very often, manually developing parallel codes is a time consuming,


complex, error-prone and iterative process

• Today, various tools are available to assist the programmer with


converting serial programs into parallel programs

• The most common type of tool used to automatically parallelize a


serial program is a parallelizing compiler or pre-processor.
• A parallelizing compiler generally works in two different
ways:
– Fully Automatic:
• Compiler analyzes the source code and identifies
opportunities for parallelism
• Loops (do, for) are the most frequent target for automatic
parallelization Examples: Paralax compiler, Insieme
compiler
– Programmer Directed:
• Using "compiler directives" or possibly compiler flags,
the programmer explicitly tells the compiler how to
parallelize the code. Examples: OpenMP, OpenACC
• May be able to be used in conjunction with some degree
of automatic parallelization.
• Some disadvantages of automatic parallelization:
– Wrong results may be produced
– Performance may actually degrade
– Less flexible than manual parallelization
– Limited to a subset (mostly loops) of code
– May actually not parallelize code if the code is too
complex

• The remainder of the lecture applies to the manual


method of developing parallel codes.
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• I/O
• Limits and Costs of Parallel Programming
• Performance Analysis and Tuning
1. Understanding the Problem
• Undoubtedly, the first step in developing parallel
application is to first understand the problem that
you wish to solve

• If you are starting with a serial program, this


requires understanding of the existing code

• Before spending time in an attempt to develop a


parallel solution for a problem, determine whether or
not the problem is suitable to be parallelized?
Identify the program's hot-spots
• Know where most of the real work is being done.
(The majority of scientific and technical programs
usually accomplish most of their work in a few
places.)

• Profilers and performance analysis tools can help


here

• Focus on parallelizing the hotspots and ignore


those sections of the program that account for
little CPU usage.
Identify bottlenecks in the program
• Are there areas that are unjustly slow, or cause
parallelizable work to halt?
– For example: I/O is usually something that slows a
program down.

• May be possible to restructure the program or use


a different algorithm to reduce or eliminate
unnecessary slow areas
– Overlap the communication with computation
Other considerations
• Identify blockage to parallelism. One common
class of obstacle is data dependence

• Investigate other algorithms if possible:


– This may be the single most important consideration
when designing a parallel application
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Limits and Costs of Parallel Programming
Partitioning
• One of the first steps in designing a parallel
program is to break the problem into discrete
"chunks"

• The broken chunks then can be distributed to


multiple tasks. This is known as decomposition or
partitioning

• There are two basic ways to partition


computational work among parallel tasks:
1. Domain decomposition
2. Functional decomposition
Domain Decomposition
• In this type of partitioning, the data associated
with a problem is decomposed. Each parallel task
then works on a portion of the data
Partitioning Data
• There are different ways to partition data:
Functional Decomposition
• In this approach: the focus is on the computation that is
to be performed rather than on the data manipulated by
the computation.

• The problem is decomposed according to the work that


must be done.

• Each task then performs a portion of the overall work.

• Functional decomposition useful to the problems that


can be split into different tasks. For example:
– Ecosystem Modeling
– Signal Processing
– Climate Modeling
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Limits and Costs of Parallel Programming
Who Needs Communications?
• You DON'T need communications
– Some types of problems can be decomposed and
executed in parallel with virtually no need for tasks
to share data. Example:
• Imagine an image processing operation where
every pixel in a black and white image needs to
have its color reversed

– These types of problems are often called


embarrassingly parallel because they are so
straight-forward.
• Very little or No inter-task communication is required
Who Needs Communications?
• You DO need communications
– Most parallel applications are not quite so simple,
– They do require tasks to share data with each
other.
• For example: a 3-D heat diffusion problem requires
a task to know the temperatures calculated by the
tasks that have neighbouring data

• Changes to neighbouring data that has a direct


effect on that task's data.
Factors to Consider (1)
Cost of communications
– Inter-task communication virtually always implies
overhead
– Machine cycles and resources that could be used for
computation are instead used to package and
transmit data.
– Communications frequently require some type of
synchronization between tasks, which can result in
tasks spending time "waiting" instead of doing
work.
– Communication traffic can saturate the available
network bandwidth, further reduce performance
Factors to Consider (2)
• Latency VS Bandwidth:
– latency is the time it takes to send a message from
point A to point B. Commonly expressed in
microseconds
– bandwidth is the amount of data that can be
communicated per unit of time
• Commonly expressed as megabytes/sec

– Sending many small messages can cause latency to


dominate communication overheads

– Often it is more efficient to package small messages


into a larger message, thus increasing the effective
communications bandwidth.
Factors to Consider (3)
• Visibility of communications
– With the Message Passing Model, communications are
explicit and generally quite visible and under the control of
the programmer

– With the Data Parallel Model (and shared Memory model),


communications often occur transparently to the
programmer

– The programmer may not even be able to know exactly how


inter-task communications are being accomplished.
Factors to Consider (4) - Synchronous vs. Asynchronous
Communications
– Synchronous communications require some type of "handshaking"
between tasks. This can be explicitly structured in code by
programmer, or it may happen at a lower level unknown to the
programmer.

– Synchronous communications are often referred to as blocking


communications since other work must wait until the
communications have completed

– Asynchronous communications allow tasks to transfer data


independently from one another

– Asynchronous communications are often referred to as non-blocking


communications

– Interleaving computation with communication is the single greatest


benefit for using asynchronous communications
Factors to Consider (5)
• Scope of communications
– Knowing which tasks must communicate with each other
is critical during the design stage of a parallel code.
– Both of the two scopes (described below) can be
implemented as Synchronous or Asynchronous.

1. Point-to-point - involves two tasks with one task acting as


the sender/producer of data, and the other acting as the
receiver/consumer.

2. Collective - involves data sharing between more than


two tasks, which are often specified as being members in
a common group, or collective.
Collective Communications, Examples
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Limits and Costs of Parallel Programming
Types of Synchronization
1. Barrier
– Usually implies that all tasks are involved

– Each task performs its work until it reaches the barrier.


It then stops, or "blocks”

– When the last task reaches the barrier, all tasks are
Synchronized
Types of Synchronization
2. Lock / semaphore
– Can involve any number of tasks

– Typically used to serialize (protect) access to global


data or a section of code. Only one task at a time may
use (own) the lock / semaphore / flag.

– The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or
code.

– Other tasks can attempt to acquire the lock but must


wait until the task that owns the lock releases it.

– Can be blocking or non-blocking


Types of Synchronisation
3. Synchronous communication Operations
– Involves only those tasks executing a communication
operation

– When a task performs a communication operation,


some form of coordination is required with the other
task(s) participating in the communication.

– For example: before a task can perform a send


operation  it must first receive an acknowledgment
from the receiving task that it is OK to send.
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Limits and Costs of Parallel Programming
Definitions: Data Dependence
• A dependence exists between program statements
when the order of statement execution affects the
results of the program

• A data dependence results from multiple use of the


same location(s) in storage by different tasks

• Dependencies are important to parallel programming


because they are one of the primary obstacle to
parallelism
Examples (1): Loop carried data dependence
DO 500 J = MYSTART,MYEND
A(J) = A(J-1) * 2.0500
CONTINUE

• The value of A(J-1) must be computed before the


value of A(J), therefore A(J) exhibits a data
dependency on A(J-1). Parallelism is not possible

• If Task 2 has A(J) and task 1 has A(J-1), computing the


correct value of A(J) requires:
– Distributed memory architecture - task 2 must obtain the
value of A(J-1) from task 1 after task 1 finishes its
computation
– Shared memory architecture - task 2 must read A(J-1) after
task 1 updates it
• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Limits and Costs of Parallel Programming
Definitions:Granularity
• Computation / Communication Ratio:
– In parallel computing, granularity is a qualitative
measure of the ratio of computation to
communication
– Periods of computation are typically separated from
periods of communication by synchronization events.

1. Fine-grain parallelism
2. Coarse-grain parallelism
Fine-grain Parallelism
• Relatively small amounts of computational work are done between
communication events
• Low computation to communication ratio
• Implies high communication overhead and less opportunity for
performance enhancement
• If granularity is too fine it is possible that the overhead required for
communications and synchronization between tasks takes longer
than the computation.
Coarse-grain Parallelism
• Relatively large amounts of computational
work are done between
communication/synchronization events

• High computation to communication ratio

• Implies more opportunity for performance


increase

• Harder to load balance efficiently


• Automatic vs. Manual Parallelization
• Understand the Problem and the Program
• Partitioning
• Communications
• Synchronization
• Data Dependencies
• Load Balancing
• Granularity
• Limits and Costs of Parallel Programming
Amdahl's Law
Amdahl's Law states that potential program
speedup is defined by the fraction of code (P)
that can be parallelized:

1
Max.speedup = --------
1 - P

• If none of the code can be parallelized, P = 0 and the speedup


= 1 (no speedup). If all of the code is parallelized, P = 1 and
the speedup is infinite (in theory).

• If 50% of the code can be parallelized, maximum speedup = 2,


meaning the code will run twice as fast.
Amdahl's Law
• It soon becomes obvious that there are limits to the
scalability of parallelism.

• For example, at P = .50, .90 and .99 (50%, 90% and


99% of the code is parallelizable)

speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
Amdahl's Law

F = serial fraction
E.g., 1/0.05 (5% serial) = 20 speedup (maximum)
Maximum Speedup (Amdahl's Law)
Maximum speedup is usually p with p processors
(linear speedup).

Possible to get super-linear speedup (greater than p)


but usually a specific reason such as:
• Extra memory in multiprocessor system
• Nondeterministic algorithm
Maximum Speedup (Amdahl's Law)
Speedup?

where ts is execution time on a single processor and tp is


execution time on a multiprocessor.

• S(p) gives increase in speed by using multiprocessor.

• Use best sequential algorithm with single processor


system instead of parallel program run with 1
processor for ts. Underlying algorithm for parallel
implementation might be (and is usually) different.
Speedup Factor?
Speedup factor can also be used in terms of
computational steps:
Speedup Factor

Here f is the part of the code that is serial:

e.g. if f==1 (all the code is serial, then the speedup will be 1
no matter how may processors are used
Speedup (with N CPUs or Machines)
• Introducing the number of processors performing the
parallel fraction of work, the relationship can be
modelled by:
1
speedup = ------------
fS + fP
-----
Proc

• where fP = parallel fraction,


Proc = number of processors and
fS = serial fraction
Speedup
• However, certain problems demonstrate increased performance
by increasing the problem size. For example:
– 2D Grid Calculations 85 seconds 85%
– Serial fraction 15 seconds 15%

• We can increase the problem size by doubling the grid dimensions


and halving the time step. This results in four times the number of
grid points and twice the number of time steps. The timings then
look like:
– 2D Grid Calculations 680 seconds 97.84%
– Serial fraction 15 seconds 2.16%

• Problems that increase the percentage of parallel time with their


size are more scalable, than problems with a fixed percentage of
parallel time.
Why use parallel processing?
 Save time: wall clock time
 Solve larger problems: increased extensibility
and configurability
 Possible better fault tolerance: Advantage of
non-local resources
 Cost savings
 Overcoming memory constraints
 Scientific interest
Other Metrics for Performance Evaluation
Parallel Computers
• Task system: Set of tasks with a dependence
relation solving a given problem.
• Concurrent: 2 tasks are called concurrent, if they
are not dependent on each other.
• Simultaneous/parallel: 2 tasks are executed in
parallel if at some point in time both tasks have
been started and none terminated.
• Parallel computer: Computer executing concurrent
tasks of task systems in parallel.
Parallelism Granularity
Parallel Computers
• Graph approach: as complexity increases, new
metrics need to be introduced:

• Number of operations
• Volume of data manipulated
• Type of data: temporary, read only, etc.
• Volume of data communicated between nodes
Regularity versus Irregularity
• Data Structures: dense vectors/matrices versus sparse
(stored as such) matrices

• Data access: regular vector access (with strides) versus


indirect access (scatter/gather)

• Computation: uniform computation on a grid versus


highly dependent upon grid point computation

• Communication: regular communication versus highly


irregular.
Static VS Dynamic Program Structure and
Behavior
Communication Structure
• LOCAL: neighbor-type communication

• GLOBAL: every one communicates with everyone

• In practice, mixture of both!!


Architecture Application Match
• Architectures are good at uniform and regular,
coarse-grain computations with local
communications, everything being static.

• Everything else generates problems!

• Computer architects assumptions:


– Software will solve it!

• Challenging problems are left to software


developers
Interaction in Parallel Systems
• Programming model specifying the interaction abstraction:
– Shared Memory
• Global adresses
• Explicit synchronization

– Message Passing
• Explicit exchange of messages
• Implicit synchronization

• Communication hardware:
– Shared memory: Bus-based shared memory systems,
Symmetrical Multiprocessors – SMPs

– Message passing: Network based such as, Ethernet, Infiniband,


etc.
Research in Parallel Systems
• There is no strict limit for contributors to the area
of parallel processing:
– Computer Architecture,
– Operating Systems,
– High-level languages
– Compilers
– Databases,
– Computer networks, all have a role to play

• This makes it a hot topic of research


Programming for Parallel Architectures
(Trick-1)

• Highway rule: Functions which consume most of


the time should be optimized the most

• But be aware of Amdahl’s Law!


– Over-provisioning of the resources (CPUs)
– Large serial part
Programming for Parallel Architectures
(Trick-2)
• Plumber’s rule: lot of care has to be given to
match bandwidth of communicating subsystems

• Use buffer to limit the effect of sudden


bandwidths variations

• Try to overlap communication with useful


computations
Memory Hierarchy
Memory Hierarchy
Unified v Split Caches
1. One cache for data and instructions (Unified)
2. Two, one for data, and one for instructions (Split)

• Advantages of the unified cache:


– Higher hit rate
• Balances load of instruction and data fetch
– Only one cache to design & implement

• Advantages of split cache


– Eliminates cache contention between instructions
fetch/decode unit and execution unit
• Important in pipelining
Principal of Locality
• Principal of Locality: the tendency to reference
data items that are near other recently referenced
data items, OR that were recently referenced

• Two categories:
1. Temporal Locality: location that is referenced once is
likely to be referenced multiple times in near future:

for(int i=0;i<1000;i++)
for(int j=0;j<1000;j++)
a[j] = b[i] * PI;
Principal of Locality
1. Spatial Locality: memory location that is referenced
once, then the program is likely to be reference a
nearby memory location:

for(int i=0;i<1000;i++)
for(int j=0;j<1000;j++)
a[j] = b[i] * PI;
Vector Product Example

float dot_prod(float x[1024], y[1024])


{
float sum = 0.0;
int i;

for (i = 0; i < 1024; i++)


sum = sum + (x[i]* y[i]);

return sum;
}
Vector Product Example

Assumptions:

• Cache line size: 4 bytes


• Cache mapping function: Direct Mapping
Thrashing Example: Good Case
x[0] y[0]
x[1] y[1] Loaded into different
x[2] y[2] Cache Lines
x[3] y[3]

•Access Sequence
– Read x[0] •Analysis
• x[0], x[1], x[2], x[3] loaded – x[i] and y[i] map to different cache lines
– Read y[0] – Cache Miss rate = 25% ( 2misses/8
• y[0], y[1], y[2], y[3] loaded loads)
– Read x[1] • Two memory accesses / iteration
• Hit • After every 4th iteration we have
two misses
– Read y[1]
• Hit
– •••
– 2 misses / 8 reads
Thrashing Example: Bad Case
x[0] y[0]
x[1] y[1] Loaded into same
x[2] y[2] Cache Lines
x[3] y[3]

•Analysis
•Access Pattern
– x[i] and y[i] map to same cache lines
– Read x[0]
– Miss rate = 100%
• x[0], x[1], x[2], x[3] loaded
• Two memory accesses / iteration
– Read y[0]
• On every iteration have two
• y[0], y[1], y[2], y[3] loaded
misses
– Read x[1]
• x[0], x[1], x[2], x[3] loaded
– Read y[1]
• y[0], y[1], y[2], y[3] loaded
•••
– 8 misses / 8 reads (Thrashing)
Matrix Sum Example-1
// <Get START time here>

for(kk=0;kk<1000;kk++)
{ sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += A[i][j];
}

// <Get END time>


Matrix Sum Example-2
// <Get START time here>

for(kk=0;kk<1000;kk++)
{ sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += A[j][i];
}

// <Get END time>


Programming for Parallel Architectures
(Trick-3)

• Prophet’s rule: knowing the future always allows


you to make the best decisions.

• OR:
– Use the past to predict the future
– Use the compiler
– Bet on several horses
Programming for Parallel Architectures
(Trick-4)

• No matter how well thought of is your


architecture, some applications will perform too
bad
– Use the compiler
– Change the applications/algorithms!

You might also like