Parallel & Distributed Computing: Prof. Dr. Aman Ullah Khan

Parallel & Distributed Computing
Prof. Dr. Aman Ullah Khan

Synchronization types
• Barrier
• Each task performs its work until it reaches the barrier. It then stops, or "blocks".
• When the last task reaches the barrier, all tasks are synchronized.
• Lock / semaphore
• The first task to acquire the lock "sets" it. This task can then safely (serially)
access the protected data or code.
• Other tasks can attempt to acquire the lock but must wait until the task that
owns the lock releases it.
• Can be blocking or non-blocking
Load balancing
• Keep all tasks busy all of the time. Minimize idle time.
• The slowest task will determine the overall performance.
Achieving Load Balancing
• Equally partition the work each task receives
• For array/matrix operations where each task performs similar work,
evenly distribute the data set among the tasks.
• For loop iterations where the work done in each iteration is similar,
evenly distribute the iterations across the tasks.
• If a heterogeneous mix of machines with varying performance
characteristics are being used, be sure to use some type of
performance analysis tool to detect any load imbalances. Adjust work
accordingly.
• Use dynamic work assignment
• Certain classes of problems result in load imbalances even if data is evenly
distributed among tasks. For example:
• Sparse arrays - some tasks will have actual data to work on while others have mostly "zeros".
• Adaptive grid methods - some tasks may need to refine their mesh while others don’t.
• N-body simulations - particles may migrate across task domains requiring more work for
some tasks.
Sparse arrays Adaptive grid N-body

• Use dynamic work assignment
• When the amount of work each task will perform is intentionally variable, or is
unable to be predicted, it may be helpful to use a scheduler-task pool approach.
As each task finishes its work, it receives a new piece from the work queue.
• Ultimately, it may become necessary to design an algorithm which detects and
handles load imbalances as they occur dynamically within the code
Granularity
• In parallel computing, granularity is a qualitative measure of the ratio of
computation to communication.
• Fine-grain Parallelism:
 Relatively small amounts of computational work are done between communication
events
 Low computation to communication ratio
 Facilitates load balancing
 Implies high communication overhead and less opportunity for performance
enhancement
 If granularity is too fine it is possible that the overhead required for communications and
synchronization between tasks takes longer than the computation.
• Coarse-grain Parallelism:
 Relatively large amounts of computational work are done between communication /
synchronization events
 High computation to communication ratio
 Implies more opportunity for performance increase
 Harder to load balance efficiently
Parallel Computing Performance Metrics
• Let T(n,p) be the time to solve a problem of size n using p
processors, then:
Speedup: S(n,p) = T(n,1)/T(n,p)
Efficiency: E(n,p) = S(n,p)/p
Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)
• An algorithm is scalable if there exists C>0 with SE(n,p) ≥ C for

all p.
Scalability
• Two common types of scaling of software are:
1. Strong scaling and
2. weak scaling.
• Strong scaling concerns the speedup for a fixed problem size
with respect to the number of processors, and is governed by
the Amdahl’s law.
• Weak scaling concerns the speedup for a scaled problem size
with respect to the number of processors, and is governed by
the Gustafson’s law.
Scalability is important for parallel computing to be efficient.
When using HPC clusters, it is almost always worthwhile to measure
the scaling of your jobs.
The results of strong and weak scaling tests provide good indications
for the best match between job size and the amount of resources that
should be requested for a particular job
Strong Scalability
• Strong Scalability: Strong scaling depicts how the solution time varies
with the number of processors for a fixed total problem size. That is, what
happens with the run time when the number of processors are increasing
for a fixed problem size.
The total problem size stays fixed as more processors are added.
Goal is to run the same problem size faster
Perfect scaling means problem is solved in 1/P time (compared to serial)
Weak Scalability
• Weak Scalability: Weak scaling indicates how the solution time varies
with the number of processors for a fixed problem size per processor.
That is what happens with the increase of number of processors for a
scaled problem size i.e. using a fixed problem size per processor.
The problem size per processor stays fixed as more processors are added. The
total problem size is proportional to the number of processors used.
Goal is to run larger problem in same amount of time
Perfect scaling means problem Px runs in same time as single processor run
Amdahl’s Law
• If P is the parallel portion of code, then the maximum Speedup = 1/(1-P).
• If none of the code can be parallelized i.e. P = 0, then speedup = 1 i.e. no speedup, how much
processors are used.
• If all of the code is parallelized, that is P = 1, then the speedup is infinite (in theory).
• If 50% of the code can be parallelized that is P = 0.5, then maximum speedup = 2, meaning the
code will run twice as fast.
• In terms of number of processors N performing the parallel fraction P of work, the Speedup can be
rewritten as: Speedup = 1/((P/N)+S), where N is the no. of processors and S is the serial portion of
code.
Speedup
N P=0.5 P=0.9 P=0.99
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
100000 1.99 9.99 99.90
Gustafson’s law
Amdahl’s law gives the upper limit of speedup for a problem of fixed size which is a
bottleneck for parallel computing and discourages the use of parallel computing.
 For example, if a gain of 500 times speedup on 1000 processors is required, then according to
Amdahl’s law, the proportion of serial part cannot exceed 0.1%.
On the other hand the Gustafson law, says that the problem size also shall scale with the
amount of available resources. That is, if a problem only requires a small amount of
resources, it is not beneficial to use a large amount of resources to carry out the
computation. A more reasonable choice is to use small amounts of resources for small
problems and larger quantities of resources for big problems.
Gustafson law says that if the parallel part scales linearly with the amount of resources, and
that the serial part does not increase with respect to the size of the problem then the scaled
speedup is defined as:
scaled speedup = S + P × N
• where S, P and N have the same meaning as in Amdahl’s law. With Gustafson’s law the scaled
speedup increases linearly with respect to the number of processors, and there is no upper
limit for the scaled speedup.
Gustafson’s law scaled speedup = S + P × N
• According to Gustafson’s law,

For S= 0.05 and P = 0.95, the scaled speedup will become infinity when infinitely many processors
are used.
Realistically, if we have N = 1000, the scaled speedup will be 950.
Table 1: Strong scaling for Julia set generator code
Strong Scaling Example height width threads time Speedup
10000 2000 1 3.932 sec 1.00
10000 2000 2 2.006 sec 1.96
10000 2000 4 1.088 sec 3.61
10000 2000 8 0.613 sec 6.41
10000 2000 12 0.441 sec 8.94
10000 2000 16 0.352 sec 11.23
10000 2000 24 0.262 sec 15.01
Problem Size is fixed,

number of threads
being increased
Figure 1: Plot of strong scaling for Julia set generator code

The dashed line shows the fitted curve based on Amdahl’s law.
Table 2: Weak scaling for Julia set generator code
Weak Scaling Example height width threads time
10000 2000 1 3.940 sec 1.00
20000 2000 2 3.874 sec 2.03
40000 2000 4 3.977 sec 3.96
80000 2000 8 4.258 sec 7.40
120000 2000 12 4.335 sec 10.91
160000 2000 16 4.324 sec 14.58
240000 2000 24 4.378 sec 21.60
Problem Size is not fixed,

number of threads being
increased
Figure 2: Plot of weak scaling for Julia set generator code

The dashed line shows the fitted curve based on Gustafson’s law.
inter-task communications Factors
• Communication overhead
• Inter-task communication virtually always implies overhead.
• Machine cycles and resources that could be used for computation are instead used to package
and transmit data.
• Communications frequently require some type of synchronization between tasks, which can
result in tasks spending time "waiting" instead of doing work.
• Competing communication traffic can saturate the available network bandwidth, further
aggravating performance problems
• Latency vs. Bandwidth
• latency is the time it takes to send a minimal (0 byte) message from point A to point B.
Commonly expressed as microseconds.
• bandwidth is the amount of data that can be communicated per unit of time. Commonly
expressed as megabytes/sec or gigabytes/sec.
• Sending many small messages can cause latency to dominate communication overheads. Often it
is more efficient to package small messages into a larger message, thus increasing the effective
communications bandwidth.
• Visibility of communications
• With the Message Passing Model, communications are explicit and generally quite visible and under
the control of the programmer.
• With the Data Parallel Model, communications often occur transparently to the programmer,
particularly on distributed memory architectures. The programmer may not even be able to know
exactly how inter-task communications are being accomplished.
• Synchronous vs. asynchronous communications
• Synchronous communications require some type of "handshaking" between tasks that are sharing
data. This can be explicitly structured in code by the programmer, or it may happen at a lower level
unknown to the programmer.
• Synchronous communications are often referred to as blocking communications since other work must
wait until the communications have completed.
• Asynchronous communications allow tasks to transfer data independently from one another. For
example, task 1 can prepare and send a message to task 2, and then immediately begin doing other
work. When task 2 actually receives the data doesn't matter.
• Asynchronous communications are often referred to as non-blocking communications since other
work can be done while the communications are taking place.
• Interleaving computation with communication is the single greatest benefit for using asynchronous
communications.
• Scope of communications
• Knowledge which tasks must communicate with each other is critical during the
design stage of a parallel code. Both of the two scopings described below can be
implemented synchronously or asynchronously.
• Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the
other acting as the receiver/consumer.
• Collective - involves data sharing between more than two tasks, which are often specified as
being members in a common group, or collective. Some common variations (there are more):
Performance Model for distributed memory
• Time for n floating point operations: tcomp = fn , where 1/f Mflops
• Time for communicating n words: tcomm = α + βn, where, α latency, 1/β bandwidth
• Total time: tcomp + tcomm

Parallel & Distributed Computing: Prof. Dr. Aman Ullah Khan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel & Distributed Computing: Prof. Dr. Aman Ullah Khan

Uploaded by

Copyright:

Available Formats

Parallel & Distributed Computing

Prof. Dr. Aman Ullah Khan

Sparse arrays Adaptive grid N-body

Speedup: S(n,p) = T(n,1)/T(n,p)

Efficiency: E(n,p) = S(n,p)/p

Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)

• An algorithm is scalable if there exists C>0 with SE(n,p) ≥ C for

• According to Gustafson’s law,

Problem Size is fixed,

Figure 1: Plot of strong scaling for Julia set generator code

Problem Size is not fixed,

Figure 2: Plot of weak scaling for Julia set generator code

You might also like