Lecture 4

Disclaimer
 In preparation of these slides, materials have been

taken from different online sources in the shape of
books, websites, research papers and presentations etc.
However, the author does not have any intention to take
any benefit of these in her/his own name. This lecture
(audio, video, slides etc) is prepared and delivered only
for educational purposes and is not intended to infringe
upon the copyrighted material. Sources have been
acknowledged where applicable. The views expressed
are presenter’s alone and do not necessarily represent
actual author(s) or the institution.
Steps in designing parallel programs
The main steps:
1. Understand the problem
2. Partitioning (separation into main tasks)
3. Decomposition & Granularity
4. Communications
this lecture
…
5. Identify data dependencies
6. Synchronization
7. Load balancing
8. Performance analysis and tuning
EEE4084F
Step 4: Communications
Factors related to Communication
 Cost of communications
 Latency vs. Bandwidth
 Visibility of communications
 Synchronous vs. asynchronous
communications
 Scope of communications
 Efficiency of communications
 Overhead and Complexity
Cost of communications
 Communication between tasks has some
kind of overheads, such as:
 CPU cycles, memory and other resources that
could be used for computation are instead used
to package and transmit data.
 Communications frequently require some type
of synchronization between tasks, which can
result in tasks spending time "waiting" instead
of doing work
 Competing communication traffic could also
saturate the network bandwidth, causing
performance loss.
Latency vs. Bandwidth
 latency is the time it takes to send a minimal (0 byte)
message from point A to point B. Commonly expressed as
microseconds
 Bandwidth
 The amount of data that can be sent per unit of time.
Usually expressed as megabytes/sec or gigabytes/sec.
Latency vs. Bandwidth
Sending many small messages can cause
latency to dominate communication overheads.
Often it is more efficient to package small
messages into a larger message, thus
increasing the effective communications
bandwidth
Visibility of Communications
 Communications is usually both explicit and highly
visible when using the message passing
programming model.
 Communications may be poorly visible when
using the data parallel programming model.
 For data parallel design on a distributed system,
communications may be entirely invisible, in that
the programmer may have no understanding (and
no easily obtainable means) to accurately
determine what inter-task communications is
happening.
Synchronous vs. asynchronous
 Synchronous communications
 Require some kind of
handshaking between tasks that
share data / results.(explicitly encoded or
transparent to the programmer)
 Synchronous communications are also referred
to as blocking communications because other
work must wait until the communications have
completed.
Synchronous vs. asynchronous
 Asynchronous communications
 Allow tasks to transfer data between one
another independently.
 Asynchronous communications are non-blocking
since other work can be done while the
communications are taking place
 Interleaving computation with communication is
the single greatest benefit for using
asynchronous communications
Scope of communications
 Scope of communications:
 Knowing which tasks must communicate with
each other is critical during the design stage
of a parallel code
 Two general types of scope:
 Point-to-point (P2P)
 Collective / broadcasting
Scope of communications
 Point-to-point (P2P)
 Involvesonly two tasks, one task is the
sender/producer of data, and the other acting as the
receiver/consumer.
 Collective
 Data sharing between more than two tasks
(sometimes specified as a common group or
collective).
 Both P2P and collective communications can be
synchronous or asynchronous.
Collective communications
Typical techniques used for collective communications:
Efficiency of
communications
The programmer often has a choice with regard to factors
that can affect communications performance.
Which implementation for a given model should be used?
Using the Message Passing Model as an example, one MPI
implementation may be faster on a given hardware platform
than another.
What type of communication operations should be used? As
mentioned previously, asynchronous communication
operations can improve overall program performance.
Network media - some platforms may offer more than one
network for communications. Which one is best?
Step 5: Identify Data
Dependencies
Data dependencies
A dependence exists between program
statements when the order of statement
execution affects the results of the
program.
 A data dependence results from multiple
use of the same location(s) in storage by
different tasks.
 Dependencies are important to parallel
programming because they are one of the
primary inhibitors to parallelism.
Common data dependencies
• Loop carried data dependence:
dependence between statements in different iterations
• Loop independent data dependence:

dependence between statements in the same
iteration
Step 6: Synchronization
Types of Synchronization
 Barrier
 Locking/ Semaphores
 Synchronous communication operations
Barrier
 Alltasks are involved

 Each task does work until it reaches the
barrier, and then blocks.
 When the last task reaches the barrier all
the tasks are synchronized.
Locking / Semaphores
 Can involve any number of tasks

 Usually used to serialize / protect access to global data
or critical section of code.
 Only one task at a time may use the lock / semaphore.
The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or code.
Other tasks can attempt to acquire the lock but must wait
until the task that owns the lock releases it, granted on a
FCFS basis.
Can be blocking or non-blocking
Synchronous communication
operations
 Involvesonly those tasks executing a
communication operation
 When a task performs a communication
operation, some form of coordination is
required with other task(s) participating in
the communication.
 For
example, before a task can do a send
operation, it needs to first receive a clear to
send (CTS) signal from the task it wants to
send to.
Step 7: Load Balancing
Load Balancing
 Distributing work among tasks so
that all tasks are kept busy most of the time
 Mainly a minimization of idle time
 Load balancing is important to parallel programs for
performance reasons.
 For example, if all tasks are subject to a barrier
synchronization point, the slowest task will determine the
overall performance.
Achieving load balance
 Two general techniques…
1. Equally partition work as a preprocess
 Workload can be determined accurately
before starting the operation
2. Dynamic work assignment
 For operations where the workload is
unknown, or cannot be effectively calculated,
before starting the operation
How to Achieve Load Balance?
Equally Partition the Work
 Examples include
 For array/matrix operations where each task performs similar work,
evenly distribute the data set among the tasks
 For loop iterations where the work done in each iteration is similar,
evenly distribute the iterations across the tasks.
 If a heterogeneous mix of machines with varying performance

characteristics are being used, be sure to use some type of performance
analysis tool to detect any load imbalances. Adjust work accordingly.
Daniele Loiacono
Step 8: Performance analysis
Performance analysis
 Simple approaches involve using timers
 Usually quite straightforward for a serial
situation or global memory
 Monitoringand analyzing parallel
programs can get a whole lot more
challenging than for the serial case
 May add communication overhead
 Needing synchronization to avoid corrupting
results, … could be a significant algorithm itself

Lecture 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

Disclaimer

 In preparation of these slides, materials have been

• Loop independent data dependence:

 Alltasks are involved

 Can involve any number of tasks

 If a heterogeneous mix of machines with varying performance

You might also like