HPC chapter 2

Parallel Algorithm Design
Principles of Parallel Algorithm Design

Parallel algorithms are designed to solve computational problems
efficiently using multiple processors or cores. Here are some principles of
parallel algorithm design:
1. Decompose the problem: Break the problem into smaller

sub-problems that can be solved independently or concurrently.
2. Minimize communication: Communication between processors
can be time-consuming, so minimize the amount of data that needs
to be shared among processors.
3. Balance the workload: Ensure that the workload is evenly
distributed among the processors to maximize efficiency.
4. Use parallel data structures: Use data structures that can be
accessed concurrently by multiple processors without interference.
5. Use parallel operations: Use operations that can be executed
concurrently by multiple processors without interference.
6. Minimize synchronization: Synchronization between processors
can also be time-consuming, so minimize the amount of
synchronization required.
7. Exploit locality: Exploit locality in the data access pattern to
minimize communication and synchronization overheads.
8. Choose the right parallel architecture: Choose the architecture
that is best suited for the problem and the available resources.
9. Measure performance: Measure the performance of the algorithm
on the target platform to identify bottlenecks and optimize
performance.
10. Consider fault tolerance: Consider the possibility of processor
or system failures and design the algorithm to handle such failures
gracefully.
By following these principles, parallel algorithms can be designed to

achieve high performance and scalability, making them ideal for solving
large-scale computational problems.
Preliminaries Parallel Algorithm Design

Before designing a parallel algorithm, there are several preliminary
considerations that need to be taken into account. Here are some of them:
1. Problem characteristics: The characteristics of the problem being

solved play a crucial role in the design of a parallel algorithm. For
instance, some problems may be inherently parallelizable while
others may not.
2. Parallel architecture: The parallel architecture being used also
affects the design of the algorithm. Different architectures have
different strengths and weaknesses, and the algorithm needs to be
designed accordingly.
3. Communication model: The communication model used by the
parallel architecture also affects the design of the algorithm. Different
communication models have different overheads and latencies, and
the algorithm needs to be designed to minimize these.
4. Scalability: The scalability of the algorithm is an important
consideration. The algorithm should be designed to work efficiently
even as the number of processors increases.
5. Load Balancing: It is another important consideration. The
workload should be evenly distributed among the processors to
ensure that all processors are utilized efficiently.
6. Data partitioning: Data partitioning is the process of dividing the
data among the processors. The algorithm should be designed to
minimize the communication overheads associated with data
partitioning.
7. Synchronization: Synchronization is the process of coordinating
the activities of the processors. The algorithm should be designed to
minimize the synchronization overheads.
8. Granularity: Granularity refers to the size of the computational tasks
assigned to each processor. The algorithm should be designed to
achieve an optimal granularity to maximize efficiency.
9. Fault tolerance: Fault tolerance is the ability of the algorithm to
continue working even in the presence of hardware or software
failures. The algorithm should be designed to handle such failures
gracefully.
By taking into account these preliminary considerations, the design of a

parallel algorithm can be optimized to achieve high performance,
scalability, and fault tolerance.
Preliminaries in parallel computing
Preliminaries in parallel computing refer to the basic concepts and tools

required for understanding and implementing parallel algorithms. These
include:
1. Parallel Architectures: Different parallel architectures like shared

memory, distributed memory, and hybrid architectures are used in
parallel computing. Understanding the architecture of the system is
essential for designing and implementing parallel algorithms.
2. Parallel Programming Languages: There are many parallel
programming languages like MPI, OpenMP, CUDA, and others.
Knowledge of these languages is essential for developing parallel
algorithms.
3. Parallel Algorithms: Designing and implementing efficient parallel
algorithms is crucial in parallel computing. Understanding the
techniques and approaches used in parallel algorithms is necessary
for optimizing the performance of parallel computations.
4. Parallel Tools: Various tools like debuggers, profilers, performance
analyzers, and visualization tools are used in parallel computing to
facilitate the development, testing, and analysis of parallel
algorithms.
5. Parallel Libraries: There are many parallel libraries like BLAS,
LAPACK, and FFTW that provide pre-implemented parallel algorithms
for common tasks like linear algebra and signal processing.
Overall, a good understanding of the above preliminaries is essential for

effectively using parallel computing to solve large-scale problems.
Decomposition Techniques
Decomposition is the process of breaking down a problem into smaller
sub-problems that can be solved independently or concurrently.
Decomposition is a critical technique in parallel algorithm design. Here are
some of the key decomposition techniques used in parallel algorithm
design:
1. Task Parallelism: Task parallelism is the decomposition of a problem

into smaller independent tasks that can be executed concurrently on
different processors. In this approach, each processor is assigned a
specific task, and the results are combined at the end.
2. Data Parallelism: Data parallelism is the decomposition of a problem
into smaller independent data sets that can be processed
concurrently by different processors. In this approach, each
processor is assigned a specific data set, and the processing is
performed in parallel.
3. Domain Decomposition: Domain decomposition is the decomposition
of a problem into smaller sub-problems that can be solved
independently by different processors. In this approach, the problem
is divided into non-overlapping sub-domains, and each processor is
assigned a specific sub-domain to solve.
4. Pipelining: Pipelining is the decomposition of a problem into a series
of sub-tasks that are executed sequentially but concurrently on
different processors. In this approach, the output of one sub-task is
used as the input for the next sub-task, and the processing is
performed in a pipeline.
5. Recursive Decomposition: Recursive decomposition is the
decomposition of a problem into smaller sub-problems using a
recursive algorithm. In this approach, the problem is divided into
smaller sub-problems, and the same algorithm is applied recursively
to solve each sub-problem.
Each of these decomposition techniques has its strengths and

weaknesses. The choice of the decomposition technique depends on the
nature of the problem being solved, the parallel architecture being used,
and other factors such as load balancing and communication overheads.
Characteristics of Tasks and Interactions

Parallel computing is a type of computation where multiple calculations are
executed simultaneously by dividing a large task into smaller sub-tasks that
can be processed concurrently. In parallel computing, there are certain
characteristics of tasks and interactions that are important to consider:
1. Task Granularity: The size of the sub-tasks that are assigned to

individual processors in parallel computing is referred to as task
granularity. It is important to balance the size of the tasks to ensure
that each processor is utilized optimally.
2. Data Dependency: In parallel computing, the processing of tasks is
often dependent on the results of other tasks. This data dependency
can impact the overall efficiency of the computation.
3. Communication Overhead: Communication between processors
is required in order to coordinate the processing of tasks in parallel
computing. This communication overhead can impact the overall
performance of the computation.
4. Load Balancing: In parallel computing, it is important to ensure that
the workload is evenly distributed among the processors to avoid any
processor being overloaded.
5. Synchronization: In order to coordinate the processing of tasks in
parallel computing, synchronization mechanisms are required. These
mechanisms ensure that tasks are executed in the correct order and
that all processors are working together towards the same goal.
Overall, parallel computing requires careful consideration of these

characteristics in order to optimize the performance of the computation.
Mapping Techniques for Load Balancing

Load balancing is an important aspect of parallel algorithm design that
involves distributing the workload across multiple processors to ensure
that all processors are utilized efficiently.
Mapping techniques are used to assign tasks to processors in a

load-balanced manner. Here are some mapping techniques for load
balancing in parallel algorithm design:
1. Static mapping: In static mapping, the mapping of tasks to

processors is predetermined before the execution of the parallel
algorithm. This mapping remains fixed throughout the execution of
the algorithm.Here each processor is assigned an equal number of
tasks.
2. Dynamic mapping: In dynamic mapping, the mapping of tasks to
processors is determined during the execution of the algorithm. The
mapping can be changed dynamically based on the workload of each
processor and it can be done using load balancing algorithms.
3. Hierarchical mapping: In hierarchical mapping, tasks are grouped
into clusters based on their dependencies and communication
requirements. Each cluster is then assigned to a processor, and the
workload of each processor is balanced by assigning clusters of
similar size and complexity.
4. Task partitioning: Task partitioning involves dividing the workload
into smaller subtasks that can be executed independently on different
processors. Each subtask is assigned to a processor based on its
workload and resource requirements.
5. Space-filling curves:These curves are a class of curves that pass
through every point in a two-dimensional space. These curves can be
used to map tasks to processors in a load-balanced manner. And
ensure that tasks that are geographically close to each other are
assigned to the same processor.
6. Topological mapping: Topological mapping involves mapping
tasks to processors based on their topological relationships. This
technique is particularly useful for algorithms that involve
graph-based computations.
Overall, effective load balancing in parallel algorithm design requires

careful consideration of the mapping technique used.
Methods for Containing Interaction Overheads

In parallel computing, interaction overheads refer to the time and resources
spent on communication and synchronization between processors. These
overheads can significantly impact the performance of parallel programs.
Here are some methods for containing interaction overheads:
1. Minimize communication: One of the most effective ways to
reduce interaction overheads is to minimize communication between
processors. This can be done by using techniques such as data
locality, where data is kept as close to the processor that needs it and
message batching, where multiple small messages are combined
into a larger message to reduce the number of messages sent.
2. Overlap communication with computation: It's another
effective way to reduce interaction overheads. This can be done by
using techniques such as non-blocking communication, where
communication is initiated but does not block the processor from
performing other tasks, and pipelining, where multiple stages of
computation and communication are overlapped.
3. Use efficient communication protocols: To reduce interaction
overheads we can use protocols such as Remote Direct Memory
Access (RDMA) can significantly reduce the overheads associated
with data transfers.
4. Employ load balancing techniques: Load balancing techniques can
help distribute the workload evenly among processors, which can
reduce the interaction overheads associated with uneven workloads.
Techniques such as task and data partitioning, hierarchical mapping,
and dynamic mapping can all help balance the workload among
processors.
5. Use hardware accelerators: Hardware accelerators such as GPUs
and FPGAs can offload certain types of computations from the CPU,
reducing the overall workload and the associated interaction
overheads.
Overall, containing interaction overheads in parallel computing requires a

combination of careful algorithm design, efficient communication
protocols, and hardware optimization. Which leads to improved
performance and scalability of parallel programs.
Parallel Algorithm Models:
Data Model
A data model in parallel computing refers to the way that data is organized
and accessed in a parallel computing system.
Some common data models used in parallel computing include:
1. Shared memory model: In this model, all processors in a parallel

system have access to a single shared memory space. This allows
for efficient communication and sharing of data between processors,
but can also lead to issues such as data races and cache coherence
problems.
2. Distributed memory model: In this model, each processor in a
parallel system has its own local memory, and communication
between processors is done through message passing. This can be
more scalable and flexible than the shared memory model, but can
also be more complex to program and manage.
3. Hybrid model: The hybrid model combines elements of the shared
memory and distributed memory models, allowing for both shared
memory and distributed memory access in a single system. This can
be useful for applications that require both efficient communication
and flexible memory access.
4. Object-based model: In this model, data is organized as objects
that can be accessed and manipulated by multiple processors. This
can be useful for applications that require fine-grained data sharing
and coordination.
5. Dataflow model: In this model, data is organized as streams of
input and output, with computation occurring as data is passed
through the system. This can be useful for applications that require
dynamic and flexible data processing.
Overall, the choice of data model in parallel computing depends on the
specific requirements of the application being developed, as well as the
hardware and software resources available. Careful consideration of the
data model can help to ensure efficient and effective parallel processing.
Task Model
A task model is a representation of the work to be done in a parallel
computing system. It defines the tasks that need to be executed and the
dependencies between them. The task model is used to create a parallel
program that can be executed on a parallel computing system.
There are several types of task models, including:
1. Data-parallel task model: This model breaks down the

computation into small, independent data elements that can be
processed in parallel. This is commonly used in scientific and
engineering applications where large data sets are processed.
2. Task-parallel task model: This model decomposes the
computation into a set of tasks that can be executed in parallel. This
is commonly used in applications where the computation is more
complex and cannot be easily decomposed into data elements.
3. Pipeline task model: This model breaks down the computation
into a set of stages where each stage performs a specific task. Data
is passed from one stage to the next in a pipeline fashion. This is
commonly used in applications where the computation can be broken
down into a set of sequential steps.
4. Task-graph task model: This model represents the computation
as a directed graph where the nodes represent tasks and the edges
represent dependencies between tasks. This is commonly used in
applications where the computation has complex dependencies
between tasks.
The choice of task model depends on the nature of the computation and
the characteristics of the parallel computing system. A good task model
can improve the performance of parallel programs by reducing
communication overhead and increasing parallelism.
Work Pool Model

In parallel computing, a Work Pool is a shared data structure that contains
a set of tasks that need to be executed.
It is typically used in the Work Pool Model, where a set of worker

processes consume tasks from the pool and process them independently.
The Work Pool can be implemented as a queue or a stack, depending on

the requirements of the application. New tasks are added to the pool as
they become available, and workers consume tasks from the pool as they
become available. When a worker finishes processing a task, it returns the
result to a designated output.
The Work Pool model is commonly used in parallel computing systems

where the workload can be broken down into independent, non-overlapping
tasks. This allows multiple workers to operate on different tasks
simultaneously, improving overall performance.
Advantages of Work Pool Model

One of the advantages of using a Work Pool is that it allows the workload
to be dynamically balanced across the workers. This helps to minimize idle
time and maximize overall throughput.
Another advantage is that it is a scalable model. As the size of the

workload increases, additional worker processes can be easily added to the
system to handle the additional load. This makes the Work Pool model
well-suited for applications that need to scale up to handle large amounts
of work.
Master Slave Model

In parallel computing, the Master Slave Model is a task model that divides a
computation into two types of tasks: master tasks and slave tasks.
The master task manages the distribution of tasks to a set of slave tasks.
The slave tasks perform the actual computation and report their results
back to the master task.
The Master Slave Model is commonly used in parallel computing systems

where the computation involves complex dependencies or coordination
requirements.
For example, in a simulation, the master task might be responsible for

initializing the simulation and coordinating the output of results, while the
slave tasks perform the actual computation for each time step.
The Master Slave Model typically operates as follows:
1. The master task initializes the computation and assigns tasks to

slave tasks.
2. The slave tasks request tasks from the master task and perform the
computation.
3. When a slave task completes a task, it reports the result back to the
master task.
4. The master task collects the results from the slave tasks and
performs any necessary coordination or output.
5. The computation continues until all tasks have been completed
Advantages of Master Slave Model
One of the advantages of the Master Slave Model is that it can handle more
complex computations than the Work Pool Model.
It allows the master to perform control and coordinate tasks while the
slave performs the actual computation. This makes it possible to handle
computations that have complex dependencies or coordination
requirements.
Disadvantages of Master Slave Model

However, the Master Slave Model can suffer from scalability issues as the
number of slave tasks increases. The master task can become a bottleneck
if it cannot keep up with the demand for new tasks from the slave tasks.
Sequential and Parallel Computational

Complexity
Computational complexity is a measure of the amount of resources,

such as time and memory, required to solve a computational
problem. There are two main types of computational complexity:
sequential and parallel.
1. Sequential computational complexity

1.1. It refers to the resources required to solve a problem using a
single processor, or sequentially.
1.2. The most common measure of sequential complexity is time
complexity, which is the amount of time required to solve a
problem on a single processor. Time complexity is typically
measured in terms of the number of operations or instructions
required to solve a problem, as a function of the input size.
1.3. Time Complexity = O(n log n)
1.4. which means that the number of operations required to sort n
items grows logarithmically/arithmetically with the input size.
2. Parallel computational complexity
2.1. It refers to the resources required to solve a problem using
multiple processors in parallel.
2.2. The most common measure of parallel complexity is parallel
time complexity, which is the amount of time required to solve a
problem using p processors.
2.3. Parallel time complexity is typically measured in terms of the
number of operations or instructions required to solve a
problem, as a function of both the input size and the number of
processors.
2.4. It is represented by O(n/p log n)
3. Overall, the choice between sequential and parallel computational

complexity depends on the problem being solved, the available
resources, and the desired performance goals.
4. While parallel computation can offer significant speedup over
sequential computation for certain problems, it also introduces
additional challenges such as load balancing, communication, and
synchronization.
Anomalies in Parallel Algorithms
Anomalies in parallel algorithms refer to unexpected or undesirable
behaviors that can occur during parallel computation. These anomalies can
arise due to a variety of factors, such as load imbalances, communication
delays, and synchronization issues.
Here are some common types of anomalies in parallel algorithms:
1. Deadlock: Deadlock occurs when two or more processes are

blocked, waiting for each other to release resources. This can happen
when processes acquire locks in a different order, leading to a circular
wait.
2. Data races: Data races occur when two or more processes access
the same data simultaneously, resulting in undefined behavior. Data
races can occur when processes do not synchronize their accesses
to shared data properly.
3. Starvation: Starvation occurs when a process is never able to
acquire the resources it needs to make progress. This can happen
when resources are allocated to other processes repeatedly, leaving
some processes waiting indefinitely.
4. Load imbalance: Load imbalance occurs when some processes
have significantly more work to do than others, leading to
underutilization of some processors and overloading of others.
5. Communication delays: Communication delays occur when
communication between processes takes longer than expected. This
can happen due to network congestion, hardware failures, or other
factors.
6. Scalability issues: Scalability issues occur when the performance
of a parallel algorithm does not improve as the number of processors
increases. This can happen due to limitations in communication
bandwidth, synchronization overhead, or other factors.
To mitigate these anomalies, parallel algorithms need to be carefully
designed and tested to ensure correct behavior.

HPC chapter 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC chapter 2

Uploaded by

Copyright:

Available Formats

Parallel Algorithm Design

Principles of Parallel Algorithm Design

1. Decompose the problem: Break the problem into smaller

By following these principles, parallel algorithms can be designed to

Preliminaries Parallel Algorithm Design

1. Problem characteristics: The characteristics of the problem being

By taking into account these preliminary considerations, the design of a

Preliminaries in parallel computing

Preliminaries in parallel computing refer to the basic concepts and tools

1. Parallel Architectures: Different parallel architectures like shared

Overall, a good understanding of the above preliminaries is essential for

1. Task Parallelism: Task parallelism is the decomposition of a problem

Each of these decomposition techniques has its strengths and

Characteristics of Tasks and Interactions

1. Task Granularity: The size of the sub-tasks that are assigned to

Overall, parallel computing requires careful consideration of these

Mapping Techniques for Load Balancing

Mapping techniques are used to assign tasks to processors in a

1. Static mapping: In static mapping, the mapping of tasks to

Overall, effective load balancing in parallel algorithm design requires

Methods for Containing Interaction Overheads

Overall, containing interaction overheads in parallel computing requires a

Some common data models used in parallel computing include:

1. Shared memory model: In this model, all processors in a parallel

There are several types of task models, including:

1. Data-parallel task model: This model breaks down the

Work Pool Model

It is typically used in the Work Pool Model, where a set of worker

The Work Pool can be implemented as a queue or a stack, depending on

The Work Pool model is commonly used in parallel computing systems

Advantages of Work Pool Model

Another advantage is that it is a scalable model. As the size of the

Master Slave Model

The Master Slave Model is commonly used in parallel computing systems

For example, in a simulation, the master task might be responsible for

The Master Slave Model typically operates as follows:

1. The master task initializes the computation and assigns tasks to

Disadvantages of Master Slave Model

Sequential and Parallel Computational

Computational complexity is a measure of the amount of resources,

1. Sequential computational complexity

3. Overall, the choice between sequential and parallel computational

Here are some common types of anomalies in parallel algorithms:

1. Deadlock: Deadlock occurs when two or more processes are

You might also like