HPC Computer Engg Sem 8 Notes

HPC:
Unit 1:
Q.1) Describe the scope of parallel computing. What are applications of parallel
computing?
Scope of Parallel Computing
Parallel computing tackles problems that can be broken down into smaller, independent (or
loosely coupled) tasks that can be solved simultaneously. This approach leverages multiple
processing units (cores, processors, or even entire computers) working in concert to achieve
faster execution times compared to traditional serial computing where tasks are executed one
after another.
Here's a breakdown of the scope:
● Problem size and complexity: Parallel computing is particularly suited for large-scale,
computationally intensive problems that would take an unreasonable amount of time to
solve on a single processor. As the problem size increases, the potential speedup from
parallelization becomes more significant.
● Task decomposition: The problem needs to be divisible into subtasks that can be
executed concurrently with minimal overhead for communication and coordination
between processing units.
● Scalability: Ideally, parallel computing should exhibit good scalability, meaning the
performance gain should increase as you add more processing units. However,
achieving perfect scalability can be challenging due to factors like communication
overhead and synchronization requirements.
Applications of Parallel Computing
Parallel computing permeates various scientific and engineering domains due to its ability to
handle complex simulations and data analysis. Here are some prominent examples:
● Scientific Simulations: Modeling weather patterns, simulating fluid dynamics, and

performing large-scale molecular simulations all leverage parallel computing for faster
and more accurate results.
● Data Analysis: Processing massive datasets from fields like astronomy, genomics, and
social media analysis benefits tremendously from parallel processing to extract
meaningful insights.
● Machine Learning and Artificial Intelligence: Training complex machine learning
models and deep neural networks often involves intensive computations that are
accelerated by parallel processing on hardware like GPUs (Graphics Processing Units).
● Financial Modeling: Complex financial simulations and risk assessments can be
performed much faster using parallel computing techniques.
● Computer Graphics: Rendering high-fidelity 3D graphics and animation for movies,
games, and simulations often relies on parallel processing for real-time performance.
Q.2) What are the types of dataflow execution model?
In the context of HPC, there are two main categories of dataflow execution models:
1. Batch Sequential:
○ Focus: Processes data in large batches, one after another.

○ Implementation: Often used in traditional HPC workflows where a series of
independent jobs are submitted and executed sequentially on available compute
resources.
○ Advantages: Simple to implement and manage, efficient for large, independent
tasks.
○ Disadvantages: Limited parallelism, not suitable for problems with inherent
dependencies between tasks.
2. Streaming (or Pipelined):
○ Focus: Processes data in a continuous flow, breaking it down into smaller

chunks and applying operations as they arrive.
○ Implementation: Commonly used in big data processing and real-time analytics
where data streams are continuously analyzed.
○ Subcategories:
■ Pipe and Filter: Data flows through a series of processing stages (filters)
connected by pipelines. Each filter performs a specific operation on the
data before passing it to the next stage.
■ Process Control: Employs control flow logic to manage the flow of data
based on conditions or events. This allows for more dynamic and flexible
processing compared to a simple pipe and filter model.
○ Advantages: Enables high parallelism, efficient for handling continuous data
streams, allows for real-time processing.
○ Disadvantages: Can be more complex to implement and manage compared to
batch sequential, may require additional overhead for managing data flow and
synchronization between stages.
Q.3) Explain cache coherence in multiprocessor system.

In a multiprocessor system with private caches for each processor, cache coherence ensures
that all processors have a consistent view of the data stored in main memory. This is crucial
because multiple processors can access the same data, and if each cache has its own
uncoordinated copy, inconsistencies can arise.
Here's a breakdown of cache coherence:
The Problem:
● Each processor has a private cache to speed up data access.

● Multiple processors can cache copies of the same data from main memory.
● If one processor modifies the cached data, other processors with copies become stale.
● Reading stale data can lead to incorrect program behavior.
Cache Coherence Protocols:
● These protocols define how processors communicate and coordinate cache updates to
maintain consistency.
● They involve states (modified, shared, exclusive, etc.) for cache lines indicating the copy
status.
● Transitions between states occur based on read/write operations and communication
between caches or a directory.
Common Approaches:
● Write Invalidation: When a processor writes to cached data, it invalidates copies in

other caches, forcing them to fetch the updated data from main memory on subsequent
reads.
● Write Update: Similar to write invalidation, but the writing processor broadcasts the
update to other caches, keeping them consistent.
Benefits:
● Ensures data consistency across processors.

● Prevents incorrect program behavior due to stale data.
● Improves overall system performance by utilizing private caches.
Challenges:
● Adds communication overhead between processors or caches.

● Increased complexity in cache design and management.
● Different protocols offer trade-offs between performance and complexity.
Real-world Example:
Imagine two processors working on a shared document. Cache coherence ensures that both
processors always see the latest version of the document, regardless of which processor made
the last edit. This prevents inconsistencies, such as one processor seeing an outdated version
while the other has the latest changes.
Q.4) Explain Store-and-Forward & packet routing with its communication cost.
Store-and-Forward Packet Routing: Reliable Delivery at a Cost
Store-and-forward is a fundamental technique used in packet routing within networks like the
internet. It ensures reliable data transmission by acting like a digital post office for data packets.
Here's how it works:
1. Receiving: When a packet arrives at a router (the network device responsible for
forwarding packets), it's entirely received and stored in a temporary buffer memory.
2. Error Checking: The router performs error checks on the packet, typically using
techniques like Cyclic Redundancy Check (CRC) to detect any data corruption during
transmission.
3. Routing Decision: Based on the destination address within the packet header, the
router consults its routing table to determine the next hop (the next router) on the path
towards the final destination.
4. Forwarding: If the error check passes and the next hop is determined, the router
forwards the entire packet out the appropriate outgoing link.
5. Buffer Management: If the buffer is full due to network congestion, the router might
employ strategies like queuing or packet dropping to manage the incoming data flow.
Communication Cost of Store-and-Forward:
While reliable, store-and-forward introduces some communication overhead:
● Latency: There's a delay introduced at each router as the entire packet needs to be
received and processed before forwarding. This cumulative delay across multiple routers
can impact real-time applications.
● Buffer Management: Routers need additional memory to store packets temporarily.
Buffer overflow can lead to packet drops, reducing overall network efficiency.
● Processing Overhead: Error checking and routing table lookups add to the processing
workload at each router.
Advantages of Store-and-Forward:
● Reliable Delivery: It minimizes the risk of corrupted data reaching the destination by
discarding packets with errors.
● Congestion Control: Routers can implement buffer management techniques to prevent
network congestion.
● Flexibility: Store-and-forward works with various network protocols and data types.
Q.5) Discuss the applications that benefit from multi-core architecture.
Multi-core processors have revolutionized computing by enabling significant performance gains

for a wide range of applications. Here's a breakdown of some key application areas that reap
the benefits of multi-core architecture:
Scientific Computing and Simulations:
● Complex simulations in fields like physics, chemistry, and engineering leverage multiple
cores to perform intensive calculations faster. This allows for more accurate and detailed
modeling of real-world phenomena.
● Examples: weather forecasting, climate modeling, molecular dynamics simulations for
drug discovery.
Data Analysis and Machine Learning:
● Processing massive datasets often involves parallel tasks like data filtering, sorting, and
aggregations. Multi-core architectures accelerate these operations, enabling faster
analysis and insights extraction.
● Machine learning algorithms, especially deep learning models with many layers and
parameters, benefit from parallel processing on multiple cores for faster training and
inference.
● Examples: large-scale genomics analysis, financial data analysis, training complex
image recognition models.
Multimedia Applications:
● Video editing, encoding, and decoding are computationally demanding tasks that can be
significantly accelerated by multi-core processors. This allows for faster rendering,
real-time editing of high-resolution videos, and smoother playback experiences.
● 3D graphics rendering in games and animation software utilizes multiple cores to handle
complex lighting effects, object interactions, and high-resolution textures, leading to
more immersive visuals.
High-Performance Computing (HPC):

● HPC applications often involve running multiple independent tasks or simulations
simultaneously. Multi-core processors enable parallel execution of these tasks,
drastically reducing overall processing time.
● Examples: drug discovery simulations, protein folding analysis, large-scale climate
modeling.
Web Servers and Database Management:
● Handling high volumes of user requests on web servers benefits from parallel processing
capabilities. Multiple cores can efficiently handle concurrent user connections and
database queries, improving responsiveness and scalability.
● Database management systems can leverage multi-core architecture for faster data
processing, indexing, and complex data manipulation tasks.
Overall, any application that involves:
● High computational workloads: Multi-core processors can distribute the workload

across multiple cores, leading to faster execution.
● Task parallelism: If a problem can be broken down into independent or loosely coupled
subtasks, multi-core architecture can significantly improve performance by executing
these tasks concurrently.
● Real-time processing: For applications demanding immediate response times,
multi-core processors can handle multiple tasks simultaneously, minimizing delays.
Q.6) Explain N-wide superscalar architecture in detail.
N-wide Superscalar Architecture: Unleashing Parallel Processing Power
N-wide superscalar architecture is a technique used in modern CPUs to extract more

performance by executing multiple instructions concurrently. It builds upon the foundation of
traditional scalar processors, which execute instructions one at a time, and injects parallelism at
the instruction level.
Here's a breakdown of the key concepts:
● N-Wide: This refers to the number of execution units available within the processor. A
wider design (higher N) allows for more parallel instruction execution. Common
examples include dual-core (N=2), quad-core (N=4), and even higher core count
processors.
● Superscalar Execution: The processor employs sophisticated hardware mechanisms
to achieve superscalar execution. It involves:
○ Instruction Fetch: The processor fetches multiple instructions from memory in a
single clock cycle, typically exceeding the number of execution units (N).
○ Instruction Decode and Dispatch: A decoder unit analyzes the fetched
instructions and identifies independent ones suitable for parallel execution. These
are then dispatched to available execution units.
○ Out-of-Order Execution: To maximize utilization of execution units, the
processor might execute instructions out of their program order if earlier
instructions have dependencies that haven't been resolved yet. This requires
careful instruction scheduling and data dependency checks to ensure correct
program execution.
○ Retirement: Once an instruction finishes execution, it's retired, and its results are
written back to the register file or memory.
Benefits of N-wide Superscalar Architecture:
● Improved Performance: By executing multiple instructions concurrently, N-wide

superscalar processors can achieve significant performance gains compared to
traditional scalar processors, especially for programs with inherent parallelism.
● Increased Efficiency: The ability to exploit parallelism within a single instruction stream
improves overall processor utilization and reduces idle time.
Challenges of N-wide Superscalar Architecture:
● Complexity: Designing and managing the hardware for instruction fetching, decoding,
scheduling, and out-of-order execution adds complexity to the processor architecture.
● Limited Benefits for Serial Programs: Programs with inherent dependencies between
instructions might not see significant performance improvement with N-wide superscalar
architecture.
● Diminishing Returns: As the core count (N) increases, the benefits of additional
execution units can diminish due to factors like increased communication overhead and
challenges in managing complex instruction dependencies.
Q.7) List applications of parallel programming.
Here's a list of applications that benefit from parallel programming:
Scientific Computing and Simulations:
● Complex simulations in physics, chemistry, engineering, and other scientific fields

leverage parallel processing to perform intensive calculations faster, leading to more
accurate and detailed models.
○ Examples: weather forecasting, climate modeling, molecular dynamics
simulations for drug discovery.
Data Analysis and Machine Learning:

● Processing massive datasets often involves parallel tasks like data filtering, sorting, and
aggregations. Parallel programming accelerates these operations, enabling faster
analysis and insights extraction.
○ Examples: large-scale genomics analysis, financial data analysis, training
complex image recognition models.
Multimedia Applications:
● Video editing, encoding, and decoding are computationally demanding. Parallel

programming can significantly speed up these tasks, allowing for faster rendering,
real-time editing of high-resolution videos, and smoother playback experiences.
● 3D graphics rendering in games and animation software utilizes multiple cores to handle
complex lighting effects, object interactions, and high-resolution textures, leading to
more immersive visuals.
High-Performance Computing (HPC):
● HPC applications often involve running multiple independent tasks or simulations

simultaneously. Parallel programming enables parallel execution of these tasks,
drastically reducing overall processing time.
○ Examples: drug discovery simulations, protein folding analysis, large-scale
climate modeling.
Web Servers and Database Management:
● Handling high volumes of user requests on web servers benefits from parallel processing
capabilities. Parallel programming allows for efficient handling of concurrent user
connections and database queries, improving responsiveness and scalability.
● Database management systems can leverage parallel programming for faster data
processing, indexing, and complex data manipulation tasks.
Other Applications:
● Signal processing: Parallel programming can accelerate tasks like image and audio
processing, filtering, and analysis.
● Cryptography: Breaking encryption codes or implementing complex cryptographic
algorithms can benefit from parallel processing techniques.
● Bioinformatics: Analyzing large genetic datasets for research purposes can be
significantly faster with parallel programming.
● Financial modeling: Complex financial simulations and risk assessments can be
performed much faster using parallel programming techniques.
In general, any application that involves:
● High computational workloads: Parallel programming can distribute the workload

across multiple cores or processors, leading to faster execution.
● Task parallelism: If a problem can be broken down into independent or loosely coupled
subtasks, parallel programming can significantly improve performance by executing
these tasks concurrently.
● Real-time processing: For applications demanding immediate response times, parallel
programming can handle multiple tasks simultaneously, minimizing delays.
Q.8) Explain (with suitable diagram): SIMD, MIMD & SIMT architecture.
1. SIMD (Single Instruction Multiple Data)
● Concept: Executes a single instruction on multiple data streams simultaneously.

● Data Streams: Data elements are typically identical or have a simple structure.
● Processing: All processing elements perform the same operation on their respective
data elements at the same time.
● Applications: Well-suited for tasks with high data parallelism, like image and video
processing, scientific simulations with regular data access patterns.
Diagram:
+--------------------+
| Instruction Fetch | (Single Instruction)
+--------------------+
|
v
+---------+---------+---------+---------+
| PE 0 | PE 1 | PE 2 | PE 3 |
| Data 0 | Data 1 | Data 2 | Data 3 | (Multiple Data Elements)
+---------+---------+---------+---------+
2. MIMD (Multiple Instruction Multiple Data)
● Concept: Executes different instructions on different data streams simultaneously.

● Data Streams: Data elements can be independent or have complex structures.
● Processing: Each processing element has its own program and operates
independently.
● Applications: Versatile for various tasks, including general-purpose computing,
scientific simulations with irregular data access patterns, and running multiple programs
concurrently.
Diagram:
+---------+ +---------+ +---------+

| PE 0 | | PE 1 | | PE 2 |
(Different +---------+ +---------+ +---------+
Instructions) |
v v v
+---------+---------+---------+---------+
| Data 0 | Data 1 | Data 2 | Data 3 | (Multiple Data Elements)
+---------+---------+---------+---------+
3. SIMT (Single Instruction Multiple Thread)
● Concept: Similar to SIMD, executes a single instruction on multiple data streams, but
with more flexibility.
● Data Streams: Data elements can have varying structures and complex dependencies.
● Processing: Threads within a processing element can diverge from the main instruction
stream based on specific conditions within their data. This allows for some level of
conditional execution within the overall SIMT model.
● Applications: Often used in graphics processing units (GPUs) for tasks like image
processing, scientific simulations with some level of data branching, and real-time ray
tracing.
Diagram:
+--------------------+
| Instruction Fetch | (Single Instruction)
+--------------------+
|
v
+---------+---------+---------+---------+
| PE 0 | PE 1 | PE 2 | PE 3 |
| Thread | Thread | Thread | Thread | (Multiple Threads)
|0 |1 |2 |3 |
+---------+---------+---------+---------+
| | | |
v v v v
+-----+ +-----+ +-----+ +-----+ +-----+ (Data with Potential Variations)
| D00 | | D10 | | D20 | | D30 | | ... |
+-----+ +-----+ +-----+ +-----+ +-----+
| | | |
v v v v
+-----+ +-----+ +-----+ +-----+ +-----+
| D01 | | D11 | | D21 | | D31 | | ... |
+-----+ +-----+ +-----+ +-----+ +-----+
Key Differences:
● Instruction Variety: SIMD executes a single instruction, MIMD allows different

instructions, and SIMT offers a balance with a single instruction but conditional execution
within threads.
● Data Structure: SIMD typically requires simpler data structures, while MIMD and SIMT
can handle more complex data variations.
● Processing Style: SIMD is highly synchronous, MIMD is fully asynchronous, and SIMT
offers a mix with some conditional branching within threads.
Q.9) Explain the impact of Memory Latency & Memory Bandwidth on system
performance.
Memory Latency:
● Concept: Refers to the time it takes for the processor to access data from main memory
after it issues a request. It's essentially the waiting time for data retrieval.
● Impact:
○ Increased latency leads to performance degradation. The processor stalls while
waiting for data, hindering its ability to execute instructions efficiently. This is
particularly significant for tasks that require frequent memory access.
○ Cache plays a vital role: Modern processors employ caches (high-speed
memory closer to the processor) to mitigate high memory latency. Frequently
accessed data is stored in the cache, reducing reliance on main memory and
improving overall performance.
Memory Bandwidth:
● Concept: Refers to the rate at which data can be transferred between the processor and
main memory. It's analogous to the width of a data pipeline, determining how much data
can flow through in a given time unit.
● Impact:
○ Limited bandwidth can bottleneck performance, especially when dealing with
large datasets or applications that require significant data movement between
memory and the processor.
○ High bandwidth enables faster data transfer, improving performance for tasks
that involve heavy data processing or frequent memory access patterns.
Q.10) Explain Message Passing Costs in Parallel Computers in Parallel machines.

In parallel computing, message passing involves sending data and instructions between
processors to work on tasks collectively. While it enables efficient problem-solving, message
passing incurs communication costs that can impact overall performance.
Here's a breakdown of the key factors affecting message passing costs:
● Startup Time: This includes the time spent on both the sending and receiving
processors to prepare the message for transmission and handle routing overhead.
● Data Transfer Time: The actual time it takes to transfer the data across the network
connection between processors. This depends on the data size and network bandwidth.
● Network Topology: The physical layout of the network (e.g., mesh, hypercube) can
influence communication costs due to varying path lengths between processors.
Minimizing message passing costs is crucial for optimal performance in parallel computing.
Strategies include:
● Reducing message frequency: Sending fewer, larger messages is more efficient than
sending numerous small ones.
● Data locality: Organizing data such that frequently accessed data resides on the same
processor or nearby processors minimizes communication needs.
● Overlapping communication and computation: When possible, processors can
perform computations while data is being transferred, improving overall efficiency.
Q.11) Describe Uniform-memory-access and Non-uniform-memory-access with

diagrammatic representation.
Uniform Memory Access (UMA):
● Concept: In UMA architecture, all processors share a single memory space and have
equal access time to any memory location regardless of the processor's physical
location.
● Diagram:
+--------------+ +--------------+ +--------------+
| Processor 0 | | Processor 1 | | Processor N |
+--------------+ +--------------+ +--------------+
| | |
v v v
+--------------------+
| Shared Memory |
+--------------------+
● Benefits:
○ Simpler architecture: Easier to manage and program due to uniform access

patterns.
○ Reduced communication overhead: No need for complex message passing
between processors to access memory.
○ Fair resource allocation: All processors have equal access to memory resources.
● Drawbacks:
○ Scalability limitations: As the number of processors increases, the shared

memory becomes a bottleneck, impacting performance.
○ Limited memory bandwidth: With multiple processors accessing the same
memory, bandwidth can become a constraint.
Non-uniform Memory Access (NUMA):
● Concept: In NUMA architecture, each processor has its own local memory that it can
access faster than non-local memory (memory associated with another processor).
Accessing non-local memory involves additional communication overhead.
● Diagram:
+--------------+ +--------------+ +--------------+
| Processor 0 | | Processor 1 | | Processor N |
+--------------+ +--------------+ +--------------+
| | |
v v v
+---------+---------+---------+---------+---------+
| Loc Mem 0 | Loc Mem 1 | Loc Mem 2 | ... | Loc Mem N |
+---------+---------+---------+---------+---------+
| (Local Memory) |
v
+--------------------+
| Shared Memory | (Optional)
+--------------------+
● Benefits:
○Improved scalability:** NUMA architecture scales better with increasing core

count as processors primarily rely on faster local memory.
○ Potentially higher memory bandwidth:** By distributing memory across
processors, the overall memory bandwidth can be higher compared to UMA.
● Drawbacks:
○ Increased complexity:** Managing data locality and communication between

processors adds complexity to programming and system management.
○ Potential performance penalties:** Accessing non-local memory can incur
significant overhead, impacting performance if not carefully managed.
Q.12) Write a short note on: (i) Dataflow Models, (ii) Demand Driven Computation, (iii)
Cache Memory
(i) Dataflow Models:
Dataflow models describe how data flows through a system and how computations are triggered
by the availability of data. They are particularly useful in parallel and distributed computing
environments. Common dataflow models include:
● Batch Sequential: Processes data in large batches, one after another. Simple to
implement but not suitable for problems with inherent dependencies between tasks.
● Streaming (or Pipelined): Processes data in a continuous flow, breaking it down into
smaller chunks and applying operations as they arrive. Enables high parallelism but can
be more complex to manage.
(ii) Demand Driven Computation:
This approach focuses on calculating only the information that is actually needed. It's useful for
avoiding unnecessary computations and optimizing resource usage. In the context of dataflow
models, demand for data or computation is triggered by the arrival of specific data elements.
Here's how it works:

1. The system identifies the data or computation required to fulfill a request.
2. It checks if the data is already available (e.g., in a cache).
3. If not, the computation is initiated, and the result is stored for future use.
(iii) Cache Memory:
Cache memory is a small, high-speed memory located closer to the processor than main
memory. It stores frequently accessed data or instructions to reduce the average time it takes to
access data. This significantly improves performance because accessing cache memory is
much faster than accessing main memory.
Here are some key points about cache memory:
● Cache Hierarchy: Modern systems often employ multiple levels of cache (L1, L2, L3)
with varying sizes and access times.
● Cache Coherence: In multiprocessor systems, cache coherence protocols ensure that
all processors have a consistent view of the data stored in main memory.
Q.13) Describe the merits of Multi-threadding over Pre-fetching techniques.
Multi-threading Advantages:
● Improved Performance for Parallel Tasks: When a program can be broken down into
independent or loosely coupled subtasks, multi-threading allows multiple threads to
execute these tasks concurrently. This can lead to significant performance gains
compared to pre-fetching, which is primarily focused on fetching data in anticipation of
future needs.
● Efficient Resource Utilization: Multi-threading enables a single process to utilize

multiple processor cores effectively. Even if a task is waiting for data, other threads
within the same process can continue executing, improving overall resource utilization.
Pre-fetching, on the other hand, doesn't directly address resource utilization beyond
potentially reducing main memory access delays.
● Flexibility for Unpredictable Workloads: Multi-threading works well for workloads with
unpredictable data access patterns or dependencies that arise during runtime. Threads
can adapt to changing conditions and synchronize execution as needed. Pre-fetching is
less flexible as it relies on static predictions of future data requirements, which might not
always be accurate.
● Reduced Context Switching Overhead: Compared to pre-fetching, which might involve

switching between the main program and pre-fetching routines, multi-threading
leverages existing context within a single process. This can minimize context switching
overhead, further improving performance.
● Granularity of Parallelism: Multi-threading offers finer-grained parallelism compared to

pre-fetching. Multiple threads can work on smaller, independent tasks within a larger
program, leading to more efficient utilization of processing resources.
Q.14) Explain memory hierarchy and thread organization. Give a summarized response.
Memory Hierarchy and Thread Organization: A High-Level View
Memory Hierarchy:
● Addresses the trade-off between speed and capacity of memory.

● Organizes memory into levels based on access speed and size:
○ Registers (fastest, smallest): Located within the CPU, hold frequently used
data for immediate access.
○ Cache (fast, medium size): Stores recently accessed data or instructions closer
to the CPU than main memory.
○ Main Memory (slower, larger): Holds the bulk of program data and instructions.
○ Secondary Storage (slowest, largest): Used for long-term data storage (e.g.,
hard drives).
Thread Organization:
● Enables parallel execution within a single process.

● Multiple threads share the same memory space and resources of the process.
● Organization can impact performance:
○ Thread scheduling: Determines which thread gets access to the CPU at a given
time.
○ Synchronization: Ensures threads cooperate and access shared data safely to
avoid conflicts.
Together they impact performance:
● Effective use of memory hierarchy reduces main memory access, improving speed.
● Thread organization allows parallel processing of independent tasks within a program.
Key considerations:
● Data locality: Keeping frequently accessed data closer to the CPU (registers, cache)
improves performance.
● Thread workload: Choose the right number of threads to match available processing
cores and task complexity.
Q.15) Explain control structure of parallel platforms in detail.
The control structure of parallel platforms refers to how tasks are coordinated and executed on
multiple processors within a computer system. Here's a concise breakdown of the key aspects:
1. Levels of Parallelism:
● Instruction-level: Explores parallelism within a single program by executing

independent instructions concurrently. (e.g., superscalar processors)
● Data-level: Focuses on parallel processing of independent data elements using the
same instruction. (e.g., SIMD architecture)
● Task-level: Involves parallel execution of entire tasks or subtasks within a program.
(e.g., multi-core processors with multi-threading)
2. Communication Models:
● Shared-address space: Processors access data from a common memory space,

requiring synchronization mechanisms to avoid conflicts. (e.g., cache coherence
protocols)
● Message passing: Processors exchange data and instructions by explicitly sending
messages to each other. (e.g., MPI libraries)
3. Thread Management:
● Scheduling: Determines how threads are allocated processing resources (CPU cores)
and the order in which they are executed. (e.g., round-robin scheduling)
● Synchronization: Ensures threads cooperate and access shared data safely to prevent
race conditions and data corruption. (e.g., mutexes, semaphores)
4. Programming Paradigms:
● Fork-join: A parent process creates child processes (or threads) that execute tasks in
parallel and then rejoin the parent process upon completion.
● Data parallelism: The same operation is applied to different data elements concurrently.
● Task parallelism: Different tasks within a program are executed concurrently.
Q.16) Define latency and bandwidth of memory.
● Memory Latency: Refers to the time it takes for the processor to access data from main
memory after it issues a request. It's essentially the waiting time for data retrieval.
● Memory Bandwidth: Represents the rate at which data can be transferred between the
processor and main memory. It's analogous to the width of a data pipeline, determining
how much data can flow through in a given time unit.
Here's a table summarizing the key differences:
Feature Memory Latency Memory Bandwidth
Measures Time Rate
Analogy Waiting time Width of a pipe
Impact on Lower is better Higher is better

performance
In simpler terms:
● Latency is how long you wait for the data.

● Bandwidth is how much data you can get in a given amount of time.
Q.17) Explain basic working principal of: (i) Superscalar processor, (ii) VLIW processor
(i) Superscalar Processor:
● Concept: A superscalar processor can execute multiple instructions concurrently in a

single clock cycle. It achieves this by employing sophisticated hardware mechanisms for:
○ Instruction Fetch: Fetches multiple instructions from memory in a single cycle,
exceeding the number of execution units available.
○ Instruction Decode and Dispatch: Analyzes the fetched instructions, identifies
independent ones, and dispatches them to available execution units within the
processor.
○ Out-of-Order Execution: To maximize utilization of execution units, the
processor might execute instructions out of their program order if earlier
instructions have dependencies that haven't been resolved yet. This requires
careful instruction scheduling and data dependency checks to ensure correct
program execution.
○ Retirement: Once an instruction finishes execution, it's retired, and its results are
written back to the register file or memory.
Benefits:
● Improved Performance: By executing multiple instructions concurrently, superscalar

processors can achieve significant performance gains compared to traditional scalar
processors, especially for programs with inherent parallelism.
● Increased Efficiency: The ability to exploit parallelism within a single instruction stream
improves overall processor utilization and reduces idle time.
Challenges:
● Complexity: Designing and managing the hardware for instruction fetching, decoding,
scheduling, and out-of-order execution adds complexity to the processor architecture.
● Limited Benefits for Serial Programs: Programs with inherent dependencies between
instructions might not see significant performance improvement with superscalar
architecture.
● Diminishing Returns: As the core count (number of execution units) increases, the
benefits of additional units can diminish due to factors like increased communication
overhead and challenges in managing complex instruction dependencies.
(ii) VLIW Processor (Very Long Instruction Word):
● Concept: A VLIW processor relies on static instruction scheduling to achieve

parallelism. The compiler plays a crucial role in analyzing the program and generating
code with multiple operations bundled into a single, very long instruction word (VLIW).
This VLIW is then issued to the processor, which executes the operations within the
instruction concurrently.
Benefits:
● Reduced Hardware Complexity: Since the compiler performs instruction scheduling,

the VLIW processor itself can be simpler compared to a superscalar processor that
requires complex hardware for dynamic scheduling.
● Potentially Higher Performance: With efficient static scheduling by the compiler, VLIW
processors can achieve higher instruction-level parallelism compared to some
superscalar designs.
Challenges:
● Increased Compiler Complexity: Compilers for VLIW processors need to be

sophisticated to effectively schedule instructions and exploit parallelism within the code.
● Less Flexibility: Reliance on static scheduling can make VLIW processors less
adaptable to code with unpredictable data dependencies or runtime variations.
● Instruction Set Challenges: Designing an instruction set that allows for efficient
packing of multiple operations within a VLIW can be complex.
Q.18) Explain Superscalar execution in terms of horizontal waste and vertical waste with
example.
Superscalar processors aim to boost performance by executing multiple instructions

concurrently within a single clock cycle. However, achieving this parallelism isn't always perfect,
and two types of "waste" can occur: horizontal waste and vertical waste.
Horizontal Waste:
● Concept: Refers to a situation where not all execution units within the superscalar
processor are utilized in a particular clock cycle. This happens when there aren't enough
independent instructions available to fill all the execution units.
● Analogy: Imagine a restaurant kitchen with multiple chefs (execution units). Horizontal
waste occurs when there aren't enough orders (instructions) to keep all chefs busy in a
particular cycle. Some chefs stand idle, even though there's cooking capacity.
● Example: Consider a 2-way superscalar processor with four functional units (e.g., ALU,
multiplier, memory access unit). In a given cycle, the processor might fetch three
independent instructions. Two instructions can be dispatched to the available execution
units, but the third instruction has to wait until the next cycle due to limited resources.
This scenario represents horizontal waste, as one execution unit remains idle even
though there's an instruction ready to be processed.
Vertical Waste:
● Concept: Occurs when a clock cycle is wasted entirely because no instruction can be
completed. This can happen due to dependencies between instructions or limitations in
the processor's design.
● Analogy: Vertical waste is like a stall in the kitchen workflow. Perhaps a chef (execution
unit) is waiting for ingredients (data) from another chef who hasn't finished their task yet
(dependency). The entire kitchen (processor) stalls until the ingredient (data) becomes
available.
● Example: In our 2-way superscalar processor, imagine one instruction requires the
result of another instruction before it can proceed. If both instructions are fetched in the
same cycle but the first instruction takes longer to execute, the entire cycle might be
wasted as the second instruction cannot proceed without the first one's result. This is
vertical waste because no instruction completes execution in that cycle.
Minimizing Waste:
● Instruction Fetch: Superscalar processors often fetch more instructions than available
execution units to increase the chance of finding independent instructions.
● Out-of-Order Execution: Advanced techniques like out-of-order execution allow the
processor to reorder instructions and execute them even if earlier instructions haven't
finished, as long as there are no dependencies. This helps reduce vertical waste.
● Compiler Optimizations: Compilers can play a role in optimizing code to improve
instruction-level parallelism and reduce dependencies, leading to less horizontal and
vertical waste.
Unit 2:
Q.1) Explain: (i) granularity, (ii) concurrency and (iii) dependency graph.
(i) Granularity:
● Concept: Refers to the size and complexity of tasks that are broken down for parallel
execution.
● Impact:
○ Fine-grained: Smaller, simpler tasks offer more opportunities for parallelism but
can introduce more overhead in managing them.
○ Coarse-grained: Larger, more complex tasks reduce management overhead but
might limit parallelism if tasks lack sufficient internal independence.
(ii) Concurrency:
● Concept: The ability of a system to execute multiple tasks seemingly at the same
time. This doesn't necessarily mean true simultaneous execution; it can involve rapid
switching between tasks.
● Importance: Enables efficient utilization of multiple processing cores or resources to
improve overall performance for problems that can be broken down into independent or
loosely coupled subtasks.
(iii) Dependency Graph:
● Concept: A visual representation of the relationships between tasks in a program.

Nodes represent tasks, and arrows depict dependencies between them. A task can only
start if all its predecessor tasks (connected by incoming arrows) have finished execution.
● Benefits:
○ Visualization: Provides a clear picture of task dependencies and potential
bottlenecks.
○ Optimization: Helps identify opportunities for parallel execution by highlighting
independent tasks and potential restructuring to minimize dependencies.
Q.2) Define and explain the following terms - (i) Degree of concurrency, (ii) Task
interaction graph.
1. Degree of Concurrency (DoC):
● Concept: The maximum number of tasks that can be executed concurrently in a

parallel program at any given time. It essentially reflects the level of parallelism
achievable within the program.
● Impact:
○ A higher DoC indicates the program can potentially utilize more processing
resources simultaneously, leading to faster execution.
○ However, the DoC is not always equal to the total number of tasks in the program
due to dependencies between tasks.
● Factors Affecting DoC:
○ Task Granularity: Finer-grained tasks offer more opportunities for concurrency,

but managing them can introduce overhead. Coarser-grained tasks might limit
concurrency if they lack internal independence.
○ Resource Availability: The number of available processing cores or resources
in the system limits the DoC.
○ Task Dependencies: The presence of dependencies between tasks restricts
concurrent execution until the dependencies are resolved.
2. Task Interaction Graph (TIG):
● Concept: A visual representation of the relationships between tasks in a parallel

program. It uses a directed acyclic graph (DAG) where:
○ Nodes represent individual tasks.

○ Directed Edges depict dependencies between tasks. An arrow from task A to
task B indicates that task B cannot start execution until task A finishes.
● Benefits:
○ Visualization: The TIG provides a clear picture of task dependencies and

potential bottlenecks in the program's execution flow.
○ Parallelization Optimization: By analyzing the TIG, you can identify
opportunities for parallel execution by highlighting independent tasks (not
connected by edges) and areas where dependencies can be minimized for
improved concurrency.
Q.3) Explain decomposition, task and dependency graph.

Decomposition:
● Concept: The process of breaking down a large, complex problem into smaller, more
manageable sub-problems that can be executed concurrently. This is the foundation for
achieving parallelism in a program.
● Benefits:
○ Enables efficient utilization of multiple processors or resources.
○ Simplifies problem-solving by focusing on smaller, independent units.
○ Improves code readability and maintainability.
Tasks:
● Concept: The individual units of work created after decomposing a problem. Tasks
represent the smallest elements that can be executed independently (or with minimal
dependencies) in a parallel program.
● Characteristics:
○ Can be of varying sizes and complexity depending on the problem and
decomposition strategy.
○ May require communication and data exchange with other tasks for overall
program execution.
Dependency Graph:
● Concept: A visual representation of the relationships between tasks generated during

decomposition. It uses a directed acyclic graph (DAG) where:
○ Nodes represent individual tasks.
○ Directed Edges depict dependencies between tasks. An arrow from task A to
task B indicates that task B cannot start execution until task A finishes.
● Benefits:
○ Visualization: Provides a clear picture of task dependencies and potential
bottlenecks in parallel execution.
○ Optimization: Helps identify opportunities for parallel execution by highlighting
independent tasks (not connected by edges) and areas where dependencies can
be minimized for improved performance.
Relationship Between Decomposition, Tasks, and Dependency Graphs:
● Decomposition defines the granularity of tasks and their relationships.

● Tasks are the individual units of work derived from decomposition.
● Dependency Graph visually depicts the relationships (dependencies) between these
tasks created during decomposition.
Q.4) What are the limitations of parallelization of any algorithm?

While parallelization offers significant performance gains for many algorithms, it's not a perfect
solution and has limitations to consider:
1. Amdahl's Law: This law sets a theoretical limit on the speedup achievable through
parallelization. It states that the overall speedup is limited by the fraction of the program that is
inherently sequential. Even with infinite processing resources, the sequential portion of the
algorithm will limit the overall improvement.
2. Overhead of Parallelization: Introducing parallelism adds complexities like:
○ Thread creation and management: Creating and managing multiple threads

can incur overhead compared to a simpler sequential execution.
○ Communication and synchronization: Threads might need to communicate
and synchronize access to shared data, introducing additional overhead
compared to a single thread operating on private data.
○ Load balancing: Ensuring an even distribution of work across processing cores
is crucial for optimal performance. Uneven load balancing can lead to idle cores
and reduced efficiency.
3. Limited Scalability: Not all algorithms scale perfectly with increasing processing cores. As
the number of cores grows, the communication and synchronization overhead can become
significant, potentially outweighing the benefits of parallelism.
4. Algorithm Suitability: Not all algorithms can be effectively parallelized. Some algorithms
have inherent dependencies between steps that make it difficult to break them down into
independent tasks suitable for concurrent execution.
5. Debugging Challenges: Debugging parallel programs can be more complex compared to

sequential programs. Issues like race conditions, deadlocks, and data inconsistencies can be
harder to identify and resolve.
Q.5) Explain with suitable examples: (a) Recursive decomposition, (b) Exploratory
decomposition, (c) Data decomposition
(a) Recursive Decomposition:
● Concept: Breaks down a problem into smaller, self-similar subproblems of the same
type. The process is repeated recursively until the subproblems become simple enough
to be solved directly. This is a natural approach for problems that can be divided into
smaller versions of themselves with the same structure.
● Example:
Consider the problem of sorting a list of numbers using the Merge Sort algorithm.
1. Base Case: If the list has only one element, it's already sorted (nothing to do).
2. Recursive Step: Divide the list into roughly equal halves.
○ Recursively sort the first half.
○ Recursively sort the second half.
○ Merge the two sorted halves into a single sorted list.
Here, each subproblem (sorting half the list) is a smaller version of the original problem (sorting
the entire list). Recursive decomposition allows efficient use of multiple processors as each
subproblem can be sorted concurrently.
(b) Exploratory Decomposition:
● Concept: Involves breaking down a problem into subproblems based on exploring the
search space of possible solutions. This technique is often used for problems where the
solution path is not entirely known in advance, and different subproblems might lead to
different solutions.
● Example:
Imagine searching for the best route in a maze.
1. Initial State: Start at the entrance of the maze.

2. Exploration: Explore possible paths by moving through available directions
(subproblems).
○ Evaluate each path by checking if it leads to the exit.
○ Prune paths that reach dead ends.
○ Explore promising paths further, potentially splitting the exploration across
multiple processors.
This approach allows parallel exploration of multiple promising paths to find the best route
(solution) faster.
(c) Data Decomposition:
● Concept: Focuses on dividing the data associated with a problem into smaller chunks.
These chunks can then be processed independently or with minimal communication
across processors. This technique is effective for problems where the same operation
needs to be applied to different parts of the data.
● Example:
Consider processing a large image and applying a filter (e.g., blur) to each pixel.
1. Data Partitioning: Divide the image into smaller tiles (subproblems) containing a subset
of pixels.
2. Parallel Processing: Assign each tile to a different processor, which independently
applies the blur filter to its assigned pixels.
3. Result Aggregation: Once all tiles are processed, combine the filtered tiles back into
the final filtered image.
Data decomposition allows for efficient parallel processing of the image data, where each
processor can independently apply the filter to its assigned tile.
Q.6) Explain various data decomposition techniques with suitable examples.
1. Block Distribution:
● Concept: Divides the data into equally sized contiguous blocks along a single
dimension. This approach is simple to implement and works well for problems where
operations are independent across different data blocks.
● Example: Consider performing statistical calculations (e.g., mean, standard deviation)

on a large dataset. The data can be split into equal-sized blocks, and each block can be
assigned to a different processor for independent calculation.
2. Cyclic Distribution:
● Concept: Divides the data into equally sized chunks and distributes them cyclically
across processors. This technique ensures a more balanced distribution of workload
compared to block distribution, especially when data items have varying processing
times.
● Example: Imagine processing a large log file where each line represents an event.
Cyclic distribution ensures each processor receives a mix of potentially short and long
log entries, leading to better load balancing compared to assigning entire blocks that
might have skewed processing times.
3. Scatter Decomposition:
● Concept: Distributes specific data elements based on a key associated with each
element. This technique is useful when operations depend on specific data values rather
than their position in the original data set.
● Example: Consider a database where customer information needs to be processed
based on location. Scatter decomposition can distribute customer records to processors
based on their city or region, allowing processors to efficiently handle queries or
operations specific to those locations.
4. Hashing:
● Concept: Uses a hash function to map each data element to a specific processor. This
technique is useful for situations where the workload associated with each data element
is unpredictable or the data needs to be grouped based on certain attributes.
● Example: Imagine processing a large collection of social media posts and analyzing
sentiment. Hashing can map each post to a processor based on the dominant sentiment
(positive, negative, neutral) expressed in the text. This allows processors to efficiently
analyze posts with similar sentiment.
Q.7) What are the different characteristics of task and interactions?
Task Characteristics:
● Task Generation:
○ Static: Tasks are pre-defined and known in advance, allowing for efficient
allocation and scheduling. (e.g., matrix multiplication with fixed matrix sizes)
○ Dynamic: Tasks are generated during runtime based on the program's execution
or data encountered. (e.g., search algorithms exploring a dynamic search space)
● Task Granularity:
○ Fine-grained: Smaller, more focused tasks offer more parallelism but can
introduce overhead in managing them. (e.g., individual arithmetic operations)
○ Coarse-grained: Larger, more complex tasks reduce management overhead but
might limit parallelism if tasks lack internal independence. (e.g., processing an
entire image file)
● Data Association:
○ Independent: Tasks operate on independent data sets, allowing for true parallel
execution without communication.
○ Shared Data Access: Tasks might need to access or modify shared data,
requiring synchronization mechanisms to avoid conflicts.
Interaction Characteristics:
● Read-Only vs. Read-Write:

○ Read-Only: Tasks only read data associated with other tasks, requiring no
synchronization for data access.
○ Read-Write: Tasks read and modify shared data, necessitating synchronization
mechanisms (e.g., mutexes) to ensure data consistency.
● Communication Patterns:
○ Regular: Interactions follow a predictable pattern, allowing for efficient
communication strategies. (e.g., processor grid communication in cellular
automata)
○ Irregular: Communication patterns are unpredictable or dynamic, making
communication management more challenging. (e.g., message passing in
unstructured search algorithms)
● Frequency:
○ Frequent: Tasks interact frequently, requiring careful handling to minimize
communication overhead.
○ Infrequent: Tasks interact less often, potentially simplifying communication
management.
Q.8) Differentiate between static and dynamic mapping techniques for load balancing.
Feature Static Mapping Dynamic Mapping
Definition Pre-defined mapping of tasks to Assigns tasks to processors during

processors before execution program execution
Knowledge Requires knowledge of task Doesn't require upfront knowledge of

Required characteristics and processor task characteristics
capabilities
Flexibility Less flexible, requires More flexible, adapts to changing

re-mapping for changes in workloads and processor availability
workload
Overhead Lower overhead due to May have higher overhead due to

pre-defined mapping runtime task analysis and assignment
Suitability Suitable for applications with Suitable for applications with dynamic
stable workloads and predictable workloads, unpredictable task sizes, or
task characteristics unknown processor capabilities
Examples Round-robin scheduling, block Task stealing, work stealing, adaptive

distribution load balancing
Benefits Simpler to implement, lower More efficient for dynamic workloads,

runtime overhead avoids processor idle time
Drawbacks Can lead to load imbalance if May introduce additional complexity and
task characteristics or workload overhead in managing dynamic
changes assignments
Q.9) Explain mapping techniques for load balancing.
Load balancing is crucial for ensuring efficient utilization of processing resources in parallel
computing. Mapping techniques define how tasks are assigned to processors to achieve this
balance. Here's a breakdown of some common mapping techniques:
Static Mapping:
● Involves pre-defining the assignment of tasks to processors before program execution

begins. This approach is simpler to implement and has lower runtime overhead.
However, it requires upfront knowledge of:
○ Task characteristics: Size, complexity, and communication requirements of
tasks.
○ Processor capabilities: Processing power, memory availability, and
communication bandwidth.
Common Static Mapping Techniques:
● Round-Robin Scheduling: Tasks are assigned to processors in a circular fashion,

ensuring a fair distribution of workload. Effective for tasks with similar execution times.
● Block Distribution: Data is divided into equally sized blocks, and each block is
assigned to a different processor. Suitable for problems where operations are
independent across different data blocks.
● Cyclic Distribution: Similar to block distribution, but data is distributed cyclically across
processors to ensure better load balancing, especially for data items with varying
processing times.
Dynamic Mapping:
● Assigns tasks to processors during program execution. This approach is more flexible
and adapts to changing workloads or unforeseen variations in task characteristics.
However, it can introduce additional overhead for runtime task analysis and assignment.
Common Dynamic Mapping Techniques:
● Work Stealing: Idle processors (or threads) can "steal" work from overloaded
processors, promoting better load balancing.
● Task Queues: Tasks are placed in a central queue, and any available processor picks
up the next task for execution. This approach simplifies load balancing but can introduce
overhead for managing the central queue.
● Adaptive Load Balancing: Monitors system performance and dynamically adjusts task
assignments based on processor load and communication patterns. Requires
sophisticated algorithms to analyze runtime behavior.
Q.10) Explain the methods for containing interaction overheads.
1. Maximizing Data Locality:
● Concept: This strategy aims to keep the data a task needs close to the processor that
will use it. This reduces the need for remote data access and communication across the
network.
● Techniques:
○ Data partitioning and distribution: Distribute data strategically across
processor memory based on access patterns.
○ Loop transformations: Reorganize loops to improve data reuse and minimize
redundant data access within a processor's cache.
2. Minimizing Volume of Data Exchanged:
● Concept: Focuses on reducing the amount of data that needs to be exchanged between
processors during communication.
● Techniques:
○ Sending only relevant data: Identify and transmit only the specific data
elements required by a task, avoiding unnecessary data transfer.
○ Data compression: Compress data before transmission to reduce the
communication overhead, especially for large datasets.
3. Minimizing Frequency of Interactions:
● Concept: Aims to reduce the number of times processors need to communicate with
each other.
● Techniques:
○ Bulk data transfers: Aggregate multiple data elements into a single message for
transmission, reducing the overhead of individual message exchanges.
○ Asynchronous communication: Allow processors to continue working while
communication happens in the background, improving overall utilization.
4. Minimizing Contention and Hot Spots:
● Concept: Addresses situations where multiple processors compete for access to shared
resources (e.g., memory, communication channels). Contention creates bottlenecks and
slows down communication.
● Techniques:
○ Locking strategies: Use efficient locking mechanisms to prevent data
inconsistencies during concurrent access but minimize the time processors
spend waiting to acquire locks.
○ Load balancing: Ensure tasks are evenly distributed across processors to
prevent overloading specific resources and creating communication bottlenecks.
5. Overlapping Computations with Interactions:
● Concept: While communication occurs, try to keep processors busy with other
independent computations that don't require data from other processors.
● Techniques:
○ Pipelining: Break down tasks into smaller stages and execute them concurrently
on different processors, overlapping communication with computation steps.
○ Multithreading: Utilize multiple threads within a processor. While one thread
communicates, another can perform independent computations.
6. Replicating Data or Computations (if applicable):
● Concept: In some cases, it might be beneficial to replicate frequently accessed data on

multiple processors or perform a specific computation on each processor instead of
relying on communication.
● Trade-off: This approach reduces communication overhead but increases memory
usage. It's suitable for situations where the benefits of reduced communication outweigh
the memory overhead.
Q.11) Explain classification of Dynamic mapping techniques.
1. Centralized vs. Distributed:

● Centralized Dynamic Mapping:
○ A central entity (e.g., master process) maintains a global view of tasks and
processors.
○ The central entity analyzes task requirements and processor availability to make
assignment decisions.
○ Examples: Task queue with central scheduler, work stealing coordinator.
● Distributed Dynamic Mapping:
○ Processors collaborate and exchange information to make mapping decisions
locally.
○ Reduces reliance on a central point of control, potentially improving scalability.
○ Examples: Gossip protocols, distributed work stealing.
2. Information Sharing Strategies:
● Static Information Based:

○ Relies on task characteristics pre-defined before execution (similar to static
mapping but with runtime adjustments).
○ Limited ability to adapt to dynamic changes during execution.
● Dynamic Information Based:
○ Leverages information gathered at runtime, such as processor load and
communication patterns.
○ Offers greater flexibility and adaptability to changing workloads.
3. Communication Patterns:
● Explicit Communication:
○ Processors explicitly exchange messages or signals to coordinate task
assignment.
○ Examples: Work stealing with message passing, centralized queue updates.
● Implicit Communication:
○ Processors infer task availability and workload based on actions or events
without explicit messages.
○ Can reduce communication overhead but might require more complex algorithms
for coordination.
○ Examples: Observing processor load through hardware counters or shared
memory access patterns.
4. Frequency of Re-mapping:
● Fine-grained Dynamic Mapping:

○ Tasks are frequently assigned and re-assigned based on dynamic information.
○ Suitable for highly dynamic workloads but can introduce significant
communication overhead.
● Coarse-grained Dynamic Mapping:
○ Tasks are assigned less frequently, focusing on stability and reducing re-mapping
overhead.
○ May not be as responsive to rapid workload changes.
Q.12) Explain various parallel algorithm models with suitable examples.
Here's a breakdown of some common parallel algorithm models with illustrative examples:
1. Data-Parallel Model:
● Concept: This model focuses on applying the same operation concurrently to different
data items. Tasks are typically independent, requiring minimal communication or
synchronization.
● Example: Consider performing a mathematical operation (e.g., addition, multiplication)
on all elements of a large array. Each processor can be assigned a portion of the array
and independently perform the operation on its assigned elements.
2. Task-Parallel Model:
● Concept: This model breaks down a problem into smaller, independent tasks that can
be executed concurrently. Tasks might have different functionalities but don't require
frequent communication or data sharing.
● Example: Imagine processing a large collection of images and applying filters (e.g.,
resize, grayscale conversion) to each image. Each processor can be assigned a
separate image and independently apply the desired filter.
3. Work Pool Model:
● Concept: This model uses a central pool of tasks that workers (processors or threads)
can access and execute dynamically. Tasks are typically independent and don't require
specific ordering.
● Example: Consider processing a queue of customer orders in an e-commerce system.
Each order represents a task, and worker threads can pick up tasks from the central
queue, process them (e.g., validate payment, prepare shipment), and update the order
status.
4. Master-Slave Model:
● Concept: This model involves a master process that coordinates the execution of tasks
on slave processes (workers). The master distributes tasks, manages communication,
and collects results.
● Example: Imagine performing a scientific simulation that requires calculations across
different spatial or temporal segments. The master process can divide the simulation
domain into smaller sub-domains, distribute them as tasks to slave processes, and
collect the partial results to assemble the final simulation outcome.
5. Producer-Consumer Model (Pipeline Model):
● Concept: This model involves a producer that generates data, a consumer that
processes the data, and potentially intermediate stages (filters, transformers) that
perform specific operations on the data stream. Stages operate concurrently, with one
stage producing data for the next stage in the pipeline.
● Example: Consider processing a video stream. A producer can continuously read video
frames, a filter stage might convert them to a different format, and finally, a consumer
could display the processed frames on the screen. Each stage operates concurrently,
forming a processing pipeline.
6. Hybrid Models:
● Concept: Many real-world parallel algorithms combine aspects of different models. This
allows for efficient execution by leveraging the strengths of each model for specific
sub-problems within the larger application.
● Example: A scientific computing application might use a master-slave model to
distribute large calculations across processors, while each slave process utilizes a
data-parallel model to perform computations on smaller data chunks within its assigned
task.
Q.13) Draw the task-dependency graph for finding the minimum number in the sequence
{4, 9, 1, 7, 8, 11, 2, 12} where each node in the tree represents the task of finding the
minimum of a pair of numbers. Compare this with serial version of finding the minimum
number from an array.
Task-dependency graphs are a way to visualize the dependencies between tasks in a parallel
computation. In this case, we can represent finding the minimum number in a sequence as a
series of pairwise comparisons.
Here's the task-dependency graph for finding the minimum number in the sequence {4, 9, 1, 7,
8, 11, 2, 12}:
[4, 9, 1, 7, 8, 11, 2, 12]

/ \
[4, 9, 1, 7] [8, 11, 2, 12]
/ \ / \
[4, 9] [1, 7] [8, 11] [2, 12]
/ \ / \ / \ / \
4 9 1 7 8 11 2 12
In this graph, each node represents the task of finding the minimum of a pair of numbers. The
root node represents the original sequence, and each subsequent layer represents the result of
pairwise comparisons until we reach the leaves, which are individual numbers.
Now, let's compare this with the serial version of finding the minimum number from an array. In
the serial version, we iterate through the array once, keeping track of the minimum number
encountered so far. Here's how it would look:
Original array: {4, 9, 1, 7, 8, 11, 2, 12}
Step 1: min = 4
Step 2: min = 4
Step 3: min = 1
Step 4: min = 1
Step 5: min = 1
Step 6: min = 1
Step 7: min = 1
Step 8: min = 1
Final minimum number: 1
In the serial version, we only perform a linear scan through the array once, comparing each
element with the current minimum found so far. There are no dependencies between tasks, as
each comparison is independent of the others.
Comparing both approaches, we see that the parallel version involves multiple parallel
comparisons, each dependent on the results of the previous comparisons, while the serial
version involves a single sequential scan through the array.
Q.14) Give the characteristics of GPUs and verious applications of GPU processing.
Characteristics of GPUs:
● Highly Parallel Architecture: GPUs are designed for massive parallelism, containing
thousands of cores compared to a CPU's limited number of cores. This allows them to
efficiently handle tasks involving a large number of independent calculations.
● Focus on Memory Bandwidth: GPUs prioritize high memory bandwidth to move data
quickly between cores and memory. This is crucial for processing large datasets that
don't fit entirely in the processor cache.
● Specialized Instruction Sets: GPUs have instruction sets optimized for specific tasks
like graphics processing and manipulating large data vectors. While less versatile than
CPUs, they excel at these specialized operations.
● Limited Control Flow: GPUs are less efficient at handling complex branching and
control flow logic compared to CPUs. They are better suited for problems with
predictable execution patterns.
● Lower Clock Speeds: Individual GPU cores typically have lower clock speeds than
CPU cores. However, the sheer number of cores often compensates for this in terms of
overall processing power for suitable workloads.
Applications of GPU Processing:
● Graphics Processing: The primary application of GPUs is in rendering complex 3D

graphics, textures, and lighting effects in video games, movies, and other visual media
applications.
● Scientific Computing: GPUs excel at scientific simulations and calculations involving
large datasets, such as weather forecasting, molecular modeling, and protein folding
simulations.
● Machine Learning and Deep Learning: The highly parallel nature of GPUs makes
them ideal for training and running complex neural networks used in machine learning
and deep learning applications like image recognition, natural language processing, and
recommendation systems.
● Financial Modeling: Complex financial simulations and risk analysis can leverage GPU
processing power to analyze vast amounts of data and perform calculations faster.
● Video Processing: Encoding and decoding high-resolution videos, applying filters, and
real-time video editing can benefit significantly from GPU acceleration.
● Cryptography and Signal Processing: GPUs can accelerate encryption and
decryption algorithms as well as signal processing tasks involving large datasets.
● General-Purpose Parallel Computing: Beyond the specialized areas mentioned
above, researchers are exploring utilizing GPUs for various parallel computing tasks that
can be broken down into smaller, independent operations.

HPC Computer Engg Sem 8 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPC Computer Engg Sem 8 Notes

Uploaded by

Copyright:

Available Formats

HPC:

Scope of Parallel Computing

Here's a breakdown of the scope:

Applications of Parallel Computing

● Scientific Simulations: Modeling weather patterns, simulating fluid dynamics, and

Q.2) What are the types of dataflow execution model?

○ Focus: Processes data in large batches, one after another.

○ Focus: Processes data in a continuous flow, breaking it down into smaller

Q.3) Explain cache coherence in multiprocessor system.

Here's a breakdown of cache coherence:

● Each processor has a private cache to speed up data access.

Cache Coherence Protocols:

● Write Invalidation: When a processor writes to cached data, it invalidates copies in

● Ensures data consistency across processors.

● Adds communication overhead between processors or caches.

Store-and-Forward Packet Routing: Reliable Delivery at a Cost

Communication Cost of Store-and-Forward:

While reliable, store-and-forward introduces some communication overhead:

Q.5) Discuss the applications that benefit from multi-core architecture.

Multi-core processors have revolutionized computing by enabling significant performance gains

Scientific Computing and Simulations:

Data Analysis and Machine Learning:

High-Performance Computing (HPC):

Web Servers and Database Management:

Overall, any application that involves:

● High computational workloads: Multi-core processors can distribute the workload

Q.6) Explain N-wide superscalar architecture in detail.

N-wide Superscalar Architecture: Unleashing Parallel Processing Power

N-wide superscalar architecture is a technique used in modern CPUs to extract more

Here's a breakdown of the key concepts:

Benefits of N-wide Superscalar Architecture:

● Improved Performance: By executing multiple instructions concurrently, N-wide

Challenges of N-wide Superscalar Architecture:

Q.7) List applications of parallel programming.

Here's a list of applications that benefit from parallel programming:

Scientific Computing and Simulations:

● Complex simulations in physics, chemistry, engineering, and other scientific fields

Data Analysis and Machine Learning:

● Video editing, encoding, and decoding are computationally demanding. Parallel

High-Performance Computing (HPC):

● HPC applications often involve running multiple independent tasks or simulations

Web Servers and Database Management:

In general, any application that involves:

● High computational workloads: Parallel programming can distribute the workload

1. SIMD (Single Instruction Multiple Data)

● Concept: Executes a single instruction on multiple data streams simultaneously.

2. MIMD (Multiple Instruction Multiple Data)

● Concept: Executes different instructions on different data streams simultaneously.

+---------+ +---------+ +---------+

3. SIMT (Single Instruction Multiple Thread)

● Instruction Variety: SIMD executes a single instruction, MIMD allows different

Q.10) Explain Message Passing Costs in Parallel Computers in Parallel machines.

Here's a breakdown of the key factors affecting message passing costs:

Q.11) Describe Uniform-memory-access and Non-uniform-memory-access with

Uniform Memory Access (UMA):

+--------------+ +--------------+ +--------------+

| Processor 0 | | Processor 1 | | Processor N |

+--------------+ +--------------+ +--------------+

○ Simpler architecture: Easier to manage and program due to uniform access

○ Scalability limitations: As the number of processors increases, the shared

Non-uniform Memory Access (NUMA):

+--------------+ +--------------+ +--------------+

| Processor 0 | | Processor 1 | | Processor N |

+--------------+ +--------------+ +--------------+