You are on page 1of 148

MODULE:1

Theory of Parallelism:
What is the theory of parallelism in computer architecture?
Parallel computing refers to the process of executing several processors an
application or computation simultaneously. Generally, it is a kind of computing
architecture where the large problems break into independent, smaller, usually
similar parts that can be processed in one go.

Parallel computing refers to the process of executing several processors an


application or computation simultaneously. Generally, it is a kind of computing
architecture where the large problems break into independent, smaller, usually
similar parts that can be processed in one go. It is done by multiple CPUs
communicating via shared memory, which combines results upon completion. It
helps in performing large computations as it divides the large problem between
more than one processor.

Parallel computing also helps in faster application processing and task resolution by
increasing the available computation power of systems. The parallel computing
principles are used by most supercomputers employ to operate. The operational
scenarios that need massive processing power or computation, generally, parallel
processing is commonly used there.

Typically, this infrastructure is housed where various processors are installed in a


server rack; the application server distributes the computational requests into small
chunks then the requests are processed simultaneously on each server. The earliest
computer software is written for serial computation as they are able to execute a
single instruction at one time, but parallel computing is different where it executes
several processors an application or computation in one time.

There are many reasons to use parallel computing, such as save time and money,
provide concurrency, solve larger problems, etc. Furthermore, parallel computing
reduces complexity. In the real-life example of parallel computing, there are two
queues to get a ticket of anything; if two cashiers are giving tickets to 2 persons
simultaneously, it helps to save time as well as reduce complexity.

Types of parallel computing

From the open-source and proprietary parallel computing vendors, there are
generally three types of parallel computing available, which are discussed below:

1. Bit-level parallelism: The form of parallel computing in which every task is


dependent on processor word size. In terms of performing a task on large-
sized data, it reduces the number of instructions the processor must execute.
There is a need to split the operation into series of instructions. For example,
there is an 8-bit processor, and you want to do an operation on 16-bit
numbers. First, it must operate the 8 lower-order bits and then the 8 higher-
order bits. Therefore, two instructions are needed to execute the operation.
The operation can be performed with one instruction by a 16-bit processor.
2. Instruction-level parallelism: In a single CPU clock cycle, the processor
decides in instruction-level parallelism how many instructions are
implemented at the same time. For each clock cycle phase, a processor in
instruction-level parallelism can have the ability to address that is less than
one instruction. The software approach in instruction-level parallelism
functions on static parallelism, where the computer decides which instructions
to execute simultaneously.
3. Task Parallelism: Task parallelism is the form of parallelism in which the tasks
are decomposed into subtasks. Then, each subtask is allocated for execution.
And, the execution of subtasks is performed concurrently by processors.

Applications of Parallel Computing

There are various applications of Parallel Computing, which are as follows:

o One of the primary applications of parallel computing is Databases and Data


mining.
o The real-time simulation of systems is another use of parallel computing.
o The technologies, such as Networked videos and Multimedia.
o Science and Engineering.
o Collaborative work environments.
o The concept of parallel computing is used by augmented reality, advanced
graphics, and virtual reality.
Advantages of Parallel computing

Parallel computing advantages are discussed below:

o In parallel computing, more resources are used to complete the task that led
to decrease the time and cut possible costs. Also, cheap components are used
to construct parallel clusters.
o Comparing with Serial Computing, parallel computing can solve larger
problems in a short time.
o For simulating, modeling, and understanding complex, real-world
phenomena, parallel computing is much appropriate while comparing with
serial computing.
o When the local resources are finite, it can offer benefit you over non-local
resources.
o There are multiple problems that are very large and may impractical or
impossible to solve them on a single computer; the concept of parallel
computing helps to remove these kinds of issues.
o One of the best advantages of parallel computing is that it allows you to do
several things in a time by using multiple computing resources.
o Furthermore, parallel computing is suited for hardware as serial computing
wastes the potential computing power.

Disadvantages of Parallel Computing

There are many limitations of parallel computing, which are as follows:

o It addresses Parallel architecture that can be difficult to achieve.


o In the case of clusters, better cooling technologies are needed in parallel
computing.
o It requires the managed algorithms, which could be handled in the parallel
mechanism.
o The multi-core architectures consume high power consumption.
o The parallel computing system needs low coupling and high cohesion, which
is difficult to create.
o The code for a parallelism-based program can be done by the most
technically skilled and expert programmers.
o Although parallel computing helps you out to resolve computationally and
the data-exhaustive issue with the help of using multiple processors,
sometimes it affects the conjunction of the system and some of our control
algorithms and does not provide good outcomes due to the parallel option.
o Due to synchronization, thread creation, data transfers, and more, the extra
cost sometimes can be quite large; even it may be exceeding the gains
because of parallelization.
o Moreover, for improving performance, the parallel computing system needs
different code tweaking for different target architectures

Fundamentals of Parallel Computer Architecture

Parallel computer architecture is classified on the basis of the level at which the
hardware supports parallelism. There are different classes of parallel computer
architectures, which are as follows:

Multi-core computing

A computer processor integrated circuit containing two or more distinct processing


cores is known as a multi-core processor, which has the capability of executing
program instructions simultaneously. Cores may implement architectures like VLIW,
superscalar, multithreading, or vector and are integrated on a single integrated
circuit die or onto multiple dies in a single chip package. Multi-core architectures
are classified as heterogeneous that consists of cores that are not identical, or they
are categorized as homogeneous that consists of only identical cores.

Symmetric multiprocessing

In Symmetric multiprocessing, a single operating system handles multiprocessor


computer architecture having two or more homogeneous, independent processors
that treat all processors equally. Each processor can work on any task without
worrying about the data for that task is available in memory and may be connected
with the help of using on-chip mesh networks. Also, all processor contains a private
cache memory.

Distributed computing

On different networked computers, the components of a distributed system are


located. These networked computers coordinate their actions with the help of
communicating through HTTP, RPC-like message queues, and connectors. The
concurrency of components and independent failure of components are the
characteristics of distributed systems. Typically, distributed programming is
classified in the form of peer-to-peer, client-server, n-tier, or three-tier
architectures. Sometimes, the terms parallel computing and distributed computing
are used interchangeably as there is much overlap between both.

Massively parallel computing

In this, several computers are used simultaneously to execute a set of instructions in


parallel. Grid computing is another approach where numerous distributed
computer system execute simultaneously and communicate with the help of the
Internet to solve a specific problem.

Why parallel computing?

There are various reasons why we need parallel computing, such are discussed
below:

o Parallel computing deals with larger problems. In the real world, there are
multiple things that run at a certain time but at numerous places
simultaneously, which is difficult to manage. In this case, parallel computing
helps to manage this kind of extensively huge data.
o Parallel computing is the key to make data more modeling, dynamic
simulation and for achieving the same. Therefore, parallel computing is
needed for the real world too.
o With the help of serial computing, parallel computing is not ideal to
implement real-time systems; also, it offers concurrency and saves time and
money.
o Only the concept of parallel computing can organize large datasets, complex,
and their management.

o The parallel computing approach provides surety the use of resources


effectively and guarantees the effective use of hardware, whereas only some
parts of hardware are used in serial computation, and some parts are
rendered idle.

Future of Parallel Computing

From serial computing to parallel computing, the computational graph has


completely changed. Tech giant likes Intel has already started to include multicore
processors with systems, which is a great step towards parallel computing. For a
better future, parallel computation will bring a revolution in the way of working the
computer. Parallel Computing plays an important role in connecting the world with
each other more than before. Moreover, parallel computing's approach becomes
more necessary with multi-processor computers, faster networks, and distributed
systems.

Difference Between serial computation and Parallel Computing

Serial computing refers to the use of a single processor to execute a program, also
known as sequential computing, in which the program is divided into a sequence of
instructions, and each instruction is processed one by one. Traditionally, the
software offers a simpler approach as it has been programmed sequentially, but the
processor's speed significantly limits its ability to execute each series of instructions.
Also, sequential data structures are used by the uni-processor machines in which
data structures are concurrent for parallel computing environments.

As compared to benchmarks in parallel computing, in sequential programming,


measuring performance is far less important and complex because it includes
identifying bottlenecks in the system. With the help of benchmarking and
performance regression testing frameworks, benchmarks can be achieved in
parallel computing. These testing frameworks include a number of measurement
methodologies like multiple repetitions and statistical treatment. With the help of
moving data through the memory hierarchy, the ability to avoid this bottleneck is
mainly evident in parallel computing. Parallel computing comes at a greater cost
and may be more complex. However, parallel computing deals with larger problems
and helps to solve problems faster.

History of Parallel Computing

Throughout the '60s and '70s, with the advancements of supercomputers, the
interest in parallel computing dates back to the late 1950s. On the need for
branching and waiting and the parallel programming, Stanley Gill (Ferranti)
discussed in April 1958. Also, on the use of the first parallelism in numerical
calculations, IBM researchers Daniel Slotnick and John Cocke discussed in the same
year 1958.

In 1962, a four-processor computer, the D825, was released by Burroughs


Corporation. A conference, the American Federation of Information Processing
Societies, was held in 1967 in which a debate about the feasibility of parallel
processing was published by Amdahl and Slotnick. Asymmetric multiprocessor
system, the first Multics system of Honeywell, was introduced in 1969, which was
able to run up to eight processors in parallel.
In the 1970s, a multi-processor project, C.mmp, was among the first
multiprocessors with more than a few processors at Carnegie Mellon University.
During this project, for scientific applications from 64 Intel 8086/8087 processors, a
supercomputer was launched, and a new type of parallel computing was started. In
1984, the Synapse N+1, with snooping caches, was the first bus-connected
multiprocessor. For the Lawrence Livermore National Laboratory, building on the
large-scale parallel computer had proposed by Slotnick in 1964. The ILLIAC IV was
the earliest SIMD parallel-computing effort, which was designed by the US Air
Force.

The theory of parallelism deals with the principles, concepts, and models associated
with designing and analyzing systems that can perform multiple tasks or
computations simultaneously. Parallelism aims to harness the power of multiple
processing units or cores to solve complex problems more efficiently and quickly.
It's a fundamental concept in computer science and is essential for modern
computing systems.

Here are some key aspects of the theory of parallelism:

1. **Parallelism vs. Concurrency**:

- Parallelism involves performing multiple tasks at the same time, while


concurrency deals with managing multiple tasks that may or may not execute
simultaneously.

- Parallelism focuses on achieving speedup by distributing the workload across


multiple processors, whereas concurrency addresses task management and
interaction.

2. **Types of Parallelism**:

- **Task Parallelism**: Dividing a program into smaller tasks that can be executed
concurrently. Each task operates on a separate portion of data or performs a
distinct computation.

- **Data Parallelism**: Processing the same operation on multiple data elements


simultaneously. This is often used in scientific and multimedia applications.
- **Pipeline Parallelism**: Breaking down tasks into stages and having different
processors work on different stages simultaneously.

3. **Amdahl's Law**:

- Amdahl's Law quantifies the potential speedup achievable by parallelizing a


computation. It states that the speedup is limited by the portion of the
computation that cannot be parallelized.

4. **Gustafson's Law**:

- Gustafson's Law focuses on scaling problems to larger data sets or more


processors. It suggests that if the problem size grows with the number of
processors, the relative speedup can be maintained.

5. **Parallel Programming Models**:

- **Shared Memory**: Multiple processors access a common memory space and


communicate through shared variables. Synchronization mechanisms are crucial to
manage access to shared data.

- **Message Passing**: Processors communicate by sending and receiving


messages. Each processor has its own private memory.

- **Dataflow**: Computation is driven by the availability of data. Processes


execute when their input data is available.

6. **Parallel Algorithms**: Parallel algorithms are designed to take advantage of


parallel computing resources. They often require careful consideration of data
dependencies, load balancing, and synchronization.

7. **Granularity**:

- Granularity refers to the size of the tasks or units of work that are executed in
parallel. Fine-grained parallelism involves small tasks, while coarse-grained
parallelism involves larger tasks.

8. **Scalability**:
- Scalability measures how well a system can maintain performance as the
number of processors or cores increases. Linear scalability implies that adding more
resources results in proportional speedup.

9. **Load Balancing**:

- Load balancing ensures that the workload is distributed evenly among


processors to maximize resource utilization and performance.

10. **Parallelism Challenges**:

- Managing synchronization and data dependencies.

- Overhead of communication and coordination.

- Debugging and testing complex parallel programs.

- Scalability and efficient utilization of resources.

The theory of parallelism is vital in designing efficient computing systems,


especially as technology advances lead to more processors and cores within a
single system. It requires careful consideration of both hardware and software
aspects to harness the benefits of parallel processing while managing the
associated challenges.

Explain parallel computer models.


Parallel computer models are designed to enhance computational performance by
simultaneously executing multiple tasks or computations. They utilize multiple processing
units, such as processors or cores, to divide a problem into smaller subproblems and
process them simultaneously. This approach allows for faster execution and improved
efficiency compared to sequential computing models.
There are several parallel computer models commonly used in practice. Here are some of
the most notable ones:
1. Flynn's Taxonomy: Proposed by Michael Flynn in 1966, this taxonomy categorizes
parallel computer systems based on the number of instruction streams (I) and data
streams (D) they can process simultaneously. The four categories are:
- Single Instruction, Single Data (SISD): Traditional sequential computers.
- Single Instruction, Multiple Data (SIMD): A single instruction is applied to multiple data
elements in parallel. This model is often found in graphics processing units (GPUs) and
digital signal processors (DSPs).
- Multiple Instruction, Single Data (MISD): Multiple instructions operate on the same data
stream, which is a rare model with limited practical applications.
- Multiple Instruction, Multiple Data (MIMD): Multiple instructions process multiple
independent data streams simultaneously. This model is widely used in modern parallel
computing.
2. Shared Memory Model: In this model, multiple processors access a shared physical
memory. Each processor has its own local cache, and communication between processors
occurs through shared memory. This model simplifies programming but may suffer from
issues such as memory contention and synchronization overhead.
3. Message Passing Model: In this model, processors communicate by explicitly sending
and receiving messages. Each processor has its own private memory, and data exchange
occurs through message passing libraries or protocols. This model is often used in
distributed computing systems and clusters.
4. Data Parallel Model: This model focuses on dividing data into smaller chunks and
processing them simultaneously using multiple processors. Each processor operates on a
subset of the data independently. Data parallelism is commonly used in applications such
as scientific simulations, image processing, and data analytics.
5. Task Parallel Model: This model divides a program into smaller tasks or threads that can
be executed independently. Each task is assigned to a processor, and parallelism is
achieved by executing multiple tasks simultaneously. Task parallelism is well-suited for
applications with different types of computations or with irregular workloads.
It's important to note that different parallel computer models have their own strengths
and limitations. The choice of a particular model depends on the nature of the problem,
the available resources, programming requirements, and the desired level of parallelism.

Multiprocessors and Multicomputers:

Multiprocessors and multicomputers are both types of parallel computing


architectures that involve multiple processors or computing units working together
to solve computational tasks. However, they have different characteristics and
approaches to achieving parallelism.

**Multiprocessors**:
A multiprocessor system is a type of parallel computing architecture where multiple
processors or cores share a single memory system. These processors typically have
a high degree of integration and communication, allowing them to access the same
memory locations directly. Multiprocessor systems can be further categorized into
two main types:

1. **Shared-Memory Multiprocessors (SMP)**:

- In an SMP system, all processors share a common memory space, which is


accessible by all cores. This allows for easy communication between processors
through shared variables.

- SMP systems often require cache coherence protocols to ensure that the data in
different processor caches remains consistent.

2. **NUMA (Non-Uniform Memory Access) Multiprocessors**:

- NUMA is an extension of SMP where processors are divided into groups, and
each group has its own local memory. Processors can access local memory faster
than remote memory.

- NUMA architectures aim to reduce memory contention and improve memory


access latency, especially in large multiprocessor systems.

**Multicomputers**:

A multicomputer system, also known as a cluster or distributed-memory system,


involves multiple individual computers (nodes) connected together through a
network. Each node typically has its own memory and processors, and they
communicate by passing messages. Multicomputers allow for efficient parallel
processing of tasks that can be divided into independent parts.

There are two main types of multicomputers:


1. **Homogeneous Multicomputers**:

- Homogeneous multicomputers consist of nodes that are identical in terms of


hardware and architecture.

- Each node runs its own operating system, and communication between nodes is
achieved through message passing.

2. **Heterogeneous Multicomputers**:

- Heterogeneous multicomputers consist of nodes with varying hardware and


architecture.

- These systems can leverage different node capabilities to optimize performance


for specific tasks.

**Key Differences**:

- **Memory Architecture**:

- Multiprocessors share a common memory system, while multicomputers have


distributed memory among nodes.

- **Communication**:

- In multiprocessors, communication between processors is often through shared


memory, while multicomputers use message passing for communication.

- **Scalability**:

- Multiprocessors can face scalability challenges due to contention for shared


memory, while multicomputers can scale more easily by adding more nodes.

- **Programming Model**:
- Multiprocessors may have a more straightforward programming model due to
shared memory, while programming multicomputers requires explicit message
passing.

- **Latency and Bandwidth**:

- Multiprocessors may have lower latency and higher bandwidth for interprocessor
communication due to shared memory, while multicomputers rely on network
communication.

Both multiprocessors and multicomputers have their advantages and trade-offs,


and the choice between them depends on factors such as the nature of the
computation, the communication patterns, and the desired level of scalability and
performance.

1. Multiprocessor: A Multiprocessor is a computer system with two or more central


processing units (CPUs) share full access to a common RAM. The main objective of using
a multiprocessor is to boost the system’s execution speed, with other objectives being
fault tolerance and application matching. There are two types of multiprocessors, one is
called shared memory multiprocessor and another is distributed memory multiprocessor.
In shared memory multiprocessors, all the CPUs shares the common memory but in a
distributed memory multiprocessor, every CPU has its own private

memory. Applications of Multiprocessor –


1. As a uniprocessor, such as single instruction, single data stream (SISD).
2. As a multiprocessor, such as single instruction, multiple data stream (SIMD),
which is usually used for vector processing.
3. Multiple series of instructions in a single perspective, such as multiple
instruction, single data stream (MISD), which is used for describing hyper-
threading or pipelined processors.
4. Inside a single system for executing multiple, individual series of instructions in
multiple perspectives, such as multiple instruction, multiple data stream
(MIMD).
Benefits of using a Multiprocessor –
 Enhanced performance.
 Multiple applications.
 Multi-tasking inside an application.
 High throughput and responsiveness.
 Hardware sharing among CPUs.

Advantages:

Improved performance: Multiprocessor systems can execute tasks faster than single-
processor systems, as the workload can be distributed across multiple processors.
Better scalability: Multiprocessor systems can be scaled more easily than single-
processor systems, as additional processors can be added to the system to handle increased
workloads.
Increased reliability: Multiprocessor systems can continue to operate even if one
processor fails, as the remaining processors can continue to execute tasks.
Reduced cost: Multiprocessor systems can be more cost-effective than building multiple
single-processor systems to handle the same workload.
Enhanced parallelism: Multiprocessor systems allow for greater parallelism, as different
processors can execute different tasks simultaneously.

Disadvantages:

Increased complexity: Multiprocessor systems are more complex than single-processor


systems, and they require additional hardware, software, and management resources.
Higher power consumption: Multiprocessor systems require more power to operate than
single-processor systems, which can increase the cost of operating and maintaining the
system.
Difficult programming: Developing software that can effectively utilize multiple
processors can be challenging, and it requires specialized programming skills.
Synchronization issues: Multiprocessor systems require synchronization between
processors to ensure that tasks are executed correctly and efficiently, which can add
complexity and overhead to the system.
Limited performance gains: Not all applications can benefit from multiprocessor
systems, and some applications may only see limited performance gains when running on
a multiprocessor system.
2. Multicomputer: A multicomputer system is a computer system with multiple
processors that are connected together to solve a problem. Each processor has its own
memory and it is accessible by that particular processor and those processors can
communicate with each other via an interconnection

network.
As the multicomputer is capable of messages passing between the processors, it is
possible to divide the task between the processors to complete the task. Hence, a
multicomputer can be used for distributed computing. It is cost effective and easier to
build a multicomputer than a multiprocessor.
Difference between multiprocessor and Multicomputer:
1. Multiprocessor is a system with two or more central processing units (CPUs)
that is capable of performing multiple tasks where as a multicomputer is a
system with multiple processors that are attached via an interconnection network
to perform a computation task.
2. A multiprocessor system is a single computer that operates with multiple CPUs
where as a multicomputer system is a cluster of computers that operate as a
singular computer.
3. Construction of multicomputer is easier and cost effective than a multiprocessor.
4. In multiprocessor system, program tends to be easier where as in multicomputer
system, program tends to be more difficult.
5. Multiprocessor supports parallel computing, Multicomputer supports distributed
computing.
Explain the state of computing?
Ans. Modern computers are equipped with powerful hardware
facilitates driven by extensive software packages. To asses the
state of computing we first review historical milestones in the
development of computers.

Computer Generations:
Over the past five decades, electronic computers have gone
through fine generations of development. Each of first three
generations lasted about 10 years. The fourth generations
covered a time span of 15 years. We have just entered the fifth
generations with the use of processors & memory devices with
more than 1 million transistors on solo silicon chip. The table
shows the new hardware and software features introduced with
each generation. Most features introduced in earlier generations
have been passed to later generations.
Five Generations of Electronic Computers

Representativ
Generation Technology & Software & e
Architecture Application
System
First Vaccuum tubes & relay Machine/assembly ENIAC,
(1945-54) memories, CPU languages, single user, Princeton,
motivated by Pc no subroutine linkage, IAS, IBM 701
& accumulator, fixed programmed I/O using
Second point arithmetic. CPU.
(1955- Discrete transistors and HLL used with
compilere, subroutine IBM 7090, CDC
64) core memories, floating
libraries, batch 1604,
point arithmetic, I/O
processing monitor. Univac
processors, multiplexed
Multiprogramming & LARC.
Third memory access.
(1965- Integrated circuits (SSI- time sharing OS,
multi user IBM 360/370,
74) MSI),
applications. CDC 6600, TI-
microprogramming,
Multiprocessor OS, ASC, PDP-8
pipelining, cache &
VAX 9000,
Fourth lookahead processors. languages, Gay
(1975- LSI/VLSI & semi compilers & XMP,
90) conductor memory, environment for IBM 3090 BBN
multiprocessors, vector parallel processing. TC 2000
Fifth supercomputers, multi Massively parallel Fujitsu VPP
(199 computers. processing, grand 500,
1 ULSI/VHSIC challenge applications, Gay/MPP,
present) processors, memory & heterogenous TMC/CM-5,
switches, high density processing. Intel
packaging, scalable paragon.
architectures.

In other words, the latest generation computers have inherited


all the bad ones found in previous generations.

SIMD Computers:
2) Discuss about SIMD Computers
Ans) SIMD (Single Instruction, Multiple Data) computers are a type of parallel computing
architecture designed to perform operations on multiple data elements simultaneously
using a single instruction. In SIMD architecture, a single instruction is applied to a group or
vector of data elements, allowing for parallel processing and efficient execution of
repetitive tasks. SIMD computers are commonly used in areas that require massive
parallelism and data-level parallel processing, such as multimedia processing, scientific
simulations, and signal processing.
Here are some key characteristics and features of SIMD computers:
1. Vector Processing: SIMD computers use vector processing, where a single instruction
operates on multiple data elements in parallel. The data elements, often referred to as
vectors, are typically fixed-size arrays or registers that hold multiple values of the same
data type. The SIMD instruction applies the same operation to each element of the vector
simultaneously.
2. Data-Level Parallelism: SIMD architecture exploits data-level parallelism by performing
the same operation on multiple data elements simultaneously. This approach is highly
efficient for tasks that involve a large amount of regular data processing, as it allows for
significant speedup by performing multiple operations in parallel.
3. Vector Registers: SIMD computers include special-purpose registers called vector
registers, which can hold multiple data elements simultaneously. These registers are wider
than traditional scalar registers to accommodate multiple data values, and the SIMD
instructions operate on these vector registers.
4. Single Instruction Stream: SIMD computers execute a single instruction stream across
multiple data elements. This means that all the data elements in a vector execute the same
instruction in a lockstep fashion. This characteristic is useful for tasks where the same
operation needs to be applied to a large amount of data simultaneously.
5. Instruction-Level Parallelism: SIMD computers achieve instruction-level parallelism by
executing multiple instructions simultaneously on different data elements. The processor
pipelines are designed to process multiple instructions in parallel, improving overall
throughput.
6. SIMD Extensions: Many modern processors incorporate SIMD extensions or instructions
sets, such as Intel's SSE (Streaming SIMD Extensions) or ARM's NEON, which provide
additional capabilities for SIMD operations. These extensions offer specialized instructions
and enhanced performance for multimedia processing, gaming, and other SIMD-intensive
applications.
7. Programming Model: Programming SIMD computers requires the use of specialized
programming models or libraries that provide support for SIMD instructions. Examples
include SIMD intrinsics, where low-level SIMD instructions are embedded within the
source code, or higher-level SIMD libraries that provide APIs for SIMD operations. The
SIMD architecture offers significant performance benefits for tasks that can be parallelized
at the data level. By leveraging parallelism and executing the same instruction on multiple
data elements simultaneously, SIMD computers can achieve high computational
throughput and accelerate the execution of tasks that involve massive amounts of regular
data processing..

A multi-vector computer is a type of parallel computing architecture that focuses on


enhancing performance by processing multiple data elements (vectors) in parallel using
specialized vector instructions. It is designed to accelerate computations for tasks that
involve repetitive operations on large amounts of data, such as scientific simulations,
simulations, and data-intensive applications. Multi-vector computers are an example of
exploiting data-level parallelism.

Key features of multi-vector computers include:


1. **Vector Registers**: Multi-vector computers include specialized vector registers that
can store multiple data elements (vectors) simultaneously. Each vector register can hold
multiple values of the same data type.
2. **Vector Instructions**: These computers are equipped with vector instructions that
operate on entire vectors of data in a single operation. These instructions perform
operations such as addition, subtraction, multiplication, and division on all corresponding
elements of two vectors simultaneously.
3. **Vector Pipelines**: Multi-vector processors often have vector pipelines, which allow
for efficient processing of vector instructions. Each stage of the pipeline performs a specific
operation on the vector elements.
4. **Parallel Processing**: Multi-vector architectures exploit parallel processing by
applying the same instruction to multiple data elements in parallel, thereby achieving high
throughput.
5. **Software and Programming**: Programming for multi-vector computers involves
utilizing vector instructions through programming languages, libraries, or compilers that
support vectorization. Efficient utilization of vector instructions often requires careful
coding and data alignment.
6. **Performance Benefits**: Multi-vector computers excel at tasks that require repeated
application of the same operations to large sets of data, as they reduce the overhead of
instruction fetching and decoding.
7. **Applications**: Multi-vector computers are well-suited for scientific computing,
simulations, image processing, signal processing, numerical analysis, and other data-
intensive tasks.
It's worth noting that the term "multi-vector" may be less commonly used in modern
computing discussions, as vector processing concepts have often been integrated into
wider parallel computing architectures, such as SIMD (Single Instruction, Multiple Data)
extensions in modern CPUs and GPUs. These extensions enable processors to perform
vectorized operations on multiple data elements.

0verall, multi-vector computers represent an early example of harnessing data-level


parallelism for performance improvement, and their principles continue to influence
modern parallel computing architectures.
Vector Computer:
 In a vector computer, a vector processor is attached to the scalar processor
as an optional feature.
 The host computer first loads program and data to the main memory.
 Then the scalar control unit decodes all the instructions.
 If
the decoded instructions are scalar operations or program operations, the
scalar processor executes those operations using scalar functional
pipelines.
 On the other hand, if the decoded instructions are vector operations then the
instructions will be sent to vector control unit.

SIMD Computer:
In SIMD computers, ‘N’ number of processors are connected to a control unit and
all the processors have their individual memory units. All the processors are
connected by an interconnection network.

Q.1. What are the Architectural development tracks?


Ans. Architecture of todays systems pursue development tracks. There
are mainly 3 tracks. These tracks are illustrious by likeness in
computational model & technological bases.
1. Multiple Processor tracks: multiple processor system can be
shared memory or distributed memory.

(a) Shared Memory track:

Standard/Dash

Itlinos Fujitsu VPP 500


is
CMU/
cedar KSR 1
C.mmP
IBM RP3
NYU/
Ultra
BBN Butterfly
Computer

Fig. Shared Memory track


It shows track of multiprocessor development employing a single
address space in the entire system c. mmp was a UMA
multiprocessor. The c.mmp project poincered shared memory
multiprocessor development not only in the cross architecture but
in multiprocessor operating system development.
Illinois Codar project and NYO ultra computer project both were
developed with a single address space. Both use multi stage
network as system inter connect.
Standard Dash is a NUMA multiprocessor with distributed
memory forming a global address space cache coherence is there
with distributed directories.
Advanced Computer Arc. 23

KSR-1 is a COMA model.


Fujitsu UPP 500 is processor system with a cross bar inter
connected shared memories are distributed to all processor nodes.

(b) Message Passing track:

CUBE – 2/6400

Cosmic
Cuben Inter iPsc‘s Intel paragon

Mosaic MIT/J Machine

(2) Multivector & SIMD tracks


Multivector track
CDC Cyber 205- ETA 10

CDC 7600
Cray Y- mp Cray/m PP
Cray
1
Fujitru, NEC, Hitachi Mode
The CDC 7600 was first vector dual processor system. There are 2
subtracks derived from CDC-7600. The latest cray/mpp is a
massively parallel system with distributed shared memory.
24

(b) SIMD track

DAP 610
Goodyear
MPP
Illiac IV CM5

Mas Par MP1


BSP

IBM GF/11

3. Multi threaded and Dataflow tracks:


In case of multi threaded system, each processor can execute
multiple context at the same time. So multiple threading means
there are multiple threads of control in each processor. So multi
threading hides long latency in constructing large scale
multiprocessors. This track has been tried out in laboratories.
Multi threaded track

Tera

CDC 600

HFP MIT/Alenrife
Data Flow Track

MIT tagged token

Static Data flow

Mamchester

Q.2. What is Parallelism? What are the various conditions of parallelism


Ans. Parallelism is the major concept used in today computer use of
multiple functional units is a form of parallelism within the CPU.
In early computer only one arithmetic & functional units are there
so it cause only one operation to execute at a time. So ALU
function can be distributed to multiple functional units, which
are operating in parallel.
H.T. Kung has recognized that there is a need to move in three
areas namely computation model for parallel computing, inter
process communication in parallel architecture & system
integration for incorporating parallel systems into general
computing environment.
14

Conditions of Parallelism :
1.Data and resource dependencies : A program is made up of
several part, so the ability of executing several program segment
in parallel requires that each segment should be independent other
segment. Dependencies in various segment of a program may be in
various form like resource dependency, control depending & data
depending. Dependence graph is used to describe the relation.
Program statements are represented by nodes and the directed
edge with different labels shows the ordered relation among the
statements. After analyzing dependence graph, it can be shown
that where opportunity exist for parallelization & vectorization.
Data Dependencies: Relation between statements is shown by data
dependences. There are 5 types of data dependencies given below:
(a) Antidependency: A statement S2 is antidependent on statement ST1 if
ST2
follows ST1 in order and if the output of ST2 overlap the input to ST1.
(b) Input dependence: Read & write are input statement input
dependence occur not because of same variables involved put
because of same file is referenced by both input statements.
(c) Unknown dependence: The dependence relation between two
statement cannot be found in following situation
The subscript of variable is itself subscribed.
The subscript does not have the loop index
variable. Subscript is non linear in the loop
index variable.
(d) Output dependence: Two statements are output dependence if
they produce the same output variable.
(e) Flow dependence: The statement ST2 is flow dependent if an
statement ST1, if an expression path exists from ST 1 to ST2 and at
least are output of ST, feeds in an input to ST2.
2. Bernstein’s condition : Bernstein discovered a set of
conditions depending on which two process can execute in parallel. A
process is a program that is in execution. Process is an active entity.
Actually it is an stumbling block of a program fragment defined at
various processing levels. Ii is the input set of process Pi which is set
of all input variables needed to execute the process similarly the
output set of consist of all output variable generated after execution
of all process Pi. Input variables are actually the operands which
are fetched from the memory or registers. Output variables are the
result to be stored in working registers or memory locations.
Advanced Computer Arc. 15

Let there are 2 processes


P1 & P2 Input sets are I1 &
I2

Output sets are O1 & O2


The two processes P1 & P2 can execute in parallel & are directed
by P1/P2 if & only if they are independent and do not create
confusing results.
3.Software Parallelism : Software dependency is defined by control
and data dependency of programs. Degree of parallelism is
revealed in the program profile or in program flow graph.
Software parallelism is a function of algorithm, programming
style and compiler optimization. Program flow graphs shows the
pattern of simultaneously executable operation. Parallelism in a
program varies during the execution period.
4. Hardware Parallelism : Hardware Parallelism is defined by
hardware multiplicity & machine hardware. It is a function of cost
& performance trade off. It present the resource utilization
patterns of simultaneously executable operations. It also indicates
the performance of the processor resources.
One method of identifying parallelism in hardware is by means by
number of instructions issued per machine cycle.

Q.3. What are the different levels of parallelism :


Ans. Levels of parallelism are described below:
1. Instruction Level : At instruction level, a grain is consist of less
than 20 instruction called fine grain. Fine grain parallelism at this
level may range from two thousands depending an individual
program single instruction stream parallelism is greater than two
but the average parallelism at instruction level is around fine
rarely exceeding seven in ordinary program. For scientific
applications average parallel is in the range of 500 to 300 fortran
statements executing concurrently in an idealized environment.
2.Loop Level : It embrace iterative loop operations. A loop may
contain less than 500 instructions. Some loop independent
operation can be vectorized for pipelined execution or for look step
execution of SIMD machines. Loop level parallelism is the most
optimized program construct to execute on a parallel or vector
computer. But recursive loops are different to parallelize. Vector
processing is mostly exploited at the loop level by vectorizing
compiler.
3. Procedural Level : It communicate to medium grain size at the
task, procedure, subroutine levels. Grain at this level has less than
2000 instructions. Detection of parallelism at this level is much
more difficult than a finer grain level. Communication obligation
is much less as compared with
16

that MIMD execution mode. But here major efforts are requisite by
the programmer to reorganize a program at this level.
4. Subprogram Level : Subprogram level communicate to job
steps and related subprograms. Grain size here have less than
1000 instructions. Job steps can overlap across diverse jobs.
Multiprogramming an uniprocessor or multiprocessor is
conducted at this level.
5. Job Level : It corresponds to parallel executions of independent
tasks on parallel computer. Grain size here can be tens of thousands
of instructions. It is handled by program loader and by operating
system. Time sharing & space sharing multiprocessors explores this
level of parallelism.

Q.1. What are program flow mechanisms?


Ans. Traditional computers are founded on control flow mechanism by
which the order of program execution is explicitly stated in the
user program. Data flow computers have high degree of
parallelism at the fine grain instruction level reduction computers
are based on demand driven method which commence operation
based on the demand for its result by other computations.
Data flow & control flow computers : There are mainly two sort
of computers. Data flow computers are connectional computer based
on Von Neumamm machine. It carry out instructions under program
flow control whereas control flow computer, executes instructions
under availability of data.
Control flow Computers : Control Flow computers employ
shared memory to hold program instructions and data objects.
Variables in shared memory are updated by many instructions.
The execution of one instruction may produce side effects on
other instructions since memory is shared. In many cases, the side
effects prevent parallel processing from taking place. In fact, a
uniprocessor computer is inherently sequential due to use of
control driven mechanism.
Data Flow Computers : In data flow computer, the running of an
instruction is determined by data availability instead of being
guided by program counter. In theory any instruction should be
ready for execution whenever operands become available. The
instructions in data driven program are not ordered in any way.
Instead of being stored in shared memory, data are directly held
inside instructions. Computational results are passed directly
between instructions. The data generated by instruction will be
duplicated into many copies and forwarded directly to all needy
instructions.
This data driven scheme requires no shared memory, no program
counter and no control sequencer. However it requires special
method to detect data availability, to match data tokens with
needy instructions and to enable the chain reaction of
asynchronous instructions execution.
28

Q.2. Explain data flow architecture?


Ans. There are more than a few new data flow computer projects.
Arvind and his associates at MIT have developed a tagged token
architecture for constructing data flow computers.
Global path

nxn Routing Network

PE' PE2 PEn

the global architecture

The global architecture incorporate n processing elements (PEs)


inter connected by an n x n routing network. The complete
system supports pipelined data flow operations in all n PEs. Inter
PE communications are done through the pipelined routing
network.
Within each PE, the machine offer a low level token matching
mechanism which sends off only those instructions whose input
data are already available. Each datum is tagged with the address of
instruction to which it belongs and context in which the instruction
is being executed. Instructions are stored in program memory.
Tagged tokens enter the PE through a local path. The tokens can
also be passed to the other PE through the routing network. All
internal circulation operations are pipelined without blocking.
Explain about System interconnect Architecture with diagram?
System interconnect architec0
ture refers to the design and organization of the communication pathways within a
computer system or a complex electronic system. It involves the arrangement and
connectivity of various components, such as processors, memory modules, peripherals,
and input/output devices, to facilitate data transfer and communication between them.
While system interconnect architectures can vary depending on the specific system
requirements and design choices, I'll explain a commonly used architecture called a bus-
based architecture, along with a diagram to illustrate its components.
In a bus-based architecture, a shared communication bus serves as the primary
interconnect between the different components of the system. The bus is a collection of
wires or conductive paths that carry data, control signals, and address information. It acts
as a communication medium through which components can send and receive
information.
Here is a simplified diagram of a bus-based system interconnect architecture:

```
+-------------------------+
| Processor 1 |
+-------------------------+
|
| Data Bus
|
+-------------------------+
| Memory |
+-------------------------+
|
| Address Bus
|
+-------------------------+
| I/O Devices |
+-------------------------+
|
| Control Bus
|
+-------------------------+
| Processor 2 |
+-------------------------+
```

In this diagram, we have two processors (Processor 1 and Processor 2) connected to a


shared bus, along with memory and I/O devices.
The data bus is responsible for transferring data between the different components. It
carries the actual data being exchanged between processors, memory, and I/O devices.
The address bus is used to specify the memory location or I/O device that the processors
are accessing. It carries the address information for read and write operations.
The control bus carries various control signals that coordinate the activities of the
components in the system. These control signals can include signals for read and write
operations, interrupt requests, and synchronization signals.
Each component in the system is connected to the appropriate bus lines to enable data,
address, and control signal transfer. The bus arbitration logic ensures that only one
component can use the bus at a time to avoid conflicts and maintain proper data integrity.
It's important to note that this is a simplified representation, and real-world system
interconnect architectures can be much more complex, incorporating multiple buses,
additional levels of hierarchy, and specialized protocols for high-speed communication.
Different system interconnect architectures, such as point-to-point interconnects or
network-on-chip (NoC) architectures, may also be used in specific applications or advanced
systems to address scalability, performance, and power efficiency requirements.
Overall, the system interconnect architecture plays a critical role in facilitating
communication and data transfer within a computer system, ensuring efficient and reliable
operation of the system's components.
https://www.slideshare.net/pankajjain12382923/system-interconnect-architectures-in-aca
Advanced Computer Arc. 29

From Routing Network


x
Local Path

Token Match

Program
memory

Compute
ALU Tag

Form Token

Interior Design of a Processing Element


You can imagine instruction address in a dataflow computer as
replace the program counter & the context identifier replacing the
frame base register in control flow computer. It is the machine job
to match up data with same tag to needy instructions. In so doing,
new data will be produced with a new tag indicating the successor
instructions. Thus each instruction represents a synchronization
operation. New tokens are formed and circulated along the PE
pipeline for sense or to other PEs through global path, which is
also pipelined.

Q.3. Explain Grain Sizes and Latency.


Ans. Grain Size or granularity is the amount of computation and
manipulation involved in a software process. The simplest way is
to count the number of instructions in a given (program segment).
Grain size decides the basic program segment chosen for parallel
processing. Grain sizes are usually explained as fine, medium or
coarse, depending on the processing levels involved.
Latency is a time of the communication overhead acquire amid
subsystems for example the memory latency is the tune required
by processor to access
30

the memory. The time required for two processes to synchronize


with each other is called synchronization latency; computational
granularity and communication latency are closely related.

Q.4. How can we partition a program into parallel branches, program


modules, microtasks or grains to yield the shortest possible execution time?
Ans. There exists a tradeoff among parallelism and scheduling
overheads. The time complexity entail both computation and
communication overheads. The program partitioning entail the
algorithm designer, programmer, compiler, operating system
support etc.
The concept of grain packing is to apply five grain first in order to
achieve a higher degree of parallelism. Then one combines
multiple fine grain nodes into a coarse grain node if it can remove
redundant communications delays or lessen the overall
scheduling overhead.
Usually, all five grain operations within a single coarse, grain node
are given to some processor for execution. Fine grain partition of
a program often demands more inter processor communication
than that required in a coarse grain partition. Thus grain pickings‘
offers a tradeoff between parallelism and scheduling. Internal
delays amid fine grain operations within the same coarse grain
node are negligible because the communication delay is given
chiefly by inter processor delays rather than by delays within the
same processor. The selection of optimal grain size is meant to get
the shortest schedule for the nodes on a parallel system.
ppp
Explain about Program partitioning and Scheduling?
Program partitioning and scheduling are two important concepts in the field of computer science and software
engineering, particularly in the domain of real-time systems. They are used to optimize the execution of programs and
ensure that tasks are completed within their specified deadlines.
Program partitioning involves dividing a software application or system into smaller tasks or modules, also known as
partitions or components. This partitioning is typically based on the functionality or criticality of the tasks. Each partition
represents a distinct unit of work that can be executed independently.
The main goal of program partitioning is to allocate the system resources efficiently and effectively. By breaking down a
program into smaller partitions, it becomes easier to analyze and optimize the execution of each partition separately. This
allows for better resource utilization, improved performance, and easier maintenance.
Once the program is partitioned, the next step is scheduling. Scheduling is the process of determining the order and timing
of executing the individual partitions or tasks. The scheduler decides which task should be executed, when it should be
executed, and on which processing resource it should be executed.
In real-time systems, scheduling is particularly crucial because tasks often have strict timing requirements, known as
deadlines. There are various scheduling algorithms and techniques available to ensure that these deadlines are met. Some
commonly used scheduling algorithms include:

1. Rate Monotonic Scheduling (RMS): This is a priority-based scheduling algorithm where tasks with shorter periods
(i.e., higher rates) are assigned higher priorities. The task with the highest priority is executed first.
2. Earliest Deadline First (EDF): This algorithm assigns priorities based on the deadlines of the tasks. The task with the
earliest deadline is given the highest priority.

3. Fixed-Priority Scheduling: In this approach, each task is assigned a fixed priority, and the scheduler follows a pre-defined
priority order to determine the execution order.The choice of scheduling algorithm depends on the specific requirements
of the system, such as the nature of tasks, their deadlines, and the available resources. The objective is to schedule the
tasks in a way that minimizes the chances of missing deadlines and maximizes system efficiency.
Overall, program partitioning and scheduling play vital roles in optimizing the execution of real-time systems. They help
ensure that tasks are completed within their deadlines, resources are utilized effectively, and system performance is
optimized.

MODULE:2
Describe Principles of Scalable Performance
Scalable performance refers to the ability of a system to handle increasing workloads and deliver higher levels of
performance as the demand grows. Achieving scalable performance is crucial in today's computing systems, as data
volumes and processing requirements continue to expand. There are several principles that guide the design and
implementation of scalable performance. Let's discuss them in detail:
1. Load Balancing: Load balancing is a fundamental principle for achieving scalable performance. It involves distributing the
workload evenly across multiple resources (e.g., processors, servers) to ensure that no single resource is overwhelmed
while others remain underutilized. By balancing the load effectively, the system can utilize its resources efficiently and
avoid bottlenecks, thereby maximizing performance.
2. Parallelism: Parallelism is a key principle in achieving scalable performance. It involves dividing a task into smaller sub-
tasks that can be executed simultaneously or in parallel. By leveraging multiple processors, cores, or threads, parallel
processing allows for increased throughput and reduced execution time. This principle is particularly effective for
computationally intensive tasks and data processing operations.
3. Modularity: Modularity refers to the design principle of breaking down a system into smaller, independent modules or
components. Each module can be developed, scaled, and maintained separately, allowing for easier scalability. With a
modular architecture, it becomes possible to add or remove resources or modules as needed without impacting the
overall system performance.
4. Distributed Computing: Distributed computing is a principle that involves the use of multiple interconnected systems or
nodes to execute tasks or process data. By distributing the workload across multiple machines, distributed computing
enables horizontal scalability, where the system can handle increasing demands by adding more machines to the network.
It allows for higher performance and fault tolerance through redundancy and load sharing.
5. Caching and Memoization: Caching and memoization are techniques used to reduce the computational overhead and
improve performance. Caching involves storing frequently accessed data or results in a cache memory for quick retrieval,
reducing the need for repeated computations. Memoization is a similar concept where the results of expensive function
calls are cached for subsequent use. Both techniques improve performance by avoiding redundant computations.
6. Asynchronous Processing: Asynchronous processing is a principle that allows tasks to be executed independently and
concurrently, without requiring strict synchronization. By decoupling tasks and allowing them to progress independently,
asynchronous processing reduces idle time and improves overall system throughput. It is particularly useful in scenarios
where there are dependencies between tasks, but their execution order can be flexible.
7. Scalable Data Structures and Algorithms: To achieve scalable performance, it is essential to employ data structures and
algorithms that can handle growing data volumes efficiently. Scalable data structures, such as hash tables, B-trees, or
distributed databases, allow for efficient storage and retrieval of data even as the dataset size increases. Scalable
algorithms, optimized for parallel processing or distributed computing, ensure that computational tasks can scale with the
growing workload.
By following these principles, system designers and developers can build scalable performance into their applications and
infrastructure. Scalability is a critical aspect of modern computing systems, enabling them to handle increasing demands
and deliver high-performance experiences to users.

Illustrate short notes on the following :


a)Performance metrics ) Performance measures
a) Performance Metrics:
Performance metrics are measurements used to evaluate and quantify the performance of a system or a component. They
provide objective indicators of how well a system is performing in terms of speed, efficiency, throughput, response time,
and other relevant characteristics. Here are some commonly used performance metrics:
1. Response Time: Response time measures the time taken for a system or component to respond to a given request or
input. It is a crucial performance metric for applications where quick response is essential, such as real-time systems or
interactive applications.

2. Throughput: Throughput refers to the rate at which a system or component can process or handle a certain number of
tasks or transactions per unit of time. It measures the system's capacity or processing power and is often expressed in
terms of tasks per second or transactions per second.
3. Latency: Latency measures the time delay between initiating a request or action and receiving the corresponding
response. It is often used in networking and communication systems to assess the time taken for data to travel from the
source to the destination.
4. Scalability: Scalability refers to the ability of a system or component to handle increasing workloads or user demands
without a significant decrease in performance. Scalability metrics evaluate how well a system can scale up or out by adding
more resources, such as processors, servers, or storage, to accommodate growing requirements.
5. Efficiency: Efficiency measures the ratio of useful work performed by a system or component to the resources
consumed. It assesses how effectively the system utilizes its resources to achieve its intended tasks or goals. Efficiency
metrics can include CPU utilization, memory utilization, energy efficiency, or algorithmic efficiency.
6. Reliability: Reliability measures the ability of a system to perform consistently and predictably over a certain period of
time. It quantifies the system's availability, uptime, and the probability of failure or errors occurring. Reliability metrics are
crucial for critical systems where downtime or failures can have significant consequences.
7. Error Rate: Error rate measures the occurrence of errors or failures during system operation. It can include metrics such
as the number of errors, fault rates, or failure rates. Minimizing the error rate is important to ensure system stability and
data integrity.
8. Quality of Service (QoS): QoS metrics evaluate the performance of a system in delivering a certain level of service or
meeting specific user requirements. QoS metrics can include metrics related to network performance, response time, data
accuracy, availability, and other service-level indicators.

Performance metrics are essential for assessing the effectiveness, efficiency, and overall performance of systems and
components. They provide valuable insights for optimization, capacity planning, and decision-making to ensure optimal
system performance and user satisfaction.
b) Performance Measures:
Performance measures are specific quantifiable values or indicators used to assess and evaluate the performance of a
system, process, or component. These measures help in understanding how well a system meets its objectives, performs
its functions, and delivers expected outcomes. Here are some common performance measures:
1. Accuracy: Accuracy measures how closely the output or results of a system or process match the expected or desired
values. It assesses the correctness and precision of the system's output and is particularly important in data analysis,
modeling, and prediction systems.
2. Speed: Speed measures the rate at which a system or process can complete a given task or operation. It evaluates the
efficiency and timeliness of the system and is often expressed in terms of processing time, response time, or execution
time.
3. Productivity: Productivity measures the amount of work or output that a system or process can generate within a given
timeframe or with a given set of resources. It quantifies the efficiency and effectiveness of the system in delivering desired
results.
4.Utilization: Utilization measures the extent to which a resource, such as: CPU, memory, or network bandwidth, is utilized
or occupied during a specific period. It assesses the efficiency and capacity usage of resources and can help identify
potential bottlenecks or underutilized resources.
5. Availability: Availability measures the percentage of time that a system or component is operational and accessible to
users. It quantifies the reliability and uptime of the system and is particularly important for mission-critical systems that
require high availability.
6. Customer Satisfaction: Customer satisfaction measures the level of satisfaction or happiness of customers or users with
the system's performance, functionality, and usability. It can be assessed through surveys, feedback, or ratings and plays a
crucial role in determining the success and acceptance of a system.
7. Cost-effectiveness: Cost-effectiveness measures the efficiency of a system or process in terms of the value it delivers
relative to the resources or costs invested. It evaluates the benefits achieved per unit of cost and helps in making informed
decisions regarding resource allocation and investment.
8. Compliance: Compliance measures the adherence of a system or process to established standards, regulations, or
requirements. It assesses the system's ability to meet specific criteria or specifications and ensures compliance with
industry or regulatory guidelines.
Performance measures provide objective data and benchmarks for evaluating the performance of systems, processes, or
components. They help in monitoring, analyzing, and improving system performance, identifying areas of improvement,
and making informed decisions to optimize performance and achieve desired outcomes.
Discuss about Parallel Processing applications in detail.
Parallel processing refers to the simultaneous execution of multiple tasks or instructions in a computing system. It involves
dividing a task into smaller sub-tasks that can be processed independently and then executing them concurrently using
multiple processors or processor cores. Parallel processing offers several advantages, including increased computational
speed, improved efficiency, and the ability to handle large and complex datasets. There are numerous applications of
parallel processing across various fields. Let's discuss some of them in detail:
1. Scientific Simulations: Parallel processing is extensively used in scientific simulations, such as weather forecasting,
climate modeling, and fluid dynamics. These simulations involve complex mathematical calculations and simulations that
can be divided into smaller tasks and executed in parallel. By distributing the workload across multiple processors,
simulations can be performed faster, allowing researchers to obtain results more quickly.
2. Image and Video Processing: Parallel processing plays a vital role in image and video processing applications. Tasks such
as image filtering, compression, object recognition, and video encoding/decoding can be efficiently parallelized. For
example, in video encoding, individual frames can be processed simultaneously on different processors, reducing the time
required for encoding the entire video.
3. Data Mining and Machine Learning: Parallel processing is crucial for handling large-scale data mining and machine
learning tasks. Algorithms like clustering, classification, and regression can be parallelized to process large datasets in
parallel. By distributing the data and computation across multiple processors, parallel processing enables faster model
training and prediction, making it feasible to analyze and extract insights from massive datasets.
4. Database Systems: Parallel processing is used in database systems to improve query performance and data processing
efficiency. Parallel database architectures employ multiple processors to simultaneously execute queries or process
transactions, leading to faster response times. This is especially beneficial for complex queries that involve joining large
tables or aggregating massive amounts of data.
5. High-Performance Computing: Parallel processing is a fundamental aspect of high-performance computing (HPC)
systems. HPC applications, such as computational fluid dynamics, molecular dynamics simulations, and finite element
analysis, require enormous computational power. Parallel processing allows these applications to divide the workload
among multiple processors or nodes, enabling faster and more accurate simulations or calculations.
6. Financial Modeling and Risk Analysis: Parallel processing is employed in financial modeling and risk analysis applications
to perform complex calculations, simulations, and Monte Carlo simulations. These tasks involve performing a large number
of calculations with different input parameters, which can be effectively parallelized to speed up the analysis and provide
real-time insights for decision making.
7. Cryptography and Security: Parallel processing is utilized in cryptographic algorithms and security systems to enhance
performance and ensure robustness. Tasks such as encryption, decryption, and hashing can be parallelized to accelerate
the cryptographic operations. Additionally, parallel processing can be used in network security systems to analyze and
process network traffic in real-time for detecting and preventing security threats.
These are just a few examples of the wide range of applications for parallel processing. As technology continues to
advance, parallel processing will continue to play a vital role in accelerating computations, handling big data, and solving
complex problems in various domains.

Explain Speed up performance laws in detail.


Speedup performance laws are mathematical principles that help in understanding the potential improvement in
performance when using parallel processing or distributed computing systems. These laws provide insights into the
scalability and efficiency of parallel algorithms and architectures. There are three fundamental speedup performance laws:
Amdahl's Law, Gustafson's Law, and the Universal Scalability Law. Let's discuss each of them in detail:
1. Amdahl's Law:
Amdahl's Law, formulated by computer architect Gene Amdahl in 1967, provides an estimate of the potential speedup
achievable by parallelizing a computation when a fraction of the task cannot be parallelized. The law states that the
maximum speedup achievable is limited by the non-parallelizable portion of the computation.
The formula for Amdahl's Law is as follows:
Speedup = 1 / [(1 - p) + (p / n)]
where:
- p represents the fraction of the computation that can be parallelized.
- n represents the number of processors or threads used for parallel execution.

Amdahl's Law highlights the diminishing returns of parallel processing. As the number of processors increases, the impact
of the non-parallelizable portion diminishes, but it never completely disappears. Therefore, there is an upper limit to the
speedup that can be achieved.
2. Gustafson's Law:
Gustafson's Law, proposed by computer scientist John L. Gustafson in 1988, presents an alternative perspective to
Amdahl's Law. It focuses on scaling the problem size rather than the number of processors. According to Gustafson's Law,
when the problem size increases, the amount of parallelizable work also increases, allowing for potential speedup.
The formula for Gustafson's Law is as follows:
Speedup = S + (1 - S) * p
where:
- S represents the serial fraction of the computation.
- p represents the number of processors or threads used for parallel execution.
Unlike Amdahl's Law, Gustafson's Law assumes that the problem size can be scaled up to keep the parallelizable portion
dominant. It emphasizes that the overall execution time can be reduced by distributing the workload across more
processors.
3. Universal Scalability Law (USL):
The Universal Scalability Law (USL), introduced by computer scientist Neil J. Gunther, extends the concepts of Amdahl's
Law and Gustafson's Law to account for system contention effects and bottlenecks that may impact scalability. The USL
provides a more comprehensive model for analyzing scalability in real-world systems.
The formula for the Universal Scalability Law is as follows:
Speedup = N / [1 + (N - 1) * α + N * β]
where:
- N represents the number of processors or threads used for parallel execution.
- α represents the inherent system contention overhead.
- β represents the coordination overhead.
The USL considers both contention overhead, which arises from shared resources and bottlenecks, and coordination
overhead, which accounts for the cost of communication and synchronization between processors. It recognizes that as
the number of processors increases, contention and coordination overhead can limit the achievable speedup.
These speedup performance laws provide valuable insights into the potential benefits and limitations of parallel processing
and distributed computing systems. By understanding these laws, system designers and developers can make informed
decisions about the scalability and efficiency of their parallel algorithms and architectures.

Explain Scalability Analysis and its Approaches

Scalability analysis is the process of evaluating how well a system or application can handle increasing workloads or
accommodate growing demands while maintaining acceptable performance levels. It involves assessing the system's ability
to scale up (handle larger workloads) or scale out (distribute workloads across multiple resources) to meet evolving
requirements. Scalability analysis helps identify potential bottlenecks, limitations, and areas for improvement to ensure that
the system can effectively handle increased loads. There are several approaches to scalability analysis, including vertical
scaling, horizontal scaling, and load testing.
1. Vertical Scaling:
Vertical scaling, also known as scaling up, involves increasing the capacity of a single resource within a system, such as
adding more CPU cores, increasing memory, or upgrading to a faster disk. This approach focuses on improving the
performance of individual components by enhancing their capabilities. Vertical scaling is typically easier to implement but
may have limitations as the capacity of a single resource is finite.
2. Horizontal Scaling:
Horizontal scaling, also known as scaling out, involves adding more resources to a system by distributing the workload
across multiple instances or nodes. This approach aims to improve the overall system capacity by leveraging additional
resources in parallel. Horizontal scaling often involves adding more servers, instances, or nodes to a distributed system. It
requires effective load balancing and coordination mechanisms to ensure efficient distribution of workloads across the
added resources.
3. Load Testing:
Load testing is a technique used to evaluate the performance and scalability of a system by subjecting it to various simulated
workloads. Load testing involves generating a high volume of concurrent user requests or transactions to determine how the
system behaves under different load conditions. It helps identify performance bottlenecks, such as resource saturation, slow
response times, or system failures, and provides insights into the system's scalability limits. Load testing can be performed
using specialized tools that simulate realistic user behavior and generate the desired load on the system.

Approaches to scalability analysis typically involve a combination of these techniques. It's essential to perform thorough
analysis and testing at different levels, including hardware, software, and network infrastructure, to understand the
scalability characteristics and limitations of the system. By identifying and addressing scalability issues early in the system
design or optimization process, organizations can ensure that their systems can efficiently handle increased workloads,
accommodate growth, and deliver the desired performance levels.

Hardware Technologies
Computer architecture encompasses various hardware technologies that contribute to the design and functionality of
computing systems. Here are some key hardware technologies in computer architecture:

1. **Processors (CPUs)**:
- Processors execute instructions and perform calculations. Modern processors often have multiple cores to enable
parallel processing.

2. **Instruction Set Architectures (ISAs)**:


- ISAs define the instructions that a processor can execute and the memory model it uses. Examples include x86, ARM,
and RISC-V.

3. **Caches**:
- Caches store frequently accessed data and instructions closer to the processor to reduce memory access latency. Levels
include L1, L2, and L3 caches.

4. **Memory Hierarchy**:
- The memory hierarchy consists of different levels of memory, from registers to main memory (RAM) and storage
devices. It balances speed and capacity.

5. **Memory Controllers**:
- Memory controllers manage data flow between the processor and memory, ensuring efficient data transfer and access.

6. **Bus and Interconnects**:


- Buses and interconnects link different hardware components, facilitating data and control signal exchange.

7. **Parallel Processing**:
- Technologies like multi-core processors, SIMD (Single Instruction, Multiple Data), and GPUs enable concurrent execution
of tasks.

8. **Vector Processing**:
- Vector processors execute operations on entire arrays or vectors of data simultaneously, enhancing performance for
specific workloads.

9. **Pipelining**:
- Pipelining divides instruction execution into stages, allowing multiple instructions to overlap in execution for improved
throughput.

10. **Superscalar Processors**:


- These processors can execute multiple instructions in a single clock cycle, exploiting instruction-level parallelism.

11. **Out-of-Order Execution**:


- Processors reorder instructions dynamically to maximize resource utilization and avoid stalls due to data dependencies.

12. **Branch Prediction**:


- Branch prediction techniques reduce the impact of branch instruction delays by guessing the likely branch outcome.

13. **Floating-Point Units (FPUs)**:


- FPUs handle floating-point arithmetic operations, essential for scientific and mathematical computations.

14. **Memory Management Units (MMUs)**:


- MMUs handle virtual memory translation, enabling efficient memory allocation and protection.

15. **Cache Coherence Protocols**:


- Protocols maintain consistency between multiple caches when shared data is updated in a multi-processor system.

16. **SIMD and GPU Architectures**:


- SIMD (Single Instruction, Multiple Data) and GPUs (Graphics Processing Units) accelerate parallelizable tasks like
graphics rendering, AI, and scientific simulations.

17. **Hardware Accelerators**:


- Specialized hardware accelerators are designed for specific tasks like cryptography, machine learning, and video
encoding.
18. **Memory Technologies**:
- Different memory technologies include DRAM, SRAM, Flash memory, and emerging non-volatile memory like Phase
Change Memory (PCM).

19. **I/O Interfaces**:


- Interfaces like USB, PCIe, SATA, and Ethernet connect peripherals, storage, and networking devices to the system.

20. **Power Management Technologies**:


- Techniques like dynamic voltage and frequency scaling (DVFS) help manage power consumption and heat generation.

These hardware technologies collectively shape the performance, efficiency, and capabilities of modern computer systems,
addressing various application demands and workloads.
Processes and Memory Hierarchy

Processes and the memory hierarchy are two fundamental concepts in computer architecture that play a crucial role in the
operation and performance of modern computer systems.
**Processes**:
A process is an executing instance of a program. It represents a program in execution along with its current state, which
includes the values of its variables, registers, program counter, and other relevant information. Processes are managed by
the operating system and provide isolation, protection, and multitasking capabilities. Here are some key points about
processes:

1. **Program vs. Process**: A program is a static set of instructions, while a process is a dynamic entity representing the
execution of those instructions.

2. **Context Switching**: The operating system performs context switches to switch between different processes. This
involves saving the state of the currently executing process and restoring the state of another process.
3. **Process States**: Processes typically transition between various states such as "running," "waiting," "ready," and
"terminated." The operating system manages these state transitions.
4. **Process Scheduling**: The operating system uses scheduling algorithms to determine which processes are allocated
CPU time and in what order. This ensures fair and efficient utilization of resources.
5. **Memory Isolation**: Processes are isolated from each other's memory space to prevent unauthorized access and
interference.

**Memory Hierarchy**:
The memory hierarchy refers to the organization of different types of memory in a computer system. It consists of various
levels of memory, each with different characteristics in terms of capacity, access speed, and cost. The goal of the memory
hierarchy is to provide a balance between speed and cost-effectiveness. Here are the key levels of the memory hierarchy:
1. **Registers**: These are the fastest and smallest storage locations located directly in the CPU. Registers hold data and
instructions being actively processed.
2. **Cache Memory**: Caches are small, high-speed memory units that store frequently accessed data and instructions,
reducing the need to access slower main memory.
3. **Main Memory (RAM)**: This is the primary memory used to store data and program instructions during execution.
It's larger than cache memory but slower.
4. **Virtual Memory**: Virtual memory is an abstraction that allows programs to use more memory than is physically
available by using a combination of RAM and disk space.
5. **Secondary Storage**: Secondary storage devices (e.g., hard drives, SSDs) provide long-term storage for programs,
data, and files. They have larger capacities but slower access times compared to main memory.
6. **Tertiary Storage**: This level includes backup storage and archival systems, which have even larger capacities but
slower access times.
The memory hierarchy ensures that frequently accessed data is kept in faster and smaller memory levels to reduce access
latency, while less frequently used data is stored in larger and slower memory levels. Efficient memory hierarchy
management is crucial for optimizing system performance and minimizing the impact of memory bottlenecks on overall
computation.
Parallel Random Access Machine, also called PRAM is a model considered for most of the parallel algorithms. It helps
to write a precursor parallel algorithm without any architecture constraints and also allows parallel-algorithm designers to
treat processing power as unlimited. It ignores the complexity of inter-process communication . PRAM algorithms are
mostly theoretical but can be used as a basis for developing an efficient parallel algorithm for practical machines and can
also motivate building specialized machines.

PRAM Architecture Model

The following are the modules of which a PRAM consists of:


1. It consists of a control unit, global memory, and an unbounded set of similar processors, each with its own
private memory.
2. An active processor reads from global memory, performs required computation, and then writes to global
memory.
3. Therefore, if there are N processors in a PRAM, then N number of independent operations can be performed in a
particular unit of time.
Parallel Random Access Machines

Models of PRAM

While accessing the shared memory, there can be conflicts while performing the read and write operation (i.e.), a
processor can access a memory block that is already being accessed by another processor. Therefore, there are various
constraints on a PRAM model which handles the read or write conflicts. They are:

 EREW: also called Exclusive Read Exclusive Write is a constraint that doesn’t allow two processors to read or
write from the same memory location at the same instance.
 CREW: also called Concurrent Read Exclusive Write is a constraint that allows all the processors to read from
the same memory location but are not allowed to write into the same memory location at the same time.
 ERCW: also called Exclusive Read Concurrent Write is a constraint that allows all the processors to write to the
same memory location but are now allowed to read the same memory location at the same time.
 CRCW: also called Concurrent Read Concurrent Write is a constraint that allows all the processors to read from
and write to the same memory location parallelly.
Example: Suppose we wish to add an array consisting of N numbers. We generally iterate through the array and use N
steps to find the sum of the array. So, if the size of the array is N and for each step, let’s assume the time taken to be 1
second. Therefore, it takes N seconds to complete the iteration. The same operation can be performed more efficiently
using a CRCW model of a PRAM. Let there be N/2 parallel processors for an array of size N, then the time taken for the
execution is 4 which is less than N = 6 seconds in the following illustration.
Discuss about VLSI models
VLSI (Very Large Scale Integration) models refer to the design and representation techniques used in the field of integrated
circuit (IC) design. These models are essential for designing complex integrated circuits that can contain millions or even
billions of transistors on a single chip.
There are various VLSI models that are used at different stages of the design process. Here are some commonly used VLSI
models:
1. Gate-Level Model: The gate-level model represents an integrated circuit using logic gates and their interconnections. It
describes the circuit in terms of its logic functions and the relationships between the gates. Gate-level modeling is used for
designing the digital logic of the circuit and is often the starting point for VLSI design.
2. Register-Transfer Level (RTL) Model: The RTL model represents a circuit at a higher level of abstraction than the gate-
level model. It describes the flow of data between registers and the operations performed on that data. RTL modeling is
commonly used in digital circuit design, and it forms the basis for designing and verifying digital systems using hardware
description languages (HDLs) like Verilog or VHDL.
3. Behavioral Model: The behavioral model focuses on the functionality of the circuit rather than its implementation
details. It describes the desired behavior of the circuit in terms of inputs, outputs, and the relationship between them.
Behavioral modeling is used for system-level design and is often implemented using HDLs. It allows designers to simulate
and validate the functionality of the circuit before proceeding to lower-level designs.
4. Switch-Level Model: The switch-level model represents the behavior of a circuit at the transistor level. It captures the
detailed switching behavior of individual transistors and their interconnections. Switch-level modeling is used for analyzing
the performance and power characteristics of the circuit and is often employed in the optimization and verification stages
of the design process.
5. Physical Design Models: Physical design models focus on the layout and placement of components on the chip. These
models include representations of the physical geometry, interconnects, and other layout-related aspects of the integrated
circuit. Physical design models are critical for achieving optimal circuit performance, minimizing signal delays, and ensuring
manufacturability.
It's important to note that these models are often hierarchical, with each level of abstraction building upon the previous
one. Starting from the behavioral model, the design is refined and transformed into successively lower-level models, taking
into account various design constraints and optimizations at each stage.
VLSI models are used by designers to simulate, analyze, and validate the behavior, performance, and physical
characteristics of integrated circuits before they are manufactured. These models enable designers to optimize circuit
designs, ensure functionality, and address potential issues early in the design process, saving time and resources.

Discuss about Superscalar processors with diagram


Superscalar processors are advanced microprocessors designed to execute multiple instructions simultaneously or in parallel
within a single clock cycle. They achieve this by incorporating multiple execution units and pipelines, allowing for
increased instruction-level parallelism. Let's discuss the architecture and operation of superscalar processors along with a
diagram.

Superscalar Processor Architecture:


A superscalar processor consists of the following key components:

1. Instruction Fetch Unit (IFU): The IFU is responsible for fetching instructions from the memory hierarchy, such as caches
and main memory. It retrieves instructions in the program order and sends them to the subsequent stages for processing.

2. Instruction Decode Unit (IDU): The IDU decodes the fetched instructions and determines their types, operands, and
dependencies. It identifies the instructions that can be executed independently or in parallel, and assigns them to available
execution units.

3. Reservation Stations: The reservation stations act as buffers that hold decoded instructions along with their operands until
all dependencies are resolved. Each reservation station is associated with a specific execution unit.

4. Execution Units: Superscalar processors have multiple execution units, such as integer units, floating-point units,
load/store units, etc. These units perform the actual computation or data manipulation operations. Each execution unit is
responsible for executing specific types of instructions.
5. Register Files: The register files store temporary data and intermediate results during instruction execution. These files
are accessed by the execution units to read input operands and write back the results.

6. Issue Unit: The issue unit determines which instructions from the reservation stations can be dispatched to the available
execution units in a given clock cycle. It ensures that the dependencies are satisfied and the execution units are fully
utilized.

7. Write-Back Unit: The write-back unit handles the results produced by the execution units. It updates the register files and
memory with the computed values.

Superscalar Processor Operation:


The operation of a superscalar processor can be summarized in the following steps:

1. Instruction Fetch: The IFU fetches instructions from memory based on the program counter.

2. Instruction Decode: The IDU decodes the fetched instructions and identifies their types, dependencies, and operands.

3. Dispatch and Issue: The issue unit determines which instructions can be dispatched from the reservation stations to the
available execution units in the current clock cycle. It checks for dependencies and resource availability.

4. Instruction Execution: The execution units perform the computations or data manipulations specified by the dispatched
instructions. Multiple instructions are executed simultaneously if they are independent and resources are available.

5. Write-Back: The write-back unit updates the register files and memory with the computed results produced by the
execution units.
6. Looping: The processor repeats the fetch, decode, dispatch, execution, and write-back stages for the subsequent
instructions until the program is completed.

Diagram:
Here is a simplified diagram illustrating the basic architecture of a superscalar processor:

```
+--------------------------------------------------+
| |
| +----------+ +----------+ |
| IFU | IDU | RS1 | RS2 | |
| +----------+ +----------+ |
| |
| |
| |
| +-----------+ +------------+ |
| INT | FP Add | | Load/Store | |
| Unit +-----------+ +------------+ |
| |
+--------------------------------------------------+
```

In the above diagram:


- IFU represents the Instruction Fetch Unit.
- IDU represents the Instruction Decode Unit.
- RS1 and RS2 represent the Reservation Stations.
- INT Unit represents the Integer Execution Unit.
- FP Add represents the Floating-Point Addition Execution Unit.
- Load/Store represents the
Describe Vector processors in detail
Vector processors, also known as vector processing units or vector computers, are specialized types of processors designed
to perform parallel computations on vectors or arrays of data elements. They excel at executing operations that involve the
same instruction applied to multiple data elements simultaneously. Vector processors were initially developed in the 1970s
and 1980s and have found applications in various fields, including scientific computing, signal processing, and multimedia.

Vector processor is basically a central processing unit that has the ability to execute the complete vector input in a single
instruction. More specifically we can say, it is a complete unit of hardware resources that executes a sequential set of similar
data items in the memory using a single instruction.
it implements the data sequentially.
Classification of Vector Processor

The classification of vector processor relies on the ability of vector formation as well as the presence of vector instruction
for processing. So, depending on these criteria, vector processing is classified as follows:
Register to Register Architecture
This architecture is highly used in vector computers. As in this architecture, the fetching of the operand or previous results
indirectly takes place through the main memory by the use of registers.

The several vector pipelines present in the vector computer help in retrieving the data from the registers and also storing the
results in the desired register. These vector registers are user instruction programmable.

This means that according to the register address present in the instruction, the data is fetched and stored in the desired
register. These vector registers hold fixed length like the register length in a normal processing unit.
Some examples of a supercomputer using the register to register architecture are Cray – 1, Fujitsu etc.
Memory to Memory Architecture
Here in memory to memory architecture, the operands or the results are directly fetched from the memory despite using
registers. However, it is to be noted here that the address of the desired data to be accessed must be present in the vector
instruction.

This architecture enables the fetching of data of size 512 bits from memory to pipeline. However, due to high memory
access time, the pipelines of the vector computer requires higher startup time, as higher time is required to initiate the vector
instruction.

Some examples of supercomputers that possess memory to memory architecture are Cyber 205, CDC etc.
Key Characteristics of Vector Processors:
1. Vector Instructions: Vector processors are designed to execute vector instructions, which operate on arrays or vectors of
data elements in a single instruction. These instructions allow for parallel processing of data elements, enabling significant
performance improvements over scalar processors.
2. Vector Registers: Vector processors contain vector registers, which are specialized registers capable of holding multiple
data elements simultaneously. Each vector register can store a vector of fixed-length data elements, such as floating-point
numbers or integers.
3. Vector Pipelines: Vector processors typically incorporate vector pipelines, which are parallel execution units dedicated to
processing vector instructions. These pipelines can simultaneously perform arithmetic or logical operations on multiple data
elements within a single clock cycle.
4. Vector Length: The vector length refers to the number of data elements that can be processed in parallel within a single
vector instruction. It is determined by the size of the vector registers and the width of the vector pipelines. Longer vector
lengths allow for greater parallelism and potentially higher performance.
5. Data Alignment: Vector processors often require data to be aligned in memory to optimize performance. Data alignment
ensures that vector instructions can efficiently access and operate on consecutive memory locations. If the data is not
aligned, additional operations may be required to load or store the data, leading to performance penalties.
Benefits of Vector Processors:
1. Increased Performance: Vector processors exploit parallelism by performing operations on multiple data elements
simultaneously. This parallelism leads to significant speedups in computational-intensive tasks compared to traditional
scalar processors.
2. Efficient Memory Access: Vector processors can efficiently access memory by loading or storing multiple data elements
in a single operation. This reduces the number of memory accesses and enhances data transfer rates, improving overall
performance.
3. Simplified Programming Model: Vector processors provide a higher level of abstraction by operating on entire arrays or
vectors of data elements in a single instruction. This simplifies the programming model and allows for efficient utilization
of the hardware resources.
4. Energy Efficiency: Vector processors can achieve high computational throughput while consuming less power compared
to scalar processors. By performing multiple operations simultaneously, they minimize the amount of time spent on
instruction fetching, decoding, and control logic.

Applications of Vector Processors:


Vector processors have been extensively used in scientific and technical computing applications that involve
computationally intensive tasks, such as:
1. Numerical Simulations: Vector processors excel at performing large-scale simulations, including weather modeling, fluid
dynamics, and molecular dynamics, by efficiently processing arrays of data elements.
2. Signal and Image Processing: Vector processors are well-suited for applications that involve processing and analyzing
large volumes of digital signals or images, such as audio and video processing, medical imaging, and computer vision.
3. Data Analytics: Vector processors can accelerate data analytics tasks, such as data mining, machine learning, and big data
processing, by leveraging the parallelism inherent in these computations.
4. Graphics and Multimedia: Vector processors are commonly used in graphics processing units (GPUs) to accelerate
rendering, image manipulation, and video encoding/decoding tasks.
5. High-Performance Computing: Vector processors have been employed in supercomputers and high-performance
computing clusters to achieve high computational throughput and accelerate scientific simulations and complex
calculations.
Overall, vector processors offer significant performance advantages for applications that can leverage their parallel
processing capabilities. While their usage has been overshadowed by other parallel computing architectures, such as
multicore processors and GPUs, vector processors continue to play a crucial role in specific domains where vector
operations are prevalent.

Module:3

Sequential and weak consistency models:


Sequential consistency and weak consistency are two different memory consistency models used in shared-memory
architectures. They define the ordering and visibility of memory operations as observed by different processors in a
multiprocessor system. Here's a comparison of the two models:
1. Sequential Consistency:
Sequential consistency provides a strong and intuitive memory consistency model. In this model, the execution of a
program is equivalent to a sequential execution where all memory operations appear to occur in a global, total order. This
means that the order of memory operations observed by any processor is the same as if they were executed in a single,
sequential program.
Key features of sequential consistency include:
- Program Order: Memory operations issued by each processor must be observed in the order specified by the program. In
other words, the order of operations within each processor's execution stream is preserved.
- Global Order: Memory operations from different processors are ordered consistently, forming a global order. This ensures
that memory operations are observed by all processors in the same order.
- Visibility: Memory operations become visible to all processors instantaneously. Once a memory operation completes, its
effects are immediately visible to other processors.
Sequential consistency provides a straightforward programming model and simplifies reasoning about the behavior of
concurrent programs. However, it may impose a high overhead due to the need for synchronization and ordering
constraints, especially in large-scale multiprocessor systems.
2. Weak Consistency:
Weak consistency relaxes the ordering constraints imposed by sequential consistency to provide better performance and
scalability in shared-memory systems. Unlike sequential consistency, weak consistency allows for reordering and
overlapping of memory operations among processors. This flexibility enables better parallelism and reduces
synchronization overhead.
Key features of weak consistency include:
- Out-of-Order Execution: Processors are allowed to execute and complete memory operations out of program order, as
long as they respect certain dependencies and constraints.
- Partial Order: The order in which memory operations are observed by different processors may vary. Processors can
observe different interleavings of memory operations, leading to different partial orders.
- Synchronization: Explicit synchronization primitives, such as locks or barriers, are required to establish ordering and
enforce dependencies between memory operations.
- Delayed Visibility: The effects of a memory operation may not become immediately visible to other processors. There can
be delays in propagating updates, leading to inconsistent views of shared memory.
Weak consistency allows for greater concurrency and optimization opportunities but requires careful synchronization and
explicit handling of dependencies to ensure correct program behavior. It places the burden on the programmer to reason
about and manage the ordering and visibility of memory operations.
In summary, sequential consistency provides a strong and intuitive memory consistency model but may lead to increased
synchronization overhead. Weak consistency relaxes the ordering constraints to enhance performance and scalability but
requires explicit synchronization and careful management of memory dependencies. The choice of consistency model
depends on the specific requirements of the application and the trade-off between simplicity and performance.

Sequential consistency and weak consistency are memory consistency models that define the order in which memory
operations (read and write instructions) appear to be executed by different processors or threads in a multiprocessor
system. These models play a crucial role in ensuring that the behavior of concurrent programs is well-defined and
predictable.

1. **Sequential Consistency**:
- In a sequentially consistent memory model, the execution of memory operations appears as if they were executed in
some sequential order that respects program order.
- This means that if two memory operations from different processors are not causally related (i.e., they don't have a
direct dependency between them), then the system should guarantee that they appear to be executed in the same order
on all processors.
- While sequential consistency provides a simple and intuitive programming model, it can limit potential optimizations
and parallelism because it requires memory operations to be globally ordered.

2. **Weak Consistency**:
- Weak consistency relaxes the ordering requirements compared to sequential consistency. It allows for more reordering
of memory operations to improve performance.
- Memory operations are categorized into different memory barriers or synchronization operations (like acquire and
release operations), which provide constraints on the reordering of memory operations around them.
- Weak consistency models generally allow out-of-order execution and relaxed memory visibility, which can lead to
different processors observing different orders of memory operations, as long as the constraints of memory barriers are
satisfied.
- While weak consistency can enable better performance and optimization opportunities, it can also lead to complex
programming and debugging challenges due to the less intuitive behavior of memory operations.

It's important to note that both sequential consistency and weak consistency models have their own advantages and
challenges:

- **Sequential Consistency Advantages**:


- Easier to reason about and program for.
- Provides a more intuitive view of memory operations' behavior.
- Makes it simpler to detect and understand data races.

- **Weak Consistency Advantages**:


- Allows for greater optimization and parallelism by allowing more relaxed memory ordering.
- Can lead to better performance by enabling hardware and compiler optimizations.
- Often used in high-performance computing environments where performance is critical.

The choice between these models depends on the specific use case, application requirements, and hardware architecture.
Most modern multiprocessor systems and programming languages offer a range of memory consistency models, allowing
developers to choose the one that best suits their needs while considering the trade-offs between programmability and
performance optimization.
Shared memoy organisation
Shared memory is a faster inter process communication system. It allows cooperating processes to access the
same pieces of data concurrently. It speeds up the computation power of the system and divides long tasks into
smaller sub-tasks and can be executed in parallel. Modularity is achieved in a shared memory system

Shared-memory organization refers to a type of computer architecture where multiple processors or cores share
a common memory system. In this arrangement, all processors can access the same physical memory locations,
which enables easier communication and data sharing among the processors. This is in contrast to distributed-
memory architectures where each processor has its own separate memory space.

There are different ways to implement shared-memory organizations:

1. **Uniform Memory Access (UMA)**: In a UMA architecture, all processors have equal and uniform access
times to the shared memory. This means that accessing any memory location takes roughly the same amount of
time, regardless of which processor initiates the access. However, as the number of processors increases, the
memory bus or interconnect can become a bottleneck, leading to performance degradation.
2. **Non-Uniform Memory Access (NUMA)**: NUMA is an extension of UMA where processors are grouped
together, and each group has its own local memory. Processors can access their local memory faster than
remote memory belonging to other groups. While this reduces contention on the memory bus, accessing remote
memory takes longer due to interconnect latency. NUMA architectures aim to optimize memory access by
minimizing remote memory accesses.

3. **Cache Coherence**: In shared-memory architectures, maintaining cache coherence becomes crucial.


Cache coherence protocols ensure that if one processor updates a memory location, all other processors see
the updated value. This can involve complex protocols to manage read and write operations to shared memory
locations.

4. **Directory-Based Coherence**: In larger-scale shared-memory systems, directory-based coherence protocols


are often used. Each memory block has an associated directory that tracks which processors have cached
copies of that block. This helps in reducing the overhead of broadcast-based invalidations and can be more
efficient in systems with a large number of processors.

5. **Synchronization and Consistency**: Shared-memory architectures require mechanisms for synchronization


and memory consistency. This involves providing operations like locks, semaphores, and atomic operations to
ensure that concurrent accesses to shared data are coordinated properly and that the order of memory
accesses appears consistent to all processors.

Shared-memory architectures offer programming simplicity as communication and data sharing among
processors are relatively straightforward compared to distributed-memory systems. However, they also introduce
challenges related to contention for shared resources, cache coherence complexity, and scalability as the
number of processors increases.

Modern multi-core processors often utilize shared-memory architectures to leverage the benefits of parallelism
and efficient communication between cores. In addition, high-performance computing clusters and large-scale
multiprocessor systems often employ shared-memory architectures with NUMA characteristics to strike a
balance between performance and scalability.

The fundamental feature of a shared-memory computer is that all the CPU-cores are connected to the same
piece of memory.
This is achieved by having a memory bus that takes requests for data from multiple sources (here, each of the
four separate CPU-cores) and fetches the data from a single piece of memory. The term bus apparently comes
from the Latin omnibus meaning for all, indicating that it is a single resource shared by many CPU-cores.

This is the basic architecture of a modern mobile phone, laptop or desktop PC. If you buy a system with a quad
core processor and 4 GBytes of RAM, each of the 4 CPU-cores will be connected to the same 4 Gbytes of RAM,
and they’ll therefore have to play nicely and share the memory fairly between each other.

A good analogy here is to think of four office-mates or workers (the CPU-cores) sharing a single office (the
computer) with a single whiteboard (the memory). Each worker has their own set of whiteboard pens and an
eraser, but they are not allowed to talk to each other: they can only communicate by writing to and reading from
the whiteboard.

Later on, we’ll start to think about how we can use this shared whiteboard to get the four workers to cooperate to
solve the same problem more quickly than they can do it alone. However, the analogy already illustrates two key
limitations of this approach:

1. memory capacity: there is a limit to the size of the whiteboard that you can fit into an office, i.e. there is a
limit to the amount of memory you can put into a single shared-memory computer;
2. memory access speed: imagine that there were ten people in the same office – although they can in
principle all read and write to the whiteboard, there’s simply not enough room for more than around four of
them to do so at the same time as they start to get in each other’s way. Although you can fill the office full
of more and more workers, their productivity will stall after about 4 workers because the additional workers
will spend more and more time idle as they have to queue up to access the shared whiteboard.

Limitations
It turns out that memory access speed is a real issue in shared-memory machines. If you look at the processor
diagram above, you’ll see that all the CPU-cores share the same bus: the connection between the bus and the
memory eventually becomes a bottleneck and there is simply no point in adding additional CPU-cores. Coupled
with the fact that the kinds of programs we run on supercomputers tend to read and write large quantities of
data, it is often memory access speed that is the limiting factor controlling how fast we can do a calculation, not
the floating-point performance of the CPU-cores.

There are various tricks to overcoming these two issues, but the overcrowded office clearly illustrates the
fundamental challenges of this approach if we require many hundreds of thousands of CPU-cores.
Characteristics Of Pipeline Processors
Pipelining refers to the temporal overlapping of processing pipelines are
nothing
more than assembly lines in computing that can be used for
instruction
processing. A basic pipeline process a sequence of
tasks or
instruction, according
to the following principle of operation
.
Each task is subdivided into a number of successive tasks. The processing of
each
single instruction can be broken down into four sub tasks:
-
1. Instruction Fetch
2. Instruction Decod
e
3. Execute
4. Write back
It is assumed that there is a pipelined stage associated with each subtask
.
The same amount of time is available in each stage for performing the
required
subtask.
All the pipeline stages operate like an assembly line, that is,
receiving their
input
from the previous stage and delivering their output to next stage.
We
also
assumes, the basic pipeline operates clocked, in other words
synchronously. This means that each stage accepts a non input at start of
clock cycle, each stage
has a single clock cycle available for performing the
required operation and each stage increases the result to the next stage by
the beginning of subsequent clock cycle.
Linear Pipeline Processors:
A linear Pipeline processor is a cascade of
processing stages which are linearly
connected to perform a fixed function over a stream of data flowing from one
end to other. In modern computers, linear pipelines are applied for instruction execution,
arithmetic computation, memory access operations.
A linear pipeline processor is constructed with be processing stages. External
inputs are fed into the pipeline at the first stage S
1
. The processed results are
passed from stage Si to stage Si+1 for all i = 1,2.......K
-
1. The final result emerges
from the pip
eline at the last stage Sk. Depending on the control of data flow
along the pipeline, linear pipelines are model in two categories.
Asynchronous Model:
Data flow between adjacent stages in asynchronous
pipeline is controlled by hankshaking protocol. When
stage S
1
is ready to
transmit, it sends a ready signal to Si + 1. After stage Si+1 receives the incoming
data, it returns an acknowledge signal to Si.
NON LINEAR PIPELINE PROCESSOR:
A Three Stage Pipeline
Clock period
The logic circuitry in each stage
S
i
has a time delay denoted by
τ
i
.
Let
τ
l
be the time delay of each interface latch. The
clock period
of a linear
pipeline is defined by
The reciprocal of the clock period is called the
frequency
f
=
1/
τ
.
Ideally, a linear pipeline with
k
stages can process
n
tasks in
T
k
=k+(n
-
1)
periods, where
k
cycles are used to fill up the pipeline or to
complete execution of the first task and
n

1
cycles are needed to complete
the remaining
n

1
tasks. The same numb
er of tasks (operand pairs) can be
executed in a nonpipeline processor with an equivalent function in
T
1
-
n.k time delay.
Speedup
We define the
speedup
of a
k
-
stage linear pipeline processor
over an equivalent nonpipeline processor as
It should be noted that the maximum speedup is
Sk
k

,
for
n
>>
k
.
In other
words, the maximum speedup that a linear pipeline can provide us is
k
,
where
k
is the number of stages in the pipe. The maximum speedup is never
fully
achievable because of data dependencies between instructions,
interrupts, and other factors.
Efficiency :
The
efficiency
of a linear pipeline is measured by the percentage
of busy time
-
space spans over the total time
-
space span, which equals the
sum
of all busy and idle time
-
space spans. Let
n
,
k
,
τ
be the number of tasks
(instructions), the number of pipeline stages, and the clock period of a linear
pipeline, respectively. The pipeline efficiency is defined by
Note that
η

1
as
n


.
This
implies that the larger the number of
tasks flowing through the pipeline, the better is its efficiency. Moreover, we
realize that
η
=
Sk /
k
. This provides another view of efficiency of a linear
pipeline as the ratio of its actual speedup to the ideal spee
dup
k
. In the steady
state of a pipeline, we have
n
>>
k
,
the efficiency
η
should approach 1.
However, this ideal case may not hold all the time because of program
branches and interrupts, data dependency, and other reasons.
Throughput :
The
number of results (tasks) that can be completed by a
pipeline per unit time is called its
throughput
. This rate reflects the computing
power of a pipeline. In terms of efficiency
η
and clock period
τ
of a linear
pipeline, we define the throughput as follo
ws:
where
n
equals the total number of tasks being processed during an
observation period
k
τ
+
(
n

1
)
τ
. In the ideal case,
w
= 1/
τ=f
when
η

1.
This means that the maximum throughput of a linear pipeline is
equal to its
frequency, which corresponds to one output result per clock period.
According to the levels of processing, pipeline processors can be classified into
the classes: arithmetic, instruction, processor, unifunction vs. multifunction,
static vs. dynamic, scalar
vs. vector pipelines.
Reservation Table in linear pipelining:
The utilization pattern of successive stages in a synchronons pipeline is
specified by reservation table. The table is essentially a space time diagram
depicting
the precedence relationship in using the pipeline stages. For a K
-
stage
linear pipeline, ‘K’ clock cycles are needed to flow through the pipeline.
Reservations table in Non
-
linear pipelining:
Reservation table for a dynamic pipeline become more
complex and
interesting
because a non
-
linear pattern is followed. For a given non
-
linear
pipeline configuration, multiple reservation tables can be generated. Each
reservation table will represent evaluation of different function. Each
reservation table di
splays the time space flow of data through the pipeline
for one function evaluation. Different function may follows different paths on
the reservation table.
Processing sequence
S
1

S
2

S
1

S
2

S
3

S
1

S
3

S
1
Reservation table for function ‘X’
Latency Analysi
s:

Latency: The number of time units (clock cycles) between two initiations of a
pipeline is the
latency
between them.

A latency value
k
means that two initiations are separated by
k
clock cycles.

Any attempt by two or more initiation
s to use the same pipeline stage at the
same time will cause a
collision
.

A collision implies resource conflicts between two initiations in a pipeline.
Collision with scheduling latency 2
:

Latencies that cause collision are called
forbidden
latencies.

Forbidden latencies for function X are 2,4,5,7

Latencies 1,3,6 do not cause collision.

Maximum forbidden latency can be m

n = no. of columns

m≤n
-
1

All the latencies greater than m+ do not cause
collisions.

Permissible Latency p,
lies in the range:

1≤p≤m
-
1

Value of p should be as small as possible

Permissible latency p=1 corresponds to an ideal case, can
be achieved by a static pipeline
.
Non Linear Pipeline
:
Collision Vectors
:

Combined set of permissible and
forbidden latencies.

m
-
bit binary vector
C = (Cm Cm
-
1....C2 C1 )

The value of Ci = 1 if the latency
i
causes a collision; Ci
= 0 if the latency
i
is permissible.

Cm = 1, always; it corresponds to the maximum
forbidden
latency.
State Diagrams

State
diagrams can be constructed to specify the permissible
transitions among successive initiations.

The collision vector, corresponding to the initial state of pipeline
at time 1, is called the initial collision vector.

The next state of the pipeline at t
ime t+p can be obtained by
using a bit
-
right shift register

Initial CV is loaded into the register.

The register is then shifted to the right

When a 0 emerges from the right end after p shifts, p is a
permissible latency

When a 1 emerges, the corre
sponding latency should be
forbidden

Logical 0 enters from the left end of the shift register.

The next state after p shifts is obtained by bitwise
-
ORing the
initial CV with the shifted register contents.

This bitwise
-
ORing of the shifted contents is
meant to prevent
collisions from the future initiations starting at time t+1 and
onward.
Latency Cycles

Simple Cycles : Latency cycle in which each state
appears only once.

Greedy Cycles : whose edges are all made with
minimum latencies from their re
spective starting
states.

MAL : minimum average latency

At least one of the greedy cycles will lead to MAL
.
Collision
-
free scheduling

Finding Greedy cycles from the set of Simple
cycles.

The Greedy cycle yielding the MAL is the final
Choice
.
Optimization technique
:

Insertion of Delay stages

Modification of reservation table

New CV

Improved state diagram

To yield an optimal latency cycle
Bounds on MAL

MAL is lower
-
bounded by the maximum number of
checkmarks
in any row of the reservation table.

MAL is lower than or equal to the average latency of
any greedy cycle in the state diagram.

Average latency of any greedy cycle is upperbounded
by the number of 1’s in the initial CV plus 1.

Optimal latency cycle
is selected from one of the lowest
greedy cycles.
output
Instruction Pipeline Design:
A stream of instructions can be executed by pipeline in an overlapped manner. A
typical instruction execution c
onsists of a sequence of operations, including
(1) Instruction fetch
(2) Decode
(3) Operand fetch
(4) Execute
(5) Write back phases
Pipeline instruction processing:
A typical instruction pipeline has seven stages as depicted below in figures;
·
Fetch
stage (F) fetches instructions from a cache memory.
·
Decode stage (D) decode the instruction in order to find function to be
performed and identifies the resources needed.
·
Issue stage (I) reserves resources. Resources include GPRs, bases and
functional
units.
·
The instructions are executed in one or several execute stages (E)
·
Write back stage (WB) is used to write results into the registers.
·
Memory lead and store (L/S) operations are treated as part of solution.
·
Floating point add and multiply ope
rations take four execution clock cycles.
·
In many RISC processors fewer cycles are needed.

Ideal cycles when instruction issues are blocked due to resource conflicts
before date Y and Z are located in.

the store of sum to memory location X must wait
three cycles for the add to
finish due to flow dependence.

Non-Linear pipeline are dynamic pipeline because they can be reconfigured to perform variable functions at
different times

Arithmetic pipeline design is a technique used in computer architecture to accelerate the execution of arithmetic
and mathematical operations by breaking them down into multiple stages and processing multiple operations
simultaneously. This approach improves the throughput of arithmetic computations and takes advantage of
pipelining to achieve higher performance. Arithmetic pipelines are commonly used for tasks like integer and
floating-point arithmetic operations.

Here's a high-level overview of how an arithmetic pipeline works and the key stages involved:
1. Stage 1 - Operand Fetch: In this stage, the operands required for the arithmetic operation are fetched
from registers or memory. Operand fetching involves reading data from the appropriate storage locations
and making them available for subsequent stages.
2. Stage 2 - Decode and Issue: The fetched operands are decoded, and the arithmetic operation to be
performed is determined based on the instruction. The operation is then issued to the next stage, where
actual computation takes place.
3. Stage 3 - Execute: In this stage, the actual arithmetic operation is performed. This includes arithmetic
computations like addition, subtraction, multiplication, and division. Execution may also involve logical
operations and other mathematical functions.
4. Stage 4 - Write-Back: The result of the arithmetic operation is written back to registers or memory. This
stage updates the appropriate destination location with the computed value.

By dividing the arithmetic computation into these stages and allowing multiple operations to progress through
the pipeline simultaneously, the arithmetic pipeline achieves better utilization of hardware resources and
improves performance. This approach is particularly effective for executing a sequence of similar arithmetic
operations, as each stage of the pipeline can work on a different operation at the same time.

It's important to note that while arithmetic pipelines can significantly improve throughput, they may introduce
latency due to the pipeline stages and associated overhead. Pipeline hazards such as data hazards (dependencies
between instructions) and control hazards (branches and jumps) must also be carefully managed to avoid stalls
and ensure correct execution.

Modern processors often employ advanced techniques to enhance arithmetic pipeline performance, including:
 Superscalar Execution: Combining multiple arithmetic pipelines to process multiple instructions in
parallel.
 Out-of-Order Execution: Allowing instructions to be executed out of their original order, reducing
pipeline stalls caused by dependencies.
 Register Renaming: Using physical registers to eliminate false dependencies and improve pipeline
efficiency.
 Speculative Execution: Predicting the outcome of branch instructions and speculatively executing
instructions to reduce pipeline stalls caused by mispredictions.
 Vector Processing: Processing multiple data elements (vectors) in parallel using vector operations, which
can accelerate certain arithmetic workloads.

Arithmetic pipeline design is a critical aspect of modern processors, enabling them to efficiently handle the
intensive mathematical computations required by a wide range of applications, including scientific simulations,
graphics rendering, and machine learning.
1. Arithmetic Pipeline :
An arithmetic pipeline divides an arithmetic problem into various sub problems for execution in various pipeline segments.
It is used for floating point operations, multiplication and various other computations. The process or flowchart arithmetic
pipeline for floating point addition is shown in the diagram.
Floating point addition using arithmetic pipeline :
The following sub operations are performed in this case:
1. Compare the exponents.
2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalise the result
First of all the two exponents are compared and the larger of two exponents is chosen as the result exponent. The difference
in the exponents then decides how many times we must shift the smaller exponent to the right. Then after shifting of
exponent, both the mantissas get aligned. Finally the addition of both numbers take place followed by normalisation of the
result in the last segment.
Example:
Let us consider two numbers,
X=0.3214*10^3 and Y=0.4500*10^2
Explanation:
First of all the two exponents are subtracted to give 3-2=1. Thus 3 becomes the exponent of result and the smaller exponent
is shifted 1 times to the right to give
Y=0.0450*10^3
Finally the two numbers are added to produce
Z=0.3664*10^3
As the result is already normalized the result remains the same.
2. Instruction Pipeline :
In this a stream of instructions can be executed by overlapping fetch, decode and execute phases of an instruction cycle.
This type of technique is used to increase the throughput of the computer system. An instruction pipeline reads instruction
from the memory while previous instructions are being executed in other segments of the pipeline. Thus we can execute
multiple instructions simultaneously. The pipeline will be more efficient if the instruction cycle is divided into segments of
equal duration.
In the most general case computer needs to process each instruction in following sequence of steps:
1. Fetch the instruction from memory (FI)
2. Decode the instruction (DA)
3. Calculate the effective address
4. Fetch the operands from memory (FO)
5. Execute the instruction (EX)
6. Store the result in the proper place
The flowchart for instruction pipeline is shown below.
Let us see an example of instruction pipeline.
Example:

Here the instruction is fetched on first clock cycle in segment 1.


Now it is decoded in next clock cycle, then operands are fetched and finally the instruction is executed. We can see that here
the fetch and decode phase overlap due to pipelining. By the time the first instruction is being decoded, next instruction is
fetched by the pipeline.
In case of third instruction we see that it is a branched instruction. Here when it is being decoded 4th instruction is fetched
simultaneously. But as it is a branched instruction it may point to some other instruction when it is decoded. Thus fourth
instruction is kept on hold until the branched instruction is executed. When it gets executed then the fourth instruction is
copied back and the other phases continue as usual.
Last Updated : 16 May, 2020
28
superscalar pipeline design:
What is a superscalar pipeline design?
In a superscalar computer, the central processing unit (CPU) manages multiple instruction pipelines to execute
several instructions concurrently during a clock cycle. This is achieved by feeding the different pipelines through
a number of execution units within the processor.

A pipelined processor uses a 4-stage instruction pipeline with the following stages: Instruction fetch (IF),
Instruction decode (ID), Execute (EX) and Writeback (WB).

1. Instruction Fetch (IF): The first stage of the pipeline involves fetching instructions from memory. In a
superscalar pipeline, multiple instructions are fetched in parallel, often using techniques like instruction
prefetching and branch prediction to minimize pipeline stalls.
2. Instruction Decode (ID): In this stage, fetched instructions are decoded into their corresponding
operations and operands. The decoder identifies dependencies among instructions and checks for hazards,
such as data hazards and control hazards.
3. Issue Stage: This is a crucial stage in superscalar pipelines. Instructions are issued to multiple execution
units based on availability and dependencies. The issue logic determines which instructions can be
executed in parallel, considering factors like data availability, resource availability (like functional units), and
instruction dependencies.
4. Execution Stage: Instructions proceed to the execution units, where operations like arithmetic, logic, and
memory access are performed. Superscalar processors typically have multiple execution units of different
types (e.g., arithmetic logic units, floating-point units), allowing for the concurrent execution of diverse
instructions.
5. Memory Access Stage: Instructions that involve memory access (load and store) interact with the memory
subsystem. Memory access may include caching, memory address calculations, and interfacing with the
memory hierarchy.
6. Write-Back Stage: The results of executed instructions are written back to appropriate registers or
memory locations. This stage ensures that the updated values are available for subsequent instructions.
7. Register Renaming: To avoid data hazards and improve instruction-level parallelism, register renaming
techniques are used. They allow multiple instructions to use the same architectural registers without
causing data conflicts.
8. Dynamic Scheduling: Superscalar processors dynamically reorder instructions at runtime to exploit
available execution resources more efficiently.
MODULE:4
https://www.slideshare.net/sumitmittu/aca2-07-new
Parallel and scalable architectures:
Parallel and scalable architectures refer to computer system designs that focus on increasing computational power and
performance by leveraging parallelism, while also accommodating growing workloads and data sizes. These architectures
are essential for addressing the ever-increasing demand for processing power in various domains, including scientific
research, data analysis, artificial intelligence, and more.

Here are key concepts related to parallel and scalable architectures:


1. **Parallelism**: Parallelism involves executing multiple tasks or instructions concurrently to improve performance.
There are two main types of parallelism:
- **Task Parallelism**: Multiple independent tasks are executed simultaneously. Each task can run on its own processing
unit.
- **Data Parallelism**: A single task is executed on multiple data sets concurrently. This is common in SIMD architectures
like GPUs.
2. **Scalability**: Scalability is the ability of a system to handle growing workloads efficiently. It can be categorized into
two types:
- **Vertical Scalability (Scale Up)**: Increasing the capacity of a single machine by adding more resources like CPUs,
memory, or storage.
- **Horizontal Scalability (Scale Out)**: Adding more machines to a system to distribute the workload. This approach is
often used in clusters and cloud computing.
3. **Architectural Approaches**:
- **Multiprocessor Systems**: These involve multiple processors (CPUs) working together on a single machine. They can
be symmetric multiprocessing (SMP) or non-uniform memory access (NUMA) architectures.
- **Multicore Processors**: These involve multiple processing cores on a single chip, promoting parallelism within a
single machine.
- **Clustered Systems**: Multiple independent machines (nodes) are connected to form a cluster, sharing resources and
distributing tasks.
- **Distributed Systems**: These involve a network of computers that communicate and collaborate on tasks. They can
range from clusters to data centers.
4. **Architectural Features**:
- **Interconnects**: High-speed interconnects like InfiniBand or Ethernet enable efficient communication between
processors or nodes.
- **Memory Hierarchy**: Efficient memory access and management are crucial for maintaining high performance in
parallel systems.
- **Load Balancing**: Distributing tasks evenly across processors or nodes helps ensure optimal resource utilization.
- **Fault Tolerance**: Systems should be designed to handle hardware failures without causing downtime.
- **Parallel Programming Models**: Languages and frameworks like MPI (Message Passing Interface) and OpenMP
facilitate the development of parallel applications.
5. **Examples**:
- **Clusters**: Beowulf clusters use commodity hardware and open-source software to create cost-effective parallel
systems.
- **High-Performance Computing (HPC) Systems**: Supercomputers like those from Cray, IBM, and others are designed
for massive parallelism and are used in scientific research.
- **Cloud Computing**: Cloud providers offer scalable resources for various workloads, allowing users to scale their
applications up or down as needed.
6. **Challenges**:
- **Amdahl's Law**: The performance improvement from parallelization is limited by the serial (non-parallelizable)
portion of the code.
- **Data Dependencies**: Managing data dependencies and synchronization can introduce overhead in parallel systems.
- **Programming Complexity**: Writing efficient parallel code is challenging and requires understanding parallel
programming models.
Parallel and scalable architectures play a critical role in modern computing, enabling the efficient processing of large data
sets and complex computations. They are essential for meeting the demands of applications ranging from scientific
simulations to real-time analytics and machine learning.
PARELL

scalable architecture:
A scalable architecture is an architecture that can scale up to meet increased work loads. In other words, if the
work load all of a sudden exceeds the capacity of your existing software + hardware combination, you can scale
up the system (software + hardware) to meet the increased work load.

Scalability Factor

When you scale up your system's hardware capacity, you want the workload it is able to handle to scale up to
the same degree. If you double hardware capacity of your system, you would like your system to be able to
handle double the workload too. This situation is called "linear scalability".

Linear scalability is often not the case though. Very often there is an overhead associated with scaling up, which
means that when you double hardware capacity, your system can handle less than double the workload.

The extra workload your system can handle when you scale up your hardware capacity is your system's
scalability factor. The scalability factor may vary depending on what part of your system you scale up.

Vertical and Horizontal Scalability

There are two primary ways to scale up a system: Vertical scaling and horizontal scaling.

Vertical scaling means that you scale up the system by deploying the software on a computer with higher
capacity than the computer it is currently deployed on. The new computer may have a faster CPU, more
memory, faster and larger hard disk, faster memory bus etc. than the current computer.
Horizontal scaling means that you scale up the system by adding more computers with your software deployed
on. The added computers typically have about the same capacity as the current computers the system is
running on, or whatever capacity you would get for the same money at the time of purchase (computers tend to
get more powerful for the same money over time).

Architecture Scalability Requirements

The easiest way to scale up your software from a developer perspective is vertical scalability. You just deploy on
a bigger machine, and the software performs better. However, once you get past standard hardware
requirements, buying faster CPUs, bigger and faster RAM modules, bigger and faster hard disks, faster and
bigger motherboards etc. gets really, really expensive compared to the extra performance you get. Also, if you
add more CPUs to a computer, and your software is not explicitly implemented to take advantage of them, you
will not get any increased performance out of the extra CPUs.

Scaling horizontally is not as easy seen from a software developer's perspective. In order to make your software
take advantage of multiple computers (or even multiple CPUs within the same computer), your software need to
be able to parallelize its tasks. In fact, the better your software is at parallelizing its tasks, the better your
software scales horizontally.

Task Parallelization

Parallelization of tasks can be done at several levels:

 Distributing separate tasks onto separate computers.


 Distributing separate tasks onto separate CPUs on the same computer.
 Distributing separate tasks onto separate threads on the same CPU.

You may also take advantage of special hardware the computers might have, like graphics cards with lots of
CPU cores, or InfiniBand network interface cards etc.
Multiprocessors and Multicomputers:
Multiprocessors and multicomputers are two types of parallel computing architectures designed to achieve high-
performance computing by utilizing multiple processing units or nodes. While both architectures aim to harness
parallelism, they have different approaches to achieving this goal.
**Multiprocessors:**
A multiprocessor architecture involves multiple processing units (CPUs) within a single machine, sharing common
resources such as memory and I/O devices. The primary idea is to enhance performance by executing multiple tasks
concurrently within a single system.
Key characteristics of multiprocessor architectures:
1. **Shared Memory**: All processors in a multiprocessor system share a single address space and memory. This allows
for easy communication and data sharing between processors but also requires efficient memory management and
synchronization mechanisms to avoid conflicts.
2. **Symmetric Multiprocessing (SMP)**: In SMP systems, all processors are identical and have equal access to memory
and I/O. Operating systems and software can run on any processor, and load balancing is critical for optimal performance.
3. **Uniform Memory Access (UMA)**: In UMA systems, memory access time is approximately the same for all
processors. This ensures balanced access to memory and avoids performance bottlenecks.
4. **NUMA (Non-Uniform Memory Access)**: In NUMA systems, processors are divided into nodes, each with its own
local memory. Accessing local memory is faster than accessing remote memory, leading to potential performance
variations depending on memory location.
**Multicomputers:**
A multicomputer architecture, on the other hand, involves multiple independent computers (nodes) connected by a high-
speed interconnect. Each node has its own local memory and resources, and communication between nodes is typically
achieved through message passing.
Key characteristics of multicomputer architectures:
1. **Distributed Memory**: Each node in a multicomputer system has its own local memory, and nodes communicate by
sending messages to each other. This architecture is more scalable and can handle larger systems compared to shared
memory architectures.
2. **Message Passing**: Communication between nodes is achieved through message-passing protocols like MPI
(Message Passing Interface). While this approach allows for greater scalability, it requires careful design to manage
communication efficiently.
3. **Scalability**: Multicomputers are designed with scalability in mind, allowing for easy expansion by adding more
nodes to the system. This approach is well-suited for large-scale computing clusters and data centers.
4. **Data Locality**: Multicomputers can achieve good data locality by distributing data across nodes and minimizing the
need for data movement.
In summary, multiprocessors and multicomputers are two distinct parallel computing architectures. Multiprocessors focus
on achieving parallelism by using multiple processors within a single machine, with shared memory and symmetric
multiprocessing. Multicomputers, on the other hand, utilize multiple independent computers connected by a high-speed
interconnect, relying on distributed memory and message-passing communication for parallel processing. Both
architectures have their strengths and considerations, and the choice depends on the specific requirements of the
application and workload.
Multiprocessor system interconnects:
Multiprocessor System Interconnects
Parallel processing needs the use of efficient system interconnects for fast communication among the Input/Output and
peripheral devices, multiprocessors and shared memory.
Hierarchical Bus Systems
A hierarchical bus system consists of a hierarchy of buses connecting various systems and sub-systems/components in a
computer. Each bus is made up of a number of signal, control, and power lines. Different buses like local buses, backplane
buses and I/O buses are used to perform different interconnection functions.
Local buses are the buses implemented on the printed-circuit boards. A backplane bus is a printed circuit on which many
connectors are used to plug in functional boards. Buses which connect input/output devices to a computer system are known
as I/O buses.
Crossbar switch and Multiport Memory
Switched networks give dynamic interconnections among the inputs and outputs. Small or medium size systems mostly use
crossbar networks. Multistage networks can be expanded to the larger systems, if the increased latency problem can be
solved.
Both crossbar switch and multiport memory organization is a single-stage network. Though a single stage network is
cheaper to build, but multiple passes may be needed to establish certain connections. A multistage network has more than
one stage of switch boxes. These networks should be able to connect any input to any output.
Multistage and Combining Networks
Multistage networks or multistage interconnection networks are a class of high-speed computer networks which is mainly
composed of processing elements on one end of the network and memory elements on the other end, connected by
switching elements.
These networks are applied to build larger multiprocessor systems. This includes Omega Network, Butterfly Network and
many more.

In multiprocessor systems, the interconnect plays a crucial role in facilitating communication and data exchange between
the various processing units (CPUs) and memory modules. The choice of interconnect architecture can significantly impact
the system's performance, scalability, and overall efficiency. Here are some common types of interconnects used in
multiprocessor systems:
1. **Bus-Based Interconnect**:
- **Description**: A shared bus is used to connect all processors, memory, and I/O devices. Processors take turns
accessing the bus, resulting in contention and potential bottlenecks.
- **Pros**: Simple and easy to implement. Suitable for smaller systems.
- **Cons**: Limited scalability due to contention. Can result in lower performance as the number of processors
increases.
2. **Crossbar Interconnect**:
- **Description**: A full crossbar switch connects every processor to every memory module directly. This eliminates
contention but can become impractical as the number of processors grows due to the complexity of the interconnect.
- **Pros**: High throughput and low contention.
- **Cons**: Complexity and cost increase as the number of processors grows.
3. **Switched Interconnect**:
- **Description**: Similar to crossbar interconnect, but with switches that allow flexible connections between
processors, memory, and I/O devices. Can be hierarchical with multiple layers of switches.
- **Pros**: Offers scalability and flexibility. Can accommodate large-scale multiprocessor systems.
- **Cons**: Complexity increases with the size of the system.
4. **NUMA (Non-Uniform Memory Access) Interconnect**:
- **Description**: Used in NUMA architectures where each processor has its own local memory, and remote memory
access incurs higher latency. Interconnect aims to optimize memory access by minimizing remote accesses.
- **Pros**: Efficient memory access within local nodes. Suitable for larger systems with distributed memory.
- **Cons**: Complexity in managing memory hierarchy and ensuring balanced access.
5. **Hypercube Interconnect**:
- **Description**: Used in some parallel computing systems. Each node is connected to a fixed number of neighbors,
creating a hypercube topology. Communication pathways are well-defined.
- **Pros**: Efficient and predictable communication paths. Can handle large-scale systems.
- **Cons**: Complexity increases with the number of nodes.
6. **Mesh Interconnect**:
- **Description**: Similar to hypercube but in a grid-like topology. Each node is connected to its adjacent nodes. Used in
some parallel computing and multicore systems.
- **Pros**: Simple and regular topology. Scalable for moderately sized systems.
- **Cons**: Limited scalability compared to more complex interconnects.
7. **Fat-Tree Interconnect**:
- **Description**: Hierarchical interconnect designed to balance bandwidth and reduce contention. Common in high-
performance computing clusters.
- **Pros**: Scalable with efficient bandwidth allocation. Can handle large-scale systems.
- **Cons**: Complexity increases with the number of levels.
The choice of interconnect depends on factors such as the scale of the system, the workload's communication patterns,
the desired level of scalability, and the budget. High-performance systems often use more advanced and complex
interconnects to ensure efficient communication and avoid bottlenecks. Additionally, advances in interconnect technology
continue to shape the design of modern multiprocessor systems.
cache coherence and synchronization mechanism:
Cache coherence and synchronization mechanisms are critical aspects of parallel and multiprocessor computing systems,
ensuring data consistency and correct execution of concurrent programs. Let's delve deeper into these concepts:
**Cache Coherence:**

Cache coherence refers to the consistent and correct behavior of multiple caches in a system that store copies of the same
memory location. It ensures that all processors or cores observe a single, coherent view of memory. Cache coherence
protocols manage how data is shared and updated across caches. Some commonly used protocols include MESI, MOESI,
and MESIF.
Key points about cache coherence:
1. **Reads and Writes**: When one processor writes to a memory location, other processors' caches holding that
location's data need to be updated to reflect the new value. Reads also need to ensure that the latest value is obtained
from memory or the appropriate cache.
2. **Invalidation and Propagation**: Cache coherence protocols use invalidation and update mechanisms. Invalidation
invalidates copies of the data in other caches, while updates propagate the new value to other caches.
3. **Coherence States**: Each cache line has different coherence states like Invalid, Shared, Modified, etc. These states
dictate the behavior of the cache when a read or write operation occurs.
4. **Snooping**: Processors "snoop" the bus to observe memory transactions. They monitor bus traffic to determine
whether a memory location is being accessed by another processor.
**Synchronization Mechanisms:**
Synchronization mechanisms ensure proper coordination among concurrent threads or processes to avoid issues like race
conditions, deadlocks, and data inconsistencies. They provide ways for threads or processes to communicate and
coordinate their execution.
Key synchronization mechanisms:
1. **Locks and Mutexes**: Locks (mutual exclusion) are used to protect shared resources. Threads acquire locks to access
resources and release them afterward. Mutexes (mutual exclusion semaphores) are commonly used to provide exclusive
access to a resource.
2. **Semaphores**: Semaphores are used to control access to a certain number of identical resources. They allow a
specified number of threads to access the resource simultaneously, while others wait.
3. **Condition Variables**: Condition variables allow threads to wait for a certain condition to be met before proceeding.
They're often used in scenarios where threads need to wait for specific events or signals.
4. **Barriers**: Barriers synchronize a group of threads by forcing them to wait until all threads in the group reach a
certain point in the program before proceeding.
5. **Atomic Operations**: Atomic operations are indivisible operations that ensure no other thread can interrupt or
observe the operation midway. They're used for operations like incrementing counters or updating shared variables.
6. **Memory Fences/Barriers**: These ensure memory ordering and visibility of changes across threads or cores,
preventing reordering of memory operations that could lead to unexpected behavior.
7. **Read-Modify-Write Operations**: These operations guarantee that read, modification, and write steps are executed
atomically. Common examples include Compare-and-Swap (CAS) and Test-and-Set.
Effective use of synchronization mechanisms ensures that threads or processes work together harmoniously, avoiding
conflicts and ensuring the correctness of concurrent programs.
Both cache coherence and synchronization mechanisms are essential for creating high-performance and reliable parallel
systems, whether in multiprocessor CPUs, multicore processors, or distributed computing environments. However,
improper use or overuse of synchronization mechanisms can lead to performance bottlenecks or even deadlocks, so
careful consideration is crucial during system design and programming.

Three Generations of Multicomputers


Multicomputers have evolved over different generations to meet the demands of increasingly complex and data-intensive
applications. Here are three generations of multicomputers:
**First Generation: Early Multicomputers (1970s - 1980s)**
During this period, multicomputers emerged as a solution to address the limitations of single-processor systems in
handling computationally intensive tasks. These systems were characterized by distributed memory architectures and
message-passing communication mechanisms.
Key characteristics of the first generation of multicomputers:
1. **Distributed Memory**: Each processing node had its own local memory, and communication between nodes was
achieved through message passing.
2. **Limited Scalability**: These systems were relatively small-scale, often consisting of tens to a few hundred nodes.
3. **Custom Interconnects**: Interconnects were custom-designed and optimized for the specific applications they
targeted.
4. **Message-Passing**: Communication between nodes was explicitly managed through message-passing libraries or
APIs, such as PVM (Parallel Virtual Machine) and MPI (Message Passing Interface).
5. **Programming Challenges**: Developing applications for these systems required explicit management of data
distribution and communication, making programming complex.
**Second Generation: Cluster Computing (1990s - Early 2000s)**
The second generation saw the rise of cluster computing, where commodity hardware components were interconnected
to create cost-effective and scalable parallel computing systems.
Key characteristics of the second generation of multicomputers:
1. **Commodity Hardware**: Clusters used off-the-shelf components like x86 processors and Ethernet networking.
2. **Scalability**: Clusters were designed for scalability, often comprising hundreds or thousands of nodes.
3. **Message-Passing and Shared-Memory Models**: Clusters supported both message-passing and shared-memory
programming models, making it easier to port applications.
4. **Linux and Open Source**: Linux became a popular operating system for cluster environments due to its flexibility and
cost-effectiveness.
5. **High-Performance Interconnects**: Some clusters used high-speed interconnects like InfiniBand to reduce
communication overhead.
**Third Generation: Modern High-Performance Computing (Mid-2000s - Present)**
The third generation of multicomputers is characterized by advancements in high-performance computing (HPC)
architectures and technologies.
Key characteristics of the third generation of multicomputers:
1. **Hybrid Architectures**: Many modern multicomputers combine both traditional CPUs and specialized accelerators
like GPUs or FPGAs for enhanced performance.
2. **Massive Scalability**: Supercomputers, like those in the TOP500 list, now consist of tens of thousands of nodes and
can achieve exascale performance.
3. **Advanced Interconnects**: High-speed, low-latency interconnects are crucial for efficient communication in these
systems. Examples include Cray's Aries interconnect and Intel's Omni-Path Architecture.
4. **Energy Efficiency**: With concerns about power consumption, modern multicomputers focus on energy-efficient
designs and heterogeneous architectures.
5. **Programming Models and Libraries**: Programming models and libraries have evolved to simplify parallel
programming, such as OpenMP, CUDA, and parallelism frameworks like OpenACC.
6. **AI and Data-Intensive Workloads**: Multicomputers are now increasingly used for AI and data-intensive tasks, leading
to the convergence of HPC and data science.
These generations illustrate the evolution of multicomputers from custom-built, distributed memory systems to modern,
highly scalable supercomputers capable of tackling some of the most complex computational challenges in various fields.

Message-passing
https://pt.slideshare.net/keerthikaA8/message-passing-mechanisms

Message-passing mechanisms are fundamental communication techniques used in parallel and distributed
computing systems to enable different processes or threads to exchange information and coordinate their
activities. These mechanisms facilitate communication between separate entities, such as processes running on
different processors or nodes, and are particularly important in systems where shared memory is not directly
accessible to all components. Message-passing is commonly used in multicomputers, clusters, and distributed
computing environments.

Here are some key concepts related to message-passing mechanisms:


1. Point-to-Point Communication:
 In point-to-point communication, a message is sent from one specific sender to a specific receiver.
 It involves two main operations: sending a message and receiving a message.
2. Message Passing Interface (MPI):
 MPI is a widely used standard for message-passing programming in parallel and distributed
computing.
 It provides a set of library routines and communication protocols that allow processes to exchange
data and coordinate their activities.
 MPI supports both blocking and non-blocking communication operations.
3. Communication Modes:
 Blocking Communication: The sender and receiver are synchronized during the communication
operation. The sender waits until the receiver is ready to receive the message.
 Non-Blocking Communication: The sender initiates the communication and continues its work
without waiting for the receiver to complete the operation. This can improve overlapping of
computation and communication.
4. Collective Communication:
 Collective communication involves communication operations that involve multiple processes in a
coordinated manner.
 Examples include broadcasting data to all processes, gathering data from all processes, and
performing reductions (sum, product, etc.) across processes.
5. Message Buffering:
 Messages are often placed in a buffer before being sent. This buffer can be in the memory of the
sender or in a dedicated communication buffer.
 Buffering allows asynchronous communication, where the sender can continue its computation while
the message is being transmitted.
6. Synchronous vs. Asynchronous Communication:
 Synchronous Communication: The sender and receiver synchronize during the communication
operation. Both parties must be ready for communication to occur.
 Asynchronous Communication: The sender and receiver can continue their work independently
after the communication operation is initiated. The actual data exchange happens at a later time.
7. Message Passing Libraries and APIs:
 In addition to MPI, various programming languages and frameworks offer message-passing libraries
or APIs for communication in distributed and parallel systems. Examples include OpenMP,
OpenSHMEM, and CUDA.
8. Interconnects:
 Efficient message passing often requires a reliable and high-speed interconnect between nodes or
processors. Interconnects like InfiniBand, Ethernet, and proprietary technologies are used to facilitate
communication.

Message-passing mechanisms are essential for achieving effective communication and synchronization in
distributed systems, enabling parallel processes to collaborate and exchange data efficiently. These mechanisms
form the foundation for developing scalable and high-performance parallel and distributed applications.
**Multi-vector Computers:**

Multi-vector computers are a type of parallel computing architecture designed to process multiple data elements
simultaneously using vector operations. In a multi-vector computer, multiple vector processors work together to perform
vector computations. A vector processor operates on arrays (vectors) of data elements in a single instruction, allowing for
efficient execution of operations like addition, multiplication, and other mathematical functions.

Key features of multi-vector computers:


- **Vector Pipelines**: Multi-vector computers often employ vector pipelines to efficiently process multiple vector
operations in parallel.
- **Vector Registers**: These systems use specialized vector registers to store and manipulate vector data.
- **Array Processing**: They excel at performing computations on large arrays of data, making them suitable for scientific
simulations, numerical analysis, and signal processing.
- **Data Parallelism**: Multi-vector computers exploit data parallelism, where the same operation is applied to multiple
data elements simultaneously.
- **Programming Models**: Writing software for multi-vector computers requires understanding vector instructions and
programming models, which can be more complex than scalar programming.

**SIMD (Single Instruction, Multiple Data) Computers:**

SIMD computers are another type of parallel architecture where a single instruction is executed on multiple data elements
in parallel. In a SIMD system, a single control unit (instruction fetch/decode unit) broadcasts the same instruction to
multiple processing elements, and each processing element operates on its corresponding data element.

Key features of SIMD computers:


- **Data-Level Parallelism**: SIMD systems excel at tasks that involve data-level parallelism, such as image and video
processing.
- **Massive Parallelism**: These systems can have a large number of processing elements, allowing them to process
multiple data elements concurrently.
- **Streaming Data**: SIMD architecture is well-suited for tasks that involve streaming data, where the same operation is
performed on successive data elements.
- **Vector Registers**: Like in multi-vector computers, SIMD systems often use vector registers to hold the data elements
being processed.
- **Programming**: SIMD programming requires specialized techniques and tools to express parallelism at the data level.

Vector Computer:
 In a vector computer, a vector processor is attached to the scalar processor as an optional feature.
 The host computer first loads program and data to the main memory.
 Then the scalar control unit decodes all the instructions.
 If the decoded instructions are scalar operations or program operations, the scalar processor executes
those operations using scalar functional pipelines.
 On the other hand, if the decoded instructions are vector operations then the instructions will be sent to
vector control unit.
SIMD Computer:
In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their
individual memory units. All the processors are connected by an interconnection network.
MODULE:5
Vector processing is a computer architecture and processing technique that involves the use of vector operations to
process multiple data elements in parallel. It's designed to handle tasks that involve a large amount of data that can be
processed simultaneously, such as in scientific simulations, simulations, graphics rendering, and other high-performance
computing applications. Vector processing is particularly effective when the same operation needs to be applied to a large
set of data elements, as it can perform these operations in a single instruction, greatly enhancing processing efficiency.
he two commonly used architectures for vector processing are pipeline processors and parallel array processors.
In a pipeline, data processing elements are connected in series and output of one process is the input for the
next successive process
Features of Vector Processing
There are various features of Vector Processing which are as follows −
 A vector is a structured set of elements. The elements in a vector are scalar quantities. A vector operand
includes an ordered set of n elements, where n is known as the length of the vector.
 Each clock period processes two successive pairs of elements. During one single clock period, the dual vector
pipes and the dual sets of vector functional units allow the processing of two pairs of elements.
As the completion of each pair of operations takes place, the results are delivered to appropriate elements of the
result register. The operation continues just before the various elements processed are similar to the count
particularized by the vector length register.
 In parallel vector processing, more than two results are generated per clock cycle. The parallel vector
operations are automatically started under the following two circumstances −
o When successive vector instructions facilitate different functional units and multiple vector
registers.
o When successive vector instructions use the resulting flow from one vector register as the operand
of another operation utilizing a different functional unit. This phase is known as chaining.
 A vector processor implements better with higher vectors because of the foundation delay in a pipeline.
 Vector processing decrease the overhead related to maintenance of the loop-control variables which creates it
more efficient than scalar processing.

Here are some key principles of vector processing:


1. **Data-Level Parallelism:** Vector processing takes advantage of data-level parallelism, where a single instruction
operates on multiple data elements (vectors) in a single clock cycle. This is in contrast to scalar processing, where each
instruction operates on only one data element.
2. **Vector Registers:** Vector processors contain specialized registers known as vector registers, which can hold multiple
data elements simultaneously. These registers are much wider than the standard scalar registers found in conventional
processors.

3. **Vector Instructions:** Vector processors utilize vector instructions that define operations to be performed on entire
vectors of data. These instructions include arithmetic operations (addition, subtraction, multiplication, division), logical
operations, and memory operations. A single vector instruction can operate on all the elements in a vector register
simultaneously.
4. **SIMD (Single Instruction, Multiple Data):** Vector processing often falls under the category of SIMD architectures.
This means that a single instruction is applied to multiple data elements at the same time. SIMD architectures can provide
significant performance improvements for tasks that involve repetitive operations on large datasets.
5. **Compiler Optimization:** Compilers play a crucial role in vector processing. They need to identify opportunities for
applying vectorization and generating the appropriate vector instructions from high-level programming code.
6. **Memory Access Patterns:** Efficient memory access is essential for vector processing. Data elements in vectors are
often stored in contiguous memory locations to enable efficient loading and storing of vector data.
7. **Loop Structures:** Many vector processing tasks involve loops where the same operation is applied to multiple data
elements iteratively. Vectorizing these loops involves transforming them into vector operations, which can greatly
accelerate processing.
8. **Software Challenges:** While vector processing can provide significant performance gains, it requires software to be
properly optimized for it. Not all algorithms are well-suited for vectorization, and some tasks might require modifications
to take full advantage of vector processing capabilities.
9. **Vector Length:** The length of a vector refers to the number of data elements that can be processed in parallel in a
single instruction. Longer vectors generally lead to higher performance, but there's a trade-off between vector length and
the complexity of the hardware.
10. **Applications:** Vector processing is commonly used in fields such as scientific computing, graphics processing,
simulations, weather forecasting, and more, where large amounts of data need to be processed efficiently.
It's important to note that while vector processing can offer significant performance benefits for certain types of tasks, not
all applications can take full advantage of this architecture. The effectiveness of vector processing depends on the nature
of the task and the ability to parallelize operations on large sets of data.
Multivector Multiprocessors
ChatGPT
Multivector multiprocessors, also known as vector multiprocessors or parallel vector processors, are a type of
computer architecture that combines the principles of vector processing with multiprocessing. These systems are
designed to deliver high-performance computing by utilizing multiple processors (or processing elements)
working in parallel, each capable of performing vector operations on large datasets. This architecture was
particularly popular in the 1970s and 1980s for scientific and engineering applications that required intensive
calculations.

Key features and concepts of multivector multiprocessors include:


1. Multiple Processors: A multivector multiprocessor system consists of multiple processing elements or
CPUs that work together to execute tasks. Each processor is capable of performing scalar and vector
operations.
2. Vector Registers: Each processor has its own set of vector registers that can store multiple data elements
(vectors). These registers are wider than typical scalar registers to accommodate the larger data sets used
in vector operations.
3. Vector Instructions: The processors in a multivector multiprocessor are equipped with vector instructions
that allow them to perform operations on entire vectors of data in a single instruction. These instructions
are essential for exploiting data-level parallelism.
4. Memory System: Efficient memory access is critical for the performance of multivector multiprocessors.
These systems typically have high-speed memory subsystems that can deliver data to processors at a fast
rate to keep up with their computational capabilities.
5. Interconnection Network: Processors in a multivector multiprocessor need to communicate and share
data. An interconnection network, often in the form of a high-speed bus or a network topology, facilitates
communication and data sharing among processors.
6. Vector Length: The vector length refers to the number of data elements that a vector instruction can
process in parallel. Longer vector lengths can enhance performance, but they also increase the complexity
of the hardware and memory requirements.
7. Software Optimization: To fully utilize the capabilities of multivector multiprocessors, software
applications need to be written or optimized to take advantage of vector instructions and data parallelism.
This often involves transforming algorithms into vectorized code.
8. Performance Benefits: Multivector multiprocessors excel at tasks that involve intensive numerical
calculations, simulations, scientific computing, and engineering applications. These architectures can
achieve significant speedup compared to scalar processors for tasks that can be effectively parallelized.
9. Scalability: Multivector multiprocessors can be scaled by adding more processors to the system, allowing
for even greater parallelism and processing power. However, scalability might be limited by factors like
memory bandwidth and communication overhead.
10.Limitations: While multivector multiprocessors provide significant performance advantages for certain
workloads, they are not always the best choice for all types of applications. Some tasks may not be well-
suited for vector processing, and the overhead of managing parallelism and communication can impact
overall performance.
11.Evolution: Over time, the computing landscape has evolved, and newer parallel processing architectures,
such as multicore processors, GPUs, and specialized accelerators, have gained prominence. These
architectures often offer more flexible and efficient solutions for parallel computing tasks.

Overall, multivector multiprocessors were a significant milestone in high-performance computing history and
played a crucial role in advancing scientific and technical research. However, they have been largely replaced by
more modern and versatile parallel computing architectures in today's computing landscape.
SIMD computer Organizations:
SIMD, or Single Instruction, Multiple Data, is a type of computer architecture that is designed to perform the same
operation on multiple data elements in parallel using a single instruction. This architecture is particularly well-suited for
tasks that involve a high degree of data-level parallelism, such as multimedia processing, scientific simulations, and certain
types of signal processing.
In a SIMD computer organization, there is a single control unit that fetches a single instruction from memory, and this
instruction is then broadcasted to multiple processing units, each of which operates on a separate data element. The
processing units operate synchronously, meaning they all execute the same instruction at the same time, but on different
data elements. This allows for efficient parallel processing.

Here are some key characteristics and features of SIMD computer organizations:
1. **Vector Registers**: SIMD architectures utilize vector registers to hold multiple data elements that are being operated
on simultaneously. These registers can store a fixed number of data elements, such as 4, 8, or 16, depending on the
architecture.
2. **Single Instruction Stream**: As the name suggests, a single instruction is executed across multiple data elements. This
instruction is fetched from memory and decoded by the control unit, and then the same operation is applied to all data
elements within the vector registers.

3. **Data-Level Parallelism**: SIMD architectures excel at exploiting data-level parallelism, where the same operation is
performed on different pieces of data. This makes them efficient for tasks like graphics processing, image manipulation,
and simulations.
4. **Performance**: SIMD architectures can deliver high performance for specific tasks, but their Effectiveness depends
on the nature of the computation. They are particularly useful when the same operation needs to be applied to a large
amount of data.
5. **Instruction Set**: SIMD architectures have specialized instruction sets optimized for parallel data processing. These
instructions include operations like addition, multiplication, and bit manipulation, applied element-wise to the data in
vector registers.
6. **Examples of SIMD Architectures**:
- **Intel's SSE and AVX**: These are extensions to the x86 instruction set architecture, adding SIMD capabilities. AVX
(Advanced Vector Extensions) introduced wider vector registers and more advanced instructions.
- **ARM NEON**: This is SIMD architecture for ARM processors, widely used in mobile devices and embedded systems.
- **NVIDIA CUDA**: While not a traditional CPU SIMD architecture, CUDA is a parallel computing platform and
programming model that enables SIMD-like parallelism on NVIDIA GPUs.
7. **Programming Considerations**: Programming for SIMD architectures often involves using compiler intrinsics or
specialized libraries to take advantage of the parallelism. These tools help ensure that the compiler generates the
appropriate SIMD instructions.
8. **Limitations**: SIMD architectures are most effective when the data and operations are uniform across multiple
processing units. If there is significant divergence in the computation, some processing units may remain idle, limiting
overall efficiency.
In recent years, SIMD architectures have become more prevalent due to their usefulness in tasks like graphics rendering,
scientific simulations, and machine learning. However, it's important to choose the right architecture based on the specific
requirements of your application.
The Connection Machine CM-5
The Connection Machine CM-5 is a massively parallel supercomputer designed and manufactured by Thinking Machines
Corporation in the late 1980s and early 1990s. It's a notable example of a SIMD (Single Instruction, Multiple Data)
architecture and was designed to perform high-speed calculations for tasks like scientific simulations and data analysis. The
CM-5 was one of the most powerful computers of its time and contributed to advancements in parallel computing.
Here are some key features and characteristics of the Connection Machine CM-5:
1. **Massively Parallel Architecture**: The CM-5 was designed with a massive number of simple processing elements
(PEs) connected in a hypercube topology. The PE is capable of executing SIMD operations on multiple data elements.
2. **Hypercube Topology**: The CM-5's processors were arranged in a hypercube configuration, where each node (PE) is
connected to several neighboring nodes. This arrangement allowed for efficient data communication and coordination
between processors.

3. **SIMD Execution Model**: The CM-5 follows the SIMD model, where a single instruction is broadcasted to all PEs, and
each PE operates on its own local data simultaneously. This architecture is well-suited for tasks that involve regular, data-
parallel computations.
4. **Memory Hierarchy**: The CM-5 had a hierarchical memory structure, including local memory for each PE, as well as
global memory accessible to all PEs. Efficient management of data movement between these memory levels was crucial
for achieving good performance.
5. **I/O Subsystem**: The CM-5 featured a high-speed I/O subsystem that allowed the system to interact with external
devices and networks for data input, output, and communication.
6. **Applications**: The CM-5 was used for a wide range of applications, including scientific simulations, weather
forecasting, fluid dynamics, and other compute-intensive tasks that could benefit from massive parallelism.
7. **Programming Model**: Writing software for the CM-5 involved dealing with the complexities of parallel
programming. The CM-5 used the *C* programming language with extensions for specifying parallel operations and data
distribution.
8. **Legacy and Impact**: The CM-5 was influential in the field of parallel computing and contributed to advancements in
the understanding of massively parallel architectures. It demonstrated the potential of parallelism for solving complex
problems, even though programming and managing such systems posed significant challenges.
9. **Downfall**: While the CM-5 was powerful and groundbreaking, its commercial success was limited due to its high
cost, complexity of programming, and competition from other parallel computing architectures. Thinking Machines
Corporation faced financial difficulties and eventually went bankrupt in the mid-1990s.
The Connection Machine CM-5 serves as a historical milestone in the development of supercomputing and parallel
computing. While the specific architecture of the CM-5 may not be as prevalent today, its legacy continues to influence the
design and understanding of parallel and high-performance computing systems.

You might also like