CSA

Basic computer architecture:-
A computer system is basically a machine that simplifies complicated

tasks. It should maximize performance and reduce costs as well as
power consumption.The different components in the Computer
System Architecture are Input Unit, Output Unit, Storage Unit,
Arithmetic Logic Unit, Control Unit etc.
A diagram that shows the flow of data between these units is as

follows −
The input data travels from input unit to ALU. Similarly, the computed
data travels from ALU to output unit. The data constantly moves from
storage unit to ALU and back again. This is because stored data is
computed on before being stored again. The control unit controls all
the other units as well as their data.
Details about all the computer units are −
• Input Unit
The input unit provides data to the computer system from

the outside. So, basically it links the external environment
with the computer. It takes data from the input devices,
converts it into machine language and then loads it into the
computer system. Keyboard, mouse etc. are the most
commonly used input devices.
• Output Unit
The output unit provides the results of computer process to

the users i.e it links the computer with the external
environment. Most of the output data is the form of audio or
video. The different output devices are monitors, printers,
speakers, headphones etc.
1
• Storage Unit
Storage unit contains many computer components that are

used to store data. It is traditionally divided into primary
storage and secondary storage.Primary storage is also
known as the main memory and is the memory directly
accessible by the CPU. Secondary or external storage is not
directly accessible by the CPU. The data from secondary
storage needs to be brought into the primary storage before
the CPU can use it. Secondary storage contains a large
amount of data permanently.
• Arithmetic Logic Unit
All the calculations related to the computer system are

performed by the arithmetic logic unit. It can perform
operations like addition, subtraction, multiplication,
division etc. The control unit transfers data from storage
unit to arithmetic logic unit when calculations need to be
performed. The arithmetic logic unit and the control unit
together form the central processing unit.
• Control Unit
This unit controls all the other units of the computer system
and so is known as its central nervous system. It transfers
data throughout the computer as required including from
storage unit to central processing unit and vice versa. The
control unit also dictates how the memory, input output
devices, arithmetic logic unit etc. should behave.
Quantitative Computer Design:-
• Principles that are useful in design and analysis of

computers:
• Make the common case fast !
o If a design trade-off is necessary, favor the
frequent case (which is often simpler) over the
infrequent case.
o
o For example, given that overflow in addition is
infrequent, favor optimizing the case when no
overflow occurs.
• Objective:
• Determine the frequent case.
2
• Determine how much improvement in performance is possible by
making it faster.
• Amdahl'slaw can be used to quantify the latter given that

we have information concerning the former.
Quantitative Computer Design
• Amdahl's Law :
o The performance improvement to be gained from
using some faster mode of execution is limited by
the fraction of the time the faster mode can be
used.
o Amdahl's law defines the speedup obtained by

using a particular feature:
o Two factors:
• Fraction enhanced : Fraction of compute time in original machine that
can be converted to take advantage of the enhancement.
▪ Always <= 1.
• Speedup enhanced : Improvement gained by enhanced execution
mode:
• Amdahl's Law (cont) :

o Execution time using original machine with
enhancement:
3
o Speedup overall using Amdahl's Law:
• Amdahl's Law (cont) : An example:

o Consider an enhancement that takes 20ns on a
machine with enhancement and 100ns on a
machine without. Assume enhancement can only
be used 30% of the time.
o What is the overall speedup ?
• Amdahl's Law expresses the law of diminishing returns . i.e.
o Assume the first improvement costs us $1,000.
o Assume we are thinking about spending $100,000

to speed up the 30% by a factor of 500.
4
o Is
this a worthy investment ? (will we get a
52%*100 fold increase in performance)?
o NO ! The best that we can do: 1/0.7 = 1.42 !
• CPU Performance Equation:

o Often it is difficult to measure the improvement
in time using a new enhancement directly.
oA second method that decomposes the CPU

execution time into three components makes this
task simpler.
o CPU Performance Equation:
▪ where, for example, Clock cycle time =

2ns for a 500MHz Clock rate.
• CPU Performance Equation:
o Analternative to "number of clock cycles" is

"number of instructions executed" or Instruction
Count ( IC ).
o Given both the "number of clock cycles" and IC of

a program, the average Clocks Per
Instruction ( CPI ) is given by:
o Therefore,CPU performance is dependent on

three characteristics:
5
o Note that CPU time is equally dependent on these,
i.e. a 10% improvement in any one leads to a 10%
improvement in CPU time.
• CPU Performance Equation :

o One difficulty: It is difficult to change one in
isolation of the others:
• Clock cycle time: Hardware and Organization.
• CPI: Organization and Instruction set architecture.
• Instruction count: Instruction set architecture and Compiler
technology.
•A variation of this equation is:
• where IC i represents number of time instruction i is

executed in a program and CPI i represents the average
number of clock cycles for instruction
Measuring and Reporting Performance
The computer user is interested in reducing response time( the time

between the start and the completion of an event) also referred to as
execution time. The manager of a large data processing center may be
interested in increasing throughput( the total amount of work done in
a given time).
Even execution time can be defined in different ways depending on

what we count. The most straightforward definition of time is called
wall-clock time, response time, or elapsed time, which is the latency
to complete a task, including disk accesses, memory accesses,
input/output activities, operating system overhead
Choosing Programs to Evaluate Performance
A computer user who runs the same programs day in and day out
would be the perfect candidate to evaluate a new computer. To
6
evaluate a new system the user would simply compare the execution
time of her workload—the mixture of programs and operating system
commands that users run on a machine.There are five levels of
programs used in such circumstances, listed below in decreasing
order of accuracy of prediction.
1. Real applications— Although the buyer may not know what

fraction of time is spent on these programs, she knows that some
users will run them to solve real problems. Examples are compilers for
C, text-processing software like Word, and other applications like
Photoshop. Real applications have input, output, and options that a
user can select when running the program. There is one major
downside to using real applications as benchmarks: Real applications
often encounter portability problems arising from dependences on the
operating system or compiler. Enhancing portability often means
modifying the source and sometimes eliminating some important
activity, such as interactive graphics, which tends to be more system-
dependent.
2. Modified (or scripted) applications—In many cases, real

applications are used as the building block for a benchmark either
with modifications to the application or with a script that acts as
stimulus to the application. Applications are modified for two primary
reasons: to enhance portability or to focus on one particular aspect of
system performance. For example, to create a CPU-oriented
benchmark, I/O may be removed or restructured to minimize its
impact on execution time. Scripts are used to reproduce interactive
behavior, which might occur on a desktop system, or to simulate
complex multiuser interaction, which occurs in a server system.
3.Kernels—Several attempts have been made to extract small, key

pieces from real programs and use them to evaluate performance.
Livermore Loops and Linpack are the best known examples. Unlike real
programs, no user would run kernel programs, for they exist solely to
evaluate performance. Kernels are best used to isolate performance of
individual features of a machine to explain the reasons for differences
in performance of real programs.
4. Toy benchmarks—Toy benchmarks are typically between 10 and

100 lines of code and produce a result the user already knows before
running the toy program. Programs like Sieve of Eratosthenes, Puzzle,
and Quicksort are popular because they are small, easy to type, and
run on almost any computer. The best use of such programs is
beginning programming assignments
7
5. Synthetic benchmarks—Similar in philosophy to kernels, synthetic
benchmarks try to match the average frequency of operations and
operands of a large set of programs. Whetstone and Dhrystone are the
most popular synthetic benchmarks.
Benchmark Suites
Recently, it has become popular to put together collections of

benchmarks to try to measure the performance of processors with a
variety of applications. One of the most successful attempts to create
standardized benchmark application suites has been the SPEC
(Standard Performance Evaluation Corporation), which had its roots in
the late 1980s efforts to deliver better benchmarks for workstations.
Just as the computer industry has evolved over time, so has the need
for different benchmark suites, and there are now SPEC benchmarks
to cover different application classes, as well as other suites based on
the SPEC model. Which is shown in figure
Desktop Benchmarks
Desktop benchmarks divide into two broad classes: CPU intensive

benchmarks and graphics intensive benchmarks intensive CPU
activity). SPEC originally created a benchmark set focusing on CPU
performance (initially called SPEC89), which has evolved into its
fourth generation: SPEC CPU2000, which follows SPEC95, and SPEC92.
Although SPEC CPU2000 is aimed at CPU performance, two different

types of graphics benchmarks were created by SPEC: SPEC viewperf is
used for benchmarking systems supporting the OpenGL graphics
library, while SPECapc consists of applications that make extensive
use of graphics. SPECviewperf measures the 3D rendering
performance of systems running under OpenGL using a 3-D model and
a series of OpenGL calls that transform the model. SPECapc consists
of runs of three large applications:
8
1.Pro/Engineer: a solid modeling application that does extensive 3-D
rendering. The input script is a model of a photocopying machine
consisting of 370,000 triangles.
2.SolidWorks 99: a 3-D CAD/CAM design tool running a series of five

tests varying from I/O intensive to CPU intensive. The largetest input
is a model of an assembly line consisting of 276,000 triangles.
3. Unigraphics V15: The benchmark is based on an aircraft model and

covers a wide spectrum of Unigraphics functionality, including
assembly, drafting, numeric control machining, solid modeling, and
optimization. The inputs are all part of an aircraft design.
Server Benchmarks
Just as servers have multiple functions, so there are multiple types of

benchmarks. The simplest benchmark is perhaps a CPU throughput
oriented benchmark. SPEC CPU2000 uses the SPEC CPU benchmarks to
construct a simple throughput benchmark where the processing rate
of a multiprocessor can be measured by running multiple copies
(usually as many as there are CPUs) of each SPEC CPU benchmark and
converting the CPU time into a rate. This leads to a measurement called
the SPECRate. Other than SPECRate, most server applications and
benchmarks have significant I/O activity arising from either disk or
network traffic, including benchmarks for file server systems, for web
servers, and for database and transaction processing systems. SPEC
offers both a file server benchmark (SPECSFS) and a web server
benchmark (SPECWeb). SPECSFS is a benchmark for measuring NFS
(Network File System) performance using a script of file server
requests; it tests the performance of the I/O system (both disk and
network I/O) as well as the CPU. SPECSFS is a throughput oriented
benchmark but with important response time requirements.
Transaction processing benchmarks measure the ability of a system

to handle transactions, which consist of database accesses and
updates. All the TPC benchmarks measure performance in
transactions per second. In addition, they include a response-time
requirement, so that throughput performance is measured only when
the response time limit is met. To model real-world systems, higher
transaction rates are also associated with larger systems, both in terms
of users and the data base that the transactions are applied to. Finally,
the system cost for a benchmark system must also be included,
allowing accurate comparisons of cost-performance.
9
Embedded Benchmarks
Benchmarks for embedded computing systems are in a far more

nascent state than those for either desktop or server environments. In
fact, many manufacturers quote Dhrystone performance, a benchmark
that was criticized and given up by desktop systems more than 10
years ago! As mentioned earlier, the enormous variety in embedded
applications, as well as differences in performance requirements (hard
real-time, soft real-time, and overall cost-performance), make the use
of a single set of benchmarks unrealistic.
In practice, many designers of embedded systems devise benchmarks

that reflect their application, either as kernels or as stand-alone
versions of the entire application. For those embedded applications
that can be characterized well by kernel performance, the best
standardized set of benchmarks appears to be a new benchmark set:
the EDN Embedded Microprocessor Benchmark Consortium (or EEMBC–
pronounced embassy). The EEMBC benchmarks fall into five classes:
automotive/industrial, consumer, networking, office automation, and
telecommunications Figure shows the five different application
classes, which include 34 benchmarks.
Basic Pipelining:-
Pipelining is the process of accumulating instruction from the

processor through a pipeline. It allows storing and executing
instructions in an orderly process. It is also known as pipeline
processing.Before moving forward with pipelining, check these topics
out to understand the concept better :
Pipelining is a technique where multiple instructions are overlapped

during execution. Pipeline is divided into stages and these stages are
connected with one another to form a pipe like structure. Instructions
enter from one end and exit from another end.
Pipelining increases the overall instruction throughput.
In pipeline system, each segment consists of an input register

followed by a combinational circuit. The register is used to hold data
and combinational circuit performs operations on it. The output of
combinational circuit is applied to the input register of the next
segment.
10
Pipeline system is like the modern day assembly line setup in
factories. For example in a car manufacturing industry, huge assembly
lines are setup and at each point, there are robotic arms to perform a
certain task, and then the car moves on ahead to the next arm.
Types of Pipeline
It is divided into 2 categories:
1. Arithmetic Pipeline
2. Instruction Pipeline
Arithmetic Pipeline
Arithmetic pipelines are usually found in most of the computers. They

are used for floating point operations, multiplication of fixed point
numbers etc. For example: The input to the Floating Point Adder
pipeline is:
X = A*2^a
Y = B*2^b
Here A and B are mantissas (significant digit of floating point
numbers), while a and b are exponents.
The floating point addition and subtraction is done in 4 parts:
1. Compare the exponents.

2. Align the mantissas.
3. Add or subtract mantissas
4. Produce the result.
Registers are used for storing the intermediate results between the
above operations.
11
Instruction Pipeline
In this a stream of instructions can be executed by

overlapping fetch, decode and execute phases of an instruction cycle.
This type of technique is used to increase the throughput of the
computer system.
An instruction pipeline reads instruction from the memory while

previous instructions are being executed in other segments of the
pipeline. Thus we can execute multiple instructions simultaneously.
The pipeline will be more efficient if the instruction cycle is divided
into segments of equal duration.
Pipeline Conflicts
There are some factors that cause the pipeline to deviate its normal
performance. Some of these factors are given below:
1. Timing Variations:-All stages cannot take same amount of

time. This problem generally occurs in instruction
processing where different instructions have different
operand requirements and thus different processing time.
2. Data Hazards:-When several instructions are in partial
execution, and if they reference same data then the problem
arises. We must ensure that next instruction does not
attempt to access data before the current instruction,
because this will lead to incorrect results.
3. Branching
In order to fetch and execute the next instruction, we must know what
that instruction is. If the present instruction is a conditional branch,
and its result will lead us to the next instruction, then the next
instruction may not be known until the current one is processed.
4. Interrupts
Interrupts set unwanted instruction into the instruction stream.

Interrupts effect the execution of instruction.
5. Data Dependency
It arises when an instruction depends upon the result of a previous

instruction but this result is not yet available.
12
Advantages of Pipelining
1. The cycle time of the processor is reduced.

2. It increases the throughput of the system
3. It makes the system reliable.
Disadvantages of Pipelining
1. The design of pipelined processor is complex and costly to

manufacture.
2. The instruction latency is more
Arithmetic Pipeline and Instruction Pipeline:-
1. Arithmetic Pipeline : An arithmetic pipeline divides an

arithmetic problem into various sub problems for execution in
various pipeline segments. It is used for floating point
operations, multiplication and various other computations. The
process or flowchart arithmetic pipeline for floating point
addition is shown in the diagram.
Floating point addition using arithmetic pipeline :

The following sub operations are performed in this case:
1. Compare the exponents.

2. Align the mantissas.
3. Add or subtract the mantissas.
4. Normalise the result
First of all the two exponents are compared and the larger of two
exponents is chosen as the result exponent. The difference in the
exponents then decides how many times we must shift the smaller
exponent to the right. Then after shifting of exponent, both the
13
mantissas get aligned. Finally the addition of both numbers take place
followed by normalisation of the result in the last segment.
Example:
Let us consider two numbers,
X=0.3214*10^3 and Y=0.4500*10^2
Explanation:
First of all the two exponents are subtracted to give 3-2=1. Thus 3
becomes the exponent of result and the smaller exponent is shifted 1
times to the right to give
Y=0.0450*10^3
Finally the two numbers are added to produce
Z=0.3664*10^3
As the result is already normalized the result remains the same.

2. Instruction Pipeline: In this a stream of instructions can be
executed by overlapping fetch, decode and execute phases of an
instruction cycle. This type of technique is used to increase the
throughput of the computer system. An instruction pipeline
reads instruction from the memory while previous instructions
are being executed in other segments of the pipeline. Thus we
can execute multiple instructions simultaneously. The pipeline
will be more efficient if the instruction cycle is divided into
segments of equal duration.
In the most general case computer needs to process each instruction
in following sequence of steps:
1. Fetch the instruction from memory (FI)

2. Decode the instruction (DA)
3. Calculate the effective address
4. Fetch the operands from memory (FO)
5. Execute the instruction (EX)
6. Store the result in the proper place
The flowchart for instruction pipeline is shown below.
14
Let us see an example of instruction pipeline.
Example:
Here the instruction is fetched on first clock cycle in segment 1.

Now it is decoded in next clock cycle, then operands are fetched and
finally the instruction is executed. We can see that here the fetch and
decode phase overlap due to pipelining. By the time the first
instruction is being decoded, next instruction is fetched by the
pipeline.
In case of third instruction we see that it is a branched instruction.
Here when it is being decoded 4th instruction is fetched
simultaneously. But as it is a branched instruction it may point to
some other instruction when it is decoded. Thus fourth instruction is
15
kept on hold until the branched instruction is executed. When it gets
executed then the fourth instruction is copied back and the other
phases continue as usual.
Data hazard, structural hazard, and control hazard are three

categories of common hazards. Hazards in the domain of central
processing unit (CPU) design are problems with the instruction
pipeline in CPU micro-architectures when the next instruction cannot
execute in the next clock cycle, which might possibly result in
inaccurate calculation results.
Data hazard arises if an instruction accesses a register that a preceding

instruction overwrites in a future cycle. Unless we decrease data
hazard, pipelines will produce erroneous outputs. In this article, we
shall dig further into Data Hazards.
What is a Data hazard:-
Data hazard in pipelining arises when one instruction is dependent on

the results of a preceding instruction and that result has not yet been
calculated. Whenever two distinct instructions make use of the same
storage. The location must seem to be run sequentially.
In other words, a Data hazard in computer architecture occurs when

instructions with data dependency change data at several stages of a
pipeline. Ignoring possible data hazards might lead to race conditions
(also termed race hazards).Sr Free Mock Test
Types of Data Hazard in Pipelining
The data hazard or dependency exists in the instructions in the three

types that are as follows:
• Read after Write (RAW)

• Write after Read (WAR)
• Write after Write (WAW)
Read after Write (RAW)
Read after Write(RAW) is also known as True dependency or Flow

dependency. A read-after-write (RAW) data hazard is when an
instruction refers to a result that has not yet been computed or
retrieved. This can happen because, even when an instruction is
executed after another, the previous instruction has only been
processed partially through the pipeline.
For example, consider the two instructions:

16
I1. R2 <- R5 + R3
I2. R4 <- R2 + R3
The first instruction computes a value to be saved in register R2, and

the second uses this value to compute a result for register R4.
However, in a pipeline, when the operands for the second operation
are retrieved, the results of the first operation have not yet been
stored, resulting in a data dependence.
Instruction I2 has a data dependence since it is dependent on the

execution of instruction I1.
Write after Read (WAR)
Write after Read(WAR) is also known as anti-dependency. These data

hazards arise when an instruction's output register is utilized
immediately after being read by a previous instruction.
As an example, consider the two instructions:
I1. R4 <- R1 + R5
I2. R5 <- R1 + R2
When there is a chance that I2 will end before I1 (i.e., when there is
concurrent execution), it must be assured that the result of register R5
is not saved before I1 has had a chance to obtain the operands.
Write after Write (WAW)
Write after Write(WAW) is also known as output dependency. These

data hazards arise when the output register of instruction is utilized
for write after the previous instruction has been written.
As an example, consider the two instructions:
I1. R2 <- R4 + R7
I2. R2 <- R1 + R3
The write-back (WB) of I2 must be postponed until I1 has completed

its execution.
Handling Data Hazard
17
There are various methods to handle the data hazard in computer
architecture that occur in the program. Some of the methods to
handle the data hazard are as follows:
• Forwarding is the addition of specific circuitry to the pipeline.

This approach works because the needed values take less time to
travel via a wire than it does for a pipeline segment to compute
its result.
• Code reordering requires the use of specialized software. This
sort of software is known as a hardware-dependent compiler.
Stall Insertioninserts one or more installs(no-op instructions) into the

pipeline, delaying execution of the current instruction until the
needed operand is written to the register file; unfortunately, this
method reduces pipeline efficiency and throughput.These are various
methods we use to handle hazards: Forwarding, Code recording, and
Stall insertion.
These are explained as follows below.
1. Forwarding :
It adds special circuitry to the pipeline. This method works
because it takes less time for the required values to travel
through a wire than it does for a pipeline segment to compute
its result.
2. Code reordering :
We need a special type of software to reorder code. We call
this type of software a hardware-dependent compiler.
3. Stall Insertion :
it inserts one or more installs (no-op instructions) into the
pipeline, which delays the execution of the current
instruction until the required operand is written to the
register file, but this method decreases pipeline efficiency
and throughput.
Control Hazards
Control hazard occurs whenever the pipeline makes incorrect branch

prediction decisions, resulting in instructions entering the pipeline
that must be discarded. A control hazard is often referred to as a
branch hazard.
Branch hazards are caused by branch instructions and are known as
control hazards. The flow of program/instruction execution is
controlled by branch instructions. In higher-level languages,
conditional statements are used for repetitive loops or condition
18
testing (correlate with while, for, if, case statements). These are
converted into one of the BRANCH instruction variations. To
understand the programme flow, you must know the value of the
condition being tested. It’s a difficult situation.
As a result, when the decision to execute one instruction is reliant on
the result of another instruction, such as a conditional branch, which
examines the condition’s consequent value, a conditional hazard
develops.
The Program Counter (PC) is loaded with the appropriate place for the
branch and jump instructions, which determines the programme flow.
The next instruction that is to be fetched and executed by the given
CPU is stored in the PC. Take a look at the instructions that follow:
In this scenario, fetching the I3 is pointless. What is the status of the

pipeline? The I3 fetch must be terminated while in I2. This can only be
determined once I2 has been decoded as JMP. As a result, the pipeline
cannot continue at its current rate, resulting in a Control Dependency
(hazard). If I3 is fetched in the meantime, it is not just unnecessary
work, but it is also possible that some data in registers has been
changed and needs to be reversed.
Handling Control Hazards
The objectives of this module are to discuss how to handle control
hazards, to differentiate between static and dynamic branch
prediction and to study the concept of delayed branching.
A branch in a sequence of instructions causes a problem. An

instruction must be fetched at every clock cycle to sustain the
pipeline. However, until the branch is resolved, we will not know
where to fetch the next instruction from and this causes a problem.
This delay in determining the proper instruction to fetch is called
a control hazard or branch hazard, in contrast to the data hazards we
examined in the previous modules. Control hazards are caused by
control dependences. An instruction that is control dependent on a
branch cannot be moved in front of the branch, so that the branch no
longer controls it and an instruction that is not control dependent on
a branch cannot be moved after the branch so that the branch controls
it. This will give rise to control hazards.
19
The two major issues related to control dependences are exception
behavior and handling and preservation of data flow. Preserving
exception behavior requires that any changes in instruction execution
order must not change how exceptions are raised in program. That is,
no new exceptions should be generated.
• Example:
ADD R2,R3,R4
BEQZ R2,L1
LD R1,0(R2)
L1:
What will happen with moving LD before BEQZ? This may lead to
memory protection violation. The branch instruction is a guarding
branch that checks for an address zero and jumps to L1. If this is
moved ahead, then an additional exception will be raised. Data flow is
the actual flow of data values among instructions that produce results
and those that consume them. Branches make flow dynamic and
determine which instruction is the supplier of data
• Example:
DADDU R1,R2,R3
BEQZR4,L
DSUBU R1,R5,R6
L:…
OR R7,R1,R8
The instruction OR depends on DADDU or DSUBU? We must ensure

that we preserve data flow on execution.
The general rule to reduce branch penalties is to resolve branches as

early as possible. In the MIPS pipeline, the comparison of registers and
target address calculation is normally done at the execution stage.
This gives rise to three clock cycles penalty. This is indicated in Figure
20
13.1. If we do a more aggressive implementation by adding hardware
to resolve the branch in the ID stage, the penalty can be reduced.
Resolving the branch earlier requires two actions to occur – computing

the branch target address and evaluating the branch decision early.
The easy part of this change is to move up the branch address
calculation. We already have the PC value and the immediate field in
the IF/ID pipeline register, so we just move the branch adder from the
EX stage to the ID stage; of course, the branch target address
calculation will be performed for all instructions, but only used when
needed. The harder part is the branch decision itself. For branch equal,
we would compare the two registers read during the ID stage to see if
they are equal. Equality can be tested by first exclusive ORing their
respective bits and then ORing all the results. Moving the branch test
to the ID stage also implies additional forwarding and hazard
detection hardware, since a branch dependent on a result still in the
pipeline must still work properly with this optimization. For example,
to implement branch-on-equal (and its inverse), we will need to
forward results to the equality test logic that operates during ID. There
are two complicating factors:
• During ID, we must decode the instruction, decide whether a

bypass to the equality unit is needed, and complete the equality
comparison so that if the instruction is a branch, we can set the
PC to the branch target address. Forwarding for the operands of
branches was formerly handled by the ALU forwarding logic, but
the introduction of the equality test unit in ID will require new
forwarding logic. Note that the bypassed source operands of a
21
branch can come from either the ALU/MEM or MEM/WB pipeline
latches.
• Because the values in a branch comparison are needed during ID
but may be produced later in time, it is possible that a data
hazard can occur and a stall will be needed. For example, if an
ALU instruction immediately preceding a branch produces one of
the operands for the comparison in the branch, a stall will be
required, since the EX stage for the ALU instruction will occur
after the ID cycle of the branch.
Despite these difficulties, moving the branch execution to the ID

stage is an improvement since it reduces the penalty of a branch to
only one instruction if the branch is taken, namely, the one currently
being fetched.
There are basically two ways of handling control hazards:

1. Stall until the branch outcome is known or perform the fetch again
2. Predict the behavior of branches
a. Static prediction by the compiler
b. Dynamic prediction by the hardware
The first option of stalling the pipeline till the branch is resolved, or
fetching again from the resolved address leads to too much of penalty.
Branches are very frequent and not handling them effectively brings
down the performance. We are also violating the principle of “Make
common cases fast”.
The second option is predicting about the behavior of branches.

Branch Prediction is the ability to make an educated guess about which
way a branch will go – will the branch be taken or not. First of all, we
shall discuss about static prediction done by the compiler. This is
based on typical branch behavior. For example, for loop and if-
statement branches, we can predict that backward branches will be
taken and forward branches will not be taken. So, there are primarily
three methods adopted. They are:
– Predict not taken approach

• Assume that the branch is not taken, i.e. the condition will not
evaluate to be true
– Predict taken approach
• Assume that the branch is taken, i.e. the condition will evaluate to
be true
– Delayed branching
• A more effective solution
22
In the predict not taken approach, treat every branch as “not taken”.
Remember that the registers are read during ID, and we also perform
an equality test to decide whether to branch or not. We simply load in
the next instruction (PC+4) and continue. The complexity arises when
the branch evaluates to be true and we end up needing to actually take
the branch. In such a case, the pipeline is cleared of any code loaded
from the “not-taken” path, and the execution continues.
In the predict-taken approach, we assume that the branch is always

taken. This method will work for processors that have the target
address computed in time for the IF stage of the next instruction so
there is no delay, and the condition alone may not be evaluated. This
will not work for the MIPS architecture with a 5-stage pipeline. Here,
the branch target is computed during the ID cycle or later and the
condition is also evaluated in the same clock cycle.
The third approach is the delayed branching approach. In this case,

an instruction that is useful and not dependent on whether the branch
is taken or not is inserted into the pipeline. It is the job of the compiler
to determine the delayed branch instructions. The slots filled up by
instructions which may or may not get executed, depending on the
outcome of the branch, are called the branch delay slots. The compiler
has to fill these slots with useful/independent instructions. It is easier
for the compiler if there are less number of delay slots.
There are three different ways of introducing instructions in the delay
slots:
• Before branch instruction

• From the target address: only valuable when branch taken
• From fall through: only valuable when branch is not taken
• Cancelling branches allow more slots to be filled
A structural hazard arises when two (or more) pipelined instructions

need the same resource. Instructions must therefore be carried out in
series rather than parallel for a segment of the pipeline. On occasion,
resource hazards are regarded as structural dangers. Here we have
discussed the structural hazards.
What are Structural Hazards?
Structural hazard is one of the three hazards in the pipeline. A

structural hazard is caused due to resource conflict in the pipeline
stage. When two different instructions access the same resource in the
same stage, this situation is termed a structural hazard.
23
These structural hazards cause stalls in the pipeline. To minimize or
eliminate the stalls due to the structural dependency in the pipeline,
we use a hardware mechanism called the Renaming technique.
Structural Hazards in Pipelining
A cycle in the pipeline without new input initiation is called an extra

cycle, also called a stall or hazard. When a stall is present in the
pipeline, then CPI (Cycle per Instruction) ≠ 1. There are three types of
hazards possible in the pipeline, namely:gnup for Free Mock Test
• Structural Hazards
• Data Hazards
• Control Hazards
A structural dependency causes a structural hazard in pipelining. This

dependency occurs due to resource conflict. A resource may be a
memory or register or a functional unit like ALU (Arithmetic Logical
Unit) in the CPU. In simple words, when different instructions collide
while trying to access the same resource in the same segment of the
pipeline, then it causes a structural hazard.
Handling Structural Hazards in Pipelining
The structural hazards are minimized using a hardware technique is

used called renaming. The renaming mechanism states that it splits
the memory into two independent sub-modules to store instruction
and data separately. The module used to store instruction is called
code memory (CM), and the module used to store data is called data
memory(DM).
In this technique, as we proceed further, we refer to the code memory

in the first stage and refer to the data memory next, so accessing these
two memories in the same cycle does not create any conflict or hazard.
In this manner, we eliminate structural hazards in pipelining.
Structural Hazard Example
Let us understand the structural hazard through an example. Consider

we have four instructions, I1, I2, I3, and I4, accessing Memory (Mem),
Instruction Decode (Decod), and ALU stages of the pipeline as shown
in the figure below:
24
In the above execution sequence instructions, I1 and I4 both are trying
to access the same resource, which is Mem (Memory) in the same
CC4(Clock Cycle - 4). This situation in the pipeline is called a structural
hazard.
Conflict is an unsuccessful operation, so to handle this problem, we

need to stop the I4 instruction fetch from the memory until the
resource becomes available. This waiting creates stalls in the pipeline,
as shown in the below figure:
In the cycle diagram, instruction I1 refers to the memory in the fourth

stage to read or write the data. Simultaneously, instruction I4 refers to
the memory in the first stage of the pipeline to access data in the same
cycle CC4. This situation has resulted in conflict. Now to eliminate this,
we will use the renaming technique. That is, we divide the memory
into two modules, code memory(CM) and data memory (DM), and refer
to the CM in the first stage to read data and refer to DM in the fourth
stage to write data. This implementation is shown in the below figure:
25
Using the renaming technique to overcome the structural hazard, we
achieve pipeline CPI=1 with 0(zero) stalls.
What is exception handling?

Exception handling is the process of responding to unwanted or
unexpected events when a computer program runs. Exception
handling deals with these events to avoid the program or system
crashing, and without this process, exceptions would disrupt the
normal operation of a program.
Exceptions occur for numerous reasons, including invalid user input,

code errors, device failure, the loss of a network connection,
insufficient memory to run an application, a memory conflict with
another program, a program attempting to divide by zero or a user
attempting to open files that are unavailable.
When an exception occurs, specialized programming language

constructs, interrupt hardware mechanisms or operating
system interprocess communication facilities handle the exception.
Exception handling differs from error handling in that the former

involves conditions an application might catch versus serious
problems an application might want to avoid. In contrast, error
handling helps maintain the normal flow of software program
execution.
How is exception handling used?

If a program has a lot of statements and an exception happens halfway
through its execution, the statements after the exception do not
26
execute, and the program crashes. Exception handling helps ensure
this does not happen when an exception occurs.
The try block detects and throws any found exceptions to the catch
blocks, which then handles them.
Exception handling can catch and throw exceptions. If a detecting

function in a block of code cannot deal with an anomaly, the exception
is thrown to a function that can handle the exception. A catch
statement is a group of statements that handle the specific thrown
exception. Catch parameters determine the specific type of exception
that is thrown.
Exception handling is useful for dealing with exceptions that cannot

be handled locally. Instead of showing an error status in the program,
the exception handler transfers control to where the error can be
handled. A function can throw exceptions or can choose to handle
exceptions.
Error handling code can also be separated from normal code with the
use of try blocks, which is code that is enclosed in curly braces or
brackets that could cause an exception. Try blocks can help
programmers to categorize exception objects.
27
The try bracket contains the code that encounters the exception and
prevents the application from crashing.
What are the types of exceptions?

Exceptions can come in the following two exception classes:
1. Checked exceptions. Also called compile-time exceptions,

the compiler checks these exceptions during the compilation
process to confirm if the exception is being handled by the
programmer. If not, then a compilation error displays on the
system. Checked exceptions include SQLException and
ClassNotFoundException.
2. Unchecked exceptions. Also called runtime exceptions, these
exceptions occur during program execution. These exceptions
are not checked at compile time, so the programmer is
responsible for handling these exceptions. Unchecked
exceptions do not give compilation errors. Examples of
unchecked exceptions include NullPointerException and
IllegalArgumentException.
Interrupts and Exceptions
• Last Updated : 24 May, 2020
Read
Discuss
Exceptions and interrupts are unexpected events which will disrupt

the normal flow of execution of instruction(that is currently executing
by processor). An exception is an unexpected event from within the
processor. Interrupt is an unexpected event from outside the process.
28
Whenever an exception or interrupt occurs, the hardware starts
executing the code that performs an action in response to the
exception. This action may involve killing a process, outputting an
error message, communicating with an external device, or horribly
crashing the entire computer system by initiating a “Blue Screen of
Death” and halting the CPU. The instructions responsible for this
action reside in the operating system kernel, and the code that
performs this action is called the interrupt handler code. Now, We can
think of handler code as an operating system subroutine. Then, After
the handler code is executed, it may be possible to continue execution
after the instruction where the execution or interrupt occurred.
Exception and Interrupt Handling :
Whenever an exception or interrupt occurs, execution transition from
user mode to kernel mode where the exception or interrupt is handled.
In detail, the following steps must be taken to handle an exception or
interrupts.
While entering the kernel, the context (values of all CPU registers) of
the currently executing process must first be saved to memory. The
kernel is now ready to handle the exception/interrupt.
1. Determine the cause of the exception/interrupt.

2. Handle the exception/interrupt.
When the exception/interrupt have been handled the kernel performs

the following steps:
1. Select a process to restore and resume.

2. Restore the context of the selected process.
3. Resume execution of the selected process.
At any point in time, the values of all the registers in the CPU defines
the context of the CPU. Another name used for CPU context is CPU
state.
The exception/interrupt handler uses the same CPU as the currently
executing process. When entering the exception/interrupt handler, the
values in all CPU registers to be used by the exception/interrupt
handler must be saved to memory. The saved register values can later
restored before resuming execution of the process.
The handler may have been invoked for a number of reasons. The
handler thus needs to determine the cause of the exception or
interrupt. Information about what caused the exception or interrupt
can be stored in dedicated registers or at predefined addresses in
memory.
29
Next, the exception or interrupt needs to be serviced. For instance, if
it was a keyboard interrupt, then the key code of the key press is
obtained and stored somewhere or some other appropriate action is
taken. If it was an arithmetic overflow exception, an error message
may be printed or the program may be terminated.
The exception/interrupt have now been handled and the kernel. The
kernel may choose to resume the same process that was executing
prior to handling the exception/interrupt or resume execution of any
other process currently in memory.
The context of the CPU can now be restored for the chosen process by
reading and restoring all register values from memory.
The process selected to be resumed must be resumed at the same
point it was stopped. The address of this instruction was saved by the
machine when the interrupt occurred, so it is simply a matter of
getting this address and make the CPU continue to execute at this
address.
Pipeline Optimization techniques:-
Process to maximize the rendering speed, then allow stages that are
not bottlenecks to consume as much time as the bottleneck. Pipeline
Optimization: Process to maximize the rendering speed, then allow
stages that are not bottlenecks to consume as much time as the
bottleneck.
Pipeline Optimization :-
• 1.Stages execute in parallel

• 2.Always the slowest stage is the bottleneck of the pipeline
• 3.The bottleneck determines throughput (i.e., maximum speed)
• 4.The bottleneck is the average bottleneck over a frame
• 5.Cannot measure intra-frame bottlenecks easily
• 6.Bottlenecks can change over a frame
• 7.Most important: find bottleneck, then optimize that stage!
Locating the Bottleneck :-
Two bottleneck location techniques:
Technique 1: ◦ Make a certain stage work less ◦ If performance is the

better, then that stage is the bottleneck
Technique 2: ◦ Make the other two stages work less or (better) not at
all ◦ If performance is the same, then the stages not included above is
30
the bottleneck Complication: the bus between CPU and graphics card
may be bottleneck (not a typical stage).
What are the Design and Characteristics of Memory Hierarchy?

In the Computer System Design, Memory Hierarchy is an
enhancement to organize the memory such that it can minimize the
access time. The Memory Hierarchy was developed based on a
program behavior known as locality of references.Memory Hierarchy,
in Computer System Design, is an enhancement that helps in
organising the memory so that it can actually minimise the access
time. The development of the Memory Hierarchy occurred on a
behaviour of a program known as locality of references. Here is a
figure that demonstrates the various levels of memory hierarchy
clearly:
Memory Hierarchy Design

This Hierarchy Design of Memory is divided into two main types. They
are:
31
External or Secondary Memory
It consists of Magnetic Tape, Optical Disk, Magnetic Disk, i.e. it
includes peripheral storage devices that are accessible by the system’s
processor via I/O Module.
Internal Memory or Primary Memory

It consists of CPU registers, Cache Memory, and Main Memory. It is
accessible directly by the processor.
Characteristics of Memory Hierarchy

One can infer these characteristics of a Memory Hierarchy Design from
the figure given above:
Inclusion Property:
it implies that all information items are originally stored in level Mn.
During the processing, subsets of Mn are copied into Mn-1. similarity,
subsets of Mn-1 are copied into Mn-2, and so on. Multi-level caches
can be designed in various ways depending on whether the content of
one cache is present in other levels of caches. If all blocks in the
higher level cache are also present in the lower level cache, then
the lower level cache is said to be inclusive of the higher level
cache.
Coherence Property
The coherence property requires that copies of the same information

item at successive memory level should be consistent. If a word is
modified in the cache copies of that word must be updated
immediately at the higher levels.
There are two strategies for maintaining coherence in a memory

hierarchy. The first method is,
1. Write through which demand immediate update through

broadcasting Mi+1 level of memory if a word is modified in Mi.
2. Write back – the second method is written which delays the
update in Mi.
Locality of references
• Temporal Locality - Recently referenced items like instructions

or data are likely to be referred again shortly such as iterative
loops process stacks, temporary variables or subroutines.
32
• Spatial Locality - This refers to the tendency for a process to
access items whose addresses are nearer to another for example
operations on tables or arrays involve access of a certain
clustered area in the address space.
• Sequential Locality - In typical programs the execution of
instructions follows a sequential order unless branch
instructions create out of order execution. The ratio of in order
execution to out of order execution is roughly 5:1 in ordinary
programs.
1. Capacity
It refers to the total volume of data that a system’s memory can store.
The capacity increases moving from the top to the bottom in the
Memory Hierarchy.
2. Access Time
It refers to the time interval present between the request for
read/write and the data availability. The access time increases as we
move from the top to the bottom in the Memory Hierarchy.
3. Performance
When a computer system was designed earlier without the Memory
Hierarchy Design, the gap in speed increased between the given CPU
registers and the Main Memory due to a large difference in the
system’s access time. It ultimately resulted in the system’s lower
performance, and thus, enhancement was required. Such a kind of
enhancement was introduced in the form of Memory Hierarchy Design,
and because of this, the system’s performance increased. One of the
primary ways to increase the performance of a system is minimising
how much a memory hierarchy has to be done to manipulate data.
4. Cost per bit

The cost per bit increases as one moves from the bottom to the top in
the Memory Hierarchy, i.e. External Memory is cheaper than Internal
Memory.
Design of Memory Hierarchy

In computers, the memory hierarchy primarily includes the following:
1. Registers
The register is usually an SRAM or static RAM in the computer
processor that is used to hold the data word that is typically 64 bits
or 128 bits. A majority of the processors make use of a status word
33
register and an accumulator. The accumulator is primarily used to
store the data in the form of mathematical operations, and the status
word register is primarily used for decision making.
2. Cache Memory
The cache basically holds a chunk of information that is used
frequently from the main memory. We can also find cache memory in
the processor. In case the processor has a single-core, it will rarely
have multiple cache levels. The present multi-core processors would
have three 2-levels for every individual core, and one of the levels is
shared.
3. Main Memory
In a computer, the main memory is nothing but the CPU’s memory unit
that communicates directly. It’s the primary storage unit of a
computer system. The main memory is very fast and a very large
memory that is used for storing the information throughout the
computer’s operations. This type of memory is made up of ROM as
well as RAM.
4. Magnetic Disks
In a computer, the magnetic disks are circular plates that’s fabricated
with plastic or metal with a magnetised material. Two faces of a disk
are frequently used, and many disks can be stacked on a single spindle
by read/write heads that are obtainable on every plane. The disks in a
computer jointly turn at high speed.
5. Magnetic Tape
Magnetic tape refers to a normal magnetic recording designed with a
slender magnetizable overlay that covers an extended, thin strip of
plastic film. It is used mainly to back up huge chunks of data. When a
computer needs to access a strip, it will first mount it to access the
information. Once the information is allowed, it will then be
unmounted. The actual access time of a computer memory would be
slower within a magnetic strip, and it will take a few minutes for us to
access a strip.
12 | P a g e
Cache Memory Organization:-
Computer memory is organized into a hierarchy. At the highest level

are the processor registers, next comes one or more levels of cache ,
main memory, which is usually made out of a dynamic random-access
memory (DRAM) and at last external memory composed of magnetic
34
disks and tapes. All of these are considered internal to the computer
system. Cache is small high speed memory usually Static RAM (SRAM)
that contains the most recently accessed pieces of main memory.
Cache memories are the high speed buffers which are interested
between the processors and main memory to capture those portion of
the contents of the main memory which are currently in use. Since
cache memories are typically 5-10 times faster than main memory they
can reduce the effective memory access time if carefully designed and
implemented. In this paper, we are going to discuss the architectural
specification, cache mapping techniques, write policies, performance
optimization in detail with case study of Pentium processors
implementing cache.here are basically four placement policies which
are named as direct, fully associative, set associative and sector
mapping.
Cache Memory is a special very high-speed memory. It is used to

speed up and synchronize with high-speed CPU. Cache memory is
costlier than main memory or disk memory but more economical than
CPU registers. Cache memory is an extremely fast memory type that
acts as a buffer between RAM and the CPU. It holds frequently
requested data and instructions so that they are immediately available
to the CPU when needed. Cache memory is used to reduce the average
time to access data from the Main memory. The cache is a smaller and
faster memory that stores copies of the data from frequently used
main memory locations. There are various different independent
caches in a CPU, which store instructions and
data.
Levels of memory:
• Level 1 or Register – It is a type of memory in which data is

stored and accepted that are immediately stored in CPU. Most
commonly used register is accumulator, Program counter,
address register etc.
• Level 2 or Cache memory – It is the fastest memory which has
faster access time where data is temporarily stored for faster
access.
35
• Level 3 or Main Memory – It is memory on which computer
works currently. It is small in size and once power is off data
no longer stays in this memory.
• Level 4 or Secondary Memory – It is external memory which
is not as fast as main memory but data stays permanently in
this memory.
Cache Performance: When the processor needs to read or write a

location in main memory, it first checks for a corresponding entry in
the cache.
• If the processor finds that the memory location is in the cache,

a cache hit has occurred and data is read from the cache.
• If the processor does not find the memory location in the
cache, a cache miss has occurred. For a cache miss, the cache
allocates a new entry and copies in data from main memory,
then the request is fulfilled from the contents of the cache.
The performance of cache memory is frequently measured in terms of

a quantity called Hit ratio.
Hit ratio(H) = hit / (hit + miss) = no. of hits/total accesses
Miss ratio = miss / (hit + miss) = no. of miss/total accesses = 1 - hit

ratio(H)
We can improve Cache performance using higher cache block size, and
higher associativity, reduce miss rate, reduce miss penalty, and
reduce the time to hit in the cache.
Cache Mapping: There are three different types of mapping used for
the purpose of cache memory which is as follows: Direct mapping,
Associative mapping, and Set-Associative mapping. These are
explained below.
A. Direct Mapping
The simplest technique, known as direct mapping, maps each block of
main memory into only one possible cache line. or In Direct mapping,
assign each memory block to a specific line in the cache. If a line is
previously taken up by a memory block when a new block needs to be
loaded, the old block is trashed. An address space is split into two
parts index field and a tag field. The cache is used to store the tag field
whereas the rest is stored in the main memory. Direct mapping`s
performance is directly proportional to the Hit ratio.
i = j modulo m
where
36
i=cache line number
j= main memory block number
m=number of lines in the cache
1. For purposes of cache access, each main memory address can

be viewed as consisting of three fields. The least significant w
bits identify a unique word or byte within a block of main
memory. In most contemporary machines, the address is at the
byte level. The remaining s bits specify one of the 2s blocks of
main memory. The cache logic interprets these s bits as a tag
of s-r bits (most significant portion) and a line field of r bits.
This latter field identifies one of the m=2r lines of the cache.
Line offset is index bits in the direct mapping.
37
B. Associative Mapping
In this type of mapping, the associative memory is used to store
content and addresses of the memory word. Any block can go into any
line of the cache. This means that the word id bits are used to identify
which word in the block is needed, but the tag becomes all of the
remaining bits. This enables the placement of any word at any place
in the cache memory. It is considered to be the fastest and the most
flexible mapping form. In associative mapping the index bits are zero.
C. Set-associative Mapping
This form of mapping is an enhanced form of direct mapping where
the drawbacks of direct mapping are removed. Set associative
addresses the problem of possible thrashing in the direct mapping
method. It does this by saying that instead of having exactly one line
that a block can map to in the cache, we will group a few lines together
creating a set. Then a block in memory can map to any one of the lines
of a specific set. Set-associative mapping allows that each word that is
present in the cache can have two or more words in the main memory
for the same index address. Set associative cache mapping combines
the best of direct and associative cache mapping techniques. In set
associative mapping the index bits are given by the set offset bits. In
this case, the cache consists of a number of sets, each of which
consists of a number of lines. The relationships are
m=v*k
i= j mod v
38
where
i=cache set number
j=main memory block number
v=number of sets
m=number of lines in the cache number of sets
k=number of lines in each set
Application of Cache Memory:
1. Usually, the cache memory can store a reasonable number of

blocks at any given time, but this number is small compared
to the total number of blocks in the main memory.
39
2. The correspondence between the main memory blocks and
those in the cache is specified by a mapping function.
3. Primary Cache – A primary cache is always located on the
processor chip. This cache is small and its access time is
comparable to that of processor registers.
4. Secondary Cache – Secondary cache is placed between the
primary cache and the rest of the memory. It is referred to as
the level 2 (L2) cache. Often, the Level 2 cache is also housed
on the processor chip.
5. Spatial Locality of reference This says that there is a chance
that the element will be present in close proximity to the
reference point and next time if again searched then more
close proximity to the point of reference.
6. Temporal Locality of reference In this Least recently used
algorithm will be used. Whenever there is page fault occurs
within a word will not only load word in main memory but
complete page fault will be loaded because the spatial locality
of reference rule says that if you are referring to any word next
word will be referred to in its register that’s why we load
complete page table so the complete block will be loaded.
Techniques to Reduce Cache Misses (3 Key Tips):-
1. Set an Expiry Date for the Cache Lifespan
Every time your cache is purged, the data in it needs to be written into
the memory after the first request. This is why, at Kinsta, we use
the Kinsta MU plugin so that only certain sections of the cache are
purged.
The more you purge your cache, the more likely cache misses are to
occur. Of course, sometimes clearing your cache is necessary.
However, one way you can prevent this problem is to expand the
lifespan of your cache by increasing its expiry time. Keep in mind that
the expiry time should coincide with how often you update your
website to ensure that the changes appear to your users.
For example, if you don’t frequently update your site, you can
probably set the expiry time to two weeks. Alternatively, if site
updates are a weekly occurrence, your expiry time shouldn’t exceed a
day or two.
Your options for doing this will vary depending on your hosting
provider. If you rely on caching plugins, you can use the WP Rocket
40
plugin. Once installed and activated, you can navigate
to Settings > WP Rocket, followed by the Cache tab.
Under the Cache Lifespan section, you will be able to specify the
global expiry time for when the cache is cleared. When you’re done,
you can click on the Save Changes button at the bottom of the page.
2. Increase the Size of Your Cache or Random Access Memory (RAM)
Another option for reducing cache misses is to increase the size of

your cache or RAM. Obviously, the larger your cache, the more data it
can hold and, thus, the fewer cache misses you’re likely to deal with.
However, increasing your RAM can be a bit pricey. You may want to
check with your hosting provider to see what your options are. For
example, at Kinsta we offer scalable hosting. This means that you can
easily scale up your plan without having to worry about downtime.
3. Use the Optimal Cache Policies for Your Specific Circumstances
A third way to reduce cache misses is by testing out different cache

policies for your environment. Understanding what your options are
and how they work is key.
The four main cache policies are:
1. First In First Out (FIFO): This policy means that the data that was
added the earliest to the cache will be the first to be evicted.
2. Last In First Out (LIFO): This means that the data entries added
last to the cache will be the first to be removed.
3. Least Recently Used (LRU): True to its name, this policy first
evicts the data accessed the longest time ago.
4. Most Recently Used (MRU): With this policy, the data most
recently accessed is evicted first.
Virtual memory organization:-
Virtual memory is a method that computers use to manage storage

space to keep systems running quickly and efficiently. Using the
technique, operating systems can transfer data between different
types of storage, such as random access memory (RAM), also known
as main memory, and hard drive or solid-state disk storage. At any
particular time, the computer only needs enough active memory to
support active processes. The system can move those that are dormant
into virtual memory until needed.
A virtual memory system has many advantages, including:-

41
1.Allowing users to operate multiple applications at the same time or
applications that are larger than the main memory
2.Freeing applications from having to compete for shared memory

space and allowing multiple applications to run at the same time
3.Allowing core processes to share memory between libraries, which

consists of written code that provides the foundation for a program's
operations
4.Improving security by isolating and segmenting where the computer

stores information
5.Improving efficiency and speed by allowing more processes to sit in

virtual memory
6.Lowering the cost of computer systems as you find the right amount
of main memory and virtual memory
7.Increasing the amount of memory available by working outside the

limits of a computer's physical main memory space
8.Optimizing central processing unit (CPU) usage
Virtual memory is a built-in component of most modern desktop

computers. It's part of a computer's CPU and is a more cost-effective
method for managing memory than expanding its physical memory
storage system. Some specialized computers might not rely on virtual
memory because it could cause inconsistencies in how the computer
processes information and runs tasks.
Professionals in specific industries, such as scientific or statistical

modeling, may not use virtual memory for functions requiring
stability and predictability. Most everyday personal or business
computers don't need this level of consistency and can benefit from
the advantages of virtual memory more than the predictability of other
memory system
Limitations of virtual memory:-
1.Virtual memory has many advantages, but it also has some

limitations. Here are a few to keep in mind:
2.Virtual memory runs slower than physical memory, so most

computers prioritize using physical memory when possible.
42
3.Moving data between a computer's virtual and physical memory
requires more from the computer's hardware.
4.The amount of storage that virtual memory can provide depends on

the amount of secondary storage a computer has.
5.If a computer only has a small amount of RAM, virtual memory can
cause thrashing, which is when the computer must constantly swap
data between virtual and physical memory, resulting in significant
performance delays.
6.It can take longer for applications to load or for a computer to switch
between applications when using virtual memory.
Mapping techniques
The process of transfer the data from main memory to cache memory
is called as mapping. In the cache memory, there are three kinds
of mapping techniques are used.
1. Associative mapping
2. Direct mapping
3. Set Associative mapping
Components present in each line are:
1. Valid bit: This gives the status of the data block. If 0 then the
data block is not referenced and if 1 then the data block is
referenced.
2. Tag: This is the main memory address part.
3. Data: This is the data block.
1) Associative mapping
In this technique, a number of mapping functions are used to transfer

the data from main memory to cache memory. That means any main
memory can be mapped into any cache line. Therefore, cache memory
address is not in the use. Associative cache controller interprets the
request by using the main memory address format. During the
mapping process, the complete data block is transferred to cache
memory along with the complete tags.
43
• Associative cache controller interprets the CPU generated request
as:
• The existing tag in the cache controller compared with the CPU
generated tags.
• If anyone of the tag in the matching operation becomes hit. So,
based on the word offset the respective data is transfer to CPU.
• If none of the tags are matching operation become miss. So, the
references will be forwarded to the main memory.
• According to the main memory address format, the respective
main memory block is enabled then transferred to the cache
memory by using the associative mapping. Later the data will be
transfer to the CPU.
• In this mapping technique, replacement algorithms are used to
replace the cache block when the cache is full.
• Tag memory size = number of lines * number of tag bits in the
line.
Tag memory size = 4*3 bits
Tag memory size = 12 bits
2) Direct mapping
In this mapping technique, the mapping function is used to transfer

the data from main memory to cache memory. The mapping function
is:
K mod N = i
Where,
• K is the main memory block number.
44
• N is the number of cache lines.
• And, i is the cache memory line number.
• Direct cache controller interprets the CPU generated a request as:
• Line offset is directly connected to the address logic of the cache

memory. Therefore the corresponding cache line is enabled.
• Existing tag in the enabled cache line is compared with the CPU
generated the tag.
• If both are matching operation then it becomes hit. So, the
respective data is transfer to CPU based on the word offset.
• If both are not matching operation then it becomes a miss. So,
the reference is forward into the main memory.
• According to the main memory address format, the
corresponding block is enabled, then transferred to the cache
memory by using the direct mapping function. Later the data is
transfer to the CPU based on the word offset.
• In this mapping technique, a replacement algorithm is not
required because the mapping function itself replaces the blocks.
• The disadvantage of direct mapping is each cache line is able to
hold only one block at a time. Therefore. The number of conflicts
misses will be increased.
• To avoid the disadvantage of the direct mapping, use the
alternative cache organization in which each line is able to hold
more than one tags at a time. This alternative organization is
called as a set associative cache organization.
• Tag memory size = number of lines * number of tag bits in the
line.
Tag memory size = 4*1 bits
Tag memory size =4 bits
45
3) Set Associative Mapping
In this mapping technique, the mapping function is used to transfer

the data from main memory to cache memory. The mapping function
is:
K mod S = i
Where,
• K is the main memory block number,

• S is the number of cache sets,
• And, i is the cache memory set number.
• Set associative cache controller interprets the CPU generated a

request as:
• A set offset is directly connected to the address logic of the cache

memory. So, respective sets will be enabled.
• Set contain multiple blocks so to identify the hit block there is a
need of multiplexer to compare existing tags in the enabled set
one after the another based on the selection bit with the CPU
generated a tag.
• If anyone is matching, the operation becomes hit. So, the data is
transfer to the CPU. If none of them is matching, then operation
becomes a miss. So, the reference is forward to the CPU.
46
• The main memory block is transferred to the cache memory by
using a set associative mapping function. Later data is transfer to
CPU.
• In this technique, replacement algorithms are used to replace the
blocks in the cache line, when the set is full.
• Tag memory size = number of sets in cache * number of blocks
in the set * tag bit.
Tag memory size = 2 * 2 * 2 bits.
Tag memory size = 8 bits.
Memory replacement policies:-
When a page fault appears, the program that is directly in execution

is stopped just before the required page is transferred into the main
memory. Because the act of loading a page from auxiliary memory to
main memory is an I/O operation, the operating framework creates
this function for the I/O processor.
In this interval, control is moved to the next program in the main
memory that is waiting to be prepared in the CPU. Soon after the
memory block is assigned and then moved, the suspended program
can resume execution.
If the main memory is full, a new page cannot be moved in. Therefore,
it is important to remove a page from a memory block to hold the new
page. The decision of removing specific pages from memory is
determined by the replacement algorithm.
There are two common replacement algorithms used are the first-in,
first-out (FIFO) and least recently used (LRU).
The FIFO algorithm chooses to replace the page that has been in
memory for the highest time. Every time a page is weighted into
memory, its identification number is pushed into a FIFO stack.
FIFO will be complete whenever memory has no more null blocks.
When a new page should be loaded, the page least currently transports
in is removed. The page to be removed is simply determined because
its identification number is at the high of the FIFO stack.
The FIFO replacement policy has the benefit of being simple to
execute. It has the drawback that under specific circumstances pages
are removed and loaded from memory too frequently.
The LRU policy is more complex to execute but has been more
interesting on the presumption that the least recently used page is an
excellent applicant for removal than the least recently loaded page as
47
in FIFO. The LRU algorithm can be executed by relating a counter with
each page that is in the main memory.
When a page is referenced, its associated counter is set to zero. At
permanent intervals of time, the counters related to all pages directly
in memory are incremented by 1.
Instruction level parallelism:-
Instruction Level Parallelism (ILP) is used to refer to the

architecture in which multiple operations can be performed parallelly
in a particular process, with its own set of resources – address space,
registers, identifiers, state, program counters. It refers to the
compiler design techniques and processors designed to execute
operations, like memory load and store, integer addition, float
multiplication, in parallel to improve the performance of the
processors. Examples of architectures that exploit ILP are VLIWs,
Superscalar Architecture.
ILP processors have the same execution hardware as RISC processors.
The machines without ILP have complex hardware which is hard to
implement. A typical ILP allows multiple-cycle operations to be
pipelined.
Example :
Suppose, 4 operations can be carried out in single clock cycle. So
there will be 4 functional units, each attached to one of the
operations, branch unit, and common register file in the ILP execution
hardware. The sub-operations that can be performed by the
functional units are Integer ALU, Integer Multiplication, Floating Point
Operations, Load, Store. Let the respective latencies be 1, 2, 3, 2, 1.
Let the sequence of instructions be –
1. y1 = x1*1010
2. y2 = x2*1100
3. z1 = y1+0010
4. z2 = y2+0101
5. t1 = t1+1
6. p = q*1000
7. clr = clr+0010
8. r = r+0001
Instruction Level Parallelism is achieved when multiple operations

are performed in single cycle, that is done by either executing them
simultaneously or by utilizing gaps between two successive
operations that is created due to the latencies.
48
Now, the decision of when to execute an operation depends largely
on the compiler rather than hardware. However, extent of compiler’s
control depends on type of ILP architecture where information
regarding parallelism given by compiler to hardware via program
varies. The classification of ILP architectures can be done in the
following ways –
1. Sequential Architecture :
Here, program is not expected to explicitly convey any
information regarding parallelism to hardware, like
superscalar architecture.
2. Dependence Architectures :
Here, program explicitly mentions information regarding
dependencies between operations like dataflow architecture.
3. Independence Architecture :
Here, program gives information regarding which operations
are independent of each other so that they can be executed
instead of the ‘nop’s.
In order to apply ILP, compiler and hardware must determine data
dependencies, independent operations, and scheduling of these
independent operations, assignment of functional unit, and register
to store data.
Techniques for increasing instruction level parallelism:-
Instruction Level Parallelism (ILP) is the number of instructions that
can be executed in simultaneously a program in a clock cycle. The
microprocessors exploit ILP by means of several techniques that have
been implemented in the last decades and according to the advances
that have been obtained in hardware, this survey presents the
different
techniques that have been used successfully in the execution of
multiple instructions of a single program in a single clock cycle.
Superscalar Architecture:-
A more aggressive approach is to equip the processor with multiple
processing units to handle several instructions in parallel in each
processing stage. With this arrangement, several instructions start
execution in the same clock cycle and the process is said to use
multiple issue. Such processors are capable of achieving an
instruction execution throughput of more than one instruction per
cycle. They are known as ‘Superscalar Processors’.
49
In the above diagram, there is a processor with two execution units;
one for integer and one for floating point operations. The instruction
fetch unit is capable of reading the instructions at a time and storing
them in the instruction queue. In each cycle, the dispatch unit
retrieves and decodes up to two instructions from the front of the
queue. If there is one integer, one floating point instruction and no
hazards, both the instructions are dispatched in the same clock
cycle.
Advantages of Superscalar Architecture :
• The compiler can avoid many hazards through judicious
selection and ordering of instructions.
• The compiler should strive to interleave floating point and
integer instructions. This would enable the dispatch unit to
keep both the integer and floating point units busy most of
the time.
• In general, high performance is achieved if the compiler is
able to arrange program instructions to take maximum
advantage of the available hardware units.
Disadvantages of Superscalar Architecture :
• In a Superscalar Processor, the detrimental effect on
performance of various hazards becomes even more
pronounced.
• Due to this type of architecture, problem in scheduling can
occur.
Superpipelining:-
Superpipelining is based on dividing the stages of a pipeline into

several sub-stages, and thus increasing the number of instructions
which are handled by the pipeline at the same time.  For example, by
dividing each stage into two sub-stages, a pipeline can perform at
twice the speed in the ideal situation.  Many pipeline stages may
perform tasks that require less than half a clock cycle.  No duplication
of hardware is needed for these stages.
50
For a given architecture and the corresponding instruction set there
is an optimal number of pipeline stages/sub-stages.  Increasing the
number of stages/sub-stages over this limit reduces the overall
performance.  Overhead of data buffering between the stages.  Not
all stages can be divided into (equal-length) sub-stages.  The hazards
will be more difficult to resolved.  More complex hardware.
VLIW Processors :-
A key benefit of very large instruction word (VLIW) architectures is

their ability to support large amounts of hardware parallelism with
relatively simple control logic. While VLIW processors have been built
that issued as many as 28 operations per cycle, superscalar
architectures have not yet demonstrated such high levels of
parallelism. [15]The VLIW approach exposes the hardware resources
of the processor to the 3 compiler. The compiler is responsible for all
decisions about instruction scheduling, including the assignment of
null operations to functional units that cannot be assigned a useful
task during a given cycle. VLIW advocates believe that since the
compiler has the complete information about the entire program, it is
best suited to manage the available resources (like ALUs, FPUs and I/O
units), and is best able to increase instruction execution bandwidth in
areas of the code that were previously performance limited by
resource constraints. However, VLIW does not support out-of-order
execution, and any change of the hardware description requires all
code to be recompiled for the program to work correctly .
Array processor:-
Array Processor performs computations on large array of data. These

are two types of Array Processors: Attached Array Processor, and SIMD
Array Processor. These are explained as following below.
1. Attached Array Processor :
To improve the performance of the host computer in numerical

computational tasks auxiliary processor is attached to it.
51
Attached array processor has two interfaces:
1. Input output interface to a common processor.
2. Interface with a local memory.
Here local memory interconnects main memory. Host computer is

general purpose computer. Attached processor is back end machine
driven by the host computer.
The array processor is connected through an I/O controller to the
computer & the computer treats it as an external interface.
2. SIMD array processor :
This is computer with multiple process unit operating in parallel Both
types of array processors, manipulate vectors but their internal
organization is different.
SIMD is a computer with multiple processing units operating in

parallel.
52
The processing units are synchronized to perform the same operation
under the control of a common control unit. Thus providing a single
instruction stream, multiple data stream (SIMD) organization. As
shown in figure, SIMD contains a set of identical processing elements
(PES) each having a local memory M.
Each PE includes –
• ALU
• Floating point arithmetic unit
• Working registers
Master control unit controls the operation in the PEs. The function of
master control unit is to decode the instruction and determine how
the instruction to be executed. If the instruction is scalar or program
control instruction then it is directly executed within the master
control unit.
Main memory is used for storage of the program while each PE uses
operands stored in its local memory.
Vector Processor:-
Vector processor is basically a central processing unit that has the
ability to execute the complete vector input in a single instruction.
More specifically we can say, it is a complete unit of hardware
resources that executes a sequential set of similar data items in the
memory using a single instruction.
We know elements of the vector are ordered properly so as to have
successive addressing format of the memory. This is the reason why
we have mentioned that it implements the data sequentially.
It holds a single control unit but has multiple execution units that
perform the same operation on different data elements of the vector.
Unlike scalar processors that operate on only a single pair of data, a

vector processor operates on multiple pair of data. However, one can
convert a scalar code into vector code. This conversion process is
known as vectorization. So, we can say vector processing allows
operation on multiple data elements by the help of single instruction.
These instructions are said to be single instruction multiple

data or vector instructions. The CPU used in recent time makes use
of vector processing as it is advantageous than scalar processing.
The functional units of a vector computer are as follows:
• IPU or instruction processing unit

• Vector register
53
• Scalar register
• Scalar processor
• Vector instruction controller
• Vector access controller
• Vector processor
Multiprocessor architecture:-
A Multiprocessor is a computer system with two or more central
processing units (CPUs) share full access to a common RAM. The main
objective of using a multiprocessor is to boost the system’s execution
speed, with other objectives being fault tolerance and application
matching.
There are two types of multiprocessors, one is called shared memory
multiprocessor and another is distributed memory multiprocessor. In
shared memory multiprocessors, all the CPUs shares the common
memory but in a distributed memory multiprocessor, every CPU has
its own private memory.
Applications of Multiprocessor –
1. As a uniprocessor, such as single instruction, single data
stream (SISD).
2. As a multiprocessor, such as single instruction, multiple data
stream (SIMD), which is usually used for vector processing.
3. Multiple series of instructions in a single perspective, such as
multiple instruction, single data stream (MISD), which is used
for describing hyper-threading or pipelined processors.
4. Inside a single system for executing multiple, individual
series of instructions in multiple perspectives, such as
multiple instruction, multiple data stream (MIMD).
Benefits of using a Multiprocessor –
54
• Enhanced performance.
• Multiple applications.
• Multi-tasking inside an application.
• High throughput and responsiveness.
• Hardware sharing among CPUs.
Advantages of Multiprocessor Systems
There are multiple advantages to multiprocessor systems. Some of

these are −
More reliable Systems
In a multiprocessor system, even if one processor fails, the system
will not halt. This ability to continue working despite hardware failure
is known as graceful degradation. For example: If there are 5
processors in a multiprocessor system and one of them fails, then also
4 processors are still working. So the system only becomes slower and
does not ground to a halt.
Enhanced Throughput
If multiple processors are working in tandem, then the throughput of
the system increases i.e. number of processes getting executed per
unit of time increase. If there are N processors then the throughput
increases by an amount just under N.
More Economic Systems
Multiprocessor systems are cheaper than single processor systems in
the long run because they share the data storage, peripheral devices,
power supplies etc. If there are multiple processes that share data, it
is better to schedule them on multiprocessor systems with shared
data than have different computer systems with multiple copies of
the data.
Disadvantages of Multiprocessor Systems
There are some disadvantages as well to multiprocessor systems.

Some of these are:
Increased Expense
Even though multiprocessor systems are cheaper in the long run than
using multiple computer systems, still they are quite expensive. It is
much cheaper to buy a simple single processor system than a
multiprocessor system.
Complicated Operating System Required
55
There are multiple processors in a multiprocessor system that share
peripherals, memory etc. So, it is much more complicated to schedule
processes and impart resources to processes.than in single processor
systems. Hence, a more complex and complicated operating system is
required in multiprocessor systems.
Large Main Memory Required
All the processors in the multiprocessor system share the memory. So
a much larger pool of memory is required as compared to single
processor systems.
Taxonomy of parallel architecture(flynn’s taxonomy):-
Parallel computing is a computing where the jobs are broken into
discrete parts that can be executed concurrently. Each part is further
broken down to a series of instructions. Instructions from each part
execute simultaneously on different CPUs. Parallel systems deal with
the simultaneous use of multiple computer resources that can
include a single computer with multiple processors, a number of
computers connected by a network to form a parallel processing
cluster or a combination of both.
Parallel systems are more difficult to program than computers with a
single processor because the architecture of parallel computers
varies accordingly and the processes of multiple CPUs must be
coordinated and synchronized.
The crux of parallel processing are CPUs. Based on the number
of instruction and data streams that can be processed
simultaneously, computing systems are classified into four major
categories:
56
Flynn’s classification –
1. Single-instruction, single-data (SISD) systems –

An SISD computing system is a uniprocessor machine which
is capable of executing a single instruction, operating on a
single data stream. In SISD, machine instructions are
processed in a sequential manner and computers adopting
this model are popularly called sequential computers. Most
conventional computers have SISD architecture. All the
instructions and data to be processed have to be stored in
primary memory.
The speed of the processing element in the SISD model is

limited(dependent) by the rate at which the computer can
57
transfer information internally. Dominant representative SISD
systems are IBM PC, workstations.
2. Single-instruction, multiple-data (SIMD) systems –
An SIMD system is a multiprocessor machine capable of
executing the same instruction on all the CPUs but operating
on different data streams. Machines based on an SIMD model
are well suited to scientific computing since they involve lots
of vector and matrix operations. So that the information can
be passed to all the processing elements (PEs) organized data
elements of vectors can be divided into multiple sets(N-sets
for N PE systems) and each PE can process one data set.
Dominant representative SIMD systems is Cray’s vector

processing machine.
3. Multiple-instruction, single-data (MISD) systems –
An MISD computing system is a multiprocessor machine
capable of executing different instructions on different PEs
but all of them operating on the same dataset .
Example Z = sin(x)+cos(x)+tan(x)
The system performs different operations on the same data
set. Machines built using the MISD model are not useful in
most of the application, a few machines are built, but none of
them are available commercially.
58
4. Multiple-instruction, multiple-data (MIMD) systems –
An MIMD system is a multiprocessor machine which is capable
of executing multiple instructions on multiple data sets. Each
PE in the MIMD model has separate instruction and data
streams; therefore machines built using this model are
capable to any kind of application. Unlike SIMD and MISD
machines, PEs in MIMD machines work asynchronously.
MIMD machines are broadly categorized into shared-memory

MIMD and distributed-memory MIMD based on the way PEs
are coupled to the main memory.
In the shared memory MIMD model (tightly coupled
multiprocessor systems), all the PEs are connected to a single
global memory and they all have access to it. The
communication between PEs in this model takes place through
the shared memory, modification of the data stored in the
global memory by one PE is visible to all other PEs. Dominant
representative shared memory MIMD systems are Silicon
Graphics machines and Sun/IBM’s SMP (Symmetric Multi-
Processing).
In Distributed memory MIMD machines (loosely coupled
multiprocessor systems) all PEs have a local memory. The
communication between PEs in this model takes place through
the interconnection network (the inter process
communication channel, or IPC). The network connecting PEs
can be configured to tree, mesh or in accordance with the
requirement.
The shared-memory MIMD architecture is easier to program
but is less tolerant to failures and harder to extend with
respect to the distributed memory MIMD model. Failures in a
shared-memory MIMD affect the entire system, whereas this
is not the case of the distributed model, in which each of the
PEs can be easily isolated. Moreover, shared memory MIMD
architectures are less likely to scale because the addition of
more PEs leads to memory contention. This is a situation that
59
does not happen in the case of distributed memory, in which
each PE has its own memory. As a result of practical outcomes
and user’s requirement , distributed memory MIMD
architecture is superior to the other existing models.
Centralized Shared-Memory Architectures

• The use of large multilevel caches can substantially reduce
memory bandwidth demands of a processor.
• Thishas made it possible for several (micro)processors to

share the same memory through a shared bus.
• Caching supports both private and shared data.

o For private data, once cached, it's treatment is
identical to that of a uniprocessor.
o For shared data, the shared value may be
replicated in many caches.
• Replicationhas several advantages:

• Reduced latency and memory bandwidth requirements.
• Reduced contention for data items that are read by multiple
processors simultaneously.
• However, it also introduces a problem: Cache coherence .

Cache Coherence
• With multiple caches, one CPU can modify memory at
locations that other CPUs have cached.
• For example:
• CPU A reads location x, getting the value N .
• Later, CPU B reads the same location, getting the value N .
• Next, CPU A writes location x with the value N - 1 .
• At this point, any reads from CPU B will get the value N ,
while reads from CPU A will get the value N - 1 .
• Thisproblem occurs both with write-through caches and

(more seriously) with write-back caches.
• Cache coherence : informal definition:

o A memory system is coherent if any read of a data
item returns the most recently written value of
that data item.
60
• Upon
closer inspection, there are several aspects that need
to be addressed.
Cache Coherence
• Coherence defines what values can be returned by a read.
•Amemory system is coherent if:

• Read after write works for a single processor.
o If CPU A writes N to location X, all future reads of
location X will return N if no other processor
writes location X after CPU A.
3. Memory consistency :
Memory consistency defines the order in which memory
operations (from any process) appear to execute with respect to
one another.
What orders are kept?
Given a load, what are the possible values it can return?
It is impossible to tell much about the execution of an SAS
(Statistical Analysis System) programmed without it.
Consequences for both programmers and system designers.
A programmed is used by a programmer to reason about
correctness and possible outcomes.
System designers can use this to limit the number of accesses
that can be reordered by the compiler or hardware.
Agreement between the programmer and the system.
Interconnection Network:-
Interconnection networks are composed of switching elements.

Topology is the pattern to connect the individual switches to other
elements, like processors, memories and other switches. A network
allows exchange of data between processors in the parallel system.
• Direct connection networks − Direct networks have point-

to-point connections between neighboring nodes. These
networks are static, which means that the point-to-point
connections are fixed. Some examples of direct networks are
rings, meshes and cubes.
• Indirect connection networks − Indirect networks have no
fixed neighbors. The communication topology can be
changed dynamically based on the application demands.
Indirect networks can be subdivided into three parts: bus
networks, multistage networks and crossbar switches.
o Bus networks − A bus network is composed of a
number of bit lines onto which a number of
61
resources are attached. When busses use the same
physical lines for data and addresses, the data and
the address lines are time multiplexed. When there
are multiple bus-masters attached to the bus, an
arbiter is required.
o Multistage networks − A multistage network
consists of multiple stages of switches. It is
composed of ‘axb’ switches which are connected
using a particular interstage connection pattern
(ISC). Small 2x2 switch elements are a common
choice for many multistage networks. The number
of stages determine the delay of the network. By
choosing different interstage connection patterns,
various types of multistage network can be
created.
o Crossbar switches − A crossbar switch contains a
matrix of simple switch elements that can switch
on and off to create or break a connection. Turning
on a switch element in the matrix, a connection
between a processor and a memory can be made.
Crossbar switches are non-blocking, that is all
communication permutations can be performed
without blocking.
Architecture of Distributed Shared Memory(DSM):-

Distributed Shared Memory (DSM) implements the distributed
systems shared memory model in a distributed system, that hasn’t
any physically shared memory. Shared model provides a virtual
address area shared between any or all nodes. To beat the high forged
of communication in distributed system. DSM memo, model provides
a virtual address area shared between all nodes. systems move
information to the placement of access. Information moves between
main memory and secondary memory (within a node) and between
main recollections of various nodes.
Every Greek deity object is in hand by a node. The initial owner is
that the node that created the object. possession will amendment as
the object moves from node to node. Once a method accesses
information within the shared address space, the mapping manager
maps shared memory address to physical memory (local or remote).
62
DSM permits programs running on separate reasons to share
information while not the software engineer having to agitate
causation message instead underlying technology can send the
messages to stay the DSM consistent between compute. DSM permits
programs that wont to treat constant laptop to be simply tailored to
control on separate reason. Programs access what seems to them to
be traditional memory.
Hence, programs that Pine Tree State DSM square measure sometimes
shorter and easier to grasp than programs that use message passing.
But, DSM isn’t appropriate for all things. Client-server systems square
measure typically less suited to DSM, however, a server is also wont
to assist in providing DSM practicality for information shared
between purchasers.
Architecture of Distributed Shared Memory (DSM) :
Every node consists of 1 or additional CPU’s and a memory unit. High-
speed communication network is employed for connecting the nodes.
A straightforward message passing system permits processes on
completely different nodes to exchange one another.
Memory mapping manager unit :
Memory mapping manager routine in every node maps the native
63
memory onto the shared computer storage. For mapping operation,
the shared memory house is divided into blocks.
Information caching may be a documented answer to deal with
operation latency. DMA uses information caching to scale back
network latency. the most memory of the individual nodes is
employed to cache items of the shared memory house.
Memory mapping manager of every node reads its native memory as
an enormous cache of the shared memory house for its associated
processors. The bass unit of caching may be a memory block. Systems
that support DSM, information moves between secondary memory
and main memory also as between main reminiscences of various
nodes.
Communication Network Unit :
Once method access information within the shared address house
mapping manager maps the shared memory address to the physical
memory. The mapped layer of code enforced either within the
operating kernel or as a runtime routine.
Physical memory on every node holds pages of shared virtual–
address house. Native pages area unit gift in some node’s memory.
Remote pages in some other node’s memory.
Cluster computing:-
it is a collection of tightly or loosely connected computers that work
together so that they act as a single entity. The connected computers
execute operations all together thus creating the idea of a single
system. The clusters are generally connected through fast local area
networks (LANs)
64
Cluster Computing
A Simple Cluster Computing Layout
Types of Cluster computing :

1. High performance (HP) clusters :
HP clusters use computer clusters and supercomputers to solve
advance computational problems. They are used to performing
functions that need nodes to communicate as they perform their jobs.
They are designed to take benefit of the parallel processing power of
several nodes.
2. Load-balancing clusters :
Incoming requests are distributed for resources among several nodes
running similar programs or having similar content. This prevents
any single node from receiving a disproportionate amount of task.
This type of distribution is generally used in a web-hosting
environment.
3. High Availability (HA) Clusters :
HA clusters are designed to maintain redundant nodes that can act as
backup systems in case any failure occurs. Consistent computing
services like business activities, complicated databases, customer
services like e-websites and network file distribution are provided.
They are designed to give uninterrupted data availability to the
customers.
Classification of Cluster :
1. Open Cluster :
65
IPs are needed by every node and those are accessed only through
the internet or web. This type of cluster causes enhanced security
concerns.
2. Close Cluster :
The nodes are hidden behind the gateway node, and they provide
increased protection. They need fewer IP addresses and are good for
computational tasks.
Cluster Computing Architecture :
• It is designed with an array of interconnected individual
computers and the computer systems operating collectively
as a single standalone system.
• It is a group of workstations or computers working together
as a single, integrated computing resource connected via high
speed interconnects.
• A node – Either a single or a multiprocessor network having
memory, input and output functions and an operating system.
• Two or more nodes are connected on a single line or every
node might be connected individually through a LAN
connection.
Cluster Computing Architecture
Components of a Cluster Computer :

1. Cluster Nodes
2. Cluster Operating System
3. The switch or node interconnect
66
4. Network switching hardware
Cluster Components
Advantages of Cluster Computing :

1. High Performance :
The systems offer better and enhanced performance than that of
mainframe computer networks.
2. Easy to manage :
Cluster Computing is manageable and easy to implement.
3. Scalable :
Resources can be added to the clusters accordingly.
4. Expandability :
Computer clusters can be expanded easily by adding additional
computers to the network. Cluster computing is capable of combining
several additional resources or the networks to the existing computer
system.
5. Availability :
The other nodes will be active when one node gets failed and will
function as a proxy for the failed node. This makes sure for enhanced
availability.
6. Flexibility :
It can be upgraded to the superior specification or additional nodes
can be added.
67
Disadvantages of Cluster Computing :
1. High cost :
It is not so much cost-effective due to its high hardware and its
design.
2. Problem in finding fault :
It is difficult to find which component has a fault.
3. More space is needed :
Infrastructure may increase as more servers are needed to manage
and monitor.
Applications of Cluster Computing :
• Various complex computational problems can be solved.
• It can be used in the applications of aerodynamics,
astrophysics and in data mining.
• Weather forecasting.
• Image Rendering.
• Various e-commerce applications.
• Earthquake Simulation.
• Petroleum reservoir simulation.
68

CSA

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSA

Uploaded by

Copyright:

Available Formats

Basic computer architecture:-

A computer system is basically a machine that simplifies complicated

A diagram that shows the flow of data between these units is as

Details about all the computer units are −

The input unit provides data to the computer system from

The output unit provides the results of computer process to

Storage unit contains many computer components that are

• Arithmetic Logic Unit

All the calculations related to the computer system are

Quantitative Computer Design:-

• Principles that are useful in design and analysis of

• Amdahl'slaw can be used to quantify the latter given that

Quantitative Computer Design

o Amdahl's law defines the speedup obtained by

Quantitative Computer Design

• Amdahl's Law (cont) :

Quantitative Computer Design

• Amdahl's Law (cont) : An example:

o What is the overall speedup ?

Quantitative Computer Design

• Amdahl's Law expresses the law of diminishing returns . i.e.

o Assume the first improvement costs us $1,000.

o Assume we are thinking about spending $100,000

o NO ! The best that we can do: 1/0.7 = 1.42 !

Quantitative Computer Design

• CPU Performance Equation:

oA second method that decomposes the CPU

o CPU Performance Equation:

▪ where, for example, Clock cycle time =

Quantitative Computer Design

• CPU Performance Equation:

o Analternative to "number of clock cycles" is

o Given both the "number of clock cycles" and IC of

o Therefore,CPU performance is dependent on

Quantitative Computer Design

• CPU Performance Equation :

•A variation of this equation is:

• where IC i represents number of time instruction i is

Measuring and Reporting Performance

The computer user is interested in reducing response time( the time

Even execution time can be defined in different ways depending on

Choosing Programs to Evaluate Performance

1. Real applications— Although the buyer may not know what

2. Modified (or scripted) applications—In many cases, real

3.Kernels—Several attempts have been made to extract small, key

4. Toy benchmarks—Toy benchmarks are typically between 10 and

Recently, it has become popular to put together collections of

Desktop benchmarks divide into two broad classes: CPU intensive

Although SPEC CPU2000 is aimed at CPU performance, two different

2.SolidWorks 99: a 3-D CAD/CAM design tool running a series of five

3. Unigraphics V15: The benchmark is based on an aircraft model and

Just as servers have multiple functions, so there are multiple types of

Transaction processing benchmarks measure the ability of a system

Benchmarks for embedded computing systems are in a far more

In practice, many designers of embedded systems devise benchmarks

Pipelining is the process of accumulating instruction from the

Pipelining is a technique where multiple instructions are overlapped

Pipelining increases the overall instruction throughput.

In pipeline system, each segment consists of an input register

It is divided into 2 categories:

Arithmetic pipelines are usually found in most of the computers. They

The floating point addition and subtraction is done in 4 parts:

1. Compare the exponents.

X=0.321410^3 and Y=0.450010^2