ACA Notes

1 Microprocessor
Microprocessor is a controlling unit of a micro-computer, fabricated on a small chip capable

of performing ALU (Arithmetic Logical Unit) operations and communicating with the other
devices connected to it.
Microprocessor consists of an ALU, register array, and a control unit. ALU performs
arithmetical and logical operations on the data received from the memory or an input
device.
1.1 Block Diagram of a Basic Microcomputer
 Instruction Set: It is the set of instructions that the microprocessor can

understand.
 Bandwidth: It is the number of bits processed in a single instruction.
 Clock Speed: It determines the number of operations per second the processor
can perform. It is expressed in megahertz (MHz) or gigahertz (GHz).It is also known
as Clock Rate.
 Word Length: It depends upon the width of internal data bus, registers, ALU, etc.
An 8-bit microprocessor can process 8-bit data at a time. The word length ranges
from 4 bits to 64 bits depending upon the type of the microcomputer.
 Data Types: The microprocessor has multiple data type formats like binary, BCD,
ASCII, signed and unsigned numbers.
1.2 Features of a Microprocessor
 Cost-effective: The microprocessor chips are available at low prices and results
its low cost.
 Size: The microprocessor is of small size chip, hence is portable.
 Low Power Consumption: Microprocessors are manufactured by using metal-oxide
semiconductor technology, which has low power consumption.
 Versatility: The microprocessors are versatile as we can use the same chip in a
number of applications by configuring the software program.
 Reliability: The failure rate of an IC in microprocessors is very low, hence it is
reliable.
1.3 Microcontrollers
A microcontroller is a small and low-cost microcomputer, which is designed to perform
the specific tasks of embedded systems like displaying microwave’s information, receiving
remote signals, etc.
The general microcontroller consists of the processor, the memory (RAM, ROM, EPROM),
Serial ports, peripherals (timers, counters), etc.
1.4 Difference between Microprocessor and Microcontroller

Microcontroller Microprocessor
Microcontrollers are used to execute a Microprocessors are used for big
single task within an application. applications.
Its designing and hardware cost is low. Its designing and hardware cost is high.
Easy to replace. Not so easy to replace.
Its power consumption is high because it
It is built with CMOS technology, which
has
requires less power to operate. to control the entire system.
It doesn’t consist of RAM, ROM, I/O ports.
It
It consists of CPU, RAM, ROM, I/O ports.
uses its pins to interface to peripheral
devices.
1.5 Applications of Microcontrollers

Microcontrollers are widely used in various different devices such as:
 Light sensing and controlling devices like LED.
 Temperature sensing and controlling devices like microwave oven, chimneys.
 Fire detection and safety devices like Fire alarm.
 Measuring devices like Volt Meter.
Architecture of 8051 Microcontroller
2 Instruction Set Architecture
2.1 Introduction
Instruction set architecture is a part of processor architecture, which is necessary for creating
machine level programs to perform any mathematical or logical operations. Instruction set
architecture acts as an interface between hardware and software. It prepares the processor to
respond to the commands like execution, deleting etc given by the user.
The performance of the processor is defined by the instruction set architecture designed in it.
The Instruction Set Architecture (ISA) is the part of the processor that is visible to the programmer
or compiler writer. The ISA serves as the boundary between software and hardware. We will
briefly describe the instruction sets found in many of the microprocessors used today.
The ISA of a processor can be described using 5 categories:
Operand Storage in the CPU

Where are the operands kept other than in memory?
Number of explicit named operands
How many operands are named in a typical instruction.
Operand location
Can any ALU instruction operand be located in memory? Or must all operands be kept
internaly in the CPU?
Operations
What operations are provided in the ISA.
Type and size of operands
What is the type and size of each operand and how is it specified?
Of all the above the most distinguishing factor is the first.
The 3 most common types of ISAs are:
1. Stack - The operands are implicitly on top of the stack.

2. Accumulator - One operand is implicitly the accumulator.
3. General Purpose Register (GPR) - All operands are explicitely mentioned, they are
either registers or memory locations.
Stack
Advantages: Simple Model of expression evaluation (reverse polish). Short instructions.

Disadvantages: A stack can't be randomly accessed This makes it hard to generate eficient code.
The stack itself is accessed every operation and becomes a bottleneck.
Accumulator
Advantages: Short instructions.

Disadvantages: The accumulator is only temporary storage so memory traffic is the highest for
this approach.
GPR
Advantages: Makes code generation easy. Data can be stored for long periods in registers.
Disadvantages: All operands must be named leading to longer instructions.
Earlier CPUs were of the first 2 types but in the last 15 years all CPUs made are GPR processors.
The 2 major reasons are that registers are faster than memory, the more data that can be kept
internaly in the CPU the faster the program wil run. The other reason is that registers are easier
for a compiler to use.
2.2 What Makes a Good Instruction Set?

Implementability
– supports a (performance/cost) range of implementations
– implies support for high performance implementations
Programmability
– easy to express programs backward/forward/upward compatibility
– implementability & programmability across generations
2.3 RISC vs CISC
2.3.1 CISC Architecture
In the early days machines were programmed in assembly language and the memory access is
also slow. To calculate complex arithmetic operations, compilers have to create long sequence of
machine code.
This made the designers to build an architecture , which access memory less frequently and
reduce burden to compiler. Thus this lead to very power full but complex instruction set.
Advantages of CISC Architecture
 Microprogramming is easy to implement and much less expensive than hard wiring a control
unit.
 It is easy to add new commands into the chip without changing the structure of the instruction
set as the architecture uses general-purpose hardware to carry out commands.
 This architecture makes the efficient use of main memory since the complexity (or more
capability) of instruction allows to use less number of instructions to achieve a given task.
 The compiler need not be very complicated, as the micro program instruction sets can be
written to match the constructs of high level languages.
Disadvantages of CISC Architecture
 A new or succeeding versions of CISC processors consists early generation processors in their
subsets (succeeding version). Therefore, chip hardware and instruction set became complex
with each generation of the processor.
 The overall performance of the machine is reduced because of slower clock speed.
 The complexity of hardware and on-chip software included in CISC design to perform many
functions.
2.3.2 RISC Architecture
In RISC architecture, the instruction set of processor is simplified to reduce the execution time. It
uses small and highly optimized set of instructions which are generally register to register
operations.
The speed of the execution is increased by using smaller number of instructions .This uses
pipeline technique for execution of any instruction.
Advantages of RISC Architecture
 The performance of RISC processors is often two to four times than that of CISC processors
because of simplified instruction set.
 This architecture uses less chip space due to reduced instruction set. This makes to place extra
functions like floating point arithmetic units or memory management units on the same chip.
 The per-chip cost is reduced by this architecture that uses smaller chips consisting of more
components on a single silicon wafer.
 RISC processors can be designed more quickly than CISC processors due to its simple
architecture.
 The execution of instructions in RISC processors is high due to the use of many registers for
holding and passing the instructions as compared to CISC processors.
Disadvantages of RISC Architecture
 The performance of a RISC processor depends on the code that is being executed. The processor
spends much time waiting for first instruction result before it proceeds with next subsequent
instruction, when a compiler makes a poor job of scheduling instruction execution.
 RISC processors require very fast memory systems to feed various instructions. Typically, a large
memory cache is provided on the chip in most RISC based systems.
RISC CISC
It stands for ‘Reduced It stands for ‘Complex
Acronym
Instruction Set Computer’. Instruction Set Computer’.
The RISC processors have a The CISC processors have a
Definition smaller set of instructions larger set of instructions with
with few addressing nodes. many addressing nodes.
It has no memory unit and It has a memory unit to
Memory unit uses a separate hardware to implement complex
implement instructions. instructions.
It has a hard-wired unit of It has a micro-programming
Program
programming. unit.
It is a complex complier
Design It is an easy complier design.
design.
The calculations are faster The calculations are slow and
Calculations
and precise. precise.
Decoding of instructions is Decoding of instructions is
Decoding
simple. complex.
Time Execution time is very less. Execution time is very high.
It does not require external It requires external memory
External memory
memory for calculations. for calculations.
Pipelining does function Pipelining does not function
Pipelining
correctly. correctly.
Stalling is mostly reduced in
Stalling The processors often stall.
processors.
Code expansion can be a Code expansion is not a
Code expansion
problem. problem.
Disc space The space is saved. The space is wasted.
Used in high end applications
Used in low end applications
such as video processing,
Applications such as security systems,
telecommunications and
home automations, etc.
image processing.
2.4 Moore's law

Moore's law is the observation that over the history of computing hardware, the
number of transistors on integrated circuits doubles approximately every two years.
The period often quoted as "18 months" is due to Intel executive David House, who
predicted that period for a doubling in chip performance (being a combination of the
effect of more transistors and their being faster).
The capabilities of many digital electronic devices are strongly linked to Moore's law:
processing speed, memory capacity, sensors and even the number and size of
pixels in digital cameras. All of these are improving at (roughly) exponential rates as
well.
2.5 SPEC Rating
2.6 Amdahl’s Law

2.7 Parallel Processing System (Parallel Computer)
A computer system is said to be Parallel Processing System or Parallel Computer if it provides
facilities for simultaneous processing of various set of data or simultaneous execution of multiple
instruction.
On a computer with more than one processor each of several processes can be assigned to its
own processor, to allow the processes to progress simultaneously
If only one processor is available the effect of parallel processing can be simulated by having the
processor run each process in turn for a short time.
Parallelism in Uniprocessor Systems

A uniprocessor (one CPU) system can perform two or more tasks simultaneously. The tasks are not related to each
other. So, a system that processes two different instructions simultaneously could be considered to perform
parallel processing.
A typical uniprocessor computer consists of three major components: the main memory, the central processing
unit (CPU), and the input-output (I/O) subsystem. The architectures of two commercially available uniprocessor
computers are given below to show the possible interconnection of structures among the three subsystems.
A number of parallel processing mechanisms have been developed in uniprocessor computers.
 multiplicity of functional units
 parallelism and pipelining within the CPU
 overlapped CPU and I/O operations
 use of a hierarchical memory system
 multiprogramming and time sharing
2.8 Instruction-level parallelism

Instruction-level parallelism (ILP) is a measure of how many of the instructions in a computer program can be
executed simultaneously.
There are two approaches to instruction level parallelism:
Hardware & Software
Hardware level works upon dynamic parallelism whereas, the software level works on static parallelism. Dynamic
parallelism means the processor decides at run time which instructions to execute in parallel, whereas static
parallelism means the compiler decides which instructions to execute in parallel.
Hardware Approach of ILP
• Hardware approach works upon dynamic parallelism.
• Dynamic parallelism means the processor decides at run time which instructions to execute in parallel.
• Pentium processor works on the dynamic sequence of parallel execution
Software Approach of ILP
• Software approach works on static parallelism.
• static parallelism means the compiler decides which instructions to execute in parallel.
• Itanium processor works on the static level parallelism.

3.Pipelining
3.1 What Is Pipelining?
Pipelining is an implementation technique whereby multiple instructions are overlapped in
execution; it takes advantage of parallelism that exists among the actions needed to execute an
instruction.
A pipeline is like an assembly line. In an automobile assembly line, there are many steps, each
contributing something to the construction of the car. Each step operates in parallel with the
other steps, although on a different car. In a computer pipeline, each step in the pipeline
completes a part of an instruction. Like the assembly line, different steps are completing different
parts of different instructions in parallel. Each of these steps is called a pipe stage or a pipe
segment. The stages are connected one to the next to form a pipe—instructions enter at one end,
progress through the stages, and exit at the other end, just as cars would in an assembly line.
If the stages are perfectly balanced , then the time per instruction on the pipelined processor—
assuming ideal conditions—is equal to
CPU Datapath Design

• A datapath is a collection of functional units (such as arithmetic
logic units or multipliers that perform data
processing operations), registers, and buses.
• Instruction operation consists of 5 parts, namely, Fetch (IF), Decode
(ID), Execute (EX), Memory (DM), and Write-back (WB) stages
Datapath for MIPS
Every instruction in this RISC subset can be implemented in at most 5 clock cycles. The 5 clock
cycles are as follows.
1. Instruction fetch cycle (IF):
Send the program counter (PC) to memory and fetch the current instruction from memory.
Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to the PC.
2. Instruction decode/register fetch cycle (ID):

Decode the instruction and read the registers corresponding to register source specifiers from the
register file. Do the equality test on the registers as they are read, for a possible branch. Sign-
extend the offset field of the instruction in case it is needed. Compute the possible branch target
address by adding the sign-extended offset to the incremented PC. In an aggressive
implementation, which we explore later, the branch can be completed at the end of this stage by
storing the branch-target address into the PC, if the condition test yielded true. Decoding is done
in parallel with reading registers, which is possible because the register specifiers are at a fixed
location in a RISC architecture.
This technique is known as fixed-field decoding. Note that we may read a register we don’t use,
which doesn’t help but also doesn’t hurt performance. (It does waste energy to read an unneeded
register, and power-sensitive designs might avoid this.) Because the immediate portion of an
instruction is also located in an identical place, the sign-extended immediate is also calculated
during this cycle in case it is needed.
3. Execution/effective address cycle (EX):

The ALU operates on the operands prepared in the prior cycle, performing one of three functions
depending on the instruction type.
■ Memory reference—The ALU adds the base register and the offset to form the effective
address.
■ Register-Register ALU instruction—The ALU performs the operation specified by the ALU
opcode on the values read from the register file.
■ Register-Immediate ALU instruction—The ALU performs the operation specified by the ALU
opcode on the first value read from the register file and the sign-extended immediate.
In a load-store architecture the effective address and execution cycles can be combined into a
single clock cycle, since no instruction needs to simultaneously calculate a data address and
perform an operation on the data.
4. Memory access (MEM):

If the instruction is a load, the memory does a read using the effective address computed in the
previous cycle. If it is a store, then the memory writes the data from the second register read
from the register file using the effective address.
5. Write-back cycle (WB):
■ Register-Register ALU instruction or load instruction:
Write the result into the register file, whether it comes from the memory system (for a load) or
from the ALU (for an ALU instruction).
3.2 Pipeline Speedup

Now let us assume that in a non pipelined processor a task is executed in Tn times, then, the
speedup obtained form a k stage, n tasks pipeline can be given as:
S k = Time taken in non pipelined system /Time taken in a k stage pipelined system
= n Tn / (k + n - 1) Tp
As the number of tasks increases, n becomes much larger than k - 1, and k + n - 1 approaches
the value of n, the speedup becomes
S k = Tn/Tp
If we assume that the time it takes to process the task is same in the pipeline and the
nonpipeline circuits, we will have Tn = k Tp. Including this assumption, the speedup becomes:
S k = k Tp/Tp = k
Example: Let us consider a pipeline system with the following specifications
time takes to process a suboperation in each segment = 20 nsecs.
number of segments in the pipeline = 4
number of tasks being executed in sequence = 100
time takes to complete the pipeline = (k + n - 1) Tp

= (4 + 100 - 1) 20 = 2060 nsecs.
3.3 Basic Performance Issues in Pipelining
Pipelining increases the CPU instruction throughput—the number of instructions completed per
unit of time—but it does not reduce the execution time of an individual instruction.
The increase in instruction throughput means that a program runs faster and has lower total
execution time.
Example Consider the unpipelined processor in the previous section. Assume that it has a 1
ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for
memory operations. Assume that the relative frequencies of these operations are 40%, 20%,
and 40%, respectively. Suppose that due to clock skew and setup, pipelining the processor
adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how much speedup in the
instruction execution rate will we gain from a pipeline?
Answer
The average instruction execution time on the unpipelined processor is
Average instruction execution time = Clock cycle x Average CPI
= 1 ns x [(40% + 20%) × 4 + 40% x 5]
= 1 ns x 4.4
= 4.4 ns
In the pipelined implementation, the clock must run at the speed of the slowest stage plus
overhead, which will be 1 + 0.2 or 1.2 ns; this is the average instruction execution time. Thus,
the speedup from pipelining is
The 0.2 ns overhead essentially establishes a limit on the effectiveness of pipelining. If the
overhead is not affected by changes in the clock cycle, Amdahl’s law tells us that the overhead
limits the speedup.
3.4 INSTRUCTION PIPELINE

In a von Neumann architecture, the process of executing an instruction involves several
steps. First, the control unit of a processor fetches the instruction from the cache (or from
memory). Then the control unit decodes the instruction to determine the type of operation to
be performed. When the operation requires operands, the control unit also determines the
address of each operand and fetches them from cache (or memory). Next, the operation is
performed on the operands and, finally, the result is stored in the specified location.
An instruction pipeline increases the performance of a processor by overlapping the

processing of several different instructions. Often, this is done by dividing the instruction
execution process into several stages. As shown in Figure , an instruction pipeline often
consists of five stages, as follows:
1.Instruction fetch (IF)
. Retrieval of instructions fromcache (or main memory).
2.Instruction decoding (ID)
. Identification of the operation to be performed.
3.Operand fetch (OF)
. Decoding and retrieval of any required operands.
4.Execution (EX)
. Performing the operation on the operands.
5.Write-back (WB)
. Updating the destination operands.

3.5 Arithmetic pipelining
Arithmetic pipeline divides an arithmetic operation into sub-operations for execution in the
pipeline segments, it's used to implement the floating point operation , multiplication of fixed
point numbers and similar computations encountered in scientific problems.
• The floating point executions cannot be performed in one cycle during the EX stage.
• Allowing much more time will increase the pipeline cycle time or subsequent
instructions have to be stalled.
• Solution is to break the FP EX stage to several stages whose delay can match the
cycle time of the instruction pipeline.
• Such a FP or arithmetic pipeline does not reduce latency, but can decouple from the
integer unit and increase throughput for a sequence of FP instructions
3.6 Pipeline Hazards
There are situations, called hazards, that prevent the next instruction in the instruction stream
from executing during its designated clock cycle. Hazards reduce the performance from the
ideal speedup gained by pipelining.
There are three classes of hazards:
1. Structural hazards arise from resource conflicts when the hardware cannot support all
possible combinations of instructions simultaneously in overlapped execution.
2. Data hazards arise when an instruction depends on the results of a previous instruction in
a way that is exposed by the overlapping of instructions in the pipeline.
3. Control hazards arise from the pipelining of branches and other instructions
that change the PC.
3.6.1 Structural Hazards
When a processor is pipelined, the overlapped execution of instructions requires pipelining of
functional units and duplication of resources to allow all possible combinations of instructions
in the pipeline. If some combination of instructions cannot be accommodated because of
resource conflicts, the processor is said to have a structural hazard.
The most common instances of structural hazards arise when some functional unit is not fully
pipelined. Then a sequence of instructions using that unpipelined unit cannot proceed at the
rate of one per clock cycle. Another common way that structural hazards appear is when some
resource has not been duplicated enough to allow all combinations of instructions in the
pipeline to execute. For example, a processor may have only one register-file write port, but
under certain circumstances, the pipeline might want to perform two writes in a clock cycle.
This will generate a structural hazard.
To resolve this hazard, we stall the pipeline for 1 clock cycle when the data memory access
occurs. A stall is commonly called a pipeline bubble or just bubble, since it floats through
the pipeline taking space but carrying no useful work.
3.6.2 Data Hazards
A major effect of pipelining is to change the relative timing of instructions by overlapping
their execution. This overlap introduces data and control hazards. Data hazards occur when
the pipeline changes the order of read/write accesses to operands so that the order differs
from the order seen by sequentially executing instructions on an unpipelined processor.
Consider a program that contains two instructions, I1 followed by I2 . When this program
is executed in a pipeline, the execution of I2 can begin before the execution of I1 is
completed. This means that the results generated by I1 may not be available for use by I2.
We must ensure that the results obtained when instructions are executed in a pipelined
processor are identical to those obtained when the same instructions are executed
sequentially. The potential for obtaining incorrect results when operations are performed
concurrently can be demonstrated by a simple example. Assume that A=5, and consider
the following two operations:
When these operations are performed in the order given, the result is B = 32. But if they
are performed concurrently, the value of A used in computing B would be the original
value, 5, leading to an incorrect result.
I1. R2 <- R1 + R3
I2. R4 <- R2 + R3
I1. R4 <- R1 + R5
I2. R5 <- R1 + R2
I1. R2 <- R4 + R7
I2. R2 <- R1 + R3
3.6.2.1 OPERAND FORWARDING
Operand forwarding (or data forwarding) is an optimization in pipelined CPUs to limit performance
deficits which occur due to pipeline stalls. A data hazard can lead to a pipeline stall when the current
operation has to wait for the results of an earlier operation which has not yet finished.
The data hazard arises because one instruction, instruction I2, is waiting for data to be written in the
register file. However, these data are available at the output of the ALU once the Execute stage
completes step E (Execution) of I1. Hence, the delay can be reduced, or possibly eliminated, if we
arrange for the result of instruction I1 to be forwarded directly for use in step E (Execution) of I2.
Consider the pipelined execution of these instructions:
Minimizing Data Hazard Stalls by Forwarding
1 2 3 4 5 6 7 8 9
F1 D1 E1 M1 W1
F2 D2 E2 M2 W2
F3 D3 E3 M3 W3
F4 D4 E4 M4 W4
F5 D5 E5 M5 W5
Data Hazards Requiring Stalls
1 2 3 4 5 6 7 8 9 10 11 12
F1 D1 E1 M1 W1
F2 stall stall stall D2 E2 M2 W2
F3 D3 E3 M3 W3
F4 D4 E4 M4 W4
F5 D5 E5 M5 W5
Register half read/write
1 2 3 4 5 6 7 8 9 10 11
F1 D1 E1 M1 W1
F2 stall stall D2 E2 M2 W2
F3 D3 E3 M3 W3
F4 D4 E4 M4 W4
F5 D5 E5 M5 W5
3.6.3 Branch/Control Hazards

Branching hazards (also termed control hazards) occur with branches. On many instruction pipeline
microarchitectures, the processor will not know the outcome of the branch when it needs to insert a
new instruction into the pipeline (normally the fetch stage).
Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline
stalls.
3.6.3.1 Branch Penalty

Branch penalty is the condition when PC is updated with address of the branch target. In case of a branch
hazard/control hazard when the branch target changes, the previously fetched instruction are cleared from
the pipeline. Additional CLOCK CYCLEs are required to clear the previous instructions. This condition
is called a branch penalty.
3.6.3.2 Instruction Queue and Prefetching
This is a sophisticated fetch unit, which can simultaneously fetch and decode instructions. By the
time the branch target is calculated, there is a probability that the fetch unit might have fetched
and decoded the branch instruction. So, the number of branch penalties can be reduced in this
technology.
3.6.3.3 Conditional Braches
A conditional branch instruction introduces the added hazard caused by the dependency of the branch
condition on the result of a preceding instruction.
The decision to branch cannot be made until the execution of that instruction has been completed.
Branch instructions represent about 20% of the dynamic instruction count of most programs.
3.6.3.4 Delayed Branch

The instructions in the delay slots are always fetched. Therefore, we would like to arrange for them to
be fully executed whether or not the branch is taken.
The objective is to place useful instructions in these slots.
The effectiveness of the delayed branch approach depends on how often it is possible to reorder
instructions.
3.6.3.5 Loop Unrolling
Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to
optimize a program's execution speed at the expense of its binary size, which is an approach known as
the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an
optimizing compiler.
3.6.3.6 Branch folding
Branch folding is a technique where, on the prediction of most branches, the branch instruction is
completely removed from the instruction stream presented to the execution pipeline. Branch folding
can significantly improve the performance of branches, taking the CPI for branches significantly below 1.
When the PU prefetches a predicted taken instruction, it evaluates if the branch:
 is not a branch instruction

o the reason branch instructions are not evaluated is to avoid losing the link
 does not point to a code sequence that contains a branch in the first two instructions
 is not breakpointed
 is not aborted.
3.7 Superscalar Architecture

Base Scalar Processor:
It is defined as a machine with one instruction issued per cycle.

The CPU is essentially a scalar processor consists of multiple functional units.
The floating-point unit can be built on a coprocessor attached to the CPU.
Superscalar Architectures
Superscalar is a computer designed to improve the performance of the execution of scalar instructions.
A scalar is a variable that can hold only one atomic value at a time, e.g., an integer or a real.
A scalar architecture processes one data item at a time
Examples of non-scalar variables: Arrays, Matrices, Records
In a superscalar architecture (SSA), several scalar instructions can be initiated simultaneously and executed
independently.
Pipelining allows also several instructions to be executed at the same time, but they have to be in different pipeline
stages at a given moment.
SSA includes all features of pipelining but, in addition, there can be several instructions executing simultaneously in
the same pipeline stage.
SSA introduces therefore a new level of parallelism, called instruction-level parallelism.
In a superscalar processor, multiple instruction pipelines are required. This implies that multiple instructions are
issued per cycle and multiple results are generated per cycle.
Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Only independent
instructions an be executed in parallel without causing a wait state.
The instruction-issue degree in a superscalar processor is limited to 2-5 in practice.
The effective CPI of a superscalar processor should be lower than that of a generic scalar RISC processors.
Example: IBM RS/6000
3.8 The VLIW Architecture: Very Long Instruction Word..

Very Long Instruction Word (VLIW) architecture is generalized from two concepts; horizontal microcoding
and superscalar processing.
A typical VLIW machine has;
– Instruction words hundreds of bits in length.
– Multiple functional units.
– Common large register file shared by all functional units.
VLIW and Superscalar Processor:
VLIW machines behave much like superscalar machine with 3 differences:
1. The decoding of VLIW instruction is easier than that of superscalar instructions.
2. The code density of the superscalar machine is better than when the available instruction level parallelism is less
than that exploitable by the VLIW machine.
3. A superscalar machine can be object-code compatible with a larger family of nonparallel machines. On the
contrary, a VLIW machine exploiting different amount of parallelism would require different instruction sets.
VLIW: Advantages:
• The main advantage of VLIW architecture is its simplicity in hardware structure and instruction set.
• The VLIW processor can potentially perform well in scientific applications where the program behavior (branch
predictions) is more predictable.
3.9 Vector Processors:

• A vector processor is a coprocessor specially designed to perform vector computations.
• A vector instruction involves a large array of operands. The same operation will be performed over a string of data.
• Vector processors are often used in a multi-pipelined supercomputer.
• Vector processors can assume either;
• A register-to-register architecture using shorter instructions and vector register files, or
• A memory-to-memory architecture using memory-based instructions.
• The vector pipelines can be attached to any scalar processor (whether it is superscalar, superpipelined, or both).
Vector Pipelines:
• In a scalar processor, each scalar instruction executes only one operation over one data element.
• Each vector instruction executes a string of operations, one for each element in the vector
3.10 Array Processor:

• An array processor consists of a large number of identical processors that perform the same sequence of
instructions on different sets of data
Example: ILLIAC IV
• Announced by University of Illinois in 1972.
• The original plan was to build a machine consisting of 4 quadrants, each having 8*8 square grid of
processor/memory elements.
• Only one quadrant was built due to cost. It did achieve a performance of 50 megaflops
3.11 SPARC
SPARC stands for Scalable Processor Architecture. developed by Sun Microsystems in the 1980s and is based on
the RISC structure designed at the University of California at Berkeley in early 1980s.
It is a Load and store architecture. Operations are always done over registers. It uses “register window” concept thus
offering a large number of registers. Uses delay slot to optimize branch instruction. Passes arguments using registers
and the stack.
Modules
The Integer Unit (IU)

Contains the general purpose registers and controls the overall operation of the processor.
may contain from 64 to 528 general-purpose 64-bit r registers. They are partitioned into 8 global registers, 8
alternate global registers, plus a circular stack of from 3 to 32 sets of 16 registers each, known as register
windows.
Executes the integer arithmetic instructions and computes memory addresses for loads and stores.
Maintains the program counters and controls instruction execution for the FPU.
The Register Window
At any time, an instruction can access the 8 global registers and a 24-register window
A register window comprises a 16-register set- divided into 8 in and 8 local registers- together with the 8 in
registers of an adjacent register set, addressable from the current window as its out registers.
When a procedure is called, the register window shifts by sixteen registers, hiding the old input registers and
old local registers and making the old output registers the new input registers.
Input registers : arguments are passed to a function
Local registers : to store any local data.
Output registers : When calling a function, the programmer puts his argument in these registers.
Coprocessor Unit (CU)
The instruction set includes support for a single, implementation-dependent coprocessor. The
coprocessor has its own set of registers.
Coprocessor load/store instructions are used to move data between the coprocessor registers and
memory.
floating-point instructions mirrors coprocessor instructions.
3.11 ARM PROCESSOR
ARM is short for Advanced Risc Machines Ltd. Founded 1990, owned by Acorn, Apple and VLSI
Used especially in portable devices due to low power consumption and reasonable performance (MIPS / watt)
Processor cores: ARM6, ARM7, ARM9, ARM10, ARM11
ARM architecture
• 32-bit RISC-processor core (32-bit instructions)
• 37 pieces of 32-bit integer registers (16 available)
• Pipelined (ARM7: 3 stages)
• Cached (depending on the implementation)
• Von Neuman-type bus structure (ARM7), Harvard (ARM9)
• 8 / 16 / 32 -bit data types
• 7 modes of operation (usr, fiq, irq, svc, abt, sys, und)
• Simple structure -> reasonably good speed / power consumption ratio
ARM7TDMI
ARM7TDMI is a core processor module embedded in many ARM7 microprocessors, such as ARM720T, ARM710T,
ARM740T, and Samsung’s KS32C50100. It is the most complex processor core module in ARM7 series.
– T: capable of executing Thumb instruction set
– D: Featuring with IEEE Std. 1149.1 JTAG boundary-scan debugging interface.
– M: Featuring with a Multiplier-And-Accumulate (MAC) unit for DSP applications.
– I: Featuring with the support of embedded In-Circuit Emulator.
• Three Pipe Stages: Instruction fetch, decode, and Execution
Features
• A 32-bit RSIC processor core capable of executing 16-bit instructions (Von Neumann Architecture)
– High density code

• The Thumb’s set’s 16-bit instruction length allows it to approach about 65% of standard ARM code size
while
retaining ARM 32-bit processor performance.
– Smaller die size
• About 72,000 transistors
• Occupying only about 4.8mm2 in a 0.6um semiconductor technology.
– Lower power consumption
• dissipate about 2mW/MHZ with 0.6um technology.
- Memory Access
– Data can be
• 8-bit (bytes)
• 16-bit (half words)
• 32-bit (words)
• Memory Interface
– Can interface to SRAM, ROM, DRAM
– Has four basic types of memory cycle
• idle cycle
• nonsequential cycle
• sequential cycle
• coprocessor register cycle
32-bit address bus
• 32-bit data bus
– D[31:0]: Bidirectional data bus
– DIN[31:0]: Unidirectional input bus
– DOUT[31:0]: Unidirectional output bus
• Control signals
– Specify the size of the data to be transferred and the direction
of the transfer
ARM7TDMI Block Diagram
3.12 Superpipelined Processors

In contrast to a superscalar processor, a superpipelined one has split the main computational pipeline
into more stages. Each stage is simpler (does less work) and thus the clock speed can be increased.
However the latency, measured in clock cycles, for any instruction to complete has increased from 4
cycles in early RISC processors to 8 or more.
Benefit
The major benefit of superpipelining is the increase in the number of instructions which can be in the
pipeline at one time and hence the level of parallelism.
Drawbacks
The larger number of instructions "in flight" (ie in some part of the pipeline) at any time, increases the
potential for data dependencies to introduce stalls. Simulation studies have suggested that a pipeline
depth of more than 8 stages tends to be counter-productive.
Note that some recent processors, eg the MIPS R10000, can be described as both superscalar -
they have multiple processing units - and superpipelined - there are more than 5 stages in the
pipeline
4 MEMORY
4.1 Memory Hierarchy
4.2 Cache Memories
• CPU requests contents of memory location
• Check cache for this data

• If present, get from cache (fast)
• If not present, read required block from main memory to cache
• Then deliver from cache to CPU
• Cache includes tags to identify which block of main memory is in each cache slot
4.2.3 Cache Mapping

There are several possible methods for determining where memory blocks are placed in the cache. It is instructive to
describe these methods using a specific small example. Consider a cache consisting of 128 blocks of 16 words each,
for a total of 2048 (2K) words, and assume that the main memory is addressable by a 16-bit address. The main
memory has 64K words, which we will view as 4K blocks of 16 words each. For simplicity, we have assumed that
consecutive addresses refer to consecutive words.
Direct Mapping
The simplest way to determine cache locations in which to store memory blocks is the direct-mapping technique. In
this technique, block j of the main memory maps onto block j modulo 128 of the cache,
Associative Mapping
the most flexible mapping method, in which a main memory block can be placed into any cache block position. In
this case, 12 tag bits are required to identify a memory block when it is resident in the cache. The tag bits of an
address received from the processor are compared to the tag bits of each block of the cache to see if the desired
block is present. This is called the associative-mapping technique.
Set-Associative Mapping
The blocks of the cache are grouped into sets, and the mapping allows a block of the main memory to reside in any
block of a specific set. Hence, the contention problem of the direct method is eased by having a few choices for
block placement. At the same time, the hardware cost is reduced by decreasing the size of the associative search.
4.3 Cache coherence
In computer architecture, cache coherence is the uniformity of shared resource data that ends up
stored in multiple local caches. When clients in a system maintain caches of a common memory
resource, problems may arise with incoherent data, which is particularly the case with CPUs in a
multiprocessing system.
Cache Coherence Solutions

Two types Software based & hardware based
Software-based
Compiler based or with run-time system support
With or without hardware assist
Tough problem because perfect information is needed in the presence of memory aliasing and explicit
parallelism
Hardware based
The schemes can be classified based on :
Shared caches vs Snoopy schemes vs. Directory schemes
Write through vs. write - back (ownership -based) protocols
update vs. invalidation protocols
dirty-sharing vs. no-dirty-sharing protocols
Snoopy Cache Coherence Schemes

A distributed cache coherence scheme based on the notion of a snoop that watches all activity on a
global bus, or is informed about such activity by some global broadcast mechanism.
Most commonly used method in commercial multiprocessors
Write Through Schemes

All processor writes result in :
update of local cache and a global bus write that :
o updates main memory
o invalidates/updates all other caches with that item
Advantage : Simple to implement
Disadvantages : Since ~15% of references are writes, this scheme consumes tremendous bus
bandwidth . Thus only a few processors can be supported.
Write - Back/Ownership Schemes
When a single cache has ownership of a block, processor writes do not result in bus writes thus
conserving bandwidth.
Most bus - based multiprocessors nowadays use such schemes.
Many variants of ownership - based protocols exist:
 Goodman’s write - once scheme
Berkley ownership scheme
Firefly update protocol
Types of Cache Misses

Compulsory: On the first access to a block; the block must be brought into the cache; also called cold
start misses, or first reference misses.
Capacity: Occur because blocks are being discarded from cache because cache cannot contain all blocks
needed for program execution (program working set is much larger than cache capacity).
Conflict: In the case of set associative or direct mapped block placement strategies, conflict misses occur
when several blocks are mapped to the same set or block frame; also called collision misses or
interference misses.
4.4 Unified vs Split I and D (Instruction and Data) Caches
Given a fixed total size (in bytes) for the cache, is it better to have two caches, one for instructions and
one for data; or is it better to have a single unified cache?
Unified is better because it automatically performs load balancing. If the current program needs
more data references than instruction references, the cache will accommodate. Similarly if more
instruction references are needed.
Split is better because it can do two references at once (one instruction reference and one data
reference).
The better is cache is the split I and D (at least for L1).
But unified has the better (i.e. higher) hit ratio.
4.5 Local vs Global Miss rate

Local miss rate—This rate is simply the number of misses in a cache divided by the total number of
memory accesses to this cache. As you would expect, for the first-level cache it is equal to Miss rateL1,
and for the second-level cache it is Miss rateL2.
Global miss rate—The number of misses in the cache divided by the total number of memory accesses
generated by the processor. Using the terms above, the global miss rate for the first-level cache is still
just Miss rateL1, but for the second-level cache it is Miss rateL1 × Miss rateL2.
4.6 Six Basic Cache Optimizations
1. First Optimization: Larger Block Size to Reduce Miss Rate

Larger block sizes will reduce also compulsory misses. This reduction occurs because the
principle of locality has two components: temporal locality and spatial locality. Larger blocks take
advantage of spatial locality
2. Second Optimization: Larger Caches to Reduce Miss Rate
The obvious way to reduce capacity misses is to increase capacity of the cache. The obvious
drawback is potentially longer hit time and higher cost and power.
3. Third Optimization: Higher Associativity to Reduce Miss Rate
An eight-way set associative is for practical purposes as effective in reducing misses for these
sized caches as fully associative. Cache rule of thumb, is that a direct mapped cache of size N has
about the same miss rate as a two-way set associative cache of size N/2.
4. Fourth Optimization: Multilevel Caches to Reduce Miss Penalty
Adding another level of cache between the original cache and memory simplifies the decision. The
first-level cache can be small enough to match the clock cycle time of the fast processor. Yet, the
second-level cache can be large enough to capture many accesses that would go to main memory,
thereby lessening the effective miss penalty.
5. Fifth Optimization: Giving Priority to Read Misses over Writes to Reduce
Miss Penalty
This optimization serves reads before writes have been completed.
With a write-through cache the most important improvement is a write buffer of the proper size.
Write buffers, however, do complicate memory accesses because they might hold the updated
value of a location needed on a read miss.
The simplest way out of this dilemma is for the read miss to wait until the write buffer is empty.
The alternative is to check the contents of the write buffer on a read miss, and if there are no
conflicts and the memory system is available, let the read miss continue.
6. Sixth Optimization: Avoiding Address Translation during Indexing of the
Cache to Reduce Hit Time
We use virtual addresses for the cache, since hits are much more common than misses. Such
caches are termed virtual caches, with physical cache used to identify the traditional cache that
uses physical addresses. It is important to distinguish two tasks: indexing the cache and comparing
addresses. Thus, the issues are whether a virtual or physical address is used to index the cache and
whether a virtual or physical address is used in the tag comparison. Full virtual addressing for both
indices and tags eliminates address translation time from a cache hit.
4.7 Princeton Architecture
The memory interface unit is responsible for arbitrating access to the memory space between reading
instructions (based upon the current program counter) and passing data back and forth with the processor
and its internal registers.
It might at first seem that the memory interface unit is a bottleneck between the processor and the
variable/RAM space (especially with the requirement for fetching instructions at the same time); however,
in many Princeton architected processors, this is not the case because the time required to execute a
given instruction can be used to fetch the next instruction (this is known as pre-fetching and is a feature
on many Princeton architected processors.
The Von Neumann architecture's largest advantage is that it simplifies the microcontroller chip
design because only one memory is accessed. For microcontrollers, its biggest asset is that the
contents of RAM (random-access memory) can be used for both variable (data) storage as well as
program instruction storage. An advantage for some applications is the program counter stack
contents that are available for access by the program. This allows greater flexibility in developing
software, primarily in the areas of real-time operating systems.
4.8 Havard Architecture
Harvard's response was a design that used separate memory banks for program store, the processor
stack, and variable RAM.
The Harvard architecture executes instructions in fewer instruction cycles that the Von Neumann
architecture. This is because a much greater amount of instruction parallelism is possible in the Harvard
architecture. Parallelism means that fetches for the next instruction can take place during the execution
of the current instruction, without having to either wait for a "dead" cycle of the instruction's execution
or stop the processor's operation while the next instruction is being fetched.
4.9 Uniform Memory Access (UMA)

Uniform memory access (UMA) is a shared memory architecture used in parallel computers. All the
processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time
to a memory location is independent of which processor makes the request or which memory chip
contains the transferred data.
In this model, all the processors share the physical memory uniformly. All the processors have
equal access time to all the memory words. Each processor may have a private cache memory.
Same rule is followed for peripheral devices.
When all the processors have equal access to all the peripheral devices, the system is called a
symmetric multiprocessor. When only one or a few processors can access the peripheral
devices, the system is called an asymmetric multiprocessor.
4.10 Non-uniform Memory Access (NUMA)
NUMA (non-uniform memory access) is a method of configuring a cluster of microprocessor

in a multiprocessing system so that they can share memory locally, improving performance and
the ability of the system to be expanded. NUMA is used in a symmetric multiprocessing ( SMP )
system.
In NUMA multiprocessor model, the access time varies with the location of the memory word.
Here, the shared memory is physically distributed among all the processors, called local
memories. The collection of all local memories forms a global address space which can be
accessed by all the processors.
Virtual Memory
In most modern computer systems, the physical main memory is not as large as the address space of the
processor. For example, a processor that issues 32-bit addresses has an addressable space of 4G bytes.
The size of the main memory in a typical computer with a 32-bit processor may range from 1G to 4G bytes.
If a program does not completely fit into the main memory, the parts of it not currently being executed
are stored on a secondary storage device, typically a magnetic disk. As these parts are needed for
execution, they must first be brought into the main memory, possibly replacing other parts that are
already in the memory. These actions are performed automatically by the operating system, using a
scheme known as virtual memory.
Under a virtual memory system, programs, and hence the processor, reference instructions and data in an address
space that is independent of the available physical main memory space. The binary addresses that the processor
issues for either instructions or data are called virtual or logical addresses. These addresses are translated into
physical addresses by a combination of hardware and software actions. If a virtual address refers to a part of the
program or data space that is currently in the physical memory, then the contents of the appropriate location in
the main memory are accessed immediately. Otherwise, the contents of the referenced address must be brought
into a suitable location in the memory before they can be used.
Virtual memory organization
Address Translation
A virtual-memory address-translation method based on the concept of fixed-length pages is shown schematically
in Figure. Each virtual address generated by the processor, whether it is for an instruction fetch or an operand
load/store operation, is interpreted as a virtual page number (high-order bits) followed by an offset (low-order
bits) that specifies the location of a particular byte (or word) within a page. Information about the main memory
location of each page is kept in a page table. This information includes the main memory address where the page
is stored and the current status of the page. An area in the main memory that can hold one page is called a page
frame. The starting address of the page table is kept in a page table base register. By adding the virtual page
number to the contents of this register, the address of the corresponding entry in the page table is obtained. The
contents of this location give the starting address of the page if that page currently resides in the main memory.
Translation Look-aside Buffer
The page table information is used by the MMU for every read and write access. Ideally, the page table should be
situated within the MMU. Unfortunately, the page table may be rather large. Since the MMU is normally
implemented as part of the processor chip, it is impossible to include the complete table within the MMU. Instead,
a copy of only a small portion of the table is accommodated within the MMU, and the complete table is kept in the
main memory. The portion maintained within the MMU consists of the entries corresponding to the most recently
accessed pages. They are stored in a small table, usually called the Translation Lookaside Buffer (TLB). The TLB
functions as a cache for the page table in the main memory. Each entry in the TLB includes a copy of the information
in the corresponding entry in the page table. In addition, it includes the virtual address of the page, which is needed
to search the TLB for a particular page.
5INTERCONNECTION NETWORK
An ICN could be either static or dynamic. Connections in a static network are fixed links, while connections in a
dynamic network are established on the fly as needed.
Static networks can be further classified according to their interconnection pattern as one-dimension (1D), two-
dimension (2D), or hypercube (HC).
Two types of static networks can be identified. These are:

completely connected networks (CCNs)
limited connection networks (LCNs).
Dynamic networks can be classified based on interconnection scheme as bus-based versus switch-based.
Static Interconnection Networks:
• Static (fixed) interconnection networks are characterized by having fixed paths, unidirectional or bidirectional,
between processors. Two types of
static networks can be identified ;

– Completely Connected Networks (CCNs),
– Limited Connection Networks (LCNs).
Completely Connected Networks:

• In a completely connected network each node is connected to all other nodes in the network.
• CCNs guarantee fast delivery of messages from any source node to any destination node (only one link has to be
traversed).
• Every node is connected to every other node in the network, therefore routing of messages between nodes
becomes a straightforward task.
CCN Characteristics:
• CCNs guarantee fast delivery of messages from any source node to any destination node. Only one link has to be
traversed.
• CCNs are expensive in terms of the number of links needed
• for their construction. This disadvantage becomes more and more apparent for higher values of N.
• The number of links in a CCN is given by N(N-1)/2.
• The delay complexity of CCNs, measured in terms of the number of links traversed as messages are routed from
any source to any destination is constant.
CCN having N=6 nodes. A total of 15 links are required in order to satisfy the complete interconnectivity of the
network.
Limited Connection Networks:

• Limited connection networks (LCNs) do not provide a direct link from every node to every other node in the
network. Instead, communications between some nodes have to be routed through other nodes in the network.
• The length of the path between nodes, measured in terms of the number of links that have to be traversed, is
expected to be longer compared to
the case of CCNs.
• Two other conditions seem to have been imposed by the existence of limited interconnectivity in LCNs, these
are:
– the need for a pattern of interconnection among nodes, and
– the need for a mechanism for routing messages around the network
until they reach their destinations.
• A number of regular interconnection patterns for LCNs include:
– linear arrays;
– ring (loop) networks;
– two-dimensional arrays (nearest-neighbor mesh);
– tree networks; and
– cube networks.
Linear Array Static LCN:
• Each node is connected to its two immediate neighboring nodes.
• If node i needs to communicate with node j, j>i, then the message from node i has to traverse nodes i+1, i+2, . . . ,
j-i.
• In the worst possible case, when node 1 has to send a message to node N, the message has to traverse a total of
N-1nodes before it can reach its
destination.
• Linear arrays are simple in their architecture and have simple routing mechanisms.
• Linear arrays are slow. This is particularly true when the number of nodes N is large.
• The network complexity of the linear array is O(N) and its time complexity is O(N).
• If the two nodes at the extreme ends of a linear array network are connected, then the resultant network has
ring (loop) architecture.
Tree Network Static LCN:

• If a node at level i needs to communicate with a node at level j, where i>j and the destination node belongs to
the same root’s child subtree, then it
will have to send its message up the tree traversing nodes at levels i-1, i-2, . . . , j+1 until it reaches the destination
node.
• If a node at level i needs to communicate with another node at the same level i (or with node at level jBi where
the destination node belongs to a
different root’s child subtree), it will have to send its message up the tree until the message reaches the root node
at level 0. The message will have to be then sent down from the root nodes until it reaches its destination.
• The number of nodes (processors) in a binary tree system having k levels can be calculated as:
Network Topologies
• A variety of network topologies have been proposed and implemented.
• These topologies tradeoff performance for cost.
• Commercial machines often implement hybrids of multiple topologies for reasons of packaging,
cost, and available components.
Network Topologies: Buses
• Some of the simplest and earliest parallel machines used buses.
• All processors access a common bus for exchanging data.
• The distance between any two nodes is O(1) in a bus. The bus also provides a convenient
broadcast media.
• However, the bandwidth of the shared bus is a major bottleneck.
• Typical bus based machines are limited to dozens of nodes. Sun Enterprise servers and Intel
Pentium based shared-bus multiprocessors are examples of such architectures.
• Since much of the data accessed by processors is local to the processor, a local memory can
improve the performance of bus-based machines.
Network Topologies: Crossbars
• A crossbar network uses a p×m grid of switches to connect p inputs to m outputs in a

non-blocking manner.
• The cost of a crossbar of p processors grows as O(p2 ).
• This is generally difficult to scale for large values of p.
• Examples of machines that employ crossbars include the Sun Ultra HPC 10000 and the
Fujitsu VPP500.
Network Topologies: Multistage Networks
• Crossbars have excellent performance scalability but poor cost scalability.

• Buses have excellent cost scalability, but poor performance scalability.
• Multistage interconnects strike a compromise between these extremes.
Switches
Omega Network
Network Topologies: Completely Connected Network
• Each processor is connected to every other processor.

• The number of links in the network scales as O(p2 ).
• While the performance scales very well, the hardware complexity is not
realizable for large values of p.
• In this sense, these networks are static counterparts of crossbars.
Network Topologies: Star Connected Network

• Every node is connected only to a common node at the center.
• Distance between any pair of nodes is O(1). However, the central node becomes a
bottleneck.
• In this sense, star connected networks are static counterparts of buses.
Network Topologies: Linear Arrays, Meshes, and k-d Meshes

• In a linear array, each node has two neighbors, one to its left and one to its right. If
the nodes at either end are connected, we refer to it as a 1-D torus or a ring.
• A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south,
east, and west.
• A further generalization to d dimensions has nodes with 2d neighbors.
• A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is
the total number of nodes.
Network Topologies: Two- and Three Dimensional Meshes

Properties of Hypercubes
• The distance between any two nodes is at most log p.
• Each node has log p neighbors.
• The distance between two nodes is given by the number of bit positions at which
the two nodes differ.
Network Topologies: Tree-Based Networks

Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
Tree Properties
• The distance between any two nodes is no more than 2logp.
• Links higher up the tree potentially carry more traffic than those at the lower
levels.
• For this reason, a variant called a fat-tree, fattens the links as we go up the
tree.
• Trees can be laid out in 2D with no wire crossings. This is an attractive
property of trees.
Evaluating Interconnection Networks
• Diameter: The distance between the farthest two nodes in the network. The
diameter of a linear array is p − 1, that of a tree and hypercube is log p, and
that of a completely connected network is O(1).
• Bisection Width: The minimum number of wires you must cut to divide the
network into two equal parts. The bisection width of a linear array and tree is
1, that of a hypercube is p/2 and that of a completely connected network is
p2 /4.
• Cost: The number of links or switches (whichever is asymptotically higher)
is a meaningful measure of the cost. However, a number of other factors,
such as the ability to layout the network, the length of wires, etc., also factor
in to the cost.
6 IO Organization
Programmed I/O
Programmed I/O (PIO) refers to data transfers initiated by a CPU under driver software control
to access registers or memory on a device.
The CPU issues a command then waits for I/O operations to be complete. As the CPU is faster
than the I/O module, the problem with programmed I/O is that the CPU has to wait a long time
for the I/O module of concern to be ready for either reception or transmission of data. The CPU,
while waiting, must repeatedly check the status of the I/O module, and this process is known as
Polling. As a result, the level of the performance of the entire system is severely degraded.
Programmed I/O basically works in these ways:
 CPU requests I/O operation

 I/O module performs operation
 I/O module sets status bits
 CPU checks status bits periodically
 I/O module does not inform CPU directly
 I/O module does not interrupt CPU
 CPU may wait or come back later
Interrupt
The CPU issues commands to the I/O module then proceeds with its normal work until
interrupted by I/O device on completion of its work.
For input, the device interrupts the CPU when new data has arrived and is ready to be retrieved
by the system processor. The actual actions to perform depend on whether the device uses I/O
ports, memory mapping.
For output, the device delivers an interrupt either when it is ready to accept new data or to
acknowledge a successful data transfer. Memory-mapped and DMA-capable devices usually
generate interrupts to tell the system they are done with the buffer.
Although Interrupt relieves the CPU of having to wait for the devices, but it is still inefficient in
data transfer of large amount because the CPU has to transfer the data word by word between I/O
module and memory.
Below are the basic operations of Interrupt:
 CPU issues read command

 I/O module gets data from peripheral whilst CPU does other work
 I/O module interrupts CPU
 CPU requests data
 I/O module transfers data
Direct Memory Access (DMA)
Direct Memory Access (DMA) means CPU grants I/O module authority to read from or write to
memory without involvement. DMA module controls exchange of data between main memory
and the I/O device. Because of DMA device can transfer data directly to and from memory,
rather than using the CPU as an intermediary, and can thus relieve congestion on the bus. CPU is
only involved at the beginning and end of the transfer and interrupted only after entire block has
been transferred.
Direct Memory Access needs a special hardware called DMA controller (DMAC) that manages
the data transfers and arbitrates access to the system bus. The controllers are programmed with
source and destination pointers (where to read/write the data), counters to track the number of
transferred bytes, and settings, which includes I/O and memory types, interrupts and states for
the CPU cycles.
DMA increases system concurrency by allowing the CPU to perform tasks while the DMA
system transfers data via the system and memory busses. Hardware design is complicated
because the DMA controller must be integrated into the system, and the system must allow the
DMA controller to be a bus master. Cycle stealing may also be necessary to allow the CPU and
DMA controller to share use of the memory bus.

ACA Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ACA Notes

Uploaded by

Copyright:

Available Formats

1 Microprocessor

Microprocessor is a controlling unit of a micro-computer, fabricated on a small chip capable

1.1 Block Diagram of a Basic Microcomputer

 Instruction Set: It is the set of instructions that the microprocessor can

1.4 Difference between Microprocessor and Microcontroller

1.5 Applications of Microcontrollers

The ISA of a processor can be described using 5 categories:

Operand Storage in the CPU

Of all the above the most distinguishing factor is the first.

The 3 most common types of ISAs are:

1. Stack - The operands are implicitly on top of the stack.

Advantages: Simple Model of expression evaluation (reverse polish). Short instructions.

Advantages: Short instructions.

2.2 What Makes a Good Instruction Set?

2.3.1 CISC Architecture

Advantages of CISC Architecture

Disadvantages of CISC Architecture

2.3.2 RISC Architecture

Advantages of RISC Architecture

Disadvantages of RISC Architecture

2.4 Moore's law

2.6 Amdahl’s Law

Parallelism in Uniprocessor Systems

2.8 Instruction-level parallelism

There are two approaches to instruction level parallelism:

Hardware & Software

Hardware Approach of ILP

• Hardware approach works upon dynamic parallelism.

• Pentium processor works on the dynamic sequence of parallel execution

Software Approach of ILP

• Software approach works on static parallelism.

• Itanium processor works on the static level parallelism.

CPU Datapath Design

2. Instruction decode/register fetch cycle (ID):

3. Execution/effective address cycle (EX):

4. Memory access (MEM):

3.2 Pipeline Speedup

Example: Let us consider a pipeline system with the following specifications

time takes to process a suboperation in each segment = 20 nsecs.

number of segments in the pipeline = 4

number of tasks being executed in sequence = 100

time takes to complete the pipeline = (k + n - 1) Tp

3.4 INSTRUCTION PIPELINE

An instruction pipeline increases the performance of a processor by overlapping the

1.Instruction fetch (IF)

. Retrieval of instructions fromcache (or main memory).

2.Instruction decoding (ID)

. Identification of the operation to be performed.

3.Operand fetch (OF)

. Decoding and retrieval of any required operands.

. Performing the operation on the operands.

. Updating the destination operands.

Minimizing Data Hazard Stalls by Forwarding

Register half read/write

3.6.3 Branch/Control Hazards

3.6.3.1 Branch Penalty

3.6.3.4 Delayed Branch

The objective is to place useful instructions in these slots.

3.6.3.6 Branch folding

When the PU prefetches a predicted taken instruction, it evaluates if the branch:

 is not a branch instruction

3.7 Superscalar Architecture

It is defined as a machine with one instruction issued per cycle.