MicroArchi PDF

The Maharaja Sayajirao University of Baroda
Faculty of Technology and Engineering

Electrical Engineering Department
BE-IV-Electronics Microcomputer Architecture
Microcomputer Architecture
3. RISC versus CISC characteristics, overlapped register windows, pipelining-general considerations, RISC
pipeline, Instruction pipeline, parallel processing, vector processing, array processing, Superscalar processors
– overview, design issues, power P C, Pentium, CISC scalar & RISC scalar processors. Linear and linear
pipeline pressers.
5. ARM7TDMI: Introduction-Architectural overview, ARM7TDMI processor (LPC2378). System Control Block

– PLL, Power control, Memory mapping, External Interrupts. Vectored Interrupt Controller – VIC register,
Interrupt sources. External Memory Controller. Pin connect block, Port structure, ADC, DAC, UART,
I2Cinterface, SPI interface, Timers, PWM, Watchdog, Real Time Clock. C-development – Programs on usage
of above peripherals.
6. DSP archi. and ASIC design: Basic architecture, operation, Pipelining, Application Specific Instruction-Set
Processors (ASIPs) – Micro Controllers and Digital Signal Processors.
REFERENCE BOOKS
1. Computer system Architecture (PHI) by M.Morris Mano
2. Digital Design (PHI) by M.Morris Mano
3. Computer Organization & Architecture – (PHI) by William Stallings.
4. Advanced Computer Architecture (McGraw Hill) by Kai Hwang.
5.ARM Architecture Reference Manual, 2nd Ed, Published 2001, edited by David Seal, Addison-Wesley.
Devasi Chocha Mobile -9662739107 Email-chochadevh@gmail.com
19-08-2020 12:46:44 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 1
Computer Organization RISC and CISC

Reduced Instruction Set Architecture (RISC) –
The main idea behind is to make hardware simpler by using an instruction set
composed of a few basic steps for loading, evaluating and storing operations just
like a load command will load data, store command will store the data.
A reduced instruction set computer is a computer that only uses simple
commands that can be divided into several instructions that achieve low-level
operation within a single CLK cycle, as its name proposes “Reduced Instruction
Set”.
Complex Instruction Set Architecture (CISC) –

The main idea is that a single instruction will do all loading, evaluating and storing
operations just like a multiplication command will do stuff like loading data,
evaluating and storing it, hence it’s complex.
A complex instruction set computer is a computer where single instructions can
perform numerous low-level operations like a load from memory, an arithmetic
operation, and a memory store or are accomplished by multi-step processes or
addressing modes in single instructions, as its name proposes “Complex
Instruction Set ”.
Characteristic of RISC –
1. Simpler instruction, hence simple instruction decoding.

2. Instruction come under size of one word.
3. Instruction take single clock cycle to get executed.
4. More number of general purpose register.
5. Simple Addressing Modes.
6. Less Data types.
7. Pipeline can be achieved.
Characteristic of CISC –
1. Complex instruction, hence complex instruction decoding.

2. Instruction are larger than one word size.
3. Instruction may take more than single clock cycle to get executed.
4. Less number of general purpose register as operation get performed in
memory itself.
5. Complex Addressing Modes.
6. More Data types.

Difference between RISC and CISC Architecture
Sr no. RISC CISC
1 RISC stands for Reduced Instruction CISC stands for Complex Instruction Set
Set Computer. Computer.
2 RISC processors have simple CSIC processor has complex instructions
instructions taking about one clock that take up multiple clocks for execution.
cycle. The average clock cycle per The average clock cycle per instruction
instruction (CPI) is 1.5 (CPI) is in the range of 2 to 15.
Sr no. RISC CISC

3 Performance is optimized with more Performance is optimized with more focus
focus on software on hardware.
4 Can perform only Register to Can perform REG to REG or REG to MEM or
Register Arithmetic operations MEM to MEM
5 It has a hard-wired unit of It has a microprogramming unit.
programming.
6 The instruction set is reduced i.e. it The instruction set has a variety of
has only a few instructions in the different instructions that can be used for
instruction set. complex operations.
7 Requires more number of registers Requires less number of registers
8 Fixed sized instructions Variable sized instructions
9 RISC processors are highly pipelined They are normally not pipelined or less
pipelined
10 The complexity of RISC lies with the The complexity lies in the microprogram
compiler that executes the program

Sr no. RISC CISC

11 Execution time is very less Execution time is very high
12 The decoding of instructions is simple. Decoding of instructions is complex
13 A instruction fit in one word Instruction are larger than size of one
word
14 The most common RISC Examples of CISC processors are the
microprocessors are Alpha, ARC, ARM, System/360, VAX, PDP-11, Motorola
AVR, MIPS, PA-RISC, PIC, Power 68000 family, AMD, and Intel x86 CPUs.
Architecture, and SPARC.
15 RISC architecture is used in high-end CISC architecture is used in low-end
applications such as video processing, applications such as security systems,
telecommunications, and image home automation, etc.
processing.

Multiplying Two Numbers in Memory
The RISC Approach
LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
The CISC Approach
MULT 2:3, 5:2
The CISC approach attempts to minimize the

number of instructions per program, sacrificing the
number of cycles per instruction.
RISC does the opposite, reducing the cycles per
instruction at the cost of the number of instructions
per program.




Pipeline principle
The decomposition of the instruction processing by 6 stages is the following.
1. Fetch Instruction (FI): Read the next expected introduction into a buffer
2. Decode Instruction(DI): Determine the opcode and the operand specifiers
3. Calculate Operands (CO): Calculate the effective address of each source

operand. This may involve displacement, register indirect, indirect or other forms
of address calculations.
4. Fetch Operands (FO): Fetch each operand from memory. Operands in register
need not be fetched.
5. Execute Instruction (EI): Perform the indicated operation and store the result, if
any, in the specified destination operand location.
6. Write Operand (WO): Store result in memory.

INSTRUCTION PIPELINING
➢ First stage fetches the instruction and buffers it.
➢ When the second stage is free, the first stage
passes it the buffered instruction.
➢ While the second stage is executing the instruction,
the first stage takes advantages of
➢ any unused memory cycles to fetch and buffer the
next instruction.
➢ This is called instruction prefetch or fetch
overlap.

Inefficiency in two stage instruction pipelining
There are two reasons
• The execution time will generally be longer than

the fetch time. Thus the fetch stage may have to
wait for some time before it can empty the buffer.
• When conditional branch occurs, then the address

of next instruction to be fetched become
unknown. Then the execution stage have to wait
while the next instruction is fetched.

6 Stages Instruction Pipeline

Decomposition of instruction processing
The decomposition of the instruction processing by 6 stages is the following.
❖Fetch instruction(FI)
❖Decode instruction(DI)
❖Calculate operands (CO)
❖Fetch operands(FO)
❖Execute instructions(EI)
❖Write operand(WO)

SIX STAGE OF INSTRUCTION PIPELINING
❖ Fetch Instruction(FI)
Read the next expected instruction into a buffer
❖ Decode Instruction(DI)
Determine the opcode and the operand specifiers.
❖ Calculate Operands(CO)
Calculate the effective address of each source operand.
❖ Fetch Operands(FO)
Fetch each operand from memory. Operands in registers
need not be fetched.
❖ Execute Instruction(EI)
Perform the indicated operation and store the result
❖ Write Operand(WO)
Store the result in memory.


High efficiency of instruction pipelining
Assume all the below in diagram

• All stages will be of equal duration.
• Each instruction goes through all the six stages of
the pipeline.
• All the stages can be performed parallel.
• No memory conflicts.
• All the accesses occur simultaneously.
✓ In the previous diagram the instruction pipelining

works very efficiently and give high performance

Instruction Pipeline Limitations

The factors affecting the performance are
1. If six stages are not of equal duration, then there will

be some waiting time at various stages.
2. Conditional branch instruction which can invalidate
several instruction fetches.
3. Interrupt which is unpredictable event.
4. Register and memory conflicts.
5. CO stage may depend on the contents of a register
that could be altered by a previous instruction that
is still in pipeline.

Performance of instruction pipeline

Pipeline Clock and Timing
Si Si+1
 m d
Clock cycle of the pipeline : 
Latch delay : d
 = max {m } + d
Pipeline frequency : f
f=1/
Performance of instruction pipeline Speedup and Efficiency

k-stage pipeline processes n tasks in k + (n-1) clock
cycles:
k cycles for the first task and n-1 cycles

for the remaining n-1 tasks
Total time to process n tasks
Tk = [ k + (n-1)] 
For the non-pipelined processor
T1 = n k 
Speedup factor
T1 nk nk
Sk = = [ k + (n-1)]  = k + (n-1)
Tk
Performance of instruction pipeline Efficiency and Throughput
Efficiency of the k-stages pipeline :
Sk n
Ek = =
k k + (n-1)
Pipeline throughput (the number of tasks per unit time) :

note equivalence to IPC
n nf
Hk = =
[ k + (n-1)]  k + (n-1)

Pipeline Performance: Example

• Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80
ns (nanoseconds)
• latch delay = 10
• Pipeline cycle time = 90+10 = 100 ns
• For non-pipelined execution
– time = 60+50+90+80 = 280 ns
• Calculate For 100 instructions
1. Speedup factor = pipelined/without pipelined=?
2. Efficiency E= speedup factor / K=?
3. Pipeline throughput (the number of tasks per unit time) =?

Performance Issues in Pipelining

Pipelining increases the CPU instruction throughput - the number of instructions
completed per unit of time. But it does not reduce the execution time of an individual
instruction. In fact, it usually slightly increases the execution time of each instruction
due to overhead in the pipeline control.
The increase in instruction throughput means that a program runs faster and has
lower total execution time.
Limitations on practical depth of a pipeline arise from:
✓ Pipeline latency
• The fact that the execution time of each instruction does not decrease puts
limitations on pipeline depth.
✓ Imbalance among pipeline stages
• Imbalance among the pipe stages reduces performance since the clock can run
no faster than the time needed for the slowest pipeline stage.
✓ Pipeline overhead
• Pipeline overhead arises from the combination of pipeline register delay (setup
time plus propagation delay) and clock skew.
Once the clock cycle is as small as the sum of the clock skew and latch overhead,
no further pipelining is useful, since there is no time left in the cycle for useful work.
Simple example
Consider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50 ns, 60
ns, 60 ns, 50 ns, and 50 ns.
- Find the instruction latency on this machine.
- How much time does it take to execute 100 instructions?
Solution:
Instruction latency = 50+50+60+60+50+50= 320 ns
Time to execute 100 instructions = 100*320 = 32000 ns

Suppose we introduce pipelining on this machine. Assume that when

introducing pipelining, the clock skew adds 5ns of overhead to each
execution stage.
- What is the instruction latency on the pipelined machine?

- How much time does it take to execute 100 instructions?
Solution:
Remember that in the pipelined implementation, the length of the pipe stages must
all be the same, i.e., the speed of the slowest stage plus overhead. With 5ns
overhead it comes to:
The length of pipelined stage = MAX(lengths of unpipelined stages) + overhead =

60 + 5 = 65 ns
Instruction latency = 65 ns
Time to execute 100 instructions = 65*6*1 + 65*1*99 = 390 + 6435 = 6825 ns

- What is the speedup obtained from pipelining?
Solution:
Speedup is the ratio of the average instruction time without pipelining to the
average instruction time with pipelining.
(here we do not consider any stalls introduced by different types of hazards which
we will look at in the next section)
Average instruction time not pipelined = 320 ns

Average instruction time pipelined = 65 ns
Speedup for 100 instructions = 32000 / 6825 = 4.69

Pipeline Hazards
There are situations, called hazards, that prevent the next instruction in the
instruction stream from being executing during its designated clock cycle.
Hazards reduce the performance from the ideal speedup gained by pipelining.
There are three classes of hazards:

✓ Structural Hazards
They arise from resource conflicts when the hardware cannot support all
possible combinations of instructions in simultaneous overlapped execution.
✓ Data Hazards
They arise when an instruction depends on the result of a previous
instruction in a way that is exposed by the overlapping of instructions in the
pipeline.
✓ Control Hazards
They arise from the pipelining of branches and other instructions that
change the PC
Hazards in pipelines can make it necessary to stall the pipeline. The processor can stall
on different events:
Pipelines Hazards
Structural hazards:
structural hazards are those that occur because of resource conflicts.
Most common type:

• When a functional unit is not fully pipelined .
• The use of the functional unit requires more than one clock cycle.
• If an instruction follows an instruction that is using it, and the second instruction
also requires the resource, it must stall.
A second type involves resources that are shared between pipe stages.
• Occurs when two different instructions want to use the resource in the same
clock cycle.
• In this case, the lack of duplication of the resource does not allow all
combinations of instructions in the pipeline to execute.

Structural Hazards
Example 1
• For cost-saving reasons, a CPU may be designed with a single
interface to memory.
• This interface is always used during IF.
• It is also used during MEM for Load or Store operations.
• When a Load or Store gets to the MEM stage, the instruction in the
IF stage must be stalled.

Structural Hazards
Example 2: Consider branches with complex conditions:
• Let's modify pipeline to allow branches that:
 First perform a comparison (during the EX cycle)

 And then the address calculation if the branch was taken (during the MEM cycle).
❖ In such a case, the MEM cycle of a branch would interfere with the EX cycle of
the following instruction, causing a stall.
❖ In both cases, the problem could be solved with additional CPU hardware.
❖ In the first case, a second memory port.
❖ In the second case, an additional ALU.
❖ Therefore, structural hazards are caused solely by insufficient hardware.

Data Hazards
❖ Pipelining changes the relative timing of instructions by overlapping them

in time.
❖ This introduces possible hazards by reordering accesses
❖ To the register file (data hazards.)

❖ To the program counter (control hazards.
Consider the code:
• All of the instructions after ADD use the result of the
ADD instruction.
• Since the standard DLX pipeline waits until WB to

write the value back, the SUB, AND and OR
instructions read the wrong value.
• Also, the error may not be deterministic if an

interrupt occurs between the ADD and the AND,
which would allow the ADD to write its result.

Data Hazards

Data Hazards
•Memory reference data hazards :
• We used registers in our example.
• It is also possible for a pair of instructions to create a dependence by writing

and reading the same memory location.
• In pipeline, however, we always keep the memory references in order,

preventing this type of hazard.
• Consider cache misses.

• These could cause memory references to get out of order if we allowed
the processor to continue to work on later instructions.
• For pipeline, we stall in entire pipeline on cache misses.
• Soon we will discuss machines that allow Load and Stores to be executed
out of order.
Data Hazards
• Types of data hazards :

•Consider two instructions, A and B. A occurs before B.
•Hazards are named according to the ordering that MUST be preserved by

the pipeline.
A. RAW (read after write)

•B tries to read a register before A has written it and gets the old value.
•This is common, and forwarding helps to solve it.

A. Read after write (RAW)
(i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard
refers to a situation where an instruction refers to a result that has not yet been
calculated or retrieved. This can occur because even though an instruction is
executed after a prior instruction, the prior instruction has been processed only partly
through the pipeline.
For example:
i1. R2 <- R5 + R3 IF ID EX WB
The first instruction is calculating a value to be saved in register R2, and the
second is going to use this value to compute a result for register R4. However, in a
pipeline, when operands are fetched for the 2nd operation, the results from the first
have not yet been saved, and hence a data dependency occurs.
A data dependency occurs with instruction i2, as it is dependent on the completion

of instruction i1.
Data Hazards
•Types of data hazards :
B. WAW (write after write)
•B tries to write an operand before A has written it.
•After instruction B has executed, the value of the register should be B's
result, but A's result is stored instead.
•This can only happen with pipelines that write values in more than one
stage, or in variable-length pipelines (i.e. FP pipelines).
•It does not happen in our version of the DLX pipeline, but a modified version
might allow it.
•More on this later.

B. Write after write (WAW)
(i2 tries to write an operand before it is written by i1) A write after write (WAW) data
hazard may occur in a concurrent execution environment.
For example:
i1. R2 <- [R5] + [R3] IF ID EX MEM1 MEM2 WB

The write back (WB) of i2 must be delayed until i1 finishes executing.

Data Hazards
• Types of data hazards :
C. WAR (write after read)
•B tries to write a register before A has read it.
•In this case, A uses the new (incorrect) value.
•This type of hazard is rare because most pipelines read values early and write
results late.
•However, it might happen for a CPU that had complex addressing modes. i.e.
autoincrement.

C. Write after read (WAR)
(i2 tries to write a destination before it is read by i1) A write after read (WAR) data
hazard represents a problem with concurrent execution.
For example:
i1. R4 <- [R1] + [R5] IF ID EX MEM1 MEM2 WB

In any situation with a chance that i2 may finish before i1 (i.e., with concurrent
execution), it must be ensured that the result of register R5 is not stored before i1
has had a chance to fetch the operands.

Data Hazards
•Types of data hazards :
D. RAR (read after read)
•This is NOT a hazard since the register value does NOT change.
•The order of the two reads is not important.

Data Hazards Solution
The problem with data hazards, introduced by this sequence of instructions can
be solved with a simple hardware technique called forwarding.
1 2 3 4 5 6 7
ADD R1, R2, R3 IF ID EX MEM WB
SUB R4, R5, R1 IF IDsub EX MEM WB
AND R6, R1, R7 IF IDand EX MEM WB
The key insight in forwarding is that the result is not really needed by SUB until
after the ADD actually produces it. The only problem is to make it available for
SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to
where the SUB needs it (ALU input latch), then the need for a stall can be
avoided.
:

Using this observation , forwarding works as follows:
❖ The ALU result from the EX/MEM register is always fed back to the ALU input
latches.
❖ If the forwarding hardware detects that the previous ALU operation has written the
register corresponding to the source for the current ALU operation, control logic
selects the forwarded result as the ALU input rather than the value read from the
register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each
ALU multiplexer and the addtion of three paths to the new inputs.
The paths correspond to a forwarding of:
(a) the ALU output at the end of EX,

(b) the ALU output at the end of MEM, and
(c) the memory output at the end of MEM.

Without forwarding our example will execute correctly with stalls:
1 2 3 4 5 6 7 8 9
ADD R1, R2, R3 IF ID EX MEM WB
SUB R4, R5, R1 IF stall stall IDsub EX MEM WB
AND R6, R1, R7 stall stall IF IDand EX MEM WB

As our example shows, we need to forward results not only from the immediately
previous instruction, but possibly from an instruction that started three cycles earlier.
Forwarding can be arranged from MEM/WB latch to ALU input also. Using those
forwarding paths the code sequence can be executed without stalls:
1 2 3 4 5 6 7
ADD R1, R2, R3 IF ID EXadd MEMadd WB
SUB R4, R5, R1 IF ID EXsub MEM WB
AND R6, R1, R7 IF ID EXand MEM WB
▪ The first forwarding is for value of R1 from EXadd to EXsub .

▪ The second forwarding is also for value of R1 from MEMadd to EXand.
▪ This code now can be executed without stalls.
Forwarding can be generalized to include passing the result directly to the

functional unit that requires it: a result is forwarded from the output of one unit to
the input of another, rather than just from the result of a unit to the input of the
same unit.
Control Hazards

Conditional branch instructions

Assume that the instruction 3 is a conditional branch to
instruction 15.
✓ Until the instruction is executed there is no way of

knowing which instruction will come next
✓ The pipeline will simply loads the next instruction in the
sequence and execute.
✓ Branch is not determined until the end of time unit 7.
✓ During time unit 8,instruction 15 enters into the pipeline.
✓ No instruction complete during time units 9 through 12.
✓ This is the performance penalty incurred because we
could not anticipate the branch.

Control Hazards
Six-stage CPU instruction

pipeline Dealing with
branch instruction

Control Hazards
Control hazards can cause a greater performance loss for pipeline than data
hazards. When a branch is executed, it may or may not change the PC (program
counter) to something other than its current value.
If instruction i is a taken branch, then the PC is normally not changed until the end
of MEM stage, after the completion of the address calculation and comparison
The simplest method of dealing with branches is to stall the pipeline as soon as
the branch is detected until we reach the MEM stage, which determines the new
PC. The pipeline behaviour looks like :
Branch IF ID EX MEM WB
Branch successor IF(stall) stall stall IF ID EX MEM WB
Branch
IF ID EX MEM WB
successor+1

Control Hazards
This control hazards stall must be implemented differently from a data hazard, since
the IF cycle of the instruction following the branch must be repeated as soon as we
know the branch outcome. Thus, the first IF cycle is essentially a stall (because it
never performs useful work), which comes to total 3 stalls.
Three clock cycles wasted for every branch is a significant loss. With a 30% branch
frequency and an ideal CPI of 1, the machine with branch stalls achieves only half
the ideal speedup from pipelining!
The number of clock cycles can be reduced by two steps:
❖ Find out whether the branch is taken or not taken earlier in the pipeline;
❖ Compute the taken PC (i.e., the address of the branch target) earlier.
Both steps should be taken as early in the pipeline as possible. In some machines,
branch hazards are even more expensive in clock cycles. For example, a machine
with separate decode and register fetch stages will probably have a branch delay -
the length of the control hazard - that is at least one clock cycle longer. The branch
delay, unless it is dealt with, turns into a branch penalty. Many older machines that
implement more complex instruction sets have branch delays of four clock cycles or
more.
In general, the deeper the pipeline, the worse the branch penalty in clock cycles.
There are many methods to deal with the pipeline stalls caused by branch delay
We discuss four simple compile-time schemes in which predictions are static -

they are fixed for each branch during the entire execution, and the predictions
are compile-time guesses.
❖ Stall pipeline
❖ Predict taken
❖ Predict not taken
❖ Delayed branch
Stall pipeline
The simplest scheme to handle branches is to freeze or flush the pipeline, holding
or deleting any instructions after the branch until the branch destination is known.
Advantage: simple both to software and hardware

Predict Not Taken
A higher performance, and only slightly more complex, scheme is to predict the
branch as not taken, simply allowing the hardware to continue as if the branch
were not executed. Care must be taken not to change the machine state until the
branch outcome is definitely known.
The complexity arises from:

we have to know when the state might be changed by an instruction;
we have to know how to "back out" a change.
The pipeline with this scheme implemented behaves as shown below:
Untaken Branch Instr IF ID EX MEM WB

Instr i+1 IF ID EX MEM WB

Cont…….. Predict Not Taken

When branch is not taken, determined during ID, we have fetched the fall-
through and just continue. If the branch is taken during ID, we restart the fetch at
the branch target. This causes all instructions following the branch to stall one
clock cycle.
Predict Taken
Taken Branch Instr IF ID EX MEM WB
Instr i+1 IF idle idle idle idle
Branch target IF ID EX MEM WB
Branch target+1 IF ID EX MEM WB
An alternative scheme is to predict the branch as taken. As soon as the branch is

decoded and the target address is computed, we assume the branch to be taken
and begin fetching and executing at the target address.
Because in pipeline the target address is not known any earlier than the branch
outcome, there is no advantage in this approach. In some machines where the
target address is known before the branch outcome a predict-taken scheme might
make sense.
Delayed Branch
In a delayed branch, the execution cycle with a branch delay of length n is

Branch instr
sequential successor 1
sequential successor 2
.....
sequential successor n
Branch target if taken
Sequential successors are in the branch-delay slots. These instructions are
executed whether or not the branch is taken.
The pipeline behaviour of the pipeline, which has one branch delay slot is shown
below:
Untaken branch instr IF ID EX MEM WB
Branch delay instr(i+1) IF ID EX MEM WB

Cont..
Taken branch instr IF ID EX MEM WB
Branch delay instr(i+1) IF ID EX MEM WB
Branch target IF ID EX MEM WB
The job of the compiler is to make the successor instructions valid and useful.
In Delayed Branch, the branch is moved before “independent instructions”

preceding it.
Then those instructions which now follow the branch can be executed while
the branch target is being determined.

Dealing with branches
A variety of approaches have been taken for

dealing with conditional branches.
1. Multiple streams
2. Prefetch branch target.
3. Loop buffer
4. Branch prediction
5. Delayed branch (Discussed in last topic)

Multiple Streams
Have two pipelines

Prefetch each branch into a separate pipeline
Use appropriate pipeline
Challenges:
Leads to bus & register contention
Multiple branches lead to further pipelines being needed
Prefetch Branch Target
Target of branch is prefetched in addition to instructions following branch
Keep target until branch is executed

Using a Loop Buffer
Have a small fast memory to hold the past n instructions
This likely contains loops that are executed repeatedly

Branch Prediction
➢ Predict branch never taken
➢ Predict branch always taken
➢ Predict by opcode
➢ Use Predict branch taken/not taken switch
➢ Maintain branch history table
➢ Get help from Compiler

❖ Instruction Reordering

Predict Branch Taken / Not taken
▪ Predict never taken

 Assume that jump will not happen
 Always fetch next instruction
• Predict always taken

 Assume that jump will happen
 Always fetch target instruction
Branch Prediction by Opcode / Switch

• Predict by Opcode
 Some instructions are more likely to result in a jump than others
 Can get up to 75% success with this strategy
• Taken/Not taken switch

 Based on previous history
 Good for loops
 Perhaps good to match programmer style

Branch Prediction Flowchart

Branch Prediction State Diagram

Maintain Branch Table
Perhaps maintain a cache table

of three entries:
- Address of branch
- History of branching
- Targets of branch
Branch History Table


1. Why Vector Processors?

❖A single vector instruction specifies a great deal of
work—it is equivalent to executing an entire loop.
❖The computation of each result in the vector is
independent of the computation of other results in the
same vector and so hardware does not have to check for
data hazards within a vector instruction.
❖Hardware need only check for data hazards between two
vector instructions once per vector operand, not once for
every element within the vectors.
❖Vector instructions that access memory have a known
access pattern.
❖Because an entire loop is replaced by a vector
instruction whose behavior is predetermined, control
hazards that would normally arise from the loop branch
are nonexistent.

2. Basic Vector Architecture
• There are two primary types of architectures for

vector processors: vector-register processors
and memory-memory vector processors.
– In a vector-register processor, all vector

operations—except load and store—are
among the vector registers.
– In a memory-memory vector processor, all
vector operations are memory to memory.

Vector Memory-Memory versus Vector Register Machines

Vector memory-memory instructions hold all vector operands in main memory
The first vector machines, CDC Star-100 (‘73) and TI ASC (‘71), were memory-
memory machines
Cray-1 (’76) was first vector register machine
Vector Memory-Memory Code

Example Source Code ADDV C, A, B
for (i=0; i<N; i++) SUBV D, A, B
{
C[i] = A[i] + B[i]; Vector Register Code
D[i] = A[i] - B[i]; LV V1, A
} LV V2, B
ADDV V3, V1, V2
SV V3, C
SUBV V4, V1, V2
SV V4, D
Vector Memory-Memory vs. Vector Register Machines

❖ Vector memory-memory architectures (VMMA) require greater main
memory bandwidth, why?
❖ All operands must be read in and out of memory
❖ VMMAs make if difficult to overlap execution of multiple vector
operations, why?
❖ Must check dependencies on memory addresses
❖ VMMAs incur greater startup latency
❖ Scalar code was faster on CDC Star-100 for vectors < 100
elements
❖ For Cray-1, vector/scalar breakeven point was around 2 elements
❖ Apart from CDC follow-ons (Cyber-205, ETA-10) all major vector
machines since Cray-1 have had vector register architectures
❖ (we ignore vector memory-memory from now on)

The basic structure of a vector-register architecture
VMIPS
Vector Supercomputers
Epitomized by Cray-1, 1976:
Scalar Unit + Vector Extensions

• Load/Store Architecture
• Vector Registers
• Vector Instructions
• Hardwired Control
• Highly Pipelined Functional Units
• Interleaved Memory System
• No Data Caches
• No Virtual Memory

Cray-1 (1976)

BE-IV-Electronics Microcomputer Architecture Cray-1 (1976)
V0 Vi V. Mask
V1
64 Element Vector V2 Vj
V3 V. Length
V4 Vk
Single Port Registers V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of 64- ( (Ah) + j k m ) S1
S2 Sk FP Recip
bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)








MicroArchi PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MicroArchi PDF

Uploaded by

Copyright:

Available Formats

The Maharaja Sayajirao University of Baroda

Faculty of Technology and Engineering

5. ARM7TDMI: Introduction-Architectural overview, ARM7TDMI processor (LPC2378). System Control Block

Computer Organization RISC and CISC

Complex Instruction Set Architecture (CISC) –

1. Simpler instruction, hence simple instruction decoding.

1. Complex instruction, hence complex instruction decoding.

19-08-2020 12:46:44 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 3

Difference between RISC and CISC Architecture

Sr no. RISC CISC

Sr no. RISC CISC

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 5

Sr no. RISC CISC

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 6

Multiplying Two Numbers in Memory

The RISC Approach

The CISC Approach

MULT 2:3, 5:2

The CISC approach attempts to minimize the

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 8

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 9

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 10

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 11

The decomposition of the instruction processing by 6 stages is the following.

2. Decode Instruction(DI): Determine the opcode and the operand specifiers

3. Calculate Operands (CO): Calculate the effective address of each source

6. Write Operand (WO): Store result in memory.

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 12

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 13

Inefficiency in two stage instruction pipelining

There are two reasons

• The execution time will generally be longer than

• When conditional branch occurs, then the address

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 14

6 Stages Instruction Pipeline

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 15

SIX STAGE OF INSTRUCTION PIPELINING

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 16

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 17

High efficiency of instruction pipelining

Assume all the below in diagram

✓ In the previous diagram the instruction pipelining

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 18

Instruction Pipeline Limitations

1. If six stages are not of equal duration, then there will

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 19

Performance of instruction pipeline

Clock cycle of the pipeline : 

Performance of instruction pipeline Speedup and Efficiency

k cycles for the first task and n-1 cycles

Total time to process n tasks

For the non-pipelined processor

Performance of instruction pipeline Efficiency and Throughput

Efficiency of the k-stages pipeline :

Pipeline throughput (the number of tasks per unit time) :

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 22

Pipeline Performance: Example

1. Speedup factor = pipelined/without pipelined=?

2. Efficiency E= speedup factor / K=?

3. Pipeline throughput (the number of tasks per unit time) =?

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 23

Performance Issues in Pipelining

19-08-2020 12:46:45 Copyright@2020-21 Electrical Engg. Dept. of The M S University of Baroda 25