Unit III - Basic Processing Unit

Unit III - Basic Processing Unit
Slide Courtesy : “Computer Organization” by Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Overview
• Instruction Set Processor (ISP)

• Central Processing Unit (CPU)
• A typical computing task consists of a series of
steps specified by a sequence of machine
instructions that constitute a program.
• An instruction is executed by carrying out a
sequence of more rudimentary operations.
Some Fundamental Concepts
Fundamental Concepts
• Processor fetches one instruction at a time and perform

the operation specified.
• Instructions are fetched from successive memory
locations until a branch or a jump instruction is
encountered.
• Processor keeps track of the address of the memory
location containing the next instruction to be fetched
using Program Counter (PC).
• Instruction Register (IR)
Executing an Instruction
• Fetch the contents of the memory location pointed to by

the PC. The contents of this location are loaded into the
IR (fetch phase).
IR ← [[PC]]
• Assuming that the memory is byte addressable,
increment the contents of the PC by 4 (fetch phase).
PC ← [PC] + 4
• Carry out the actions specified by the instruction in the
IR (execution phase).
Processor Organization
Internal processor
bus
Control signals
PC
Instruction
Address
decoder and
lines
MDR HAS TWO MAR control logic
INPUTS AND Memory
TWO OUTPUTS bus
MDR
Data
lines IR
Datapath
Y
Constant 4 R0
Select MUX
Add
A B
ALU Sub R n - 1 
control ALU
lines
Carry-in
XOR TEMP
Z
Textbook Page 413
Figure 7.1. Single-bus organization of the datapath inside a processor.

Executing an Instruction
• Transfer a word of data from one processor

register to another or to the ALU.
• Perform an arithmetic or a logic operation and
store the result in a processor register.
• Fetch the contents of a given memory location
and load them into a processor register.
• Store a word of data from a processor register
into a given memory location.
Register Transfers Internal processor
bus
Riin
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Z out
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Register Transfers
• All operations and data transfers are controlled by the processor clock.
Bus
D Q
1
Q
Riout
Ri in
Clock
Figure 7.3.7.3.Input
Figure Inputand
andoutput
outputgating
gating for one register
register bit.
bit.
Performing an Arithmetic or
Logic Operation
• The ALU is a combinational circuit that has no internal

storage.
• ALU gets the two operands from MUX and bus. The
result is temporarily stored in register Z.
• What is the sequence of operations to add the contents
of register R1 to those of R2 and store the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
Fetching a Word from Memory
• Address into MAR; issue Read operation; data into MDR.

Memory-bus Internal processor
data lines MDRoutE MDRout bus
MDR
MDR inE MDRin
Figure 7.4.
Figure 7.4. Connection and control
Connection and controlsignals
signalsfor
forregister
registerMDR.
MDR.
Fetching a Word from Memory
• The response time of each memory access varies (cache

miss, memory-mapped I/O,…).
• To accommodate this, the processor waits until it
receives an indication that the requested operation has
been completed (Memory-Function-Completed, MFC).
• Move (R1), R2
 MAR ← [R1]
 Start a Read operation on the memory bus
 Wait for the MFC response from the memory
 Load MDR from the memory bus
 R2 ← [MDR]
Timing Step 1 2 3
Clock
MARin MAR ← [R1]

Assume MAR
is always available Address
on the address lines
of the memory bus. Start a Read operation on the memory bus
Read
MR
MDRinE
Data
Wait for the MFC response from the memory

MFC
MDR out Load MDR from the memory bus

R2 ← [MDR]
Figure 7.5. Timing of a memory Read operation.

Execution of a Complete
Instruction
• Add (R3), R1
• Fetch the instruction
• Fetch the first operand (the contents of the
memory location pointed to by R3)
• Perform the addition
• Load the result into R1
Architecture Internal processor
bus
Riin
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Z out
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Execution of a Complete
Instruction
Internal processor
bus
Control signals
Add (R3), R1
PC
Instruction
Step Action Address
decoder and
lines
MAR control logic
1 PC out , MAR in , Read, Select4, Add, Zin Memory

bus
2 Zout , PC in , Yin , WMF C MDR

Data
lines IR
3 MDR out , IR in
4 R3 out , MAR in , Read Y
Constant 4 R0
5 R1 out , Yin , WMF C
6 MDR out , SelectY, Add, Zin Select MUX
7 Zout , R1 in , End Add

A B
ALU Sub R n - 1 
control ALU
lines
Carry-in
XOR TEMP
Figure 7.6. Con trol sequence for execution of the instruction Add (R3),R1.
Z

Execution of Branch
Instructions
• A branch instruction replaces the contents of

PC with the branch target address, which is
usually obtained by adding an offset X given in
the branch instruction.
• The offset X is usually the difference between
the branch target address and the address
immediately following the branch instruction.
• Conditional branch
Execution of Branch
Instructions
Step Action
1 PCout , MAR in , Read, Select4, Add, Z in

2 Zout , PC in , Yin , WMF C
3 MDR out , IR in
4 Offset-field-of-IRout , Add, Z in
5 Z out , PCin , End
Figure 7.7. Control sequence for an unconditional branch instruction.

Multiple-Bus Organization
Bus A Bus B Bus C
Incrementer
PC
Register
file
Constant 4
MUX
A
ALU R
Instruction
decoder
IR
MDR
MAR
Memory bus Address

data lines lines
Figure 7.8. Three-bus organization of the datapath.

Multiple-Bus Organization
• Add R4, R5, R6
Step Action
1 PCout , R=B, MAR in , Read, IncPC

2 WMFC
3 MDR outB , R=B, IR in
4 R4outA , R5outB , SelectA, Add, R6 in , End
Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,

for the three-bus organization in Figure 7.8.
Quiz
Internal processor
bus
Control signals
• What is the control PC
Instruction
sequence for Address

lines
MAR
decoder and
control logic
execution of the Memory

bus
instruction Data
lines
MDR
IR
Add R1, R2 Constant 4

Y
R0
including the Select MUX
instruction fetch Add

A B
Sub R n - 1 
phase? (Assume single ALU
control
lines
ALU
Carry-in
bus architecture) XOR TEMP

Hardwired Control
Overview
• To execute instructions, the processor must

have some means of generating the control
signals needed in the proper sequence.
• Two categories: hardwired control and
microprogrammed control
• Hardwired system can operate at high speed;
but with little flexibility.
Control Unit Organization
CLK Control step

Clock counter
External
inputs
Decoder/
IR
encoder
Condition
codes
Control signals
Figure 7.10. Control unit organization.

Detailed Block Description
CLK
Clock Control step Reset
counter
Step decoder
T 1 T2 Tn
INS 1
External
INS 2 inputs
Instruction
IR Encoder
decoder
Condition
codes
INSm
Run End
Control signals
Figure 7.11. Separation of the decoding and encoding functions.

Generating Zin
• Zin = T1 + T6 • ADD + T4 • BR + …
Branch Add
T4 T6
T1
Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.
Generating End
• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…

Branch<0
Add Branch
N N
T7 T5 T4 T5
End
Figure 7.13. Generation of the End control signal.

A Complete Processor
Instruction Integer Floating-point

unit unit unit
Instruction Data
cache cache
Bus interface
Processor
System bus
Main Input/
memory Output
Figure 7.14. Block diagram of a complete processor.

Microprogrammed Control
Overview
• Control signals are generated by a program similar to machine language

programs.
• Control Word (CW); microroutine; microinstruction
MDRout
WMFC
MAR in
Select
PCout
Micro -
R1out
R3out
Read
PCin
R1 in
End
Add
Z out
IRin
Yin
instruction
Zin
1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1
Figure 7.15 An example of microinstructions for Figure 7.6.

Overview
Step Action
1 PC out , MAR in , Read, Select4, Add, Zin

3 MDR out , IR in
4 R3 out , MAR in , Read
5 R1 out , Yin , WMF C
6 MDR out , SelectY, Add, Zin
7 Zout , R1 in , End
Figure 7.6. Con trol sequence for execution of the instruction Add (R3),R1.
Overview
• Control store
Starting
IR address
generator One function
cannot be carried
out by this simple
organization.
Clock P C
Control
store CW
Figure 7.16. Basic organization of a microprogrammed control unit.

Overview
• The previous organization cannot handle the situation when the control unit is
required to check the status of the condition codes or external inputs to
choose between alternative courses of action.
• Use conditional branch microinstruction.
Address Microinstruction
0 PCout , MAR in , Read, Select4, Add, Z in

2 MDR out , IR in
3 Branch to starting addressof appropriate microroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25 If N=0, then branch to microinstruction 0
26 Offset-field-of-IR out , SelectY, Add, Z in
27 Zout , PC in , End
Figure 7.17. Microroutine for the instruction Branch<0.

Overview
External
inputs
Starting and
branch address Condition
IR codes
generator
Clock m PC
Control
store CW
Figure 7.18. Organization of the control unit to allow

conditional branching in the microprogram.
Microinstructions
• A straightforward way to structure

microinstructions is to assign one bit position
to each control signal.
• However, this is very inefficient.
• The length can be reduced: most signals are
not needed simultaneously, and many signals
are mutually exclusive.
• All mutually exclusive signals are placed in the
same group in binary coding.
Partial Format for the
Microinstructions
Microinstruction
F1 F2 F3 F4 F5
F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits)
0000: No transfer 000: No transfer 000: No transfer 0000: Add 00: No action
0001: PCout 001: PCin 001: MARin 0001: Sub 01: Read
0010: MDRout 010: IRin 010: MDRin 10: Write
0011: Zout 011: Z in 011: TEMPin
0100: R0out 100: R0in 100: Y in 1111: XOR
0101: R1out 101: R1in
0110: R2out 110: R2 in 16 ALU
functions
0111: R3 out 111: R3 in
1010: TEMPout
1011: Offsetout
F6 F7 F8
What is the price paid for
this scheme?
F6 (1 bit) F7 (1 bit) F8 (1 bit)
0: SelectY 0: No action 0: Continue

1: Select4 1: WMFC 1: End
Figure 7.19. An example of a partial format for field-encoded microinstructions.

Further Improvement
• Enumerate the patterns of required signals in

all possible microinstructions. Each
meaningful combination of active control
signals can then be assigned a distinct code.
• Vertical organization
• Horizontal organization
Microprogram Sequencing
• If all microprograms require only straightforward

sequential execution of microinstructions except for
branches, letting a μPC governs the sequencing would be
efficient.
• However, two disadvantages:
 Having a separate microroutine for each machine instruction results in a
large total number of microinstructions and a large control store.
 Longer execution time because it takes more time to carry out the
required branches.
• Example: Add src, Rdst
• Four addressing modes: register, autoincrement,
autodecrement, and indexed (with indirect forms).
- Bit-ORing
- Wide-Branch Addressing
- WMFC
Mode
Contents of IR OP code 0 1 0 Rsrc Rdst
11 10 8 7 4 3 0
Address Microinstruction
(octal)
000 PCout, MARin , Read, Select4 , Add, Zin

001 Zout , PCin, Yin, WMFC
002 MDRout, IRin
003 m Branch {m PC ¬ 101 (from Instruction decoder);
m PC5,4 ¬ [IR10,9]; m PC3 ¬ [IR 10]×[IR9]×[IR8]}
121 Rsrcout , MARin , Read, Select4, Add, Zin
122 Zout , Rsrcin
123 mBranch {mPC ¬ 170;mPC0 ¬ [IR8]}, WMFC
170 MDRout, MARin , Read, WMFC
171 MDRout, Yin
172 Rdstout , SelectY, Add, Zin
173 Zout , Rdstin , End
Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst.

Note: Microinstruction at location 170 is not executed for this addressing mode.
Microinstructions with Next-
Address Field
• The microprogram we discussed requires several branch
microinstructions, which perform no useful operation in
the datapath.
• A powerful alternative approach is to include an address
field as a part of every microinstruction to indicate the
location of the next microinstruction to be fetched.
• Pros: separate branch microinstructions are virtually
eliminated; few limitations in assigning addresses to
microinstructions.
• Cons: additional bits for the address field (around 1/6)
Microinstructions with Next-
Address Field
IR
External Condition
Inputs codes
Decoding circuits
AR
Control store
Next address I R
Microinstruction decoder
Control signals
Figure 7.22. Microinstruction-sequencing organization.

Microinstruction
F0 F1 F2 F3
F0 (8 bits) F1 (3 bits) F2 (3 bits) F3 (3 bits)
Address of next 000: No transfer 000: No transfer 000: No transfer

microinstruction 001: PC out 001: PCin 001: MAR in
010: MDRout 010: IRin 010: MDR in
011: Z out 011: Z in 011: TEMPin
100: Rsrcout 100: Rsrc in 100: Y in
101: Rdstout 101: Rdst in
110: TEMP out
F4 F5 F6 F7
F4 (4 bits) F5 (2 bits) F6 (1 bit) F7 (1 bit)
0000: Add 00: No action 0: SelectY 0: No action

0001: Sub 01: Read 1: Select4 1: WMFC
10: Write
1111: XOR
F8 F9 F10
F8 (1 bit) F9 (1 bit) F10 (1 bit)
0: NextAdrs 0: No action 0: No action

1: InstDec 1: ORmode 1: OR indsrc
Figure 7.23. Format for microinstructions in the example of Section 7.5.3.

Implementation of the
Microroutine
Octal
address F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
0 0 0 0 0 0 0 0 0 0 1 0 0 1 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 1 1 00 1 1 0 0 0 0 0 0 00 0 1 0 0 0
0 0 2 0 0 0 0 0 0 1 1 0 1 0 01 0 0 0 0 0 0 0 0 00 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 1 1 0
121 0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
122 0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00 0 1 0 0 1
1 7 0 0 1 1 1 1 0 0 1 0 1 0 00 0 0 0 1 0 0 0 0 01 0 1 0 0 0
1 7 1 0 1 1 1 1 0 1 0 0 1 0 00 0 1 0 0 0 0 0 0 00 0 0 0 0 0
1 7 2 0 1 1 1 1 0 1 1 1 0 1 01 1 0 0 0 0 0 0 0 00 0 0 0 0 0
1 7 3 0 0 0 0 0 0 0 0 0 1 1 10 1 0 0 0 0 0 0 0 00 0 0 0 0 0
Figure 7.24. Implementation of the microroutine of Figure 7.21 using a

next-microinstruction address field. (See Figure 7.23 for encoded signals.)
R15in R15out R0 in R0out
Decoder
Decoder
IR Rsrc Rdst
InstDecout
External
inputs ORmode
Decoding
circuits
Condition ORindsrc
codes
AR
Control store
Next address F1 F2 F8 F9 F10
Rdst out
Rdst in
Microinstruction
decoder
Rsrc out
Rsrc in
Other control signals
Figure 7.25. Some details of the control-signal-generating circuitry.

bit-ORing
Pipelining
What Is A Pipeline?
• Pipelining is used by virtually all modern

microprocessors to enhance performance by
overlapping the execution of instructions.
• A common analogue for a pipeline is a factory
assembly line. Assume that there are three stages:
1. Welding
2. Painting
3. Polishing
• For simplicity, assume that each task takes one
hour.
What Is A Pipeline?
• If a single person were to work on the product it

would take three hours to produce one product.
• If we had three people, one person could work
on each stage, upon completing their stage they
could pass their product on to the next person
(since each stage takes one hour there will be no
waiting).
• We could then produce one product per hour
assuming the assembly line has been filled.
Characteristics Of Pipelining
• If the stages of a pipeline are not balanced and one

stage is slower than another, the entire throughput of
the pipeline is affected.
• In terms of a pipeline within a CPU, each instruction is
broken up into different stages. Ideally if each stage is
balanced (all stages are ready to start at the same
time and take an equal amount of time to execute.)
the time taken per instruction (pipelined) is defined
as:
Time per instruction (unpipelined) / Number of stages

Characteristics Of Pipelining
• The previous expression is ideal. We will see later

that there are many ways in which a pipeline cannot
function in a perfectly balanced fashion.
• In terms of a CPU, the implementation of pipelining
has the effect of reducing the average instruction
time, therefore reducing the average CPI.
• EX: If each instruction in a microprocessor takes 5
clock cycles (unpipelined) and we have a 4 stage
pipeline, the ideal average CPI with the pipeline will
be 1.25 .
RISC Instruction Set Basics
(from Hennessey and Patterson)
• Properties of RISC architectures:

– All ops on data apply to data in registers and typically
change the entire register (32-bits or 64-bits).
– The only ops that affect memory are load/store
operations. Memory to register, and register to
memory.
– Load and store ops on data less than a full size of a
register (32, 16, 8 bits) are often available.
– Usually instructions are few in number (this can be
relative) and are typically one size.
Types Of Instructions
• ALU Instructions:
• Arithmetic operations, either take two registers as
operands or take one register and a sign extended
immediate value as an operand. The result is stored in
a third register.
• Logical operations AND OR, XOR do not usually
differentiate between 32-bit and 64-bit.
• Load/Store Instructions:
• Usually take a register (base register) as an operand
and a 16-bit immediate value. The sum of the two will
create the effective address. A second register acts as a
source in the case of a load operation.
Types Of Instructions (continued)
• In the case of a store operation the second register

contains the data to be stored.
• Branches and Jumps
• Conditional branches are transfers of control. As
described before, a branch causes an immediate value to
be added to the current program counter.
• Appendix A has a more detailed description of
the RISC instruction set. Also the inside back
cover has a listing of a subset of the MIPS64
instruction set.
RISC Instruction Set
Implementation
• We first need to look at how instructions in the MIPS64 instruction set
are implemented without pipelining. We’ll assume that any
instruction of the subset of MIPS64 can be executed in at most 5 clock
cycles.
• The five clock cycles will be broken up into the following steps:
• Instruction Fetch Cycle
• Instruction Decode/Register Fetch Cycle
• Execution Cycle
• Memory Access Cycle
• Write-Back Cycle
Instruction Fetch (IF) Cycle
• The value in the PC represents an address in

memory. The MIPS64 instructions are all 32-bits
in length. Figure 2.27 shows how the 32-bits (4
bytes) are arranged depending on the instruction.
• First we load the 4 bytes in memory into the CPU.
• Second we increment the PC by 4 because
memory addresses are arranged in byte ordering.
This will now represent the next instruction. (Is
this certain???)
Instruction Decode
(ID)/Register Fetch Cycle
• Decode the instruction and at the same time read
in the values of the register involved. As the
registers are being read, do equality test incase the
instruction decodes as a branch or jump.
• The offset field of the instruction is sign-extended
incase it is needed. The possible branch effective
address is computed by adding the sign-extended
offset to the incremented PC. The branch can be
completed at this stage if the equality test is true
and the instruction decoded as a branch.
Instruction Decode (ID)/Register
Fetch Cycle (continued)
• Instruction can be decoded in parallel with

reading the registers because the register
addresses are at fixed locations.
Execution (EX)/Effective
Address Cycle
• If a branch or jump did not occur in the previous

cycle, the arithmetic logic unit (ALU) can
execute the instruction.
• At this point the instruction falls into three
different types:
• Memory Reference: ALU adds the base register and the
offset to form the effective address.
• Register-Register: ALU performs the arithmetic, logical,
etc… operation as per the opcode.
• Register-Immediate: ALU performs operation based on
the register and the immediate value (sign extended).
Memory Access (MEM) Cycle
• If a load, the effective address computed from

the previous cycle is referenced and the
memory is read. The actual data transfer to
the register does not occur until the next
cycle.
• If a store, the data from the register is written
to the effective address in memory.
Write-Back (WB) Cycle
• Occurs with Register-Register ALU instructions

or load instructions.
• Simple operation whether the operation is a
register-register operation or a memory load
operation, the resulting data is written to the
appropriate register.
Looking At The Big Picture
• Overall the most time that an non-pipelined

instruction can take is 5 clock cycles. Below is a
summary:
• Branch - 2 clock cycles
• Store - 4 clock cycles
• Other - 5 clock cycles
• EX: Assuming branch instructions account for 12%
of all instructions and stores account for 10%,
what is the average CPI of a non-pipelined CPU?
ANS: 0.12*2+0.10*4+0.78*5 = 4.54
The Classical RISC 5 Stage
Pipeline
• In an ideal case to implement a pipeline we just

need to start a new instruction at each clock cycle.
• Unfortunately there are many problems with trying
to implement this. Obviously we cannot have the
ALU performing an ADD operation and a MULTIPLY
at the same time. But if we look at each stage of
instruction execution as being independent, we can
see how instructions can be “overlapped”.
ENGR9861 Winter 2007 RV
Problems With The Previous
Figure
• The memory is accessed twice during each clock cycle. This problem
is avoided by using separate data and instruction caches.
• It is important to note that if the clock period is the same for a
pipelined processor and an non-pipelined processor, the memory
must work five times faster.
• Another problem that we can observe is that the registers are
accessed twice every clock cycle. To try to avoid a resource conflict
we perform the register write in the first half of the cycle and the
read in the second half of the cycle.
Problems With The Previous
Figure (continued)
• We write in the first half because therefore an write operation
can be read by another instruction further down the pipeline.
• A third problem arises with the interaction of the pipeline
with the PC. We use an adder to increment PC by the end of
IF. Within ID we may branch and modify PC. How does this
affect the pipeline?
• The use if pipeline registers allow the CPU of have a memory
to implement the pipeline. Remember that the previous
figure has only one resource use in each stage.
Pipeline Hazards
• The performance gain from using pipelining

occurs because we can start the execution of a
new instruction each clock cycle. In a real
implementation this is not always possible.
• Another important note is that in a pipelined
processor, a particular instruction still takes at
least as long to execute as non-pipelined.
• Pipeline hazards prevent the execution of the next
instruction during the appropriate clock cycle.
Types Of Hazards
• There are three types of hazards in a pipeline,

they are as follows:
• Structural Hazards: are created when the data path
hardware in the pipeline cannot support all of the
overlapped instructions in the pipeline.
• Data Hazards: When there is an instruction in the
pipeline that affects the result of another instruction in
the pipeline.
• Control Hazards: The PC causes these due to the
pipelining of branches and other instructions that
change the PC.
A Hazard Will Cause A Pipeline
Stall
• Some performance expressions involving a

realistic pipeline in terms of CPI. It is a
assumed that the clock period is the same for
pipelined and unpipelined implementations.
Speedup = CPI Unpipelined / CPI pipelined
= Pipeline Depth / ( 1 + Stalls per Ins)
= Ave Ins Time Unpipelined / Ave Ins Time
Pipelined
A Hazard Will Cause A Pipeline
Stall (continued)
• We can look at pipeline performance in terms

of a faster clock cycle time as well:
Speedup = CPI unpipelined Clock cycle time unpipelined

x
CPI pipelined Clock cycle time pipelined
Clock cycle time unpipelined

Clock cycle pipelined =
Pipeline Depth
1
Speedup = x Pipeline Depth
1 + Pipeline stalls per Ins
Dealing With Structural Hazards
• Structural hazards result from the CPU data path

not having resources to service all the required
overlapping resources.
• Suppose a processor can only read and write
from the registers in one clock cycle. This would
cause a problem during the ID and WB stages.
• Assume that there are not separate instruction
and data caches, and only one memory access
can occur during one clock cycle. A hazard
would be caused during the IF and MEM cycles.
• A structural hazard is dealt with by inserting a stall or pipeline

bubble into the pipeline. This means that for that clock cycle,
nothing happens for that instruction. This effectively “slides” that
instruction, and subsequent instructions, by one clock cycle.
• This effectively increases the average CPI.
• EX: Assume that you need to compare two processors, one with a
structural hazard that occurs 40% for the time, causing a stall.
Assume that the processor with the hazard has a clock rate 1.05
times faster than the processor without the hazard. How fast is the
processor with the hazard compared to the one without the
hazard?
(continued)
Speedup = CPI no haz Clock cycle time no haz

x
CPI haz Clock cycle time haz
1 1
Speedup = x
1+0.4*1 1/1.05
= 0.75
(continued)
• We can see that even though the clock speed

of the processor with the hazard is a little
faster, the speedup is still less than 1.
• Therefore the hazard has quite an effect on
the performance.
• Sometimes computer architects will opt to
design a processor that exhibits a structural
• A: The improvement to the processor data path is too costly.
•hazard. Why?
B: The hazard occurs rarely enough so that the processor will still
perform to specifications.
Data Hazards (A Programming
Problem?)
• We haven’t looked at assembly programming in

detail at this point.
• Consider the following operations:
DADD R1, R2, R3
DSUB R4, R1, R5
AND R6, R1, R7
ORR8, R1, R9
XOR R10, R1, R11
Pipeline Registers
What are the problems?

Data Hazard Avoidance
• In this trivial example, we cannot expect the programmer to reorder

his/her operations. Assuming this is the only code we want to
execute.
• Data forwarding can be used to solve this problem.
• To implement data forwarding we need to bypass the pipeline
register flow:
– Output from the EX/MEM and MEM/WB stages must be fed back into
the ALU input.
– We need routing hardware that detects when the next instruction
depends on the write of a previous instruction.
General Data Forwarding
• It is easy to see how data forwarding can be used

by drawing out the pipelined execution of each
instruction.
• Now consider the following instructions:
DADD R1, R2, R3

LD R4, O(R1)
SD R4, 12(R1)
Problems
• Can data forwarding prevent all data hazards?

• NO!
• The following operations will still cause a data
hazard. This happens because the further down the
pipeline we get, the less we can use forwarding.
LD R1, O(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
Problems
• We can avoid the hazard by using a pipeline

interlock.
• The pipeline interlock will detect when data
forwarding will not be able to get the data to
the next instruction in time.
• A stall is introduced until the instruction can
get the appropriate data from the previous
instruction.
Control Hazards
• Control hazards are caused by branches in the code.

• During the IF stage remember that the PC is
incremented by 4 in preparation for the next IF cycle
of the next instruction.
• What happens if there is a branch performed and we
aren’t simply incrementing the PC by 4.
• The easiest way to deal with the occurrence of a
branch is to perform the IF stage again once the
branch occurs.
Performing IF Twice
• We take a big performance hit by performing the

instruction fetch whenever a branch occurs. Note,
this happens even if the branch is taken or not. This
guarantees that the PC will get the correct value.
IF ID EX MEM WB
branch IF ID EX MEM WB
IF IF ID EX MEM WB
Performing IF Twice
• This method will work but as always in computer

architecture we should try to make the most
common operation fast and efficient.
• With MIPS64 branch instructions are quite
common.
• By performing IF twice we will encounter a
performance hit between 10%-30%
• Next class we will look at some methods for
dealing with Control Hazards.
Control Hazards (other
solutions)
• These following solutions assume that we are

dealing with static branches. Meaning that the
actions taken during a branch do not change.
• We already saw the first example, we stall the
pipeline until the branch is resolved (in our case
we repeated the IF stage until the branch
resolved and modified the PC)
• The next two examples will always make an
assumption about the branch instruction.
solutions)
• What if we treat every branch as “not taken”

remember that not only do we read the
registers during ID, but we also perform an
equality test in case we need to branch or not.
• We can improve performance by assuming that
the branch will not be taken.
• What in this case we can simply load in the next
instruction (PC+4) can continue. The complexity
arises when the branch evaluates and we end up
needing to actually take the branch.
solutions)
• If the branch is actually taken we need to clear

the pipeline of any code loaded in from the “not-
taken” path.
• Likewise we can assume that the branch is always
taken. Does this work in our “5-stage” pipeline?
• No, the branch target is computed during the ID
cycle. Some processors will have the target
address computed in time for the IF stage of the
next instruction so there is no delay.
solutions)
• The “branch-not taken” scheme is the same as performing the IF
stage a second time in our 5 stage pipeline if the branch is taken.
• If not there is no performance degradation.
• The “branch taken” scheme is no benefit in our case because we
evaluate the branch target address in the ID stage.
• The fourth method for dealing with a control hazard is to
implement a “delayed” branch scheme.
• In this scheme an instruction is inserted into the pipeline that is
useful and not dependent on whether the branch is taken or not. It
is the job of the compiler to determine the delayed branch
instruction.
How To Implement a Pipeline
• From page A-29 in the text we will now look at the

data path implementation of a 5 stage pipeline.

How To Implement a Pipeline

Multi-clock Operations
• Sometimes operations require more than one

clock cycle to complete. Examples are:
• Floating Point Multiply
• Floating Point Divide
• Floating Point Add
• We can assume that there is hardware available
on the processor for performing the operations.
• Assume that the FP Mul and Add are fully
pipelined, and the divide is un-pipelined.
Avoiding Structural Hazards
• The multiplier and the divider are fully pipelined. The

divider is not pipelined at all.
• Take a look at figure A.34 for a good example of how
pipelining will function in the case of longer
instruction execution. The author assumes a single
floating point register port.
• Structural hazards are avoided in the ID stage by
assigning a memory bit in a shift register. Incoming
instructions can then check to see if they should stall.
Instruction Level Parallelism (ILP)
Chapter 3
• The reason why we can implement pipelining in a

microprocessor is due to instruction level parallelism.
• Since operations can be overlapped in execution,
they exhibit ILP.
• ILP is mostly exploited in the use of branches. A
“basic block” is a block of code that has no branches
into or out of except for at the start and the end.
• In MIPS, an average basic block is 4-7 separate
instructions.
Dependences and
Hazards
• Data Dependence:
– Instruction i produces a result the instruction j will use
or instruction i is data dependent on instruction j and
vice versa.
• Name Dependence:
– Occurs when two instructions use the same register and
memory location. But there is no flow of data between
the instructions. Instruction order must be preserved.
• Antidependence: i writes to a location that j reads.
• Output Dependence: two instructions write to the same
location.
Dependences and
Hazards
• Types of data hazards:
– RAW: read after write
– WAW: write after write
– WAR: write after read
• We have already seen a RAW hazard. WAW
hazards occur due to output dependence.
• WAR hazards do not usually occur because of
the amount of time between the read cycle and
write cycle in a pipeline.
Control Dependence
• Assume we have the following piece of code:

If p1{
S1
}
If p2{
S2
}
• S1 is dependent on p1 and S2 is dependent on p2.

Control Dependence
• Control Dependences have the following

properties:
– An instruction that is control dependent on a
branch cannot be moved in front of the branch, so
that the branch no longer controls it.
– An instruction that is control dependent on a
branch cannot be moved after the branch so that
the branch controls it.
Dynamic Scheduling
• The previous example that we looked at was an

example of statically scheduled pipeline.
• Instructions are fetched and then issued. If the
users code has a data dependency / control
dependence it is hidden by forwarding.
• If the dependence cannot be hidden a stall occurs.
• Dynamic Scheduling is an important technique in
which both dataflow and exception behavior of
the program are maintained.
Dynamic Scheduling
(continued)
• Data dependence can cause stalling in a pipeline

that has “long” execution times for instructions
that dependencies.
• EX: Consider this code ( .D is floating point) ,
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
Dynamic Scheduling
(continued)
• Longer execution times of certain floating point

operations give the possibility of WAW and WAR
hazards. EX:
DIV.D F0, F2, F4

ADD.D F6, F0, F8
SUB.D F8, F10, F14
MUL.D F6, F10, F8
Dynamic Scheduling
(continued)
• If we want to execute instructions out of order in

hardware (if they are not dependent etc…) we
need to modify the ID stage of our 5 stage pipeline.
• Split ID into the following stages:
– Issue: Decode instructions, check for structural hazards.
– Read Operands: Wait until no data hazards, then read
operands.
• IF still precedes ID and will store the instruction
into a register or queue.
Still More Dynamic Scheduling
• Tomasulo’s Algorithim was invent by Robert
Tomasulo and was used in the IBM 360/391.
• The algorithm will avoid RAW hazards by
executing an instruction only when it’s operands
are available. WAR and WAW hazards are avoided
by register renaming.
DIV.D F0,F2,F4 DIV.D F0,F2,F4
ADD.D F6,F0,F8 ADD.D Temp,F0,F8
S.D F6,0(R1) S.D Temp,0(R1)
SUB.D F8,F10,F14 SUB.D Temp2,F10,F14
MUL.D F6,F10,F8 MUL.D F6,F10,Temp2
Branch Prediction In
Hardware
• Data hazards can be overcome by dynamic hardware
scheduling, control hazards need also to be addressed.
• Branch prediction is extremely useful in repetitive
branches, such as loops.
• A simple branch prediction can be implemented using
a small amount of memory and the lower order bits of
the address of the branch instruction.
• The memory only needs to contain one bit,
representing whether the branch was taken or not.
Branch Prediction In
Hardware
• If the branch is taken the bit is set to 1. The next
time the branch instruction is fetched we will
know that the branch occurred and we can
assume that the branch will be taken.
• This scheme adds some “history” to our
previous discussion on “branch taken” and
“branch not taken” control hazard avoidance.
• This single bit method will fail at least 20% of
the time. Why?
2-bit Prediction Scheme
• This method is more reliable than using a

single bit to represent whether the branch
was recently taken or not.
• The use of a 2-bit predictor will allow
branches that favor taken (or not taken) to be
mispredicted less often than the one-bit case.
Branch Predictors
• The size of a branch predictor memory will only

increase it’s effectiveness so much.
• We also need to address the effectiveness of the
scheme used. Just increasing the number of bits
in the predictor doesn’t do very much either.
• Some other predictors include:
– Correlating Predictors
– Tournament Predictors
Branch Predictors
• Correlating predictors will use the history of a

local branch AND some overall information on
how branches are executing to make a
decision whether to execute or not.
• Tournament Predictors are even more
sophisticated in that they will use multiple
predictors local and global and enable them
with a selector to improve accuracy.
SUPERSCALAR ARCHITECTURE
116
Definition and Characteristics
• Superscalar processing is the ability to initiate

multiple instructions during the same clock
cycle.
• A typical Superscalar processor fetches and
decodes the incoming instruction stream
several instructions at a time.
• Superscalar architecture exploit the potential
of ILP(Instruction Level Parallelism).
117
Fetching and dispatching two instructions per
cycle 118
Uninterrupted stream of instructions
• The outcomes of conditional branch instructions

are usually predicted in advance to ensure
uninterrupted stream of instructions
• Instructions are initiated for execution in parallel
based on the availability of operand data, rather
than their original program sequence. This is
referred to as dynamic instruction scheduling.
• Upon completion instruction results are
resequenced in the original order.
119
Superscalar Execution Example
- With Register Renaming for WAR

and WAW dependencies.
120
.
Register Renaming Example
WAR dependency exist between LD r7,(r3) and SUB r3, r12,r11 instructions
With Register Renaming, the first write to r3 maps to hw3,while the second write
maps to hw20.This converts four instruction dependency chain into 2 two instructions
chains, which can then be executed in parallel if the processor allows out of order
execution.
121
Hardware Organization of a superscalar processor
122
CONCLUSION
• It thereby allows faster CPU throughput than would

otherwise be possible at the same clock rate.
• All general-purpose CPUs developed since about 1998 are
superscalar.
• The major problem of executing multiple instructions in a

scalar program is the handling of data dependencies. If data
dependencies are not effectively handled, it is difficult to
achieve an execution rate of more than one instruction per
clock cycle.
123

Unit III - Basic Processing Unit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit III - Basic Processing Unit

Uploaded by

Copyright:

Available Formats

Unit III - Basic Processing Unit

• Instruction Set Processor (ISP)

• Processor fetches one instruction at a time and perform

• Fetch the contents of the memory location pointed to by

Figure 7.1. Single-bus organization of the datapath inside a processor.

• Transfer a word of data from one processor

• The ALU is a combinational circuit that has no internal

• Address into MAR; issue Read operation; data into MDR.

MDR inE MDRin

• The response time of each memory access varies (cache

MARin MAR ← [R1]

Wait for the MFC response from the memory

MDR out Load MDR from the memory bus

Figure 7.5. Timing of a memory Read operation.

1 PC out , MAR in , Read, Select4, Add, Zin Memory

2 Zout , PC in , Yin , WMF C MDR

7 Zout , R1 in , End Add

Figure 7.1. Single-bus organization of the datapath inside a processor.

• A branch instruction replaces the contents of

1 PCout , MAR in , Read, Select4, Add, Z in

Figure 7.7. Control sequence for an unconditional branch instruction.

Memory bus Address

Figure 7.8. Three-bus organization of the datapath.

• Add R4, R5, R6

1 PCout , R=B, MAR in , Read, IncPC

Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,

• What is the control PC

sequence for Address

execution of the Memory

Add R1, R2 Constant 4

including the Select MUX

instruction fetch Add

bus architecture) XOR TEMP

Figure 7.1. Single-bus organization of the datapath inside a processor.

• To execute instructions, the processor must

CLK Control step

Figure 7.10. Control unit organization.

Figure 7.11. Separation of the decoding and encoding functions.

• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…

Figure 7.13. Generation of the End control signal.

Instruction Integer Floating-point

Figure 7.14. Block diagram of a complete processor.

• Control signals are generated by a program similar to machine language

Figure 7.15 An example of microinstructions for Figure 7.6.

1 PC out , MAR in , Read, Select4, Add, Zin

Figure 7.16. Basic organization of a microprogrammed control unit.

0 PCout , MAR in , Read, Select4, Add, Z in

Figure 7.17. Microroutine for the instruction Branch<0.

Figure 7.18. Organization of the control unit to allow

• A straightforward way to structure

F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits)

0: SelectY 0: No action 0: Continue

Figure 7.19. An example of a partial format for field-encoded microinstructions.

• Enumerate the patterns of required signals in

• If all microprograms require only straightforward

Contents of IR OP code 0 1 0 Rsrc Rdst

000 PCout, MARin , Read, Select4 , Add, Zin

Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst.

Figure 7.22. Microinstruction-sequencing organization.

F0 (8 bits) F1 (3 bits) F2 (3 bits) F3 (3 bits)

Address of next 000: No transfer 000: No transfer 000: No transfer

F4 (4 bits) F5 (2 bits) F6 (1 bit) F7 (1 bit)

0000: Add 00: No action 0: SelectY 0: No action