You are on page 1of 123

Unit III - Basic Processing Unit

Slide Courtesy : “Computer Organization” by Carl Hamacher, Zvonko Vranesic, Safwat Zaky
Overview

• Instruction Set Processor (ISP)


• Central Processing Unit (CPU)
• A typical computing task consists of a series of
steps specified by a sequence of machine
instructions that constitute a program.
• An instruction is executed by carrying out a
sequence of more rudimentary operations.
Some Fundamental Concepts
Fundamental Concepts

• Processor fetches one instruction at a time and perform


the operation specified.
• Instructions are fetched from successive memory
locations until a branch or a jump instruction is
encountered.
• Processor keeps track of the address of the memory
location containing the next instruction to be fetched
using Program Counter (PC).
• Instruction Register (IR)
Executing an Instruction

• Fetch the contents of the memory location pointed to by


the PC. The contents of this location are loaded into the
IR (fetch phase).
IR ← [[PC]]
• Assuming that the memory is byte addressable,
increment the contents of the PC by 4 (fetch phase).
PC ← [PC] + 4
• Carry out the actions specified by the instruction in the
IR (execution phase).
Processor Organization
Internal processor
bus

Control signals

PC

Instruction
Address
decoder and
lines
MDR HAS TWO MAR control logic
INPUTS AND Memory
TWO OUTPUTS bus

MDR
Data
lines IR

Datapath
Y

Constant 4 R0

Select MUX

Add
A B
ALU Sub R n - 1 
control ALU
lines
Carry-in
XOR TEMP

Z
Textbook Page 413

Figure 7.1. Single-bus organization of the datapath inside a processor.


Executing an Instruction

• Transfer a word of data from one processor


register to another or to the ALU.
• Perform an arithmetic or a logic operation and
store the result in a processor register.
• Fetch the contents of a given memory location
and load them into a processor register.
• Store a word of data from a processor register
into a given memory location.
Register Transfers Internal processor
bus
Riin

Ri

Riout

Yin

Constant 4

Select MUX

A B
ALU

Zin

Z out

Figure 7.2. Input and output gating for the registers in Figure 7.1.
Register Transfers

• All operations and data transfers are controlled by the processor clock.
Bus

D Q
1
Q
Riout

Ri in
Clock

Figure 7.3.7.3.Input
Figure Inputand
andoutput
outputgating
gating for one register
register bit.
bit.
Performing an Arithmetic or
Logic Operation

• The ALU is a combinational circuit that has no internal


storage.
• ALU gets the two operands from MUX and bus. The
result is temporarily stored in register Z.
• What is the sequence of operations to add the contents
of register R1 to those of R2 and store the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
Fetching a Word from Memory

• Address into MAR; issue Read operation; data into MDR.


Memory-bus Internal processor
data lines MDRoutE MDRout bus

MDR

MDR inE MDRin

Figure 7.4.
Figure 7.4. Connection and control
Connection and controlsignals
signalsfor
forregister
registerMDR.
MDR.
Fetching a Word from Memory

• The response time of each memory access varies (cache


miss, memory-mapped I/O,…).
• To accommodate this, the processor waits until it
receives an indication that the requested operation has
been completed (Memory-Function-Completed, MFC).
• Move (R1), R2
 MAR ← [R1]
 Start a Read operation on the memory bus
 Wait for the MFC response from the memory
 Load MDR from the memory bus
 R2 ← [MDR]
Timing Step 1 2 3

Clock

MARin MAR ← [R1]


Assume MAR
is always available Address
on the address lines
of the memory bus. Start a Read operation on the memory bus
Read

MR

MDRinE

Data

Wait for the MFC response from the memory


MFC

MDR out Load MDR from the memory bus


R2 ← [MDR]

Figure 7.5. Timing of a memory Read operation.


Execution of a Complete
Instruction

• Add (R3), R1
• Fetch the instruction
• Fetch the first operand (the contents of the
memory location pointed to by R3)
• Perform the addition
• Load the result into R1
Architecture Internal processor
bus
Riin

Ri

Riout

Yin

Constant 4

Select MUX

A B
ALU

Zin

Z out

Figure 7.2. Input and output gating for the registers in Figure 7.1.
Execution of a Complete
Instruction
Internal processor
bus

Control signals
Add (R3), R1
PC

Instruction
Step Action Address
decoder and
lines
MAR control logic

1 PC out , MAR in , Read, Select4, Add, Zin Memory


bus

2 Zout , PC in , Yin , WMF C MDR


Data
lines IR
3 MDR out , IR in
4 R3 out , MAR in , Read Y

Constant 4 R0
5 R1 out , Yin , WMF C
6 MDR out , SelectY, Add, Zin Select MUX

7 Zout , R1 in , End Add


A B
ALU Sub R n - 1 
control ALU
lines
Carry-in
XOR TEMP
Figure 7.6. Con trol sequence for execution of the instruction Add (R3),R1.
Z

Figure 7.1. Single-bus organization of the datapath inside a processor.


Execution of Branch
Instructions

• A branch instruction replaces the contents of


PC with the branch target address, which is
usually obtained by adding an offset X given in
the branch instruction.
• The offset X is usually the difference between
the branch target address and the address
immediately following the branch instruction.
• Conditional branch
Execution of Branch
Instructions

Step Action

1 PCout , MAR in , Read, Select4, Add, Z in


2 Zout , PC in , Yin , WMF C
3 MDR out , IR in
4 Offset-field-of-IRout , Add, Z in
5 Z out , PCin , End

Figure 7.7. Control sequence for an unconditional branch instruction.


Multiple-Bus Organization
Bus A Bus B Bus C

Incrementer

PC

Register
file

Constant 4

MUX
A

ALU R

Instruction
decoder

IR

MDR

MAR

Memory bus Address


data lines lines

Figure 7.8. Three-bus organization of the datapath.


Multiple-Bus Organization

• Add R4, R5, R6

Step Action

1 PCout , R=B, MAR in , Read, IncPC


2 WMFC
3 MDR outB , R=B, IR in
4 R4outA , R5outB , SelectA, Add, R6 in , End

Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,


for the three-bus organization in Figure 7.8.
Quiz
Internal processor
bus

Control signals

• What is the control PC

Instruction

sequence for Address


lines
MAR
decoder and
control logic

execution of the Memory


bus

instruction Data
lines
MDR
IR

Add R1, R2 Constant 4


Y
R0

including the Select MUX

instruction fetch Add


A B
Sub R n - 1 
phase? (Assume single ALU
control
lines
ALU

Carry-in

bus architecture) XOR TEMP

Figure 7.1. Single-bus organization of the datapath inside a processor.


Hardwired Control
Overview

• To execute instructions, the processor must


have some means of generating the control
signals needed in the proper sequence.
• Two categories: hardwired control and
microprogrammed control
• Hardwired system can operate at high speed;
but with little flexibility.
Control Unit Organization

CLK Control step


Clock counter

External
inputs
Decoder/
IR
encoder
Condition
codes

Control signals

Figure 7.10. Control unit organization.


Detailed Block Description
CLK
Clock Control step Reset
counter

Step decoder

T 1 T2 Tn

INS 1
External
INS 2 inputs
Instruction
IR Encoder
decoder
Condition
codes
INSm

Run End

Control signals

Figure 7.11. Separation of the decoding and encoding functions.


Generating Zin

• Zin = T1 + T6 • ADD + T4 • BR + …
Branch Add

T4 T6

T1

Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.
Generating End

• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…


Branch<0
Add Branch
N N

T7 T5 T4 T5

End

Figure 7.13. Generation of the End control signal.


A Complete Processor

Instruction Integer Floating-point


unit unit unit

Instruction Data
cache cache

Bus interface
Processor

System bus

Main Input/
memory Output

Figure 7.14. Block diagram of a complete processor.


Microprogrammed Control
Overview

• Control signals are generated by a program similar to machine language


programs.
• Control Word (CW); microroutine; microinstruction

MDRout

WMFC
MAR in

Select
PCout
Micro -

R1out

R3out
Read
PCin

R1 in

End
Add

Z out
IRin
Yin
instruction

Zin
1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1

Figure 7.15 An example of microinstructions for Figure 7.6.


Overview

Step Action

1 PC out , MAR in , Read, Select4, Add, Zin


2 Zout , PC in , Yin , WMF C
3 MDR out , IR in
4 R3 out , MAR in , Read
5 R1 out , Yin , WMF C
6 MDR out , SelectY, Add, Zin
7 Zout , R1 in , End

Figure 7.6. Con trol sequence for execution of the instruction Add (R3),R1.
Overview

• Control store
Starting
IR address
generator One function
cannot be carried
out by this simple
organization.

Clock P C

Control
store CW

Figure 7.16. Basic organization of a microprogrammed control unit.


Overview

• The previous organization cannot handle the situation when the control unit is
required to check the status of the condition codes or external inputs to
choose between alternative courses of action.
• Use conditional branch microinstruction.
Address Microinstruction

0 PCout , MAR in , Read, Select4, Add, Z in


1 Zout , PC in , Yin , WMF C
2 MDR out , IR in
3 Branch to starting addressof appropriate microroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25 If N=0, then branch to microinstruction 0
26 Offset-field-of-IR out , SelectY, Add, Z in
27 Zout , PC in , End

Figure 7.17. Microroutine for the instruction Branch<0.


Overview
External
inputs

Starting and
branch address Condition
IR codes
generator

Clock m PC

Control
store CW

Figure 7.18. Organization of the control unit to allow


conditional branching in the microprogram.
Microinstructions

• A straightforward way to structure


microinstructions is to assign one bit position
to each control signal.
• However, this is very inefficient.
• The length can be reduced: most signals are
not needed simultaneously, and many signals
are mutually exclusive.
• All mutually exclusive signals are placed in the
same group in binary coding.
Partial Format for the
Microinstructions
Microinstruction

F1 F2 F3 F4 F5

F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits)

0000: No transfer 000: No transfer 000: No transfer 0000: Add 00: No action
0001: PCout 001: PCin 001: MARin 0001: Sub 01: Read
0010: MDRout 010: IRin 010: MDRin 10: Write
0011: Zout 011: Z in 011: TEMPin
0100: R0out 100: R0in 100: Y in 1111: XOR
0101: R1out 101: R1in
0110: R2out 110: R2 in 16 ALU
functions
0111: R3 out 111: R3 in
1010: TEMPout
1011: Offsetout

F6 F7 F8
What is the price paid for
this scheme?
F6 (1 bit) F7 (1 bit) F8 (1 bit)

0: SelectY 0: No action 0: Continue


1: Select4 1: WMFC 1: End

Figure 7.19. An example of a partial format for field-encoded microinstructions.


Further Improvement

• Enumerate the patterns of required signals in


all possible microinstructions. Each
meaningful combination of active control
signals can then be assigned a distinct code.
• Vertical organization
• Horizontal organization
Microprogram Sequencing

• If all microprograms require only straightforward


sequential execution of microinstructions except for
branches, letting a μPC governs the sequencing would be
efficient.
• However, two disadvantages:
 Having a separate microroutine for each machine instruction results in a
large total number of microinstructions and a large control store.
 Longer execution time because it takes more time to carry out the
required branches.
• Example: Add src, Rdst
• Four addressing modes: register, autoincrement,
autodecrement, and indexed (with indirect forms).
- Bit-ORing
- Wide-Branch Addressing
- WMFC
Mode

Contents of IR OP code 0 1 0 Rsrc Rdst

11 10 8 7 4 3 0

Address Microinstruction
(octal)

000 PCout, MARin , Read, Select4 , Add, Zin


001 Zout , PCin, Yin, WMFC
002 MDRout, IRin
003 m Branch {m PC ¬ 101 (from Instruction decoder);
m PC5,4 ¬ [IR10,9]; m PC3 ¬ [IR 10]×[IR9]×[IR8]}
121 Rsrcout , MARin , Read, Select4, Add, Zin
122 Zout , Rsrcin
123 mBranch {mPC ¬ 170;mPC0 ¬ [IR8]}, WMFC
170 MDRout, MARin , Read, WMFC
171 MDRout, Yin
172 Rdstout , SelectY, Add, Zin
173 Zout , Rdstin , End

Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst.


Note: Microinstruction at location 170 is not executed for this addressing mode.
Microinstructions with Next-
Address Field
• The microprogram we discussed requires several branch
microinstructions, which perform no useful operation in
the datapath.
• A powerful alternative approach is to include an address
field as a part of every microinstruction to indicate the
location of the next microinstruction to be fetched.
• Pros: separate branch microinstructions are virtually
eliminated; few limitations in assigning addresses to
microinstructions.
• Cons: additional bits for the address field (around 1/6)
Microinstructions with Next-
Address Field
IR

External Condition
Inputs codes

Decoding circuits

AR

Control store

Next address I R

Microinstruction decoder

Control signals

Figure 7.22. Microinstruction-sequencing organization.


Microinstruction

F0 F1 F2 F3

F0 (8 bits) F1 (3 bits) F2 (3 bits) F3 (3 bits)

Address of next 000: No transfer 000: No transfer 000: No transfer


microinstruction 001: PC out 001: PCin 001: MAR in
010: MDRout 010: IRin 010: MDR in
011: Z out 011: Z in 011: TEMPin
100: Rsrcout 100: Rsrc in 100: Y in
101: Rdstout 101: Rdst in
110: TEMP out

F4 F5 F6 F7

F4 (4 bits) F5 (2 bits) F6 (1 bit) F7 (1 bit)

0000: Add 00: No action 0: SelectY 0: No action


0001: Sub 01: Read 1: Select4 1: WMFC
10: Write
1111: XOR

F8 F9 F10

F8 (1 bit) F9 (1 bit) F10 (1 bit)

0: NextAdrs 0: No action 0: No action


1: InstDec 1: ORmode 1: OR indsrc

Figure 7.23. Format for microinstructions in the example of Section 7.5.3.


Implementation of the
Microroutine
Octal
address F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

0 0 0 0 0 0 0 0 0 0 1 0 0 1 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 1 1 00 1 1 0 0 0 0 0 0 00 0 1 0 0 0
0 0 2 0 0 0 0 0 0 1 1 0 1 0 01 0 0 0 0 0 0 0 0 00 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 1 1 0

121 0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
122 0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00 0 1 0 0 1

1 7 0 0 1 1 1 1 0 0 1 0 1 0 00 0 0 0 1 0 0 0 0 01 0 1 0 0 0
1 7 1 0 1 1 1 1 0 1 0 0 1 0 00 0 1 0 0 0 0 0 0 00 0 0 0 0 0
1 7 2 0 1 1 1 1 0 1 1 1 0 1 01 1 0 0 0 0 0 0 0 00 0 0 0 0 0
1 7 3 0 0 0 0 0 0 0 0 0 1 1 10 1 0 0 0 0 0 0 0 00 0 0 0 0 0

Figure 7.24. Implementation of the microroutine of Figure 7.21 using a


next-microinstruction address field. (See Figure 7.23 for encoded signals.)
R15in R15out R0 in R0out

Decoder

Decoder

IR Rsrc Rdst

InstDecout
External
inputs ORmode
Decoding
circuits
Condition ORindsrc
codes

AR

Control store

Next address F1 F2 F8 F9 F10

Rdst out

Rdst in
Microinstruction
decoder
Rsrc out

Rsrc in

Other control signals

Figure 7.25. Some details of the control-signal-generating circuitry.


bit-ORing
Pipelining
What Is A Pipeline?

• Pipelining is used by virtually all modern


microprocessors to enhance performance by
overlapping the execution of instructions.
• A common analogue for a pipeline is a factory
assembly line. Assume that there are three stages:
1. Welding
2. Painting
3. Polishing
• For simplicity, assume that each task takes one
hour.
What Is A Pipeline?

• If a single person were to work on the product it


would take three hours to produce one product.
• If we had three people, one person could work
on each stage, upon completing their stage they
could pass their product on to the next person
(since each stage takes one hour there will be no
waiting).
• We could then produce one product per hour
assuming the assembly line has been filled.
Characteristics Of Pipelining

• If the stages of a pipeline are not balanced and one


stage is slower than another, the entire throughput of
the pipeline is affected.
• In terms of a pipeline within a CPU, each instruction is
broken up into different stages. Ideally if each stage is
balanced (all stages are ready to start at the same
time and take an equal amount of time to execute.)
the time taken per instruction (pipelined) is defined
as:

Time per instruction (unpipelined) / Number of stages


Characteristics Of Pipelining

• The previous expression is ideal. We will see later


that there are many ways in which a pipeline cannot
function in a perfectly balanced fashion.
• In terms of a CPU, the implementation of pipelining
has the effect of reducing the average instruction
time, therefore reducing the average CPI.
• EX: If each instruction in a microprocessor takes 5
clock cycles (unpipelined) and we have a 4 stage
pipeline, the ideal average CPI with the pipeline will
be 1.25 .
RISC Instruction Set Basics
(from Hennessey and Patterson)

• Properties of RISC architectures:


– All ops on data apply to data in registers and typically
change the entire register (32-bits or 64-bits).
– The only ops that affect memory are load/store
operations. Memory to register, and register to
memory.
– Load and store ops on data less than a full size of a
register (32, 16, 8 bits) are often available.
– Usually instructions are few in number (this can be
relative) and are typically one size.
RISC Instruction Set Basics
Types Of Instructions
• ALU Instructions:
• Arithmetic operations, either take two registers as
operands or take one register and a sign extended
immediate value as an operand. The result is stored in
a third register.
• Logical operations AND OR, XOR do not usually
differentiate between 32-bit and 64-bit.
• Load/Store Instructions:
• Usually take a register (base register) as an operand
and a 16-bit immediate value. The sum of the two will
create the effective address. A second register acts as a
source in the case of a load operation.
RISC Instruction Set Basics
Types Of Instructions (continued)

• In the case of a store operation the second register


contains the data to be stored.
• Branches and Jumps
• Conditional branches are transfers of control. As
described before, a branch causes an immediate value to
be added to the current program counter.
• Appendix A has a more detailed description of
the RISC instruction set. Also the inside back
cover has a listing of a subset of the MIPS64
instruction set.
RISC Instruction Set
Implementation
• We first need to look at how instructions in the MIPS64 instruction set
are implemented without pipelining. We’ll assume that any
instruction of the subset of MIPS64 can be executed in at most 5 clock
cycles.
• The five clock cycles will be broken up into the following steps:
• Instruction Fetch Cycle
• Instruction Decode/Register Fetch Cycle
• Execution Cycle
• Memory Access Cycle
• Write-Back Cycle
Instruction Fetch (IF) Cycle

• The value in the PC represents an address in


memory. The MIPS64 instructions are all 32-bits
in length. Figure 2.27 shows how the 32-bits (4
bytes) are arranged depending on the instruction.
• First we load the 4 bytes in memory into the CPU.
• Second we increment the PC by 4 because
memory addresses are arranged in byte ordering.
This will now represent the next instruction. (Is
this certain???)
Instruction Decode
(ID)/Register Fetch Cycle
• Decode the instruction and at the same time read
in the values of the register involved. As the
registers are being read, do equality test incase the
instruction decodes as a branch or jump.
• The offset field of the instruction is sign-extended
incase it is needed. The possible branch effective
address is computed by adding the sign-extended
offset to the incremented PC. The branch can be
completed at this stage if the equality test is true
and the instruction decoded as a branch.
Instruction Decode (ID)/Register
Fetch Cycle (continued)

• Instruction can be decoded in parallel with


reading the registers because the register
addresses are at fixed locations.
Execution (EX)/Effective
Address Cycle

• If a branch or jump did not occur in the previous


cycle, the arithmetic logic unit (ALU) can
execute the instruction.
• At this point the instruction falls into three
different types:
• Memory Reference: ALU adds the base register and the
offset to form the effective address.
• Register-Register: ALU performs the arithmetic, logical,
etc… operation as per the opcode.
• Register-Immediate: ALU performs operation based on
the register and the immediate value (sign extended).
Memory Access (MEM) Cycle

• If a load, the effective address computed from


the previous cycle is referenced and the
memory is read. The actual data transfer to
the register does not occur until the next
cycle.
• If a store, the data from the register is written
to the effective address in memory.
Write-Back (WB) Cycle

• Occurs with Register-Register ALU instructions


or load instructions.
• Simple operation whether the operation is a
register-register operation or a memory load
operation, the resulting data is written to the
appropriate register.
Looking At The Big Picture

• Overall the most time that an non-pipelined


instruction can take is 5 clock cycles. Below is a
summary:
• Branch - 2 clock cycles
• Store - 4 clock cycles
• Other - 5 clock cycles
• EX: Assuming branch instructions account for 12%
of all instructions and stores account for 10%,
what is the average CPI of a non-pipelined CPU?
ANS: 0.12*2+0.10*4+0.78*5 = 4.54
The Classical RISC 5 Stage
Pipeline

• In an ideal case to implement a pipeline we just


need to start a new instruction at each clock cycle.
• Unfortunately there are many problems with trying
to implement this. Obviously we cannot have the
ALU performing an ADD operation and a MULTIPLY
at the same time. But if we look at each stage of
instruction execution as being independent, we can
see how instructions can be “overlapped”.
ENGR9861 Winter 2007 RV
Problems With The Previous
Figure
• The memory is accessed twice during each clock cycle. This problem
is avoided by using separate data and instruction caches.
• It is important to note that if the clock period is the same for a
pipelined processor and an non-pipelined processor, the memory
must work five times faster.
• Another problem that we can observe is that the registers are
accessed twice every clock cycle. To try to avoid a resource conflict
we perform the register write in the first half of the cycle and the
read in the second half of the cycle.
Problems With The Previous
Figure (continued)
• We write in the first half because therefore an write operation
can be read by another instruction further down the pipeline.
• A third problem arises with the interaction of the pipeline
with the PC. We use an adder to increment PC by the end of
IF. Within ID we may branch and modify PC. How does this
affect the pipeline?
• The use if pipeline registers allow the CPU of have a memory
to implement the pipeline. Remember that the previous
figure has only one resource use in each stage.
Pipeline Hazards

• The performance gain from using pipelining


occurs because we can start the execution of a
new instruction each clock cycle. In a real
implementation this is not always possible.
• Another important note is that in a pipelined
processor, a particular instruction still takes at
least as long to execute as non-pipelined.
• Pipeline hazards prevent the execution of the next
instruction during the appropriate clock cycle.
Types Of Hazards

• There are three types of hazards in a pipeline,


they are as follows:
• Structural Hazards: are created when the data path
hardware in the pipeline cannot support all of the
overlapped instructions in the pipeline.
• Data Hazards: When there is an instruction in the
pipeline that affects the result of another instruction in
the pipeline.
• Control Hazards: The PC causes these due to the
pipelining of branches and other instructions that
change the PC.
A Hazard Will Cause A Pipeline
Stall

• Some performance expressions involving a


realistic pipeline in terms of CPI. It is a
assumed that the clock period is the same for
pipelined and unpipelined implementations.
Speedup = CPI Unpipelined / CPI pipelined
= Pipeline Depth / ( 1 + Stalls per Ins)
= Ave Ins Time Unpipelined / Ave Ins Time
Pipelined
A Hazard Will Cause A Pipeline
Stall (continued)

• We can look at pipeline performance in terms


of a faster clock cycle time as well:

Speedup = CPI unpipelined Clock cycle time unpipelined


x
CPI pipelined Clock cycle time pipelined

Clock cycle time unpipelined


Clock cycle pipelined =
Pipeline Depth

1
Speedup = x Pipeline Depth
1 + Pipeline stalls per Ins
Dealing With Structural Hazards

• Structural hazards result from the CPU data path


not having resources to service all the required
overlapping resources.
• Suppose a processor can only read and write
from the registers in one clock cycle. This would
cause a problem during the ID and WB stages.
• Assume that there are not separate instruction
and data caches, and only one memory access
can occur during one clock cycle. A hazard
would be caused during the IF and MEM cycles.
ENGR9861 Winter 2007 RV
Dealing With Structural Hazards

• A structural hazard is dealt with by inserting a stall or pipeline


bubble into the pipeline. This means that for that clock cycle,
nothing happens for that instruction. This effectively “slides” that
instruction, and subsequent instructions, by one clock cycle.
• This effectively increases the average CPI.
• EX: Assume that you need to compare two processors, one with a
structural hazard that occurs 40% for the time, causing a stall.
Assume that the processor with the hazard has a clock rate 1.05
times faster than the processor without the hazard. How fast is the
processor with the hazard compared to the one without the
hazard?
Dealing With Structural Hazards
(continued)

Speedup = CPI no haz Clock cycle time no haz


x
CPI haz Clock cycle time haz

1 1
Speedup = x
1+0.4*1 1/1.05

= 0.75
Dealing With Structural Hazards
(continued)

• We can see that even though the clock speed


of the processor with the hazard is a little
faster, the speedup is still less than 1.
• Therefore the hazard has quite an effect on
the performance.
• Sometimes computer architects will opt to
design a processor that exhibits a structural
• A: The improvement to the processor data path is too costly.
•hazard. Why?
B: The hazard occurs rarely enough so that the processor will still
perform to specifications.
Data Hazards (A Programming
Problem?)

• We haven’t looked at assembly programming in


detail at this point.
• Consider the following operations:
DADD R1, R2, R3
DSUB R4, R1, R5
AND R6, R1, R7
ORR8, R1, R9
XOR R10, R1, R11
Pipeline Registers

What are the problems?

ENGR9861 Winter 2007 RV


Data Hazard Avoidance

• In this trivial example, we cannot expect the programmer to reorder


his/her operations. Assuming this is the only code we want to
execute.
• Data forwarding can be used to solve this problem.
• To implement data forwarding we need to bypass the pipeline
register flow:
– Output from the EX/MEM and MEM/WB stages must be fed back into
the ALU input.
– We need routing hardware that detects when the next instruction
depends on the write of a previous instruction.
ENGR9861 Winter 2007 RV
General Data Forwarding

• It is easy to see how data forwarding can be used


by drawing out the pipelined execution of each
instruction.
• Now consider the following instructions:

DADD R1, R2, R3


LD R4, O(R1)
SD R4, 12(R1)
Problems

• Can data forwarding prevent all data hazards?


• NO!
• The following operations will still cause a data
hazard. This happens because the further down the
pipeline we get, the less we can use forwarding.
LD R1, O(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
ENGR9861 Winter 2007 RV
Problems

• We can avoid the hazard by using a pipeline


interlock.
• The pipeline interlock will detect when data
forwarding will not be able to get the data to
the next instruction in time.
• A stall is introduced until the instruction can
get the appropriate data from the previous
instruction.
Control Hazards

• Control hazards are caused by branches in the code.


• During the IF stage remember that the PC is
incremented by 4 in preparation for the next IF cycle
of the next instruction.
• What happens if there is a branch performed and we
aren’t simply incrementing the PC by 4.
• The easiest way to deal with the occurrence of a
branch is to perform the IF stage again once the
branch occurs.
Performing IF Twice

• We take a big performance hit by performing the


instruction fetch whenever a branch occurs. Note,
this happens even if the branch is taken or not. This
guarantees that the PC will get the correct value.
IF ID EX MEM WB

branch IF ID EX MEM WB
IF IF ID EX MEM WB
Performing IF Twice

• This method will work but as always in computer


architecture we should try to make the most
common operation fast and efficient.
• With MIPS64 branch instructions are quite
common.
• By performing IF twice we will encounter a
performance hit between 10%-30%
• Next class we will look at some methods for
dealing with Control Hazards.
Control Hazards (other
solutions)

• These following solutions assume that we are


dealing with static branches. Meaning that the
actions taken during a branch do not change.
• We already saw the first example, we stall the
pipeline until the branch is resolved (in our case
we repeated the IF stage until the branch
resolved and modified the PC)
• The next two examples will always make an
assumption about the branch instruction.
Control Hazards (other
solutions)

• What if we treat every branch as “not taken”


remember that not only do we read the
registers during ID, but we also perform an
equality test in case we need to branch or not.
• We can improve performance by assuming that
the branch will not be taken.
• What in this case we can simply load in the next
instruction (PC+4) can continue. The complexity
arises when the branch evaluates and we end up
needing to actually take the branch.
Control Hazards (other
solutions)

• If the branch is actually taken we need to clear


the pipeline of any code loaded in from the “not-
taken” path.
• Likewise we can assume that the branch is always
taken. Does this work in our “5-stage” pipeline?
• No, the branch target is computed during the ID
cycle. Some processors will have the target
address computed in time for the IF stage of the
next instruction so there is no delay.
Control Hazards (other
solutions)
• The “branch-not taken” scheme is the same as performing the IF
stage a second time in our 5 stage pipeline if the branch is taken.
• If not there is no performance degradation.
• The “branch taken” scheme is no benefit in our case because we
evaluate the branch target address in the ID stage.
• The fourth method for dealing with a control hazard is to
implement a “delayed” branch scheme.
• In this scheme an instruction is inserted into the pipeline that is
useful and not dependent on whether the branch is taken or not. It
is the job of the compiler to determine the delayed branch
instruction.
ENGR9861 Winter 2007 RV
How To Implement a Pipeline

• From page A-29 in the text we will now look at the


data path implementation of a 5 stage pipeline.

ENGR9861 Winter 2007 RV


How To Implement a Pipeline

ENGR9861 Winter 2007 RV


Multi-clock Operations

• Sometimes operations require more than one


clock cycle to complete. Examples are:
• Floating Point Multiply
• Floating Point Divide
• Floating Point Add
• We can assume that there is hardware available
on the processor for performing the operations.
• Assume that the FP Mul and Add are fully
pipelined, and the divide is un-pipelined.
Avoiding Structural Hazards

• The multiplier and the divider are fully pipelined. The


divider is not pipelined at all.
• Take a look at figure A.34 for a good example of how
pipelining will function in the case of longer
instruction execution. The author assumes a single
floating point register port.
• Structural hazards are avoided in the ID stage by
assigning a memory bit in a shift register. Incoming
instructions can then check to see if they should stall.
Instruction Level Parallelism (ILP)
Chapter 3

• The reason why we can implement pipelining in a


microprocessor is due to instruction level parallelism.
• Since operations can be overlapped in execution,
they exhibit ILP.
• ILP is mostly exploited in the use of branches. A
“basic block” is a block of code that has no branches
into or out of except for at the start and the end.
• In MIPS, an average basic block is 4-7 separate
instructions.
Dependences and
Hazards
• Data Dependence:
– Instruction i produces a result the instruction j will use
or instruction i is data dependent on instruction j and
vice versa.
• Name Dependence:
– Occurs when two instructions use the same register and
memory location. But there is no flow of data between
the instructions. Instruction order must be preserved.
• Antidependence: i writes to a location that j reads.
• Output Dependence: two instructions write to the same
location.
Dependences and
Hazards
• Types of data hazards:
– RAW: read after write
– WAW: write after write
– WAR: write after read
• We have already seen a RAW hazard. WAW
hazards occur due to output dependence.
• WAR hazards do not usually occur because of
the amount of time between the read cycle and
write cycle in a pipeline.
Control Dependence

• Assume we have the following piece of code:


If p1{
S1
}

If p2{
S2
}

• S1 is dependent on p1 and S2 is dependent on p2.


Control Dependence

• Control Dependences have the following


properties:
– An instruction that is control dependent on a
branch cannot be moved in front of the branch, so
that the branch no longer controls it.
– An instruction that is control dependent on a
branch cannot be moved after the branch so that
the branch controls it.
Dynamic Scheduling

• The previous example that we looked at was an


example of statically scheduled pipeline.
• Instructions are fetched and then issued. If the
users code has a data dependency / control
dependence it is hidden by forwarding.
• If the dependence cannot be hidden a stall occurs.
• Dynamic Scheduling is an important technique in
which both dataflow and exception behavior of
the program are maintained.
Dynamic Scheduling
(continued)

• Data dependence can cause stalling in a pipeline


that has “long” execution times for instructions
that dependencies.
• EX: Consider this code ( .D is floating point) ,

DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
Dynamic Scheduling
(continued)

• Longer execution times of certain floating point


operations give the possibility of WAW and WAR
hazards. EX:

DIV.D F0, F2, F4


ADD.D F6, F0, F8
SUB.D F8, F10, F14
MUL.D F6, F10, F8
Dynamic Scheduling
(continued)

• If we want to execute instructions out of order in


hardware (if they are not dependent etc…) we
need to modify the ID stage of our 5 stage pipeline.
• Split ID into the following stages:
– Issue: Decode instructions, check for structural hazards.
– Read Operands: Wait until no data hazards, then read
operands.
• IF still precedes ID and will store the instruction
into a register or queue.
Still More Dynamic Scheduling
• Tomasulo’s Algorithim was invent by Robert
Tomasulo and was used in the IBM 360/391.
• The algorithm will avoid RAW hazards by
executing an instruction only when it’s operands
are available. WAR and WAW hazards are avoided
by register renaming.
DIV.D F0,F2,F4 DIV.D F0,F2,F4
ADD.D F6,F0,F8 ADD.D Temp,F0,F8
S.D F6,0(R1) S.D Temp,0(R1)
SUB.D F8,F10,F14 SUB.D Temp2,F10,F14
MUL.D F6,F10,F8 MUL.D F6,F10,Temp2
ENGR9861 Winter 2007 RV
Branch Prediction In
Hardware
• Data hazards can be overcome by dynamic hardware
scheduling, control hazards need also to be addressed.
• Branch prediction is extremely useful in repetitive
branches, such as loops.
• A simple branch prediction can be implemented using
a small amount of memory and the lower order bits of
the address of the branch instruction.
• The memory only needs to contain one bit,
representing whether the branch was taken or not.
Branch Prediction In
Hardware
• If the branch is taken the bit is set to 1. The next
time the branch instruction is fetched we will
know that the branch occurred and we can
assume that the branch will be taken.
• This scheme adds some “history” to our
previous discussion on “branch taken” and
“branch not taken” control hazard avoidance.
• This single bit method will fail at least 20% of
the time. Why?
2-bit Prediction Scheme

• This method is more reliable than using a


single bit to represent whether the branch
was recently taken or not.
• The use of a 2-bit predictor will allow
branches that favor taken (or not taken) to be
mispredicted less often than the one-bit case.
ENGR9861 Winter 2007 RV
Branch Predictors

• The size of a branch predictor memory will only


increase it’s effectiveness so much.
• We also need to address the effectiveness of the
scheme used. Just increasing the number of bits
in the predictor doesn’t do very much either.
• Some other predictors include:
– Correlating Predictors
– Tournament Predictors
Branch Predictors

• Correlating predictors will use the history of a


local branch AND some overall information on
how branches are executing to make a
decision whether to execute or not.
• Tournament Predictors are even more
sophisticated in that they will use multiple
predictors local and global and enable them
with a selector to improve accuracy.
SUPERSCALAR ARCHITECTURE

116
Definition and Characteristics

• Superscalar processing is the ability to initiate


multiple instructions during the same clock
cycle.
• A typical Superscalar processor fetches and
decodes the incoming instruction stream
several instructions at a time.
• Superscalar architecture exploit the potential
of ILP(Instruction Level Parallelism).

117
Fetching and dispatching two instructions per
cycle 118
Uninterrupted stream of instructions

• The outcomes of conditional branch instructions


are usually predicted in advance to ensure
uninterrupted stream of instructions
• Instructions are initiated for execution in parallel
based on the availability of operand data, rather
than their original program sequence. This is
referred to as dynamic instruction scheduling.
• Upon completion instruction results are
resequenced in the original order.
119
Superscalar Execution Example

- With Register Renaming for WAR


and WAW dependencies.

120
.
Register Renaming Example

WAR dependency exist between LD r7,(r3) and SUB r3, r12,r11 instructions

With Register Renaming, the first write to r3 maps to hw3,while the second write
maps to hw20.This converts four instruction dependency chain into 2 two instructions
chains, which can then be executed in parallel if the processor allows out of order
execution.

121
Hardware Organization of a superscalar processor

122
CONCLUSION

• It thereby allows faster CPU throughput than would


otherwise be possible at the same clock rate.
• All general-purpose CPUs developed since about 1998 are
superscalar.

• The major problem of executing multiple instructions in a


scalar program is the handling of data dependencies. If data
dependencies are not effectively handled, it is difficult to
achieve an execution rate of more than one instruction per
clock cycle.

123

You might also like