You are on page 1of 35

ECE 2300

Digital Logic & Computer Organization


Spring 2018

More Pipelined Microprocessor

Lecture 18: 1
Announcements
• No instructor office hour today
– Rescheduled to Monday April 16, 4:00-5:30pm

• Prelim 2 review sessions


– Friday April 13, 4:30-6:00pm, PHL 219
– Monday April 16, 7:00-8:30pm, PHL 203

Lecture 18: 2
Pipelined Microprocessor
Control
CU Signals

VCZN
Adder
+2 Fm … F0 Data
RF RAM
M LD
P
Decoder
U Inst SA M
ALU
X C RAM SB U
DR M D_IN
U X
PCJ VCZN
X MW MD
PCL D_in

SE

IF/ID ID/EX MB EX/MEM MEM/WB

Key idea: Keep all resources fully utilized

Lecture 18: 3
Data Hazard Problem
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
ADD R1,R2,R3 IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

A
ADDI R7,R7,3 IM Reg L DM Reg
U

The OR, SUB, and AND instructions are


data dependent on the ADD instruction
Lecture 18: 4
Example: Data Hazards
• Identify all data hazards in the following
instruction sequences by circling each source
register that is read before the updated value is
written back
SUB R4, R2, R3
ADDI R1, R1, 1
ADDI R2, R4, 1
ADDI R3, R4, 1
SUB R5, R3, R4

Lecture 18: 5
Review: Compiler Inserts NOPs (Solution 1)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
ADD R1,R2,R3 IM Reg L DM Reg
U

A
NOP IM Reg L DM Reg
U

A
NOP IM Reg L DM Reg
U

A
NOP IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

Lecture 18: 6
Review: HW Stalls the Pipeline (Solution 2)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
ADD R1,R2,R3 IM Reg L DM Reg
U

A
OR R4,R1,R3 IM bubble bubble bubble Reg L DM Reg
U

SUB R5,R2,R1 IM Reg A


L
DM
U

A
AND R6,R1,R2 IM Reg L
U

ADDI R7,R7,3 IM Reg

The pipeline is stalled for three cycles

Lecture 18: 7
Solution 3: HW Forwarding (Bypassing)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
ADD R1,R2,R3 IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

A
ADDI R7,R7,3 IM Reg L DM Reg
U

Lecture 18: 8
Pipeline Modifications for Forwarding?

IM Reg DM
A
L
U

IF/ID ID/EX EX/MEM MEM/WB

Lecture 18: 9
Pipelined Microprocessor w/o Forwarding
Control
CU Signals

VCZN
Adder
+2 Fm … F0 Data
RF RAM
M LD
P
Decoder
U Inst SA M
ALU
X C RAM SB U
DR M D_IN
U X
PCJ VCZN
X MW MD
PCL D_in

SE

IF/ID ID/EX MB EX/MEM MEM/WB

Lecture 18: 10
Pipelined Processor with Forwarding
Control
CU Signals

VCZN
Adder
+2 Fm … F0 Data
M
RF U M RAM
M LD X
U
P Inst
Decoder
SA
U
X M
M ALU
X C RAM SB
M
U U
DR X D_IN
U
X
M
U
X
PCJ X MB V C Z N
D_in MW MD
PCL
SE
IF/ID ID/EX EX/MEM MEM/WB

Lecture 18: 11
Forwarding in Action
Control
CU Signals

VCZN
Adder
Fm … F0 Data
M
RF U M RAM
LD X U
Decoder

SA X M
M ALU
SB
M
U U
DR X D_IN
U
X
M
U
X
X MB V C Z N
D_in MW MD

SE
IF/ID ID/EX EX/MEM MEM/WB

OR R4,R1,R3 ADD R1,R2,R3

Lecture 18: 12
Forwarding in Action
Control
CU Signals

VCZN
Adder
Fm … F0 Data
M
RF U M RAM
LD X U
Decoder

SA X M
M ALU
SB
M
U U
DR X D_IN
U
X
M
U
X
X MB V C Z N
D_in MW MD

SE
IF/ID ID/EX EX/MEM MEM/WB

SUB R5,R2,R1 OR R4,R1,R3 ADD R1,R2,R3

Lecture 18: 13
Forwarding in Action
Control
CU Signals

VCZN
Adder
Fm … F0 Data
M
RF U M RAM
LD X U
Decoder

SA X M
M ALU
SB
M
U U
DR X D_IN
U
X
M
U
X
X MB V C Z N
D_in MW MD

SE
IF/ID ID/EX EX/MEM MEM/WB

AND R6,R1,R2 SUB R5,R2,R1 OR R4,R1,R3 ADD R1,R2,R3

Lecture 18: 14
HW Forwarding
Control
CU Signals

Adder
+2 Fm … F0 Data
M
RF U M RAM
M LD X
U
P Inst
Decoder
SA
U
X M
M ALU
X C RAM SB
M
U U
DR X D_IN
U
X
M
U
X
PCJ X MB V C Z N
D_in MW MD
PCL
SE
IF/ID ID/EX EX/MEM MEM/WB

Trade-off between performance and cost

Lecture 18: 15
Example: Data Hazards w/o Forwarding
• Identify all data hazards in the following instruction
sequences by circling each source register that is read
before the updated value is written back
ADD R1, R2, R3
OR R4, R1, R3
SUB R5, R2, R1
SUB R6, R1, R2

Lecture 18: 16
Example: Data Hazards w/ Forwarding
• Identify all data hazards in the following instruction
sequences by circling each source register that is read
before the updated value is written back
ADD R1, R2, R3
OR R4, R1, R3
SUB R5, R2, R1
SUB R6, R1, R2

Data hazards resolved by R-type to R-type forwarding

Lecture 18: 17
Another Example: Data Hazards w/ Forwarding
• Identify all data hazards in the following instruction
sequences by circling each source register that is read
before the updated value is written back
LW R1, 0(R2)
OR R4, R1, R3
SUB R5, R2, R1

Data hazard not resolved by R-type to R-type forwarding

Lecture 18: 18
Data Hazards Caused by Load
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
LW R1,0(R2) IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

A
ADDI R7,R7,3 IM Reg L DM Reg
U

Lecture 18: 19
Load Instructions and Forwarding
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
LW R1,0(R2) IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

A
ADDI R7,R7,3 IM Reg L DM Reg
U

Lecture 18: 20
Solution 1: Compiler Inserts NOP Instruction
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
LW R1,0(R2) IM Reg L DM Reg
U

A
NOP IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

Lecture 18: 21
Solution 2: HW Stalls the Pipeline
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
LW R1,0(R2) IM Reg L DM Reg
U

OR R4,R1,R3 IM Reg bubble


A
L
DM Reg
U

A
SUB R5,R2,R1 IM bubble Reg L DM Reg
U

AND R6,R1,R2 IM Reg A


L
DM Reg
U

ADDI R7,R7,3 IM Reg


A
L
DM
U

Lecture 18: 22
Solution 3: Delay Slots
• A delay slot is a location in the program where
the compiler is required to insert an instruction
between dependent instructions

• The ISA defines the delay slots

• The compiler can fill delay slots with NOPs

• Even better: Move a non-dependent instruction


from elsewhere in the program into the delay slot
– Doing so must not change the function of the program

Lecture 18: 23
Filling the Load Delay Slot
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
LW R1,0(R2) IM Reg L DM Reg
U

A
ADDI
NOP R7,R7,3 IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

ADDI R7,R7,3

Lecture 18: 24
Filling the Load Delay Slot ?
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
LW R1,0(R2) IM Reg L DM Reg
U

A
NOP IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
AND R6,R1,R2 IM Reg L DM Reg
U

ADDI R7,R6,3

Lecture 18: 25
The Problem with Branches
Control
CU Signals

Adder
+2 Fm … F0 Data
M
RF U M RAM
M LD X
U
P Inst
Decoder
SA
U
X M
M ALU
X C RAM SB
M
U U
DR X D_IN
U
X
M
U
X
PCJ X MB V C Z N
D_in MW MD
PCL
SE
IF/ID ID/EX EX/MEM MEM/WB

If the condition is met, the PC is updated at the end of EX,


after we’ve already fetched the next two instructions

Lecture 18: 26
The Problem with Branches
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
BEQ R2,R3,X IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
X: AND R6,R1,R2 IM Reg L DM Reg
U

A
ADDI R7,R7,3 IM Reg L DM Reg
U

...

OR and SUB are fetched before the branch


condition is evaluated
Lecture 18: 27
Control Hazard
• Occurs when instructions following a branch are
fetched before the branch outcome is known
BEQ R2, R3, X IF ID ALU MEM WB
OR R4, R1, R3 IF ID ALU MEM WB
SUB R5, R2, R1 IF ID ALU MEM WB
X: AND R6,R1,R2

• What should happen


– If branch is not taken, next fetched instruction should be at
address PC+2 (OR)
– If branch is taken, next fetched instruction should be at
address X (AND)

• What actually happens


– Instructions at PC+2 and PC+4 are fetched before branch
outcome is known

Lecture 18: 28
Branch Delay Slot
• If the ISA defines a branch delay slot, the
instruction immediately following a branch is
always executed after the branch

• The compiler finds an instruction to put there, or


puts in a NOP

• The hardware must execute the instruction


immediately following the branch, regardless of
whether the branch is taken or not

Lecture 18: 29
Reducing the Branch Delay
• We already calculate the branch target in ID

• Put dedicated hardware to also evaluate the


condition in ID
– Hence only 1 branch delay slot needed

Lecture 18: 30
Evaluating Branch Condition in ID
Control
EvaluatingCU
Branch Condition in ID Signals
Control
=
CUsign bit Signals
=?
Adder
sign bit
+2 Fm … F0 Data


M +2 !Adder
LD
RF
M
U
X
M Fm … F0 Data
RAM
P Inst U
Decoder
M
U M SARF X RAM M
U M
ALU
X UC P RAM
LD M
SB X U
U
Decoder

U
Inst SA
DR M X M
M X ALU D_IN X
X C RAM !SB U M
MX U U U
PCJ !DR X
MB D_IN
U MX X
PCJ D_in X U MW MD
PCL X MB
D_in MW MD
PCL
SE
SE
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB

Lecture
Lecture 18: 31 19: !2
Filling the Branch Delay Slot with a NOP
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
BEQ R2,R3,X IM Reg L DM Reg
U

A
NOP IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
X: AND R6,R1,R2 IM Reg L DM Reg
U

ADDI R7,R7,3
...

Lecture 18: 32
Filling the Branch Delay Slot
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

A
BEQ R2,R3,X IM Reg L DM Reg
U

A
ADDI
NOP R7,R7,3 IM Reg L DM Reg
U

A
OR R4,R1,R3 IM Reg L DM Reg
U

A
SUB R5,R2,R1 IM Reg L DM Reg
U

A
X: AND R6,R1,R2 IM Reg L DM Reg
U

ADDI R7,R7,3
The ADDI is always executed after the BEQ.
...
If the BEQ is taken, executing the ADDI
must not cause incorrect behavior
Lecture 18: 33
Branch Target Address (PC+2+OFF)
Control
CU Signals

VCZN
Adder
+2 Fm … F0 Data
RF RAM
M LD
P
Decoder
U Inst SA M
ALU
X C RAM SB U
DR M D_IN
U X
PCJ VCZN
X MW MD
PCL D_in

SE

IF/ID ID/EX MB EX/MEM MEM/WB

Branch delay slot is accounted for in the


branch target calculation

Lecture 18: 34
Before Next Class

Next Time

More Pipelined Microprocessor

Lecture 18: 35

You might also like