You are on page 1of 76

EEE440 - Computer Architecture

Enhancing Performance with Pipelining


(Chapter 6)
Pipelining
Start work ASAP!! Do not waste time!
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
Not pipelined
B

Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

A Pipelined
B

D
Pipelined vs. Single-Cycle Instruction
Execution: the Plan

P rogra m
e xecution 2 4 6 8 10 12 14 16 18
order Time
(in instructions ) Single-cycle
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access

Instruction Data
lw $2, 200($0) 8 ns fe tch
Reg ALU
access
Reg

Instruction
lw $3, 300($0) 8 ns fe tch
...
8 ns
Assume 2 ns for memory access, ALU operation; 1 ns for register access:
therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Pr ogr a m
exec uti on 2 4 6 8 10 12 14
Time
or der
(i ni nstr ucti ons)
Instruction Da ta
lw $1, 100($0) Reg ALU Reg
fetch access
Pipelined
Ins truction Da ta
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access

I nstr ucti on Dat a


lw $3, 300($0) 2 ns Re g ALU Re g
f et ch access

2 ns 2 ns 2 ns 2 ns 2 ns
Pipelining: Keep in Mind
Pipelining does not reduce latency of a single task, it
increases throughput of entire workload
Pipeline rate limited by longest stage
potential speedup = number pipe stages
unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it – when there is
slack in the pipeline – reduces speedup
Example Problem
Problem: for the laundry fill in the following table when
1. the stage lengths are 30, 30, 30 30 min., resp.
2. the stage lengths are 20, 20, 60, 20 min., resp.

Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined


finish time finish time to pipeline 1 finish time to pipeline 2
1
2
3
4
n
Come up with a formula for pipeline speed-up!
Pipelining MIPS
What makes it easy with MIPS?
all instructions are same length
so fetch and decode stages are similar for all instructions
just a few instruction formats
simplifies instruction decode and makes it possible in one stage
memory operands appear only in load/stores
so memory access can be deferred to exactly one later stage
operands are aligned in memory
one data transfer instruction requires one memory access stage
Pipelining MIPS
What makes it hard?
structural hazards: different instructions, at different stages,
in the pipeline want to use the same hardware resource
control hazards: succeeding instruction, to put into pipeline,
depends on the outcome of a previous branch instruction,
already in pipeline
data hazards: an instruction in the pipeline requires data to
be computed by a previous instruction still in the pipeline

Before actually building the pipelined datapath and control


we first briefly examine these potential hazards
individually…
Structural hazard: inadequate hardware to simultaneously
support all instructions in the pipeline in the same clock cycle
E.g., suppose single – not separate – instruction and data
Structural Hazards
memory in pipeline below with one read port
then a structural hazard between first and fourth lw instructions
P rogra m
e xecution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Da ta
lw $1, 100($0)
fetch
Reg ALU
acces s
Reg
Pipelined
Instruction Da ta
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access

Instruction Data
Hazard if
lw $3, 300($0) 2 ns fetch
Reg ALU
access
Reg
single memory
Instruction Da ta
lw $4, 400($0) Reg ALU Reg
2 ns fetch access

MIPS was designed to be pipelined:


2 ns structural
2 ns 2 ns hazards
2 ns 2 ns are easy
to avoid!
Control Hazards
Control hazard: need to make a decision based on the result of a
previous instruction still executing in pipeline
Solution 1 Stall the pipeline

Pr ogr a m
exe c uti o n 2 4 6 8 10 12 14 16
or der Ti me
(i ni nstr ucti ons)
Ins truction Data Note that branch outcome
a dd $4, $5, $6 Reg ALU Reg
fetch acces s
is
be q $1, $2, 40
Ins truction
fe tch
Reg ALU
Data
access
Re g computed in ID stage with
2ns
Ins truction
added
Data
hardware (later…)
lw $3, 300($0) bubble Reg ALU Re g
fe tch access

4 ns 2ns

Pipeline stall
Control Hazards
Solution 2 Predict branch outcome
e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg

Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access

Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access

Prediction
success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access

Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble

Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
Prediction failure: undo (=flush) lw
Control Hazards
Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay
– compiler’s job to find a statement that can be put in the slot
that is independent of branch outcome
MIPS does this – but it is an option in SPIM (Simulator -> Settings)
Program
executi on 2 4 6 8 10 12 14
order Time
(i ni nstructions)
beq$1, $2, 40 Insfetruction
tch
Re g ALU
Da ta
acce ss
Reg

Ins truction Data


add$4, $5, $6 fe tch
Re g ALU
acce ss
Re g
(d el ayedbranchsl ot) 2ns
Ins truction Da ta
lw$3, 300($0) fe tch
Reg ALU
access
Re g
2ns
2ns
Delayed branch beq is followed by add that is
independent of branch outcome
Data Hazards
Data hazard: instruction needs data from the result of a previous
instruction still executing in pipeline
Solution Forward data if possible…

2 4 6 8 10
Time

IF ID EX MEM
Instruction
add $s0, $t0, $t1 WB
pipeline diagram:
shade indicates
use –
Pr ogr a m
executi on
left=write,
2 4 6 8 10
or der Ti me
(i ni nstr ucti ons)
right=read
a dd $s 0, $t 0, $t 1 IF ID EX ME M WB Without forwarding
– blue line –
s ub $t 2, $s 0, $t 3 IF ID EX ME M WB data has to go back
in time;
with forwarding –
red line –
Data Hazards
Forwarding may not be enough
e.g., if an R-type instruction following a load uses the result of the
load – called load-use data hazard
2 4 6 8 10 12 14
Progra m Time
execution
order
(in instructions)

lw $s0, 20($t1) IF ID EX ME M WB Without a stall it is


impossible
s ub $t2, $s0, $t3 IF ID EX ME M WB to provide input to
the sub
2 4 6 8 10 12 14instruction in time
Program Time
execution
order
(in instructions)

lw $s0, 20($t1) IF ID EX MEM WB With a one-stage


stall, forwarding
bubble bubble bubble bubble bubble can get the data to
the sub
sub $t2, $s0, $t3 IF ID EX MEM WB
instruction in time
Reordering Code to Avoid Pipeline Stall
(Software Solution)
Example:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1) Data
hazard
sw $t0, 4($t1)

Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1) Interch
anged
Pipelined Datapath
We now move to actually building a pipelined datapath
First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
Review: single-cycle processor
all 5 steps done in a single clock cycle
dedicated hardware required for each step

What happens if we break the execution into multiple cycles, but keep
the extra hardware?
Review - Single-Cycle Datapath “Steps”

A
D
D
4 A
D
D
P Instructi I <
C A
D
R
D 3 1
on3
5 5 5 <
D Instru 2 6 2
2
R Me
ction
R
N
R
N
W
NR Ze
mo Registe
1 2 D
1
A ro

ry W
D r File R L
M
D
2
U
X
U A
D
D D R
1
E
X 3
R Mea D
M
U
X
6 T
N
2 W
D mot
D rya

IF ID EX MEM WB
Instruction Instruction Execute/ Memory Write
Pipelined Datapath – Key Idea
What happens if we break the execution into multiple cycles, but keep
the extra hardware?
Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining
…but we shall need extra registers to hold data between cycles
– pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in

A
D
D
A
4
64 128 D
D
P bits bits < 97 64
Instruction
C AD
DR
R
D 3 1 3
I
< bits bits
Instruc 2 2 5 5 5
6
Mem R R W 2
tion N N NR Ze
ory Register
1 2 D
1
A ro
W
D File L
R M
D
2
U
X
U AD
DR
D R
1
E
X 3
Mem
at D
M
U
X
6 T
N
2 W
D ory
a
D

IF/ID ID/ EX/ MEM/


EX MEM WB
Pipelined Datapath
Pipeline registers wide enough to hold data coming in

A
D
D
A
4
64 128 D
D
P bits bits < 97 64
Instruction
C AD
DR
R
D 3 1 3
I
< bits bits
Instruc 2 2 5 5 5
6
Mem R R W 2
tion N N NR Ze
ory Register
1 2 D
1
A ro
W
D File L
R M
D
2
U
X
U AD
DR
D R
1
E
X 3
Mem
at D
M
U
X
6 T
N
2 W
D ory
a
D

IF/ID ID/ EX/ MEM/


EX MEM WB
Only data flowing right to left may cause hazard…, why?
Bug in the Datapath

I I EX/ ME
A
D F D ME M/
4
D
/ / A M WB
D
I E D
P D Instructi I
X <
C A
D
R
D 3 1
on3
<
Instru 2 2 5 5 5
D 6
R Me R R W 2
ction N N NR
mo Registe
1 2 D
1
A
ry W
D r File R L
M
D
2
U
X
U A
D
D D R
1
E
X 3
R Mea D
M
U
X
6 T
N
2 W
D mot
D rya

Write register number comes from another later instruction!


Corrected Datapath
I I EX/ ME
F D ME M/
A
D / / M WB
A
4
D
I 64 E 133 D
D bits X bits < D
102 69
bits bits
P <
C AD R
3 5
R R 2 Z
DR
Instru
D 2
5
N
R
1
D A e
Me
ction
N
W Reg 1
L r

mo
5 2
N F
iste
R M
o
W D U
U AD
ri
D 2 X
ry DR
D
l1 E Me R M

X 3 a D U

e6
X
T
N
2 W
D mot
5 D rya

Destination register number is also passed through ID/EX, EX/MEM


and MEM/WB registers, which are now wider by 5 bits
Pipelined Example
Consider the following instruction sequence:
lw $t0, 10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:Clock Cycle
L1
W
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
S Cycle 2 L
W W
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
ADCycle 3 S L
D W W
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
SUCycle 4 AD S L
B D W W
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 5 SU AD S L
B D W W
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 6 SU AD S
B D W
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 7 SU AD
D
B
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 8 SU
B
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Alternative View – Multiple-Clock-Cycle
Diagram
C C C C C C C C
C
1
C
2
C
3
C
4
C
5
C
6
Time
C
7
C
8
axis
lw $t0, I
M
R
E
A
L
D
M
R
E
10($t1) G U G

sw $t3, I
M
R
E
A
L
D
M
R
E
20($t4) G U G

add $t5, I
M
R
E
A
L
D
M
R
E
$t6, $t7 G U G

s $t8, I
M
R
E
A
L
D
M
R
E
u $t9, G U G

b $t10
Notes
One significant difference in the execution of an R-type
instruction between multicycle and pipelined implementations:
register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle
implementation. Why?
think of structural hazards when writing to the register file…
Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline
registers to decouple the 5 stages
The CPI of an ideal pipeline (no stalls) is 1. Why?
The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources page
As we develop control for the pipeline keep in mind that the text
does not consider jump – should not be too hard to implement!
Recall Single-Cycle Control – the
Datapath
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25 21] Read


PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15 0] Sign
extend ALU
control

Instruction [5 0]
Instruction AluOp Instruction Funct Field Desired ALU
control
opcode operation ALU action input
Recall Single-Cycle – ALU Control
LW
SW
00
00
load word xxxxxx add
store word xxxxxx add
010
010
Branch eq 01 branch eq xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 set on less 101010 set on less 111
ALUO Funct field Operatio
ALUOp p ALUOp F F F F F F n
1 0 0 0 5X 4X 3X 2X 1X 0X 01
0 1 X X X X X X 0
11
1 X X X 0 0 0 0 0
01
1 X X X 0 0 1 0 0
11
1 X X X 0 1 0 0 0
00
1 X X X 0 1 0 1 0
00
1 X X X 1 0 1 0 1
11
Truth table for ALU control bits 1
Recall Single-Cycle – Control Signals
Effect of control bits

Signal Name Effect when deasserted Effect when asserted

RegDst The register destination number for the The register destination number for the
Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)
RegWrite None The register on the Write register input is written
with the value on the Write data input
AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended,
second register file output (Read data 2) lower 16 bits of the instruction
PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch target
MemRead None Data memory contents designated by the address
input are put on the first Read data output
MemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data input
MemtoReg The value fed to the register Write data input The value fed to the register Write data input
comes from the ALU comes from the data memory
Memto- Reg Mem Mem
Deter- Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
mining R-format 1 0 0 1 0 0 0 1 0
control lw 0 1 1 1 1 0 0 0 0
bits sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
Pipeline Control
Initial design – motivated by single-cycle datapath control – use
the same control signals
Observe:
No separate write signal for the PC as it is written every cycle Will
No separate write signals for the pipeline registers as they are be
written every cycle modifi
ed
No separate read signal for instruction memory as it is read every
by
clock cycle hazar
No separate read signal for register file as it is read every clock cycle
d
Need to set control signals during each pipeline stage detecti
on
Since control signals are associated with components active unit!!
during a single pipeline stage, can group control lines into five
groups according to pipeline stage
Pipelined Datapath with Control I
PCSrc

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

Read MemWrite
Instruction

PC Address register 1 Read


Read data 1 ALUSrc
register 2 Zero
Zero MemtoReg
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead

Same control Instruction


[20– 16]
signals as the Instruction
0
M ALUOp

single-cycle
u
[15– 11] x
1
datapath RegDst
Pipeline Control Signals
There are five stages in the pipeline
instruction fetch / PC increment
Nothing to control as
instruction decode / register fetch
instruction memory
execution / address calculation
read and PC write are
memory access
always enabled
write back

Write-back
Execution/Address Calculation Memory access stage stage control
stage control lines control lines lines
Reg ALU ALU ALU Mem Mem Reg Mem to
Instruction Dst Op1 Op0 Src Branch Read Write write Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Pipeline Control Implementation
Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding
stages

WB

Instruction
Control M WB

EX M WB

Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6
least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register

IF/ID ID/EX EX/MEM MEM/WB


Pipelined Datapath with Control II
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite Shift
Branch
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0]
Control signals Sign
extend
ALU
control
MemRead

emanate from Instruction


[20– 16]
the control 0
M
u
ALUOp

portions of the Instruction


[15– 11]
1
x

pipeline registers RegDst


IF: lw $10, 20($1) ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

IF/ID ID/EX EX/MEM MEM/WB


0
M 00 00
u WB
x
1 000 000 00
Control M WB
0 0 0
0000 00 0
EX M WB 0
0 0

Add

Add
4 Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction
PC Address register 1 Read
Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M

PipelinedExecutionandControl
Write x memory u
data x
1
0
Write
data

Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]

Clock cycle 1
0 ALUOp
M
Instruction u
x

Instruction
[15– 11]
1
Clock 1 RegDst

sequence: IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

IF/ID ID/EX EX/MEM MEM/WB


0

lw $10, 20($1) M

1
u
x
11
WB
00

sub $11, $2, $3


lw 010 000 00
Control M WB
0 0 0
0001 00 0
EX M WB 0

and $12, $4, $7


0 0

Add

or $13, $6, $7 4 Add


Add result

RegWrite
Branch

add $14, $8, $9


Shift
left 2

MemWrite
ALUSrc
1 Read

MemtoReg
Instruction

register 1
PC Address Read $1
X data 1
Read
register 2 Zero
Instruction
Registers Read $X ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u

Label “before<i>” means


data x
1
0
Write
data

i th instruction before
Instruction
20 [15– 0] Sign 20 ALU MemRead
extend control

lw 10
Instruction
[20– 16] 10
0

Clock cycle 2
ALUOp
M
Instruction u
X [15– 11] X x
1
Clock 2 RegDst
IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

IF/ID ID/EX EX/MEM MEM/WB


0
M 10 11
u WB
x
1 sub 000 010 00
Control M WB
0 0 0
1100 00 0
EX M WB 0
1 0

Add

Add
4 Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc
2 Read

MemtoReg
Instruction
PC Address register 1 Read $2 $1
3 Read data 1
register 2 Zero
Instruction
Registers Read $3 ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
u

PipelinedExecutionandControl
Write x memory
data x
1
0
Write
data

Instruction
X [15– 0] Sign X 20 ALU MemRead
extend control

Instruction
X [20– 16] X 10

Clock cycle 3
0 ALUOp
M
Instruction u
x

Instruction
11 [15– 11] 11
1
Clock 3 RegDst

sequence: IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

IF/ID ID/EX EX/MEM MEM/WB


0
M 10 10

lw $10, 20($1)
u WB
x
1 and 000 000 11
Control M WB

sub $11, $2, $3 1100


EX
1
10
0
M
0
1
0
0
WB 0

and $12, $4, $7 Add

or $13, $6, $7 4 Add


Add result

RegWrite
Shift Branch

add $14, $8, $9


left 2

MemWrite
ALUSrc
4 Read

MemtoReg
Instruction

register 1
PC Address Read $4 $2
5 data 1
Read
register 2 Zero
Instruction
Registers Read $5 $3 ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data

Instruction
X [15– 0] Sign X ALU MemRead
extend control

Instruction
X [20– 16] X
0 ALUOp

Clock cycle 4 12
Instruction
[15– 11] 12 11
M

1
u
x
10

Clock 4 RegDst
IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

IF/ID ID/EX EX/MEM MEM/WB


0
M 10 10
u WB
x
1 or 000 000 10
Control M WB
1 0 1
1100 10 0
EX M WB 1
0 0

Add

Add
4 Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc
6 Read

MemtoReg
Instruction
PC Address register 1 Read $6 $4
7 Read data 1
register 2 Zero
Instruction $5
Registers Read $7 ALU ALU
memory 10 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
x

PipelinedExecutionandControl
data 1
0
Write
data

Instruction
X [15– 0] Sign X ALU MemRead
extend control

Instruction
X [20– 16] X

Clock cycle 5
0 ALUOp
M 11 10
Instruction u
13 [15– 11] 13 12 x
Clock 5 1

Instruction
RegDst

sequence:
IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

IF/ID ID/EX EX/MEM MEM/WB


0
M 10 10

lw $10, 20($1)
u WB
x
1 add 000 000 10
Control M WB

sub $11, $2, $3


1 0 1
1100 10 0
EX M WB 0
0 0

and $12, $4, $7 Add

or $13, $6, $7 4 Add


Add result

RegWrite
Shift Branch

add $14, $8, $9


left 2

MemWrite
ALUSrc
8 Read

MemtoReg
Instruction

register 1
PC Address Read $8 $6
9 data 1
Read
register 2 Zero
Instruction
Registers Read $9 $7 ALU ALU
memory 11 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u

Label “after<i>” means


data x
1
0
Write
data

i th instruction after add


Instruction
X [15– 0] Sign X ALU MemRead
extend control

Instruction
X [20– 16] X

Clock cycle 6
0 ALUOp
M 12 11
Instruction u
14 [15– 11] 14 13 x
1
Clock 6 RegDst
IF: after<2> ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

IF/ID ID/EX EX/MEM MEM/WB


0
M 00 10
u WB
x
1 000 000 10
Control M WB
1 0 1
0000 10 0
EX M WB 0
0 0

Add

Add
4 Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction
PC Address register 1 Read $8
Read data 1
register 2 Zero
Instruction $9
Registers Read ALU ALU
memory 12 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
x

PipelinedExecutionandControl
data 1
0
Write
data

Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]

Clock cycle 7
0 ALUOp
M 13 12
Instruction u
[15– 11] 14 x
1
Clock 7 RegDst

Instruction IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

sequence: 0
M
IF/ID

00
ID/EX

WB
00
EX/MEM MEM/WB

u
x
1

lw $10, 20($1)
000 000 10
Control M WB
0 0 1
0000 00 0
EX M WB 0

sub $11, $2, $3


0 0

Add

and $12, $4, $7 4 Add


Add result

RegWrite
Branch

or $13, $6, $7
Shift
left 2

MemWrite
ALUSrc
Read

add $14, $8, $9

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory 13 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]

Clock cycle 8
0 ALUOp
M 14 13
Instruction u
[15– 11] x
1
Clock 8 RegDst
Pipelined Execution and Control
Instruction IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

sequence: 0
IF/ID ID/EX EX/MEM MEM/WB
M 00 00
u WB
x
1

lw $10, 20($1)
000 000 00
Control M WB
0 0 1

sub $11, $2, $3


0000 00 0
EX M WB 0
0 0

and $12, $4, $7 Add

or $13, $6, $7
Add
4 Add result

RegWrite
Shift Branch

add $14, $8, $9 left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1 Read


Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory 14 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M 14

Clock cycle 9 Instruction


[15– 11]
1
u
x

Clock 9 RegDst
Revisiting Hazards
So far our datapath and control have ignored hazards
We shall revisit data hazards and control hazards and
enhance our datapath and control to handle them in
hardware…
Data Hazards and Forwarding
Problem with starting an instruction before previous are finished:
data dependencies that go backward in time – called data hazards
Time (in clock cycles)

$2 = 10 Value of CC 1
register $2: 10
CC 2
10
CC 3
10
CC 4
10
CC 5
10/– 20
CC 6
– 20
CC 7
– 20
CC 8
– 20
CC 9
– 20
before sub; Program
execution
$2 = -20 after order
(in instructions)
sub sub $2, $1, $3 IM Reg DM Reg

sub $2, $1, $3


and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2
add $14, $2, $2
or $13, $6, $2 IM DM Reg
sw $15, 100($2) Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg


Software Solution
Have compiler guarantee never any data hazards!
by rearranging instructions to insert independent instructions
between instructions that would otherwise have a data hazard
between them,
or, if such rearrangement is not possible, insert nops

sub $2, $1, $3 sub $2, $1, $3


lw $10, 40($3) nop
slt $5, $6, $7 and $12, $2, $5 onop and $12, $2, $5 or
or $13, $6, $2 add $14, $13,
r $6, $2 add $14, $2, $2 sw
$2, $2 sw $15, 100($2) $15, 100($2)
Such compiler solutions may not always be possible, and nops
slow the machine down

MIPS: nop = “no operation” = 00…0 (32bits)


= sll $0, $0, 0
Hardware Solution: Forwarding
Idea: use intermediate data, do not wait for result to be finally
written to the destination register. Two steps:
1. Detect data hazard
2. Forward intermediate data to resolve hazard
Pipelined Datapath with Control II (as
before)
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite Shift
Branch
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0]
Control signals Sign
extend
ALU
control
MemRead

emanate from Instruction


[20– 16]
the control 0
M
u
ALUOp

portions of the Instruction


[15– 11]
1
x

pipeline registers RegDst


Hazard Detection
Hazard conditions:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt

Eg., in the earlier example, first hazard between sub $2, $1, $3 and
and $12, $2, $5 is detected when the and is in EX stage and the
sub is in MEM stage because
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)

Whether to forward also depends on:


if the later instruction is going to write a register – if not, no need to
forward, even if there is register number match as in conditions above
if the destination register of the later instruction is $0 – in which case
there is no need to forward value ($0 is always 0 and never
overwritten)
Plan:
Data Forwarding
allow inputs to the ALU not just from ID/EX, but also later
pipeline registers, and
use multiplexors and control signals to choose appropriate
inputs to ALU
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Value of EX/MEM : X X X – 20 X X X X X
Value of MEM/WB : X X X X – 20 X X X X

Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

sub $2, $1, $3


and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2) or $13, $6, $2 IM Reg DM Reg

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

Dependencies between pipelines move forward in time


ID/EX EX/MEM MEM/WB

ForwardingHardware
Registers ALU

Data
memory M
u
x

Datapath before adding forwarding hardware


a. No forwarding

ID/EX EX/MEM MEM/WB

M
u
x
Registers
ForwardA ALU

M Data
u memory
x M
u
x

Rs ForwardB
Rt
Rt M
u EX/MEM.RegisterRd
Rd
x
Forwarding MEM/WB.RegisterRd
unit

b. With forwarding Datapath after adding forwarding hardware


Forwarding Hardware: Multiplexor
Control
Mux control Source Explanation
ForwardA = 00 ID/EX The first ALU operand comes from the register file
ForwardA = 10 EX/MEM The first ALU operand is forwarded from prior ALU result
ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory
or an earlier ALU result
ForwardB = 00 ID/EX The second ALU operand comes from the register file
ForwardB = 10 EX/MEM The second ALU operand is forwarded from prior ALU
result
ForwardB = 01 MEM/WB The second ALU operand is forwarded from data
memory
or an earlier ALU result

Depending on the selection in the


rightmost multiplexor
(see datapath with control diagram)
Data Hazard: Detection and
Forwarding
Forwarding unit determines multiplexor control according to the
following rules:

1. EX hazard
if ( EX/MEM.RegWrite // if there is a write…
and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches,
then… ForwardA = 10
2.

if ( EX/MEM.RegWrite // if there is a write…


and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches,
then… ForwardB = 10
Data Hazard: Detection and
Forwarding
1. MEM hazard
if ( MEM/WB.RegWrite // if there is a write…
and ( MEM/WB.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd ≠ ID/EX.RegisterRs )// and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline
register, then…
ForwardA = 01

if ( MEM/WB.RegWrite // if there is a write…


and ( MEM/WB.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd ≠ ID/EX.RegisterRt )// and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline
register, then…
ForwardB = 01

This check is necessary, e.g., for sequences such as add $1, $1,
$2; add $1, $1, $3; add $1, $1, $4;
Forwarding Hardware with Control
Called forwarding unit, not hazard
ID/EX
detection unit,
WB because once data is forwarded
EX/MEM

Control M there is no hazard!


WB
MEM/WB

IF/ID EX M WB

M
Instruction

u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x

IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt
M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
Forwarding MEM/WB.RegisterRd
unit

Datapath with forwarding hardware and control wires – certain details,


e.g., branching hardware, are omitted to simplify the drawing
Note: so far we have only handled forwarding to R-type instructions…!
or $4, $4, $2 and $4, $2, $5 sub $2, $1, $3 before<1> before<2>

ID/EX
10 10
WB
EX/MEM

Forwarding
Control M WB
MEM/WB

IF/ID EX M WB

2 $2 $1
M

Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $3
u
M x
u
x

2 1
5 3
M
4 2 u
x

Clock cycle 3
Forwarding
unit

Execution Clock 3

example: add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>

ID/EX
10 10

sub $2, $1, $3 WB


EX/MEM
10

and $4, $2, $5 Control M WB


MEM/WB

or $4, $4, $2 IF/ID EX M WB

add $9, $4, $2 4 $4 $2


M
Instruction

6 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $5
u
M x
u
x

2 2
6 5
M 2
4 4 u
x
Forwarding

Clock cycle 4 unit

Clock 4
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .

ID/EX
10 10
WB
EX/MEM

Forwarding
10
Control M WB
MEM/WB
1
IF/ID EX M WB

4 $4 $4
M

Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $2
u
M x
u
x

4 4
2 2
M 4 2
u

Execution
9 4
x

Clock cycle 5
Forwarding
unit

example Clock 5

(cont.): after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .

ID/EX
10

sub $2, $1, $3 WB


EX/MEM
10

and $4, $2, $5 Control M WB


MEM/WB
1

or $4, $4, $2 IF/ID EX M WB

add $9, $4, $2 $4


M
Instruction

u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x

4
2

M 4 4
9 u
x
Forwarding

Clock cycle 6 unit

Clock 6
Data Hazards and Stalls
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes to
the same register

lw $2, 20($1) Time (in clock cycles)


Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
and $4, $2, $5 execution

or $8, $2, $6 order


(in instructions)
add $9, $4, $2 lw $2, 20($1) IM Reg DM Reg

Slt $1, $6, $7


and $4, $2, $5 IM Reg DM Reg

As even a pipeline
dependency goes or $8, $2, $6 IM Reg DM Reg

backward therefore,
in time we need a hazard detection unit to stall the pipeline after the load
forwardinginstruction
will not add $9, $4, $2 IM Reg DM Reg

solve the hazard


slt $1, $6, $7 IM Reg DM Reg
Pipelined Datapath with Control II (as
before)
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite Shift
Branch
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0]
Control signals Sign
extend
ALU
control
MemRead

emanate from Instruction


[20– 16]
the control 0
M
u
ALUOp

portions of the Instruction


[15– 11]
1
x

pipeline registers RegDst


Hazard Detection Logic to Stall
Hazard detection unit implements the following check if to stall

if ( ID/EX.MemRead // if the instruction in the EX stage is a


load…
and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register
or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register
// of the instruction in the ID stage,
then…
stall the pipeline
Mechanics of Stalling
If the check to stall verifies, then the pipeline needs to stall
only 1 clock cycle after the load as after that the forwarding
unit can resolve the dependency
What the hardware does to stall the pipeline 1 cycle:
does not let the IF/ID register change (disable write!) – this will
cause the instruction in the ID stage to repeat, i.e., stall
therefore, the instruction, just behind, in the IF stage must be
stalled as well – so hardware does not let the PC change (disable
write!) – this will cause the instruction in the IF stage to repeat,
i.e., stall
changes all the EX, MEM and WB control fields in the ID/EX
pipeline register to 0, so effectively the instruction just behind
the load becomes a nop – a bubble is said to have been inserted
into the pipeline
note that we cannot turn that instruction into an nop by 0ing all the
bits in the instruction itself – recall nop = 00…0 (32 bits) – because
it has already been decoded and control signals generated
Hazard Detection Unit Hazard
detection
ID/EX.MemRead

unit ID/EX

WB
IF/IDWrite EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite

M
Instruction

u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x

IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit

Datapath with forwarding hardware, the hazard detection unit and


controls wires – certain details, e.g., branching hardware are omitted
to simplify the drawing
Stalling Resolves a Hazard
Same instruction sequence as before for which forwarding by
itself could not resolve the hazard:
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
order
(in instructions)

lw $2, 20($1) lw $2, 20($1) IM Reg DM Reg


and $4, $2, $5
or $8, $2, $6
add $9, $4, $2 and $4, $2, $5 IM Reg Reg DM Reg

Slt $1, $6, $7


or $8, $2, $6 IM IM Reg DM Reg

bubble

add $9, $4, $2 IM Reg DM Reg

slt $1, $6, $7 IM Reg DM Reg

Hazard detection unit inserts a 1-cycle bubble in the pipeline, after


which all pipeline register dependencies go forward so then the
forwarding unit can handle them and there are no more hazards
and $4, $2, $5 lw $2, 20($1) before<1> before<2> before<3>
Hazard
ID/EX.MemRead
detection
1 unit ID/EX
X
11
WB

IF/IDWrite
EX/MEM
M
Control u M WB
MEM/WB

Stalling
x
0
IF/ID EX M WB

1 $1

PCWrite
M

Instruction
X u
x
Registers
Instruction Data
PC ALU
memory memory M
$X
u
M x
u
x

Execution 1
X
2
M

example:
u
x
ID/EX.RegisterRt Forwarding
unit

Clock cycle 2
Clock 2
lw $2, 20($1) or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2>
and $4, $2, $5 2
Hazard
detection ID/EX.MemRead

or $4, $4, $2
unit ID/EX
5
00 11
WB
IF/IDWrite

add $9, $4, $2


EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB

2 $2 $1
PCWrite

M
Instruction

5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $X
u
M x
u
x

2 1
5 X
2 M
4 u
x
ID/EX.RegisterRt Forwarding
unit

Clock cycle 3
Clock 3
or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1>
Hazard
ID/EX.MemRead
detection
2 unit ID/EX
5
10 00
WB

IF/IDWrite
EX/MEM
M

Stalling
11
Control u M WB
x MEM/WB
0
IF/ID EX M WB

2 $2 $2

PCWrite
M

Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $5
u
M x
u
x

Execution
2 2
5 5

M 2
4 4 u

example
x
ID/EX.RegisterRt Forwarding

Clock cycle 4
unit

(cont.): Clock 4

add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . .
Hazard
ID/EX.MemRead

lw $2, 20($1)
detection
4 unit ID/EX
2
10 10

and $4, $2, $5 WB


IF/IDWrite

EX/MEM
M 0

or $4, $4, $2 Control


0
u
x
M WB
MEM/WB

11

add $9, $4, $2 IF/ID EX M WB

4 $4 $2
PCWrite

M
Instruction

2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $5
u
M x
u
x

4 2
2 5
M 2
4 4 u
x
ID/EX.RegisterRt Forwarding
unit

Clock cycle 5
Clock 5
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . bubble
Hazard ID/EX.MemRead
detection
4
unit ID/EX
2
10 10
WB

IF/IDWrite
EX/MEM
M 10
Control u M WB
x MEM/WB

Stalling
0
0
IF/ID EX M WB

4 $4 $4

PCWrite
M

Instruction
2 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $2
u
M x
u
x

4 4

Execution 2

9
2

4
M
u
4
x

example ID/EX.RegisterRt Forwarding


unit

Clock cycle 6
(cont.): Clock 6

after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .


Hazard

lw $2, 20($1) detection ID/EX.MemRead


unit ID/EX

and $4, $2, $5


10 10
WB
IF/IDWrite

EX/MEM

or $4, $4, $2
M 10
Control u M WB
x MEM/WB
0

add $9, $4, $2


1
IF/ID EX M WB

$4
PCWrite

M
Instruction

u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x

4
2

M 4 4
9 u
x
ID/EX.RegisterRt Forwarding

Clock cycle 7
unit

Clock 7
Control (or Branch) Hazards
Problem with branches in the pipeline we have so far is that the
branch decision is not made till the MEM stage – so what
instructions, if at all, should we insert into the pipeline following
the branch instructions?

Possible solution: stall the pipeline till branch decision is known


not efficient, slow the pipeline significantly!

Another solution: predict the branch outcome


e.g., always predict branch-not-taken – continue with next
sequential instructions
if the prediction is wrong have to flush the pipeline behind the
branch – discard instructions already fetched or decoded – and
continue execution at the branch target
Predicting Branch-not-taken:
Misprediction delay
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg

48 or $13, $6, $2 IM Reg DM Reg

52 add $14, $2, $2 IM Reg DM Reg

72 lw $4, 50($7) IM Reg DM Reg

The outcome of branch taken (prediction wrong) is decided only when


beq is in the MEM stage, so the following three sequential instructions
already in the pipeline have to be flushed and execution resumes at lw
Optimizing the Pipeline to Reduce
Branch Delay
Move the branch decision from the MEM stage (as in our
current pipeline) earlier to the ID stage
calculating the branch target address involves moving the branch
adder from the MEM stage to the ID stage – inputs to this adder,
the PC value and the immediate fields are already available in
the IF/ID pipeline register
calculating the branch decision is efficiently done, e.g., for
equality test, by XORing respective bits and then ORing all the
results and inverting, rather than using the ALU to subtract and
then test for zero (when there is a carry delay)
with the more efficient equality test we can put it in the ID stage
without significantly lengthening this stage – remember an objective
of pipeline design is to keep pipeline stages balanced
we must correspondingly make additions to the forwarding and
hazard detection units to forward to or stall the branch at the ID
stage in case the branch decision depends on an earlier result
Flushing on Misprediction
Same strategy as for stalling on load-use data hazard…
Zero out all the control values (or the instruction itself) in
pipeline registers for the instructions following the branch that
are already in the pipeline – effectively turning them into nops
– so they are flushed
in the optimized pipeline, with branch decision made in the ID
stage, we have to flush only one instruction in the IF stage – the
branch delay penalty is then only one clock cycle
Optimized Datapath for Branch
IF.Flush

Hazard
detection
unit IF.Flush control zeros out the
instruction in the IF/ID
M ID/EX
u
x
pipeline register (which follows the
WB
EX/MEM
M
Control
0
u
x
M
branch) WB
MEM/WB

IF/ID EX M WB

4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x

Sign
extend

M
u
x
Forwarding
unit

Branch decision is moved from the MEM stage to the ID stage – simplified drawing not showing
enhancements to the forwarding and hazard detection units
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>

IF.Flush

Hazard
detection
unit
72 ID/EX
M

PipelinedBranch
u
48 x WB
EX/MEM
M
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72

4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x

Execution Sign
extend

example: 10

Forwarding

Clock cycle 3 unit

36 sub $10, $4, $8 Clock 3

40 beq $1, $3, 7 lw $4, 50($7) bubble (nop) beq $1, $3, 7 sub $10, . . . before<1>

44 and $12 $2, $5 IF.Flush

48 or $13 $2, $6
Hazard
detection
unit
ID/EX

52 add $14, $4, $2


M
u
76 x WB
EX/MEM

56 slt $15, $6, $7


M
Control u M WB
x MEM/WB
0


IF/ID EX M WB
76 72

72 lw $4, 50($7) Shift


left 2
=
M
u
x
$1

Registers
Instruction Data
PC ALU
memory memory M
76 72
u
M $3 x

Optimized pipeline with


u
x

only one bubble as a result


Sign
extend

of the taken branch 10

Forwarding

Clock cycle 4
unit

Clock 4
Simple Example: Comparing
Performance
Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix
assume 2 ns for memory access, 2 ns for ALU operation, 1 ns
for register read or write
assume gcc instruction mix 23% loads, 13% stores, 19%
branches, 2% jumps, 43% ALU
for pipelined execution assume
50% of the loads are followed immediately by an instruction that
uses the result of the load
25% of branches are mispredicted
branch delay on misprediction is 1 clock cycle
jumps always incur 1 clock cycle delay so their average time is 2
clock cycles
Simple Example: Comparing
Performance
Single-cycle (p. 373): average instruction time 8 ns
Multicycle (p. 397): average instruction time 8.04 ns
Pipelined:
loads use 1 cc (clock cycle) when no load-use dependency and 2 cc
when there is dependency – given 50% of loads are followed by
dependency the average cc per load is 1.5
stores use 1 cc each
branches use 1 cc when predicted correctly and 2 cc when not –
given 25% misprediction average cc per branch is 1.25
jumps use 2 cc each
ALU instructions use 1 cc each
therefore, average CPI is
1.5 × 23% + 1 × 13% + 1.25 × 19% + 2 × 2% + 1 × 43% = 1.18
therefore, average instruction time is 1.18 × 2 = 2.36 ns

You might also like