Pipe Lining

Enhancing Performance with
Pipelining
Pipelining
 Start work ASAP!! Do not waste time!
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
Not pipelined
A
Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
Pipelined
B
D
Pipelined vs. Single-Cycle
Instruction Execution: the Plan
Program
execution 2 4 6 8 10 12 14 16 18
order Time
Single-cycle
(in instructions)
Instruction Data
lw $1, 100($0) fetch
Reg ALU
access
Reg
Instruction Data
lw $2, 200($0) 8 ns fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns fetch
...
8 ns
Assume 2 ns for memory access, ALU operation; 1 ns for register access:

therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Program
execution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Instruction Data
Pipelined
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
Pipelining: Keep in Mind
 Pipelining does not reduce latency of a single task, it
increases throughput of entire workload
 Pipeline rate limited by longest stage
 potential speedup = number pipe stages
 unbalanced lengths of pipe stages reduces speedup
 Time to fill pipeline and time to drain it – when there is
slack in the pipeline – reduces speedup
Example Problem
 Problem: for the laundry fill in the following table when
1. the stage lengths are 30, 30, 30 30 min., resp.
2. the stage lengths are 20, 20, 60, 20 min., resp.
Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined

finish time finish time to pipeline 1 finish time to pipeline 2
1
2
3
4
 Come up with a formula for pipeline speed-up!

Pipelining MIPS
 What makes it easy with MIPS?
 all instructions are same length
 so fetch and decode stages are similar for all instructions
 just a few instruction formats
 simplifies instruction decode and makes it possible in one stage
 memory operands appear only in load/stores
 so memory access can be deferred to exactly one later stage
 operands are aligned in memory
 one data transfer instruction requires one memory access stage
Pipelining MIPS
 What makes it hard?
 structural hazards: different instructions, at different stages,
in the pipeline want to use the same hardware resource
 control hazards: succeeding instruction, to put into pipeline,
depends on the outcome of a previous branch instruction,
already in pipeline
 data hazards: an instruction in the pipeline requires data to
be computed by a previous instruction still in the pipeline
 Before actually building the pipelined datapath and control

we first briefly examine these potential hazards
individually…
Structural Hazards
 Structural hazard: inadequate hardware to simultaneously
support all instructions in the pipeline in the same clock cycle
 E.g., suppose single – not separate – instruction and data
memory in pipeline below with one read port
 then a structural hazard between first and fourth lw instructions
Program
execution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Data
fetch access
Pipelined
Instruction Data
fetch access
Hazard if single memory
Instruction Data
fetch access
Instruction Data
2 ns fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
 MIPS was designed to be pipelined: structural hazards are easy

to avoid!
Control Hazards
 Control hazard: need to make a decision based on the result of
a previous instruction still executing in pipeline
 Solution 1 Stall the pipeline
Program
execution 2 4 6 8 10 12 14 16
order Time
(in instructions)
Instruction
Reg ALU
Data
Reg Note that branch outcome is
add $4, $5, $6 fetch access computed in ID stage with
Instruction Data added hardware (later…)
beq $1, $2, 40 Reg ALU Reg
2ns fetch access
Instruction Data
lw $3, 300($0) bubble fetch
Reg ALU
access
Reg
4 ns 2ns
Pipeline stall
Control Hazards
 Solution 2 Predict branch outcome
 e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg
Instruction Data
2 ns fetch access
Instruction Data
2 ns fetch access
Prediction success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access
Instruction Data
fetch access
2 ns
bubble bubble bubble bubble bubble
Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
Prediction failure: undo (=flush) lw
Control Hazards
 Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay
– compiler’s job to find a statement that can be put in the slot
that is independent of branch outcome
 MIPS does this – but it is an option in SPIM (Simulator -> Settings)
P r o g ra m
execution 2 4 6 8 10 12 14
orde r T im e
(in instructions)
beq $ 1, $2, 40 Instruction Data

Reg ALU Reg
fetch access
add $4, $5 , $ 6 Instruction Data

Reg ALU Reg
(d elaye d branch slot) 2 ns fetch access
Instruction Data
lw $3 , 3 00($0) Reg ALU Reg
2 ns fetch access
2 ns
Delayed branch beq is followed by add that is

independent of branch outcome
Data Hazards
 Data hazard: instruction needs data from the result of a
previous instruction still executing in pipeline
 Solution Forward data if possible…
2 4 6 8 10
Time
Instruction pipeline diagram:
add $s0, $t0, $t1 IF ID EX MEM WB shade indicates use –
left=write, right=read
Program
execution 2 4 6 8 10
order Time
(in instructions)
add $s0, $t0, $t1 IF ID EX MEM WB
Without forwarding – blue line –
data has to go back in time;
with forwarding – red line –
sub $t2, $s0, $t3
data is available in time
IF ID EX MEM WB
Data Hazards
 Forwarding may not be enough
 e.g., if an R-type instruction following a load uses the result of the
load – called load-use data hazard
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
Without a stall it is impossible
lw $s0, 20( $t1) IF ID EX MEM WB
to provide input to the sub
instruction in time
sub $t2, $s0, $t3 IF ID EX MEM WB
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
With a one-stage stall, forwarding
lw $s0, 20($t1) IF ID EX MEM WB can get the data to the sub
instruction in time
bubble bubble bubble bubble bubble
sub $t2, $s0, $t3 IF ID EX MEM WB

Reordering Code to Avoid
Pipeline Stall (Software Solution)
 Example:
lw $t0, 0($t1)
lw $t2, 4($t1)
Data hazard
sw $t2, 0($t1)
sw $t0, 4($t1)
 Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
Interchanged
sw $t2, 0($t1)
Pipelined Datapath
 We now move to actually building a pipelined datapath
 First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
 Review: single-cycle processor
 all 5 steps done in a single clock cycle
 dedicated hardware required for each step
 What happens if we break the execution into multiple cycles, but

keep the extra hardware?
Review - Single-Cycle Datapath
“Steps”
ADD
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1 Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
RD M
E Memory U
16 X 32 X
T WD
N
D
IF ID EX MEM WB
Instruction Fetch Instruction Decode Execute/ Address Calc. Memory Access Write Back
Pipelined Datapath – Key Idea
 What happens if we break the execution into multiple cycles, but keep
the extra hardware?
 Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining
 …but we shall need extra registers to hold data between cycles
– pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
RD M
E Memory U
16 X 32 X
T WD
N
D
IF/ID ID/EX EX/MEM MEM/WB

Pipelined Datapath
Pipeline registers wide enough to hold data coming in
ADD
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
RD M
E Memory U
16 X 32 X
T WD
N
D

Only data flowing right to left may cause hazard…, why?
Bug in the Datapath
ADD
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
Write register number comes from another later instruction!

Corrected Datapath
ADD
ADD
4 64 bits 133 bits
102 bits 69 bits
<<2
PC
ADDR RD 5
RN1 RD1
32
ALU Zero
Instruction RN2
5
Memory Register
5
WN File RD2 M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5 D
Destination register number is also passed through ID/EX, EX/MEM

and MEM/WB registers, which are now wider by 5 bits
Pipelined Example
 Consider the following instruction sequence:
lw $t0, 10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:
Clock Cycle 1
LW
Clock Cycle 2
SW LW
Clock Cycle 3
ADD SW LW
Clock Cycle 4
SUB ADD SW LW
Clock Cycle 5
SUB ADD SW LW
Clock Cycle 6 SUB ADD SW
Clock Cycle 7 SUB ADD
Clock Cycle 8 SUB
Alternative View –
Multiple-Clock-Cycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
Time axis
lw $t0, 10($t1) IM REG ALU DM REG
sw $t3, 20($t4) IM REG ALU DM REG
add $t5, $t6, $t7 IM REG ALU DM REG
sub $t8, $t9, $t10 IM REG ALU DM REG

Notes
 One significant difference in the execution of an R-type
instruction between multicycle and pipelined implementations:
 register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle
implementation. Why?
 think of structural hazards when writing to the register file…
 Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline
registers to decouple the 5 stages
 The CPI of an ideal pipeline (no stalls) is 1. Why?
 The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources
page
 As we develop control for the pipeline keep in mind that the
text does not consider jump – should not be too hard to
implement!
Recall Single-Cycle Control –
the Datapath
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [25 21] Read

PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15 0] Sign
extend ALU
control
Instruction [5 0]
Recall Single-Cycle – ALU Control
Instruction AluOp Instruction Funct Field Desired ALU control
opcode operation ALU action input
LW 00 load word xxxxxx add 010
SW 00 store word xxxxxx add 010
Branch eq 01 branch eq xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 set on less 101010 set on less 111
ALUOp Funct field Operation

ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0 0 X X X X X X 010
0 1 X X X X X X 110
1 X X X 0 0 0 0 010
1 X X X 0 0 1 0 110
1 X X X 0 1 0 0 000
1 X X X 0 1 0 1 001
1 X X X 1 0 1 0 111
Truth table for ALU control bits
Recall Single-Cycle – Control Signals
Effect of control bits
Signal Name Effect when deasserted Effect when asserted
RegDst The register destination number for the The register destination number for the
Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)
RegWrite None The register on the Write register input is written
with the value on the Write data input
AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended,
second register file output (Read data 2) lower 16 bits of the instruction
PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch target
MemRead None Data memory contents designated by the address
input are put on the first Read data output
MemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data input
MemtoReg The value fed to the register Write data input The value fed to the register Write data input
comes from the ALU comes from the data memory
Memto- Reg Mem Mem

Deter- Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
mining R-format 1 0 0 1 0 0 0 1 0
control lw 0 1 1 1 1 0 0 0 0
bits sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
Pipeline Control
 Initial design – motivated by single-cycle datapath control – use
the same control signals
 Observe:
Will be
 No separate write signal for the PC as it is written every cycle modified
by hazard
 No separate write signals for the pipeline registers as they are detection
written every cycle unit!!
 No separate read signal for instruction memory as it is read every

clock cycle
 No separate read signal for register file as it is read every clock cycle
 Need to set control signals during each pipeline stage
 Since control signals are associated with components active
during a single pipeline stage, can group control lines into five
groups according to pipeline stage
Pipelined Datapath with
Control I PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
PC Address register 1
Read
data 1
Read ALUSrc
Zero
Zero MemtoReg
Instruction register 2
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead
Same control Instruction

[20– 16]
signals as the
0
M ALUOp
Instruction u
single-cycle [15– 11]

1
x
datapath RegDst
Pipeline Control Signals
 There are five stages in the pipeline
 instruction fetch / PC increment Nothing to control as instruction memory
read and PC write are always enabled
 instruction decode / register fetch
 execution / address calculation
 memory access
 write back
Write-back
Execution/Address Calculation Memory access stage stage control
stage control lines control lines lines
Reg ALU ALU ALU Mem Mem Reg Mem to
Instruction Dst Op1 Op0 Src Branch Read Write write Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Pipeline Control
Implementation
 Pass control signals along just like the data – extend each pipeline
register to hold needed control bits for succeeding stages
WB
Instruction
Control M WB
EX M WB
 Note: The 6-bit funct field of the instruction required in the EX

stage to generate ALU control can be retrieved as the 6 least
significant bits of the immediate field which is sign-extended and
passed from the IF/ID register to the ID/EX register
Control II PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Read
data 1
Read
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals
Sign ALU MemRead
extend control
emanate from Instruction

[20– 16]
0
the control
ALUOp
M
Instruction u
portions of the [15– 11] x

1
pipeline registers
RegDst
Pipelined
IF: lw $10, 20($1) ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

0
M 00 00
u WB
x
Execution
1 000 000 00
Control M WB
0 0 0
0000 00 0
EX M WB 0
0 0
Add
Add
4 Add result
RegWrite
and
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read
Read data 1
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
Control
Write x memory u
data x
1
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 1
0 ALUOp
M
Instruction u
x
Instruction
[15– 11]
1
Clock 1

RegDst
sequence: IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

0
lw $10, 20($1)
M 11 00
u WB
x
1 lw 010 000 00
Control M WB
sub $11, $2, $3 0001

EX
0
00
0
M
0
0
0
0
WB 0
and $12, $4, $7 Add
or $13, $6, $7 4 Add

Add result
RegWrite
Shift Branch
left 2
MemWrite
add $14, $8, $9 1 Read
ALUSrc
MemtoReg
Instruction
register 1
PC Address Read $1
X data 1
Read
register 2 Zero
Instruction
Registers Read $X ALU ALU
memory Write 0 Read
register M data
u Data M
Write x memory u
x
Label “before<i>” means

data 1
0
Write
data
i th instruction before
Instruction
20 [15– 0] Sign 20 ALU MemRead
extend control
lw
Instruction
10 [20– 16] 10
0
Clock cycle 2
ALUOp
M
Instruction u
X [15– 11] X x
1
Clock 2 RegDst
Pipelined
IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

0
M 10 11
u WB
x
1
Execution
sub 000 010 00
Control M WB
0 0 0
1100 00 0
EX M WB 0
1 0
Add
Add
4 Add result
RegWrite
and
Shift Branch
left 2
MemWrite
ALUSrc
2 Read
MemtoReg
Instruction
PC Address register 1 Read $2 $1
3 Read data 1
register 2 Zero
Instruction
Registers Read $3 ALU ALU
memory Write 0 Read
register M data
u Data M
Control
Write x memory u
data x
1
0
Write
data
Instruction
X [15– 0] Sign X 20 ALU MemRead
extend control
Instruction
X [20– 16] X 10
Clock cycle 3
0 ALUOp
M
Instruction u
x
Instruction
11 [15– 11] 11
1
Clock 3

RegDst
sequence: IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

0
M 10 10
lw $10, 20($1)
u WB
x
1 and 000 000 11
Control M WB
sub $11, $2, $3

1 0 0
1100 10 1
EX M WB 0
0 0
and $12, $4, $7 4

Add
Add
Add result
or $13, $6, $7
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
add $14, $8, $9 4 Read
MemtoReg
Instruction
register 1
PC Address Read $4 $2
5 data 1
Read
register 2 Zero
Instruction
Registers Read $5 $3 ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
0 ALUOp
Clock cycle 4
M 10
Instruction u
12 [15– 11] 12 11 x
1
Clock 4 RegDst
Pipelined
IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

0
M 10 10
u WB
x
1 or 000 000 10
Control M WB
Execution
1 0 1
1100 10 0
EX M WB 1
0 0
Add
Add
4 Add result
RegWrite
Shift Branch
and
left 2
MemWrite
ALUSrc
6 Read
MemtoReg
Instruction
PC Address register 1 Read $6 $4
7 Read data 1
register 2 Zero
Instruction $5
Registers Read $7 ALU ALU
memory 10 Write 0 Read
register M data
u Data M
Write x memory u
x
Control
data 1
0
Write
data
Instruction
extend control
Instruction
X [20– 16] X
Clock cycle 5
0 ALUOp
M 11 10
Instruction u
13 [15– 11] 13 12 x
Clock 5 1
Instruction
RegDst
sequence:
IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

0
M 10 10
u WB
lw $10, 20($1) 1
x
add
Control
000
M
000
WB
10
1 0
sub $11, $2, $3

1
1100 10 0
EX M WB 0
0 0
and $12, $4, $7 4

Add
Add
Add result
or $13, $6, $7
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
add $14, $8, $9

8 Read
MemtoReg
Instruction
register 1
PC Address Read $8 $6
9 data 1
Read
register 2 Zero
Instruction
Registers Read $9 $7 ALU ALU
register M data
u Data M
Write x memory u
x
Label “after<i>” means

data 1
0
Write
data
i th instruction after add

Instruction
extend control
Instruction
X [20– 16] X
Clock cycle 6
0 ALUOp
M 12 11
Instruction u
14 [15– 11] 14 13 x
1
Clock 6 RegDst
Pipelined
IF: after<2> ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

0
M 00 10
u WB
x
1 000 000 10
Control M WB
Execution
1 0 1
0000 10 0
EX M WB 0
0 0
Add
Add
4 Add result
RegWrite
Shift Branch
and
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read $8
Read data 1
register 2 Zero
Instruction $9
register M data
u Data M
Write x memory u
x
Control
data 1
0
Write
data
Instruction
extend control
Instruction
[20– 16]
Clock cycle 7
0 ALUOp
M 13 12
Instruction u
[15– 11] 14 x
1
Clock 7 RegDst
 Instruction IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
sequence: 0
M
u
IF/ID
00
ID/EX
WB
00
EX/MEM MEM/WB
x
1 000 000 10
lw $10, 20($1)
Control M WB
0 0 1
0000 00 0
EX M WB 0
0 0
sub $11, $2, $3 Add
and $12, $4, $7 4 Add

Add result
RegWrite
Shift Branch
left 2
or $13, $6, $7
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Read
add $14, $8, $9

data 1
Read
register 2 Zero
Instruction
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
extend control
Instruction
[20– 16]
Clock cycle 8
0 ALUOp
M 14 13
Instruction u
[15– 11] x
1
Clock 8 RegDst
Pipelined Execution and Control
 Instruction IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
sequence: 0
M 00 00
u WB
x
1 000 000 00
lw $10, 20($1) Control
0000
M
0
00
WB
0
0
1
sub $11, $2, $3

EX M WB 0
0 0
and $12, $4, $7 4

Add
Add
Add result
or $13, $6, $7
RegWrite
Shift Branch
left 2
MemWrite
add $14, $8, $9 Read
ALUSrc
MemtoReg
Instruction
PC Address register 1 Read

Read data 1
register 2 Zero
Instruction
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
extend control
Instruction
[20– 16]
0 ALUOp
M 14
Clock cycle 9 Instruction u

[15– 11] x
1
Clock 9 RegDst
Revisiting Hazards
 So far our datapath and control have ignored hazards
 We shall revisit data hazards and control hazards and
enhance our datapath and control to handle them in
hardware…
Data Hazards and Forwarding
 Problem with starting an instruction before previous are finished:
 data dependencies that go backward in time – called data hazards
Time (in clock cycles)
$2 = 10 before sub; Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
$2 = -20 after sub register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
sub $2, $1, $3

and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2
add $14, $2, $2 or $13, $6, $2 IM Reg DM Reg
sw $15, 100($2)
add $14, $2, $2 IM Reg DM Reg
sw $15, 100($2) IM Reg DM Reg

Software Solution
 Have compiler guarantee never any data hazards!
 by rearranging instructions to insert independent instructions
between instructions that would otherwise have a data hazard
between them,
 or, if such rearrangement is not possible, insert nops
sub $2, $1, $3 sub $2, $1, $3
lw $10, 40($3) nop
slt $5, $6, $7 nop
and $12, $2, $5 or and $12, $2, $5
or $13, $6, $2 or $13, $6, $2
add $14, $2, $2 add $14, $2, $2
sw $15, 100($2) sw $15, 100($2)
 Such compiler solutions may not always be possible, and nops

slow the machine down
MIPS: nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0
Hardware Solution:
Forwarding
 Idea: use intermediate data, do not wait for result to be
finally written to the destination register. Two steps:
1. Detect data hazard
2. Forward intermediate data to resolve hazard
Control II (as before)
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Read
data 1
Read
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals
Sign ALU MemRead
extend control

[20– 16]
0
the control
ALUOp
M
Instruction u

1
pipeline registers
RegDst
Hazard Detection
 Hazard conditions:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
 Eg., in the earlier example, first hazard between sub $2, $1, $3 and
and $12, $2, $5 is detected when the and is in EX stage and the
sub is in MEM stage because
 EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)
 Whether to forward also depends on:

 if the later instruction is going to write a register – if not, no need to
forward, even if there is register number match as in conditions above
 if the destination register of the later instruction is $0 – in which case
there is no need to forward value ($0 is always 0 and never overwritten)

Data
Plan:
Forwarding
 allow inputs to the ALU not just from ID/EX, but also later
pipeline registers, and
 use multiplexors and control signals to choose appropriate
inputs to ALU
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Value of EX/MEM : X X X – 20 X X X X X
Value of MEM/WB : X X X X – 20 X X X X
Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
sub $2, $1, $3

and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2
add $14, $2, $2
or $13, $6, $2 IM Reg DM Reg
sw $15, 100($2)
sw $15, 100($2) IM Reg DM Reg
Dependencies between pipelines move forward in time

ID/EX EX/MEM MEM/WB
Forwarding
Hardware
Registers ALU
Data
memory M
u
x
Datapath before adding forwarding hardware

a. No forwarding
ID/EX EX/MEM MEM/WB
M
u
x
Registers
ForwardA ALU
M Data
u memory
x M
u
x
Rs ForwardB
Rt
Rt M
u EX/MEM.RegisterRd
Rd
x
Forwarding MEM/WB.RegisterRd
unit
b. With forwarding Datapath after adding forwarding hardware

Forwarding Hardware:
Multiplexor Control
Mux control Source Explanation
ForwardA = 00 ID/EX The first ALU operand comes from the register file
ForwardA = 10 EX/MEM The first ALU operand is forwarded from prior ALU result
ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory
or an earlier ALU result
ForwardB = 00 ID/EX The second ALU operand comes from the register file
ForwardB = 10 EX/MEM The second ALU operand is forwarded from prior ALU result
ForwardB = 01 MEM/WB The second ALU operand is forwarded from data memory
or an earlier ALU result
Depending on the selection in the rightmost multiplexor

(see datapath with control diagram)
Data Hazard: Detection and
Forwarding
 Forwarding unit determines multiplexor control according to the
following rules:
1. EX hazard
if ( EX/MEM.RegWrite // if there is a write…
and ( EX/MEM.RegisterRd  0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches, then…
ForwardA = 10
if ( EX/MEM.RegWrite // if there is a write…

and ( EX/MEM.RegisterRd  0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches, then…
ForwardB = 10
Data Hazard: Detection and
Forwarding
2. MEM hazard
if ( MEM/WB.RegWrite // if there is a write…
and ( MEM/WB.RegisterRd  0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd  ID/EX.RegisterRs ) // and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline
register, then…
ForwardA = 01
if ( MEM/WB.RegWrite // if there is a write…

and ( MEM/WB.RegisterRd  0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd  ID/EX.RegisterRt )
// and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline
register, then…
ForwardB = 01
This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4;
(array summing?), where an earlier pipeline (EX/MEM) register has more recent data
Forwarding Hardware with
Control ID/EX
Called forwarding unit, not hazard detection unit,
because once data is forwarded there is no hazard!
WB
EX/MEM
Control M WB
MEM/WB
IF/ID EX M WB
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt
M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
Forwarding MEM/WB.RegisterRd
unit
Datapath with forwarding hardware and control wires – certain details,

e.g., branching hardware, are omitted to simplify the drawing
Note: so far we have only handled forwarding to R-type instructions…!
or $4, $4, $2 and $4, $2, $5 sub $2, $1, $3 before<1> before<2>
ID/EX
10 10
WB
EX/MEM
Forwarding
Control M WB
MEM/WB
IF/ID EX M WB
2 $2 $1
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $3
u
M x
u
x
2 1
5 3
M
4 2 u
x
Forwarding
Clock cycle 3 unit
 Execution Clock 3
example: add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>
ID/EX
10 10
sub $2, $1, $3

WB
EX/MEM
10
Control M WB
and $4, $2, $5 EX M
MEM/WB
WB
IF/ID
or $4, $4, $2 4 $4 $2
add $9, $4, $2 M

Instruction
6 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
2 2
6 5
M 2
4 4 u
x
Forwarding
Clock cycle 4 unit
Clock 4
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .
ID/EX
10 10
WB
EX/MEM
Forwarding
10
Control M WB
MEM/WB
1
IF/ID EX M WB
4 $4 $4
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
2 2
M 4 2
u
Execution
9 4
x
 Forwarding
Clock cycle 5 unit
example Clock 5
(cont.): after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .
ID/EX
10
sub $2, $1, $3

WB
EX/MEM
10
Control M WB
and $4, $2, $5 EX M
MEM/WB
WB
1
IF/ID
or $4, $4, $2 $4
add $9, $4, $2 M

Instruction
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
Forwarding
Clock cycle 6 unit
Clock 6
Data Hazards and Stalls
 Load word can still cause a hazard:
 an instruction tries to read a register following a load instruction that
writes to the same register
lw $2, 20($1) Time (in clock cycles)

Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
and $4, $2, $5 execution
order
or $8, $2, $6 (in instructions)
add $9, $4, $2 lw $2, 20($1) IM Reg DM Reg
Slt $1, $6, $7

and $4, $2, $5 IM Reg DM Reg
As even a pipeline
or $8, $2, $6 IM Reg DM Reg
dependency goes
backward in time
forwarding will not add $9, $4, $2 IM Reg DM Reg
solve the hazard

slt $1, $6, $7 IM Reg DM Reg
 therefore, we need a hazard detection unit to stall the pipeline after

the load instruction
Control II (as before)
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Read
data 1
Read
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals
Sign ALU MemRead
extend control

[20– 16]
0
the control
ALUOp
M
Instruction u

1
pipeline registers
RegDst
Hazard Detection Logic to Stall
 Hazard detection unit implements the following check if to stall
if ( ID/EX.MemRead // if the instruction in the EX stage is a load…

and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register
or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register
// of the instruction in the ID stage, then…
stall the pipeline
Mechanics of Stalling
 If the check to stall verifies, then the pipeline needs to stall
only 1 clock cycle after the load as after that the forwarding
unit can resolve the dependency
 What the hardware does to stall the pipeline 1 cycle:
 does not let the IF/ID register change (disable write!) – this will
cause the instruction in the ID stage to repeat, i.e., stall
 therefore, the instruction, just behind, in the IF stage must be
stalled as well – so hardware does not let the PC change (disable
write!) – this will cause the instruction in the IF stage to repeat,
i.e., stall
 changes all the EX, MEM and WB control fields in the ID/EX
pipeline register to 0, so effectively the instruction just behind
the load becomes a nop – a bubble is said to have been inserted
into the pipeline
 note that we cannot turn that instruction into an nop by 0ing all the
bits in the instruction itself – recall nop = 00…0 (32 bits) – because
it has already been decoded and control signals generated
Hazard Detection Unit
Hazard ID/EX.MemRead
detection
unit ID/EX
WB
IF/IDWrite
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit
Datapath with forwarding hardware, the hazard detection unit and

controls wires – certain details, e.g., branching hardware are omitted
to simplify the drawing
Stalling Resolves a Hazard
 Same instruction sequence as before for which forwarding by
itself could not resolve the hazard:
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
order
(in instructions)
lw $2, 20($1) lw $2, 20($1) IM Reg DM Reg
and $4, $2, $5

or $8, $2, $6
DM Reg
add $9, $4, $2 and $4, $2, $5 IM Reg Reg
Slt $1, $6, $7

or $8, $2, $6 IM IM Reg DM Reg
bubble
slt $1, $6, $7 IM Reg DM Reg
Hazard detection unit inserts a 1-cycle bubble in the pipeline, after

which all pipeline register dependencies go forward so then the
forwarding unit can handle them and there are no more hazards
and $4, $2, $5 lw $2, 20($1) before<1> before<2> before<3>
Hazard
ID/EX.MemRead
detection
1 unit ID/EX
X
11
WB
IF/IDWrite
EX/MEM
M
Control u M WB
MEM/WB
Stalling
x
0
IF/ID EX M WB
PCWrite
1 $1
M
Instruction
X u
x
Registers
Instruction Data
PC ALU
memory memory M
$X
u
M x
u
x
 Execution 1
X
2
M
example:
u
x
ID/EX.RegisterRt Forwarding
unit
ClockClock
cycle
2
2
lw $2, 20($1) or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2>
and $4, $2, $5 2

Hazard
detection
unit
ID/EX.MemRead
ID/EX
5
or $4, $4, $2 00
WB
11
IF/IDWrite
EX/MEM
add $9, $4, $2 Control

M
u
x
M WB
MEM/WB
0
IF/ID EX M WB
$2 $1
PCWrite
2
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $X
u
M x
u
x
2 1
5 X
2 M
4 u
x
unit
Clock cycle 3
Clock 3
or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1>
Hazard
ID/EX.MemRead
detection
2 unit ID/EX
5
10 00
IF/IDWrite
WB
EX/MEM
M
Stalling
11
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite
2 $2 $2
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $5
u
M x
u
x
Execution
2 2
 5 5
M 2
4 4 u
example
x
unit
Clock cycle 4
(cont.): Clock 4
add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . .
Hazard
ID/EX.MemRead
detection
lw $2, 20($1) 4
2
unit
10
ID/EX
10
IF/IDWrite
WB
and $4, $2, $5 Control
M
u M
EX/MEM
WB
0
or $4, $4, $2 x MEM/WB

0
11
IF/ID EX M WB
add $9, $4, $2

PCWrite
4 $4 $2
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
4 2
2 5
M 2
4 4 u
x
unit
Clock cycle 5
Clock 5
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . bubble
Hazard ID/EX.MemRead
detection
4
unit ID/EX
2
10 10
WB
IF/IDWrite
EX/MEM
M 10
Control u M WB
x MEM/WB
Stalling
0
0
IF/ID EX M WB
PCWrite
4 $4 $4
M
Instruction
2 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
 Execution 2
9
2
4
M
u
4
x
example ID/EX.RegisterRt Forwarding

unit
Clock cycle 6
(cont.): Clock 6
after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .

Hazard
ID/EX.MemRead
lw $2, 20($1)
detection
unit ID/EX
10 10
and $4, $2, $5

IF/IDWrite
WB
EX/MEM
M 10
u
or $4, $4, $2
Control M WB
x MEM/WB
0
1
EX M WB
add $9, $4, $2

IF/ID
$4
PCWrite
M
Instruction
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
unit
Clock cycle 7
Clock 7
Control (or Branch) Hazards
 Problem with branches in the pipeline we have so far is that the
branch decision is not made till the MEM stage – so what
instructions, if at all, should we insert into the pipeline following
the branch instructions?
 Possible solution: stall the pipeline till branch decision is known

 not efficient, slow the pipeline significantly!
 Another solution: predict the branch outcome

 e.g., always predict branch-not-taken – continue with next
sequential instructions
 if the prediction is wrong have to flush the pipeline behind the
branch – discard instructions already fetched or decoded – and
continue execution at the branch target
Predicting Branch-not-taken:
Misprediction delay
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)
40 beq $1, $3, 7 IM Reg DM Reg
44 and $12, $2, $5 IM Reg DM Reg
48 or $13, $6, $2 IM Reg DM Reg
52 add $14, $2, $2 IM Reg DM Reg
72 lw $4, 50($7) IM Reg DM Reg
The outcome of branch taken (prediction wrong) is decided only when

beq is in the MEM stage, so the following three sequential instructions
already in the pipeline have to be flushed and execution resumes at lw
Optimizing the Pipeline to
Reduce Branch Delay
 Move the branch decision from the MEM stage (as in our
current pipeline) earlier to the ID stage
 calculating the branch target address involves moving the branch
adder from the MEM stage to the ID stage – inputs to this adder,
the PC value and the immediate fields are already available in
the IF/ID pipeline register
 calculating the branch decision is efficiently done, e.g., for
equality test, by XORing respective bits and then ORing all the
results and inverting, rather than using the ALU to subtract and
then test for zero (when there is a carry delay)
 with the more efficient equality test we can put it in the ID stage
without significantly lengthening this stage – remember an objective
of pipeline design is to keep pipeline stages balanced
 we must correspondingly make additions to the forwarding and
hazard detection units to forward to or stall the branch at the ID
stage in case the branch decision depends on an earlier result
Flushing on Misprediction
 Same strategy as for stalling on load-use data hazard…
 Zero out all the control values (or the instruction itself) in
pipeline registers for the instructions following the branch that
are already in the pipeline – effectively turning them into nops
– so they are flushed
 in the optimized pipeline, with branch decision made in the ID
stage, we have to flush only one instruction in the IF stage – the
branch delay penalty is then only one clock cycle
Optimized Datapath for Branch
IF.Flush
Hazard
detection IF.Flush control zeros out the instruction in the IF/ID
unit
M ID/EX
pipeline register (which follows the branch)
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x
Sign
extend
M
u
x
Forwarding
unit
Branch decision is moved from the MEM stage to the ID stage – simplified drawing
not showing enhancements to the forwarding and hazard detection units
Pipelined
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>
IF.Flush
Hazard
detection
unit
72 ID/EX
M
Branch
u
48 x WB
EX/MEM
M
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72
4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x
 Execution Sign
extend
example: 10
Forwarding
Clock cycle 3
unit
36 sub $10, $4, $8 Clock 3
40 beq $1, $3, 7 lw $4, 50($7) bubble (nop) beq $1, $3, 7 sub $10, . . . before<1>
44 and $12 $2, $5 IF.Flush
Hazard
detection
48 or $13 $2, $6 M
u
unit
ID/EX
52 add $14, $4, $2

76 x WB
EX/MEM
M
Control u M WB
x MEM/WB
56 slt $15, $6, $7 76

IF/ID
72
0
EX M WB
… 4
Shift
left 2
M
u
$1
72 lw $4, 50($7) PC
Instruction
Registers
= x
ALU
Data
memory
76 72 memory M
u
M $3 x
u
Optimized pipeline with

x
Sign
only one bubble as a result

extend
of the taken branch 10
Forwarding
Clock cycle 4
unit
Clock 4
Simple Example: Comparing
Performance
 Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix
 assume 2 ns for memory access, 2 ns for ALU operation, 1 ns
for register read or write
 assume gcc instruction mix 23% loads, 13% stores, 19%
branches, 2% jumps, 43% ALU
 for pipelined execution assume
 50% of the loads are followed immediately by an instruction that
uses the result of the load
 25% of branches are mispredicted
 branch delay on misprediction is 1 clock cycle
 jumps always incur 1 clock cycle delay so their average time is 2
clock cycles
Simple Example: Comparing
Performance
 Single-cycle (p. 373): average instruction time 8 ns
 Multicycle (p. 397): average instruction time 8.04 ns
 Pipelined:
 loads use 1 cc (clock cycle) when no load-use dependency and 2 cc
when there is dependency – given 50% of loads are followed by
dependency the average cc per load is 1.5
 stores use 1 cc each
 branches use 1 cc when predicted correctly and 2 cc when not –
given 25% misprediction average cc per branch is 1.25
 jumps use 2 cc each
 ALU instructions use 1 cc each
 therefore, average CPI is
1.5  23% + 1  13% + 1.25  19% + 2  2% + 1  43% = 1.18
 therefore, average instruction time is 1.18  2 = 2.36 ns

Pipe Lining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipe Lining

Uploaded by

Copyright:

Available Formats

Enhancing Performance with

Assume 2 ns for memory access, ALU operation; 1 ns for register access:

Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined

 Come up with a formula for pipeline speed-up!

 Before actually building the pipelined datapath and control

 MIPS was designed to be pipelined: structural hazards are easy

beq $ 1, $2, 40 Instruction Data

add $4, $5 , $ 6 Instruction Data

Delayed branch beq is followed by add that is

sub $t2, $s0, $t3 IF ID EX MEM WB

 What happens if we break the execution into multiple cycles, but

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

Write register number comes from another later instruction!

Destination register number is also passed through ID/EX, EX/MEM

sw $t3, 20($t4) IM REG ALU DM REG

add $t5, $t6, $t7 IM REG ALU DM REG

sub $t8, $t9, $t10 IM REG ALU DM REG

Instruction [25 21] Read

ALUOp Funct field Operation

Memto- Reg Mem Mem

 No separate read signal for instruction memory as it is read every

IF/ID ID/EX EX/MEM MEM/WB

Same control Instruction

single-cycle [15– 11]

 execution / address calculation

IF/ID ID/EX EX/MEM MEM/WB

 Note: The 6-bit funct field of the instruction required in the EX

emanate from Instruction

portions of the [15– 11] x

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3 0001

and $12, $4, $7 Add

or $13, $6, $7 4 Add

Label “before<i>” means

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3

and $12, $4, $7 4

add $14, $8, $9 4 Read

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3

and $12, $4, $7 4

add $14, $8, $9

Label “after<i>” means

i th instruction after add

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3 Add

and $12, $4, $7 4 Add

add $14, $8, $9

sub $11, $2, $3

and $12, $4, $7 4

PC Address register 1 Read

Clock cycle 9 Instruction u

sub $2, $1, $3

sw $15, 100($2) IM Reg DM Reg

 Such compiler solutions may not always be possible, and nops

emanate from Instruction

portions of the [15– 11] x

 Whether to forward also depends on:

sub $2, $1, $3

add $14, $2, $2 IM Reg DM Reg