Professional Documents
Culture Documents
11/27/2010 1
Enhancing Performance
with Pipelining
11/27/2010 2
Pipelining
Start work ASAP!! Do not waste time!
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
Not pipelined
B
Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
Pipelined
B
D
11/27/2010 3
Pipelined vs. Single-Cycle
Instruction Execution: the Plan
Program
execution 2 4 6 8 10 12 14 16 18
order Time
(in instructions)
Instruction Data Single-cycle
lw $1, 100($0) fetch
Reg ALU
access
Reg
Instruction Data
lw $2, 200($0) 8 ns fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns fetch
...
8 ns
Instruction Data
Pipelined
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
11/27/2010 2 ns 2 ns 2 ns 2 ns 2 ns 4
Pipelining: Keep in Mind
Pipelining does not reduce latency of a single
task, it increases throughput of entire workload
Pipeline rate limited by longest stage
potential speedup = number pipe stages
unbalanced lengths of pipe stages reduces
speedup
Time to fill pipeline and time to drain it – when
there is slack in the pipeline – reduces
speedup
11/27/2010 5
Example Problem
Problem: for the laundry fill in the following table when
1. the stage lengths are 30, 30, 30 30 min., resp.
2. the stage lengths are 20, 20, 60, 20 min., resp.
11/27/2010 6
Pipelining MIPS
11/27/2010 7
Pipelining MIPS
What makes it hard?
structural hazards: different instructions, at different stages,
in the pipeline want to use the same hardware resource
control hazards: succeeding instruction, to put into pipeline,
depends on the outcome of a previous branch instruction,
already in pipeline
data hazards: an instruction in the pipeline requires data to
be computed by a previous instruction still in the pipeline
Program
execution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Pipelined
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Hazard if single memory
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $4, 400($0) Reg ALU Reg
2 ns fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
Program
execution 2 4 6 8 10 12 14 16
order Time
(in instructions)
Instruction Data Note that branch outcome is
add $4, $5, $6 Reg ALU Reg
fetch access computed in ID stage with
Instruction Data added hardware (later…)
beq $1, $2, 40 fetch
Reg ALU
access
Reg
2ns
Instruction Data
lw $3, 300($0) bubble fetch
Reg ALU
access
Reg
4 ns 2ns
Pipeline stall
11/27/2010 10
Control Hazards
Solution 2 Predict branch outcome
e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg
Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access
Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access
Prediction success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access
Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble
Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
11/27/2010 11
Prediction failure: undo (=flush) lw
Control Hazards
Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay –
compiler’s job to find a statement that can be put in the slot that is
independent of branch outcome
MIPS does this – but it is an option in SPIM (Simulator -> Settings)
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access
2 ns
2 4 6 8 10
Time
Instruction pipeline diagram:
add $s0, $t0, $t1 IF ID EX MEM WB shade indicates use –
left=write, right=read
Program
execution 2 4 6 8 10
order Time
(in instructions)
add $s0, $t0, $t1 IF ID EX MEM WB
Without forwarding – blue line –
data has to go back in time;
with forwarding – red line –
sub $t2, $s0, $t3
data is available in time
IF ID EX MEM WB
11/27/2010 13
Data Hazards
Forwarding may not be enough
e.g., if an R-type instruction following a load uses the result of the load –
called load-use data hazard
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
Without a stall it is impossible
lw $s0, 20($t1) IF ID EX MEM WB
to provide input to the sub
instruction in time
sub $t2, $s0, $t3 IF ID EX MEM WB
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
With a one-stage stall, forwarding
lw $s0, 20($t1) IF ID EX MEM WB can get the data to the sub
instruction in time
bubble bubble bubble bubble bubble
Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
Interchanged
sw $t2, 0($t1)
11/27/2010 15
Pipelined Datapath
We now move to actually building a pipelined datapath
First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
Review: single-cycle processor
all 5 steps done in a single clock cycle
dedicated hardware required for each step
What happens if we break the execution into multiple cycles, but keep
the extra hardware?
11/27/2010 16
Review - Single-Cycle Datapath
“Steps”
ADD
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1 Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
RD
E Memory M
U
16 X 32 X
T WD
N
D
IF
11/27/2010 ID EX MEM WB
17
Instruction Fetch Instruction Decode Execute/ Address Calc. Memory Access Write Back
Pipelined Datapath – Key Idea
What happens if we break the execution into
multiple cycles, but keep the extra hardware?
Answer: We may be able to start executing a new
instruction at each clock cycle - pipelining
…but we shall need extra registers to hold data
between cycles – pipeline registers
11/27/2010 18
Pipelined Datapath
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D
11/27/2010 21
ADD
ADD
4 64 bits 133 bits
102 bits 69 bits
<<2
PC
ADDR RD 5
RN1 RD1
32 Zero
Instruction RN2
ALU
5
Memory Register
5
WN File RD2 M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5 D
11/27/2010 23
Single-Clock-Cycle Diagram:
Clock Cycle 1
LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 24
Single-Clock-Cycle Diagram:
Clock Cycle 2
SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 25
Single-Clock-Cycle Diagram:
Clock Cycle 3
ADD SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 26
Single-Clock-Cycle Diagram:
Clock Cycle 4
SUB ADD SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 27
Single-Clock-Cycle Diagram:
Clock Cycle 5
SUB ADD SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 28
Single-Clock-Cycle Diagram:
Clock Cycle 6
SUB ADD SW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 29
Single-Clock-Cycle Diagram:
Clock Cycle 7
SUB ADD
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 30
Single-Clock-Cycle Diagram:
Clock Cycle 8
SUB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
11/27/2010 31
Alternative View –
Multiple-Clock-Cycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
Time axis
lw $t0, 10($t1) IM REG ALU DM REG
11/27/2010 32
Notes
One significant difference in the execution of an R-type instruction
between multicycle and pipelined implementations:
register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle
implementation. Why?
think of structural hazards when writing to the register file…
Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline
registers to decouple the 5 stages
The CPI of an ideal pipeline (no stalls) is 1. Why?
The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources page
As we develop control for the pipeline keep in mind that the text
does not consider jump – should not be too hard to implement!
11/27/2010 33
Recall Single-Cycle Control –
the Datapath
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [5 0]
11/27/2010 34
Recall Single-Cycle – ALU Control
Instruction AluOp Instruction Funct Field Desired ALU control
opcode operation ALU action input
LW 00 load word xxxxxx add 010
SW 00 store word xxxxxx add 010
Branch eq 01 branch eq xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 set on less 101010 set on less 111
RegDst The register destination number for the The register destination number for the
Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)
RegWrite None The register on the Write register input is written
with the value on the Write data input
AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended,
second register file output (Read data 2) lower 16 bits of the instruction
PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch target
MemRead None Data memory contents designated by the address
input are put on the first Read data output
MemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data input
MemtoReg The value fed to the register Write data input The value fed to the register Write data input
comes from the ALU comes from the data memory
11/27/2010 37
Pipelined Datapath with Control I
PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
datapath RegDst
11/27/2010 38
Pipeline Control Signals
Instruction
Control M WB
EX M WB
Note: The 6-bit funct field of the instruction required in the EX stage
to generate ALU control can be retrieved as the 6 least significant
bits of the immediate field which is sign-extended and passed from
the IF/ID register to the ID/EX register
11/27/2010 40
Pipelined Datapath with Control II
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU
Control signals extend control
MemRead
Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 00 00
u WB
x
1 000 000 00
Control M WB
0 0 0
0000 00 0
EX M WB 0
Execution
0 0
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
and
Read
MemtoReg
Instruction
PC Address register 1 Read
Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
Control
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
Instruction
1
Clock 1
RegDst
sequence: IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
lw $10, 20($1) 1
u
x
lw
11
010
WB
00
000 00
Control M WB
Add
RegWrite
Shift Branch
left 2
MemWrite
add $14, $8, $9 1 Read
ALUSrc
MemtoReg
Instruction
register 1
PC Address Read $1
X data 1
Read
register 2 Zero
Instruction
Registers Read $X ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
Instruction
Instruction
Sign
extend
20 ALU
control
MemRead
lw 10 [20– 16] 10
0 ALUOp
Clock cycle 2
M
Instruction u
X [15– 11] X x
1
11/27/2010 Clock 2 RegDst 42
IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 10 11
u WB
x
1 sub 000 010 00
Control M WB
0 0 0
1100 00 0
EX M WB 0
1 0
Execution 4
Add
Add
Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
2 Read
MemtoReg
Instruction
and
PC Address register 1 Read $2 $1
3 Read data 1
register 2 Zero
Instruction
Registers Read $3 ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
Control
data
Instruction
X [15– 0] Sign X 20 ALU MemRead
extend control
Instruction
X [20– 16] X 10
0 ALUOp
Clock cycle 3 11
Instruction
[15– 11] 11
M
1
u
x
sequence: IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
lw $10, 20($1) 1
x
and
Control
000
M
000
WB
11
1 0 0
Add
Add result
RegWrite
or $13, $6, $7 Shift
left 2
Branch
MemWrite
ALUSrc
MemtoReg
Instruction
register 1
PC Address Read $4 $2
5 data 1
Read
register 2 Zero
Instruction
Registers Read $5 $3 ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
0 ALUOp
M 10
Clock cycle 4
Clock 4
12
Instruction
[15– 11] 12 11
1
u
x
11/27/2010 RegDst
43
IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
Pipelined
M 10 10
u WB
x
1 or 000 000 10
Control M WB
1 0 1
1100 10 0
EX M WB 1
0 0
Execution
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
6 Read
MemtoReg
Instruction
PC Address register 1 Read $6 $4
and
7 Read data 1
register 2 Zero
Instruction $5
Registers Read $7 ALU ALU
memory 10 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Control
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
0 ALUOp
Clock cycle 5 13
Instruction
[15– 11] 13 12
M
u
x
11 10
Clock 5 1
Instruction
RegDst
IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
sequence:
IF/ID ID/EX EX/MEM MEM/WB
0
M 10 10
u WB
lw $10, 20($1) 1
x
add
Control
000
M
000
WB
10
1 0 1
Add
Add result
RegWrite
or $13, $6, $7 Shift
left 2
Branch
MemWrite
ALUSrc
8
MemtoReg
Instruction
register 1
PC Address Read $8 $6
9 data 1
Read
register 2 Zero
Instruction
Registers Read $9 $7 ALU ALU
memory 11 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
Instruction
Instruction
X [20– 16] X
0 ALUOp
Clock cycle 6 14
Instruction
[15– 11] 14 13
M
u
x
1
12 11
Pipelined
M 00 10
u WB
x
1 000 000 10
Control M WB
1 0 1
0000 10 0
EX M WB 0
0 0
Execution
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read $8
and
Read data 1
register 2 Zero
Instruction $9
Registers Read ALU ALU
memory 12 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
[20– 16]
0 ALUOp
M 13 12
Instruction u
[15– 11] 14 x
1
Clock 7 RegDst
Instruction IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
sequence: 0
M
IF/ID
00
ID/EX
00
EX/MEM MEM/WB
u WB
x
1 000 000 10
Control M WB
Add
and $12, $4, $7 4 Add result
RegWrite
Shift Branch
left 2
MemWrite
or $13, $6, $7 Read
ALUSrc
MemtoReg
Instruction
PC Address register 1
Read
data 1
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
1
u
x
14 13
Instruction IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
0000
M
0
00
WB
0
0
1
EX M WB 0
sub $11, $2, $3 0 0
Add
and $12, $4, $7 4
Add
Add result
or $13, $6, $7
RegWrite
Shift Branch
left 2
MemWrite
add $14, $8, $9 Read
ALUSrc
MemtoReg
Instruction
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M 14
u
Clock cycle 9 Instruction
[15– 11]
1
x
Clock 9 RegDst
11/27/2010 46
Revisiting Hazards
So far our datapath and control have ignored
hazards
We shall revisit data hazards and control
hazards and enhance our datapath and control
to handle them in hardware…
11/27/2010 47
Data Hazards and Forwarding
Problem with starting an instruction before previous are finished:
data dependencies that go backward in time – called data hazards
or $13, $6, $2
add $14, $2, $2 or $13, $6, $2 IM Reg DM Reg
sw $15, 100($2)
add $14, $2, $2 IM Reg DM Reg
11/27/2010 48
Software Solution
Have compiler guarantee never any data hazards!
by rearranging instructions to insert independent instructions
between instructions that would otherwise have a data hazard
between them,
or, if such rearrangement is not possible, insert nops
sub $2, $1, $3 sub $2, $1, $3
lw $10, 40($3) nop
slt $5, $6, $7 nop
and $12, $2, $5 or and $12, $2, $5
or $13, $6, $2 or $13, $6, $2
add $14, $2, $2 add $14, $2, $2
sw $15, 100($2) sw $15, 100($2)
Such compiler solutions may not always be possible, and nops
slow the machine down
11/27/2010
MIPS: nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0 49
Hardware Solution: Forwarding
11/27/2010 50
Pipelined Datapath with Control
II (as before)
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU
Control signals extend control
MemRead
Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
11/27/2010 53
Dependencies between pipelines move forward in time
ID/EX EX/MEM MEM/WB
Hardware Data
memory M
u
x
M
u
x
Registers
ForwardA ALU
M Data
u memory
x M
u
x
Rs ForwardB
Rt
Rt M
u EX/MEM.RegisterRd
Rd
x
Forwarding MEM/WB.RegisterRd
unit
11/27/2010 54
b. With forwarding Datapath after adding forwarding hardware
Forwarding Hardware:
Multiplexor Control
11/27/2010 55
Data Hazard: Detection and
Forwarding
Forwarding unit determines multiplexor control according to the
following rules:
1. EX hazard
if ( EX/MEM.RegWrite // if there is a write…
and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches, then…
ForwardA = 10
11/27/2010 56
Data Hazard: Detection and
Forwarding
2. MEM hazard
if ( MEM/WB.RegWrite // if there is a write…
and ( MEM/WB.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd ≠ ID/EX.RegisterRs ) // and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline
register, then…
ForwardA = 01
This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4;
(array summing?), where an earlier pipeline (EX/MEM) register has more recent data
11/27/2010 57
Forwarding Hardware with
Control ID/EX
Called forwarding unit, not hazard detection unit,
because once data is forwarded there is no hazard!
WB
EX/MEM
Control M WB
MEM/WB
IF/ID EX M WB
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt
M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
Forwarding MEM/WB.RegisterRd
unit
ID/EX
10 10
WB
EX/MEM
Control M WB
Forwarding
MEM/WB
IF/ID EX M WB
2 $2 $1
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $3
u
M x
u
x
2 1
5 3
M
4 2 u
x
Forwarding
Execution Clock 3
example: add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>
ID/EX
10 10
WB
sub $2, $1, $3 EX/MEM
10
Control M WB
MEM/WB
and $4, $2, $5 EX M WB
IF/ID
or $4, $4, $2 4 $4 $2
6 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
2 2
6 5
M 2
4 4 u
x
Forwarding
11/27/2010
Clock cycle 4 unit
59
Clock 4
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .
ID/EX
10 10
WB
EX/MEM
10
Control M WB
MEM/WB
Forwarding
1
IF/ID EX M WB
4 $4 $4
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
2 2
M 4 2
9 4 u
Execution
x
Forwarding
example Clock 5
ID/EX
10
WB
sub $2, $1, $3 EX/MEM
10
Control M WB
MEM/WB
and $4, $2, $5 EX M WB
1
IF/ID
or $4, $4, $2
$4
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
Forwarding
11/27/2010
Clock cycle 6 unit
60
Clock 6
Data Hazards and Stalls
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes
to the same register
As even a pipeline
or $8, $2, $6 IM Reg DM Reg
dependency goes
backward in time
add $9, $4, $2 IM Reg DM Reg
forwarding will not
solve the hazard
slt $1, $6, $7 IM Reg DM Reg
therefore, we need a hazard detection unit to stall the pipeline after the
11/27/2010
load instruction 61
Pipelined Datapath with Control II
(as before)
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU
Control signals extend control
MemRead
11/27/2010 63
Mechanics of Stalling
If the check to stall verifies, then the pipeline needs to stall only 1
clock cycle after the load as after that the forwarding unit can
resolve the dependency
What the hardware does to stall the pipeline 1 cycle:
does not let the IF/ID register change (disable write!) – this will cause
the instruction in the ID stage to repeat, i.e., stall
therefore, the instruction, just behind, in the IF stage must be stalled
as well – so hardware does not let the PC change (disable write!) –
this will cause the instruction in the IF stage to repeat, i.e., stall
changes all the EX, MEM and WB control fields in the ID/EX pipeline
register to 0, so effectively the instruction just behind the load
becomes a nop – a bubble is said to have been inserted into the
pipeline
note that we cannot turn that instruction into an nop by 0ing all the bits
in the instruction itself – recall nop = 00…0 (32 bits) – because it has
already been decoded and control signals generated
11/27/2010 64
Hazard Detection Unit
Hazard ID/EX.MemRead
detection
unit ID/EX
WB
IF/IDWrite
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit
bubble
IF/IDWrite
EX/MEM
Stalling
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
1 $1
PCWrite
M
Instruction
X u
x
Registers
Instruction Data
PC ALU
memory memory M
$X
u
M x
u
x
Execution 1
X
2
M
example:
u
x
ID/EX.RegisterRt Forwarding
unit
ClockClock
cycle
2
2
lw $2, 20($1) or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2>
Hazard
and $4, $2, $5 2
detection
unit
ID/EX.MemRead
ID/EX
5
or $4, $4, $2 00
WB
11
IF/IDWrite
EX/MEM
2 $2 $1
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $X
u
M x
u
x
2 1
5 X
2 M
4 u
x
ID/EX.RegisterRt Forwarding
unit
11/27/2010 Clock cycle 3 67
Clock 3
or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1>
Hazard
ID/EX.MemRead
detection
2 unit ID/EX
5
10 00
WB
IF/IDWrite
EX/MEM
M 11
Control u M WB
x MEM/WB
0
Stalling IF/ID
2 $2
EX
$2
M WB
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $5
u
M x
u
x
2 2
Execution 5 5
M 2
4 4 u
example
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 4
(cont.): Clock 4
add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . .
Hazard
ID/EX.MemRead
detection
4
lw $2, 20($1) 2
unit
10
ID/EX
10
WB
IF/IDWrite
WB
0
MEM/WB
or $4, $4, $2 IF/ID
0
x
EX M WB
11
$4 $2
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
4 2
2 5
M 2
4 4 u
x
ID/EX.RegisterRt Forwarding
unit
IF/IDWrite
EX/MEM
Stalling
M 10
Control u M WB
x MEM/WB
0
0
IF/ID EX M WB
4 $4
PCWrite
$4
M
Instruction
2 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
Execution 2
9
2
4
M
u
4
x
Clock cycle 6
(cont.): Clock 6
EX/MEM
M 10
Control u M WB
or $4, $4, $2 0
x MEM/WB
1
IF/ID EX M WB
add $9, $4, $2
$4
PCWrite
M
Instruction
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
ID/EX.RegisterRt Forwarding
unit
11/27/2010 70
Predicting Branch-not-taken:
Misprediction delay
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)
11/27/2010 73
Optimized Datapath for Branch
IF.Flush
Hazard
detection IF.Flush control zeros out the instruction in the IF/ID
unit
M ID/EX
pipeline register (which follows the branch)
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x
Sign
extend
M
u
x
Forwarding
unit
Branch decision is moved from the MEM stage to the ID stage – simplified drawing
11/27/2010 74
not showing enhancements to the forwarding and hazard detection units
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>
IF.Flush
Pipelined 72
48 x
M
u
Hazard
detection
unit
M
ID/EX
WB
EX/MEM
Branch
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72
4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x
Execution Sign
extend
example: 10
Forwarding
unit
Clock cycle 3
36 sub $10, $4, $8 Clock 3
IF.Flush
44 and $12 $2, $5 Hazard
detection
48 or $13 $2, $6 M
u
unit
ID/EX
76 x WB
WB
MEM/WB
x
… 4
Shift M $1
left 2 u
72 lw $4, 50($7) PC
Instruction
Registers
= x
ALU
Data
memory memory M
76 72
u
M $3 x
u
x
Forwarding
unit
11/27/2010 76
Simple Example: Comparing
Performance
Single-cycle (p. 373): average instruction time 8 ns
Multicycle (p. 397): average instruction time 8.04 ns
Pipelined:
loads use 1 cc (clock cycle) when no load-use dependency
and 2 cc when there is dependency – given 50% of loads
are followed by dependency the average cc per load is 1.5
stores use 1 cc each
branches use 1 cc when predicted correctly and 2 cc when
not – given 25% misprediction average cc per branch is 1.25
jumps use 2 cc each
ALU instructions use 1 cc each
therefore, average CPI is
1.5 × 23% + 1 × 13% + 1.25 × 19% + 2 × 2% + 1 × 43% = 1.18
therefore, average instruction time is 1.18 × 2 = 2.36 ns
11/27/2010 77