Professional Documents
Culture Documents
Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A Pipelined
B
D
Pipelined vs. Single-Cycle Instruction
Execution: the Plan
P rogra m
e xecution 2 4 6 8 10 12 14 16 18
order Time
(in instructions ) Single-cycle
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Instruction Data
lw $2, 200($0) 8 ns fe tch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns fe tch
...
8 ns
Assume 2 ns for memory access, ALU operation; 1 ns for register access:
therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Pr ogr a m
exec uti on 2 4 6 8 10 12 14
Time
or der
(i ni nstr ucti ons)
Instruction Da ta
lw $1, 100($0) Reg ALU Reg
fetch access
Pipelined
Ins truction Da ta
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
2 ns 2 ns 2 ns 2 ns 2 ns
Pipelining: Keep in Mind
Pipelining does not reduce latency of a single task, it
increases throughput of entire workload
Pipeline rate limited by longest stage
potential speedup = number pipe stages
unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it – when there is
slack in the pipeline – reduces speedup
Example Problem
Problem: for the laundry fill in the following table when
1. the stage lengths are 30, 30, 30 30 min., resp.
2. the stage lengths are 20, 20, 60, 20 min., resp.
Instruction Data
Hazard if
lw $3, 300($0) 2 ns fetch
Reg ALU
access
Reg
single memory
Instruction Da ta
lw $4, 400($0) Reg ALU Reg
2 ns fetch access
Pr ogr a m
exe c uti o n 2 4 6 8 10 12 14 16
or der Ti me
(i ni nstr ucti ons)
Ins truction Data Note that branch outcome
a dd $4, $5, $6 Reg ALU Reg
fetch acces s
is
be q $1, $2, 40
Ins truction
fe tch
Reg ALU
Data
access
Re g computed in ID stage with
2ns
Ins truction
added
Data
hardware (later…)
lw $3, 300($0) bubble Reg ALU Re g
fe tch access
4 ns 2ns
Pipeline stall
Control Hazards
Solution 2 Predict branch outcome
e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg
Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access
Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access
Prediction
success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access
Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble
Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
Prediction failure: undo (=flush) lw
Control Hazards
Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay
– compiler’s job to find a statement that can be put in the slot
that is independent of branch outcome
MIPS does this – but it is an option in SPIM (Simulator -> Settings)
Program
executi on 2 4 6 8 10 12 14
order Time
(i ni nstructions)
beq$1, $2, 40 Insfetruction
tch
Re g ALU
Da ta
acce ss
Reg
2 4 6 8 10
Time
IF ID EX MEM
Instruction
add $s0, $t0, $t1 WB
pipeline diagram:
shade indicates
use –
Pr ogr a m
executi on
left=write,
2 4 6 8 10
or der Ti me
(i ni nstr ucti ons)
right=read
a dd $s 0, $t 0, $t 1 IF ID EX ME M WB Without forwarding
– blue line –
s ub $t 2, $s 0, $t 3 IF ID EX ME M WB data has to go back
in time;
with forwarding –
red line –
Data Hazards
Forwarding may not be enough
e.g., if an R-type instruction following a load uses the result of the
load – called load-use data hazard
2 4 6 8 10 12 14
Progra m Time
execution
order
(in instructions)
Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1) Interch
anged
Pipelined Datapath
We now move to actually building a pipelined datapath
First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
Review: single-cycle processor
all 5 steps done in a single clock cycle
dedicated hardware required for each step
What happens if we break the execution into multiple cycles, but keep
the extra hardware?
Review - Single-Cycle Datapath “Steps”
A
D
D
4 A
D
D
P Instructi I <
C A
D
R
D 3 1
on3
5 5 5 <
D Instru 2 6 2
2
R Me
ction
R
N
R
N
W
NR Ze
mo Registe
1 2 D
1
A ro
ry W
D r File R L
M
D
2
U
X
U A
D
D D R
1
E
X 3
R Mea D
M
U
X
6 T
N
2 W
D mot
D rya
IF ID EX MEM WB
Instruction Instruction Execute/ Memory Write
Pipelined Datapath – Key Idea
What happens if we break the execution into multiple cycles, but keep
the extra hardware?
Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining
…but we shall need extra registers to hold data between cycles
– pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
A
D
D
A
4
64 128 D
D
P bits bits < 97 64
Instruction
C AD
DR
R
D 3 1 3
I
< bits bits
Instruc 2 2 5 5 5
6
Mem R R W 2
tion N N NR Ze
ory Register
1 2 D
1
A ro
W
D File L
R M
D
2
U
X
U AD
DR
D R
1
E
X 3
Mem
at D
M
U
X
6 T
N
2 W
D ory
a
D
A
D
D
A
4
64 128 D
D
P bits bits < 97 64
Instruction
C AD
DR
R
D 3 1 3
I
< bits bits
Instruc 2 2 5 5 5
6
Mem R R W 2
tion N N NR Ze
ory Register
1 2 D
1
A ro
W
D File L
R M
D
2
U
X
U AD
DR
D R
1
E
X 3
Mem
at D
M
U
X
6 T
N
2 W
D ory
a
D
I I EX/ ME
A
D F D ME M/
4
D
/ / A M WB
D
I E D
P D Instructi I
X <
C A
D
R
D 3 1
on3
<
Instru 2 2 5 5 5
D 6
R Me R R W 2
ction N N NR
mo Registe
1 2 D
1
A
ry W
D r File R L
M
D
2
U
X
U A
D
D D R
1
E
X 3
R Mea D
M
U
X
6 T
N
2 W
D mot
D rya
mo
5 2
N F
iste
R M
o
W D U
U AD
ri
D 2 X
ry DR
D
l1 E Me R M
X 3 a D U
e6
X
T
N
2 W
D mot
5 D rya
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
S Cycle 2 L
W W
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
ADCycle 3 S L
D W W
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
SUCycle 4 AD S L
B D W W
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 5 SU AD S L
B D W W
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 6 SU AD S
B D W
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 7 SU AD
D
B
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
Cycle 8 SU
B
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Alternative View – Multiple-Clock-Cycle
Diagram
C C C C C C C C
C
1
C
2
C
3
C
4
C
5
C
6
Time
C
7
C
8
axis
lw $t0, I
M
R
E
A
L
D
M
R
E
10($t1) G U G
sw $t3, I
M
R
E
A
L
D
M
R
E
20($t4) G U G
add $t5, I
M
R
E
A
L
D
M
R
E
$t6, $t7 G U G
s $t8, I
M
R
E
A
L
D
M
R
E
u $t9, G U G
b $t10
Notes
One significant difference in the execution of an R-type
instruction between multicycle and pipelined implementations:
register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle
implementation. Why?
think of structural hazards when writing to the register file…
Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline
registers to decouple the 5 stages
The CPI of an ideal pipeline (no stalls) is 1. Why?
The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources page
As we develop control for the pipeline keep in mind that the text
does not consider jump – should not be too hard to implement!
Recall Single-Cycle Control – the
Datapath
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [5 0]
Instruction AluOp Instruction Funct Field Desired ALU
control
opcode operation ALU action input
Recall Single-Cycle – ALU Control
LW
SW
00
00
load word xxxxxx add
store word xxxxxx add
010
010
Branch eq 01 branch eq xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 set on less 101010 set on less 111
ALUO Funct field Operatio
ALUOp p ALUOp F F F F F F n
1 0 0 0 5X 4X 3X 2X 1X 0X 01
0 1 X X X X X X 0
11
1 X X X 0 0 0 0 0
01
1 X X X 0 0 1 0 0
11
1 X X X 0 1 0 0 0
00
1 X X X 0 1 0 1 0
00
1 X X X 1 0 1 0 1
11
Truth table for ALU control bits 1
Recall Single-Cycle – Control Signals
Effect of control bits
RegDst The register destination number for the The register destination number for the
Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)
RegWrite None The register on the Write register input is written
with the value on the Write data input
AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended,
second register file output (Read data 2) lower 16 bits of the instruction
PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch target
MemRead None Data memory contents designated by the address
input are put on the first Read data output
MemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data input
MemtoReg The value fed to the register Write data input The value fed to the register Write data input
comes from the ALU comes from the data memory
Memto- Reg Mem Mem
Deter- Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
mining R-format 1 0 0 1 0 0 0 1 0
control lw 0 1 1 1 1 0 0 0 0
bits sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
Pipeline Control
Initial design – motivated by single-cycle datapath control – use
the same control signals
Observe:
No separate write signal for the PC as it is written every cycle Will
No separate write signals for the pipeline registers as they are be
written every cycle modifi
ed
No separate read signal for instruction memory as it is read every
by
clock cycle hazar
No separate read signal for register file as it is read every clock cycle
d
Need to set control signals during each pipeline stage detecti
on
Since control signals are associated with components active unit!!
during a single pipeline stage, can group control lines into five
groups according to pipeline stage
Pipelined Datapath with Control I
PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
single-cycle
u
[15– 11] x
1
datapath RegDst
Pipeline Control Signals
There are five stages in the pipeline
instruction fetch / PC increment
Nothing to control as
instruction decode / register fetch
instruction memory
execution / address calculation
read and PC write are
memory access
always enabled
write back
Write-back
Execution/Address Calculation Memory access stage stage control
stage control lines control lines lines
Reg ALU ALU ALU Mem Mem Reg Mem to
Instruction Dst Op1 Op0 Src Branch Read Write write Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Pipeline Control Implementation
Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding
stages
WB
Instruction
Control M WB
EX M WB
Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6
least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite Shift
Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals Sign
extend
ALU
control
MemRead
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read
Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
PipelinedExecutionandControl
Write x memory u
data x
1
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 1
0 ALUOp
M
Instruction u
x
Instruction
[15– 11]
1
Clock 1 RegDst
sequence: IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
lw $10, 20($1) M
1
u
x
11
WB
00
Add
RegWrite
Branch
MemWrite
ALUSrc
1 Read
MemtoReg
Instruction
register 1
PC Address Read $1
X data 1
Read
register 2 Zero
Instruction
Registers Read $X ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
i th instruction before
Instruction
20 [15– 0] Sign 20 ALU MemRead
extend control
lw 10
Instruction
[20– 16] 10
0
Clock cycle 2
ALUOp
M
Instruction u
X [15– 11] X x
1
Clock 2 RegDst
IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
2 Read
MemtoReg
Instruction
PC Address register 1 Read $2 $1
3 Read data 1
register 2 Zero
Instruction
Registers Read $3 ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
u
PipelinedExecutionandControl
Write x memory
data x
1
0
Write
data
Instruction
X [15– 0] Sign X 20 ALU MemRead
extend control
Instruction
X [20– 16] X 10
Clock cycle 3
0 ALUOp
M
Instruction u
x
Instruction
11 [15– 11] 11
1
Clock 3 RegDst
sequence: IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
lw $10, 20($1)
u WB
x
1 and 000 000 11
Control M WB
RegWrite
Shift Branch
MemWrite
ALUSrc
4 Read
MemtoReg
Instruction
register 1
PC Address Read $4 $2
5 data 1
Read
register 2 Zero
Instruction
Registers Read $5 $3 ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
0 ALUOp
Clock cycle 4 12
Instruction
[15– 11] 12 11
M
1
u
x
10
Clock 4 RegDst
IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
6 Read
MemtoReg
Instruction
PC Address register 1 Read $6 $4
7 Read data 1
register 2 Zero
Instruction $5
Registers Read $7 ALU ALU
memory 10 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
x
PipelinedExecutionandControl
data 1
0
Write
data
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
Clock cycle 5
0 ALUOp
M 11 10
Instruction u
13 [15– 11] 13 12 x
Clock 5 1
Instruction
RegDst
sequence:
IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
lw $10, 20($1)
u WB
x
1 add 000 000 10
Control M WB
RegWrite
Shift Branch
MemWrite
ALUSrc
8 Read
MemtoReg
Instruction
register 1
PC Address Read $8 $6
9 data 1
Read
register 2 Zero
Instruction
Registers Read $9 $7 ALU ALU
memory 11 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
Instruction
X [20– 16] X
Clock cycle 6
0 ALUOp
M 12 11
Instruction u
14 [15– 11] 14 13 x
1
Clock 6 RegDst
IF: after<2> ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read $8
Read data 1
register 2 Zero
Instruction $9
Registers Read ALU ALU
memory 12 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
x
PipelinedExecutionandControl
data 1
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 7
0 ALUOp
M 13 12
Instruction u
[15– 11] 14 x
1
Clock 7 RegDst
Instruction IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
sequence: 0
M
IF/ID
00
ID/EX
WB
00
EX/MEM MEM/WB
u
x
1
lw $10, 20($1)
000 000 10
Control M WB
0 0 1
0000 00 0
EX M WB 0
Add
RegWrite
Branch
or $13, $6, $7
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory 13 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 8
0 ALUOp
M 14 13
Instruction u
[15– 11] x
1
Clock 8 RegDst
Pipelined Execution and Control
Instruction IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
sequence: 0
IF/ID ID/EX EX/MEM MEM/WB
M 00 00
u WB
x
1
lw $10, 20($1)
000 000 00
Control M WB
0 0 1
or $13, $6, $7
Add
4 Add result
RegWrite
Shift Branch
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M 14
Clock 9 RegDst
Revisiting Hazards
So far our datapath and control have ignored hazards
We shall revisit data hazards and control hazards and
enhance our datapath and control to handle them in
hardware…
Data Hazards and Forwarding
Problem with starting an instruction before previous are finished:
data dependencies that go backward in time – called data hazards
Time (in clock cycles)
$2 = 10 Value of CC 1
register $2: 10
CC 2
10
CC 3
10
CC 4
10
CC 5
10/– 20
CC 6
– 20
CC 7
– 20
CC 8
– 20
CC 9
– 20
before sub; Program
execution
$2 = -20 after order
(in instructions)
sub sub $2, $1, $3 IM Reg DM Reg
or $13, $6, $2
add $14, $2, $2
or $13, $6, $2 IM DM Reg
sw $15, 100($2) Reg
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite Shift
Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals Sign
extend
ALU
control
MemRead
Eg., in the earlier example, first hazard between sub $2, $1, $3 and
and $12, $2, $5 is detected when the and is in EX stage and the
sub is in MEM stage because
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)
Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
ForwardingHardware
Registers ALU
Data
memory M
u
x
M
u
x
Registers
ForwardA ALU
M Data
u memory
x M
u
x
Rs ForwardB
Rt
Rt M
u EX/MEM.RegisterRd
Rd
x
Forwarding MEM/WB.RegisterRd
unit
1. EX hazard
if ( EX/MEM.RegWrite // if there is a write…
and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches,
then… ForwardA = 10
2.
This check is necessary, e.g., for sequences such as add $1, $1,
$2; add $1, $1, $3; add $1, $1, $4;
Forwarding Hardware with Control
Called forwarding unit, not hazard
ID/EX
detection unit,
WB because once data is forwarded
EX/MEM
IF/ID EX M WB
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt
M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
Forwarding MEM/WB.RegisterRd
unit
ID/EX
10 10
WB
EX/MEM
Forwarding
Control M WB
MEM/WB
IF/ID EX M WB
2 $2 $1
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $3
u
M x
u
x
2 1
5 3
M
4 2 u
x
Clock cycle 3
Forwarding
unit
Execution Clock 3
example: add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>
ID/EX
10 10
6 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
2 2
6 5
M 2
4 4 u
x
Forwarding
Clock 4
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .
ID/EX
10 10
WB
EX/MEM
Forwarding
10
Control M WB
MEM/WB
1
IF/ID EX M WB
4 $4 $4
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
2 2
M 4 2
u
Execution
9 4
x
Clock cycle 5
Forwarding
unit
example Clock 5
ID/EX
10
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
Forwarding
Clock 6
Data Hazards and Stalls
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes to
the same register
As even a pipeline
dependency goes or $8, $2, $6 IM Reg DM Reg
backward therefore,
in time we need a hazard detection unit to stall the pipeline after the load
forwardinginstruction
will not add $9, $4, $2 IM Reg DM Reg
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite Shift
Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals Sign
extend
ALU
control
MemRead
unit ID/EX
WB
IF/IDWrite EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit
bubble
IF/IDWrite
EX/MEM
M
Control u M WB
MEM/WB
Stalling
x
0
IF/ID EX M WB
1 $1
PCWrite
M
Instruction
X u
x
Registers
Instruction Data
PC ALU
memory memory M
$X
u
M x
u
x
Execution 1
X
2
M
example:
u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 2
Clock 2
lw $2, 20($1) or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2>
and $4, $2, $5 2
Hazard
detection ID/EX.MemRead
or $4, $4, $2
unit ID/EX
5
00 11
WB
IF/IDWrite
2 $2 $1
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $X
u
M x
u
x
2 1
5 X
2 M
4 u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 3
Clock 3
or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1>
Hazard
ID/EX.MemRead
detection
2 unit ID/EX
5
10 00
WB
IF/IDWrite
EX/MEM
M
Stalling
11
Control u M WB
x MEM/WB
0
IF/ID EX M WB
2 $2 $2
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $5
u
M x
u
x
Execution
2 2
5 5
M 2
4 4 u
example
x
ID/EX.RegisterRt Forwarding
Clock cycle 4
unit
(cont.): Clock 4
add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . .
Hazard
ID/EX.MemRead
lw $2, 20($1)
detection
4 unit ID/EX
2
10 10
EX/MEM
M 0
11
4 $4 $2
PCWrite
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
4 2
2 5
M 2
4 4 u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 5
Clock 5
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . bubble
Hazard ID/EX.MemRead
detection
4
unit ID/EX
2
10 10
WB
IF/IDWrite
EX/MEM
M 10
Control u M WB
x MEM/WB
Stalling
0
0
IF/ID EX M WB
4 $4 $4
PCWrite
M
Instruction
2 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
Execution 2
9
2
4
M
u
4
x
Clock cycle 6
(cont.): Clock 6
EX/MEM
or $4, $4, $2
M 10
Control u M WB
x MEM/WB
0
$4
PCWrite
M
Instruction
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
ID/EX.RegisterRt Forwarding
Clock cycle 7
unit
Clock 7
Control (or Branch) Hazards
Problem with branches in the pipeline we have so far is that the
branch decision is not made till the MEM stage – so what
instructions, if at all, should we insert into the pipeline following
the branch instructions?
Hazard
detection
unit IF.Flush control zeros out the
instruction in the IF/ID
M ID/EX
u
x
pipeline register (which follows the
WB
EX/MEM
M
Control
0
u
x
M
branch) WB
MEM/WB
IF/ID EX M WB
4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x
Sign
extend
M
u
x
Forwarding
unit
Branch decision is moved from the MEM stage to the ID stage – simplified drawing not showing
enhancements to the forwarding and hazard detection units
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>
IF.Flush
Hazard
detection
unit
72 ID/EX
M
PipelinedBranch
u
48 x WB
EX/MEM
M
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72
4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x
Execution Sign
extend
example: 10
Forwarding
40 beq $1, $3, 7 lw $4, 50($7) bubble (nop) beq $1, $3, 7 sub $10, . . . before<1>
48 or $13 $2, $6
Hazard
detection
unit
ID/EX
…
IF/ID EX M WB
76 72
Registers
Instruction Data
PC ALU
memory memory M
76 72
u
M $3 x
Forwarding
Clock cycle 4
unit
Clock 4
Simple Example: Comparing
Performance
Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix
assume 2 ns for memory access, 2 ns for ALU operation, 1 ns
for register read or write
assume gcc instruction mix 23% loads, 13% stores, 19%
branches, 2% jumps, 43% ALU
for pipelined execution assume
50% of the loads are followed immediately by an instruction that
uses the result of the load
25% of branches are mispredicted
branch delay on misprediction is 1 clock cycle
jumps always incur 1 clock cycle delay so their average time is 2
clock cycles
Simple Example: Comparing
Performance
Single-cycle (p. 373): average instruction time 8 ns
Multicycle (p. 397): average instruction time 8.04 ns
Pipelined:
loads use 1 cc (clock cycle) when no load-use dependency and 2 cc
when there is dependency – given 50% of loads are followed by
dependency the average cc per load is 1.5
stores use 1 cc each
branches use 1 cc when predicted correctly and 2 cc when not –
given 25% misprediction average cc per branch is 1.25
jumps use 2 cc each
ALU instructions use 1 cc each
therefore, average CPI is
1.5 × 23% + 1 × 13% + 1.25 × 19% + 2 × 2% + 1 × 43% = 1.18
therefore, average instruction time is 1.18 × 2 = 2.36 ns