7 Performance With Pipelining

EEE440 - Computer Architecture
Enhancing Performance with Pipelining

(Chapter 6)
Pipelining
Start work ASAP!! Do not waste time!
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
Not pipelined
B
Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A Pipelined
B
D
Pipelined vs. Single-Cycle Instruction
Execution: the Plan
P rogra m
e xecution 2 4 6 8 10 12 14 16 18
order Time
(in instructions ) Single-cycle
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Instruction Data
lw $2, 200($0) 8 ns fe tch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns fe tch
...
8 ns
Assume 2 ns for memory access, ALU operation; 1 ns for register access:
therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Pr ogr a m
exec uti on 2 4 6 8 10 12 14
Time
or der
(i ni nstr ucti ons)
Instruction Da ta
fetch access
Pipelined
Ins truction Da ta
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
I nstr ucti on Dat a

lw $3, 300($0) 2 ns Re g ALU Re g
f et ch access
2 ns 2 ns 2 ns 2 ns 2 ns
Pipelining: Keep in Mind
Pipelining does not reduce latency of a single task, it
increases throughput of entire workload
Pipeline rate limited by longest stage
potential speedup = number pipe stages
unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it – when there is
slack in the pipeline – reduces speedup
Example Problem
Problem: for the laundry fill in the following table when
1. the stage lengths are 30, 30, 30 30 min., resp.
2. the stage lengths are 20, 20, 60, 20 min., resp.
Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined

finish time finish time to pipeline 1 finish time to pipeline 2
1
2
3
4
n
Come up with a formula for pipeline speed-up!
Pipelining MIPS
What makes it easy with MIPS?
all instructions are same length
so fetch and decode stages are similar for all instructions
just a few instruction formats
simplifies instruction decode and makes it possible in one stage
memory operands appear only in load/stores
so memory access can be deferred to exactly one later stage
operands are aligned in memory
one data transfer instruction requires one memory access stage
Pipelining MIPS
What makes it hard?
structural hazards: different instructions, at different stages,
in the pipeline want to use the same hardware resource
control hazards: succeeding instruction, to put into pipeline,
depends on the outcome of a previous branch instruction,
already in pipeline
data hazards: an instruction in the pipeline requires data to
be computed by a previous instruction still in the pipeline
Before actually building the pipelined datapath and control

we first briefly examine these potential hazards
individually…
Structural hazard: inadequate hardware to simultaneously
support all instructions in the pipeline in the same clock cycle
E.g., suppose single – not separate – instruction and data
Structural Hazards
memory in pipeline below with one read port
then a structural hazard between first and fourth lw instructions
P rogra m
e xecution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Da ta
lw $1, 100($0)
fetch
Reg ALU
acces s
Reg
Pipelined
Instruction Da ta
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
Hazard if
lw $3, 300($0) 2 ns fetch
Reg ALU
access
Reg
single memory
Instruction Da ta
2 ns fetch access
MIPS was designed to be pipelined:

2 ns structural
2 ns 2 ns hazards
2 ns 2 ns are easy
to avoid!
Control Hazards
Control hazard: need to make a decision based on the result of a
previous instruction still executing in pipeline
Solution 1 Stall the pipeline
Pr ogr a m
exe c uti o n 2 4 6 8 10 12 14 16
or der Ti me
Ins truction Data Note that branch outcome
a dd $4, $5, $6 Reg ALU Reg
fetch acces s
is
be q $1, $2, 40
Ins truction
fe tch
Reg ALU
Data
access
Re g computed in ID stage with
2ns
Ins truction
added
Data
hardware (later…)
lw $3, 300($0) bubble Reg ALU Re g
fe tch access
4 ns 2ns
Pipeline stall
Control Hazards
Solution 2 Predict branch outcome
e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg
Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access
Instruction Data
2 ns fetch access
Prediction
success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access
Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble
Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
Prediction failure: undo (=flush) lw
Control Hazards
Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay
– compiler’s job to find a statement that can be put in the slot
that is independent of branch outcome
MIPS does this – but it is an option in SPIM (Simulator -> Settings)
Program
executi on 2 4 6 8 10 12 14
order Time
(i ni nstructions)
beq$1, $2, 40 Insfetruction
tch
Re g ALU
Da ta
acce ss
Reg
Ins truction Data

add$4, $5, $6 fe tch
Re g ALU
acce ss
Re g
(d el ayedbranchsl ot) 2ns
Ins truction Da ta
lw$3, 300($0) fe tch
Reg ALU
access
Re g
2ns
2ns
Delayed branch beq is followed by add that is
independent of branch outcome
Data Hazards
Data hazard: instruction needs data from the result of a previous
instruction still executing in pipeline
Solution Forward data if possible…
2 4 6 8 10
Time
IF ID EX MEM
Instruction
add $s0, $t0, $t1 WB
pipeline diagram:
shade indicates
use –
Pr ogr a m
executi on
left=write,
2 4 6 8 10
or der Ti me
right=read
a dd $s 0, $t 0, $t 1 IF ID EX ME M WB Without forwarding
– blue line –
s ub $t 2, $s 0, $t 3 IF ID EX ME M WB data has to go back
in time;
with forwarding –
red line –
Data Hazards
Forwarding may not be enough
e.g., if an R-type instruction following a load uses the result of the
load – called load-use data hazard
2 4 6 8 10 12 14
Progra m Time
execution
order
(in instructions)
lw $s0, 20($t1) IF ID EX ME M WB Without a stall it is

impossible
s ub $t2, $s0, $t3 IF ID EX ME M WB to provide input to
the sub
2 4 6 8 10 12 14instruction in time
Program Time
execution
order
(in instructions)
lw $s0, 20($t1) IF ID EX MEM WB With a one-stage

stall, forwarding
bubble bubble bubble bubble bubble can get the data to
the sub
sub $t2, $s0, $t3 IF ID EX MEM WB
instruction in time
Reordering Code to Avoid Pipeline Stall
(Software Solution)
Example:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t2, 0($t1) Data
hazard
sw $t0, 4($t1)
Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
sw $t2, 0($t1) Interch
anged
Pipelined Datapath
We now move to actually building a pipelined datapath
First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
Review: single-cycle processor
all 5 steps done in a single clock cycle
dedicated hardware required for each step
What happens if we break the execution into multiple cycles, but keep
the extra hardware?
Review - Single-Cycle Datapath “Steps”
A
D
D
4 A
D
D
P Instructi I <
C A
D
R
D 3 1
on3
5 5 5 <
D Instru 2 6 2
2
R Me
ction
R
N
R
N
W
NR Ze
mo Registe
1 2 D
1
A ro
ry W
D r File R L
M
D
2
U
X
U A
D
D D R
1
E
X 3
R Mea D
M
U
X
6 T
N
2 W
D mot
D rya
IF ID EX MEM WB
Instruction Instruction Execute/ Memory Write
Pipelined Datapath – Key Idea
What happens if we break the execution into multiple cycles, but keep
the extra hardware?
Answer: We may be able to start executing a new instruction at each
clock cycle - pipelining
…but we shall need extra registers to hold data between cycles
– pipeline registers
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
A
D
D
A
4
64 128 D
D
P bits bits < 97 64
Instruction
C AD
DR
R
D 3 1 3
I
< bits bits
Instruc 2 2 5 5 5
6
Mem R R W 2
tion N N NR Ze
ory Register
1 2 D
1
A ro
W
D File L
R M
D
2
U
X
U AD
DR
D R
1
E
X 3
Mem
at D
M
U
X
6 T
N
2 W
D ory
a
D
IF/ID ID/ EX/ MEM/

EX MEM WB
Pipelined Datapath
Pipeline registers wide enough to hold data coming in
A
D
D
A
4
64 128 D
D
P bits bits < 97 64
Instruction
C AD
DR
R
D 3 1 3
I
< bits bits
Instruc 2 2 5 5 5
6
Mem R R W 2
tion N N NR Ze
ory Register
1 2 D
1
A ro
W
D File L
R M
D
2
U
X
U AD
DR
D R
1
E
X 3
Mem
at D
M
U
X
6 T
N
2 W
D ory
a
D
IF/ID ID/ EX/ MEM/

EX MEM WB
Only data flowing right to left may cause hazard…, why?
Bug in the Datapath
I I EX/ ME
A
D F D ME M/
4
D
/ / A M WB
D
I E D
P D Instructi I
X <
C A
D
R
D 3 1
on3
<
Instru 2 2 5 5 5
D 6
R Me R R W 2
ction N N NR
mo Registe
1 2 D
1
A
ry W
D r File R L
M
D
2
U
X
U A
D
D D R
1
E
X 3
R Mea D
M
U
X
6 T
N
2 W
D mot
D rya
Write register number comes from another later instruction!

Corrected Datapath
I I EX/ ME
F D ME M/
A
D / / M WB
A
4
D
I 64 E 133 D
D bits X bits < D
102 69
bits bits
P <
C AD R
3 5
R R 2 Z
DR
Instru
D 2
5
N
R
1
D A e
Me
ction
N
W Reg 1
L r
mo
5 2
N F
iste
R M
o
W D U
U AD
ri
D 2 X
ry DR
D
l1 E Me R M
X 3 a D U
e6
X
T
N
2 W
D mot
5 D rya
Destination register number is also passed through ID/EX, EX/MEM

and MEM/WB registers, which are now wider by 5 bits
Pipelined Example
Consider the following instruction sequence:
lw $t0, 10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10
Single-Clock-Cycle Diagram:Clock Cycle
L1
W
IF/ID ID/EX EX/MEM MEM/WB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Single-Clock-Cycle Diagram: Clock
S Cycle 2 L
W W
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
ADCycle 3 S L
D W W
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
SUCycle 4 AD S L
B D W W
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Cycle 5 SU AD S L
B D W W
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Cycle 6 SU AD S
B D W
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Cycle 7 SU AD
D
B
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Cycle 8 SU
B
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D
Alternative View – Multiple-Clock-Cycle
Diagram
C C C C C C C C
C
1
C
2
C
3
C
4
C
5
C
6
Time
C
7
C
8
axis
lw $t0, I
M
R
E
A
L
D
M
R
E
10($t1) G U G
sw $t3, I
M
R
E
A
L
D
M
R
E
20($t4) G U G
add $t5, I
M
R
E
A
L
D
M
R
E
$t6, $t7 G U G
s $t8, I
M
R
E
A
L
D
M
R
E
u $t9, G U G
b $t10
Notes
One significant difference in the execution of an R-type
instruction between multicycle and pipelined implementations:
register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle
implementation. Why?
think of structural hazards when writing to the register file…
Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline
registers to decouple the 5 stages
The CPI of an ideal pipeline (no stalls) is 1. Why?
The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources page
As we develop control for the pipeline keep in mind that the text
does not consider jump – should not be too hard to implement!
Recall Single-Cycle Control – the
Datapath
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [25 21] Read

PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15 0] Sign
extend ALU
control
Instruction [5 0]
Instruction AluOp Instruction Funct Field Desired ALU
control
opcode operation ALU action input
Recall Single-Cycle – ALU Control
LW
SW
00
00
load word xxxxxx add
store word xxxxxx add
010
010
Branch eq 01 branch eq xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 set on less 101010 set on less 111
ALUO Funct field Operatio
ALUOp p ALUOp F F F F F F n
1 0 0 0 5X 4X 3X 2X 1X 0X 01
0 1 X X X X X X 0
11
1 X X X 0 0 0 0 0
01
1 X X X 0 0 1 0 0
11
1 X X X 0 1 0 0 0
00
1 X X X 0 1 0 1 0
00
1 X X X 1 0 1 0 1
11
Truth table for ALU control bits 1
Recall Single-Cycle – Control Signals
Effect of control bits
Signal Name Effect when deasserted Effect when asserted
RegDst The register destination number for the The register destination number for the
Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)
RegWrite None The register on the Write register input is written
with the value on the Write data input
AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended,
second register file output (Read data 2) lower 16 bits of the instruction
PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch target
MemRead None Data memory contents designated by the address
input are put on the first Read data output
MemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data input
MemtoReg The value fed to the register Write data input The value fed to the register Write data input
comes from the ALU comes from the data memory
Memto- Reg Mem Mem
Deter- Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
mining R-format 1 0 0 1 0 0 0 1 0
control lw 0 1 1 1 1 0 0 0 0
bits sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
Pipeline Control
Initial design – motivated by single-cycle datapath control – use
the same control signals
Observe:
No separate write signal for the PC as it is written every cycle Will
No separate write signals for the pipeline registers as they are be
written every cycle modifi
ed
No separate read signal for instruction memory as it is read every
by
clock cycle hazar
No separate read signal for register file as it is read every clock cycle
d
Need to set control signals during each pipeline stage detecti
on
Since control signals are associated with components active unit!!
during a single pipeline stage, can group control lines into five
groups according to pipeline stage
Pipelined Datapath with Control I
PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
PC Address register 1 Read

Read data 1 ALUSrc
register 2 Zero
Zero MemtoReg
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead
Same control Instruction

[20– 16]
signals as the Instruction
0
M ALUOp
single-cycle
u
[15– 11] x
1
datapath RegDst
Pipeline Control Signals
There are five stages in the pipeline
instruction fetch / PC increment
Nothing to control as
instruction decode / register fetch
instruction memory
execution / address calculation
read and PC write are
memory access
always enabled
write back
Write-back
Execution/Address Calculation Memory access stage stage control
stage control lines control lines lines
Reg ALU ALU ALU Mem Mem Reg Mem to
Instruction Dst Op1 Op0 Src Branch Read Write write Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
Pipeline Control Implementation
Pass control signals along just like the data – extend each pipeline register to hold needed control bits for succeeding
stages
WB
Instruction
Control M WB
EX M WB
Note: The 6-bit funct field of the instruction required in the EX stage to generate ALU control can be retrieved as the 6
least significant bits of the immediate field which is sign-extended and passed from the IF/ID register to the ID/EX register

Pipelined Datapath with Control II
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite Shift
Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
Control signals Sign
extend
ALU
control
MemRead
emanate from Instruction

[20– 16]
the control 0
M
u
ALUOp
portions of the Instruction

[15– 11]
1
x
pipeline registers RegDst

IF: lw $10, 20($1) ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

0
M 00 00
u WB
x
1 000 000 00
Control M WB
0 0 0
0000 00 0
EX M WB 0
0 0
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Read data 1
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
PipelinedExecutionandControl
Write x memory u
data x
1
0
Write
data
Instruction
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
Clock cycle 1
0 ALUOp
M
Instruction u
x
Instruction
[15– 11]
1
Clock 1 RegDst
sequence: IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

0
lw $10, 20($1) M
1
u
x
11
WB
00
sub $11, $2, $3

lw 010 000 00
Control M WB
0 0 0
0001 00 0
EX M WB 0
and $12, $4, $7

0 0
Add
or $13, $6, $7 4 Add

Add result
RegWrite
Branch
add $14, $8, $9

Shift
left 2
MemWrite
ALUSrc
1 Read
MemtoReg
Instruction
register 1
PC Address Read $1
X data 1
Read
register 2 Zero
Instruction
Registers Read $X ALU ALU
memory Write 0 Read
register M data
u Data M
Write x memory u
Label “before<i>” means

data x
1
0
Write
data
i th instruction before
Instruction
20 [15– 0] Sign 20 ALU MemRead
extend control
lw 10
Instruction
[20– 16] 10
0
Clock cycle 2
ALUOp
M
Instruction u
X [15– 11] X x
1
Clock 2 RegDst
IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

0
M 10 11
u WB
x
1 sub 000 010 00
Control M WB
0 0 0
1100 00 0
EX M WB 0
1 0
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
2 Read
MemtoReg
Instruction
PC Address register 1 Read $2 $1
3 Read data 1
register 2 Zero
Instruction
Registers Read $3 ALU ALU
memory Write 0 Read
register M data
u Data M
u
Write x memory
data x
1
0
Write
data
Instruction
X [15– 0] Sign X 20 ALU MemRead
extend control
Instruction
X [20– 16] X 10
Clock cycle 3
0 ALUOp
M
Instruction u
x
Instruction
11 [15– 11] 11
1
Clock 3 RegDst
sequence: IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

0
M 10 10
lw $10, 20($1)
u WB
x
1 and 000 000 11
Control M WB
sub $11, $2, $3 1100

EX
1
10
0
M
0
1
0
0
WB 0
and $12, $4, $7 Add
or $13, $6, $7 4 Add

Add result
RegWrite
Shift Branch
add $14, $8, $9

left 2
MemWrite
ALUSrc
4 Read
MemtoReg
Instruction
register 1
PC Address Read $4 $2
5 data 1
Read
register 2 Zero
Instruction
Registers Read $5 $3 ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
Instruction
X [15– 0] Sign X ALU MemRead
extend control
Instruction
X [20– 16] X
0 ALUOp
Clock cycle 4 12
Instruction
[15– 11] 12 11
M
1
u
x
10
Clock 4 RegDst
IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

0
M 10 10
u WB
x
1 or 000 000 10
Control M WB
1 0 1
1100 10 0
EX M WB 1
0 0
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
6 Read
MemtoReg
Instruction
PC Address register 1 Read $6 $4
7 Read data 1
register 2 Zero
Instruction $5
Registers Read $7 ALU ALU
memory 10 Write 0 Read
register M data
u Data M
Write x memory u
x
data 1
0
Write
data
Instruction
extend control
Instruction
X [20– 16] X
Clock cycle 5
0 ALUOp
M 11 10
Instruction u
13 [15– 11] 13 12 x
Clock 5 1
Instruction
RegDst
sequence:
IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

0
M 10 10
lw $10, 20($1)
u WB
x
1 add 000 000 10
Control M WB
sub $11, $2, $3

1 0 1
1100 10 0
EX M WB 0
0 0
and $12, $4, $7 Add
or $13, $6, $7 4 Add

Add result
RegWrite
Shift Branch
add $14, $8, $9

left 2
MemWrite
ALUSrc
8 Read
MemtoReg
Instruction
register 1
PC Address Read $8 $6
9 data 1
Read
register 2 Zero
Instruction
Registers Read $9 $7 ALU ALU
register M data
u Data M
Write x memory u
Label “after<i>” means

data x
1
0
Write
data
i th instruction after add

Instruction
extend control
Instruction
X [20– 16] X
Clock cycle 6
0 ALUOp
M 12 11
Instruction u
14 [15– 11] 14 13 x
1
Clock 6 RegDst
IF: after<2> ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

0
M 00 10
u WB
x
1 000 000 10
Control M WB
1 0 1
0000 10 0
EX M WB 0
0 0
Add
Add
4 Add result
RegWrite
Shift Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1 Read $8
Read data 1
register 2 Zero
Instruction $9
register M data
u Data M
Write x memory u
x
data 1
0
Write
data
Instruction
extend control
Instruction
[20– 16]
Clock cycle 7
0 ALUOp
M 13 12
Instruction u
[15– 11] 14 x
1
Clock 7 RegDst
Instruction IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .
sequence: 0
M
IF/ID
00
ID/EX
WB
00
EX/MEM MEM/WB
u
x
1
lw $10, 20($1)
000 000 10
Control M WB
0 0 1
0000 00 0
EX M WB 0
sub $11, $2, $3

0 0
Add
and $12, $4, $7 4 Add

Add result
RegWrite
Branch
or $13, $6, $7
Shift
left 2
MemWrite
ALUSrc
Read
add $14, $8, $9
MemtoReg
Instruction
Read
data 1
Read
register 2 Zero
Instruction
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
extend control
Instruction
[20– 16]
Clock cycle 8
0 ALUOp
M 14 13
Instruction u
[15– 11] x
1
Clock 8 RegDst
Pipelined Execution and Control
Instruction IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .
sequence: 0
M 00 00
u WB
x
1
lw $10, 20($1)
000 000 00
Control M WB
0 0 1
sub $11, $2, $3

0000 00 0
EX M WB 0
0 0
and $12, $4, $7 Add
or $13, $6, $7
Add
4 Add result
RegWrite
Shift Branch
add $14, $8, $9 left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction

Read data 1
register 2 Zero
Instruction
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction
extend control
Instruction
[20– 16]
0 ALUOp
M 14
Clock cycle 9 Instruction

[15– 11]
1
u
x
Clock 9 RegDst
Revisiting Hazards
So far our datapath and control have ignored hazards
We shall revisit data hazards and control hazards and
enhance our datapath and control to handle them in
hardware…
Data Hazards and Forwarding
Problem with starting an instruction before previous are finished:
data dependencies that go backward in time – called data hazards
Time (in clock cycles)
$2 = 10 Value of CC 1
register $2: 10
CC 2
10
CC 3
10
CC 4
10
CC 5
10/– 20
CC 6
– 20
CC 7
– 20
CC 8
– 20
CC 9
– 20
before sub; Program
execution
$2 = -20 after order
(in instructions)
sub sub $2, $1, $3 IM Reg DM Reg
sub $2, $1, $3

and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2
add $14, $2, $2
or $13, $6, $2 IM DM Reg
sw $15, 100($2) Reg
add $14, $2, $2 IM Reg DM Reg
sw $15, 100($2) IM Reg DM Reg

Software Solution
Have compiler guarantee never any data hazards!
by rearranging instructions to insert independent instructions
between instructions that would otherwise have a data hazard
between them,
or, if such rearrangement is not possible, insert nops
sub $2, $1, $3 sub $2, $1, $3

lw $10, 40($3) nop
slt $5, $6, $7 and $12, $2, $5 onop and $12, $2, $5 or
or $13, $6, $2 add $14, $13,
r $6, $2 add $14, $2, $2 sw
$2, $2 sw $15, 100($2) $15, 100($2)
Such compiler solutions may not always be possible, and nops
slow the machine down
MIPS: nop = “no operation” = 00…0 (32bits)

= sll $0, $0, 0
Hardware Solution: Forwarding
Idea: use intermediate data, do not wait for result to be finally
written to the destination register. Two steps:
1. Detect data hazard
2. Forward intermediate data to resolve hazard
Pipelined Datapath with Control II (as
before)
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite Shift
Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Read
data 1
Read
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
extend
ALU
control
MemRead

[20– 16]
the control 0
M
u
ALUOp

[15– 11]
1
x

Hazard Detection
Hazard conditions:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Eg., in the earlier example, first hazard between sub $2, $1, $3 and
and $12, $2, $5 is detected when the and is in EX stage and the
sub is in MEM stage because
EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)
Whether to forward also depends on:

if the later instruction is going to write a register – if not, no need to
forward, even if there is register number match as in conditions above
if the destination register of the later instruction is $0 – in which case
there is no need to forward value ($0 is always 0 and never
overwritten)
Plan:
Data Forwarding
allow inputs to the ALU not just from ID/EX, but also later
pipeline registers, and
use multiplexors and control signals to choose appropriate
inputs to ALU
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Value of EX/MEM : X X X – 20 X X X X X
Value of MEM/WB : X X X X – 20 X X X X
Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg
sub $2, $1, $3

and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2) or $13, $6, $2 IM Reg DM Reg
sw $15, 100($2) IM Reg DM Reg
Dependencies between pipelines move forward in time

ID/EX EX/MEM MEM/WB
ForwardingHardware
Registers ALU
Data
memory M
u
x
Datapath before adding forwarding hardware

a. No forwarding
ID/EX EX/MEM MEM/WB
M
u
x
Registers
ForwardA ALU
M Data
u memory
x M
u
x
Rs ForwardB
Rt
Rt M
u EX/MEM.RegisterRd
Rd
x
Forwarding MEM/WB.RegisterRd
unit
b. With forwarding Datapath after adding forwarding hardware

Forwarding Hardware: Multiplexor
Control
Mux control Source Explanation
ForwardA = 00 ID/EX The first ALU operand comes from the register file
ForwardA = 10 EX/MEM The first ALU operand is forwarded from prior ALU result
ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory
or an earlier ALU result
ForwardB = 00 ID/EX The second ALU operand comes from the register file
ForwardB = 10 EX/MEM The second ALU operand is forwarded from prior ALU
result
ForwardB = 01 MEM/WB The second ALU operand is forwarded from data
memory
or an earlier ALU result
Depending on the selection in the

rightmost multiplexor
(see datapath with control diagram)
Data Hazard: Detection and
Forwarding
Forwarding unit determines multiplexor control according to the
following rules:
1. EX hazard
if ( EX/MEM.RegWrite // if there is a write…
and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches,
then… ForwardA = 10
2.
if ( EX/MEM.RegWrite // if there is a write…

and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches,
then… ForwardB = 10
Data Hazard: Detection and
Forwarding
1. MEM hazard
if ( MEM/WB.RegWrite // if there is a write…
and ( MEM/WB.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd ≠ ID/EX.RegisterRs )// and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline
register, then…
ForwardA = 01
if ( MEM/WB.RegWrite // if there is a write…

and ( MEM/WB.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd ≠ ID/EX.RegisterRt )// and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline
register, then…
ForwardB = 01
This check is necessary, e.g., for sequences such as add $1, $1,
$2; add $1, $1, $3; add $1, $1, $4;
Forwarding Hardware with Control
Called forwarding unit, not hazard
ID/EX
detection unit,
WB because once data is forwarded
EX/MEM
Control M there is no hazard!

WB
MEM/WB
IF/ID EX M WB
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt
M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
Forwarding MEM/WB.RegisterRd
unit
Datapath with forwarding hardware and control wires – certain details,

e.g., branching hardware, are omitted to simplify the drawing
Note: so far we have only handled forwarding to R-type instructions…!
or $4, $4, $2 and $4, $2, $5 sub $2, $1, $3 before<1> before<2>
ID/EX
10 10
WB
EX/MEM
Forwarding
Control M WB
MEM/WB
IF/ID EX M WB
2 $2 $1
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $3
u
M x
u
x
2 1
5 3
M
4 2 u
x
Clock cycle 3
Forwarding
unit
Execution Clock 3
example: add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>
ID/EX
10 10
sub $2, $1, $3 WB

EX/MEM
10
and $4, $2, $5 Control M WB

MEM/WB
or $4, $4, $2 IF/ID EX M WB
add $9, $4, $2 4 $4 $2

M
Instruction
6 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
2 2
6 5
M 2
4 4 u
x
Forwarding
Clock cycle 4 unit
Clock 4
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .
ID/EX
10 10
WB
EX/MEM
Forwarding
10
Control M WB
MEM/WB
1
IF/ID EX M WB
4 $4 $4
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
2 2
M 4 2
u
Execution
9 4
x
Clock cycle 5
Forwarding
unit
example Clock 5
(cont.): after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .
ID/EX
10
sub $2, $1, $3 WB

EX/MEM
10
and $4, $2, $5 Control M WB

MEM/WB
1
or $4, $4, $2 IF/ID EX M WB
add $9, $4, $2 $4

M
Instruction
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
Forwarding
Clock cycle 6 unit
Clock 6
Data Hazards and Stalls
Load word can still cause a hazard:
an instruction tries to read a register following a load instruction that writes to
the same register
lw $2, 20($1) Time (in clock cycles)

Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
and $4, $2, $5 execution
or $8, $2, $6 order

(in instructions)
add $9, $4, $2 lw $2, 20($1) IM Reg DM Reg
Slt $1, $6, $7

and $4, $2, $5 IM Reg DM Reg
As even a pipeline
dependency goes or $8, $2, $6 IM Reg DM Reg
backward therefore,
in time we need a hazard detection unit to stall the pipeline after the load
forwardinginstruction
will not add $9, $4, $2 IM Reg DM Reg
solve the hazard

slt $1, $6, $7 IM Reg DM Reg
Pipelined Datapath with Control II (as
before)
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite Shift
Branch
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
Read
data 1
Read
register 2 Zero
Instruction
memory Write 0 Read
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0]
extend
ALU
control
MemRead

[20– 16]
the control 0
M
u
ALUOp

[15– 11]
1
x

Hazard Detection Logic to Stall
Hazard detection unit implements the following check if to stall
if ( ID/EX.MemRead // if the instruction in the EX stage is a

load…
and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination register
or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source register
// of the instruction in the ID stage,
then…
stall the pipeline
Mechanics of Stalling
If the check to stall verifies, then the pipeline needs to stall
only 1 clock cycle after the load as after that the forwarding
unit can resolve the dependency
What the hardware does to stall the pipeline 1 cycle:
does not let the IF/ID register change (disable write!) – this will
cause the instruction in the ID stage to repeat, i.e., stall
therefore, the instruction, just behind, in the IF stage must be
stalled as well – so hardware does not let the PC change (disable
write!) – this will cause the instruction in the IF stage to repeat,
i.e., stall
changes all the EX, MEM and WB control fields in the ID/EX
pipeline register to 0, so effectively the instruction just behind
the load becomes a nop – a bubble is said to have been inserted
into the pipeline
note that we cannot turn that instruction into an nop by 0ing all the
bits in the instruction itself – recall nop = 00…0 (32 bits) – because
it has already been decoded and control signals generated
Hazard Detection Unit Hazard
detection
ID/EX.MemRead
unit ID/EX
WB
IF/IDWrite EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite
M
Instruction
u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit
Datapath with forwarding hardware, the hazard detection unit and

controls wires – certain details, e.g., branching hardware are omitted
to simplify the drawing
Stalling Resolves a Hazard
Same instruction sequence as before for which forwarding by
itself could not resolve the hazard:
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
order
(in instructions)
lw $2, 20($1) lw $2, 20($1) IM Reg DM Reg

and $4, $2, $5
or $8, $2, $6
add $9, $4, $2 and $4, $2, $5 IM Reg Reg DM Reg
Slt $1, $6, $7

or $8, $2, $6 IM IM Reg DM Reg
bubble
slt $1, $6, $7 IM Reg DM Reg
Hazard detection unit inserts a 1-cycle bubble in the pipeline, after

which all pipeline register dependencies go forward so then the
forwarding unit can handle them and there are no more hazards
and $4, $2, $5 lw $2, 20($1) before<1> before<2> before<3>
Hazard
ID/EX.MemRead
detection
1 unit ID/EX
X
11
WB
IF/IDWrite
EX/MEM
M
Control u M WB
MEM/WB
Stalling
x
0
IF/ID EX M WB
1 $1
PCWrite
M
Instruction
X u
x
Registers
Instruction Data
PC ALU
memory memory M
$X
u
M x
u
x
Execution 1
X
2
M
example:
u
x
ID/EX.RegisterRt Forwarding
unit
Clock cycle 2
Clock 2
lw $2, 20($1) or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2>
and $4, $2, $5 2
Hazard
detection ID/EX.MemRead
or $4, $4, $2
unit ID/EX
5
00 11
WB
IF/IDWrite
add $9, $4, $2

EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
2 $2 $1
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $X
u
M x
u
x
2 1
5 X
2 M
4 u
x
unit
Clock cycle 3
Clock 3
or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1>
Hazard
ID/EX.MemRead
detection
2 unit ID/EX
5
10 00
WB
IF/IDWrite
EX/MEM
M
Stalling
11
Control u M WB
x MEM/WB
0
IF/ID EX M WB
2 $2 $2
PCWrite
M
Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $5
u
M x
u
x
Execution
2 2
5 5
M 2
4 4 u
example
x
Clock cycle 4
unit
(cont.): Clock 4
add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . .
Hazard
ID/EX.MemRead
lw $2, 20($1)
detection
4 unit ID/EX
2
10 10
and $4, $2, $5 WB

IF/IDWrite
EX/MEM
M 0
or $4, $4, $2 Control

0
u
x
M WB
MEM/WB
11
add $9, $4, $2 IF/ID EX M WB
4 $4 $2
PCWrite
M
Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $5
u
M x
u
x
4 2
2 5
M 2
4 4 u
x
unit
Clock cycle 5
Clock 5
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . bubble
Hazard ID/EX.MemRead
detection
4
unit ID/EX
2
10 10
WB
IF/IDWrite
EX/MEM
M 10
Control u M WB
x MEM/WB
Stalling
0
0
IF/ID EX M WB
4 $4 $4
PCWrite
M
Instruction
2 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $2
u
M x
u
x
4 4
Execution 2
9
2
4
M
u
4
x
example ID/EX.RegisterRt Forwarding

unit
Clock cycle 6
(cont.): Clock 6
after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .

Hazard
lw $2, 20($1) detection ID/EX.MemRead

unit ID/EX
and $4, $2, $5

10 10
WB
IF/IDWrite
EX/MEM
or $4, $4, $2
M 10
Control u M WB
x MEM/WB
0
add $9, $4, $2

1
IF/ID EX M WB
$4
PCWrite
M
Instruction
u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x
4
2
M 4 4
9 u
x
Clock cycle 7
unit
Clock 7
Control (or Branch) Hazards
Problem with branches in the pipeline we have so far is that the
branch decision is not made till the MEM stage – so what
instructions, if at all, should we insert into the pipeline following
the branch instructions?
Possible solution: stall the pipeline till branch decision is known

not efficient, slow the pipeline significantly!
Another solution: predict the branch outcome

e.g., always predict branch-not-taken – continue with next
sequential instructions
if the prediction is wrong have to flush the pipeline behind the
branch – discard instructions already fetched or decoded – and
continue execution at the branch target
Predicting Branch-not-taken:
Misprediction delay
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)
40 beq $1, $3, 7 IM Reg DM Reg
44 and $12, $2, $5 IM Reg DM Reg
48 or $13, $6, $2 IM Reg DM Reg
52 add $14, $2, $2 IM Reg DM Reg
72 lw $4, 50($7) IM Reg DM Reg
The outcome of branch taken (prediction wrong) is decided only when

beq is in the MEM stage, so the following three sequential instructions
already in the pipeline have to be flushed and execution resumes at lw
Optimizing the Pipeline to Reduce
Branch Delay
Move the branch decision from the MEM stage (as in our
current pipeline) earlier to the ID stage
calculating the branch target address involves moving the branch
adder from the MEM stage to the ID stage – inputs to this adder,
the PC value and the immediate fields are already available in
the IF/ID pipeline register
calculating the branch decision is efficiently done, e.g., for
equality test, by XORing respective bits and then ORing all the
results and inverting, rather than using the ALU to subtract and
then test for zero (when there is a carry delay)
with the more efficient equality test we can put it in the ID stage
without significantly lengthening this stage – remember an objective
of pipeline design is to keep pipeline stages balanced
we must correspondingly make additions to the forwarding and
hazard detection units to forward to or stall the branch at the ID
stage in case the branch decision depends on an earlier result
Flushing on Misprediction
Same strategy as for stalling on load-use data hazard…
Zero out all the control values (or the instruction itself) in
pipeline registers for the instructions following the branch that
are already in the pipeline – effectively turning them into nops
– so they are flushed
in the optimized pipeline, with branch decision made in the ID
stage, we have to flush only one instruction in the IF stage – the
branch delay penalty is then only one clock cycle
Optimized Datapath for Branch
IF.Flush
Hazard
detection
unit IF.Flush control zeros out the
instruction in the IF/ID
M ID/EX
u
x
pipeline register (which follows the
WB
EX/MEM
M
Control
0
u
x
M
branch) WB
MEM/WB
IF/ID EX M WB
4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x
Sign
extend
M
u
x
Forwarding
unit
Branch decision is moved from the MEM stage to the ID stage – simplified drawing not showing
enhancements to the forwarding and hazard detection units
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>
IF.Flush
Hazard
detection
unit
72 ID/EX
M
PipelinedBranch
u
48 x WB
EX/MEM
M
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72
4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x
Execution Sign
extend
example: 10
Forwarding
Clock cycle 3 unit
36 sub $10, $4, $8 Clock 3
40 beq $1, $3, 7 lw $4, 50($7) bubble (nop) beq $1, $3, 7 sub $10, . . . before<1>
44 and $12 $2, $5 IF.Flush
48 or $13 $2, $6
Hazard
detection
unit
ID/EX
52 add $14, $4, $2

M
u
76 x WB
EX/MEM
56 slt $15, $6, $7

M
Control u M WB
x MEM/WB
0
…
IF/ID EX M WB
76 72
72 lw $4, 50($7) Shift

left 2
=
M
u
x
$1
Registers
Instruction Data
PC ALU
memory memory M
76 72
u
M $3 x
Optimized pipeline with

u
x
only one bubble as a result

Sign
extend
of the taken branch 10
Forwarding
Clock cycle 4
unit
Clock 4
Simple Example: Comparing
Performance
Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix
assume 2 ns for memory access, 2 ns for ALU operation, 1 ns
for register read or write
assume gcc instruction mix 23% loads, 13% stores, 19%
branches, 2% jumps, 43% ALU
for pipelined execution assume
50% of the loads are followed immediately by an instruction that
uses the result of the load
25% of branches are mispredicted
branch delay on misprediction is 1 clock cycle
jumps always incur 1 clock cycle delay so their average time is 2
clock cycles
Simple Example: Comparing
Performance
Single-cycle (p. 373): average instruction time 8 ns
Multicycle (p. 397): average instruction time 8.04 ns
Pipelined:
loads use 1 cc (clock cycle) when no load-use dependency and 2 cc
when there is dependency – given 50% of loads are followed by
dependency the average cc per load is 1.5
stores use 1 cc each
branches use 1 cc when predicted correctly and 2 cc when not –
given 25% misprediction average cc per branch is 1.25
jumps use 2 cc each
ALU instructions use 1 cc each
therefore, average CPI is
1.5 × 23% + 1 × 13% + 1.25 × 19% + 2 × 2% + 1 × 43% = 1.18
therefore, average instruction time is 1.18 × 2 = 2.36 ns

7 Performance With Pipelining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 Performance With Pipelining

Uploaded by

Copyright:

Available Formats

EEE440 - Computer Architecture

Enhancing Performance with Pipelining

I nstr ucti on Dat a

Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined

Before actually building the pipelined datapath and control

MIPS was designed to be pipelined:

Ins truction Data

lw $s0, 20($t1) IF ID EX ME M WB Without a stall it is

lw $s0, 20($t1) IF ID EX MEM WB With a one-stage

IF/ID ID/ EX/ MEM/

IF/ID ID/ EX/ MEM/

Write register number comes from another later instruction!

Destination register number is also passed through ID/EX, EX/MEM

Instruction [25 21] Read

Signal Name Effect when deasserted Effect when asserted

IF/ID ID/EX EX/MEM MEM/WB

PC Address register 1 Read

Same control Instruction

IF/ID ID/EX EX/MEM MEM/WB

emanate from Instruction

portions of the Instruction

pipeline registers RegDst

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3

and $12, $4, $7

or $13, $6, $7 4 Add

add $14, $8, $9

Label “before<i>” means

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3 1100

and $12, $4, $7 Add

or $13, $6, $7 4 Add

add $14, $8, $9

IF/ID ID/EX EX/MEM MEM/WB

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3

and $12, $4, $7 Add

or $13, $6, $7 4 Add

add $14, $8, $9

Label “after<i>” means

i th instruction after add

IF/ID ID/EX EX/MEM MEM/WB

sub $11, $2, $3

and $12, $4, $7 4 Add

add $14, $8, $9

sub $11, $2, $3

and $12, $4, $7 Add

add $14, $8, $9 left 2

PC Address register 1 Read

Clock cycle 9 Instruction

sub $2, $1, $3

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

sub $2, $1, $3 sub $2, $1, $3

MIPS: nop = “no operation” = 00…0 (32bits)

emanate from Instruction

portions of the Instruction

pipeline registers RegDst

Whether to forward also depends on:

sub $2, $1, $3

add $14, $2, $2 IM Reg DM Reg