You are on page 1of 77

Computer Architecture

Nguyễn Trí Thành


Information Systems Department
Faculty of Technology
College of Technology
ntthanh@vnu.edu.vn

11/27/2010 1
Enhancing Performance
with Pipelining

11/27/2010 2
Pipelining
 Start work ASAP!! Do not waste time!
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
Not pipelined
B

Assume 30 min. each task – wash, dry, fold, store – and that
separate tasks use separate hardware and so can be overlapped
6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

A
Pipelined
B

D
11/27/2010 3
Pipelined vs. Single-Cycle
Instruction Execution: the Plan
Program
execution 2 4 6 8 10 12 14 16 18
order Time
(in instructions)
Instruction Data Single-cycle
lw $1, 100($0) fetch
Reg ALU
access
Reg

Instruction Data
lw $2, 200($0) 8 ns fetch
Reg ALU
access
Reg

Instruction
lw $3, 300($0) 8 ns fetch
...
8 ns

Assume 2 ns for memory access, ALU operation; 1 ns for register access:


therefore, single cycle clock 8 ns; pipelined clock cycle 2 ns.
Program
execution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access

Instruction Data
Pipelined
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access

Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access

11/27/2010 2 ns 2 ns 2 ns 2 ns 2 ns 4
Pipelining: Keep in Mind
 Pipelining does not reduce latency of a single
task, it increases throughput of entire workload
 Pipeline rate limited by longest stage
 potential speedup = number pipe stages
 unbalanced lengths of pipe stages reduces
speedup
 Time to fill pipeline and time to drain it – when
there is slack in the pipeline – reduces
speedup

11/27/2010 5
Example Problem
 Problem: for the laundry fill in the following table when
1. the stage lengths are 30, 30, 30 30 min., resp.
2. the stage lengths are 20, 20, 60, 20 min., resp.

Person Unpipelined Pipeline 1 Ratio unpipelined Pipeline 2 Ratio unpiplelined


finish time finish time to pipeline 1 finish time to pipeline 2
1
2
3
4

 Come up with a formula for pipeline speed-up!

11/27/2010 6
Pipelining MIPS

 What makes it easy with MIPS?


 all instructions are same length
 so fetch and decode stages are similar for all instructions

 just a few instruction formats


 simplifies instruction decode and makes it possible in one
stage
 memory operands appear only in load/stores
 so memory access can be deferred to exactly one later stage

 operands are aligned in memory


 one data transfer instruction requires one memory access
stage

11/27/2010 7
Pipelining MIPS
 What makes it hard?
 structural hazards: different instructions, at different stages,
in the pipeline want to use the same hardware resource
 control hazards: succeeding instruction, to put into pipeline,
depends on the outcome of a previous branch instruction,
already in pipeline
 data hazards: an instruction in the pipeline requires data to
be computed by a previous instruction still in the pipeline

 Before actually building the pipelined datapath and


control we first briefly examine these potential
hazards individually…
11/27/2010 8
Structural Hazards
 Structural hazard: inadequate hardware to simultaneously support
all instructions in the pipeline in the same clock cycle
 E.g., suppose single – not separate – instruction and data memory
in pipeline below with one read port
 then a structural hazard between first and fourth lw instructions

Program
execution 2 4 6 8 10 12 14
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Pipelined
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
fetch access
Hazard if single memory
Instruction Data
lw $3, 300($0) 2 ns Reg ALU Reg
fetch access
Instruction Data
lw $4, 400($0) Reg ALU Reg
2 ns fetch access

2 ns 2 ns 2 ns 2 ns 2 ns

 MIPS was designed to be pipelined: structural hazards are easy to


avoid!
11/27/2010 9
Control Hazards
 Control hazard: need to make a decision based on the
result of a previous instruction still executing in pipeline
 Solution 1 Stall the pipeline

Program
execution 2 4 6 8 10 12 14 16
order Time
(in instructions)
Instruction Data Note that branch outcome is
add $4, $5, $6 Reg ALU Reg
fetch access computed in ID stage with
Instruction Data added hardware (later…)
beq $1, $2, 40 fetch
Reg ALU
access
Reg
2ns
Instruction Data
lw $3, 300($0) bubble fetch
Reg ALU
access
Reg

4 ns 2ns

Pipeline stall
11/27/2010 10
Control Hazards
 Solution 2 Predict branch outcome
 e.g., predict branch-not-taken :
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5, $6 fetch
Reg ALU
access
Reg

Instruction Data
beq $1, $2, 40 Reg ALU Reg
2 ns fetch access

Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access

Prediction success
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)
Instruction Data
add $4, $5 ,$6 Reg ALU Reg
fetch access

Instruction Data
beq $1, $2, 40 Reg ALU Reg
fetch access
2 ns
bubble bubble bubble bubble bubble

Instruction Data
or $7, $8, $9 Reg ALU Reg
fetch access
4 ns
11/27/2010 11
Prediction failure: undo (=flush) lw
Control Hazards
 Solution 3 Delayed branch: always execute the sequentially next
statement with the branch executing after one instruction delay –
compiler’s job to find a statement that can be put in the slot that is
independent of branch outcome
 MIPS does this – but it is an option in SPIM (Simulator -> Settings)
Program
execution 2 4 6 8 10 12 14
order Time
(in instructions)

beq $1, $2, 40 Instruction Data


Reg ALU Reg
fetch access

add $4, $5, $6 Instruction Data


Reg ALU Reg
(d elayed branch slot) 2 ns fetch access

Instruction Data
lw $3, 300($0) Reg ALU Reg
2 ns fetch access

2 ns

Delayed branch beq is followed by add that is


independent of branch outcome
11/27/2010 12
Data Hazards
 Data hazard: instruction needs data from the result of a
previous instruction still executing in pipeline
 Solution Forward data if possible…

2 4 6 8 10
Time
Instruction pipeline diagram:
add $s0, $t0, $t1 IF ID EX MEM WB shade indicates use –
left=write, right=read

Program
execution 2 4 6 8 10
order Time
(in instructions)
add $s0, $t0, $t1 IF ID EX MEM WB
Without forwarding – blue line –
data has to go back in time;
with forwarding – red line –
sub $t2, $s0, $t3
data is available in time
IF ID EX MEM WB

11/27/2010 13
Data Hazards
 Forwarding may not be enough
 e.g., if an R-type instruction following a load uses the result of the load –
called load-use data hazard
2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
Without a stall it is impossible
lw $s0, 20($t1) IF ID EX MEM WB
to provide input to the sub
instruction in time
sub $t2, $s0, $t3 IF ID EX MEM WB

2 4 6 8 10 12 14
Program Time
execution
order
(in instructions)
With a one-stage stall, forwarding
lw $s0, 20($t1) IF ID EX MEM WB can get the data to the sub
instruction in time
bubble bubble bubble bubble bubble

sub $t2, $s0, $t3 IF ID EX MEM WB


11/27/2010 14
Reordering Code to Avoid
Pipeline Stall (Software Solution)
 Example:
lw $t0, 0($t1)
lw $t2, 4($t1) Data hazard
sw $t2, 0($t1)
sw $t0, 4($t1)

 Reordered code:
lw $t0, 0($t1)
lw $t2, 4($t1)
sw $t0, 4($t1)
Interchanged
sw $t2, 0($t1)

11/27/2010 15
Pipelined Datapath
 We now move to actually building a pipelined datapath
 First recall the 5 steps in instruction execution
1. Instruction Fetch & PC Increment (IF)
2. Instruction Decode and Register Read (ID)
3. Execution or calculate address (EX)
4. Memory access (MEM)
5. Write result into register (WB)
 Review: single-cycle processor
 all 5 steps done in a single clock cycle
 dedicated hardware required for each step

 What happens if we break the execution into multiple cycles, but keep
the extra hardware?
11/27/2010 16
Review - Single-Cycle Datapath
“Steps”

ADD

4 ADD

PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1 Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
RD
E Memory M
U
16 X 32 X
T WD
N
D

IF
11/27/2010 ID EX MEM WB
17
Instruction Fetch Instruction Decode Execute/ Address Calc. Memory Access Write Back
Pipelined Datapath – Key Idea
 What happens if we break the execution into
multiple cycles, but keep the extra hardware?
 Answer: We may be able to start executing a new
instruction at each clock cycle - pipelining
 …but we shall need extra registers to hold data
between cycles – pipeline registers

11/27/2010 18
Pipelined Datapath

Pipeline registers wide enough to hold data coming in


ADD

4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D

IF/ID ID/EX EX/MEM MEM/WB


11/27/2010 19
Pipelined Datapath

Pipeline registers wide enough to hold data coming in


ADD

4 ADD
64 bits 128 bits
PC <<2 97 bits 64 bits
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Zero
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D

IF/ID ID/EX EX/MEM MEM/WB


11/27/2010 20
Only data flowing right to left may cause hazard…, why?
Bug in the Datapath

IF/ID ID/EX EX/MEM MEM/WB


ADD

4 ADD

PC <<2
Instruction I
ADDR RD
32 16 32
5 5 5
Instruction
Memory RN1 RN2 WN
RD1
Register File ALU
WD
RD2 M
U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
D

11/27/2010 21

Write register number comes from another later instruction!


Corrected Datapath
IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4 64 bits 133 bits
102 bits 69 bits
<<2
PC
ADDR RD 5
RN1 RD1
32 Zero
Instruction RN2
ALU
5
Memory Register
5
WN File RD2 M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5 D

11/27/2010 Destination register number is also passed through ID/EX, EX/MEM 22


and MEM/WB registers, which are now wider by 5 bits
Pipelined Example
 Consider the following instruction sequence:
lw $t0, 10($t1)
sw $t3, 20($t4)
add $t5, $t6, $t7
sub $t8, $t9, $t10

11/27/2010 23
Single-Clock-Cycle Diagram:
Clock Cycle 1
LW

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 24
Single-Clock-Cycle Diagram:
Clock Cycle 2
SW LW

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 25
Single-Clock-Cycle Diagram:
Clock Cycle 3
ADD SW LW

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 26
Single-Clock-Cycle Diagram:
Clock Cycle 4
SUB ADD SW LW

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 27
Single-Clock-Cycle Diagram:
Clock Cycle 5
SUB ADD SW LW

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 28
Single-Clock-Cycle Diagram:
Clock Cycle 6
SUB ADD SW

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 29
Single-Clock-Cycle Diagram:
Clock Cycle 7
SUB ADD

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 30
Single-Clock-Cycle Diagram:
Clock Cycle 8
SUB

IF/ID ID/EX EX/MEM MEM/WB

ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5
ALU Zero
Instruction RN2
5
Memory Register
WN File RD2
5
M
WD U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
5
D

11/27/2010 31
Alternative View –
Multiple-Clock-Cycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
Time axis
lw $t0, 10($t1) IM REG ALU DM REG

sw $t3, 20($t4) IM REG ALU DM REG

add $t5, $t6, $t7 IM REG ALU DM REG

sub $t8, $t9, $t10 IM REG ALU DM REG

11/27/2010 32
Notes
 One significant difference in the execution of an R-type instruction
between multicycle and pipelined implementations:
 register write-back for the R-type instruction is the 5th (the last
write-back) pipeline stage vs. the 4th stage for the multicycle
implementation. Why?
 think of structural hazards when writing to the register file…
 Worth repeating: the essential difference between the pipeline
and multicycle implementations is the insertion of pipeline
registers to decouple the 5 stages
 The CPI of an ideal pipeline (no stalls) is 1. Why?
 The RaVi Architecture Visualization Project of Dortmund U. has
pipeline simulations – see link in our Additional Resources page
 As we develop control for the pipeline keep in mind that the text
does not consider jump – should not be too hard to implement!
11/27/2010 33
Recall Single-Cycle Control –
the Datapath
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25 21] Read


PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15 0] Sign
extend ALU
control

Instruction [5 0]

11/27/2010 34
Recall Single-Cycle – ALU Control
Instruction AluOp Instruction Funct Field Desired ALU control
opcode operation ALU action input
LW 00 load word xxxxxx add 010
SW 00 store word xxxxxx add 010
Branch eq 01 branch eq xxxxxx subtract 110
R-type 10 add 100000 add 010
R-type 10 subtract 100010 subtract 110
R-type 10 AND 100100 and 000
R-type 10 OR 100101 or 001
R-type 10 set on less 101010 set on less 111

ALUOp Funct field Operation


ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0 0 X X X X X X 010
0 1 X X X X X X 110
1 X X X 0 0 0 0 010
1 X X X 0 0 1 0 110
1 X X X 0 1 0 0 000
1 X X X 0 1 0 1 001
1 X X X 1 0 1 0 111
11/27/2010 35
Truth table for ALU control bits
Recall Single-Cycle – Control Signals
Effect of control bits
Signal Name Effect when deasserted Effect when asserted

RegDst The register destination number for the The register destination number for the
Write register comes from the rt field (bits 20-16) Write register comes from the rd field (bits 15-11)
RegWrite None The register on the Write register input is written
with the value on the Write data input
AlLUSrc The second ALU operand comes from the The second ALU operand is the sign-extended,
second register file output (Read data 2) lower 16 bits of the instruction
PCSrc The PC is replaced by the output of the adder The PC is replaced by the output of the adder
that computes the value of PC + 4 that computes the branch target
MemRead None Data memory contents designated by the address
input are put on the first Read data output
MemWrite None Data memory contents designated by the address
input are replaced by the value of the Write data input
MemtoReg The value fed to the register Write data input The value fed to the register Write data input
comes from the ALU comes from the data memory

Memto- Reg Mem Mem


Deter- Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
mining R-format 1 0 0 1 0 0 0 1 0
control lw 0 1 1 1 1 0 0 0 0
bits11/27/2010sw X 1 X 0 0 1 0 0 036
beq X 0 X 0 0 0 1 0 1
Pipeline Control
 Initial design – motivated by single-cycle datapath control – use
the same control signals
 Observe:
Will be
 No separate write signal for the PC as it is written every cycle modified
 No separate write signals for the pipeline registers as they are written by hazard
detection
every cycle unit!!
 No separate read signal for instruction memory as it is read every clock
cycle
 No separate read signal for register file as it is read every clock cycle
 Need to set control signals during each pipeline stage
 Since control signals are associated with components active
during a single pipeline stage, can group control lines into five
groups according to pipeline stage

11/27/2010 37
Pipelined Datapath with Control I
PCSrc

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

Read MemWrite
Instruction

PC Address register 1 Read


Read data 1 ALUSrc
Zero
Zero MemtoReg
Instruction register 2
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead

Same control Instruction


[20– 16]
0
signals as the Instruction
M
u
ALUOp

single-cycle [15– 11]


1
x

datapath RegDst

11/27/2010 38
Pipeline Control Signals

 There are five stages in the pipeline


 instruction fetch / PC increment Nothing to control as instruction memory
 instruction decode / register fetch read and PC write are always enabled

 execution / address calculation


 memory access
 write back
Write-back
Execution/Address Calculation Memory access stage stage control
stage control lines control lines lines
Reg ALU ALU ALU Mem Mem Reg Mem to
Instruction Dst Op1 Op0 Src Branch Read Write write Reg
R-format 1 1 0 0 0 0 0 1 0
lw 0 0 0 1 0 1 0 1 1
sw X 0 0 1 0 0 1 0 X
beq X 0 1 0 1 0 0 0 X
11/27/2010 39
Pipeline Control
Implementation
 Pass control signals along just like the data – extend each pipeline
register to hold needed control bits for succeeding stages
WB

Instruction
Control M WB

EX M WB

IF/ID ID/EX EX/MEM MEM/WB

 Note: The 6-bit funct field of the instruction required in the EX stage
to generate ALU control can be retrieved as the 6 least significant
bits of the immediate field which is sign-extended and passed from
the IF/ID register to the ID/EX register
11/27/2010 40
Pipelined Datapath with Control II
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU
Control signals extend control
MemRead

emanate from Instruction


[20– 16]
0 ALUOp
the control Instruction
M
u
x
portions of the [15– 11]
1
RegDst
pipeline registers
11/27/2010 41
IF: lw $10, 20($1) ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 00 00
u WB
x
1 000 000 00
Control M WB
0 0 0
0000 00 0
EX M WB 0

Execution
0 0

Add

Add
4 Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc

and
Read

MemtoReg
Instruction
PC Address register 1 Read
Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1

Control
0
Write
data

Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp

Clock cycle 1 Instruction


[15– 11]
M
u
x

Instruction
1
Clock 1
 RegDst

sequence: IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

IF/ID ID/EX EX/MEM MEM/WB


0
M

lw $10, 20($1) 1
u
x
lw
11

010
WB
00

000 00
Control M WB

sub $11, $2, $3 0001


EX
0
00
0
M
0
0
0
0
WB 0

and $12, $4, $7 Add

Add

or $13, $6, $7 4 Add result

RegWrite
Shift Branch
left 2

MemWrite
add $14, $8, $9 1 Read
ALUSrc

MemtoReg
Instruction

register 1
PC Address Read $1
X data 1
Read
register 2 Zero
Instruction
Registers Read $X ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1

Label “before<i>” means


0
Write
data

Instruction

i th instruction before 20 [15– 0]

Instruction
Sign
extend
20 ALU
control
MemRead

lw 10 [20– 16] 10
0 ALUOp

Clock cycle 2
M
Instruction u
X [15– 11] X x
1
11/27/2010 Clock 2 RegDst 42
IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

Pipelined
IF/ID ID/EX EX/MEM MEM/WB
0
M 10 11
u WB
x
1 sub 000 010 00
Control M WB
0 0 0
1100 00 0
EX M WB 0
1 0

Execution 4
Add

Add
Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc
2 Read

MemtoReg
Instruction
and
PC Address register 1 Read $2 $1
3 Read data 1
register 2 Zero
Instruction
Registers Read $3 ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write

Control
data

Instruction
X [15– 0] Sign X 20 ALU MemRead
extend control

Instruction
X [20– 16] X 10
0 ALUOp

Clock cycle 3 11
Instruction
[15– 11] 11
M

1
u
x

 Instruction Clock 3 RegDst

sequence: IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

IF/ID ID/EX EX/MEM MEM/WB


0
M 10 10
u WB

lw $10, 20($1) 1
x
and
Control
000
M
000
WB
11

1 0 0

sub $11, $2, $3 1100


EX
10
0
M
1
0
WB 0

and $12, $4, $7 4


Add

Add
Add result

RegWrite
or $13, $6, $7 Shift
left 2
Branch

MemWrite
ALUSrc

add $14, $8, $9 4 Read

MemtoReg
Instruction

register 1
PC Address Read $4 $2
5 data 1
Read
register 2 Zero
Instruction
Registers Read $5 $3 ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data

Instruction
X [15– 0] Sign X ALU MemRead
extend control

Instruction
X [20– 16] X
0 ALUOp
M 10

Clock cycle 4
Clock 4
12
Instruction
[15– 11] 12 11
1
u
x

11/27/2010 RegDst
43
IF: add $14, $8, $9 ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

IF/ID ID/EX EX/MEM MEM/WB


0

Pipelined
M 10 10
u WB
x
1 or 000 000 10
Control M WB
1 0 1
1100 10 0
EX M WB 1
0 0

Execution
Add

Add
4 Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc
6 Read

MemtoReg
Instruction
PC Address register 1 Read $6 $4

and
7 Read data 1
register 2 Zero
Instruction $5
Registers Read $7 ALU ALU
memory 10 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Control
Instruction
X [15– 0] Sign X ALU MemRead
extend control

Instruction
X [20– 16] X
0 ALUOp

Clock cycle 5 13
Instruction
[15– 11] 13 12
M
u
x
11 10

Clock 5 1

Instruction
RegDst


IF: after<1> ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
sequence:
IF/ID ID/EX EX/MEM MEM/WB
0
M 10 10
u WB

lw $10, 20($1) 1
x
add
Control
000
M
000
WB
10

1 0 1

sub $11, $2, $3 1100


EX
10
0
M
0
0
WB 0

and $12, $4, $7 4


Add

Add
Add result

RegWrite
or $13, $6, $7 Shift
left 2
Branch

MemWrite
ALUSrc
8

add $14, $8, $9 Read

MemtoReg
Instruction

register 1
PC Address Read $8 $6
9 data 1
Read
register 2 Zero
Instruction
Registers Read $9 $7 ALU ALU
memory 11 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1

Label “after<i>” means


0
Write
data

Instruction

i th instruction after add X [15– 0] Sign


extend
X ALU
control
MemRead

Instruction
X [20– 16] X
0 ALUOp

Clock cycle 6 14
Instruction
[15– 11] 14 13
M
u
x
1
12 11

11/27/2010 Clock 6 RegDst 44


IF: after<2> ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

IF/ID ID/EX EX/MEM MEM/WB


0

Pipelined
M 00 10
u WB
x
1 000 000 10
Control M WB
1 0 1
0000 10 0
EX M WB 0
0 0

Execution
Add

Add
4 Add result

RegWrite
Shift Branch
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction
PC Address register 1 Read $8

and
Read data 1
register 2 Zero
Instruction $9
Registers Read ALU ALU
memory 12 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Control Clock cycle 7


Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M 13 12
Instruction u
[15– 11] 14 x
1
Clock 7 RegDst

 Instruction IF: after<3> ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

sequence: 0
M
IF/ID

00
ID/EX

00
EX/MEM MEM/WB

u WB
x
1 000 000 10
Control M WB

lw $10, 20($1) 0000


EX
0
00
0
M
0
0
0
1
WB 0

sub $11, $2, $3 Add

Add
and $12, $4, $7 4 Add result

RegWrite
Shift Branch
left 2

MemWrite
or $13, $6, $7 Read
ALUSrc

MemtoReg
Instruction

PC Address register 1
Read
data 1

add $14, $8, $9 Instruction


memory 13
Read
register 2
Write
Registers Read
data 2
0
Zero
ALU ALU
result Address Read 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp

Clock cycle 8 Instruction


[15– 11]
M

1
u
x
14 13

11/27/2010 Clock 8 RegDst 45


Pipelined Execution and Control

 Instruction IF: after<4> ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

sequence: IF/ID ID/EX EX/MEM MEM/WB


0
M 00 00
u WB
x
1 000 000 00
lw $10, 20($1) Control

0000
M
0
00
WB
0
0
1
EX M WB 0
sub $11, $2, $3 0 0

Add
and $12, $4, $7 4
Add
Add result

or $13, $6, $7

RegWrite
Shift Branch
left 2

MemWrite
add $14, $8, $9 Read
ALUSrc

MemtoReg
Instruction

PC Address register 1 Read


Read data 1
register 2 Zero
Instruction
Registers Read ALU ALU
memory 14 Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M 14
u
Clock cycle 9 Instruction
[15– 11]
1
x

Clock 9 RegDst

11/27/2010 46
Revisiting Hazards
 So far our datapath and control have ignored
hazards
 We shall revisit data hazards and control
hazards and enhance our datapath and control
to handle them in hardware…

11/27/2010 47
Data Hazards and Forwarding
 Problem with starting an instruction before previous are finished:
 data dependencies that go backward in time – called data hazards

Time (in clock cycles)


$2 = 10 before sub; Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
$2 = -20 after sub register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Program
execution
order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

sub $2, $1, $3


and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg

or $13, $6, $2
add $14, $2, $2 or $13, $6, $2 IM Reg DM Reg
sw $15, 100($2)
add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

11/27/2010 48
Software Solution
 Have compiler guarantee never any data hazards!
 by rearranging instructions to insert independent instructions
between instructions that would otherwise have a data hazard
between them,
 or, if such rearrangement is not possible, insert nops
sub $2, $1, $3 sub $2, $1, $3
lw $10, 40($3) nop
slt $5, $6, $7 nop
and $12, $2, $5 or and $12, $2, $5
or $13, $6, $2 or $13, $6, $2
add $14, $2, $2 add $14, $2, $2
sw $15, 100($2) sw $15, 100($2)
 Such compiler solutions may not always be possible, and nops
slow the machine down

11/27/2010
MIPS: nop = “no operation” = 00…0 (32bits) = sll $0, $0, 0 49
Hardware Solution: Forwarding

 Idea: use intermediate data, do not wait for result to


be finally written to the destination register. Two
steps:
1. Detect data hazard
2. Forward intermediate data to resolve hazard

11/27/2010 50
Pipelined Datapath with Control
II (as before)
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU
Control signals extend control
MemRead

emanate from Instruction


[20– 16]
0 ALUOp
the control Instruction
M
u
x
portions of the [15– 11]
1
RegDst
pipeline registers
11/27/2010 51
Hazard Detection
 Hazard conditions:
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
 Eg., in the earlier example, first hazard between sub $2, $1, $3 and
and $12, $2, $5 is detected when the and is in EX stage and the
sub is in MEM stage because
 EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 (1a)

 Whether to forward also depends on:


 if the later instruction is going to write a register – if not, no need to forward,
even if there is register number match as in conditions above
 if the destination register of the later instruction is $0 – in which case
there is no need to forward value ($0 is always 0 and never overwritten)
11/27/2010 52
Data Forwarding
 Plan:
 allow inputs to the ALU not just from ID/EX, but also later
pipeline registers, and
 use multiplexors and control signals to choose appropriate
inputs to ALU
Time (in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
Value of EX/MEM : X X X – 20 X X X X X
Value of MEM/WB : X X X X – 20 X X X X

Program
execution order
(in instructions)
sub $2, $1, $3 IM Reg DM Reg

sub $2, $1, $3


and $12, $2, $5 and $12, $2, $5 IM Reg DM Reg
or $13, $6, $2
add $14, $2, $2
or $13, $6, $2 IM Reg DM Reg
sw $15, 100($2)

add $14, $2, $2 IM Reg DM Reg

sw $15, 100($2) IM Reg DM Reg

11/27/2010 53
Dependencies between pipelines move forward in time
ID/EX EX/MEM MEM/WB

Forwarding Registers ALU

Hardware Data
memory M
u
x

Datapath before adding forwarding hardware


a. No forwarding

ID/EX EX/MEM MEM/WB

M
u
x
Registers
ForwardA ALU

M Data
u memory
x M
u
x

Rs ForwardB
Rt
Rt M
u EX/MEM.RegisterRd
Rd
x
Forwarding MEM/WB.RegisterRd
unit

11/27/2010 54
b. With forwarding Datapath after adding forwarding hardware
Forwarding Hardware:
Multiplexor Control

Mux control Source Explanation


ForwardA = 00 ID/EX The first ALU operand comes from the register file
ForwardA = 10 EX/MEM The first ALU operand is forwarded from prior ALU result
ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory
or an earlier ALU result
ForwardB = 00 ID/EX The second ALU operand comes from the register file
ForwardB = 10 EX/MEM The second ALU operand is forwarded from prior ALU result
ForwardB = 01 MEM/WB The second ALU operand is forwarded from data memory
or an earlier ALU result

Depending on the selection in the rightmost multiplexor


(see datapath with control diagram)

11/27/2010 55
Data Hazard: Detection and
Forwarding
 Forwarding unit determines multiplexor control according to the
following rules:

1. EX hazard
if ( EX/MEM.RegWrite // if there is a write…
and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRs ) ) // which matches, then…
ForwardA = 10

if ( EX/MEM.RegWrite // if there is a write…


and ( EX/MEM.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd = ID/EX.RegisterRt ) ) // which matches, then…
ForwardB = 10

11/27/2010 56
Data Hazard: Detection and
Forwarding
2. MEM hazard
if ( MEM/WB.RegWrite // if there is a write…
and ( MEM/WB.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd ≠ ID/EX.RegisterRs ) // and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRs ) ) // but match with later pipeline
register, then…
ForwardA = 01

if ( MEM/WB.RegWrite // if there is a write…


and ( MEM/WB.RegisterRd ≠ 0 ) // to a non-$0 register…
and ( EX/MEM.RegisterRd ≠ ID/EX.RegisterRt ) // and not already a register match
// with earlier pipeline register…
and ( MEM/WB.RegisterRd = ID/EX.RegisterRt ) ) // but match with later pipeline
register, then…
ForwardB = 01

This check is necessary, e.g., for sequences such as add $1, $1, $2; add $1, $1, $3; add $1, $1, $4;
(array summing?), where an earlier pipeline (EX/MEM) register has more recent data
11/27/2010 57
Forwarding Hardware with
Control ID/EX
Called forwarding unit, not hazard detection unit,
because once data is forwarded there is no hazard!

WB
EX/MEM

Control M WB
MEM/WB

IF/ID EX M WB

M
Instruction

u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x

IF/ID.RegisterRs Rs
IF/ID.RegisterRt Rt
IF/ID.RegisterRt Rt
M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
Forwarding MEM/WB.RegisterRd
unit

Datapath with forwarding hardware and control wires – certain details,


e.g., branching hardware, are omitted to simplify the drawing
11/27/2010 58
Note: so far we have only handled forwarding to R-type instructions…!
or $4, $4, $2 and $4, $2, $5 sub $2, $1, $3 before<1> before<2>

ID/EX
10 10
WB
EX/MEM

Control M WB

Forwarding
MEM/WB

IF/ID EX M WB

2 $2 $1
M

Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $3
u
M x
u
x

2 1
5 3
M
4 2 u
x
Forwarding

Clock cycle 3 unit

 Execution Clock 3

example: add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 sub $2, . . . before<1>

ID/EX
10 10
WB
sub $2, $1, $3 EX/MEM
10
Control M WB
MEM/WB
and $4, $2, $5 EX M WB
IF/ID
or $4, $4, $2 4 $4 $2

add $9, $4, $2 M


Instruction

6 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $5
u
M x
u
x

2 2
6 5
M 2
4 4 u
x
Forwarding

11/27/2010
Clock cycle 4 unit

59
Clock 4
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . sub $2, . . .

ID/EX
10 10
WB
EX/MEM
10
Control M WB
MEM/WB

Forwarding
1
IF/ID EX M WB

4 $4 $4
M

Instruction
2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $2
u
M x
u
x

4 4
2 2
M 4 2
9 4 u

Execution
x
 Forwarding

Clock cycle 5 unit

example Clock 5

(cont.): after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .

ID/EX
10
WB
sub $2, $1, $3 EX/MEM
10
Control M WB
MEM/WB
and $4, $2, $5 EX M WB
1
IF/ID
or $4, $4, $2
$4

add $9, $4, $2 M


Instruction

u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x

4
2

M 4 4
9 u
x
Forwarding

11/27/2010
Clock cycle 6 unit

60
Clock 6
Data Hazards and Stalls
 Load word can still cause a hazard:
 an instruction tries to read a register following a load instruction that writes
to the same register

lw $2, 20($1) Time (in clock cycles)


Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
and $4, $2, $5 execution
order
or $8, $2, $6 (in instructions)
lw $2, 20($1) IM Reg DM Reg
add $9, $4, $2
Slt $1, $6, $7
and $4, $2, $5 IM Reg DM Reg

As even a pipeline
or $8, $2, $6 IM Reg DM Reg
dependency goes
backward in time
add $9, $4, $2 IM Reg DM Reg
forwarding will not
solve the hazard
slt $1, $6, $7 IM Reg DM Reg

 therefore, we need a hazard detection unit to stall the pipeline after the
11/27/2010
load instruction 61
Pipelined Datapath with Control II
(as before)
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU
Control signals extend control
MemRead

emanate from Instruction


[20– 16]
0 ALUOp
the control Instruction
M
u
x
portions of the [15– 11]
1
RegDst
pipeline registers
11/27/2010 62
Hazard Detection Logic to Stall

 Hazard detection unit implements the following check if


to stall

if ( ID/EX.MemRead // if the instruction in the EX stage is


a load…
and ( ( ID/EX.RegisterRt = IF/ID.RegisterRs ) // and the destination
register
or ( ID/EX.RegisterRt = IF/ID.RegisterRt ) ) ) // matches either source
register
// of the instruction in the ID stage, then…
stall the pipeline

11/27/2010 63
Mechanics of Stalling
 If the check to stall verifies, then the pipeline needs to stall only 1
clock cycle after the load as after that the forwarding unit can
resolve the dependency
 What the hardware does to stall the pipeline 1 cycle:
 does not let the IF/ID register change (disable write!) – this will cause
the instruction in the ID stage to repeat, i.e., stall
 therefore, the instruction, just behind, in the IF stage must be stalled
as well – so hardware does not let the PC change (disable write!) –
this will cause the instruction in the IF stage to repeat, i.e., stall
 changes all the EX, MEM and WB control fields in the ID/EX pipeline
register to 0, so effectively the instruction just behind the load
becomes a nop – a bubble is said to have been inserted into the
pipeline
 note that we cannot turn that instruction into an nop by 0ing all the bits
in the instruction itself – recall nop = 00…0 (32 bits) – because it has
already been decoded and control signals generated
11/27/2010 64
Hazard Detection Unit
Hazard ID/EX.MemRead
detection
unit ID/EX

WB
IF/IDWrite
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite

M
Instruction

u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x

IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit

Datapath with forwarding hardware, the hazard detection unit and


controls wires – certain details, e.g., branching hardware are omitted
11/27/2010 65
to simplify the drawing
Stalling Resolves a Hazard
 Same instruction sequence as before for which forwarding by
itself could not resolve the hazard:

Program Time (in clock cycles)


execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10
order
(in instructions)

lw $2, 20($1) lw $2, 20($1) IM Reg DM Reg

and $4, $2, $5


or $8, $2, $6
and $4, $2, $5 IM Reg Reg DM Reg
add $9, $4, $2
Slt $1, $6, $7
or $8, $2, $6 IM IM Reg DM Reg

bubble

add $9, $4, $2 IM Reg DM Reg

slt $1, $6, $7 IM Reg DM Reg

Hazard detection unit inserts a 1-cycle bubble in the pipeline, after


11/27/2010 which all pipeline register dependencies go forward so then the 66
forwarding unit can handle them and there are no more hazards
and $4, $2, $5 lw $2, 20($1) before<1> before<2> before<3>
Hazard
ID/EX.MemRead
detection
1 unit ID/EX
X
11
WB

IF/IDWrite
EX/MEM

Stalling
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB

1 $1

PCWrite
M

Instruction
X u
x
Registers
Instruction Data
PC ALU
memory memory M
$X
u
M x
u
x

 Execution 1
X
2
M

example:
u
x
ID/EX.RegisterRt Forwarding
unit

ClockClock
cycle
2
2
lw $2, 20($1) or $4, $4, $2 and $4, $2, $5 lw $2, 20($1) before<1> before<2>
Hazard
and $4, $2, $5 2
detection
unit
ID/EX.MemRead
ID/EX
5
or $4, $4, $2 00
WB
11
IF/IDWrite

EX/MEM

add $9, $4, $2 Control


M
u
x
M WB
MEM/WB
0
IF/ID EX M WB

2 $2 $1
PCWrite

M
Instruction

5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $X
u
M x
u
x

2 1
5 X
2 M
4 u
x
ID/EX.RegisterRt Forwarding
unit
11/27/2010 Clock cycle 3 67
Clock 3
or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . . before<1>
Hazard
ID/EX.MemRead
detection
2 unit ID/EX
5
10 00
WB

IF/IDWrite
EX/MEM
M 11
Control u M WB
x MEM/WB
0

Stalling IF/ID

2 $2
EX

$2
M WB

PCWrite
M

Instruction
5 u
x
Registers
Instruction Data
PC ALU
memory memory M
$5 $5
u
M x
u
x

2 2

 Execution 5 5

M 2
4 4 u

example
x
ID/EX.RegisterRt Forwarding
unit

Clock cycle 4
(cont.): Clock 4

add $9, $4, $2 or $4, $4, $2 and $4, $2, $5 bubble lw $2, . . .
Hazard
ID/EX.MemRead
detection
4
lw $2, 20($1) 2
unit

10
ID/EX
10
WB
IF/IDWrite

and $4, $2, $5 Control


M
u M
EX/MEM

WB
0
MEM/WB
or $4, $4, $2 IF/ID
0
x

EX M WB
11

add $9, $4, $2 4


PCWrite

$4 $2
M
Instruction

2 u
x
Registers
Instruction 2 Data
PC ALU
memory memory M
$2 $5
u
M x
u
x

4 2
2 5
M 2
4 4 u
x
ID/EX.RegisterRt Forwarding
unit

11/27/2010 Clock cycle 5 68


Clock 5
after<1> add $9, $4, $2 or $4, $4, $2 and $4, . . . bubble
Hazard ID/EX.MemRead
detection
4
unit ID/EX
2
10 10
WB

IF/IDWrite
EX/MEM

Stalling
M 10
Control u M WB
x MEM/WB
0
0
IF/ID EX M WB

4 $4

PCWrite
$4
M

Instruction
2 u
x
Registers
Instruction Data
PC ALU
memory memory M
$2 $2
u
M x
u
x

4 4

 Execution 2

9
2

4
M
u
4
x

example ID/EX.RegisterRt Forwarding


unit

Clock cycle 6
(cont.): Clock 6

after<2> after<1> add $9, $4, $2 or $4, . . . and $4, . . .


Hazard
detection ID/EX.MemRead

lw $2, 20($1) unit ID/EX


10 10
WB
and $4, $2, $5
IF/IDWrite

EX/MEM
M 10
Control u M WB
or $4, $4, $2 0
x MEM/WB
1
IF/ID EX M WB
add $9, $4, $2
$4
PCWrite

M
Instruction

u
x
Registers
Instruction 4 Data
PC ALU
memory memory M
$2
u
M x
u
x

4
2

M 4 4
9 u
x
ID/EX.RegisterRt Forwarding
unit

11/27/2010 Clock cycle 7 69


Clock 7
Control (or Branch) Hazards
 Problem with branches in the pipeline we have so far is that the
branch decision is not made till the MEM stage – so what
instructions, if at all, should we insert into the pipeline following the
branch instructions?

 Possible solution: stall the pipeline till branch decision is known


 not efficient, slow the pipeline significantly!

 Another solution: predict the branch outcome


 e.g., always predict branch-not-taken – continue with next sequential
instructions
 if the prediction is wrong have to flush the pipeline behind the branch –
discard instructions already fetched or decoded – and continue
execution at the branch target

11/27/2010 70
Predicting Branch-not-taken:
Misprediction delay
Program Time (in clock cycles)
execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
order
(in instructions)

40 beq $1, $3, 7 IM Reg DM Reg

44 and $12, $2, $5 IM Reg DM Reg

48 or $13, $6, $2 IM Reg DM Reg

52 add $14, $2, $2 IM Reg DM Reg

72 lw $4, 50($7) IM Reg DM Reg

The outcome of branch taken (prediction wrong) is decided only when


beq is in the MEM stage, so the following three sequential instructions
11/27/2010already in the pipeline have to be flushed and execution resumes at lw 71
Optimizing the Pipeline to
Reduce Branch Delay
 Move the branch decision from the MEM stage (as in our current
pipeline) earlier to the ID stage
 calculating the branch target address involves moving the branch adder
from the MEM stage to the ID stage – inputs to this adder, the PC value
and the immediate fields are already available in the IF/ID pipeline
register
 calculating the branch decision is efficiently done, e.g., for equality test,
by XORing respective bits and then ORing all the results and inverting,
rather than using the ALU to subtract and then test for zero (when there
is a carry delay)
 with the more efficient equality test we can put it in the ID stage without
significantly lengthening this stage – remember an objective of pipeline
design is to keep pipeline stages balanced
 we must correspondingly make additions to the forwarding and hazard
detection units to forward to or stall the branch at the ID stage in case
the branch decision depends on an earlier result
11/27/2010 72
Flushing on Misprediction
 Same strategy as for stalling on load-use data hazard…
 Zero out all the control values (or the instruction itself) in pipeline
registers for the instructions following the branch that are already
in the pipeline – effectively turning them into nops – so they are
flushed
 in the optimized pipeline, with branch decision made in the ID stage,
we have to flush only one instruction in the IF stage – the branch
delay penalty is then only one clock cycle

11/27/2010 73
Optimized Datapath for Branch
IF.Flush

Hazard
detection IF.Flush control zeros out the instruction in the IF/ID
unit
M ID/EX
pipeline register (which follows the branch)
u
x
WB
EX/MEM
M
Control u M WB
x MEM/WB
0

IF/ID EX M WB

4 Shift
left 2
M
u
x
Registers =
Instruction Data
PC ALU
memory memory M
u
M x
u
x

Sign
extend

M
u
x
Forwarding
unit

Branch decision is moved from the MEM stage to the ID stage – simplified drawing
11/27/2010 74
not showing enhancements to the forwarding and hazard detection units
and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8 before<1> before<2>

IF.Flush

Pipelined 72

48 x
M
u
Hazard
detection
unit

M
ID/EX

WB
EX/MEM

Branch
Control u M WB
x MEM/WB
28
0
IF/ID EX M WB
48 44 72

4
$1
Shift M $4
left 2 u
x
=
Registers
Instruction Data
PC ALU
memory memory M
72 44 $3
u
M $8 x
7 u
x

 Execution Sign
extend

example: 10

Forwarding
unit

Clock cycle 3
36 sub $10, $4, $8 Clock 3

lw $4, 50($7) bubble (nop)


40 beq $1, $3, 7 beq $1, $3, 7 sub $10, . . . before<1>

IF.Flush
44 and $12 $2, $5 Hazard
detection

48 or $13 $2, $6 M
u
unit
ID/EX

76 x WB

52 add $14, $4, $2 Control


M
u M
EX/MEM

WB
MEM/WB
x

56 slt $15, $6, $7 76


IF/ID
72
0
EX M WB

… 4

Shift M $1
left 2 u

72 lw $4, 50($7) PC
Instruction
Registers
= x

ALU
Data
memory memory M
76 72
u
M $3 x
u
x

Optimized pipeline with Sign

only one bubble as a result


extend

of the taken branch 10

Forwarding
unit

11/27/2010 Clock cycle


Clock 4
4 75
Simple Example: Comparing
Performance
 Compare performance for single-cycle, multicycle, and pipelined
datapaths using the gcc instruction mix
 assume 2 ns for memory access, 2 ns for ALU operation, 1 ns for
register read or write
 assume gcc instruction mix 23% loads, 13% stores, 19% branches,
2% jumps, 43% ALU
 for pipelined execution assume
 50% of the loads are followed immediately by an instruction that uses
the result of the load
 25% of branches are mispredicted
 branch delay on misprediction is 1 clock cycle
 jumps always incur 1 clock cycle delay so their average time is 2 clock
cycles

11/27/2010 76
Simple Example: Comparing
Performance
 Single-cycle (p. 373): average instruction time 8 ns
 Multicycle (p. 397): average instruction time 8.04 ns
 Pipelined:
 loads use 1 cc (clock cycle) when no load-use dependency
and 2 cc when there is dependency – given 50% of loads
are followed by dependency the average cc per load is 1.5
 stores use 1 cc each
 branches use 1 cc when predicted correctly and 2 cc when
not – given 25% misprediction average cc per branch is 1.25
 jumps use 2 cc each
 ALU instructions use 1 cc each
 therefore, average CPI is
1.5 × 23% + 1 × 13% + 1.25 × 19% + 2 × 2% + 1 × 43% = 1.18
 therefore, average instruction time is 1.18 × 2 = 2.36 ns
11/27/2010 77

You might also like