Professional Documents
Culture Documents
Chapter04 2PipelinedProcessor
Chapter04 2PipelinedProcessor
2013
COMPUTER ARCHITECTURE
CSE Fall 2013
BK
TP.HCM
Vo Tan Phuong
http://www.cse.hcmut.edu.vn/~vtphuong
dce
2013
Chapter 4.2
Pipelined Processor Design
Fall 2013, CS
dce
2013
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Fall 2013, CS
dce
2013
Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
Fall 2013, CS
dce
Sequential Laundry
2013
6 PM
Time 30
7
30
8
30
30
9
30
30
10
30
30
11
30
30
12 AM
30
30
B
C
Fall 2013, CS
dce
2013
7
30
30
8
30
30
30
30
30
30
9 PM
Time
30
30
30
C
D
dce
2013
k
1 2
1 2
k
1 2
1 2
Without Pipelining
One completion every k time units
Computer Architecture Chapter 4.2
1 2
With Pipelining
One completion every 1 time unit
Fall 2013, CS
dce
Synchronous Pipeline
2013
Sk
Register
S2
Register
S1
Register
Input
Register
Output
Clock
Computer Architecture Chapter 4.2
Fall 2013, CS
dce
Pipeline Performance
2013
Sk =
k+n1
Sk k for large n
Fall 2013, CS
dce
2013
Fall 2013, CS
10
dce
2013
Reg
ALU
MEM
Reg
900 ps
IF
Reg
ALU
MEM
Reg
900 ps
Computer Architecture Chapter 4.2
Fall 2013, CS
11
dce
2013
Reg
200
IF
200
ALU
Reg
IF
200
MEM
Reg
ALU
MEM
Reg
ALU
MEM
200
200
Reg
200
Reg
200
Fall 2013, CS
12
dce
2013
Fall 2013, CS
13
dce
Next . . .
2013
Fall 2013, CS
14
dce
Single-Cycle Datapath
2013
ID = Decode &
Register Read
EX = Execute
MEM = Memory
Access
J
Next
PC
Beq
Bne
30
00
30
Instruction
Memory
Instruction
PC
0
1
ALU result
Imm26
+1
PCSrc
Rs 5
32
Rt 5
Address
Rd
Imm16
zero
32
BusA
RA
Registers
RB
BusB
0
1
WB =
Write
Back
RW
BusW
32
A
L
U
32
Data
Memory
Address
Data_out
Data_in
32
32
1
32
RegDst
clk
Reg
Write
ExtOp ALUSrc ALUCtrl
Mem Mem
Read Write
Mem
toReg
Fall 2013, CS
15
dce
Pipelined Datapath
2013
Address
RB
0
1
Rd
RW
32
Imm
E
BusB
BusW
32
zero
A
L
U
Data
Memory
ALUout
RA
ALU result
Imm16
NPC
Rt 5
BusA
Next
PC
32
Data_out
32
32
Address
WB Data
PC
Rs 5
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
MEM = Memory
Access
WB = Write Back
EX = Execute
ID = Decode &
Register Read
NPC2
IF = Instruction Fetch
Data_in
32
clk
Fall 2013, CS
16
dce
2013
Address
RB
0
1
Rd
RW
ALU result
Imm16
E
BusB
BusW
32
zero
32
Imm
Next
PC
A
L
U
Data
Memory
ALUout
Rt 5
RA
BusA
MEM =
Memory Access
32
32
32
Address
Data_out
PC
Rs 5
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
EX = Execute
WB Data
ID = Decode &
Register Read
IF = Instruction Fetch
WB = Write Back
Data_in
32
clk
Computer Architecture Chapter 4.2
Fall 2013, CS
17
dce
2013
EX
RW
BusB
BusW
0
1
32
A
L
U
Data
Memory
ALUout
Imm
32
32
zero
32
Data_out
32
32
Address
WB Data
Rd
ALU result
Imm16
Data_in
Rd4
RB
WB
Next
PC
RA
Address
Rt 5
BusA
MEM
Rd3
PC
Rs 5
Rd2
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
IF
clk
Computer Architecture Chapter 4.2
Fall 2013, CS
18
dce
2013
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
CC1
CC2
CC3
CC4
CC5
lw $t6, 8($s5)
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
CC6
CC7
CC8
Fall 2013, CS
19
dce
Instruction-Time Diagram
2013
Instruction Order
$t7, 8($s3)
lw
$t6, 8($s5)
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
IF
ID
$s2, 10($s3)
CC1
WB
EX MEM
Time
Fall 2013, CS
20
dce
Control Signals
ID
EX
RW
Imm16
BusB
32
Data
Memory
32
32
32
Address
Data_out
BusW
0
1
A
L
U
ALUout
32
zero
32
Imm
ALU result
WB Data
Rd
Bne
Data_in
Rd4
Address
RB
Beq
Rd3
PC
Rt 5
RA
BusA
WB
Next
PC
Instruction
Rs 5
MEM
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
IF
2013
clk
Reg
Dst
Reg
Write
Ext
Op
ALU
Src
ALU
Ctrl
Mem Mem
Read Write
Mem
toReg
Fall 2013, CS
21
dce
Pipelined Control
32
0
1
32
Data_out
BusW
32
32
Address
WB Data
Data
Memory
ALUout
Imm
BusB
A
L
U
Data_in
Rd4
RW
32
32
zero
Rd
Op
Address
RB
Bne
Rd3
PC
Rt 5
RA
BusA
Beq
ALU result
Imm16
Instruction
Rs 5
Next
PC
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
2013
Main
& ALU
Control
Ext
Op
ALU
Src
J
ALU Beq
Ctrl Bne
Mem Mem
Read Write
Mem
toReg
WB
Reg
Write
MEM
Reg
Dst
EX
Pass control
signals along
pipeline just
like the data
func
clk
Fall 2013, CS
22
dce
2013
Memory Stage
Fall 2013, CS
23
dce
2013
Decode
Stage
Execute Stage
Memory Stage
Write
Control Signals
Control Signals
Back
Op
RegDst ALUSrc ExtOp
R-Type
1=Rd
0=Reg
addi
0=Rt
slti
Beq Bne
ALUCtrl
func
1=Imm 1=sign
ADD
0=Rt
1=Imm 1=sign
SLT
andi
0=Rt
1=Imm 0=zero
AND
ori
0=Rt
1=Imm 0=zero
OR
lw
0=Rt
1=Imm 1=sign
ADD
sw
1=Imm 1=sign
ADD
beq
0=Reg
SUB
bne
0=Reg
SUB
Fall 2013, CS
24
dce
Next . . .
2013
Fall 2013, CS
25
dce
Pipeline Hazards
2013
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Fall 2013, CS
26
dce
Structural Hazards
2013
Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
Two instructions are
attempting to write
the register file
during same cycle
Example
Writing back ALU result in stage 4
Instructions
lw
$t6, 8($s5)
IF
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX MEM
$s2, 10($s3)
CC1
Time
Fall 2013, CS
27
dce
2013
Serious Hazard:
Hazard cannot be ignored
Fall 2013, CS
28
dce
Next . . .
2013
Fall 2013, CS
29
dce
2013
Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
# $s1 is written
# $s1 is read
Fall 2013, CS
30
dce
2013
Time (cycles)
value of $s2
sub $s2, $t1, $t3
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Fall 2013, CS
31
dce
Instruction Order
2013
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
CC9
10
10
10
10
10
20
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
Reg
Reg
Reg
ALU
DM
Reg
stall
stall
stall
IM
Reg
ALU
DM
Fall 2013, CS
32
dce
2013
Time (cycles)
value of $s2
sub $s2, $t1, $t3
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Fall 2013, CS
33
dce
Implementing Forwarding
2013
Rd
32
Result
Data
Memory
0
32
32
Data_out
0
32
1
WData
Address
Data_in
Rd4
0
1
BusW
A
L
U
RW
BusB
0
1
2
3
32 ALU result
32
Rd3
RB
0
1
2
3
Rt
RA
BusA
Imm16
Rd2
Instruction
Rs
Register File
Imm26
Im26
ForwardA
clk
ForwardB
Fall 2013, CS
34
dce
2013
Signal
Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)
Fall 2013, CS
35
dce
Forwarding Example
2013
Instruction sequence:
lw
$t4, 4($t0)
ori $t7, $t1, 2
sub $t3, $t4, $t7
sub $t3,$t4,$t7
ori $t7,$t1,2
lw $t4,4($t0)
Rd
32
Result
32
32
Data_out
0
32
1
WData
Data
Memory
Data_in
Rd4
0
1
BusW
Address
RW
BusB
0
1
2
3
A
L
U
Rd3
RB
32 ALU result
32
RA
0
1
2
3
Rt
BusA
Rd2
Instruction
Rs
ext
Register File
Imm16
Imm
Imm26
clk
Computer Architecture Chapter 4.2
Fall 2013, CS
36
dce
2013
ForwardA 1
Else if
Else if
Else
ForwardA 3
ForwardA 0
ForwardB 1
Else if
Else if
Else
ForwardB 3
ForwardB 0
Fall 2013, CS
37
dce
BusW
0
1
Result
Data
Memory
0
32
Data_out
32
0
32
1
WData
Im26
0
1
2
3
Address
Data_in
32
ALUCtrl
Rd4
RW
BusB
Rd
RB
A
L
U
Rd3
Rt
RA
0
1
2
3
BusA
32 ALU result
32
Rd2
Instruction
Rs
Register File
Imm26
2013
clk
RegDst
ForwardB
ForwardA
Hazard Detect
and Forward
func
RegWrite
WB
Main
& ALU
Control
RegWrite
MEM
Op
EX
RegWrite
Fall 2013, CS
38
dce
Next . . .
2013
Pipeline Hazards
Data Hazards and Forwarding
Fall 2013, CS
39
dce
Load Delay
2013
Program Order
Time (cycles)
lw
$s2, 20($t1)
CC1
CC2
CC3
CC4
CC5
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
CC6
CC7
CC8
Reg
Fall 2013, CS
40
dce
2013
Fall 2013, CS
41
dce
2013
Program Order
Time (cycles)
lw
$s2, 20($s1)
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
IM
stall
bubble
bubble
bubble
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
CC6
CC7
CC8
Reg
Fall 2013, CS
42
dce
2013
$s1, ($t5)
lw
$s2, 8($s1)
IF
ID
IF
EX MEM WB
Stall
ID
IF
EX MEM WB
Stall
ID
EX MEM WB
IF
ID
EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time
Computer Architecture Chapter 4.2
Fall 2013, CS
43
dce
32
0
1
0
32
Data_out
32
WData
Data
Memory
0
1
2
3
BusW
Result
Im26
A
BusB
Address
Data_in
32
Rd4
RW
A
L
U
Rd
RB
Rt
RA
0
1
2
3
BusA
32 ALU result
32
Rd2
PC
Instruction
Rs
Register File
Imm26
Rd3
2013
clk
Disable PC
RegDst
ForwardB
func
ForwardA
Hazard Detect
Forward, & Stall
MemRead
Stall
Bubble
=0
RegWrite
0
1
WB
Control Signals
MEM
EX
Op
RegWrite
RegWrite
Fall 2013, CS
44
dce
2013
Slow code:
lw
lw
add
sw
lw
lw
sub
sw
$t0, 4($s0)
$t1, 8($s0)
$t2, $t0, $t1
$t2, 0($s0)
$t3, 16($s0)
$t4, 20($s0)
$t5, $t3, $t4
$t5, 12($0)
lw
lw
lw
lw
add
sw
sub
sw
$t0,
$t1,
$t3,
$t4,
$t2,
$t2,
$t5,
$t5,
4($s0)
8($s0)
16($s0)
20($s0)
$t0, $t1
0($s0)
$t3, $t4
12($s0)
Fall 2013, CS
45
dce
2013
Fall 2013, CS
46
dce
2013
# $t1 is written
# $t1 is written
Fall 2013, CS
47
dce
Next . . .
2013
Pipeline Hazards
Data Hazards and Forwarding
Fall 2013, CS
48
dce
Control Hazards
2013
PC + 4 + 4 immediate
If Branch is Taken
Fall 2013, CS
49
dce
2013
Beq $t1,$t2,L1
cc1
cc2
cc3
IF
Reg
ALU
IF
Next1
Next2
cc4
cc5
cc6
Reg
Bubble
Bubble
Bubble
IF
Bubble
Bubble
Bubble
Bubble
IF
Reg
ALU
DM
Branch
Target
Addr
cc7
Fall 2013, CS
50
dce
Bne
32
2
3
BusB
0
1
BusW
0
1
zero
A
L
U
ALUout
Imm16
RW
Beq
32
32
Rd3
Rd
Im26
NPC
Address
RB
BusA
Rt 5
RA
0
1
Next
PC
Rd2
Instruction
0
Rs 5
Register File
Instruction
Memory
Imm26
Op
PCSrc
Instruction
+1
PC
2013
J, Beq, Bne
Control Signals
Bubble = 0
0
1
MEM
Reg
Dst
EX
func
clk
Fall 2013, CS
51
dce
2013
Beq $t1,$t2,L1
Next1
cc1
cc2
cc3
IF
Reg
IF
Next2
cc4
cc5
cc6
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
cc7
Reg
Fall 2013, CS
52
dce
2013
Fall 2013, CS
53
dce
Longer Cycle
=
RW
A
L
U
2
3
BusB
0
1
BusW
32
32
0
1
Rd
32
Rd3
RB
BusA
Address
Op
Rt 5
RA
0
1
Rd2
Instruction
0
Rs 5
Register File
Instruction
Memory
Instruction
Imm16
PCSrc
Data forwarded
then compared
ALUout
Next
PC
Reset
+1
PC
Zero
Im16
2013
Reg
Dst
J, Beq, Bne
Control Signals
Bubble = 0
ALUCtrl
0
1
MEM
EX
func
clk
Fall 2013, CS
54
dce
Next . . .
2013
Pipeline Hazards
Data Hazards and Forwarding
Fall 2013, CS
55
dce
2013
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
Fall 2013, CS
56
dce
Delayed Branch
2013
(next instruction)
label:
branch target
. . .
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot
label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4
Fall 2013, CS
57
dce
2013
Fall 2013, CS
58
dce
Zero-Delayed Branching
2013
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions
Fall 2013, CS
59
dce
2013
Inc
mux
PC
Target
Predict
Addresses
Bits
low-order bits
used as index
=
predict_taken
Fall 2013, CS
60
dce
2013
Fall 2013, CS
61
dce
2013
IF
Increment PC
No
ID
No
Jump
or taken
branch?
Found
BTB entry with predict
taken?
Yes
EX
Normal
Execution
Mispredicted Jump/branch
Enter jump/branch address, target
address, and set prediction in BTB entry.
Flush fetched instructions
Restart PC at target address
Computer Architecture Chapter 4.2
Yes
No
Jump
or taken
branch?
Yes
Correct Prediction
No stall cycles
Mispredicted branch
Branch not taken
Update prediction bits
Flush fetched instructions
Restart PC after branch
Fall 2013, CS
62
dce
2013
Not
Taken
Taken
Taken
Predict
Not Taken
Predict
Taken
Not Taken
Fall 2013, CS
63
dce
2013
outer:
inner:
bne , , inner
bne , , outer
Fall 2013, CS
64
dce
2013
Not Taken
Strong
Predict
Not Taken
Taken
Taken
Weak
Predict
Not Taken
Not Taken
Taken
Not Taken
Weak
Predict
Taken
Taken
Strong
Predict
Taken
Not Taken
Fall 2013, CS
65
dce
2013
Fall 2013, CS
66
dce
2013
Fall 2013, CS
67