Professional Documents
Culture Documents
ETH Zürich
Spring 2020
2 April 2020
Required Readings
This week
Pipelining
H&H, Chapter 7.5
Pipelining Issues
H&H, Chapter 7.8.1-7.8.3
Next week
Out-of-order execution
H&H, Chapter 7.8-7.9
Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
More advanced pipelining
Interrupt and exception handling
Out-of-order and superscalar execution concepts
2
Agenda for Today & Next Few Lectures
Last week
Single-cycle Microarchitectures
Multi-cycle Microarchitectures
This week
Pipelining
Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
Next week
Out-of-Order Execution
Issues in OoO Execution: Load-Store Handling, …
3
Can We Do Better?
4
Can We Do Better?
What limitations do you see with the multi-cycle design?
Limited concurrency
Some hardware resources are idle during different phases of
instruction processing cycle
“Fetch” logic is idle when an instruction is being “decoded” or
“executed”
Most of the datapath is idle when a memory access is
happening
5
Can We Use the Idle Hardware to Improve
Concurrency?
Goal: More concurrency Higher instruction throughput
(i.e., more “work” completed in one cycle)
7
Pipelining: Basic Idea
More systematically:
Pipeline the execution of multiple instructions
Analogy: “Assembly line processing” of instructions
Idea:
Divide the instruction processing cycle into distinct “stages” of
processing
Ensure there are enough hardware resources to process one
instruction in each stage
Process a different instruction in each stage
Instructions consecutive in program order are processed in
consecutive stages
F D E W
F D E W
Is life always this beautiful?
F D E W
F D E W
Time
9
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10
Pipelining Multiple Loads of Laundry
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
6 PM 7 8 9 10
- 4 loads of laundry in parallel
11 12 1 2 AM
- no additional resources
Time
Task
order
- throughput increased by 4
A
D
- latency per load is the same
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
11
Pipelining Multiple Loads of Laundry: In
Practice Time
Task
6 PM 7 8 9 10 11 12 1 2 AM
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
Task
order
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A
A
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
B
A
B A
C
D
B
throughput restored (2 loads per hour) using 2 dryers
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
13
An Ideal Pipeline
Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
Repetition of identical operations
The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
Repetition of independent operations
No dependencies between repeated operations
Uniformly partitionable suboperations
Processing can be evenly divided into uniform-latency
suboperations (that do not share resources)
15
More Realistic Pipeline: Throughput
Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay
T ps
T/k T/k
ps ps
16
More Realistic Pipeline: Cost
Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost
G gates
G/k G/k
17
Pipelining Instruction
Processing
18
Remember: The Instruction Processing Cycle
Fetch
1. Instruction fetch (IF)
Decodedecode and
2. Instruction
register operand
Evaluate fetch (ID/RF)
Address
3. Execute/Evaluate
Fetch Operands memory address (EX/AG)
4. Memory operand fetch (MEM)
Execute
5. Store/writeback result (WB)
Store Result
19
Remember the Single-Cycle Uarch
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [5– 0]
ALU operation
T BW=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
20
Dividing Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add
4 Add Add
result
Shift
left 2
Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
Write 0 Read
data 2 result Address 1
Instruction
memory
register M
u Data
data
M
u
RF
Write x memory
data 1
Write
x
0 write
data
16 32
Sign
extend
Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg
Instruction
lw $3, 300($0) 8 ns fetch
800ps
...
8 ns
800ps
Program
2 200 4 400 6 600 8 800 1000
10 1200
12 1400
14
execution
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access
Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access
Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access
2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps
PCE+4
nPCM
Add
Add
Add
44 Add Add
Add result
result
ShiftShift
leftleft
22
Read
Read
Instruction
Address register
register 11
AE
PCPC
PCF
Address Read
Read
AoutM
data
data 11
Read
Read
register
22 Zero
Zero
MDRW
Instruction register
Instruction Registers Read
Registers Read ALU ALU
ALU ALU
IRD
AoutW
BM
ImmE
1616 3232
Sign
Sign
extend
extend
T/k T/k
ps T ps
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
23
Pipelined Operation Example
lw All instruction classes must follow the same path
and timing through the pipeline stages.
Instruction fetch
0 lw
lw
M 0
u M Memory
Instruction fetch lw
lw
x u
x
1 0
1
M
u
Memory
x0
00 M
Any performance impact?
1
Execution
u I F/ID ID/EX EX/MEM MEM/WB
M1x
M
IF/ID ID/EX EX/MEM MEM/WB
xx
Add Add
4 Add r esult
Add
IF/ID ID/EX EX/MEM MEM/WB
11
4
Shift
Add Add
Add
4 left Add
2 result
result
Add Shift
Read Shift
Instruction
Address register 1 left 2
PC Read left 2
4 Read data 1 Add Add
Read result Zer o
Instruction
PC Address Instruction register 1register 2
Read Read
Registers Read ALU ALU
Instruction
memory Write data 1 Shift 0 Read
register
Read 1 data 2 res ult Address 1
PC Address register 2register Read left 2 M Zero Data
dat a
Instructi on u M
data 1
IF/ID
Regi sters
ID/EX EX/MEM MEM/WB
Read ALU ALU memory u
memory Read Write 0 x Read
Instruction IF/ID Write
register
Read2
register data data 2
ID/EX M 1
result
Zero EX/MEM Address
Data
dat a MEM/WB 1
M 0
x
Instruction
u Writ e
Write Registers ALU
PC Address register 1 Read
Read ALU memory u
memory Write 0x Address
data Read x
data data 2 1
data 1 result 1
Read
register 16 32 M Write
data 0
register 2 Sign Zero M
Instruction u data Data
Write Registers Read extend
x 0 ALU ALU u
memory memory Read
Add
Add
Write 16
data
register
Signdata32
2
extend
1 M
u
result Address
data
x 1
0 M
Wri te Data u
Write x data memory
data x
1
0
16 32
Add Add
Wri te
44
Sign
extend
Add Add data
16
Sign
32 result
result
extend
0
M
Shift
Shift lw
0
M
u
x
1
left 2
left 2 lwWrite back
u
x Write back
1
I F/ID
Read
Read
ID/EX EX/MEM MEM/WB
Instruction
lw 1
Instruction
PC Address0 IF/ID
register
lw
register 1
ID/EX
Read
EX/MEM MEM/WB
PC Address 0
Add
Instruction Read
decode
4 Add M
u M
Instruction decode data 1
data 1 Add Add
Read
result
4 x u
x Read Shift
Add Add
Zero
1
Instruction
1 register 22
register
left 2
result
Zero
Instruction
Shift
Read
Registers Read
Registers
l eft 2
ALU ALU
ALU
Instruction
PC Address
memory
memory
register 1
Write ReadRead
0
0 ALU Read
Read
Write data 2
2
data 1
result Address
Address Read 11
Instruction
Read
PC Address
Instructi on
IF/ID
IF/ID register 1register 2Read
register data ID/EX
ID/EX
M
Zero
result EX/MEM
EX/MEM
data
MEM/WB
MEM/WB
memory Read
register 2Write
data 1
Registers
register
Read
data 2
0
M Zero
ALU ALU
result Address data
Read 1
M
Instructi on
memory Add
register
Registers Read
M
u
ALU
u
u
ALU
Data
Data dat a
MM
uu
Add Write 0 Address Read
memory u
Write data 2 x result 1
data
Write xx
register M Data x
data M
Write memory
u 1
4
4 Write
data
data
x
Add
AddAddAdd
result
memory
Write
data
u
x
0
xx
1
1
data 16
Sign
32 Shift
1
result Writ e
data
0
00
16
Sign
extend
32
Shift
left 2
left 2 Write
Write
extend
Read data
data
Instruction
register
Read1 data 1
PC Address
register 2 16
Read
16 32 Zero
Sign 32
Instr uction
Read data 1
Registers ALU
memory Read ALU
0 Zero Read
Instruction
Write2
register
register
Registers
data 2
Read
Sign M ALU
result
ALU
Address
data
1
M
memory Write
Write
register
data 2 extend
extend
0 u
M x
result Address Data Read
memory data
1 u
x
data M
u 1 Data
Write x Wri te u 0
data data memory x
1
16 32 0
Write
Sign
data
extend
16 32
Sign
extend
97108/Patterson
97108/Patterson
Figure 06.15
Figure 06.15
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
24
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
Instruction Read Shift
Address register 1 left 2
Shift
PC Read Shift
left 2
4 data 1 left
left 2
2 Add Add
Read
Read result
Zero
Instruction
Instruction
Address register
Read 2
register 1
PC
Instruction
Read Registers
register 1 Read
Read Shift ALU ALU
PC Address Read
Instruction
memory Read left 2 0
Instruction
Write
register data 1 Read
PC
PC Address
Address register 1
Read 1 data 2
Read
data
Read1 result Address
data 1
register
Read 2
register M Zero
Instruction data 1 M
Read
register
Read 2
Registers data 1 u ALU Zero Data
Instruction
memory Read
registerRegisters
Write 2 Read x ALU
Zero u
Instruction Write 0 ALU Zero memory Read
Instruction
Instruction register
register
data 2
1 data 2
Read ALU
result Address x
1
PC Address memory
memory Write Registers
register Read
Read
data
Registers data 2 1M
0
0 ALU ALU
ALU result Address Read
data
Read 0
M 1
memory Write
register Read2 uM
0 ALU
result Address
Write data 1
Write
register
Read data
data12 M result AddressData Read
data uM
1
Write
register 2 xMu Zero data Data
memory data M
register u Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Datay
memor 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 10
0 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign
Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB
Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX Add
Add EX/MEM MEM/WB
result
on
Read 1
register
register 1 Shift
InstructionInstruction
PC
PC Address
Address Read Shift
Read
Instructi
PC Addressmemory Write
register
Read
Write 1 data 2
2 0 result
result Address Read
data 1
1
register 1 Read Address
Instruction
PC Address register
register
register 1 data
Read2 MM result data 1
PC Address register Read M data M
M
Read data
data 1
1 uu Data
Data Mu
Read
Write
Write
Read data 1 xxu Zero
Data
memory
uu
register
register 2
Write
data 2 x Zero memory
memory xx
Instruction
Instruction data
registerRegisters
2 1
1 Zero x
Instruction data Registers Read 1 ALU
ALU ALU 0
0
memory
memory Write Read
Registers data 0
0 ALU
ALU result Write Read 0
memory Write Read
data 2
2 0 ALU
result Write
Address
Write
Address Read 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
Write 16 32 u
Write
16 Sign
Sign
Sign
32 x
xx
Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Sign
Clock
Clock62
Clock 4 Sign
extend
extend
Clock62 4
Clock
Clock
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
25
Illustrating Pipeline Operation: Operation View
t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
steady state
IF ID
(full pipeline) IF
26
Illustrating Pipeline Operation: Resource
View
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10
IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10
ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
EX I0 I1 I2 I3 I4 I5 I6 I7 I8
MEM I0 I1 I2 I3 I4 I5 I6 I7
WB I0 I1 I2 I3 I4 I5 I6
27
Control Points in a Pipeline
PCSrc
0
M
u
x
1
Add
Add
4 Add
result
Branch
Shift
RegWrite left 2
Read MemWrite
Instruction
Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
Based on original figure from [P&H CO&D, 1
COPYRIGHT 2004 Elsevier. ALL RIGHTS
RESERVED.] RegDst
Instruction
Control M WB
EX M WB
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4
+
SignImm
4 15:0 <<2
Sign Extend
PCBranch
+
Result
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD
ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+
SignImmE
4 15:0
<<2
Sign Extend PCBranchM
+
ResultW
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD
ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+
15:0 <<2
Sign Extend
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+
15:0
<<2
Sign Extend SignImmE
4 PCBranchM
+
PCPlus4F PCPlus4D PCPlus4E
ResultW
Resource contention
37
Dependences and Their Types
Also called “dependency” or less desirably “hazard”
Two types
Data dependence
Control dependence
39
Carnegie Mellon
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
40
Data Dependences
Types of data dependences
Flow dependence (true data dependence – read after write)
Output dependence (write after write)
Anti dependence (write after read)
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR)
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7 42
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
Instruction Read Shift
Address register 1 left 2
Shift
PC Read Shift
left 2
4 data 1 left
left 2
2 Add Add
Read
Read result
Zero
Instruction
Instruction
Address register
Read 2
register 1
PC
Instruction
Read Registers
register 1 Read
Read Shift ALU ALU
PC Address Read
Instruction
memory Read left 2 0
Instruction
Write
register data 1 Read
PC
PC Address
Address register 1
Read 1 data 2
Read
data
Read1 result Address
data 1
register
Read 2
register M Zero
Instruction data 1 M
Read
register
Read 2
Registers data 1 u ALU Zero Data
Instruction
memory Read
registerRegisters
Write 2 Read x ALU
Zero u
Instruction Write 0 ALU Zero memory Read
Instruction
Instruction register
register
data 2
1 data 2
Read ALU
result Address x
1
PC Address memory
memory Write Registers
register Read
Read
data
Registers data 2 1M
0
0 ALU ALU
ALU result Address Read
data
Read 0
M 1
memory Write
register Read2 uM
0 ALU
result Address
Write data 1
Write
register
Read data
data12 M result AddressData Read
data uM
1
Write
register 2 xMu Zero data Data
memory data M
register u Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Datay
memor 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 10
0 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign
Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB
Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX Add
Add EX/MEM MEM/WB
result
on
Read 1
register
register 1 Shift
InstructionInstruction
PC
PC Address
Address Read Shift
Read
Instructi
PC Addressmemory Write
register
Read
Write 1 data 2
2 0 result
result Address Read
data 1
1
register 1 Read Address
Instruction
PC Address register
register
register 1 data
Read2 MM result data 1
PC Address register Read M data M
M
Read data
data 1
1 uu Data
Data Mu
Read
Write
Write
Read data 1 xxu Zero
Data
memory
uu
register
register 2
Write
data 2 x Zero memory
memory xx
Instruction
Instruction data
registerRegisters
2 1
1 Zero x
Instruction data Registers Read 1 ALU
ALU ALU 0
0
memory
memory Write Read
Registers data 0
0 ALU
ALU result Write Read 0
memory Write Read
data 2
2 0 ALU
result Write
Address
Write
Address Read 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
Write 16 32 u
Write
16 Sign
Sign
Sign
32 x
xx
Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Sign
Clock
Clock62
Clock 4 Sign
extend
extend
Clock62 4
Clock
Clock
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
43
Data Dependence Handling
44
Reading for Next Few Lectures
H&H, Chapter 7.5-7.9
45
How to Handle Data Dependences
Anti and output dependences are easier to handle
write to the destination in one stage and in program order
46
Interlocking
Detection of dependence between instructions in a
pipelined processor to guarantee correct execution
MIPS acronym?
47
Approaches to Dependence Detection (I)
Scoreboarding
Each register in register file has a Valid bit associated with it
An instruction that is writing to the register resets the Valid bit
An instruction in Decode stage checks if all its source and
destination registers are Valid
Yes: No need to stall… No dependence
No: Stall the instruction
Advantage:
Simple. 1 bit per register
Disadvantage:
Need to stall for all types of dependences, not only flow dep.
48
Not Stalling on Anti and Output Dependences
What changes would you make to the scoreboard to enable
this?
49
Approaches to Dependence Detection
(II)
Combinational dependence check logic
Special logic that checks if any instruction in later stages is
supposed to write to any source register of the instruction that
is being decoded
Yes: stall the instruction/pipeline
No: no need to stall… no flow dependence
Advantage:
No need to stall on anti and output dependences
Disadvantage:
Logic is more complex than a scoreboard
Logic becomes more complex as we make the pipeline deeper
and wider (flash-forward: think superscalar execution)
50
Once You Detect the Dependence in Hardware
What do you do afterwards?
51
Data Forwarding/Bypassing
Problem: A consumer (dependent) instruction has to wait in
decode stage until the producer instruction writes its value
in the register file
Goal: We do not want to stall the pipeline unnecessarily
Observation: The data value needed by the consumer
instruction can be supplied directly from a later stage in the
pipeline (instead of only from the register file)
Idea: Add additional dependence check logic and data
forwarding paths (buses) to supply the producer’s value to
the consumer right after the value is available
Benefit: Consumer can move in the pipeline until the point
the value can be supplied less stalling
52
A Special Case of Data Dependence
Control dependence
Data dependence on the Instruction Pointer / Program Counter
53
Control Dependence
Question: What should the fetch PC be in the next cycle?
Answer: The address of the next instruction
All instructions are control dependent on previous ones. Why?
55
Remember: Data Dependence Types
Flow dependence
r3 r1 op r2 Read-after-Write
r5 r3 op r4 (RAW)
Anti dependence
r3 r1 op r2 Write-after-Read
r1 r4 op r5 (WAR)
Output-dependence
r3 r1 op r2 Write-after-Write
r5 r3 op r4 (WAW)
r3 r6 op r7 56
RAW Dependence Handling
Which one of the following flow dependences lead to
conflicts in the 5-stage pipeline?
addi ra r- -
IF ID EX MEM WB
addi r- ra - IF ID EX MEM WB
addi r- ra - IF ID EX MEM
addi r- ra - IF ID EX
addi r- ra - IF ?ID
addi r- ra - IF
57
Pipeline Stall: Resolving Data
Dependence
t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx _
IF ID
IF
_ rx
j:bubble dist(i,j)=1Stall = make the dependent instruction
j: _ rx
bubble IF
dist(i,j)=2 wait until its source data value is available
j: _ rx
bubble dist(i,j)=3 1. stop all up-stream stages
j: _ rx dist(i,j)=4 2. drain all down-stream stages
58
How to Implement Stalling
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWrite
Branch
Shift
left 2
MemWrite
ALUSrc
Read
MemtoReg
Instruction
PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
Stall
disable PC and IF/ID latching; ensure stalled instruction stays in its stage
Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
59
Carnegie Mellon
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
60
Carnegie Mellon
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
nop DM
nop IM RF RF
nop DM
nop IM RF RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Data Forwarding
Also called Data Bypassing
We have already seen the basic idea before
Forward the result value to the dependent instruction
as soon as the value is available
Remember dataflow?
Data value supplied to dependent instruction as soon as it is available
Instruction executes when all its operands are available
62
Carnegie Mellon
Data Forwarding
1 2 3 4 5 6 7 8
Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
63
Carnegie Mellon
Data Forwarding
CLK CLK CLK
ALU
ALUOutM ReadDataW
1 10 A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM 4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+
15:0
Extend
4
<<2
+
PCPlus4F PCPlus4D PCPlus4E
PCBranchM
ResultW
RegWriteW
ForwardBE
RegWriteM
ForwardAE
Hazard Unit
64
Carnegie Mellon
Data Forwarding
Forward to Execute stage from either:
Memory stage or
Writeback stage
65
Carnegie Mellon
Data Forwarding
Forward to Execute stage from either:
Memory stage or
Writeback stage
Forwarding logic for ForwardBE same, but replace rsE with rtE
66
Carnegie Mellon
Stalling
1 2 3 4 5 6 7 8
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
Stalling
1 2 3 4 5 6 7 8
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
The lw instruction does not finish reading data until the end of the
Memory stage, so its result cannot be forwarded to the Execute stage of
the next instruction.
68
Carnegie Mellon
Stalling
1 2 3 4 5 6 7 8
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF
$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF
$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
The lw instruction receives data from memory at the end of cycle 4. But
the and instruction needs that data as a source operand at the beginning
of cycle 4. There is no way to supply the data with forwarding.
69
Carnegie Mellon
Stalling
1 2 3 4 5 6 7 8 9
Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF
$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF
$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF
Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
70
Carnegie Mellon
Stalling Hardware
Stalls are supported by:
adding enable inputs (EN) to the Fetch and Decode pipeline
registers
and a synchronous reset/clear (CLR) input to the Execute pipeline
register
or an INV bit associated with each pipeline register
71
Digital Design & Computer Arch.
Lecture 13: Pipelining
ETH Zürich
Spring 2020
2 April 2020
We did not cover the following
slides. They are for your benefit.
73
How to Handle Data Dependences
Anti and output dependences are easier to handle
write to the destination in one stage and in program order
74
Fine-Grained Multithreading
75
Fine-Grained Multithreading
Idea: Hardware has multiple thread contexts (PC+registers).
Each cycle, fetch engine fetches from a different thread.
By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
Branch/instruction resolution latency overlapped with execution
of other threads’ instructions
77
Fine-Grained Multithreading: History
CDC 6600’s peripheral processing unit is fine-grained
multithreaded
Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.
Processor executes a different I/O thread every cycle
An operation from the same thread is executed every 10 cycles
78
Fine-Grained Multithreading in HEP
Cycle time: 100ns
8 stages 800 ns to
complete an
instruction
assuming no memory
access
79
Multithreaded Pipeline Example
Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
81
Fine-grained Multithreading
Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, utilization
Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependency checking logic between threads remains (load/store)
82
Modern GPUs Are FGMT Machines
83
NVIDIA GeForce GTX 285 “core”
64 KB of storage
… for thread contexts
(registers)
84
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285 “core”
64 KB of storage
… for thread contexts
(registers)
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
87