You are on page 1of 87

Digital Design & Computer Arch.

Lecture 13: Pipelining

Prof. Onur Mutlu

ETH Zürich
Spring 2020
2 April 2020
Required Readings
 This week
 Pipelining
 H&H, Chapter 7.5
 Pipelining Issues
 H&H, Chapter 7.8.1-7.8.3

 Next week
 Out-of-order execution
 H&H, Chapter 7.8-7.9
 Smith and Sohi, “The Microarchitecture of Superscalar
Processors,” Proceedings of the IEEE, 1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts

2
Agenda for Today & Next Few Lectures
 Last week
 Single-cycle Microarchitectures
 Multi-cycle Microarchitectures

 This week
 Pipelining
 Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …

 Next week
 Out-of-Order Execution
 Issues in OoO Execution: Load-Store Handling, …

3
Can We Do Better?

4
Can We Do Better?
 What limitations do you see with the multi-cycle design?

 Limited concurrency
 Some hardware resources are idle during different phases of
instruction processing cycle
 “Fetch” logic is idle when an instruction is being “decoded” or
“executed”
 Most of the datapath is idle when a memory access is
happening

5
Can We Use the Idle Hardware to Improve
Concurrency?
 Goal: More concurrency  Higher instruction throughput
(i.e., more “work” completed in one cycle)

 Idea: When an instruction is using some resources in its


processing phase, process other instructions on idle
resources not needed by that instruction
 E.g., when an instruction is being decoded, fetch the next
instruction
 E.g., when an instruction is being executed, decode another
instruction
 E.g., when an instruction is accessing data memory (ld/st),
execute the next instruction
 E.g., when an instruction is writing its result into the register
file, access data memory for the next instruction
6
Pipelining

7
Pipelining: Basic Idea
 More systematically:
 Pipeline the execution of multiple instructions
 Analogy: “Assembly line processing” of instructions

 Idea:
 Divide the instruction processing cycle into distinct “stages” of
processing
 Ensure there are enough hardware resources to process one
instruction in each stage
 Process a different instruction in each stage
 Instructions consecutive in program order are processed in
consecutive stages

 Benefit: Increases instruction processing throughput (1/CPI)


 Downside: Start thinking about this…
8
Example: Execution of Four Independent ADDs
 Multi-cycle: 4 cycles per instruction
F D E W
F D E W
F D E W
F D E W
Time
 Pipelined: 4 cycles per 4 instructions (steady state)

F D E W
F D E W
Is life always this beautiful?
F D E W
F D E W

Time

9
The Laundry Analogy
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

 “place one dirty load of clothes in the washer”


 “when the washer is finished, place the wet load in the dryer”
 “when the dryer is finished, take out the dry load and fold”
 “when folding is finished, ask your roommate (??) to put the clothes
away”
- steps to do a load are sequentially dependent
- no dependence between different loads
- different steps do not share resources

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
10
Pipelining Multiple Loads of Laundry
6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

6 PM 7 8 9 10
- 4 loads of laundry in parallel
11 12 1 2 AM

- no additional resources
Time

Task
order

- throughput increased by 4
A

D
- latency per load is the same

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
11
Pipelining Multiple Loads of Laundry: In
Practice Time
Task
6 PM 7 8 9 10 11 12 1 2 AM

order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

the slowest step decides throughput


Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
12
Pipelining Multiple Loads of Laundry: In
Practice Time
6 PM 7 8 9 10 11 12 1 2 AM

Task
order
A

6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order

6 PM 7 8 9 10 11 12 1 2 AM
Time
Task
order
A

A
6 PM 7 8 9 10 11 12 1 2 AM
Time

Task
order
B
A

B A
C

D
B
throughput restored (2 loads per hour) using 2 dryers
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
13
An Ideal Pipeline
 Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
 Repetition of identical operations
 The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
 Repetition of independent operations
 No dependencies between repeated operations
 Uniformly partitionable suboperations
 Processing can be evenly divided into uniform-latency
suboperations (that do not share resources)

 Fitting examples: automobile assembly line, doing laundry


 What about the instruction processing “cycle”?
14
Ideal Pipelining

combinational logic (F,D,E,M,W) BW=~(1/T)


T psec

T/2 ps (F,D,E) T/2 ps (M,W) BW=~(2/T)

T/3 T/3 T/3 BW=~(3/T)


ps (F,D) ps (E,M) ps (M,W)

15
More Realistic Pipeline: Throughput
 Nonpipelined version with delay T
BW = 1/(T+S) where S = latch delay

T ps

 k-stage pipelined version


BWk-stage = 1 / (T/k +S )
Latch delay reduces throughput
BWmax = 1 / (1 gate delay + S ) (switching overhead b/w stages)

T/k T/k
ps ps

16
More Realistic Pipeline: Cost
 Nonpipelined version with combinational cost G
Cost = G+L where L = latch cost

G gates

 k-stage pipelined version


Costk-stage = G + Lk Latches increase hardware cost

G/k G/k

17
Pipelining Instruction
Processing

18
Remember: The Instruction Processing Cycle

 Fetch
1. Instruction fetch (IF)
 Decodedecode and
2. Instruction
register operand
 Evaluate fetch (ID/RF)
Address
3. Execute/Evaluate
 Fetch Operands memory address (EX/AG)
4. Memory operand fetch (MEM)
 Execute
5. Store/writeback result (WB)
 Store Result

19
Remember the Single-Cycle Uarch
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 PCSrc
1 1=Jump
PC+4 [31– 28] M M
u u
x x
ALU
Add result 1 0
Add Shift
RegDst
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg PCSrc2=Br Taken
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
bcond data
16 32
Instruction [15– 0] Sign
extend ALU
control

Instruction [5– 0]

ALU operation

T BW=~(1/T)
Based on original figure from [P&H CO&D, COPYRIGHT 2004
Elsevier. ALL RIGHTS RESERVED.]
20
Dividing Into Stages
200ps 100ps 200ps 200ps 100ps
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
0
M
u
x
1
ignore
for now
Add

4 Add Add
result
Shift
left 2

Read
PC Address register 1 Read
data 1
Read
register 2 Zero
Instruction Registers Read ALU ALU
Write 0 Read
data 2 result Address 1
Instruction
memory
register M
u Data
data
M
u
RF
Write x memory
data 1
Write
x
0 write
data
16 32
Sign
extend

Is this the correct partitioning?


Why not 4 or 6 stages? Why not different boundaries?
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
21
Instruction Pipeline Throughput
Program
execution 2 4
200 400 6 600 8 800 101000 12
1200 14
1400 16
1600 18
1800
order Time
(in instructions)
Instruction Data
lw $1, 100($0) fetch
Reg ALU
access
Reg

Instruction Data
lw $2, 200($0) 8 ns
800ps fetch
Reg ALU
access
Reg

Instruction
lw $3, 300($0) 8 ns fetch
800ps
...
8 ns
800ps

Program
2 200 4 400 6 600 8 800 1000
10 1200
12 1400
14
execution
Time
order
(in instructions)
Instruction Data
lw $1, 100($0) Reg ALU Reg
fetch access

Instruction Data
lw $2, 200($0) 2 ns Reg ALU Reg
200ps fetch access

Instruction Data
lw $3, 300($0) 2 ns
200ps Reg ALU Reg
fetch access

2 ns
200ps 2 ns
200ps 2200ps
ns 2 ns
200ps 2 ns
200ps

5-stage speedup is 4, not 5 as predicted by the ideal model. Why?


22
Enabling Pipelined Processing: Pipeline
Registers
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory access WB: Write back
register file read address calculation
00
MM
No resource is used by more than 1 stage!
uu
xx
11

IF/ID ID/EX EX/MEM MEM/WB


PCD+4

PCE+4

nPCM
Add
Add

Add
44 Add Add
Add result
result
ShiftShift
leftleft
22

Read
Read
Instruction

Address register
register 11

AE
PCPC
PCF

Address Read
Read

AoutM
data
data 11
Read
Read
register
22 Zero
Zero

MDRW
Instruction register
Instruction Registers Read
Registers Read ALU ALU
ALU ALU
IRD

memory Write 00 Read


Read
Write data
data 22 result
result Address
Address 11
register
register MM data
data
Instruction MM
uu Data
Data
memory uu
BE
Write
Write xx memory
memory
data
data xx
11
Write 00
Write
data
data

AoutW
BM
ImmE

1616 3232
Sign
Sign
extend
extend

T/k T/k
ps T ps
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
23
Pipelined Operation Example
lw All instruction classes must follow the same path
and timing through the pipeline stages.
Instruction fetch
0 lw
lw
M 0
u M Memory
Instruction fetch lw
lw
x u
x
1 0
1
M
u
Memory
x0
00 M
Any performance impact?
1

Execution
u I F/ID ID/EX EX/MEM MEM/WB
M1x
M
IF/ID ID/EX EX/MEM MEM/WB

uu Add I F/ID ID/EX EX/MEM MEM/WB

xx
Add Add
4 Add r esult
Add
IF/ID ID/EX EX/MEM MEM/WB

11
4
Shift
Add Add
Add
4 left Add
2 result
result
Add Shift
Read Shift

Instruction
Address register 1 left 2
PC Read left 2
4 Read data 1 Add Add
Read result Zer o

Instruction
PC Address Instruction register 1register 2
Read Read
Registers Read ALU ALU

Instruction
memory Write data 1 Shift 0 Read
register
Read 1 data 2 res ult Address 1
PC Address register 2register Read left 2 M Zero Data
dat a
Instructi on u M
data 1
IF/ID
Regi sters
ID/EX EX/MEM MEM/WB
Read ALU ALU memory u
memory Read Write 0 x Read
Instruction IF/ID Write
register
Read2
register data data 2
ID/EX M 1
result
Zero EX/MEM Address
Data
dat a MEM/WB 1
M 0
x

Instruction
u Writ e
Write Registers ALU
PC Address register 1 Read
Read ALU memory u
memory Write 0x Address
data Read x
data data 2 1
data 1 result 1
Read
register 16 32 M Write
data 0
register 2 Sign Zero M
Instruction u data Data
Write Registers Read extend
x 0 ALU ALU u
memory memory Read
Add
Add
Write 16
data
register
Signdata32
2
extend
1 M
u
result Address
data
x 1
0 M
Wri te Data u
Write x data memory
data x
1
0
16 32
Add Add
Wri te

44
Sign
extend
Add Add data
16
Sign
32 result
result
extend

0
M
Shift
Shift lw
0
M
u
x
1
left 2
left 2 lwWrite back
u
x Write back
1

I F/ID
Read
Read
ID/EX EX/MEM MEM/WB
Instruction

lw 1
Instruction

PC Address0 IF/ID
register
lw
register 1
ID/EX
Read
EX/MEM MEM/WB

PC Address 0
Add

Instruction Read
decode
4 Add M
u M
Instruction decode data 1
data 1 Add Add

Read
result

4 x u
x Read Shift
Add Add
Zero
1
Instruction
1 register 22
register
left 2
result
Zero
Instruction
Shift
Read
Registers Read
Registers
l eft 2
ALU ALU
ALU
Instruction

PC Address
memory
memory
register 1

Write ReadRead
0
0 ALU Read
Read
Write data 2
2
data 1
result Address
Address Read 11
Instruction

Read
PC Address
Instructi on
IF/ID
IF/ID register 1register 2Read
register data ID/EX
ID/EX
M
Zero
result EX/MEM
EX/MEM
data
MEM/WB
MEM/WB
memory Read
register 2Write
data 1
Registers
register
Read
data 2
0
M Zero
ALU ALU
result Address data
Read 1
M
Instructi on
memory Add
register
Registers Read
M
u
ALU
u
u
ALU
Data
Data dat a
MM
uu
Add Write 0 Address Read
memory u
Write data 2 x result 1
data
Write xx
register M Data x
data M
Write memory
u 1

4
4 Write
data
data
x
Add
AddAddAdd
result
memory
Write
data
u
x
0
xx
1
1
data 16
Sign
32 Shift
1
result Writ e
data
0

00
16
Sign
extend
32
Shift
left 2
left 2 Write
Write
extend
Read data
data
Instruction

PC Address register 1 Read


Read
Instruction

register
Read1 data 1
PC Address
register 2 16
Read
16 32 Zero

Sign 32
Instr uction
Read data 1
Registers ALU
memory Read ALU
0 Zero Read
Instruction
Write2
register
register
Registers
data 2
Read
Sign M ALU
result
ALU
Address
data
1
M
memory Write
Write
register
data 2 extend
extend
0 u
M x
result Address Data Read
memory data
1 u
x
data M
u 1 Data
Write x Wri te u 0
data data memory x
1
16 32 0
Write
Sign
data
extend
16 32
Sign
extend

97108/Patterson
97108/Patterson
Figure 06.15
Figure 06.15

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
24
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
Instruction Read Shift
Address register 1 left 2
Shift
PC Read Shift
left 2
4 data 1 left
left 2
2 Add Add
Read
Read result
Zero
Instruction
Instruction
Address register
Read 2
register 1
PC
Instruction
Read Registers
register 1 Read
Read Shift ALU ALU
PC Address Read
Instruction
memory Read left 2 0
Instruction
Write
register data 1 Read
PC
PC Address
Address register 1
Read 1 data 2
Read
data
Read1 result Address
data 1
register
Read 2
register M Zero
Instruction data 1 M
Read
register
Read 2
Registers data 1 u ALU Zero Data
Instruction
memory Read
registerRegisters
Write 2 Read x ALU
Zero u
Instruction Write 0 ALU Zero memory Read
Instruction

Instruction register
register
data 2
1 data 2
Read ALU
result Address x
1
PC Address memory
memory Write Registers
register Read
Read
data
Registers data 2 1M
0
0 ALU ALU
ALU result Address Read
data
Read 0
M 1
memory Write
register Read2 uM
0 ALU
result Address
Write data 1
Write
register
Read data
data12 M result AddressData Read
data uM
1
Write
register 2 xMu Zero data Data
memory data M
register u Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Datay
memor 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 10
0 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign

Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB

Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX Add
Add EX/MEM MEM/WB

Is life always this beautiful?


4
4 Add
Add Add
4 result
Addresult
Add result
Add Shift
Shift
Add Shift
left 2
left 2
left 2 Add
4 Add result
Add
4 Add Add
4 Read
Read Add result
Instruction

result
on

Read 1
register
register 1 Shift
InstructionInstruction

PC
PC Address
Address Read Shift
Read
Instructi

PC Address register 1 Read Shift


left
Read
Read
data
data
data
1
1 1 left 2
2
Read
register 2 left 2 Zero
Zero
Instruction
Instruction register
register2 2 Zero
Instruction Read Registers
Registers Read ALU
ALU ALU
memory
memory Write Read
Read Registersdata Read 0
0 ALU ALU
ALU Read
Address Read
Instruction

PC Addressmemory Write
register
Read
Write 1 data 2
2 0 result
result Address Read
data 1
1
register 1 Read Address
Instruction

PC Address register
register
register 1 data
Read2 MM result data 1
PC Address register Read M data M
M
Read data
data 1
1 uu Data
Data Mu
Read
Write
Write
Read data 1 xxu Zero
Data
memory
uu
register
register 2
Write
data 2 x Zero memory
memory xx
Instruction
Instruction data
registerRegisters
2 1
1 Zero x
Instruction data Registers Read 1 ALU
ALU ALU 0
0
memory
memory Write Read
Registers data 0
0 ALU
ALU result Write Read 0
memory Write Read
data 2
2 0 ALU
result Write
Address
Write
Address Read 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
Write 16 32 u
Write
16 Sign
Sign
Sign
32 x
xx
Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Sign
Clock
Clock62
Clock 4 Sign
extend
extend

Clock62 4
Clock
Clock

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
25
Illustrating Pipeline Operation: Operation View

t0 t1 t2 t3 t4 t5
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM
IF ID EX
steady state
IF ID
(full pipeline) IF

26
Illustrating Pipeline Operation: Resource
View
t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

IF I0 I1 I2 I3 I4 I5 I6 I7 I8 I9 I10

ID I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

EX I0 I1 I2 I3 I4 I5 I6 I7 I8

MEM I0 I1 I2 I3 I4 I5 I6 I7

WB I0 I1 I2 I3 I4 I5 I6

27
Control Points in a Pipeline
PCSrc

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB

Add

Add
4 Add
result
Branch
Shift
RegWrite left 2

Read MemWrite
Instruction

PC Address register 1 Read


Read data 1 ALUSrc
Zero
Zero MemtoReg
Instruction register 2
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u M
Data u
Write x memory
data x
1
0
Write
data
Instruction
[15– 0] 16 32 6
Sign ALU
extend control MemRead

Instruction
[20– 16]
0
M ALUOp
Instruction u
[15– 11] x
Based on original figure from [P&H CO&D, 1
COPYRIGHT 2004 Elsevier. ALL RIGHTS
RESERVED.] RegDst

Identical set of control points as the single-cycle datapath!! 28


Control Signals in a Pipeline
 For a given instruction
 same control signals as single-cycle, but
 control signals required at different cycles, depending on stage
Þ Option 1: decode once using the same logic as single-cycle and
buffer signals until consumed
WB

Instruction
Control M WB

EX M WB

IF/ID ID/EX EX/MEM MEM/WB

Þ Option 2: carry relevant “instruction word/field” down the pipeline


and decode locally within each or in a previous stage
Which one is better? 29
Pipelined Control Signals
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst

Based on original figure from [P&H CO&D,


COPYRIGHT 2004 Elsevier. ALL RIGHTS 30
RESERVED.]
Carnegie Mellon

Another Example: Single-Cycle and Pipelined


CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4
+

SignImm
4 15:0 <<2
Sign Extend
PCBranch

+
Result

CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+

SignImmE
4 15:0
<<2
Sign Extend PCBranchM
+

PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback 31


Carnegie Mellon

Another Example: Correct Pipelined Datapath

CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+

15:0 <<2
Sign Extend
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback

 WriteReg must arrive at the same time as Result


32
Carnegie Mellon

Another Example: Pipelined Control


CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM

BranchD BranchE BranchM


31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

 Same control unit as single-cycle processor


Control delayed to proper pipeline stage 33
Remember: An Ideal Pipeline
 Goal: Increase throughput with little increase in cost
(hardware cost, in case of instruction processing)
 Repetition of identical operations
 The same operation is repeated on a large number of different
inputs (e.g., all laundry loads go through the same steps)
 Repetition of independent operations
 No dependencies between repeated operations
 Uniformly partitionable suboperations
 Processing an be evenly divided into uniform-latency
suboperations (that do not share resources)

 Fitting examples: automobile assembly line, doing laundry


 What about the instruction processing “cycle”?
34
Instruction Pipeline: Not An Ideal Pipeline
 Identical operations ... NOT!
 different instructions  not all need the same stages
Forcing different instructions to go through the same pipe stages
 external fragmentation (some pipe stages idle for some instructions)

 Uniform suboperations ... NOT!


 different pipeline stages  not the same latency
Need to force each stage to be controlled by the same clock
 internal fragmentation (some pipe stages are too fast but all take
the same clock cycle time)

 Independent operations ... NOT!


 instructions are not independent of each other
Need to detect and resolve inter-instruction dependencies to ensure
the pipeline provides correct results
 pipeline stalls (pipeline is not always moving) 35
Issues in Pipeline Design
 Balancing work in pipeline stages
 How many stages and what is done in each stage

 Keeping the pipeline correct, moving, and full in the


presence of events that disrupt pipeline flow
 Handling dependences
 Data
 Control
 Handling resource contention
 Handling long-latency (multi-cycle) operations

 Handling exceptions, interrupts


 Advanced: Improving pipeline throughput
 Minimizing stalls
36
Causes of Pipeline Stalls
 Stall: A condition when the pipeline stops moving

 Resource contention

 Dependences (between instructions)


 Data
 Control

 Long-latency (multi-cycle) operations

37
Dependences and Their Types
 Also called “dependency” or less desirably “hazard”

 Dependences dictate ordering requirements between


instructions

 Two types
 Data dependence
 Control dependence

 Resource contention is sometimes called resource


dependence
 However, this is not fundamental to (dictated by) program
semantics, so we will treat it separately
38
Handling Resource Contention
 Happens when instructions in two pipeline stages need the
same resource

 Solution 1: Eliminate the cause of contention


 Duplicate the resource or increase its throughput
 E.g., use separate instruction and data memories (caches)
 E.g., use multiple ports for memory structures

 Solution 2: Detect the resource contention and stall one of


the contending stages
 Which stage do you stall?
 Example: What if you had a single read and write port for the
register file?

39
Carnegie Mellon

Example Resource Dependence: RegFile


 The register file can be read and written in the same cycle:
 write takes place during the 1st half of the cycle
 read takes place during the 2nd half of the cycle => no problem!!!
 However operations that involve register file have only half a clock
cycle to complete the operation!!

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

40
Data Dependences
 Types of data dependences
 Flow dependence (true data dependence – read after write)
 Output dependence (write after write)
 Anti dependence (write after read)

 Which ones cause stalls in a pipelined machine?


 For all of them, we need to ensure semantics of the program
is correct
 Flow dependences always need to be obeyed because they
constitute true dependence on a value
 Anti and output dependences exist due to limited number of
architectural registers
 They are dependence on a name, not a value
 We will later see what we can do about them
41
Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)

Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7 42
Pipelined Operation Example
lw $10, 20($1)
Instruction fetch
lw $10, 20($1) sub $11, $2, $3 lw $10, 20($1)
Instruction fetch sub $11, $2, $3 lw $10, 20($1)
0
M
Instruction decode Execution
u Instruction decode Execution sub $11, $2, $3 lw $10, 20($1)
x0
0
1 0
MM
M
u0u
u
Memory Write back
xMx
11
0
1
x
u
x
sub $11, $2, $3 lw $10, 20($1)
IF/ID ID/EX EX/MEM MEM/WB
1
M
u Memory Write back
x
Add
1 IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
4 Add
Add result
Add
Add
Add
Add IF/ID ID/EX Shift EX/MEM MEM/WB
4 Add Add
4 left 2 Add
Addresult
4 Add
Add result
Add
4 Add result
Add Shift result
Instruction Read Shift
Address register 1 left 2
Shift
PC Read Shift
left 2
4 data 1 left
left 2
2 Add Add
Read
Read result
Zero
Instruction
Instruction
Address register
Read 2
register 1
PC
Instruction
Read Registers
register 1 Read
Read Shift ALU ALU
PC Address Read
Instruction
memory Read left 2 0
Instruction
Write
register data 1 Read
PC
PC Address
Address register 1
Read 1 data 2
Read
data
Read1 result Address
data 1
register
Read 2
register M Zero
Instruction data 1 M
Read
register
Read 2
Registers data 1 u ALU Zero Data
Instruction
memory Read
registerRegisters
Write 2 Read x ALU
Zero u
Instruction Write 0 ALU Zero memory Read
Instruction

Instruction register
register
data 2
1 data 2
Read ALU
result Address x
1
PC Address memory
memory Write Registers
register Read
Read
data
Registers data 2 1M
0
0 ALU ALU
ALU result Address Read
data
Read 0
M 1
memory Write
register Read2 uM
0 ALU
result Address
Write data 1
Write
register
Read data
data12 M result AddressData Read
data uM
1
Write
register 2 xMu Zero data Data
memory data M
register u Data xu
Instruction data
Write
Write 1x x
u memory uM
data Registers
16 Read32 ALU ALU Datay
memor 0xx
u
memory Write
data
Write Sign 0
11x Write Read
data 2 result Address
data memory 10
0 x
data
register extend M1 Write
Write data
u data M 0
16 32 data
Write Data u
Write Sign x data memory
data 16
16 32
32 x
extend
Sign 1
Sign 0
16 32 Write
Clock 1 extend
extend
Sign data
extend
16 32
ClockClock
5 Sign

Clock 1 3 extend
Clock 3
sub $11, $2, $3 lw $10, 20($1)
Clock 5
Instruction fetch Instruction decode
sub $11, $2, $3 lw $10, 20($1) sub $11, $2, $3
0
0 sub $11, $2, $3 lw $10, 20($1)
Instruction
M
M0 fetch Instruction decode Write back
uM
u
xxu subExecution
$11, $2, $3 lw $10, 20($1)
Memory
0
1 x
1
0
M
1
0 Execution Memory sub $11, $2, $3
Mu
M
1
x
u
u
x Write back
1x IF/ID
IF/ID ID/EX
ID/EX EX/MEM
EX/MEM MEM/WB
MEM/WB
1 IF/ID ID/EX EX/MEM MEM/WB

Add
Add
Add IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX EX/MEM MEM/WB
IF/ID ID/EX Add
Add EX/MEM MEM/WB

What if the SUB were dependent on LW?


4
4 Add
Add Add
4 result
Addresult
Add result
Add Shift
Shift
Add Shift
left 2
left 2
left 2 Add
4 Add result
Add
4 Add Add
4 Read
Read Add result
Instruction

result
on

Read 1
register
register 1 Shift
InstructionInstruction

PC
PC Address
Address Read Shift
Read
Instructi

PC Address register 1 Read Shift


left
Read
Read
data
data
data
1
1 1 left 2
2
Read
register 2 left 2 Zero
Zero
Instruction
Instruction register
register2 2 Zero
Instruction Read Registers
Registers Read ALU
ALU ALU
memory
memory Write Read
Read Registersdata Read 0
0 ALU ALU
ALU Read
Address Read
Instruction

PC Addressmemory Write
register
Read
Write 1 data 2
2 0 result
result Address Read
data 1
1
register 1 Read Address
Instruction

PC Address register
register
register 1 data
Read2 MM result data 1
PC Address register Read M data M
M
Read data
data 1
1 uu Data
Data Mu
Read
Write
Write
Read data 1 xxu Zero
Data
memory
uu
register
register 2
Write
data 2 x Zero memory
memory xx
Instruction
Instruction data
registerRegisters
2 1
1 Zero x
Instruction data Registers Read 1 ALU
ALU ALU 0
0
memory
memory Write Read
Registers data 0
0 ALU
ALU result Write Read 0
memory Write Read
data 2
2 0 ALU
result Write
Address
Write
Address Read 1
Write
register
register data 2 M
M result data
Address
data
data
Read
data
data 11
register M u data M
M
16 32 u Data
Data M u
Write 16 32 u
Write
16 Sign
Sign
Sign
32 x
xx
Data
memory
memory uu
x
data extend 1 memory x
data extend
extend 1 0
Write 0
Write
data
data
16
16 32
32
Sign
Clock
Clock62
Clock 4 Sign
extend
extend

Clock62 4
Clock
Clock

Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
43
Data Dependence Handling

44
Reading for Next Few Lectures
 H&H, Chapter 7.5-7.9

 Smith and Sohi, “The Microarchitecture of Superscalar


Processors,” Proceedings of the IEEE, 1995
 More advanced pipelining
 Interrupt and exception handling
 Out-of-order and superscalar execution concepts

45
How to Handle Data Dependences
 Anti and output dependences are easier to handle
 write to the destination in one stage and in program order

 Flow dependences are more interesting

 Five fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
 Detect and eliminate the dependence at the software level
 No need for the hardware to detect dependence
 Predict the needed value(s), execute “speculatively”, and verify
 Do something else (fine-grained multithreading)
 No need to detect

46
Interlocking
 Detection of dependence between instructions in a
pipelined processor to guarantee correct execution

 Software based interlocking


vs.
 Hardware based interlocking

 MIPS acronym?

47
Approaches to Dependence Detection (I)
 Scoreboarding
 Each register in register file has a Valid bit associated with it
 An instruction that is writing to the register resets the Valid bit
 An instruction in Decode stage checks if all its source and
destination registers are Valid
 Yes: No need to stall… No dependence
 No: Stall the instruction

 Advantage:
 Simple. 1 bit per register

 Disadvantage:
 Need to stall for all types of dependences, not only flow dep.
48
Not Stalling on Anti and Output Dependences
 What changes would you make to the scoreboard to enable
this?

49
Approaches to Dependence Detection
(II)
 Combinational dependence check logic
 Special logic that checks if any instruction in later stages is
supposed to write to any source register of the instruction that
is being decoded
 Yes: stall the instruction/pipeline
 No: no need to stall… no flow dependence

 Advantage:
 No need to stall on anti and output dependences

 Disadvantage:
 Logic is more complex than a scoreboard
 Logic becomes more complex as we make the pipeline deeper
and wider (flash-forward: think superscalar execution)
50
Once You Detect the Dependence in Hardware
 What do you do afterwards?

 Observation: Dependence between two instructions is


detected before the communicated data value becomes
available

 Option 1: Stall the dependent instruction right away


 Option 2: Stall the dependent instruction only when
necessary  data forwarding/bypassing
 Option 3: …

51
Data Forwarding/Bypassing
 Problem: A consumer (dependent) instruction has to wait in
decode stage until the producer instruction writes its value
in the register file
 Goal: We do not want to stall the pipeline unnecessarily
 Observation: The data value needed by the consumer
instruction can be supplied directly from a later stage in the
pipeline (instead of only from the register file)
 Idea: Add additional dependence check logic and data
forwarding paths (buses) to supply the producer’s value to
the consumer right after the value is available
 Benefit: Consumer can move in the pipeline until the point
the value can be supplied  less stalling
52
A Special Case of Data Dependence
 Control dependence
 Data dependence on the Instruction Pointer / Program Counter

53
Control Dependence
 Question: What should the fetch PC be in the next cycle?
 Answer: The address of the next instruction
 All instructions are control dependent on previous ones. Why?

 If the fetched instruction is a non-control-flow instruction:


 Next Fetch PC is the address of the next-sequential instruction
 Easy to determine if we know the size of the fetched instruction

 If the instruction that is fetched is a control-flow instruction:


 How do we determine the next Fetch PC?

 In fact, how do we know whether or not the fetched


instruction is a control-flow instruction?
54
Data Dependence Handling:
Concepts and Implementation

55
Remember: Data Dependence Types
Flow dependence
r3  r1 op r2 Read-after-Write
r5  r3 op r4 (RAW)

Anti dependence
r3  r1 op r2 Write-after-Read
r1  r4 op r5 (WAR)

Output-dependence
r3  r1 op r2 Write-after-Write
r5  r3 op r4 (WAW)
r3  r6 op r7 56
RAW Dependence Handling
 Which one of the following flow dependences lead to
conflicts in the 5-stage pipeline?

addi ra r- -
IF ID EX MEM WB
addi r- ra - IF ID EX MEM WB
addi r- ra - IF ID EX MEM
addi r- ra - IF ID EX
addi r- ra - IF ?ID
addi r- ra - IF
57
Pipeline Stall: Resolving Data
Dependence
t0 t1 t2 t3 t4 t5
Insth IF ID ALU MEM WB
Insti i IF ID ALU MEM WB
Instj j IF ID ALU
ID MEM
ALU
ID ID
WB
MEM
ALU ALU
WB
MEM
Instk IF ID
IF ALU
ID
IF MEM
ALU
ID
IF WB
MEM
ALU
ID
Instl IF ID
IF ALU
ID
IF MEM
ALU
ID
IF
IF ID
IF ALU
ID
IF
i: rx  _
IF ID
IF
_  rx
j:bubble dist(i,j)=1Stall = make the dependent instruction
j: _  rx
bubble IF
dist(i,j)=2 wait until its source data value is available
j: _  rx
bubble dist(i,j)=3 1. stop all up-stream stages
j: _  rx dist(i,j)=4 2. drain all down-stream stages
58
How to Implement Stalling
PCSrc

ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB

EX M WB
IF/ID

Add

Add
4 Add result

RegWrite
Branch
Shift
left 2

MemWrite
ALUSrc
Read

MemtoReg
Instruction

PC Address register 1
Read
data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Read
data 2 result Address 1
register M data
u Data M
Write x memory u
data x
1
0
Write
data

Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control

Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
 Stall
 disable PC and IF/ID latching; ensure stalled instruction stays in its stage
 Insert “invalid” instructions/nops into the stage following the stalled one
(called “bubbles”)
Based on original figure from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.]
59
Carnegie Mellon

RAW Data Dependence Example


 One instruction writes a register ($s0) and next instructions
read this register => read after write (RAW) dependence.
 add writes into $s0 in the first half of cycle 5
 and reads $s0 on cycle 3, obtaining the wrong value
 Only if the pipeline handles data dependences wrong!
or reads $s0 on cycle 4, again obtaining the wrong value.
 sub reads $s0 in the second half of cycle 5, obtaining the correct value
 subsequent instructions read the correct value of $s0
1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF
60
Carnegie Mellon

Compile-Time Detection and Elimination


1 2 3 4 5 6 7 8 9 10

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

nop DM
nop IM RF RF

nop DM
nop IM RF RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Insert enough NOPs for the required result to be ready


 Or (if you can) move independent useful instructions up 61
Carnegie Mellon

Data Forwarding
 Also called Data Bypassing
 We have already seen the basic idea before
 Forward the result value to the dependent instruction
as soon as the value is available
 Remember dataflow?
 Data value supplied to dependent instruction as soon as it is available
 Instruction executes when all its operands are available

 Data forwarding brings a pipeline closer to data flow execution


principles

62
Carnegie Mellon

Data Forwarding
1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

63
Carnegie Mellon

Data Forwarding
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
ALUOutM ReadDataW
1 10 A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM 4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
Sign
+

15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E

PCBranchM

ResultW

RegWriteW
ForwardBE

RegWriteM
ForwardAE

Hazard Unit

64
Carnegie Mellon

Data Forwarding
 Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 When should we forward from one either Memory or


Writeback stage?
 If that stage will write a destination register and the destination register
matches the source register.
 If both the Memory and Writeback stages contain matching destination
registers, the Memory stage should have priority, because it contains the
more recently executed instruction.

65
Carnegie Mellon

Data Forwarding
 Forward to Execute stage from either:
 Memory stage or
 Writeback stage

 Forwarding logic for ForwardAE (pseudo code):


if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then
ForwardAE = 10 # forward from Memory stage
else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then
ForwardAE = 01 # forward from Writeback stage
else
ForwardAE = 00 # no forwarding

 Forwarding logic for ForwardBE same, but replace rsE with rtE
66
Carnegie Mellon

Stalling
1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

 Forwarding is sufficient to resolve RAW data dependences


 but …
 There are cases when forwarding is not possible due to
pipeline design and instruction latencies
67
Carnegie Mellon

Stalling
1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

The lw instruction does not finish reading data until the end of the
Memory stage, so its result cannot be forwarded to the Execute stage of
the next instruction.

68
Carnegie Mellon

Stalling
1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

The lw instruction has a two-cycle latency, therefore a dependent


instruction cannot use its result until two cycles later.

The lw instruction receives data from memory at the end of cycle 4. But
the and instruction needs that data as a source operand at the beginning
of cycle 4. There is no way to supply the data with forwarding.
69
Carnegie Mellon

Stalling
1 2 3 4 5 6 7 8 9

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF

$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF

Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

70
Carnegie Mellon

Stalling Hardware
 Stalls are supported by:
 adding enable inputs (EN) to the Fetch and Decode pipeline
registers
 and a synchronous reset/clear (CLR) input to the Execute pipeline
register
 or an INV bit associated with each pipeline register

 When a lw stall occurs


 StallD and StallF are asserted to force the Decode and Fetch stage
pipeline registers to hold their old values.
 FlushE is also asserted to clear the contents of the Execute stage
pipeline register, introducing a bubble

71
Digital Design & Computer Arch.
Lecture 13: Pipelining

Prof. Onur Mutlu

ETH Zürich
Spring 2020
2 April 2020
We did not cover the following
slides. They are for your benefit.

73
How to Handle Data Dependences
 Anti and output dependences are easier to handle
 write to the destination in one stage and in program order

 Flow dependences are more interesting

 Five fundamental ways of handling flow dependences


 Detect and wait until value is available in register file
 Detect and forward/bypass data to dependent instruction
 Detect and eliminate the dependence at the software level
 No need for the hardware to detect dependence
 Predict the needed value(s), execute “speculatively”, and verify
 Do something else (fine-grained multithreading)
 No need to detect

74
Fine-Grained Multithreading

75
Fine-Grained Multithreading
 Idea: Hardware has multiple thread contexts (PC+registers).
Each cycle, fetch engine fetches from a different thread.
 By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
 Branch/instruction resolution latency overlapped with execution
of other threads’ instructions

+ No logic needed for handling control and


data dependences within a thread
-- Single thread performance suffers
-- Extra logic for keeping thread contexts
-- Does not overlap latency if not enough
threads to cover the whole pipeline
76
Fine-Grained Multithreading (II)
 Idea: Switch to another thread every cycle such that no two
instructions from a thread are in the pipeline concurrently

 Tolerates the control and data dependency latencies by


overlapping the latency with useful work from other threads
 Improves pipeline utilization by taking advantage of multiple
threads

 Thornton, “Parallel Operation in the Control Data 6600,” AFIPS


1964.
 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.

77
Fine-Grained Multithreading: History
 CDC 6600’s peripheral processing unit is fine-grained
multithreaded
 Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964.
 Processor executes a different I/O thread every cycle
 An operation from the same thread is executed every 10 cycles

 Denelcor HEP (Heterogeneous Element Processor)


 Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.
 120 threads/processor
 available queue vs. unavailable (waiting) queue for threads
 each thread can have only 1 instruction in the processor pipeline; each thread
independent
 to each thread, processor looks like a non-pipelined machine
 system throughput vs. single thread performance tradeoff

78
Fine-Grained Multithreading in HEP
 Cycle time: 100ns

 8 stages  800 ns to
complete an
instruction
 assuming no memory
access

 No control and data


dependency checking

79
Multithreaded Pipeline Example

Slide credit: Joel Emer 80


Sun Niagara Multithreaded Pipeline

Kongetira et al., “Niagara: A 32-Way Multithreaded Sparc Processor,” IEEE Micro 2005.
81
Fine-grained Multithreading
 Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, utilization

 Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependency checking logic between threads remains (load/store)
82
Modern GPUs Are FGMT Machines

83
NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

= data-parallel (SIMD) func. unit, = instruction stream decode


control shared across 8 units
= multiply-add = execution context storage
= multiply

84
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

 Groups of 32 threads share instruction stream (each group is


a Warp): they execute the same instruction on different data
 Up to 32 warps are interleaved in an FGMT manner
 Up to 1024 thread contexts can be stored
85
Slide credit: Kayvon Fatahalian
NVIDIA GeForce GTX 285

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

30 cores on the GTX 285: 30,720 threads


86
Slide credit: Kayvon Fatahalian
End of
Fine-Grained Multithreading

87

You might also like