Professional Documents
Culture Documents
004 Pipelining PDF
004 Pipelining PDF
Principles of pipelining
Simple pipelining
Structural Hazards
Data Hazards
Control Hazards
Interrupts
Multicycle operations
Pipeline clocking
Fetch
Decode
Read Operands
Operation
Writeback Result
Determine Next Instruction
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
time
1/Throughput
instructions
latency
Pipelined
1/Throughput
time
instructions
latency
Time sequential
Ideally: Time pipeline = -----------------------------------------PipelineDepth
Time
fetch I4
decode I4
r0 = r0 + r2
fetch I5
decode I5
r1 = r1 + 1
in-progress
fetch I6
decode I6
r3 = r1 != 10
committed
in-progress
Latency = n max(t i) T
t i
Speedup = ------------------- n
max(t i)
Pipelining Limits
After a certain number of stages benefits level off and later they
start diminishing
Pipeline Utility is limited by:
1. Implementation
a. Logic Delay
b. Clock Skew
c. Latch Delay
2. Structures
3. Programs
2 & 3 will be called HAZARDS
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
latch
latch
T = clock period
T >= tl
Todays Procs:
clock
clock
computation starts
tl
T
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
10
latch
latch
CLOCK SKEW
stage
i+1
t
CLK
tl
DIE
CLKA
B
tw
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
CLKB
Clock Skew
ECE D52 Lecture Notes: Chapter 3
11
tl
T
Worst case scenario:
- Output from stage with late clock feeds stage with early
clock
12
tl
setup
hold
T
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
13
latch overhead
clock/data skew
X limits the useful pipeline depth
With n-stage pipeline (all ti equal) (T = n x t)
1
n
----------- throughput =
< X+t T
latency = n ( X + t ) = n X + T
T
speedup = ---------------- n
(X + t)
14
Simple pipelines
F- fetch, D - decode, X - execute, M - memory, W -writeback
i+1
i+2
i+3
i+4
1
0
1
1
1
2
1
3
15
16
T
Non-pipelined Implementation
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
17
fetch
decode
execute
memory
writeback
T
Pipelined Implementation: Ideally 5x performance
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
18
MEM
REG
MEM
MEM
REG
MEM
REG
REG
REG
MEM
REG
19
Hazards
Hazards
20
Structural Hazards
When two or more different instructions
want to use the same hardware resource in the same cycle
e.g., load and stores use the same memory port as IF
MEM
MEM
REG
MEM
MEM
REG
MEM
REG
MEM
REG
MEM
REG
REG
REG
MEM
REG
21
22
+ good performance
increases cost
increased interconnect delay ?
use for cheap or divisible resources
demux
mux
23
24
Data Hazards
When two different instructions use the same storage location It
must appear as if they executed in sequential order
25
Data Hazards
26
Examples of RAW
add r1, --, -sub --, r1, --
IF ID EX MEM WB
r1 written
IF ID EX MEM WB
r1 read - not OK
IF ID EX MEM WB
r1 written
IF ID EX MEM WB
r1 read - not OK
sw r1, 100(r2)
lw r1, 100(r2)
IF ID EX MEM WB
IF ID EX MEM WB
OK
27
IF ID EX MEM WB
IF stall stall
r1 written
IF ID EX MEM WB
28
Stall Methods
Compare ahead in pipe
29
IF ID EX MEM WB
IF ID EX MEM WB
data used
+ reduces/avoids stalls
complex
Deeper pipelines -> more places to bypass from
crucial for common RAW hazards
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
30
Bypass
Interlock logic
detect hazard
bypass correct result to ALU
Hardware detection requires extra hardware
31
Bypass Example
32
33
RAW solutions
Hybrid (i.e., stall and bypass) solutions required sometimes
IF ID EX MEM WB
stall IF ID
EX MEM WB
DLX has one cycle bubble if load result used in next instruction
Try to separate stall logic from bypass logic
34
Rb,
lw
Rc,
stall
add Ra,
Rb,
sw
a,
ra
lw
Re,
lw
Rf,
Rc
stall
sub Rd,
Re,
sw
Rd
d,
Rf
ECE D52 Lecture Notes: Chapter 3
35
Pipeline Scheduling
After scheduling
lw
Rb,
lw
Rc,
lw
Re,
add Ra,
Rb,
lw
Rf,
sw
a,
ra
sub Rd,
Re,
sw
Rd1
d,
Rc
Rf
No Stalls
36
Delayed Load
Avoid hardware solutions - Let the compiler deal with it
Instruction Immediately after load cant/shouldnt see load result
lw Rb, b
lw Rc, c
nop
add Ra, Rb, Rc
sw a, Ra
lw Re, efs
lw Rf, f
nop
add Rd, Re, Rf ...
lw Rb, b
SCHEDULED
UNSCHEDULED
37
IF
ID
EX MEM1 MEM2
IF
ID
WB
EX MEM1 MEM2
WB
--, (r1++)
ECE D52 Lecture Notes: Chapter 3
38
update r1
39
Control Hazards
When an instruction affects which instruction execute next
or changes the PC
sw $4, 0($5)
bne $2, $3, loop
sub -, - , -
sw
X*
bne
??
40
Control Hazards
Handling control hazards is very important
VAX e.g.,
41
target := PC + immediate
if (Rs1 op 0) PC := target
42
W
squashed
43
must be fast
cant afford to subtract
compares with 0 are simple
gt, lt test sign-bit
eq, ne must OR all bits
More general conditions need ALU
44
static - by compiler
dynamic - by hardware
MORE ON THIS LATER ON
45
46
i
i+1
i+2
47
i+8
i+9
48
i:
beqz r1, #8
i
i+1 (delay slot)
i+8
49
50
control-independent of A
Nullifying or Cancelling or Likely Branches:
Specify when delay slot is execute and when is squashed
Why? Increase fill opportunities
Major Concern w/ DS: Exposes implementation optimization
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
51
52
taken penalty
not-taken pen.
CPI penalty
naive
0.420
fast branch
0.140
not-taken
0.091
taken
0.049
delayed branch
0.5
0.5
0.070
53
scheme
taken penalty
not-taken pen.
CPI penalty
naive
0.840
fast branch
0.280
not-taken
0.182
taken
0.098
delayed branch
54
Interrupts
Examples:
55
Classifying Interrupts
1a. synchronous
OS call
2b. coerced
56
Classifying Interrupts
3a. User Maskable
User can disable processing
3b. Non-Maskable
User cannot disable processing
4a. Between Instructions
Usually asynchronous
4b. Within an instruction
Usually synchronous - Harder to deal with
5a. Resume
As if nothing happened? Program will continue execution
5b. Termination
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
57
Restartable Pipelines
Interrupts within an instruction are not catastrohic
Most machines today support this
Needed for virtual memory
Some machines did not support this
Why?
Cost
Slowdown
Key: Precice Interrupts
Will return to this soon
First lets consider a simple DLX-style pipeline
58
Handling Interrupts
Precise interrupts (sequential semantics)
59
Interrupts
E.g., data page fault
i
i+1
i+2
i+3
i+4
9
<- page fault
<- squash
<- squash
x
x+1
trap ->
<- squash
D
60
Interrupts
Preceding instructions already complete
Squash succeeding instructions
61
Interrupts
E.g., arithmetic exception
i
i+1
i+2
i+3
i+4
x
x+1
W
<- exception
<- squash
F
trap ->
trap handler ->
<- squash
F
62
Interrupts
E.g., Instruction fetch page fault
i
i+1
i+2
i+3
i+4
x
x+1
F
trap ->
trap handler ->
63
Interrupts
Let preceding instructions complete
No succeeding instructions
What happens if i+3 causes a data page fault?
64
Interrupts
Out-of-order interrupts
i
i+1
i+2
i+3
9
page fault (Mem)
page fault (fetch)
65
Out-of-Order Interrupts
Post interrupts
66
Interrupts
Other complications
67
68
Multicycle operations
Not all operations complete in 1 cycle
69
EX: integer
E+: FP adder
Assume
70
E*
int
fp/
int
fp/
fp*
E*
E*
E*
EX
(1)
E/
E/
E/
E/
EX
--
--
E/
E/
(3)
--
--
(4)
10
11
W
(2)
71
72
FP Instruction Issue
Check for structural hazards
contention in WB
possible WAR/WAW hazards
interrupt headaches
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
73
Overlapping Instructions
Contention in WB
static priority
e.g., FU with longest latency
instructions stall after issue
WAR hazards
74
Multicycle Operations
Problems with interrupts
Out-Of-Order completion
Possible imprecise interrupts
What if divf excepts after addf/subf complete?
Precice Interrupts Paper
75
Precice Interrupts
Simple solution: Modify state only when all preceding insts. are
known to be exception free.
stage
FU
DR
PC
div
R1
1000
...
...
...
...
...
add
R2
1001
oldest
youngest
motion
76
Reorder Buffer
reorder buffer
st. FU V tag
tag DR Result V E PC
add
div
motion
head
4
5
r1
r2
tail
motion
div (4 cycles)
add (1 cycle)
Out-of-order completition
Commit: Write results to register file or memory
Reorder buffer holds not yet committed state
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
77
Two Solutions:
RF
History Buffer
Future File
RB
1998 by Hill, Wood, Sohi,
Smith and Vijaykumar and
Moshovos
results
78
History Buffer
Allow out-of-oder register file updates
At decode record current value of target register in reorder buffer
entry.
On commit: do nothing
On exception: scan following reorder buffer entries restoring
register values
results
RF
exception
dst.
HB
79
Future File
Two register files:
One updated out-of-order (FUTURE)
assume no exceptions will occur
One updated in order (ARCHITECTURAL)
Advantage: No delay to restore state on exception
results
FF
exception
AF
RB
80