Pipeline Hazards

Lecture 7
Pipeline Hazards
Hazards CS510 Computer Architectures Lecture 7 - 1

Pipelining Lessons
6 PM 7 8 9 Time • Pipelining doesn’t help
latency of single task, it helps
30 40 40 40 40 20 throughput of entire workload
• Pipeline rate limited by
T
A slowest pipeline stage
a
s • Multiple tasks operating
k simultaneously
B • Potential speedup = Number
O pipe stages
r • Unbalanced lengths of pipe
C
d stages reduces speedup
e • Time to “fill” pipeline and time
r D to “drain” it reduces speedup

Its Not That Easy to Achieve
the Promised Performance
• Limits to pipelining: Hazards prevent the next instruction
from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
– Control hazards: Pipelining of branches and other
instructions that change the PC
• Common solution is to stall the pipeline until the hazard is
resolved, inserting one or more “bubbles”, i.e., idle clock
cycles, in the pipeline

Structural Hazards /Memory
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
LOAD Mem Reg Mem
Mem Reg
Instruction Order
ALU
Instr 1 Mem Reg Mem Reg
ALU
ALU
Instr 3 Reg Mem Reg
Mem
Mem
ALU
Instr 4 Mem
Mem Reg Mem Reg
Operation on Memory
by 2 different instructions
in the same clock cycle
Structural Hazards
with Single-Port Memory
Time(clock cycles)
ALU
Mem
LOAD Mem Reg Mem Reg
Instruction Order
ALU
Instr 1 Mem Reg Mem
Mem Reg
ALU
Stall
ALU
Instr 3 Reg Mem Reg
Mem
Stall
Stall
ALU
Instr 3 3 cycles stall Mem
Mem Reg
with 1-port memory

Avoiding Structural Hazard
with Dual-Port Memory
Time(clock cycles)
ALU
IM
IM Reg DM
DM Reg
LOAD
Instruction Order
ALU
Instr 1 IM
IM Reg DM Reg
ALU
Instr 2 IM Reg DM
DM Reg
ALU
Instr 3 IM
IM Reg DM Reg
DM
ALU
Instr 4 IM
IM Reg DM Reg
No stall with
Instr 5 2-port memory DM
ALU
IM
IM Reg DM

Speed Up Equation
for Pipelining
Ave Instr Time unpipelined
Speedup from pipelining
Ave Instr Time pipelined
CPIunpipelined x Clock Cycleunpipelined
CPIpipelined x Clock Cyclepipelined
CPIunpipelined
Clock Cycleunpipelined
x
CPIpipelined Clock Cyclepipelined
Ideal CPI = CPIunpipelined/Pipeline depth(Number of pipeline stages)
Speedup = Ideal CPI x Pipeline depth x Clock Cycleunpipelined

CPIpipelined Clock Cyclepipelined
Ideal CPI for pipelined machines is almost always 1

Speed Up Equation
for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr
= 1 + Pipeline stall clock cycles per instr
Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined

x
Ideal CPI + Pipeline stall CPI Clock Cyclepipelined
Speedup = Pipeline depth Clock Cycleunpipelined

x
1 + Pipeline stall CPI Clock Cyclepipelined

Dual-Port vs Single-Port Memory
• Machine A: 2-port memory(needs no stall for Load); same clock cycle
as unpipelined machine
• Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05 times
faster clock rate than the unpipelined machine
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe/clockpipe)

= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 3) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.2) x 1.05
= 0.87 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15
Machine A is 1.15 times faster

Data Hazard on Registers
Time(clock cycles)
ALU
Mem Reg Mem Reg R1
ADD R1,R2,R3
ALU
Mem Reg
Reg Mem Reg
SUB R4,R1,R3
ALU
Reg
Re
Reg Mem Reg
AND R6,R1,R7 Mem
ALU
Reg
Reg
OR R8,R1,R9 Mem Mem Reg
ALU
Reg
Reg
XOR R10,R11,R1 Mem Mem Reg

Registers can be made to read and store in the same cycle
such that data is stored in the first half of the clock cycle, and
that data can be read in the second half of the same clock cycle
Clcok
Cycle
Store Read
into Ri from Ri
Register Ri

Time(clock cycles)
ALU
ADD R1,R2,R3 Mem Reg Mem Reg R1
ALU
Mem Reg Mem Reg
SUB R4,R1,R3 Reg
ALU
AND R6,R1,R7 Reg
Reg Mem Reg
Mem
ALU
OR R8,R1,R9 Mem Reg
Reg Mem Reg
ALU
XOR R10,R11,R1 Mem Reg
Reg Mem Reg
Needs to Stall 2 cycles

Three Generic Data Hazards
Instri followed by Instrj
Read After Write (RAW)

Instrj tries to read operand before Instri writes it
Instri LW R1, 0(R2)

Instrj SUBR 4, R1, R5

InstrI followed by InstrJ
• Write After Read (WAR)

Instrj tries to write operand before Instri reads it
Instri ADD R1, R2, R3

Instrj LW R2,
0(R5)
Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages,
– Reads are always in stage 2, and
– Writes are always in stage 5

InstrI followed by InstrJ
Write After Write (WAW)

Instrj tries to write operand before Instri writes it
– Leaves wrong result ( Instri not Instrj)
Instri LW R1, 0(R2)

Instrj LW R1, 0(R3)
Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and
– Writes are always in stage 5
Will see WAR and WAW in later more complicated pipes

Forwarding
to Avoid Data Hazards
Time(clock cycles)
ALU
ADD R1,R2,R3 Mem Reg Mem Reg
ALU
SUB R4,R1,R3 Mem Reg Mem Reg
ALU
AND R6,R1,R7 Mem Reg Mem Reg
ALU
OR R8,R1,R9 Mem Reg Mem Reg
ALU
XOR R10,R11,R1 Mem Reg Mem Reg

HW Change
for Forwarding
Zero?
MUX
D/A Buffer
M/W Buffer
A/M Buffer
ALU
Data
MUX
Memory

Load Delay Due to Data Hazard
Time(clock cycles)
ALU
LOAD R1,0(R2) Reg DM Reg
IM Load Delay
=2cycles
ALU
SUB R4,R1,R6 IM Reg DM Reg
ALU
IM Reg DM Reg
ALU
IM Reg DM Reg
ALU
AND R6,R1,R7 IM Reg DM Reg
ALU
OR R8,R1,R9 IM Reg DM

Load Delay
with Forwarding
Time(clock cycles)
We need to add HW,
called Pipeline Interlock
ALU
LOAD R1,0(R2) IM Reg DM Reg
Load Delay with
Forwarding=1cycle
ALU
SUB R4,R1,R6 IM Reg DM Reg
ALU
IM Reg DM Reg
AND R6,R1,R7
ALU
IM Reg DM Reg
ALU
IM Reg DM Reg
OR R8,R1,R9

Software Scheduling
to Avoid Load Hazards
Try to produce fast code for
a = b + c;
d = e - f;
assuming a, b, c, d ,e, and f are in memory.
Slow code(with forwarding): Fast code:
LW Rb,b LW Rb,b
LW Rc,c Stall LW Rc,c
RAW ADD Ra,Rb,Rc LW Re,e
RAW SW a,Ra Stall ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f Stall SW a,Ra
RAW SUB Rd,Re,Rf SUB Rd,Re,Rf
Stall RAW Stall
RAW SW d,Rd SW d,Rd

Compiler Avoiding Load Stalls
scheduled unscheduled
54%
gcc
31%
42%
spice
14%
65%
tex
25%
0% 20% 40% 60% 80%

% loads stalling pipeline

Pipelined DLX Datapath
IF Stage ID Stage EX Stage Mem WB Stage
Stage
MUX
Add Zero?
+4
MUX
M/W Buffer
PC
F/D Buffer
D/A Buffer
A/M Buffer
Instr. Reg ALU
Memory File Data LMD
MUX
MUX
Memory
SMD
Sign
16 Ext 32
• Branch Address
• Branch
Calculation
Decision for
• Decide Condition
Hazards CS510 Computer Architectures target address Lecture 7 - 23
Control
Control Hazard
Hazard on
on Branches:
Branches:
Three Stall Cycles
Cycles
Time(clock cycles)
Program execution order in instructions
ALU
40 BEQ R1,R3, 36 Reg DM Reg Should’t be executed when
IM
branch condition is true !
ALU
44 AND R12,R2, R5 DM Reg
IM
IM Reg
Reg DM Reg Branch Target
available
ALU
48 OR R13,R6, R2 Reg DM
DM Reg
IM
ALU
52 ADD R14,R2, R2
IM Reg
Reg DM Reg
Reg
ALU
80 LD R4,R7, 100
IM Reg DM Reg
Branch Delay = 3 cycles

Control
Control Hazard
Hazard on
on Branches:
Branches:
Three
Three Stall
Stall Cycles
Cycles
We don’t know yet the instruction

Now, target address is available.
being executed is a branch.
Fetch the branch successor.
Branch instruction IF ID EX MEM WB

Branch successor IF ID EX MEM
Branch successor + 1 IF ID EX
Branch successor + 2 IF ID
Now, we know the instruction

being executed is a branch.
But stall until branch target 3 Wasted clock cycles
address is known. for the TAKEN branch

Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9
– Half of the ideal speed
• Two part solution:
– Determine the branch is TAKEN or NOT TAKEN sooner, AND
– Compute TAKEN Branch Address(Branch Target) earlier
• DLX branch tests if register = 0 or 1
DLX Solution: Get New PC earlier

- Move Zero test to ID stage
- Additional ADDER to calculate New PC(taken PC)
in ID stage
- 1 clock cycle penalty for branch in contrast to 3 cycles

Pipelined DLX Datapath
IF Stage ID Stage EX Stage Mem WB Stage
Stage
When a branch
To get target instruction is in
Zero?
addr. earlier
MUX
Execute stage,
Next Address
is available here.
Add
Add
+4
MUX
M/W Buffer
PC
F/D Buffer
D/A Buffer
A/M Buffer
Instr. Reg ALU
Memory File MUX Data LMD
MUX
Memory
SMD
To get the
Condition Earlier. Sign
Target Address Ext 32
16
available after ID.

Branch Behavior in Programs
• Conditional branch frequencies
– integer average --- 14 to 16 %
– floating point --- 3 to 12 %
• Forward and backward taken branches
– forward taken --- 60 %
– backward taken --- 85 %
– the average of all conditional branches ---- 67 %

4 Branch Hazard Alternatives
• Stall until branch direction is clear
• Predict branch NOT TAKEN
• Predict branch TAKEN
• Delayed branch

44 Branch
Branch Hazard
Hazard Alternatives:
Alternatives:
(1)
(1) STALL
STALL
Stall until branch direction is clear

Branch successor stall stall stall IF ID EX MEM
Branch successor + 1 IF ID EX
3 cycle penalty
Revised DLX pipeline(get the branch address at EX)
Branch successor stall IF ID EX MEM WB
Branch successor + 1 IF ID EX MEM
1 cycle penalty(Branch Delay Slot)

4 Branch Hazard Alternatives:
(2) Predict Branch “NOT TAKEN”
• Execute successor instructions in the sequence
• PC+4 is already calculated, so use it to get the next instruction
• Flush instructions in the pipeline if branch is actually TAKEN
• Advantage of late pipeline state update
• 47% of DLX branches are NOT TAKEN on the average
NOT TAKEN branch instruction i IF ID EX MEM WB No

instruction i+1 IF ID EX MEM WB penalty
instruction i+2 IF ID EX MEM WB
TAKEN branch instruction i IF ID EX MEM WB 1 cycle

instruction i+1 IF ID EX MEM WB penalty
instruction T IF ID EX MEM WB
Flush this instruction in progress

(3) Predict Branch “TAKEN”
– 53% DLX branches TAKEN on average
– Branch target address available after ID in DLX
• DLX still incurs 1 cycle branch penalty for TAKEN branch
• Other machines: branch target known before outcome
TAKEN address not available at this time
NOT TAKEN instruction i IF ID EX MEM WB
Instruction T stall IF
Instruction i+1 IF ID EX MEM WB
2 cycle penalty in DLX(1 in other machines).

TAKEN address available
TAKEN branch instruction i IF ID EX MEM WB

Instruction T stall IF ID EX MEM WB
Instruction T+1 IF ID EX MEM
WB
1 cycle penalty in DLX(0 in other machines)

44 Branch
Branch Hazard
Hazard Alternatives:
Alternatives:
(4) Delayed
Delayed Branch
Branch
Delayed Branch
– Delay branch to take place AFTER a successor instruction
branch instruction
sequential successor1
sequential successor2
........ Delayed Branch of length n
sequential successorn
branch target if taken
– 1 slot delayed branch allows proper decision and branch target

address in 5 stage DLX pipeline with control hazard improvement

Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch TAKEN
– From fall through: only valuable when branch NOT TAKEN
– Canceling branches allow more slots to be filled
• Compiler effectiveness for single delayed branch slot:

– Fills about 60% of delayed branch slots
– About 80% of instructions executed in delayed branch slots are
useful in computation
– About 50% (60% x 80%) of slots usefully filled

Delayed Branch
From before From target From fall through
ADD R1, R2, R3 ADD R1, R2, R3

SUB R4, R5, R6
if R2=0slot
Delay then ADD R1, R2, R3 if R1=0slot
then
Delay
if R1=0 then
Delay slot
SUB R4, R5, R6
ADD R1, R2, R3

if R2=0 then if R2=0 then
ADD R1, R2, R3 ADD R1, R2, R3 SUB R4, R5, R6
if R1=0 then
SUB R4, R5, R6
- Always improve performance - Improve performance when TAKEN(loop) - Improve performance when
- Branch must not depend on - Must be alright to execute rescheduled NOT TAKEN
rescheduled instructions instructions if Not Taken - Must be alright to execute
- May need duplicate the instruction instructions of Taken
if it is the target of another branch instr.
Limitations on Delayed
Branch
• Difficulty in finding useful instructions to fill the delayed
branch slots
• Solution - Squashing
– Delayed branch associated with a branch prediction
– Instructions in the predicted path are executed in the
delayed branch slot
– If the branch outcome is mispredicted, instructions in the
delayed branch slot are squashed(discarded)

Canceling Branch
• Used when the delayed branch scheduling, i.e., filling the delay
slot cannot be done due to
– Restrictions on scheduling instructions at the delay slots
– Limitations on the ability to predict whether it will TAKE or NOT
TAKE at compile time
• Instruction includes the direction that the branch was predicted
– When the branch behaves as predicted, the instructions in the
delay slot are executed
– When branch is incorrectly predicted, the instructions in the delay
slot are turned into No-OPs
• Canceling Branch allows to fill the delay slot even if the
instruction to be filled in the delay slot does not meet the
requirements

Evaluating Branch
Alternatives
Pipeline speedup = Pipeline depth / CPI
= Pipeline depth
1 + Branch frequency xBranch penalty
Conditional and Unconditional collectively 14% frequency,

65% of branch is TAKEN
Scheduling Branch CPI speedup vs speedup vs

scheme penalty unpipelined stall
Stall pipeline 3 1+0.14x3=1.42 5/1.42=3.5 1.0
Predict Taken 1 1+0.14x1=1.14 5/1.14=4.4 1.26
Predict Not Taken 1 1+0.14x0.65=1.09 5/1.09=4.5 1.29
Delayed branch 0.5 1+0.14x0.5=1.07 5/1.07=4.6 1.31

Static(Compiler) Prediction of
Taken/Untaken Branches
Code Motion
LW R1, 0(R2)
SUB R1, R1, R3 If branch is almost always NOT TAKEN,
TAKEN
and R4 is not needed on the taken path,
Depend BEQZ R1, L and R5 and R6 are not modified in the
on LW, following instruction(s), this move can
OR R4, R5, R6
increase speed
need to ADD R10,R4,R3
stall
L: ADD R7, R8, R9
If branch is almost always TAKEN,

TAKEN
and R7 is not needed, and R8 and R9
are not modified on the fall-through
path, this move can increase speed

Static(Compiler) Prediction
of Taken/Untaken Branches
• Improves strategy for placing instructions in delay slot
• Two strategies
– Direction-based Prediction:
TAKEN backward branch, NOT TAKEN forward branch
– Profile-based prediction:
Record branch behaviors, predict branch based on the prior run(s)
Frequency of Misprediction
14%
70%
Misprediction Rate
60% 12%
50% 10%
40% 8%
30% 6%
20% 4%
10% 2%
0%
0%
doduc
gcc
ora
tomcatv
alvinn
hydro2d
compress
espresso
mdljsp2
swm256
gcc
doduc
ora
tomcatv
alvinn
hydro2d
compress
espresso
mdljsp2
swm256
Always taken Taken backwards
Not Taken Forwards
Evaluating Static Branch
Prediction Strategies
Instructions per mispredicted branch

• Misprediction rate 100000
ignores frequency of
10000
branch
• Instructions between 1000
mispredicted branches
is a better metric 100
10
gcc
doduc
ora
tomcatv
alvinn
hydro2d
compress
espresso
mdljsp2
swm256
Profile-based Direction-based

Pipelining Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up <= Pipeline Depth; if ideal CPI is 1, then:
Pipeline Depth Clock Cycle Unpipelined

Speedup = X
1 + Pipeline stall CPI Clock Cycle Pipelined
• Hazards limit performance on computers:

Structural: need more HW resources
Data: need forwarding, compiler scheduling
Control: Dynamic Prediction, Delayed branch slot,
Static(compiler) Prediction

Pipeline Hazards

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipeline Hazards

Uploaded by

Copyright:

Available Formats

Lecture 7

Hazards CS510 Computer Architectures Lecture 7 - 1

Hazards CS510 Computer Architectures Lecture 7 - 2

Hazards CS510 Computer Architectures Lecture 7 - 3

Hazards CS510 Computer Architectures Lecture 7 - 5

Hazards CS510 Computer Architectures Lecture 7 - 6

Speedup = Ideal CPI x Pipeline depth x Clock Cycleunpipelined

Ideal CPI for pipelined machines is almost always 1

Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined

Speedup = Pipeline depth Clock Cycleunpipelined

Hazards CS510 Computer Architectures Lecture 7 - 8

SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe/clockpipe)

Machine A is 1.15 times faster

Hazards CS510 Computer Architectures Lecture 7 - 10

Hazards CS510 Computer Architectures Lecture 7 - 11

Needs to Stall 2 cycles

Read After Write (RAW)

Instri LW R1, 0(R2)

Hazards CS510 Computer Architectures Lecture 7 - 13

• Write After Read (WAR)

Instri ADD R1, R2, R3

Hazards CS510 Computer Architectures Lecture 7 - 14

Write After Write (WAW)

Instri LW R1, 0(R2)

Can’t happen in DLX 5 stage pipeline because:

Hazards CS510 Computer Architectures Lecture 7 - 15

Hazards CS510 Computer Architectures Lecture 7 - 16

Hazards CS510 Computer Architectures Lecture 7 - 17

Hazards CS510 Computer Architectures Lecture 7 - 19

Hazards CS510 Computer Architectures Lecture 7 - 20

Hazards CS510 Computer Architectures Lecture 7 - 21

0% 20% 40% 60% 80%

Hazards CS510 Computer Architectures Lecture 7 - 22

Branch Delay = 3 cycles

We don’t know yet the instruction

Branch instruction IF ID EX MEM WB

Now, we know the instruction

Hazards CS510 Computer Architectures Lecture 7 - 25

DLX Solution: Get New PC earlier

Hazards CS510 Computer Architectures Lecture 7 - 26

Hazards CS510 Computer Architectures Lecture 7 - 27

Hazards CS510 Computer Architectures Lecture 7 - 29

Hazards CS510 Computer Architectures Lecture 7 - 30

Branch instruction IF ID EX MEM WB

1 cycle penalty(Branch Delay Slot)

NOT TAKEN branch instruction i IF ID EX MEM WB No

TAKEN branch instruction i IF ID EX MEM WB 1 cycle

Flush this instruction in progress

Hazards CS510 Computer Architectures Lecture 7 - 32

2 cycle penalty in DLX(1 in other machines).

TAKEN branch instruction i IF ID EX MEM WB

Hazards CS510 Computer Architectures Lecture 7 - 33

– 1 slot delayed branch allows proper decision and branch target

Hazards CS510 Computer Architectures Lecture 7 - 34

• Compiler effectiveness for single delayed branch slot:

Hazards CS510 Computer Architectures Lecture 7 - 35

ADD R1, R2, R3 ADD R1, R2, R3

ADD R1, R2, R3

Hazards CS510 Computer Architectures Lecture 7 - 37

Hazards CS510 Computer Architectures Lecture 7 - 38

Conditional and Unconditional collectively 14% frequency,

Scheduling Branch CPI speedup vs speedup vs

Hazards CS510 Computer Architectures Lecture 7 - 39