You are on page 1of 43

Lecture 7

Pipeline Hazards

Hazards CS510 Computer Architectures Lecture 7 - 1


Pipelining Lessons
6 PM 7 8 9 Time • Pipelining doesn’t help
latency of single task, it helps
30 40 40 40 40 20 throughput of entire workload
• Pipeline rate limited by
T
A slowest pipeline stage
a
s • Multiple tasks operating
k simultaneously
B • Potential speedup = Number
O pipe stages
r • Unbalanced lengths of pipe
C
d stages reduces speedup
e • Time to “fill” pipeline and time
r D to “drain” it reduces speedup

Hazards CS510 Computer Architectures Lecture 7 - 2


Its Not That Easy to Achieve
the Promised Performance
• Limits to pipelining: Hazards prevent the next instruction
from executing during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
– Control hazards: Pipelining of branches and other
instructions that change the PC
• Common solution is to stall the pipeline until the hazard is
resolved, inserting one or more “bubbles”, i.e., idle clock
cycles, in the pipeline

Hazards CS510 Computer Architectures Lecture 7 - 3


Structural Hazards /Memory
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
LOAD Mem Reg Mem
Mem Reg
Instruction Order

ALU
Instr 1 Mem Reg Mem Reg

ALU
Instr 2 Mem Reg Mem Reg

ALU
Instr 3 Reg Mem Reg
Mem
Mem

ALU
Instr 4 Mem
Mem Reg Mem Reg
Operation on Memory
by 2 different instructions
in the same clock cycle
Hazards CS510 Computer Architectures Lecture 7 - 4
Structural Hazards
with Single-Port Memory
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
Mem
LOAD Mem Reg Mem Reg
Instruction Order

ALU
Instr 1 Mem Reg Mem
Mem Reg

ALU
Instr 2 Mem Reg Mem Reg

Stall

ALU
Instr 3 Reg Mem Reg
Mem
Stall
Stall

ALU
Instr 3 3 cycles stall Mem
Mem Reg
with 1-port memory

Hazards CS510 Computer Architectures Lecture 7 - 5


Avoiding Structural Hazard
with Dual-Port Memory
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
IM
IM Reg DM
DM Reg
LOAD
Instruction Order

ALU
Instr 1 IM
IM Reg DM Reg

ALU
Instr 2 IM Reg DM
DM Reg

ALU
Instr 3 IM
IM Reg DM Reg

DM

ALU
Instr 4 IM
IM Reg DM Reg
No stall with
Instr 5 2-port memory DM

ALU
IM
IM Reg DM

Hazards CS510 Computer Architectures Lecture 7 - 6


Speed Up Equation
for Pipelining
Ave Instr Time unpipelined
Speedup from pipelining
Ave Instr Time pipelined
CPIunpipelined x Clock Cycleunpipelined
CPIpipelined x Clock Cyclepipelined
CPIunpipelined
Clock Cycleunpipelined
x
CPIpipelined Clock Cyclepipelined
Ideal CPI = CPIunpipelined/Pipeline depth(Number of pipeline stages)

Speedup = Ideal CPI x Pipeline depth x Clock Cycleunpipelined


CPIpipelined Clock Cyclepipelined

Ideal CPI for pipelined machines is almost always 1


Hazards CS510 Computer Architectures Lecture 7 - 7
Speed Up Equation
for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr
= 1 + Pipeline stall clock cycles per instr

Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined


x
Ideal CPI + Pipeline stall CPI Clock Cyclepipelined

Speedup = Pipeline depth Clock Cycleunpipelined


x
1 + Pipeline stall CPI Clock Cyclepipelined

Hazards CS510 Computer Architectures Lecture 7 - 8


Dual-Port vs Single-Port Memory
• Machine A: 2-port memory(needs no stall for Load); same clock cycle
as unpipelined machine
• Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05 times
faster clock rate than the unpipelined machine
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed

SpeedUpA = [Pipeline Depth/(1 + 0)] x (clockunpipe/clockpipe)


= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 3) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.2) x 1.05
= 0.87 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15

Machine A is 1.15 times faster


Hazards CS510 Computer Architectures Lecture 7 - 9
Data Hazard on Registers
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
Mem Reg Mem Reg R1
ADD R1,R2,R3

ALU
Mem Reg
Reg Mem Reg
SUB R4,R1,R3

ALU
Reg
Re
Reg Mem Reg
AND R6,R1,R7 Mem

ALU
Reg
Reg
OR R8,R1,R9 Mem Mem Reg

ALU
Reg
Reg
XOR R10,R11,R1 Mem Mem Reg

Hazards CS510 Computer Architectures Lecture 7 - 10


Data Hazard on Registers
Registers can be made to read and store in the same cycle
such that data is stored in the first half of the clock cycle, and
that data can be read in the second half of the same clock cycle

Clcok
Cycle

Store Read
into Ri from Ri

Register Ri

Hazards CS510 Computer Architectures Lecture 7 - 11


Data Hazard on Registers
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
ADD R1,R2,R3 Mem Reg Mem Reg R1

ALU
Mem Reg Mem Reg
SUB R4,R1,R3 Reg

ALU
AND R6,R1,R7 Reg
Reg Mem Reg
Mem

ALU
OR R8,R1,R9 Mem Reg
Reg Mem Reg

ALU
XOR R10,R11,R1 Mem Reg
Reg Mem Reg

Needs to Stall 2 cycles


Hazards CS510 Computer Architectures Lecture 7 - 12
Three Generic Data Hazards
Instri followed by Instrj

Read After Write (RAW)


Instrj tries to read operand before Instri writes it

Instri LW R1, 0(R2)


Instrj SUBR 4, R1, R5

Hazards CS510 Computer Architectures Lecture 7 - 13


Three Generic Data Hazards
InstrI followed by InstrJ

• Write After Read (WAR)


Instrj tries to write operand before Instri reads it

Instri ADD R1, R2, R3


Instrj LW R2,
0(R5)
Can’t happen in DLX 5 stage pipeline because:
– All instructions take 5 stages,
– Reads are always in stage 2, and
– Writes are always in stage 5

Hazards CS510 Computer Architectures Lecture 7 - 14


Three Generic Data Hazards
InstrI followed by InstrJ

Write After Write (WAW)


Instrj tries to write operand before Instri writes it
– Leaves wrong result ( Instri not Instrj)

Instri LW R1, 0(R2)


Instrj LW R1, 0(R3)

Can’t happen in DLX 5 stage pipeline because:


– All instructions take 5 stages, and
– Writes are always in stage 5
Will see WAR and WAW in later more complicated pipes

Hazards CS510 Computer Architectures Lecture 7 - 15


Forwarding
to Avoid Data Hazards
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
ADD R1,R2,R3 Mem Reg Mem Reg

ALU
SUB R4,R1,R3 Mem Reg Mem Reg

ALU
AND R6,R1,R7 Mem Reg Mem Reg

ALU
OR R8,R1,R9 Mem Reg Mem Reg

ALU
XOR R10,R11,R1 Mem Reg Mem Reg

Hazards CS510 Computer Architectures Lecture 7 - 16


HW Change
for Forwarding

Zero?
MUX
D/A Buffer

M/W Buffer
A/M Buffer
ALU
Data
MUX

Memory

Hazards CS510 Computer Architectures Lecture 7 - 17


Hazards CS510 Computer Architectures Lecture 7 - 18
Load Delay Due to Data Hazard
Time(clock cycles)

ALU
LOAD R1,0(R2) Reg DM Reg
IM Load Delay
=2cycles

ALU
SUB R4,R1,R6 IM Reg DM Reg

ALU
IM Reg DM Reg

ALU
IM Reg DM Reg

ALU
AND R6,R1,R7 IM Reg DM Reg

ALU
OR R8,R1,R9 IM Reg DM

Hazards CS510 Computer Architectures Lecture 7 - 19


Load Delay
with Forwarding
Time(clock cycles)
We need to add HW,
called Pipeline Interlock

ALU
LOAD R1,0(R2) IM Reg DM Reg
Load Delay with
Forwarding=1cycle

ALU
SUB R4,R1,R6 IM Reg DM Reg

ALU
IM Reg DM Reg

AND R6,R1,R7

ALU
IM Reg DM Reg

ALU
IM Reg DM Reg
OR R8,R1,R9

Hazards CS510 Computer Architectures Lecture 7 - 20


Software Scheduling
to Avoid Load Hazards
Try to produce fast code for
a = b + c;
d = e - f;
assuming a, b, c, d ,e, and f are in memory.
Slow code(with forwarding): Fast code:
LW Rb,b LW Rb,b
LW Rc,c Stall LW Rc,c
RAW ADD Ra,Rb,Rc LW Re,e
RAW SW a,Ra Stall ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f Stall SW a,Ra
RAW SUB Rd,Re,Rf SUB Rd,Re,Rf
Stall RAW Stall
RAW SW d,Rd SW d,Rd

Hazards CS510 Computer Architectures Lecture 7 - 21


Compiler Avoiding Load Stalls

scheduled unscheduled

54%
gcc
31%

42%
spice
14%

65%
tex
25%

0% 20% 40% 60% 80%


% loads stalling pipeline

Hazards CS510 Computer Architectures Lecture 7 - 22


Pipelined DLX Datapath
IF Stage ID Stage EX Stage Mem WB Stage
Stage

MUX
Add Zero?

+4

MUX

M/W Buffer
PC

F/D Buffer

D/A Buffer

A/M Buffer
Instr. Reg ALU
Memory File Data LMD

MUX

MUX
Memory

SMD

Sign
16 Ext 32

• Branch Address
• Branch
Calculation
Decision for
• Decide Condition
Hazards CS510 Computer Architectures target address Lecture 7 - 23
Control
Control Hazard
Hazard on
on Branches:
Branches:
Three Stall Cycles
Cycles
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Program execution order in instructions

ALU
40 BEQ R1,R3, 36 Reg DM Reg Should’t be executed when
IM
branch condition is true !

ALU
44 AND R12,R2, R5 DM Reg
IM
IM Reg
Reg DM Reg Branch Target
available

ALU
48 OR R13,R6, R2 Reg DM
DM Reg
IM

ALU
52 ADD R14,R2, R2
IM Reg
Reg DM Reg
Reg

ALU
80 LD R4,R7, 100
IM Reg DM Reg

Branch Delay = 3 cycles


Hazards CS510 Computer Architectures Lecture 7 - 24
Control
Control Hazard
Hazard on
on Branches:
Branches:
Three
Three Stall
Stall Cycles
Cycles

We don’t know yet the instruction


Now, target address is available.
being executed is a branch.
Fetch the branch successor.

Branch instruction IF ID EX MEM WB


Branch successor IF ID EX MEM
Branch successor + 1 IF ID EX
Branch successor + 2 IF ID

Now, we know the instruction


being executed is a branch.
But stall until branch target 3 Wasted clock cycles
address is known. for the TAKEN branch

Hazards CS510 Computer Architectures Lecture 7 - 25


Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9
– Half of the ideal speed
• Two part solution:
– Determine the branch is TAKEN or NOT TAKEN sooner, AND
– Compute TAKEN Branch Address(Branch Target) earlier
• DLX branch tests if register = 0 or 1

DLX Solution: Get New PC earlier


- Move Zero test to ID stage
- Additional ADDER to calculate New PC(taken PC)
in ID stage
- 1 clock cycle penalty for branch in contrast to 3 cycles

Hazards CS510 Computer Architectures Lecture 7 - 26


Pipelined DLX Datapath
IF Stage ID Stage EX Stage Mem WB Stage
Stage
When a branch
To get target instruction is in
Zero?
addr. earlier
MUX

Execute stage,
Next Address
is available here.
Add
Add

+4

MUX

M/W Buffer
PC

F/D Buffer

D/A Buffer

A/M Buffer
Instr. Reg ALU
Memory File MUX Data LMD

MUX
Memory

SMD
To get the
Condition Earlier. Sign
Target Address Ext 32
16
available after ID.

Hazards CS510 Computer Architectures Lecture 7 - 27


Hazards CS510 Computer Architectures Lecture 7 - 28
Branch Behavior in Programs
• Conditional branch frequencies
– integer average --- 14 to 16 %
– floating point --- 3 to 12 %
• Forward and backward taken branches
– forward taken --- 60 %
– backward taken --- 85 %
– the average of all conditional branches ---- 67 %

Hazards CS510 Computer Architectures Lecture 7 - 29


4 Branch Hazard Alternatives
• Stall until branch direction is clear
• Predict branch NOT TAKEN
• Predict branch TAKEN
• Delayed branch

Hazards CS510 Computer Architectures Lecture 7 - 30


44 Branch
Branch Hazard
Hazard Alternatives:
Alternatives:
(1)
(1) STALL
STALL
Stall until branch direction is clear

Branch instruction IF ID EX MEM WB


Branch successor stall stall stall IF ID EX MEM
Branch successor + 1 IF ID EX
Branch successor + 2 IF ID

3 cycle penalty
Revised DLX pipeline(get the branch address at EX)
Branch instruction IF ID EX MEM WB
Branch successor stall IF ID EX MEM WB
Branch successor + 1 IF ID EX MEM
Branch successor + 2 IF ID

1 cycle penalty(Branch Delay Slot)


Hazards CS510 Computer Architectures Lecture 7 - 31
4 Branch Hazard Alternatives:
(2) Predict Branch “NOT TAKEN”
• Execute successor instructions in the sequence
• PC+4 is already calculated, so use it to get the next instruction
• Flush instructions in the pipeline if branch is actually TAKEN
• Advantage of late pipeline state update
• 47% of DLX branches are NOT TAKEN on the average

NOT TAKEN branch instruction i IF ID EX MEM WB No


instruction i+1 IF ID EX MEM WB penalty
instruction i+2 IF ID EX MEM WB

TAKEN branch instruction i IF ID EX MEM WB 1 cycle


instruction i+1 IF ID EX MEM WB penalty
instruction T IF ID EX MEM WB

Flush this instruction in progress

Hazards CS510 Computer Architectures Lecture 7 - 32


4 Branch Hazard Alternatives:
(3) Predict Branch “TAKEN”
– 53% DLX branches TAKEN on average
– Branch target address available after ID in DLX
• DLX still incurs 1 cycle branch penalty for TAKEN branch
• Other machines: branch target known before outcome
TAKEN address not available at this time
NOT TAKEN instruction i IF ID EX MEM WB
Instruction T stall IF
Instruction i+1 IF ID EX MEM WB

2 cycle penalty in DLX(1 in other machines).


TAKEN address available

TAKEN branch instruction i IF ID EX MEM WB


Instruction T stall IF ID EX MEM WB
Instruction T+1 IF ID EX MEM
WB
1 cycle penalty in DLX(0 in other machines)

Hazards CS510 Computer Architectures Lecture 7 - 33


44 Branch
Branch Hazard
Hazard Alternatives:
Alternatives:
(4) Delayed
Delayed Branch
Branch
Delayed Branch
– Delay branch to take place AFTER a successor instruction

branch instruction
sequential successor1
sequential successor2
........ Delayed Branch of length n
sequential successorn
branch target if taken

– 1 slot delayed branch allows proper decision and branch target


address in 5 stage DLX pipeline with control hazard improvement

Hazards CS510 Computer Architectures Lecture 7 - 34


Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch TAKEN
– From fall through: only valuable when branch NOT TAKEN
– Canceling branches allow more slots to be filled

• Compiler effectiveness for single delayed branch slot:


– Fills about 60% of delayed branch slots
– About 80% of instructions executed in delayed branch slots are
useful in computation
– About 50% (60% x 80%) of slots usefully filled

Hazards CS510 Computer Architectures Lecture 7 - 35


4 Branch Hazard Alternatives:
Delayed Branch
From before From target From fall through

ADD R1, R2, R3 ADD R1, R2, R3


SUB R4, R5, R6
if R2=0slot
Delay then ADD R1, R2, R3 if R1=0slot
then
Delay
if R1=0 then
Delay slot
SUB R4, R5, R6

ADD R1, R2, R3


if R2=0 then if R2=0 then
ADD R1, R2, R3 ADD R1, R2, R3 SUB R4, R5, R6
if R1=0 then
SUB R4, R5, R6

- Always improve performance - Improve performance when TAKEN(loop) - Improve performance when
- Branch must not depend on - Must be alright to execute rescheduled NOT TAKEN
rescheduled instructions instructions if Not Taken - Must be alright to execute
- May need duplicate the instruction instructions of Taken
if it is the target of another branch instr.
Hazards CS510 Computer Architectures Lecture 7 - 36
Limitations on Delayed
Branch
• Difficulty in finding useful instructions to fill the delayed
branch slots
• Solution - Squashing
– Delayed branch associated with a branch prediction
– Instructions in the predicted path are executed in the
delayed branch slot
– If the branch outcome is mispredicted, instructions in the
delayed branch slot are squashed(discarded)

Hazards CS510 Computer Architectures Lecture 7 - 37


Canceling Branch
• Used when the delayed branch scheduling, i.e., filling the delay
slot cannot be done due to
– Restrictions on scheduling instructions at the delay slots
– Limitations on the ability to predict whether it will TAKE or NOT
TAKE at compile time
• Instruction includes the direction that the branch was predicted
– When the branch behaves as predicted, the instructions in the
delay slot are executed
– When branch is incorrectly predicted, the instructions in the delay
slot are turned into No-OPs
• Canceling Branch allows to fill the delay slot even if the
instruction to be filled in the delay slot does not meet the
requirements

Hazards CS510 Computer Architectures Lecture 7 - 38


Evaluating Branch
Alternatives
Pipeline speedup = Pipeline depth / CPI
= Pipeline depth
1 + Branch frequency xBranch penalty

Conditional and Unconditional collectively 14% frequency,


65% of branch is TAKEN

Scheduling Branch CPI speedup vs speedup vs


scheme penalty unpipelined stall
Stall pipeline 3 1+0.14x3=1.42 5/1.42=3.5 1.0
Predict Taken 1 1+0.14x1=1.14 5/1.14=4.4 1.26
Predict Not Taken 1 1+0.14x0.65=1.09 5/1.09=4.5 1.29
Delayed branch 0.5 1+0.14x0.5=1.07 5/1.07=4.6 1.31

Hazards CS510 Computer Architectures Lecture 7 - 39


Static(Compiler) Prediction of
Taken/Untaken Branches

Code Motion
LW R1, 0(R2)
SUB R1, R1, R3 If branch is almost always NOT TAKEN,
TAKEN
and R4 is not needed on the taken path,
Depend BEQZ R1, L and R5 and R6 are not modified in the
on LW, following instruction(s), this move can
OR R4, R5, R6
increase speed
need to ADD R10,R4,R3
stall
L: ADD R7, R8, R9

If branch is almost always TAKEN,


TAKEN
and R7 is not needed, and R8 and R9
are not modified on the fall-through
path, this move can increase speed

Hazards CS510 Computer Architectures Lecture 7 - 40


Static(Compiler) Prediction
of Taken/Untaken Branches
• Improves strategy for placing instructions in delay slot
• Two strategies
– Direction-based Prediction:
TAKEN backward branch, NOT TAKEN forward branch
– Profile-based prediction:
Record branch behaviors, predict branch based on the prior run(s)
Frequency of Misprediction

14%
70%

Misprediction Rate
60% 12%

50% 10%

40% 8%

30% 6%
20% 4%
10% 2%
0%
0%
doduc

gcc

ora

tomcatv
alvinn

hydro2d
compress

espresso

mdljsp2

swm256

gcc
doduc

ora

tomcatv
alvinn

hydro2d
compress

espresso

mdljsp2

swm256
Always taken Taken backwards
Not Taken Forwards
Hazards CS510 Computer Architectures Lecture 7 - 41
Evaluating Static Branch
Prediction Strategies

Instructions per mispredicted branch


• Misprediction rate 100000

ignores frequency of
10000
branch
• Instructions between 1000
mispredicted branches
is a better metric 100

10

gcc
doduc

ora

tomcatv
alvinn

hydro2d
compress

espresso

mdljsp2

swm256
Profile-based Direction-based

Hazards CS510 Computer Architectures Lecture 7 - 42


Pipelining Summary
• Just overlap tasks, and easy if tasks are independent
• Speed Up <= Pipeline Depth; if ideal CPI is 1, then:

Pipeline Depth Clock Cycle Unpipelined


Speedup = X
1 + Pipeline stall CPI Clock Cycle Pipelined

• Hazards limit performance on computers:


Structural: need more HW resources
Data: need forwarding, compiler scheduling
Control: Dynamic Prediction, Delayed branch slot,
Static(compiler) Prediction

Hazards CS510 Computer Architectures Lecture 7 - 43

You might also like