Lec3 Pipe

Graduate Computer Architecture I
Lecture 3: Branch
Prediction
Young Cho
Cycles Per Instructions
“Average Cycles per Instruction”
CPI = (CPU Time * Clock Rate) / Instruction Count

= Cycles / Instruction Count
n
CPU time  Cycle Time   CPI j  I j
j 1
n Ij
CPI   CPI j  Fj where Fj 
j 1 Instruction Count
“Instruction Frequency”
2 - CSE/ESE 560M – Graduate Computer Architecture I

Typical Load/Store Processor
IF/ID ID/EX EX/MEM MEM/WB

Register
PC Control File ALU
Data Memory
Instruction Memory

Pipelining Laundry
30 minutes 35 minutes 35
25 minutes
3X Increase in
Productivity!!!
With large number of sets, the each
load takes average of ~35 min to wash
Three sets of Clean Clothes in 2 hours 40 minutes

Introducing Problems
• Hazards prevent next instruction from
executing during its designated clock cycle
– Structural hazards: HW cannot support this
combination of instructions (single person to
dry and iron clothes simultaneously)
– Data hazards: Instruction depends on result of
prior instruction still in the pipeline (missing
sock – needs both before putting them away)
– Control hazards: Caused by delay between the
fetching of instructions and decisions about
changes in control flow (Er…branch & jump)

Data Hazards
• Read After Write (RAW)
– Instr2 tries to read operand before Instr1 writes it
– Caused by a “Dependence” in compiler term
• Write After Read (WAR)
– Instr2 writes operand before Instr1 reads it
– Called an “anti-dependence” in compiler term
• Write After Write (WAW)
– Instr2 writes operand before Instr1 writes it
– “Output dependence” in compiler term
• WAR and WAW in more complex systems

Branch Hazard (Control)
10: beq r1,r3,36
ALU
Ifetch Reg DMem Reg
ALU
14: and r2,r3,r5 Ifetch Reg DMem Reg
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
22: add r8,r1,r9 Ifetch Reg DMem Reg
ALU
36: xor r10,r1,r11 Ifetch Reg DMem
3 instructions are in the pipeline before new instruction

can be fetched.

Branch Hazard Alternatives
• Stall until branch direction is clear
• Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% DLX branches not taken on average
– PC+4 already calculated, so use it to get next instr
• Predict Branch Taken
– 53% DLX branches taken on average
– DLX still incurs 1 cycle branch penalty
– Other machines: branch target known before outcome

Branch Hazard Alternatives
• Delayed Branch
– Define branch to take place AFTER a following
instruction (Fill in Branch Delay Slot)
branch instruction
sequential successor1
sequential successor2 Branch delay of length n
........
sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target

address in 5 stage pipeline

Evaluating Branch Alternatives
Pipeline speedup = Pipeline depth

1 +Branch frequencyBranch penalty
Scheduling BranchCPIspeedup v.speedup v. scheme

penalty unpipelined stall
Stall pipeline 31.42 3.51.0

Predict taken 11.14 4.41.26
Predict not taken 11.09 4.51.29
Delayed branch 0.51.07 4.61.31
Conditional & Unconditional = 14%, 65% change PC

Solution to Hazards
• Structural Hazards
– Delaying HW Dependent Instruction
– Increase Resources (i.e. dual port memory)
• Data Hazards
– Data Forwarding
– Software Scheduling
• Control Hazards
– Pipeline Stalling
– Predict and Flush
– Fill Delay Slots with Previous Instructions

Administrative
• Literature Survey
– One Q&A per Literature
– Q&A should show that you read the paper
• Changes in Schedule
– Need to be out of town on Oct 4th (Tuesday)
– Quiz 2 moved up 1 lecture
• Tool and VHDL help

Typical Pipeline
• Example: MIPS R4000
integer unit
ex
FP/int Multiply
IF ID m1 m2 m3 m4 m5 m6 m7 MEM WB
FP adder
a1 a2 a3 a4
FP/int divider
Div (lat = 25,
Init inv=25)

Prediction
• Easy to fetch multiple (consecutive)
instructions per cycle
– Essentially speculating on sequential flow
• Jump: unconditional change of control flow
– Always taken
• Branch: conditional change of control flow
– Taken typically ~50% of the time in applications
• Backward: 30% of the Branch  80% taken = ~24%
• Forward: 70% of the Branch  40% taken = ~28%

Current Ideas
• Reactive
– Adapt Current Action based on the Past
– TCP windows
– URL completion, ...
• Proactive
– Anticipate Future Action based on the Past
– Branch prediction
– Long Cache block
– Tracing

Branch Prediction Schemes
• Static Branch Prediction
• Dynamic Branch Prediction
– 1-bit Branch-Prediction Buffer
– 2-bit Branch-Prediction Buffer
– Correlating Branch Prediction Buffer
– Tournament Branch Predictor
• Branch Target Buffer
• Integrated Instruction Fetch Units
• Return Address Predictors

Static Branch Prediction
• Execution profiling
– Very accurate if Actually take time to Profile
– Incovenient
• Heuristics based on nesting and coding
– Simple heuristics are very inaccurate
• Programmer supplied hints...
– Inconvenient and potentially inaccurate

Dynamic Branch Prediction
• Performance = ƒ(accuracy, cost of mis-prediction)
• 1-bit Branch History Table
– Bitmap for Lower bits of PC address
– Says whether or not branch taken last time
– If Inst is Branch, predict and update the table
• Problem
– 1-bit BHT will cause 2 mis-predictions for Loops
• First time through the loop, it predicts exit instead loop
• End of loop case, it predicts loops instead of exit
– Avg is 9 iterations before exit
• Only 80% accuracy even if loop 90% of the time

N-bit Dynamic Branch Prediction
• N-bit scheme where change prediction only
if get misprediction N-times:
T
NT
Predict Taken Predict Taken
T
T NT
NT
Predict Not Predict Not
T Taken
Taken
NT
2-bit Scheme: Saturates the prediction up to 2 times

Correlating Branches
• (2,2) predictor Branch address (4 bits)
– 2-bit global: indicates the
behavior of the last two
branches
– 2-bit local (2-bit Dynamic
Branch Prediction)
• Branch History Table Prediction
– Global branch history is
used to choose one of
four history bitmap table
– Predicts the branch
behavior then updates
only the selected bitmap 2-bit recent global
table branch history
(01 = not taken then taken)

Accuracy of Different Schemes
20%
18%
4096 Entries 2-bit BHT 18%

of Mispredictions
16%
Unlimited Entries 2-bit BHT
14%
1024 Entries (2,2) BHT
of Mispredictions
12%
11%
10%
Frequency
8%
Frequency
6% 6% 6%
6%
5% 5%
4%
4%
2%
1% 1%
0%
0%
nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li

BHT Accuracy
• Mispredict because either:
– Wrong guess for the branch
– Wrong Index for the branch
• 4096 entry table
– programs vary from 1% misprediction (nasa7,
tomcatv) to 18% (eqntott), with spice at 9% and
gcc at 12%
• For SPEC92
– 4096 about as good as infinite table

Tournament Branch Predictors
• Correlating Predictor
– 2-bit predictor failed on important branches
– Better results by also using global information
• Tournament Predictors
– 1 Predictor based on global information
– 1 Predictor based on local information
– Use the predictor that guesses better
addr
Predictor A Predictor B

Alpha 21264
• 4K 2-bit counters to choose from among a global predictor and a
local predictor
• Global predictor also has 4K entries and is indexed by the history of
the last 12 branches; each entry in the global predictor is a standard
2-bit predictor
– 12-bit pattern: ith bit 0 => ith prior branch not taken;
ith bit 1 => ith prior branch taken;
• Local predictor consists of a 2-level predictor:
– Top level a local history table consisting of 1024 10-bit entries;
each 10-bit entry corresponds to the most recent 10 branch
outcomes for the entry. 10-bit history allows patterns 10 branches
to be discovered and predicted.
– Next level Selected entry from the local history table is used to
index a table of 1K entries consisting a 3-bit saturating counters,
which provide the local prediction
• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
(~180,000 transistors)

Branch Prediction Accuracy
99%
tomcatv 99%
100%
95%
doduc 84%
97%
86%
fpppp 82% Profile-based
98%
2-bit dynmic
88% Tournament
li 77%
98%
86%
espresso 82%
96%
88%
gcc 70%
94%
0% 20% 40% 60% 80% 100%

Accuracy versus Size
10%
9%
Conditional branch misprediction rate
8%
7%
Local
6%
5%
Correlating
4%
3%
2%
Tournament
1%
0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
Total predictor size (Kbits)

Branch Target Buffer
• Branch Target Buffer (BTB): Address of branch index to get
prediction AND branch address (if taken)
– Note: must check for branch match now, since can’t use wrong
branch address
Branch PC Predicted PC
PC of instruction
FETCH
Yes: instruction is
=? Extra
branch and use
prediction state
predicted PC as
No: branch not bits
next PC
predicted, proceed normally
(Next PC = PC+4)

Predicated Execution
• Built in Hardware Support
– Bit for predicated instruction execution
– Both paths are in the code
– Execution based on the result of the condition
• No Branch Prediction is Required
– Instructions not selected are ignored
– Sort of inserting Nop

Zero Cycle Jump
• What really has to be done at runtime?
– Once an instruction has been detected as a jump or JAL, we might
recode it in the internal cache.
– Very limited form of dynamic compilation?
• Use of “Pre-decoded” instruction cache
– Called “branch folding” in the Bell-Labs CRISP processor.
– Original CRISP cache had two addresses and could thus fold a
complete branch into the previous instruction
– Notice that JAL introduces a structural hazard on write
Internal Cache state:
and r3,r1,r5 A: and r3,r1,r5 N A+4
addi r2,r3,#4 addi r2,r3,#4 N A+8
sub r4,r2,r1
jal doit sub r4,r2,r1 L doit
subi r1,r1,#1 --- -- ---
subi r1,r1,#1 N A+20

Dynamic Branch Prediction Summary
• Prediction becoming important part of scalar execution
• Branch History Table
– 2 bits for loop accuracy
• Correlation
– Recently executed branches correlated with next branch.
– Either different branches
– Or different executions of same branches
• Tournament Predictor
– More resources to competitive solutions and pick between them
• Branch Target Buffer
– Branch address & prediction
• Predicated Execution
– No need for Prediction
– Hardware Support needed

Lec3 Pipe

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec3 Pipe

Uploaded by

Copyright:

Available Formats

Graduate Computer Architecture I

“Average Cycles per Instruction”

CPI = (CPU Time * Clock Rate) / Instruction Count

2 - CSE/ESE 560M – Graduate Computer Architecture I

IF/ID ID/EX EX/MEM MEM/WB

3 - CSE/ESE 560M – Graduate Computer Architecture I

Three sets of Clean Clothes in 2 hours 40 minutes

5 - CSE/ESE 560M – Graduate Computer Architecture I

6 - CSE/ESE 560M – Graduate Computer Architecture I

10: beq r1,r3,36

3 instructions are in the pipeline before new instruction

7 - CSE/ESE 560M – Graduate Computer Architecture I

8 - CSE/ESE 560M – Graduate Computer Architecture I

– 1 slot delay allows proper decision and branch target

9 - CSE/ESE 560M – Graduate Computer Architecture I

Pipeline speedup = Pipeline depth

Scheduling BranchCPIspeedup v.speedup v. scheme

Stall pipeline 31.42 3.51.0

Conditional & Unconditional = 14%, 65% change PC

10 - CSE/ESE 560M – Graduate Computer Architecture I

11 - CSE/ESE 560M – Graduate Computer Architecture I

12 - CSE/ESE 560M – Graduate Computer Architecture I

13 - CSE/ESE 560M – Graduate Computer Architecture I

14 - CSE/ESE 560M – Graduate Computer Architecture I

15 - CSE/ESE 560M – Graduate Computer Architecture I

16 - CSE/ESE 560M – Graduate Computer Architecture I

17 - CSE/ESE 560M – Graduate Computer Architecture I

18 - CSE/ESE 560M – Graduate Computer Architecture I

2-bit Scheme: Saturates the prediction up to 2 times

19 - CSE/ESE 560M – Graduate Computer Architecture I

20 - CSE/ESE 560M – Graduate Computer Architecture I

4096 Entries 2-bit BHT 18%

21 - CSE/ESE 560M – Graduate Computer Architecture I

22 - CSE/ESE 560M – Graduate Computer Architecture I

23 - CSE/ESE 560M – Graduate Computer Architecture I

24 - CSE/ESE 560M – Graduate Computer Architecture I

0% 20% 40% 60% 80% 100%

25 - CSE/ESE 560M – Graduate Computer Architecture I

Total predictor size (Kbits)

26 - CSE/ESE 560M – Graduate Computer Architecture I

27 - CSE/ESE 560M – Graduate Computer Architecture I

28 - CSE/ESE 560M – Graduate Computer Architecture I

subi r1,r1,#1 N A+20

29 - CSE/ESE 560M – Graduate Computer Architecture I

30 - CSE/ESE 560M – Graduate Computer Architecture I

You might also like