Lecture05 Branches

ECE 338
Parallel Computer Architecture

Spring 2022
Branches
Nikos Bellas
Electrical and Computer Engineering Department

University of Thessaly
ECE338 Parallel Computer Architecture

1
Control Hazards
The 5-stage
microarchitecture
executes beq/bne
instructions in EX stage
BUT:
Three instructions from
the wrong path have
already made it to the
pipeline if the beq
instruction is TAKEN
ECE338 Parallel Computer Architecture 2

Stalls due to Mispredicted Branches
• We can delay the pipeline by three cycles each time we decode a
branch (delay till branch execution)
Cycle
1 2 3 4 5 6 7 8
beq $1, $3, 28 IM Reg DM Reg
? ? ? IM Reg DM Reg
• Wasting three cycles for each branch is quite a lot

Performance Analysis
• NOT TAKEN (NT)  No penalty
• TAKEN (T)  3 cycles penalty
• 20% of all instructions are branches
• 50% of branches are TAKEN
• Base CPI = 1
– CPI = [ 1 + (0.20*0.50) * 3 ] =
= [ 1 + 0.1 * 3 ] = 1.3 (30% slow down)
Probability of wrong path Misprediction penalty
Need to reduce both these terms

The Branch Problem
• Control flow instructions (branches) are frequent
– 15-25% of all instructions
• Problem: Next fetch address after a control-flow instruction is

not determined after N cycles in a pipelined processor
– N cycles: (minimum) branch resolution latency
• If we are fetching W instructions per cycle (i.e., if the pipeline

is W wide)
– A branch misprediction leads to N x W wasted instruction
slots

Importance of the Branch Problem
• Assume N = 20 stages, W = 5 (5 wide fetch)
– Assume: 1 out of 5 instructions is a branch (each 5 instruction-block ends with
a branch)
• How long does it take to fetch 500 useful instructions (i.e. from the
correct path)?
– 100% accuracy
• 100 cycles (all instructions fetched on the correct path)
• No wasted work; IPC = 500/100
– 99% accuracy
• 100 (correct path) + 20 * 1 (wrong path) = 120 cycles
• 20% extra instructions fetched; IPC = 500/120
– 90% accuracy
– 60% accuracy
Branches in real programs
if ((*ptr1==0) && ((*ptr2=*ptr3)==1) && (*ptr4>2))

val5++;
else
val6++;
lw $t0, ptr1($0)
B1: bnez $t0, L_val6
lw $t1, ptr3($0)
sw $t1, ptr2($0)
B1 B2: bne $t1, $t8, L_val6
lw $t1, ptr4($0)
C1 B2 B3: ble $t1, $t9, L_val6
lw $s0, val5($0)
C2 B3 addi $s0, $s0, 1
sw $s0, val5($0)
C3 C4 j Exit
L_val6:
lw $s0, val6($0)
addi $s0, $s0, 1
sw $s0, val6($0)
Exit:
7
ECE338 Parallel Computer Architecture
Branch prediction
• Pentium 4 (2003) has 20 pipe stages
• Branches are executed in stage 15
• There can be tens of instructions somewhere in the
pipeline at any time
• Multiple branch instructions at the same time
• Need to predict correctly all of them
– Otherwise, pipeline flush and large performance
penalties
ECE338 Parallel Computer

8
Architecture
1. Reduce Misprediction Penalty
• Branch execution (beq rs, rt, Label) requires two

operations:
1. Decide if the branch is TAKEN or NOT TAKEN
2. Compute target address: PC <= Label or PC <= PC+4
• Both these need to be executed earlier to reduce
misprediction penalty
• Transfer (2) from the EX to the ID stage (this is easy)
• Need extra hardware in the ID stage to check if the two registers rs
and rt are equal (1)
• Can do that with 32 XOR gates. But it may increase the clock period

New pipeline with reduced penalty
What if we have the following instr.
sub $2, $1, $3
beq $4, $2, L
1 Zero
More complex forwarding and an extra
0
stall cycle
ID/EX
WB EX/MEM
PCSrc M WB
Control MEM/WB
IF/ID EX M WB
4
Add
P Add
C << 2
RegWrite
Read Read
register 1 data 1
MemWrite
ALU
Read Instruction
address [31-0]
Read
register 2
= 0
Read Result Address
data 2
Write
register Data
Instruction 1 MemToReg
Registers ALUOp memory
memory Write
data
ALUSrc Write Read
1
data data
Sign
extend RegDst
MemRead
Instr [20 - 16] 0
0
Instr [15 - 11]
1

2. Reduce Misprediction Penalty
(Delayed Branches)
• Idea: let us try to fetch N useful instructions (from the fall
through path) before branch is executed
– Initial architecture N=3
– New architecture N=1
• The N instructions after the branch are executed as usual
– No pipeline flush
• This technique is called Delayed Branch
• N should be small and is architecture dependent

Delayed Branch
Compiler should be able to detect N useful instructions and place them after the branch.
Assume N=1 in the following examples.
a. From before b. From target c. From fall through
sub $t4, $t5, $t6
add $s1, $s2, $s3 add $s1, $s2, $s3
…
if $s2 = 0 then if $s1 = 0 then
add $s1, $s2, $s3
Delay slot Delay slot
if $s1 = 0 then
Delay slot sub $t4, $t5, $t6

This is not always
possible. If not
Becomes Becomes Becomes
possible, just put
nops add $s1, $s2, $s3
if $s2 = 0 then if $s1 = 0 then

add $s1, $s2, $s3
Typically N add $s1, $s2, $s3 sub $t4, $t5, $t6
if $s1 = 0 then
should be small
sub $t4, $t5, $t6
(e.g. N=1)
Easy May need code May need code
restructure by the restructure by the
compiler compiler
3. Reduce Probability of Wrong Path
Branch prediction
• Try to predict the next instruction to fetch (predict the outcome of
the branch)
– Easiest approach : always predict NOT TAKEN
– Increment PC<=PC+4 and fetch next instruction from memory
• If we are correct, there is no branch penalty
Cycle
1 2 3 4 5 6 7
beq $2, $3, Label IM Reg DM Reg
IM Reg DM Reg
Next instruction 1
IM Reg DM Reg
Next instruction 2
13
Architecture
Branch prediction
• But if we are wrong (Branch NOT TAKEN), we have fetched an instruction
I from the wrong path.
1. Need to flush that instruction I from the pipeline and place a nop to register IF/ID
2. Need to change PC so that we immediately start fetching from the Label.
• We pay N cycles penalty in case of misprediction
Cycle
1 2 3 4 5 6 7 8
beq $2, $3, Label IM Reg DM Reg
next instruction 1 IM
flush
Label: ... IM Reg DM Reg

Static branch prediction
• Each branch is predicted statically, i.e. at compile time. This prediction
(T/NT) never changes at execution time.
• Methods for static branch prediction:
1. All branched are predicted as NOT TAKEN
2. Backward taken, forward not taken (BTFN):
– All forward branches are predicted as NOT TAKEN
– All backward branches as TAKEN
3. Profile-driven
– Use profiling data from previous code runs to understand the behavior of
branches. Assign prediction semantics to the branches according to that,
4. User-specified branch prediction:
– if (likely(x)) { ... }
– if (unlikely(error)) { … }

Dynamic branch prediction
• In modern superscalar and superpipeline CPUs with a large
number of stages (~20), the penalty in case of misprediction is
higher
– High Branch misprediction can severely slow down such a CPU
• Dynamic branch prediction attempts to predict branch
direction from past program behavior

• Rule 1: The execution of a branch (T / NT) is related on
previous executions of that branch
– Local Branch correlation

1-bit prediction scheme
• The simplest dynamic prediction algorithm: predict that the
branch will do the same as its last execution
outer: …
…
inner: …
…
beq …, …, inner
…
beq …, …, outer
• Simple but somewhat unstable (changes prediction “too

easily”)
• beq of the inner loop θα will be mispredicted twice for
each execution of the outer loop.
taken
predict predict
not taken
not taken taken taken
not taken

• A 4-state FSM is used to implement the 2-bit prediction scheme
• A prediction will have to miss twice before it is changed.
• It makes predictions more “stable” by adding hysteresis
• 2-bit (or sometimes 3-bit descriptors) are the basis of more
advanced predictors used in modern CPUs

• Rule 1: The execution of a branch (T / NT) is related on
previous executions of that branch
– Local Branch correlation
• Rule 2: The execution of a branch (T / NT) is also related to

previous executions of neighboring branches.
– Global Branch correlation

Recent executions of one branch are related to the
execution of a subsequent branch
if cond1 is False,
then the second conditional is False
If branch Y is True,
then branch X is False

if (aa==2) ;; B1
aa=0;
if (bb==2) ;; B2
bb=0;
if (aa!=bb) { ;; B3
….
}
if (B1 == True) && (B2 == True) then B3 = False
NT NT  T
Correlating branch prediction
• Use both Local and Global Branch
Predictions to improve prediction B1
rate
T F
– Correlating branch predictors
B2 B3
• Make a prediction based on the
outcome of the branch the last time
the same global branch history was
B5
encountered
B4
• Use Global history branch predictor
– Use an m-bit Global History
Table (GHR) to capture the B6 B7
history of the most recent m B8
111
branches 101
• Uses two levels of history (GHR +
history at that GHR for the specific
branch)
(m, n) branch predictor n-bit saturating
counters for local
Branch address (k bits) prediction
• m = #bits of GHR
• n = #bits of for the predictor of
each branch (n-bit counter)
• Size of predictor in bits =

2m * n * Number of prediction Prediction
entries selected by branch

address
(0,2) : predictor with no global

history and 2-bit counters
GHR:
m-bit recent global
ECE338 Parallel Computer Architecture branch history 25
Gshare predictors
• Gshare predictors is a type of (m, n) predictor in which GHR is
combined with the Branch PC (using a hash function)
• Pattern History Table (PHT)
– Contains n-bit saturating counters for prediction
– Indexed by a Hash function which can be as simple as an XOR

Can We Do Even Better?
• Predictability of branches varies
• Some branches are more predictable using local history

• Some using global
• For others, a simple two-bit counter is enough
• Yet for others, a bit is enough
• Observation: There is heterogeneity in predictability behavior

of branches
– No one-size fits all branch prediction algorithm for all branches
• Idea: Exploit that heterogeneity by designing heterogeneous

branch predictors ECE338 Parallel Computer Architecture 27
Tournament Branch Predictors
• Idea: Use more than one type of predictor (i.e., multiple
algorithms) and select dynamically the “best” prediction
• Tournament predictors combine Local and Global history
adaptively
• Advantages:
+ Better accuracy: different predictors are better for different branches
+ Reduced warmup time (faster-warmup predictor used until the slower-warmup
predictor warms up)
• Disadvantages:
-- Need “meta-predictor” or “selector”
-- Longer access latency
Alpha 21264 Tournament Predictor

Prediction Using Multiple History Lengths
• Observation: Different
branches require different
history lengths for better
prediction accuracy
• Idea: Have multiple PHTs

indexed with a GHRs/branchPC
with different history lengths
and intelligently allocate PHT
entries to different branches
• PHT are tagged like caches

TAGE: Tagged & prediction by the longest
history matching entry
pc h[0:L1] pc h[0:L2] pc h[0:L3]
pc
ctr tag u ctr tag u ctr tag u
=? =? =?
1 1 1 1 1 1
1
1
1
Tagless base
predictor
prediction

TAGE: Which Table to Use?
• General case:
– Longest history-matching component provides the prediction
• Special case:
– Many mispredictions on newly allocated entries: weak Cntr
Cntr: 3-bit prediction counter

U: 1 or 2-bit counters
Was the entry recently useful?
Tag: partial tag to match the branch PC
U Tag Cntr
TAGE is the basis of the state-of-art branch predictor (branch prediction championship)
Hybrid predictors performance

Branch Confidence Estimation
• Idea: Estimate if the prediction is likely to be
correct
– i.e., estimate how “confident” you are in the
prediction
• Why?
– Could be very useful in deciding how to speculate:
• What predictor/PHT to choose/use
• Whether to keep fetching on this path
• Whether to switch to some other way of handling the
branch, e.g. dual-path execution (eager execution) or
dynamic predication
How to Estimate Confidence
• An example estimator:
– Keep a record of correct/incorrect outcomes for the past
N instances of the “branch”
– Based on the correct/incorrect patterns, guess if the
current prediction will likely be correct/incorrect

What to Do With Confidence Estimation?
• An example application: Pipeline Gating
– Throttle instruction fetching/execution when too many
branches are predicted with low confidence

Branch Target Buffer (BTB)
• Require three things for branches:
1. Predict the branch direction (we covered that already)
2. Predict the new target address
3. Identify that the instruction is a branch BEFORE it is decoded (IF stage)
• Branch target buffer (BTB) covers 2 and 3
– Cache memory that stores the predicted address of the next
instruction after a branch
– Contains the predicted-taken branches (since in an untaken branch
we can just fetch the next sequential address)
– It is accessed at the same time as the I-Cache using the PC (IF
stage)

• If the PC of the fetched
instruction matches an
address in the BTB, the
predicted PC becomes
the new PC of the CPU
• Instruction fetching
begins immediately at
that address

• Two sources of penalties
• Branch predicted Taken
(BTB Hit), but ends up Not
Taken
• Branch predicted Not
Taken (BTB Miss), but ends
up Taken
• In each case, we need to flush
the pipeline
• Flush all non-committed
instructions in the ROB
after the mispredicted
branch
• Performance penalty can
be severe
Branch Target Buffer (BTB) in the 5 stage
pipeline
1
0 ID/EX
WB EX/MEM
PCSrc M WB
Control MEM/WB
IF/ID EX M WB
4 Branch
Add
P Add Zero PCSrc
C
RegWrite << 2
Read Read
register 1 data 1 Zero MemWrite
ALU
Read Instruction Read
address [31-0] register 2 0
Read Result Address
data 2
Write
register Data
Instruction 1 MemToReg
Registers ALUOp memory
memory Write
data
ALUSrc Write Read
1
data data
Sign
extend RegDst
MemRead
0
Instr [20 - 16]
0
Tag V Target Addr. Instr [15 - 11]
0x1010A 1 0x1010A000 1
0
0x10200 1 0x10110B00
Branch Target Buffer ECE338 Parallel Computer
39
Architecture
Integration of Branch Prediction and
BTB
Branch predictor
taken?
PC + inst size Next Fetch

Address
Program
hit?
Counter
Address of the
current instruction
target address
BTB: Branch Target Buffer

4. Speculation
Low confidence
• If branch prediction prediction Dual path
confidence is low, select A Path 1 Path 2
both paths (T and NT)

C B C B
– This is widely used in high
performance CPUs D D D
– Higher hardware/higher
complexity, potentially E E E
higher energy
– Need to be able to limit F F F
speculation at run time

Speculation
• How much to speculate
– Mis-speculation degrades performance and power
relative to no speculation
• May cause additional misses (cache, TLB)
– Need to prevent speculative code from causing
higher cost misses (e.g. L2)
• Speculating through multiple branches
– Complicates speculation recovery
• Speculation and energy efficiency
– Note: speculation is only energy efficient when it
significantly improves performance

How much to speculate
SPEC2000
integer code benchmark suite
FP code
Fraction of instructions that are executed as result of mis-speculation

is much higher for integer
ECE338benchmarks
Parallel Computer Architecture 43
6. Predication
• Completely get rid of branch instructions
• Convert all control hazards to data hazards 
Predication
• A lot of modern ISA support predication
– ARM
– x86 partially

44
Architecture
Predication Example
(branches) (predicates)
A
if (a!=c) { T N
A
b = 0; B
C B
} C
else {
D
b = 1; D
} A
beq a, c, TARGET
A
p1 = (a!=c)
B
B move b, 0, (p1)
move b, 0 C
j END move b, 1, (!p1)
C D
TARGET:
move b, 1
D Extra predicate operand for each instr.
If p1 is TRUE, execute instruction
If p1 is FALSE, instruction becomes NOP

ARM processor is fully predicated
• Intructions are predicated using the appropriate postfix
– add r0, r1, r2 ; Regular ADD: r0 = r1 + r2
– addeq r0, r1, r2 ; r0 = r1 + r2 only if conditional flag ZERO is
set
• Conditional move (CMOV), the most well known x86
ppredicate instruction
if (A==0) {S=T}
p1 = (R1 == 0)
CMOV R2, R3, <p1>

Takeaways
• Branches are the second largest culprit for CPU
performance slowdown (memory instructions are
the first)
– Wrong branch prediction has large penalties
– A lot of in-flight instructions have to flushed from the
pipeline
• All modern CPUs use dynamic branch prediction
and BTBs
– Accurate branch prediction (>95%) are very important
to achieve high performance and instruction
throughput.
– Branch prediction heuristics are rather complex

47
Architecture

Lecture05 Branches

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture05 Branches

Uploaded by

Copyright:

Available Formats

ECE 338

Parallel Computer Architecture

Electrical and Computer Engineering Department

ECE338 Parallel Computer Architecture

ECE338 Parallel Computer Architecture 2

beq $1, $3, 28 IM Reg DM Reg

• Wasting three cycles for each branch is quite a lot

ECE338 Parallel Computer Architecture 3

Probability of wrong path Misprediction penalty

Need to reduce both these terms

• Problem: Next fetch address after a control-flow instruction is

• If we are fetching W instructions per cycle (i.e., if the pipeline

ECE338 Parallel Computer Architecture 5

if ((*ptr1==0) && ((*ptr2=*ptr3)==1) && (*ptr4>2))

ECE338 Parallel Computer

• Branch execution (beq rs, rt, Label) requires two

ECE338 Parallel Computer Architecture 9

ECE338 Parallel Computer Architecture 10

ECE338 Parallel Computer Architecture 11

Delay slot sub $t4, $t5, $t6

if $s2 = 0 then if $s1 = 0 then

beq $2, $3, Label IM Reg DM Reg

beq $2, $3, Label IM Reg DM Reg

Label: ... IM Reg DM Reg

ECE338 Parallel Computer Architecture 14

ECE338 Parallel Computer Architecture 15

ECE338 Parallel Computer Architecture 16

ECE338 Parallel Computer Architecture 17

• Simple but somewhat unstable (changes prediction “too

ECE338 Parallel Computer Architecture 19

ECE338 Parallel Computer Architecture 20

• Rule 2: The execution of a branch (T / NT) is also related to

ECE338 Parallel Computer Architecture 21

ECE338 Parallel Computer Architecture 22

if (B1 == True) && (B2 == True) then B3 = False

• Size of predictor in bits =

entries selected by branch

(0,2) : predictor with no global

ECE338 Parallel Computer Architecture 26

• Some branches are more predictable using local history

• Observation: There is heterogeneity in predictability behavior

• Idea: Exploit that heterogeneity by designing heterogeneous

Alpha 21264 Tournament Predictor

• Idea: Have multiple PHTs

ECE338 Parallel Computer Architecture 29

ctr tag u ctr tag u ctr tag u

ECE338 Parallel Computer Architecture 30

Cntr: 3-bit prediction counter

ECE338 Parallel Computer Architecture 32

ECE338 Parallel Computer Architecture 34

ECE338 Parallel Computer Architecture 35

ECE338 Parallel Computer Architecture 36

ECE338 Parallel Computer Architecture 37

PC + inst size Next Fetch

BTB: Branch Target Buffer

ECE338 Parallel Computer Architecture 40

both paths (T and NT)

speculation at run time

ECE338 Parallel Computer Architecture 41

ECE338 Parallel Computer Architecture 42

Fraction of instructions that are executed as result of mis-speculation

ECE338 Parallel Computer

ECE338 Parallel Computer Architecture 45

ECE338 Parallel Computer Architecture 46

if ((ptr1==0) && ((ptr2=ptr3)==1) && (ptr4>2))