You are on page 1of 47

ECE 338

Parallel Computer Architecture


Spring 2022

Branches
Nikos Bellas

Electrical and Computer Engineering Department


University of Thessaly

ECE338 Parallel Computer Architecture


1
Control Hazards

The 5-stage
microarchitecture
executes beq/bne
instructions in EX stage

BUT:
Three instructions from
the wrong path have
already made it to the
pipeline if the beq
instruction is TAKEN

ECE338 Parallel Computer Architecture 2


Stalls due to Mispredicted Branches
• We can delay the pipeline by three cycles each time we decode a
branch (delay till branch execution)
Cycle
1 2 3 4 5 6 7 8

beq $1, $3, 28 IM Reg DM Reg

? ? ? IM Reg DM Reg

• Wasting three cycles for each branch is quite a lot

ECE338 Parallel Computer Architecture 3


Performance Analysis
• NOT TAKEN (NT)  No penalty
• TAKEN (T)  3 cycles penalty
• 20% of all instructions are branches
• 50% of branches are TAKEN
• Base CPI = 1
– CPI = [ 1 + (0.20*0.50) * 3 ] =
= [ 1 + 0.1 * 3 ] = 1.3 (30% slow down)

Probability of wrong path Misprediction penalty

Need to reduce both these terms


ECE338 Parallel Computer Architecture 4
The Branch Problem
• Control flow instructions (branches) are frequent
– 15-25% of all instructions

• Problem: Next fetch address after a control-flow instruction is


not determined after N cycles in a pipelined processor
– N cycles: (minimum) branch resolution latency

• If we are fetching W instructions per cycle (i.e., if the pipeline


is W wide)
– A branch misprediction leads to N x W wasted instruction
slots

ECE338 Parallel Computer Architecture 5


Importance of the Branch Problem
• Assume N = 20 stages, W = 5 (5 wide fetch)
– Assume: 1 out of 5 instructions is a branch (each 5 instruction-block ends with
a branch)

• How long does it take to fetch 500 useful instructions (i.e. from the
correct path)?
– 100% accuracy
• 100 cycles (all instructions fetched on the correct path)
• No wasted work; IPC = 500/100
– 99% accuracy
• 100 (correct path) + 20 * 1 (wrong path) = 120 cycles
• 20% extra instructions fetched; IPC = 500/120
– 90% accuracy
• 100 (correct path) + 20 * 10 (wrong path) = 300 cycles
• 200% extra instructions fetched; IPC = 500/300
– 60% accuracy
• 100 (correct path) + 20 * 40 (wrong path) = 900 cycles
• 800% extra instructions fetched; IPC = 500/900
ECE338 Parallel Computer Architecture 6
Branches in real programs

if ((*ptr1==0) && ((*ptr2=*ptr3)==1) && (*ptr4>2))


val5++;
else
val6++;
lw $t0, ptr1($0)
B1: bnez $t0, L_val6
lw $t1, ptr3($0)
sw $t1, ptr2($0)
B1 B2: bne $t1, $t8, L_val6
lw $t1, ptr4($0)
C1 B2 B3: ble $t1, $t9, L_val6
lw $s0, val5($0)
C2 B3 addi $s0, $s0, 1
sw $s0, val5($0)
C3 C4 j Exit
L_val6:
lw $s0, val6($0)
addi $s0, $s0, 1
sw $s0, val6($0)
Exit:
7
ECE338 Parallel Computer Architecture
Branch prediction
• Pentium 4 (2003) has 20 pipe stages
• Branches are executed in stage 15
• There can be tens of instructions somewhere in the
pipeline at any time
• Multiple branch instructions at the same time
• Need to predict correctly all of them
– Otherwise, pipeline flush and large performance
penalties

ECE338 Parallel Computer


8
Architecture
1. Reduce Misprediction Penalty

• Branch execution (beq rs, rt, Label) requires two


operations:
1. Decide if the branch is TAKEN or NOT TAKEN
2. Compute target address: PC <= Label or PC <= PC+4
• Both these need to be executed earlier to reduce
misprediction penalty
• Transfer (2) from the EX to the ID stage (this is easy)
• Need extra hardware in the ID stage to check if the two registers rs
and rt are equal (1)
• Can do that with 32 XOR gates. But it may increase the clock period

ECE338 Parallel Computer Architecture 9


New pipeline with reduced penalty
What if we have the following instr.
sub $2, $1, $3
beq $4, $2, L
1 Zero
More complex forwarding and an extra
0
stall cycle
ID/EX
WB EX/MEM
PCSrc M WB
Control MEM/WB
IF/ID EX M WB
4
Add
P Add
C << 2
RegWrite
Read Read
register 1 data 1
MemWrite
ALU
Read Instruction
address [31-0]
Read
register 2
= 0
Read Result Address
data 2
Write
register Data
Instruction 1 MemToReg
Registers ALUOp memory
memory Write
data
ALUSrc Write Read
1
data data
Sign
extend RegDst
MemRead
Instr [20 - 16] 0
0
Instr [15 - 11]
1

ECE338 Parallel Computer Architecture 10


2. Reduce Misprediction Penalty
(Delayed Branches)
• Idea: let us try to fetch N useful instructions (from the fall
through path) before branch is executed
– Initial architecture N=3
– New architecture N=1
• The N instructions after the branch are executed as usual
– No pipeline flush
• This technique is called Delayed Branch
• N should be small and is architecture dependent

ECE338 Parallel Computer Architecture 11


Delayed Branch
Compiler should be able to detect N useful instructions and place them after the branch.
Assume N=1 in the following examples.
a. From before b. From target c. From fall through
sub $t4, $t5, $t6
add $s1, $s2, $s3 add $s1, $s2, $s3

if $s2 = 0 then if $s1 = 0 then
add $s1, $s2, $s3
Delay slot Delay slot
if $s1 = 0 then

Delay slot sub $t4, $t5, $t6


This is not always
possible. If not
Becomes Becomes Becomes
possible, just put
nops add $s1, $s2, $s3

if $s2 = 0 then if $s1 = 0 then


add $s1, $s2, $s3
Typically N add $s1, $s2, $s3 sub $t4, $t5, $t6
if $s1 = 0 then
should be small
sub $t4, $t5, $t6
(e.g. N=1)
Easy May need code May need code
restructure by the restructure by the
compiler compiler
ECE338 Parallel Computer Architecture 12
3. Reduce Probability of Wrong Path
Branch prediction
• Try to predict the next instruction to fetch (predict the outcome of
the branch)
– Easiest approach : always predict NOT TAKEN
– Increment PC<=PC+4 and fetch next instruction from memory
• If we are correct, there is no branch penalty

Cycle
1 2 3 4 5 6 7

beq $2, $3, Label IM Reg DM Reg

IM Reg DM Reg
Next instruction 1

IM Reg DM Reg

Next instruction 2
ECE338 Parallel Computer
13
Architecture
Branch prediction
• But if we are wrong (Branch NOT TAKEN), we have fetched an instruction
I from the wrong path.
1. Need to flush that instruction I from the pipeline and place a nop to register IF/ID
2. Need to change PC so that we immediately start fetching from the Label.
• We pay N cycles penalty in case of misprediction
Cycle
1 2 3 4 5 6 7 8

beq $2, $3, Label IM Reg DM Reg

next instruction 1 IM
flush

Label: ... IM Reg DM Reg

ECE338 Parallel Computer Architecture 14


Static branch prediction
• Each branch is predicted statically, i.e. at compile time. This prediction
(T/NT) never changes at execution time.
• Methods for static branch prediction:
1. All branched are predicted as NOT TAKEN
2. Backward taken, forward not taken (BTFN):
– All forward branches are predicted as NOT TAKEN
– All backward branches as TAKEN
3. Profile-driven
– Use profiling data from previous code runs to understand the behavior of
branches. Assign prediction semantics to the branches according to that,
4. User-specified branch prediction:
– if (likely(x)) { ... }
– if (unlikely(error)) { … }

ECE338 Parallel Computer Architecture 15


Dynamic branch prediction
• In modern superscalar and superpipeline CPUs with a large
number of stages (~20), the penalty in case of misprediction is
higher
– High Branch misprediction can severely slow down such a CPU
• Dynamic branch prediction attempts to predict branch
direction from past program behavior

ECE338 Parallel Computer Architecture 16


Dynamic branch prediction
• Rule 1: The execution of a branch (T / NT) is related on
previous executions of that branch
– Local Branch correlation

ECE338 Parallel Computer Architecture 17


Dynamic branch prediction
1-bit prediction scheme
• The simplest dynamic prediction algorithm: predict that the
branch will do the same as its last execution

outer: …

inner: …

beq …, …, inner

beq …, …, outer

• Simple but somewhat unstable (changes prediction “too


easily”)
• beq of the inner loop θα will be mispredicted twice for
each execution of the outer loop.
ECE338 Parallel Computer Architecture 18
1-bit prediction scheme

taken

predict predict
not taken
not taken taken taken

not taken

ECE338 Parallel Computer Architecture 19


2-bit prediction scheme
• A 4-state FSM is used to implement the 2-bit prediction scheme
• A prediction will have to miss twice before it is changed.
• It makes predictions more “stable” by adding hysteresis
• 2-bit (or sometimes 3-bit descriptors) are the basis of more
advanced predictors used in modern CPUs

ECE338 Parallel Computer Architecture 20


Dynamic branch prediction
• Rule 1: The execution of a branch (T / NT) is related on
previous executions of that branch
– Local Branch correlation

• Rule 2: The execution of a branch (T / NT) is also related to


previous executions of neighboring branches.
– Global Branch correlation

ECE338 Parallel Computer Architecture 21


Dynamic branch prediction
Recent executions of one branch are related to the
execution of a subsequent branch

if cond1 is False,
then the second conditional is False

If branch Y is True,
then branch X is False

ECE338 Parallel Computer Architecture 22


Dynamic branch prediction
if (aa==2) ;; B1
aa=0;
if (bb==2) ;; B2
bb=0;
if (aa!=bb) { ;; B3
….
}

if (B1 == True) && (B2 == True) then B3 = False

NT NT  T
ECE338 Parallel Computer Architecture 23
Correlating branch prediction
• Use both Local and Global Branch
Predictions to improve prediction B1
rate
T F
– Correlating branch predictors
B2 B3
• Make a prediction based on the
outcome of the branch the last time
the same global branch history was
B5
encountered
B4
• Use Global history branch predictor
– Use an m-bit Global History
Table (GHR) to capture the B6 B7
history of the most recent m B8
111
branches 101
• Uses two levels of history (GHR +
history at that GHR for the specific
branch)
ECE338 Parallel Computer Architecture 24
(m, n) branch predictor n-bit saturating
counters for local
Branch address (k bits) prediction

• m = #bits of GHR
• n = #bits of for the predictor of
each branch (n-bit counter)

• Size of predictor in bits =


2m * n * Number of prediction Prediction

entries selected by branch


address

(0,2) : predictor with no global


history and 2-bit counters

GHR:
m-bit recent global
ECE338 Parallel Computer Architecture branch history 25
Gshare predictors
• Gshare predictors is a type of (m, n) predictor in which GHR is
combined with the Branch PC (using a hash function)
• Pattern History Table (PHT)
– Contains n-bit saturating counters for prediction
– Indexed by a Hash function which can be as simple as an XOR

ECE338 Parallel Computer Architecture 26


Can We Do Even Better?
• Predictability of branches varies

• Some branches are more predictable using local history


• Some using global
• For others, a simple two-bit counter is enough
• Yet for others, a bit is enough

• Observation: There is heterogeneity in predictability behavior


of branches
– No one-size fits all branch prediction algorithm for all branches

• Idea: Exploit that heterogeneity by designing heterogeneous


branch predictors ECE338 Parallel Computer Architecture 27
Tournament Branch Predictors
• Idea: Use more than one type of predictor (i.e., multiple
algorithms) and select dynamically the “best” prediction
• Tournament predictors combine Local and Global history
adaptively
• Advantages:
+ Better accuracy: different predictors are better for different branches
+ Reduced warmup time (faster-warmup predictor used until the slower-warmup
predictor warms up)
• Disadvantages:
-- Need “meta-predictor” or “selector”
-- Longer access latency

Alpha 21264 Tournament Predictor


ECE338 Parallel Computer Architecture 28
Prediction Using Multiple History Lengths
• Observation: Different
branches require different
history lengths for better
prediction accuracy

• Idea: Have multiple PHTs


indexed with a GHRs/branchPC
with different history lengths
and intelligently allocate PHT
entries to different branches
• PHT are tagged like caches

ECE338 Parallel Computer Architecture 29


TAGE: Tagged & prediction by the longest
history matching entry
pc h[0:L1] pc h[0:L2] pc h[0:L3]
pc

ctr tag u ctr tag u ctr tag u

=? =? =?
1 1 1 1 1 1
1

1
1
Tagless base
predictor
prediction

ECE338 Parallel Computer Architecture 30


TAGE: Which Table to Use?
• General case:
– Longest history-matching component provides the prediction
• Special case:
– Many mispredictions on newly allocated entries: weak Cntr

Cntr: 3-bit prediction counter


U: 1 or 2-bit counters
Was the entry recently useful?
Tag: partial tag to match the branch PC

U Tag Cntr

TAGE is the basis of the state-of-art branch predictor (branch prediction championship)
ECE338 Parallel Computer Architecture 31
Hybrid predictors performance

ECE338 Parallel Computer Architecture 32


Branch Confidence Estimation
• Idea: Estimate if the prediction is likely to be
correct
– i.e., estimate how “confident” you are in the
prediction

• Why?
– Could be very useful in deciding how to speculate:
• What predictor/PHT to choose/use
• Whether to keep fetching on this path
• Whether to switch to some other way of handling the
branch, e.g. dual-path execution (eager execution) or
dynamic predication
ECE338 Parallel Computer Architecture 33
How to Estimate Confidence
• An example estimator:
– Keep a record of correct/incorrect outcomes for the past
N instances of the “branch”
– Based on the correct/incorrect patterns, guess if the
current prediction will likely be correct/incorrect

ECE338 Parallel Computer Architecture 34


What to Do With Confidence Estimation?
• An example application: Pipeline Gating
– Throttle instruction fetching/execution when too many
branches are predicted with low confidence

ECE338 Parallel Computer Architecture 35


Branch Target Buffer (BTB)
• Require three things for branches:
1. Predict the branch direction (we covered that already)
2. Predict the new target address
3. Identify that the instruction is a branch BEFORE it is decoded (IF stage)
• Branch target buffer (BTB) covers 2 and 3
– Cache memory that stores the predicted address of the next
instruction after a branch
– Contains the predicted-taken branches (since in an untaken branch
we can just fetch the next sequential address)
– It is accessed at the same time as the I-Cache using the PC (IF
stage)

ECE338 Parallel Computer Architecture 36


Branch Target Buffer (BTB)
• If the PC of the fetched
instruction matches an
address in the BTB, the
predicted PC becomes
the new PC of the CPU
• Instruction fetching
begins immediately at
that address

ECE338 Parallel Computer Architecture 37


Branch Target Buffer (BTB)
• Two sources of penalties
• Branch predicted Taken
(BTB Hit), but ends up Not
Taken
• Branch predicted Not
Taken (BTB Miss), but ends
up Taken
• In each case, we need to flush
the pipeline
• Flush all non-committed
instructions in the ROB
after the mispredicted
branch
• Performance penalty can
be severe
ECE338 Parallel Computer Architecture 38
Branch Target Buffer (BTB) in the 5 stage
pipeline
1

0 ID/EX
WB EX/MEM
PCSrc M WB
Control MEM/WB
IF/ID EX M WB
4 Branch
Add
P Add Zero PCSrc
C
RegWrite << 2

Read Read
register 1 data 1 Zero MemWrite
ALU
Read Instruction Read
address [31-0] register 2 0
Read Result Address
data 2
Write
register Data
Instruction 1 MemToReg
Registers ALUOp memory
memory Write
data
ALUSrc Write Read
1
data data
Sign
extend RegDst
MemRead
0
Instr [20 - 16]
0
Tag V Target Addr. Instr [15 - 11]
0x1010A 1 0x1010A000 1
0
0x10200 1 0x10110B00
Branch Target Buffer ECE338 Parallel Computer
39
Architecture
Integration of Branch Prediction and
BTB
Branch predictor

taken?

PC + inst size Next Fetch


Address
Program
hit?
Counter

Address of the
current instruction

target address

BTB: Branch Target Buffer

ECE338 Parallel Computer Architecture 40


4. Speculation
Low confidence
• If branch prediction prediction Dual path
confidence is low, select A Path 1 Path 2

both paths (T and NT)


C B C B
– This is widely used in high
performance CPUs D D D
– Higher hardware/higher
complexity, potentially E E E
higher energy
– Need to be able to limit F F F

speculation at run time

ECE338 Parallel Computer Architecture 41


Speculation
• How much to speculate
– Mis-speculation degrades performance and power
relative to no speculation
• May cause additional misses (cache, TLB)
– Need to prevent speculative code from causing
higher cost misses (e.g. L2)
• Speculating through multiple branches
– Complicates speculation recovery
• Speculation and energy efficiency
– Note: speculation is only energy efficient when it
significantly improves performance

ECE338 Parallel Computer Architecture 42


How much to speculate
SPEC2000
integer code benchmark suite

FP code

Fraction of instructions that are executed as result of mis-speculation


is much higher for integer
ECE338benchmarks
Parallel Computer Architecture 43
6. Predication
• Completely get rid of branch instructions
• Convert all control hazards to data hazards 
Predication
• A lot of modern ISA support predication
– ARM
– x86 partially

ECE338 Parallel Computer


44
Architecture
Predication Example
(branches) (predicates)
A
if (a!=c) { T N
A
b = 0; B
C B
} C
else {
D
b = 1; D
} A
beq a, c, TARGET
A
p1 = (a!=c)
B
B move b, 0, (p1)
move b, 0 C
j END move b, 1, (!p1)
C D
TARGET:
move b, 1
D Extra predicate operand for each instr.
If p1 is TRUE, execute instruction
If p1 is FALSE, instruction becomes NOP

ECE338 Parallel Computer Architecture 45


ARM processor is fully predicated
• Intructions are predicated using the appropriate postfix
– add r0, r1, r2 ; Regular ADD: r0 = r1 + r2
– addeq r0, r1, r2 ; r0 = r1 + r2 only if conditional flag ZERO is
set
• Conditional move (CMOV), the most well known x86
ppredicate instruction
if (A==0) {S=T}
p1 = (R1 == 0)
CMOV R2, R3, <p1>

ECE338 Parallel Computer Architecture 46


Takeaways
• Branches are the second largest culprit for CPU
performance slowdown (memory instructions are
the first)
– Wrong branch prediction has large penalties
– A lot of in-flight instructions have to flushed from the
pipeline
• All modern CPUs use dynamic branch prediction
and BTBs
– Accurate branch prediction (>95%) are very important
to achieve high performance and instruction
throughput.
– Branch prediction heuristics are rather complex

ECE338 Parallel Computer


47
Architecture

You might also like