Professional Documents
Culture Documents
Computer
Architecture
Lecture 10: Branch
Prediction II
Rachata Ausavarungnirun
Carnegie Mellon University
Spring 2015, 2/6/2015
HW 1
Grades
StudentsNumber of
1
8
1
6
1
4
1
2
1
0
8
6
4
2
0
120
150160
170
e
a
n3 Stddev:
e 45
MM
Multi-cycle
Microarchitectures
Microprogrammed
Pipelining
and
Out-of-Order Execution
Smith1995
and Processors,
Sohi, The Microarchitecture
of
Superscalar
Proceedings of the
IEEE,
1
More advanced pipelining
and exception handling
Out-of-order and superscalar execution concepts
2
Interrupt
3
Branch Prediction
1
2
4
5
6
Reducing
misprediction penalty (branch resolution
latency)
Branch target buffer (BTB)
Recitation Session on
Monday
1
1
Lab
2
HW
3
4
3
2 (due Wednesday)
Lectures 1-10
Reading assignments
1
Always not taken
2
Always taken
3
BTFN (Backward taken, forward
4
Profile based (likely direction)
5
not taken)
1
Last time prediction (single-bit)
2
Two-bit counter based prediction
3
4
100%
accuracy
1
2
99%
accuracy
1
2
98%
accuracy
1
2
95%
accuracy
1
2
Last-time predictability
and 2BC predictors exploit
last-time
Realizationwith
1: other
A branchs
correlated
branchesoutcome
outcomes can be
1
Global branch correlation
Realization
2: Apast
branchs
can
be branch it
correlated
outcomes
the
same
(other
thanwith
the
outcome
ofoutcome
theof
branch
last-time
was
executed)
1
Recently path
executed
branch with
outcomes
in the
execution
is
correlated
the
outcome
of
the next branch
Global Branch
Correlation (II)
1
2
1
2
;; B1
;; B2
;; B3
Idea: Associate
branch outcomes with global T/NT
history
of all branches
2
Make a the
prediction
based
onsame
the outcome
of the
branch
last
time
the
global
branch
history was encountered
3
Implementation:
4
5
14
Keep
track ofGlobal
the global
history
of all branches in
a register
HistoryT/NT
Register
(GHR)
Use
GHR
to
index
into
a
table
that
recorded
the
outcome
that
wasHistory
seen for
each
GHRof
value
the recent past
Pattern
Table
(table
2-bitincounters)
First
level: Global branch history register (N bits)
1
The direction of last N branches
2
Second
entry level: Table of saturating counters for each history
1
The direction the branch took the last time the same
history was seen
Pat
ter
n
His
tory
Tab
le
(P
HT)
1 1 .. 1 0
GHR
previous
direction register)
(global
branchs
history
00 .
00
00 . 01
00 . 10
index
0
11 .
11
15
1993.
1
6
(of
bit
1
7
Idea:
Addaccount
more context
information
to the
global predictor to
take into
which branch
is being
predicted
1+
Better utilization of
PHT -- Increases access
latency
2+
Report, 1993.
18
PC + inst size
Program
Counter
Next Fetch
Address
hit?
Address of the
current instruction
target address
Cache of Target Addresses (BTB: Branch Target Buffer)
1
9
Direction
predictor
(2-bit
counters
)
Program
Counter
Address of the current
instruction
tak
en?
Address
h
i
t
? Cache of Target
Addresses (BTB:
t
a
r
g
e
t
a
d
d
r
e
s
s
XOR
Counter
Directio
n
predicto
r (2-bit
counter
s)
ta
ke
n?
Address of the
current instruction
Address
h
it
? Cache of Target
t
a
r
g
e
t
a
d
d
r
e
s
s
Buffer)
21
Can We Do Better?
1
Last-time predictability
and 2BC predictors
exploit
only
last-time
for a given
branch
Realizationwith
1: other
A branchs
correlated
branchesoutcome
outcomes can be
1
Global branch correlation
Realization
Apast
branchs
can
bebranch
correlated
with
of the
same
branch
(in addition
tooutcomes
the outcome
of
the
last-time
it2:was
executed)
1
Local Branch
Correlation
Branch Predictors,
DEC WRL TR 1993.
McFarling,
Combining
2
3
Make athe
prediction
based
on the
outcome
of the
branch
last
time
the
same
local
branch
history
was encountered
3
4
First
level: A set of local history registers (N bits each)
1
Select the history register based on the PC of the branch
2
Second level: Table of saturating counters for each history entry
1 The direction the branch took the last time the same
history was seen
Pattern History Table (PHT)
00 . 00
1 1 .. 1 0
00 . 01
00 . 10
index
Local history
registers
11 . 11
25
PC + inst size
Program
Counter
Next Fetch
Address
hit?
Address of the
current instruction
target address
Cache of Target Addresses (BTB: Branch Target Buffer)
2
6
Idea: Usealgorithms)
more than one
type
of predictor
(i.e.,
multiple
and
select
the
best
prediction
Advantages:
1+
Disadvantages:
12-
1993.
27
1
2
3
4
5
2
9
Biased Branches
1
Observation:
branches are biased in one
direction
(e.g.,Many
99% taken)
Solution:
Detect
such
biased
branches,
and
predict
them
with
a
simpler
predictor
(e.g.,
last
time, static, )
Works
well
for iteration
loops count
with is
small
number
iterations,
where
predictable
of
Learns
the direction correlations between individual
branches
2
Assigns weights to correlations
3
Jimenez
and Perceptrons,
Lin, Dynamic
Branch
Prediction with
HPCA 2001.
3
Your predictor?
3
1
Potential solutions
if the instruction is a
control-flow
instruction:
3
Stall the pipeline until we know the next fetch
4address
5 Guess the next fetch address (branch prediction)
6 Employ delayed branching (branch delay slot)
Do something else (fine-grained multithreading)
7
8
3 conditional branches
Problem:
This increases the number of
control
dependencies
3
Idea: branch
Combine
predicate operations to feed a
single
instruction
1
2
Predicates
stored and operated on using condition
registers
A single branch checks the value of the combined
predicate
fewer mipredictions/stalls
Predication (Predicated
Execution)
1
1
dependence
Each
instruction
has is
a eliminated
predicate bit set based on the predicate
computation branch
2
if (cond) { b = 0;
}
else {
b = 1;
}
(no
rm
al
bra
nch
cod
e)
j
m
p
J
O
I
N
D
A
p1 =
(cond)
branch
p1,
TARG
ET
A
T
m
o
v
b
,
1
T
A
R
G
E
T
:
m
o
v
b
,
0
add x, b, 1
(pr
edi
cat
ed
cod
e)
A
B
p1 = (co
(!p1) mov b
(p1) mov b
add x
Conditional Move
Operations
1
Very limited
execution
CMOV
R1
1
2
form
of
predicated
R2
R1
R1 = (ConditionCode == true) ? R2 :
Employed in most modern ISAs (x86,
Alpha)
3
5
Review:
CMOV
Operation
1
Suppose we had
instruction
a Conditional Move
1
CMOV condition, R1
R2
2
R1
3
= (condition == true) ? R2 : R1
Employed in most modern ISAs (x86, Alpha)
CMOV !condition, b
4;
3;
3
6
Branch Prediction
D
Pipeline flush!!
F
3
7
Predicated
Execution
(III)
1
Advantages:
1+
2+
Disadvantages:
1-
23-
2-
Loop branches
3
8
Predicated Execution in
Intel
Itanium
1
Each
instruction
predicated
can
be
separately
cmp
br
else1
else2
br
then1
then2
p1 p2 cmp
p
2 else1
p
1 then1
join1
p1 then2
p2 else2
join2
join1
join2
39
Almost all
ARM instructions
can include an
optional
condition
code.
4
0
Conditional Execution in
ARM ISA
4
1
Conditional Execution in
ARM ISA
4
2
Conditional Execution in
ARM ISA
4
3
Conditional Execution in
ARM ISA
4
4
Conditional Execution in
ARM ISA
4
5
Idealism
1
Wouldnt it be nice
If
the branch
is eliminated
(predicated) only when
it would
actually
be mispredicted
2
If
the branch
were predicted
predicted when it would
actually
be correctly
2
Wouldnt it be nice
4
6
Improving
Predicated
Execution
1
Three major limitations of predication
1.
2.
3.
Wish Branches
Dynamic
Predicated Execution
1
Solves 1, 2 (partially), 3
4
7
Wish Branches
1
48
Wish Jump/Join
HighLow Confidence
N
A
A
T
p1 = (cond)
D
p1 = (cond)
branch
p1,
TARG
ET
m
o
v
b
,
1
j
m
p
J
O
I
N
T
A
R
G
E
T
:
B
C
D JOIN:
normal branch
code
4
9
predicated code
wish jump/join
code
1
2
3
Potential solutions
if the instruction is a
control-flow
instruction:
3
Stall the pipeline until we know the next fetch
4address
5 Guess the next fetch address (branch prediction)
6 Employ delayed branching (branch delay slot)
Do something else (fine-grained multithreading)
7
8
Multi-Path
Execution
1
Idea: Execute both paths after a conditional branch
1
For
all branches:
and Foster,
inhibition
of potential
parallelism
by Riseman
conditional
jumps,The
IEEE
Transactions
on
Computers,
1972.
2
For a hard-to-predict branch: Use dynamic confidence estimation
Advantages:
1+
2+
Disadvantages:
1-
2-
3-
52
What happens when the machine encounters another hardto-predict branch? Execute both paths again?
1- Paths followed quickly become exponential
Each followed path requires its own context (registers, PC,
GHR)
Wasted work (and reduced performance) if paths merge
A
C
Hard to predict
B
D
E
path 1
path 2
CFM CFM
D
F
5
3
Direction at
fetch time
Number of
possible next
fetch
addresses?
When is next
fetch address
resolved?
Execution
(register
dependent)
Conditional
Unknown
Unconditional
Always taken
Decode (PC +
offset)
Call
Always taken
Return
Always taken
Many
Indirect
Always taken
Many
Decode (PC +
offset)
Execution
(register
dependent)
Execution
(register
dependent)
Call X
Call X
Call X
Return
Return
Return
Next instruction after each call point for the same function
A fetched return: pops the stack and uses the address as its
predicted target
TARG
A+1
branch R1
Indirect Jump
Used to implement
1
Switch-case statements
2
Virtual function calls
3
4
Indirect
Branch
Prediction
(II)
1
2
1
E.g.,
2
Virtual
1
2
Curious?
5
8
Issues
in
Branch
Prediction
(I)
1
Need to identify a branch before it is fetched
How do we do this?
2
1 BTB
3
that
thetype
fetchedofinstruction
is a branch
BTBhit
entryindicates
contains
the
the branch
Pre-decoded
branch cache
typeidentifies
information
stored
in
the
instruction
type
of branch
What if no BTB?
1
2
Bubble
in the pipeline until target address is
computed
E.g., IBM POWER4
5
9
Issues
in
Branch
Prediction
(II)
1
Latency: Prediction is latency critical
1
2
PC + inst size
Next Fetch
Address
BTB target
Return Address Stack target
Indirect Branch Predictor target
Resolved target from Backend
???
6
0
Complications in Superscalar
Processors
1
Superscalar processors
1
attempt
2
nPC = PC + 8
(case
1 2) One of the insts is a taken control flow inst
nullify 2
instruction fetched
61
Review
of Last Few Lectures
1
1
Delayed branching
2
Fine-grained multithreading
3
Branch
prediction
1
1
Last
2
time predictor
Hysteresis: 2BC predictor
4
3
Two-level
predictor
Two-levellocal
global
predictor
4
Predicated
5
6
execution
Multipath execution
Return address stack & Indirect branch prediction
63