Seminar Monday ACA

1
Branch Prediction
BTB basics, return address
prediction, correlating prediction
Blah blah blah
Why what??
All unwanted creepy things happen in pipeline due to
stalls and other unwanted things which are in bond
with one another.
We wanna break this bond and achieve our ultimate
aim of getting pipeline ideal CPI which offcourse we
know as impossible
Human tendency wat to do ..??we challenge
So here I am as a human thinking on handling branch
problems and stuffs
Actually henessey and patterson have done I am
redoing it
2
Speculation :wat do u mean by tat?
Speculative execution is an optimization technique
where a computer system performs some task that
may not be actually needed. The main idea is to do
work before it is known whether that work will be
needed at all, so as to prevent a delay that would
have to be incurred by doing the work after it is
known whether it is needed. If it turns out the work
was not needed after all, any changes made by the
work are reverted and the results are ignored
3
4
Branch Target Buffer
Branch Target Buffer (BTB): Address of branch index to
get prediction AND branch address (if taken)
Note: must check for branch match now, since cant use wrong
branch address
Example: BTB combined with BHT

Branch PC Predicted PC
=?
P
C

o
f

i
n
s
t
r
u
c
t
i
o
n

F
E
T
C
H

Extra
prediction state
bits
Yes: instruction is
branch and use
predicted PC as
next PC
No: branch not
predicted, proceed normally
(Next PC = PC+4)
5
Wat does btb do??
The PC of the instruction being fetched is matched
against a set of instruction addresses stored in the
first column; these represent the
addresses of known branches. If the PC matches
one of these entries, then the instruction
being fetched is a taken branch, and the second
field, predicted PC, contains the prediction for the
next PC after the branch. Fetching begins
immediately at that address. The
third field, which is optional, may be used for extra
prediction state bits.
6
Return Addresses Prediction
Why?
Register indirect branch hard to predict
address
Many callers, one callee
Jump to multiple return addresses from a single
address (no PC-target correlation)
SPEC89 85% such branches for procedure
return
Since stack discipline for procedures, save
return address in small buffer that acts like
a stack: 8 to 16 entries has small miss rate
7
Branch Target Buffer
Branch Target Buffer (BTB): Address of branch index to
get prediction AND branch address (if taken)
Note: must check for branch match now, since cant use wrong
branch address
Example: BTB combined with BHT

Branch PC Predicted PC
=?
P
C

o
f

i
n
s
t
r
u
c
t
i
o
n

F
E
T
C
H

Extra
prediction state
bits
Yes: instruction is
branch and use
predicted PC as
next PC
No: branch not
predicted, proceed normally
(Next PC = PC+4)
HOW???
return address stack [ras]
return address stacks are also very simple: they're fixed size
stacks of return addresses.
to use a return address stack, we push pc+4 onto the stack when
we execute a procedure call instruction. this pushes the return
address of the call instruction onto the stack - when the call is
finished, it will return to pc+4 of the procedure call instruction.
when we execute a return instruction, we pop an address off the
stack, and predict that the return instruction will return to the
popped address.
since return instructions almost always return to the last
procedure call instruction, return address stacks are highly
accurate.
remember that return address stacks only generate predictions
for return instructions. they don't help at all for procedure call
instructions [we use the btb to predict calls].

8

problems
suppose i have a standard 5-stage pipeline where
branches and jumps are resolved in the execute stage.
20% of my instructions are branches, and they're
taken 60% of the time. 5% of my instructions are
return instructions [return instructions are not
branches!].
what's the cpi of my system if i always stall my
processor on branches and jumps? what if i always
predict that branches and jumps are taken?
My idea waste to consider CPI chuck parellellism enjoy
with serial exeuction
But again human tendency ????i am a human ..i dont
give up kind of attitude ..
9
What to do ??
Integrated Instruction Fetch Units
To meet the demands of multiple-issue processors, many recent
designers have
chosen to implement an integrated instruction fetch unit, as a
separate autonomous unit that feeds instructions to the rest of
the pipeline. Essentially, this
amounts to recognizing that characterizing instruction fetch as
a simple single
pipe stage given the complexities of multiple issue is no longer
valid.
Instead, recent designs have used an integrated instruction
fetch unit that integrates several functions:

10
1. Integrated branch predictionThe branch
predictor becomes part of the
instruction fetch unit and is constantly
predicting branches, so as to drive the
fetch pipeline.
2. Instruction prefetchTo deliver multiple
instructions per clock, the
instruction fetch unit will likely need to
fetch ahead. The unit autonomously
manages the prefetching of instructions (see
Chapter 5 )
Bcuz patterson says so
11
Instruction memory access and bufferingWhen
fetching multiple instructions per cycle a variety of
complexities are encountered, including the
difficulty that fetching multiple instructions may
require accessing multiple cache
lines. The instruction fetch unit encapsulates this
complexity, using prefetch
to try to hide the cost of crossing cache blocks. The
instruction fetch unit also
provides buffering, essentially acting as an on-
demand unit to provide
instructions to the issue stage as needed and in the
quantity needed
12
Terminology ???ROB
Which place bank or
mint??
Sorry its read only
buffer
So wat do u mean by
tat ??

13
14
A re-order buffer (ROB) is used in a Tomasulo
algorithm for out-of-order instruction execution. It
allows instructions to be committed in-order.
Normally, there are three stages of instructions:
"Issue", "Execute", "Write Result". In Tomasulo
algorithm, there is an additional stage "Commit". In
this stage, the results of instructions will be stored in a
register or memory. In the "Write Result" stage, the
results are just put in the re-order buffer. All contents
in this buffer can then be used when executing other
instructions depending on these
15
Register Renaming versus Reorder Buffers
With the addition of speculation, register values may also
temporarily reside in the ROB. In either case, if the
processor does not issue new instructions for a period of time, all
existing
instructions will commit, and the register values will appear in the
register file,
which directly corresponds to the architecturally visible registers.

So wat ??
In the register-renaming approach, an extended set of physical registers is
used to hold both the architecturally visible registers as well as temporary
values.
Thus, the extended registers replace the function of both the ROB and the
reservation stations. During instruction issue, a renaming process maps the
names of
architectural registers to physical register numbers in the extended register set,
allocating a new unused register for the destination.

16
An advantage ????
of the renaming approach versus the ROB approach is that
instruction commit is simplified, since it requires only two simple
actions: record
that the mapping between an architectural register number and
physical register
number is no longer speculative, and free up any physical
registers being used to
hold the older value of the architectural register. In a design
with reservation
stations, a station is freed up when the instruction using it
completes execution,
and a ROB entry is freed up when the corresponding instruction
commits
17
How do we ever know which registers are
the architectural registers if they are constantly changing? Most of
the time when
the program is executing it does not matter. There are clearly cases,
however,
where another process, such as the operating system, must be able to know
exactly where the contents of a certain architectural register reside. To
understand
how this capability is provided, assume the processor does not issue
instructions
for some period of time. Eventually all instructions in the pipeline will
commit,
and the mapping between the architecturally visible registers and physical
registers will become stable. At that point, a subset of the physical registers
contains
the architecturally visible registers, and the value of any physical register not
associated with an architectural register is unneeded. It is then easy to move
the
architectural registers to a fixed subset of physical registers so that the
values canbe communicated to another process.
18
How Much to Speculate
One of the significant advantages of speculation is its ability to uncover events
that would otherwise stall the pipeline early, such as cache misses. This potential
advantage, however, comes with a significant potential disadvantage. Speculation
is not free: it takes time and energy, and the recovery of incorrect speculation further reduces
performance. In addition, to support the higher instruction execution
rate needed to benefit from speculation, the processor must have additional
resources, which take silicon area and power. Finally, if speculation causes an
exceptional event to occur, such as a cache or TLB miss, the potential for significant performance
loss increases, if that event would not have occurred without
speculation.
To maintain most of the advantage, while minimizing the disadvantages, most
pipelines with speculation will allow only low-cost exceptional events (such as a
first-level cache miss) to be handled in speculative mode. If an expensive exceptional event
occurs, such as a second-level cache miss or a translation lookaside
buffer (TLB) miss, the processor will wait until the instruction causing the event
is no longer speculative before handling the event. Although this may slightly
degrade the performance of some programs, it avoids significant performance
losses in others, especially those that suffer from a high frequency of such events
coupled with less-than-excellent branch prediction
19
Value prediction:

Attempts to predict value produced by instruction
E.g., Loads a value that changes infrequently
Value prediction is useful only if it significantly increases ILP
Focus of research has been on loads; so-so results, no
processor uses value prediction
Related topic is address aliasing prediction
RAW for load and store or WAW for 2 stores
Address alias prediction is both more stable and simpler since
need not actually predict the address values, only whether such
values conflict
Has been used by a few processors
20
Accuracy of Return Address Predictor

Seminar Monday ACA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar Monday ACA

Uploaded by

Copyright:

Available Formats

1

You might also like