You are on page 1of 7

Implementation of Branch Prediction by Dynamic Dataflow-based Identification

Correlated Braches
Srinivas Rao Narne
Department of Electrical and Computer Engineering
University of Florida
Gainesville, USA
snarne1@ufl.edu
Abstract
Various advances in computer for increasing the
speed to meet the never ending demand in improvement
itself has led to processor pipelines being more lengthy
and issue widths being wider in order to increase the
throughput. If this trend continues, branch misprediction penalty will become very high. Branch misprediction is the single most significant performance
limiter for improving processor performance using
deeper pipelining. To cope with deeper pipelines and
wider issue width, we cannot use default branch
prediction types.
So for branch prediction our techniques rely on
using a long global history, and identifying correlated
branches in this history by using runtime dataflow
information. Such a predictor uses a collection of
predictors, each of which provides its predictions at a
different stage of the pipeline front-end. A simple 1cycle latency line predictor provides predictions in the
first stage, followed in a couple of stages later by
predictions from a more accurate global predictor.
Finally, one or two stages later, a highly accurate
corrector predictor selectively corrects the global
predictors prediction.
Introduction
The vibrant information technology industry
today needs computers as their core products to deal
with all sort of information. The computer on the other
hand needs to deliver to the expectations of the fast
evolving IT industry. For this sole reason researches and
computer architects are striving for the past 30-40 years
to be able to improve the computer performance in any
way possible and they have regularly come up with new
ideas ranging from increasing clock speed, parallelism,
pipeline t o d e v e l o p i n g m u l t i c o r e p r o c e s s o r s ,
dealing

Manmeet Singh Khurana


Department of Electrical and Computer Engineering
University of Florida
Gainesville, USA
mkhurana@ufl.edu
With branch hazards, using different memory hierarchy
etc.
The execution of the code of many computer
programs depends on specific conditions which further
rely on the factors that cannot be known before-hand,
like branch prediction. For branch prediction digital
circuits called branch predictors are designed, which
guess the direction in which the branch will go before it
is known for sure. This is done for improving the
instruction pipeline flow. And this technique is used in
almost all the modern ISAs like x86 as it helps achieve
effective performance. Basically it helps in resolving a
branch hazard by prediction. Then execution is carried
out under that assumption and if the prediction goes
wrong, then the instructions executed after prediction are
flushed and the correct address is fetched. Now to deal
with branch hazards (also known as control hazards)
there are many possible ways: - detect and wait, detect
and forward, detect and eliminate, use branch delay slot,
branch prediction (using BTB) among others. But
Processor pipelines have been growing deeper and issue
widths wider over the years. Apart from that deep
pipelines and fast clock rates are also necessitating the
development of high accuracy, multi-stage branch
predictors for future processors.
In particular, we identify for each dynamic
branch a set of branches called affectors, which
control the computation that affect that branchs
outcome. Building an Affector Register File (ARF)
using the affector information is next step. The number
of entries in the ARF is same as the number of
architectural registers, and each ARF register
corresponds to an architectural register. An Affector
Branch Bitmap (ABB) is then generated from the
affector register file. When the processor encounters a
conditional branch instruction, the ARF entries
corresponding to its source registers are read, and ORed
to
generate
ABB.

Finally we obtain Modified Global History by


performing and operation between ABB and the Global
History which in turn is used in branch predictors to
make branch prediction based on the affector
information.
Other Techniques and Motivation
Importance for efficient branch prediction
techniques has been soaring in the recent past due to the
desire for increased levels of parallelism in the
processors. Different branch prediction techniques have
been implemented, of which some of the most popular
techniques being,
1.
Taken A technique in which, all the
conditional branches are always predicted to be taken.
2.
Not-Taken A technique in which all
the conditional branches are predicted to be not taken.
3.
Two Level Consists of two levels of
branch predictors which are correlated with lookup
tables.
4.
Bimodal Implements a state-machine
with four states, updating the states based on global
history.
5.
Combinational A combination of both
the two level and bi-modal predictors.
After analyzing these techniques which have
already been implemented in simplescalar (the results of
the analysis have also been shown in the progress report
submitted earlier), the importance of better branch
prediction techniques has become obvious.
For
processors with deeper pipeline and wider issue width,
there is a need to come up with a new idea. Global
branch predictors improve prediction accuracy by
correlating a branchs outcome with the history of the
preceding dynamic branches. Going with conventional
way we face difficulty with large predictors. For smaller
predictors (less than 16KB), whose accuracies are mainly
limited by destructive interference, a small increase in
predictor size delivers a large improvement in the
prediction accuracy. These accuracies can be further
improved by applying specific interference reduction
techniques. By contrast, for larger predictors (more than
around 64KB), lack of enough information in the global
history to generate an accurate prediction limits the
prediction accuracy. For example, a 1MB 2bc- gskew
predictor provides only nominal prediction accuracy
improvements compared to a 64KB 2bc-gskew predictor.

This is because the existing global history


predictors use the most recent branches for correlation,
and linear increases in the global history length causes
exponential increases in predictor size. Further, all of the
additional branches included may not be correlated and
they preclude the inclusion of any highly correlated
branches from farther past in the global history. Thus
only affectors need to be determined and worked upon.
A branch becomes an affector for a future branch if it
can affect the outcome of the future branch by choosing
whether or not certain instructions that directly affect the
future branchs source operands are executed. Because
affectors have a direct effect on a future branchs
outcome, they have an unusually high correlation with
the branch they affect. Evers et al. Identified two
primary reasons for two branches to be correlated. The
first reason is that the preceding branchs outcome
affects computation that determines the outcome of the
succeeding branch. In our terminology, the former
branch is considered to be an affector of the later branch.
The second reason for correlation between two branches
is that the computations affecting their outcomes are
(fully or partially) based on the same (or related)
information. In this case, we call the former branch to be
a forerunner of the later branch.
We explored the possibilities to implement two
prediction schemes that can use the affector branch
information for prediction. In the first scheme (zeroing),
the identified affector branches are retained at their
respective positions in the long global history for
creating the predictor history. In the second scheme
(packing), the identified affector branches are collected
together as an ordered set for creating the predictor
history. These affector histories are then used in a
corrector predictor that selectively corrects the
predictions of a large primary global predictor.
We require longer global history than a
conventional predictor. For a branch under prediction,
some of the correlated branches may have appeared at a
large distance in the dynamic instruction stream. This
can happen, for instance, if two correlated branches are
separated by a call to a function containing many
branches; by the time the fetch unit returns from the
function, the recorded global history may only contain
the outcome of the branches in that function. A longer
history is likely to capture the outcome of the correlated
branch that appeared in the dynamic instruction stream
prior
to
the
function
call.

Implementation

2. Generation of Affector Register File and Affector


Bitmap of Branches

1. Identification of Affector Branches


Although static analysis can be employed for
affector identification, its effectiveness may vary
compared to dynamic identification of correlated
branches. In order to understand the dynamic
identification of correlated branches, consider the
control flow graph (CFG) given in Figure. Assume that
the control is going through the shaded path and that we
are interested in predicting branch B8, the last branch in
the CFG. In a conventional global branch predictor, its
history will be a pattern that records the latest outcomes
of branches B0, B2, B3, B5, B7 (i.e., TNNTN,
assuming the pattern to be 5 bits long).

In-order to generate the affector information


dynamically, maintaining a separate record of affector
information of each of the architectural registers is
necessary. For this purpose, an Affector Register File
(ARF) is created. Weve implemented the affector
register file on simplescalar by creating a two
dimensional array. The number of entries in the affector
register file will be the number of architectural registers.
So, we set the number of rows in the array to be 32 and
also the number of columns in the array to be 32. An
ARF structure and an example ARF entry are shown in
the figure.

Figure 2: Affector Register File


The following figures
The ARF entry of any architectural register is
updated based the type of instruction that is being
executed. The instructions can be classified into three
categories.
1.

Conditional Branch Instructions

Figure 1: Example Dataflow Graph

2.

Register-writing Instructions

The source operands of branch B8 are obtained


from registers R2 and R3. This R2 value is produced in
BB3, which, in turn, is fed with a value (R1) produced in
BB2. The R3 value used by B8 is produced in BB7.
Thus, the affector basic blocks of B8 within this CFG
are BB2, BB3, and BB7 (which are marked with darker
shades in Figure 1). The branches that decided that
control should go through these basic blocks are B0, B2,
and B5. These branches are marked with circles in Figure
1, and are the affectors of this instance of B8.

3.

Non register writing instructions

The affector information is obtained from the


data-flow graph. The affector information is obtained
from each node as a bitmap. 1s in the bitmap indicate an
affector branch and the bits set to 0 indicate non-affector
branches. Therefore in the figure, for B8 the bitmap
would be 11010.

In case of the conditional branch instructions,


the ARF entries corresponding to the operands are ORed
and the result is loaded into an Affector Bitmap or
ABB. And then all the entries of the ARF are shifted
left by one bit and a zero is placed in the least significant
position of the ARF entry. In case of the register writing
instructions, the ARF entry corresponding to the
destination register is replaced with the ORed result of
the ARF entries corresponding to the source registers
and a 1 is placed in the least significant position.
An informal affector bitmap generation
algorithm is shown in the following figure.

outorder.c. The following figure shows a part of the code


used to find the source and the destinations
3. Elimination of non-affector information

Figure 3: Algorithm of Implementation of ARF & ABB


An illustration of the affector bitmap
determination for the example given in figure 1 is shown
in the following figures.

After determining the affector information in the


form of affector bitmap of branches, it is necessary to
eliminate all the non affector information, and use only
the affector information in the branch prediction. This
can be done in two ways, either by employing the zeroing
scheme or the packing scheme.
In the zeroing scheme, the global history and the
generated affector bitmap information is ANDed and
folded. It is then used with a lookup table to select a
second level counter/branch predictor.
In another scheme of implementation called
Packing, the masked result is packed before folding it.
A flow chart depicting the packing scheme is shown in
the following figure.

Figure 4: Contents of Affector Register File (ARF) and


Affector Bitmap of Branch (ABB)
The code snippet for implementation of the affector
register file and the affector bitmap of branches is shown
in the following figures.
Figure 6: Zeroing Scheme
Results

Figure 5: Code Snippet for Creation of ARF and ABB


The functionality for identification and
operating the source and destination information of an
instruction is not available in sim-bpred.c. This
functionality h a s been added by referring to sim-

The performance of the dynamic dataflow based


branch predictor is evaluated using the benchmarks Gzip,
crafty, vortex, test-math, test-fmath, test-llong and test
printf. The results show significant improvements in
most of the benchmarks.
The performance of our implementation is
compared to the two level branch predictor which was
already implemented in the simplescalar by default. The
following data shows a comparative account of the
performance parameters of both the techniques.

25000
20000
15000

Default

10000

Implementati
on

5000
0
test-math test-printf

Figure 10: Comparison of the number of misses of new


implementation with the default technique in
simplescalar
The comparison of miss rate (misses/total
branch instructions) is shown in the following figures.
Benchmarks
Default
Modified
Test-math
27.57%
20.13%
Test-fmath
23.79%
21.37%
Test-llong
20.56%
18.84%
Test-printf
13.30%
11.16%
Figure 10: Misprediction percentage for both the
techniques

Figure 7: Packing Scheme

Benchmarks

Default

Modified

Test-math

2038

1488

Test-fmath

735

660

Test-llong

384

352

Test-printf

21212

17799

30
25
20
15
10
5
0

Default
Implemented

Figure 8: Number of misses of the default


implementation technique in simplescalar and the
dynamic dataflow based implementation
800
700
600
500
400
300
200
100
0

Figure 11: Misprediction rate comparison


Default

The following figures show the performance for another


set of benchmarks.
Benchmarks

New
Implementati
on
test-fmath

test-llong

Figure 9: Comparison of the number of misses of new


implementation with the default technique in
simplescalar

Default

Modified

Gzip
35718
32263
Crafty
102269
86513
Vortex
47904
47609
Figure 12: Number of misses for the first 5,000,000
instructions

120000
100000
Default

80000

performance parameters, its evident that this technique


is more efficient in predicting branches compared to all
the conventional branch prediction techniques.

60000
New
Implementati
on

40000
20000
0
Gzip Crafty Vortex

Figure 13: Comparison of number of misses of the new


implementation with the default technique

References
[1] Renju Thomas, Manoj Franklin, Chris Wilkerson,
and Jared Stark (Intel Corporation), Improving Branch
Prediction by Dynamic Dataflow-based Identification of
Correlated Branches from a Large Global History, in the
30th Annual International S y m p o s i u m on Computer
Architecture (ISCA-2003), June 2003.
[2] David A. Patterson, John L. Hennessy, Computer
Organization and Design, 3rd Edition, Morgan
Kaufmann Publishers, Inc., 2004

Benchmarks

Default

Modified

Gzip
7.15%
6.46%
Crafty
12.21%
10.33%
Vortex
7.91%
7.86%
Figure 14: Comparison of Miss Rate
(misses/total branches)
14
12
10
8
6
4

Default
Modified

2
0
Gzip

Crafty

Vortex

Figure 15: Comparison of Miss-rates


Conclusion
The effect of dynamic dataflow based
identification of affector branches, for the branch
prediction using a large global history has been observed
and analyzed. From the comparisons of all the

[3] J. Hennessy and D. Patterson, Computer


Architecture: A Quantitative Approach, 5th Edition,
Morgan Kaufmann Publishers, Inc., 2012
[4] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides.
Design tradeoffs for the Alpha EV8 conditional branch
predictor. In Proc. 29th Intl Symp. on Computer
Architecture, 2002.
[5] E. Sprangle and D. Carmean. Increasing processor
performance by implementing deeper pipelines. In Proc.
29th Intl Symp. on Computer Architecture, 2002.
[6] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides.
Design tradeoffs for the Alpha EV8 conditional branch
predictor. In Proc. 29th Intl Symp. on Computer
Architecture, 2002.
[7] M. Evers, S. J. Patel, R. S. Chappell, and Y. N.
Patt. An analysis of correlation and predictability: What
makes twolevel branch predictors work. In Proc. 25th
Intl Symp. On Computer Architecture, pages 5261,
1998.

You might also like