Mentoring Not Available

Dynamic BTB Resizing for Variable Stages
Superscalar Architecture
Authors : Tomoyuki Nakabayashi, Takahiro Sasaki, and Toshio Kondo
Presented by : Mohammed Ahmed Bhati ( 7858)

Apoorva Bhole (7859)
Outline :
 Introduction
 Problem Statement
 Variable Stages Pipeline
 Related Work
 Methodology
 Motivation and BTB resizing technique for VSP architecture
 Evaluation
A. Evaluation of prediction accuracy
B. Dynamic energy optimization
C. Leakage energy reduction
 Summary and Future work
 References
Introduction :
• Since the latest advances in computers incur increase in energy consumption
of a processor, many techniques to achieve a high-performance with low-
energy consumption have been developed in wide research fields of device,
microarchitecture, and software.
• A variable stages pipeline (VSP) technique that is a similar technique to

dynamic pipeline scaling (DPS) and pipeline stage unification (PSU) is
proposed.
• VSP processor dynamically varies the pipeline depth and clock frequency
according to behaviour in a program and unifies plural pipeline stages to one
stage for low-energy operation when the processor workload is light.
• In modern high-performance computers, superscalar architecture has been

widely used to improve performance by extracting instruction level
parallelism (ILP) and thread level parallelism (TLP).
• The sizes of these units depend on intended energy-performance trade-off in
a peak performance. So here, VSP technique is introduced where the sizes
and structures of the units alter after pipeline unification to balance the
energy performance trade-off.
• Superscalar processors use branch target buffer (BTB) to obtain target

addresses for branches predicted taken. Large BTB adopted in superscalar
processors results in a large energy consumption. Therefore, low energy BTB
techniques are proposed.
• The proposed technique resizes the BTB size along with pipeline scaling. In
addition, to prevent the prediction accuracy from degrading, update of the
BTB by branch instruction type after the BTB size reduces is limited.
• The BTB size can be reduced to one-eight after pipeline unification with only
0.02% prediction accuracy degradation. This results in 9.2% dynamic energy
reduction of the processor core. The leakage energy consumption in the BTB
is reduced by 87.5%.
Problem Statement :
• A deeper superscalar pipeline achieves a higher performance but consumes a
larger energy consumption. For the energy reduction of a deeply-pipelined
processor, a variable stage pipeline (VSP) architecture is proposed which
reduces the energy consumption by dynamically unifying the pipeline stages
according to behaviour in a program.
Variable Stages Pipeline :
• Figure 1 shows the basic concept of the approaches. Generally, a deeper

superscalar pipeline achieves a higher clock frequency and performance, but
consumes a larger energy consumption.
• In contrast, while a shallower superscalar pipeline lowers the clock frequency

and performance, it has a better energy efficiency because of fewer pipeline
registers and early solution of data/control dependency.
• When the workload on a VSP processor is light, the processor lowers clock
frequency and unifies plural stages to form a shallower pipeline; this results
in energy saving.
• To dynamically vary the pipeline depth, pipeline registers are replaced with
the circuit shown in Fig. 2.
• The proposed technique can be adopted not only into VSP but also into PSU
and DPS.
Related Work :
• A BTB resizing technique is proposed to reduce the energy consumption. It
requires a profiling phase and relies on software. By contrast, our resizing
technique is purely implemented by hardware and does not require any
profiling.
• Lazy BTB aims at BTB energy reduction by filtering out redundant BTB
lookups using a dynamic profiling. However, Lazy BTB degrades 1.7%
performance and has a small penalty (two cycles) for a branch misprediction.
Our technique incurs only 0.05% performance loss on a 4-width superscalar
processor on the average.
Methodology :
• For performance evaluation and power estimation, FabScalar is used as the
baseline processor.
• 2 - way BTB is implemented in FabScalar, and the baseline is set to size of

2K-entry because the 2K-entry BTB almost achieves the maximum
performance (4K-entry BTB improves the performance only by 0.1%).
• Table I shows the configurations for FabScalar.

• For RTL simulation to evaluate the performance, SPEC CPU2000 INTEGER
benchmark suite is used.
• Table II shows six benchmark programs and reference input sets. Each
benchmark program is forwarded to its single simulation point specified by
SimPoint.
• Other multi-ported RAMs are implemented by duplicating provided dual-

ported memory macros by the chip vendor.
• Power consumption is estimated using a synthesized and clock tree inserted

net-list with PrimeTime PX version D-2010.06.
Motivation and BTB Resizing Technique For VSP
Architecture :
Figure 4 shows the banked BTB
Figure 3 shows the block diagram of
design which is interleaved into four
a basic branch predictor.
banks for 4-width fetch.
• In modern superscalar processors, the fetch stage pipeline consists of plural sub-
stages as shown in Fig. 5.
• Figure 6 shows how an instruction goes through fetch stages under a deeper and
shallower pipelines, respectively.
Figure 7 describes the basic Figure 8 shows the detailed BTB
microarchitecture of implementation of the proposed
our BTB resizing technique. technique.
Evaluation :
A. Evaluation of prediction accuracy :
• Figure 9 shows the prediction accuracy in case of reducing the BTB size
from 2K to 64. All the other parameters are the same as Table I.
• Figure 10 shows comparison of the number of unwieldy branch

mispredictions to make clear the effectiveness of limiting the BTB update.
B. Dynamic energy optimization :
• Reducing the BTB size is not always effective for energy reduction due to the
performance degradation. Although in this case, BTB resizing technique
mitigates the performance loss than the simple BTB resizing, the
performance loss may still incur increase in the energy consumption.
• Figure 11 shows the dynamic energy consumptions in the superscalar
processor when the proposed BTB size reduces from 2K to 64.
C. Leakage energy reduction :
• As one of low leakage cache techniques, drowsy cache is proposed. The
drowsy strategy enables BTB entries to be individually put in low-energy
mode which can reduce the leakage energy.
• Although the drowsy strategy can reduce a significant leakage energy in the
BTB, the difficulty for implementation is a problem because using the
drowsy strategy requires a separate voltage controller for each BTB entry.
• In contrast, this technique controls leakage energy by a large contiguous BTB

entries or memory macro unit, such as dynamic voltage and frequency
scaling (DVFS) and power, for leakage control.
• BAF reduces the BTB leakage energy by 83% with the drowsy strategy
whereas this technique reduces that by 87.5% with an easy implemented
leakage control technique at a maximum.
Summary and Future Work :
• BTB resizing technique for variable stages superscalar architecture is
proposed.
• The performance evaluation results show that the proposed technique reduces
the BTB size from 2K to 256 entries with a negligible performance
degradation.
• This results in 9.2% dynamic energy reduction of processor core with a

0.05% performance loss on the average. Furthermore, it reduces the leakage
energy consumption in BTB by 87.5% with a practical leakage control
technique.
• As a future work, the energy-performance trade-off for other components

such as physical register file and reorder buffer will be explored. Also the
overall effect on a variable stages superscalar processor including the
proposed BTB resizing will be investigated.
References :
• [1] Y. Ichikawa, T. Sasaki, T. Hironaka, T. Kitamura, and T. • [8] Yen-Jen Chang, Lazy BTB: Reduce BTB Energy
Kondo, Low Energy Consumption by a Variable Stages Pipeline Consumption using Dynamic Profiling, Proceedings of the 11th
Technique, Proceedings of International Technical Conference Asia and South Pacific Design Automation Conference (ASP-
on Circuits/Systems Computers and Communications, 0358, DAC 2006), pp. 917-922, January 2006.
6C1L-4, July 2004. • [9] S. Wang, J. Hu, and S.G. Ziavras, Exploring Branch Target
• [2] T. Sasaki, Y. Ichikawa, T. Hironaka, T. Kitamura, and T. Buffer Access Filtering for Low-energy and High-performance
Kondo, Evaluation of Low Energy and High Performance Microarchitectures, IET Computers & Degital Techniques,
Processor using Variable Stages Pipeline Technique, IET Journal Volume 6, Issue 1, pp. 50-58, 2012.
of Computer and Digital Techniques, Vol. 2, No. 3, pp. 230-238, • [10] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge,
April 2008. Drowsy Caches: Simple Techniques for Reducing Leakage
• [3] T. Nakabayashi, T. Sasaki, K. Ohno, and T. Kondo, Design Power, Proceedings of the 29th International Symposium on
and Evaluation of Variable Stages Pipeline Processor with Low Computer Architecture (ISCA-29), pp. 148-157, May 2002.
Energy Techniques, IET Journal of Computers and Digital • [11] N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh,
Techniques, Vol. 6, Issue 1, pp. 43-49, January 2012. J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E.
• [4] J. Koppanalil, P. Ramrakhyani, S. Desai, A. Vaidyanathan, Rotenberg, FabScalar: Composing synthesizable RTL designs of
and E. Rotenberg, A Case for Dynamic Pipeline Scaling, arbitrary cores within a canonical superscalar template,
Proceedings of International Conference on Compilers, Proceedings of the 38th IEEE/ACM International Symposium on
Architecture, and Synthesis for Embedded Systems 2002, pp. 1-8, Computer Architecture (ISCA-38), pp. 11- 22, June 2011.
October 2002. • [12] G. F. Grohoski, Machine organization of the ibm rs/6000
• [5] H. Shimada, H. Ando, and T. Shimada, Pipeline Stage processor, IBM Journal of R&D, 34(1):37-58, Jan 1990.
Unification: A Low-Energy Consumption Technique for Future • [13] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder,
Mobile Processors, Proceedings of the International Symposium “Automatically Characterizing Large Scale Program Behavior”,
on Low Power Electronics and Design 2003, pp. 326-329, 10th International Conference on Architectural Support for
August,2003. Programming Languages and Operating Systems, pp. 45-57,
• [6] H. Shimada, H. Ando, and T. Shimada, A Hybrid Power October 2002.
Reduction Scheme Using Pipeline Stage Unification and • [14] N. K. Choudhary, B. H. Dwiel, and E. Rotenberg. A
Dynamic Voltage Scaling, Proceedings of the 9th IEEE Physical Design Study of FabScalar-generated Superscalar Cores,
Symposium on Low-Power and High-Speed Chips, pp. 201-214, Proceedings of the 2012 IEEE/IFIP 20th International
April 2006. Conference on VLSI and Systemon-Chip (VLSI- SoC), pp. 165-
• [7] M.C. Huang, D. Chaver, L. Pinuel, M. Prieto, and F. Tirado, 170, October 2012.
Customizing the Branch Predictor to Reduce Complexity and
Energy Consumption, IEEE Micro, Volume 23, Issue 5, pp. 12-
25, 2003.

Mentoring Not Available

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mentoring Not Available

Uploaded by

Copyright:

Available Formats

Dynamic BTB Resizing for Variable Stages

Authors : Tomoyuki Nakabayashi, Takahiro Sasaki, and Toshio Kondo

Presented by : Mohammed Ahmed Bhati ( 7858)

• A variable stages pipeline (VSP) technique that is a similar technique to

• In modern high-performance computers, superscalar architecture has been

• Superscalar processors use branch target buffer (BTB) to obtain target

• Figure 1 shows the basic concept of the approaches. Generally, a deeper

• In contrast, while a shallower superscalar pipeline lowers the clock frequency

• 2 - way BTB is implemented in FabScalar, and the baseline is set to size of

• Table I shows the configurations for FabScalar.

• Other multi-ported RAMs are implemented by duplicating provided dual-

• Power consumption is estimated using a synthesized and clock tree inserted

• Figure 10 shows comparison of the number of unwieldy branch

• In contrast, this technique controls leakage energy by a large contiguous BTB

• This results in 9.2% dynamic energy reduction of processor core with a

• As a future work, the energy-performance trade-off for other components

You might also like