You are on page 1of 163

1

ECE 684
Branch Prediction

http://www.extremetech.com/article2/0,1558,1155321,00.asp
PC Processor Microarchitecture, additional references
2
ECE 684
There really are three different kinds of branches:

! Forward conditional branches - based on a run-time condition, the PC (Program
Counter) is changed to point to an address forward in the instruction stream.
! Backward conditional branches - the PC is changed to point backward in the
instruction stream. The branch is based on some condition, such as branching
backwards to the beginning of a program loop when a test at the end of the loop states
the loop should be executed again.
! Unconditional branches - this includes jumps, procedure calls and returns that have
no specific condition. For example, an unconditional jump instruction might be coded in
assembly language as simply "jmp", and the instruction stream must immediately be
directed to the target location pointed to by the jump instruction, whereas a conditional
jump that might be coded as "jmpne" would redirect the instruction stream only if the
result of a comparison of two values in a previous "compare" instructions shows the
values to not be equal. (The segmented addressing scheme used by the x86
architecture adds extra complexity, since jumps can be either "near" (within a segment)
or "far" (outside the segment). Each type has different effects on branch prediction
algorithms.)
A Closer Look At Branch Prediction
3
ECE 684

Static Branch Prediction predicts always the same direction for the
same branch during the whole program execution.

It comprises hardware-fixed prediction and compiler-directed
prediction.

Simple hardware-fixed direction mechanisms can be:
Predict always not taken
Predict always taken
Backward branch predict taken, forward branch predict not taken

Sometimes a bit in the branch opcode allows the compiler to decide
the prediction direction.
Static Branch Prediction
4
ECE 684

Dynamic Branch Prediction: the hardware influences the prediction while
execution proceeds.

Prediction is decided on the computation history of the program.

During the start-up phase of the program execution, where a static branch
prediction might be effective, the history information is gathered and dynamic
branch prediction gets effective.

In general, dynamic branch prediction gives better results than static branch
prediction, but at the cost of increased hardware complexity.
Dynamic Branch Prediction
5
ECE 684
Forward branches dominate backward branches by about 4 to 1 (whether
conditional or not). About 60% of the forward conditional branches are taken, while
approximately 85% of the backward conditional branches are taken (because of the
prevalence of program loops).
Just knowing this data about average code behavior, we could optimize our architecture
for the common cases. A "Static Predictor" can just look at the offset (distance forward
or backward from current PC) for conditional branches as soon as the instruction is
decoded.
Backward branches will be predicted to be taken, since that is the most common
case. The accuracy of the static predictor will depend on the type of code being
executed, as well as the coding style used by the programmer.
These statistics were derived from the SPEC suite of benchmarks, and many PC
software workloads will favor slightly different static behavior.
Using Branch Statistics for Static Prediction
6
ECE 684
Static Profile-Based Compiler Branch Misprediction Rates for
SPEC92
Floating Point Integer
More Loops
Average 9%
Average 15%
(i.e 91% Prediction Accuracy)
(i.e 85% Prediction Accuracy)
7
ECE 684
Dynamic branch prediction schemes are different from static mechanisms because they
utilize hardware-based mechanisms that use the run-time behavior of branches to make
more accurate predictions than possible using static prediction.
Usually information about outcomes of previous occurrences of branches (branching
history) is used to dynamically predict the outcome of the current branch. Some of the
proposed dynamic branch prediction mechanisms include:
One-level or Bimodal: Uses a Branch History Table (BHT), a table of usually
two-bit saturating counters which is indexed by a portion of the branch
address (low bits of address). (First proposed mid 1980s)
Two-Level Adaptive Branch Prediction. (First proposed early 1990s),
MCFarlings Two-Level Prediction with index sharing (gshare, 1993).
Hybrid or Tournament Predictors: Uses a combinations of two or more
(usually two) branch prediction mechanisms (1993).
To reduce the stall cycles resulting from correctly predicted taken branches to zero
cycles, a Branch Target Buffer (BTB) that includes the addresses of conditional
branches that were taken along with their targets is added to the fetch stage.
Dynamic Conditional Branch Prediction
8
ECE 684
How to further reduce the impact of branches on pipeline processor performance

Dynamic Branch Prediction:
Hardware-based schemes that utilize run-time
behavior of branches to make dynamic predictions:
Information about outcomes of previous occurrences
of branches are used to dynamically predict the
outcome of the current branch.
Why? Better branch prediction accuracy and
thus fewer branch stalls

Branch Target Buffer (BTB):
A hardware mechanism that aims at reducing the
stall cycles resulting from correctly predicted taken
branches to zero cycles.
9
ECE 684
To refine our branch prediction, we could create a buffer that is indexed by the low-order
address bits of recent branch instructions. In this BHB (sometimes called a "Branch History
Table (BHT)"), for each branch instruction, we'd store a bit that indicates whether the branch
was recently taken. A simple way to implement a dynamic branch predictor would be to check
the BHB for every branch instruction. If the BHB's prediction bit indicates the branch should
be taken, then the pipeline can go ahead and start fetching instructions from the new address
(once it computes the target address).

By the time the branch instruction works its way down the pipeline and actually causes a
branch, then the correct instructions are already in the pipeline. If the BHB was wrong, a
"misprediction" occurred, and we'll have to flush out the incorrectly fetched instructions and
invert the BHB prediction bit.
Dynamic Branch Prediction with a Branch History Buffer (BHB)
10
ECE 684
Dynamic Branch Prediction with a Branch History Buffer (BHB)
11
ECE 684
It turns out that a single bit in the BHB will be wrong twice for a loop--once on the
first pass of the loop and once at the end of the loop. We can get better prediction
accuracy by using more bits to create a "saturating counter" that is incremented on
a taken branch and decremented on an untaken branch. It turns out that a 2-bit
predictor does about as well as you could get with more bits, achieving anywhere
from 82% to 99% prediction accuracy with a table of 4096 entries.
This size of table is at the point of diminishing returns for 2 bit entries, so there isn't
much point in storing more. Since we're only indexing by the lower address bits,
notice that 2 different branch addresses might have the same low-order bits and
could point to the same place in our table--one reason not to let the table get too
small.
Refining Our BHB by Storing More Bits
12
ECE 684

There is a further refinement we can make to our BHB by correlating the behavior of
other branches. Often called a "Global History Counter", this "two-level predictor"
allows the behavior of other branches to also update the predictor bits for a particular
branch instruction and achieve slightly better overall prediction accuracy. One
implementation is called the "GShare algorithm".
This approach uses a "Global Branch History Register" (a register that stores the
global result of recent branches) that gets "hashed" with bits from the address of the
branch being predicted. The resulting value is used as an index into the BHB where
the prediction entry at that location is used to dynamically predict the branch direction.
Yes, this is complicated stuff, but it's being used in several modern processors.
Two-Level Predictors and the GShare Algorithm
13
ECE 684
Two-Level Predictors and the GShare Algorithm
Combined branch prediction*
Scott McFarling proposed combined branch prediction in his 1993 paper 2. Combined branch prediction
is about as accurate as local prediction, and almost as fast as global prediction.
Combined branch prediction uses three predictors in parallel: bimodal, gshare, and a bimodal-like
predictor to pick which of bimodal or gshare to use on a branch-by-branch basis. The choice predictor
is yet another 2-bit up/down saturating counter, in this case the MSB choosing the prediction to use.
In this case the counter is updated whenever the bimodal and gshare predictions disagree, to favor
whichever predictor was actually right.
On the SPEC'89 benchmarks, such a predictor is about as good as the local predictor.
Another way of combining branch predictors is to have e.g. 3 different branch predictors, and merge
their results by a majority vote.
Predictors like gshare use multiple table entries to track the behavior of any particular branch.
This multiplication of entries makes it much more likely that two branches will map to the same
table entry (a situation called aliasing), which in turn makes it much more likely that prediction
accuracy will suffer for those branches. Once you have multiple predictors, it is beneficial to arrange
that each predictor will have different aliasing patterns, so that it is more likely that at least one
predictor will have no aliasing. Combined predictors with different indexing functions for the different
predictors are called gskew predictors, and are analogous to skewed associative caches used
for data and instruction caching.
* From : http://en.wikipedia.org/wiki/Branch_prediction
14
ECE 684
In addition to a large BHB, most predictors also include a buffer that stores the actual target
address of taken branches (along with optional prediction bits). This table allows the CPU to
look to see if an instruction is a branch and start fetching at the target address early on in
the pipeline processing. By storing the instruction address and the target address, even
before the processor decodes the instruction, it can know that it is a branch.

A large BTB can completely remove most branch penalties (for correctly-predicted
branches) if the CPU looks far enough ahead to make sure the target instructions are pre-
fetched. Using a Return Address Buffer to predict the return from a subroutine One
technique for dealing with the unconditional branch at the end of a subroutine is to create a
buffer of the most recent return addresses.
There are usually some subroutines that get called quite often in a program, and a return
address buffer can make sure that the correct instructions are in the pipeline after the return
instruction.
Using a Branch Target Buffer (BTB) to Further Reduce the
Branch Penalty
15
ECE 684
Branch Target Buffer (BTB)
Effective branch prediction requires the target of the branch at an early
pipeline stage. (resolve the branch early in the pipeline)
One can use additional adders to calculate the target, as soon as the branch
instruction is decoded. This would mean that one has to wait until the ID
stage before the target of the branch can be fetched, taken branches would
be fetched with a one-cycle penalty (this was done in the enhanced MIPS
pipeline).
To avoid this problem one can use a Branch Target Buffer (BTB). A typical
BTB is an associative memory where the addresses of taken branch
instructions are stored together with their target addresses.
Some designs store n prediction bits as well, implementing a combined
BTB and Branch history Table (BHT).
Instructions are fetched from the target stored in the BTB in case the branch
is predicted-taken and found in BTB. After the branch has been resolved the
BTB is updated. If a branch is encountered for the first time a new entry is
created once it is resolved as taken.
Branch Target Instruction Cache (BTIC): A variation of BTB which caches
also the code of the branch target instruction in addition to its address. This
eliminates the need to fetch the target instruction from the instruction cache
or from memory.
16
ECE 684
BTB
17
ECE 684
BTB Flow
Fetch
Decode
Execute
Prediction Output
18
ECE 684
BTB Penalties
Branch Penalty Cycles
Using A Branch-Target Buffer (BTB)
Assuming one more stall cycle to update BTB
Penalty = 1 + 1 = 2 cycles
Base Pipeline Taken Branch Penalty = 1 cycle (i.e. branches resolved in ID)
No Not Taken Not Taken 0
19
ECE 684
Dynamic Branch Prediction
Simplest method: (One-Level)
A branch prediction buffer or Branch History Table (BHT) indexed by low
address bits of the branch instruction.
Each buffer location (or BHT entry) contains one bit indicating whether the
branch was recently taken or not
e.g 0 = not taken , 1 =taken
Always mispredicts in first and last loop iterations.
To improve prediction accuracy, two-bit prediction is used:
A prediction must miss twice before it is changed.
Thus, a branch involved in a loop will be mispredicted only once when
encountered the next time as opposed to twice when one bit is used.
Two-bit prediction is a specific case of n-bit saturating counter incremented
when the branch is taken and decremented when the branch is not taken.
Two-bit prediction counters are usually always used based on observations
that the performance of two-bit BHT prediction is comparable to that of n-bit
predictors.
The counter (predictor) used is updated after the branch is resolved
Smith
Algorithm
Why 2-bit
Prediction?
.
.
.
BHT Entry: One Bit
0 = NT = Not Taken
1 = T = Taken
N Low Bits
of Branch
Address
20
ECE 684
One-Level (Bimodal) Branch Predictors
One-level or bimodal branch prediction uses only one level of branch
history.
These mechanisms usually employ a table which is indexed by lower N
bits of the branch address.
Each table entry (or predictor) consists of n history bits, which form an n-
bit automaton or saturating counters.
Smith proposed such a scheme, known as the Smith Algorithm, that uses
a table of two-bit saturating counters. (1985)
One rarely finds the use of more than 3 history bits in the literature.
Two variations of this mechanism:
Pattern History Table: Consists of directly mapped entries.
Branch History Table (BHT): Stores the branch address as a tag.
It is associative and enables one to identify the branch
instruction during IF by comparing the address of an instruction
with the stored branch addresses in the table (similar to BTB).
21
ECE 684
N Low Bits of
Table has 2
N
entries
(also called predictors) .
0 0
0 1
1 0
1 1

High bit determines
branch prediction
0 = NT = Not Taken
1 = T = Taken
Example:

For N =12
Table has 2
N
= 2
12
entries
= 4096 = 4k entries

Number of bits needed = 2 x 4k = 8k bits

Sometimes referred to as
Decode History Table (DHT)
or
Branch History Table (BHT)
What if different branches map to the same predictor (counter)?
This is called branch address aliasing and leads to interference with current branch prediction by
other branches and may lower branch prediction accuracy for programs with aliasing.
Not Taken
(NT)
Taken
(T)
2-bit saturating counters (predictors)
Update counter after branch is resolved:
-Increment counter used if branch is taken
- Decrement counter used if branch is not
taken
One-Level (Bimodal) Branch Predictors
22
ECE 684
High bit determines
branch prediction
0 = NT= Not Taken
1 = T = Taken
0 0
0 1
1 0
1 1

Not Taken
(NT)
Taken
(T)
2-bit saturating counters (predictors)

N Low Bits of
Branch History Table (BHT)
23
ECE 684
11 10
01
00
Taken
(T)
Not Taken
(NT)
Basic Dynamic Two-Bit Branch Prediction:
Two-bit Predictor State
Transition Diagram
Or Two-bit saturating counter predictor state transition diagram (Smith Algorithm):
0 0
0 1
1 0
1 1

Not Taken
(NT)
Taken
(T)
* From: New Algorithm Improves
Branch Prediction Vol. 9, No. 4,
March 27, 1995 1995
MicroDesign Resources
24
ECE 684
Prediction Accuracy of
A 4096-Entry Basic One-
Level Dynamic Two-Bit
Branch Predictor

Integer average 11%
FP average 4%
Integer
Misprediction Rate:
(Lower misprediction rate
due to more loops)
FP
N=12
2
N
= 4096
Has, more branches
involved in
IF-Then-Else
constructs the FP
25
ECE 684
MCFarling's gshare Predictor
McFarling noted (1993) that using global history information might be less
efficient than simply using the address of the branch instruction, especially
for small predictors.
He suggests using both global history (BHR) and branch address by
hashing them together. He proposed using the XOR of global branch
history register (BHR) and branch address since he expects that this value
has more information than either one of its components. The result is that
this mechanism outperforms GAp scheme by a small margin.
The hardware cost for k history bits is k + 2 x 2
k
bits, neglecting costs for
logic.
gshare = global history with index sharing
gshare is one one the most widely implemented two level dynamic branch
prediction schemes
26
ECE 684
gshare Predictor
Branch and pattern history are kept globally. History and branch address
are XORed and the result is used to index the pattern history table.
First Level:
Second Level:
XOR
(BHR)
2-bit saturating counters (predictors)
Index the second level
gshare = global history with index sharing
Here:
m = N = k
(bitwise XOR)
One Pattern History Table (PHT) with 2
k
entries (predictors)
(PHT)
27
ECE 684
gshare Performance
gshare
(Gap)
(One Level)
GAp
One Level
GAp = Global, Adaptive, per address branch predictor
28
ECE 684
Hybrid Predictors
(Also known as tournament or combined predictors)
Hybrid predictors are simply combinations of two or more branch
prediction mechanisms.
This approach takes into account that different mechanisms may perform
best for different branch scenarios.
McFarling presented (1993) a number of different combinations of two
branch prediction mechanisms.
He proposed to use an additional 2-bit counter selector array which serves
to select the appropriate predictor for each branch.
One predictor is chosen for the higher two counts, the second one for the
lower two counts.
If the first predictor is wrong and the second one is right the counter is
decremented, if the first one is right and the second one is wrong, the
counter is incremented. No changes are carried out if both predictors are
correct or wrong.
29
ECE 684
Intel Pentium 1
It uses a single-level 2-bit Smith algorithm BHT associated with a
four way associative BTB which contains the branch history
information.
The Pentium does not fetch non-predicted targets and does not
employ a return address stack (RAS) for subroutine return
addresses.
It does not allow multiple branches to be in flight at the same time.
Due to the short Pentium pipeline the misprediction penalty is only
three or four cycles, depending on what pipeline the branch takes.
30
ECE 684
Intel P6,II,III
Like Pentium, the P6 uses a BTB that retains both branch history
information and the predicted target of the branch. However the
BTB of P6 has 512 entries reducing BTB misses. Since the
The average misprediction penalty is 15 cycles. Misses in the
BTB cause a significant 7 cycle penalty if the branch is backward.
To improve prediction accuracy a two-level branch history
algorithm is used.
Although the P6 has a fairly satisfactory accuracy of about 90%,
the enormous misprediction penalty should lead to reduced
performance. Assuming a branch every 5 instructions and 10%
mispredicted branches with 15 cycles per misprediction the overall
penalty resulting from mispredicted branches is 0.3 cycles per
instruction. This number may be slightly lower since BTB misses
take only seven cycles.
31
ECE 684
AMD K6
Uses a two-level adaptive branch history algorithm implemented in a BHT
(gshare) with 8192 entries (16 times the size of the P6).
However, the size of the BHT prevents AMD from using a BTB or even
storing branch target address information in the instruction cache. Instead,
the branch target addresses are calculated on-the-fly using ALUs during the
decode stage. The adders calculate all possible target addresses before
the instruction are fully decoded and the processor chooses which
addresses are valid.
A small branch target cache (BTC) is implemented to avoid a one cycle
fetch penalty when a branch is predicted taken.
The BTC supplies the first 16 bytes of instructions directly to the instruction
buffer.
Like the Cyrix 6x86 the K6 employs a return address stack (RAS) for
subroutines.
The K6 is able to support up to 7 outstanding branches.
With a prediction accuracy of more than 95% the K6 outperformed all other
microprocessors when introduced in 1997 (except the Alpha).
32
ECE 684
Motorola PowerPC 750
A dynamic branch prediction algorithm is combined with static branch
prediction which enables or disables the dynamic prediction mode
and predicts the outcome of branches when the dynamic mode is
disabled.
Uses a single-level Smith algorithm 512-entry BHT and a 64-entry
Branch Target Instruction Cache (BTIC), which contains the most
recently used branch target instructions, typically in pairs. When an
instruction fetch does not hit in the BTIC the branch target address is
calculated by adders.
The return address for subroutine calls is also calculated and stored
in user-controlled special purpose registers.
The PowerPC 750 supports up to two branches, although
instructions from the second predicted instruction stream can only be
fetched but not dispatched.
33
ECE 684
The SUN UltraSparc
Uses a dynamic single-level BHT Smith algorithm.
It employs a static prediction which is used to initialize the state
machine (saturated up and down counters).
However, the UltraSparc maintains a large number of branch
history entries (up to 2048 or every other line of the I-cache).
To predict branch target addresses a branch following mechanism
is implemented in the instruction cache. The branch following
mechanism also allows several levels of speculative execution.
The overall claimed performance of UltraSparc is 94% for FP
applications and 88% for integer applications.
34
ECE 684
Pentium Architecture

Excerpted from:
The Pentium: An Architectural History of the World's Most
Famous Desktop Processor (Part I)
By Jon Stokes
Sunday, July 11, 2004
35
ECE 684
General Features
Introduction date: March 22, 1993
Process: 0.8 micron
Transistor Count: 3.1 million
Clock speed at introduction: 60 and 66 MHz
Cache sizes: L1: 8K instruction, 8K data
Features: MMX added in 1997
The Pentium's two-issue superscalar architecture was fairly straightforward.
It had two five-stage integer pipelines, which Intel designated U and V, and
one six-stage floating-point pipeline. The chip's front-end could do dynamic
branch prediction

(see Pentium 97 datasheet)
36
ECE 684
37
ECE 684
Pipeline
The Pentium's basic integer pipeline is five stages long, with the stages
broken down as follows:
1. Prefetch/Fetch: Instructions are fetched from the instruction
cache and aligned in prefetch buffers for decoding.
2. Decode1: Instructions are decoded into the Pentium's internal
instruction format. Branch prediction also takes place at this stage.
3. Decode2: Same as above, and microcode ROM kicks in here, if
necessary. Also, address computations take place at this stage.
4. Execute: The integer hardware executes the instruction.
5. Write-back: The results of the computation are written back to
the register file.
38
ECE 684
Pipeline
The main difference between the Pentium's five-stage pipeline and the
four-stage pipelines prevalent at the time lies in the second decode
stage.
RISC ISAs support only simple addressing modes, but x86's multiple
complex addressing modes, which were originally designed to make
assembly language programmers' lives easier but ended up making
everyone's lives more difficult require extra address computations.
These computations are relegated to the second decode stage, where
dedicated address computation hardware handles them before
dispatching the instruction to the execution units
39
ECE 684
X86 Legacy support
!A whopping 30% of the Pentium's transistors were dedicated solely to providing
x86 legacy support.
!The Pentium's entire front-end was bloated and distended with hardware that
was there solely to support x86 (mis)features which were rapidly falling out of use
!Today, x86 support accounts for well under 10% of the transistors on the
Pentium 4 a drastic improvement over the original Pentium, and one
that has contributed significantly to the ability of x86 hardware to catch
up to and even surpass their RISC competitors in both integer and
floating-point performance
40
ECE 684
Pentium Pipeline
Block Diagram of
Pipeline operations
The Pentium's U and V integer pipes were not
fully symmetric. U, as the default pipe, was slightly
more capable and contained a shifter, which V lacked
Floating-point, however, simply went from awful on
the 486 to just mediocre with the Pentium an
improvement, to be sure, but not enough to make it even
remotely competitive with comparable RISC chips on the
market at that time
41
ECE 684
The Pentium Pro did manage to raise the x86 performance bar significantly.
Its out-of-order execution engine, dual integer pipelines, and improved
floating-point unit gave it enough oomph to get x86 into the commodity
server market.
Pentium Architectural improvements P6
42
ECE 684
The P6 architecture evolution

Pentium Pro vitals Pentium II vitals Pentium III vitals
Introduction date November 1, 1995 May 7, 1997 February 26, 1999
Process 0.60/0.35 micron 0.35 micron 0.25 micron
Transistor count 5.5 million 7.5 million 9.5 million
Clock speed at
introduction
150, 166, 180, and 200MHz 233, 266, and 300 MHz 450 and 500MHz
L1 cache size 8K instruction, 8K data 16K instruction, 16K data 16K instruction, 16K data
L2 cache size 256K or 512K (on-die) 512K (off-die) 512K (on-die)
Features No MMX MMX MMX, SSE, processor serial
number
43
ECE 684
Pentium Pro Architecture
44
ECE 684
Pentium Pro Architecture
45
ECE 684
Decoupling the front end from the back end

In the Pentium and its predecessors, instructions traveled directly
from the decoding hardware to the execution hardware. As noted, the
Pentium had some hardwired rules (see next 3 slides) for
dictating which instructions could go to which execution units and in
what combinations, so once the instructions were decoded then the
rules took over and the dispatch logic shuffled them off to the proper
execution unit
The control unit is responsible for implementing and executing the
rules that decide which instructions go where, and in what
combinations.
This static, rules-based approach is rigid and simplistic, and it has
two major drawbacks, both stemming from the fact that though the
code stream is inherently sequential, a superscalar processor
attempts to execute parts of it in parallel:
1. It adapts poorly to the dynamic and ever-changing code
stream, and
2. It would make poor use of wider superscalar hardware.
46
ECE 684
Pipeline Instruction Pairing Rules
Both instructions must be simple
Hardwired, no microcode support
Must execute in 1 clock cycle
No data dependencies between instructions (either memory or
regs)
Neither instruction may contain both a displacement and an
immediate value
Instructions w/prefixes can only be issued in the U-pipe
Branches can only be the 2
nd
of a pair
Must execute in V-pipe
47
ECE 684
Pipeline Instruction Pairing Rules
Pseudocode :


IF I1 is simple
AND I2 is simple
AND I1 is not a jump
AND dest. of I1 is not source of I2
AND dest. Of I1 is not dest of I2
THEN issue I1 to U-pipe
Issue I2 to V-pipe
ELSE issue I1 to U-pipe
48
ECE 684
Efficiency of Instruction Pairing Rules
C-code :

for(k = I + prime; k <= SIZE; k += PRIME)
Flags[k] = FALSE



Compiler assembly :

; PRIME in ecx, k in edx, FALSE in al

Inner_loop
MOVE byte ptr flags[edx], al
ADD edx, ecx
CMP edx, FALSE
JLE Inner_loop

Execution cycles
80486 Pentium
1 paired -
2 1
1 paired -
2 1
6 2
49
ECE 684
Since the Pentium is a two-issue machine (i.e., it can issue at most two operations
simultaneously from its decode hardware to its execution hardware on each clock
cycle), then its dispatch rules look at only two instructions at a time to see if they can
or cannot be dispatched simultaneously.

If more execution hardware were added, and the issue width were increased to three
instructions per cycle (as it is in the P6), then the rules determining which instructions
go where would need to be able to account for various possible combinations of two
and three instructions at a time, in order to get those instructions to the right
execution unit at the right time.

Furthermore, such rules would inevitably be difficult for coders to optimize for, and if
they weren't to be overly complex then there would necessarily exist many common
instruction sequences that would perform suboptimally under the default rule set.

The makeup of the code stream would change from application to application and from
moment to moment, but the rules responsible for scheduling the code stream's
execution would be forever fixed.
Dispatch Issues
50
ECE 684
1 - Instruction fetch.

2 - Instruction dispatch to an
instruction queue (also called
instruction buffer or reservation
stations).

3 - The instruction waits in the
queue until its input operands are
available. The instruction is then
allowed to leave the queue before
earlier, older instructions.

4 - The instruction is issued to the
appropriate functional unit and
executed by that unit.

5 - The results are queued.

Out-of-order processing
1 - Instruction fetch.

2 - If input operands are available (in
registers for instance), the instruction
is dispatched to the appropriate
functional unit. If one or more
operands is unavailable during the
current clock cycle (generally
because they are being fetched from
memory), the processor stalls until
they are available.

3 - The instruction is executed by the
appropriate functional unit.

4 - The functional unit writes the
results back to the register file.
In-order Out-of-order
51
ECE 684

Only after all older instructions have their results written back to the
register file, then this result is written back to the register file. This is called
the graduation or retire stage.

The key concept of OoO processing is to allow the processor to avoid a
class of stalls that occur when the data needed to perform an operation are
unavailable. In the outline above, the OoO processor avoids the stall that
occurs in step (2) of the in-order processor when the instruction is not
completely ready to be processed due to missing data.

Out-of-order processing
52
ECE 684

OoO processors fill these "slots" in time with other instructions that are
ready, then re-order the results at the end to make it appear that the
instructions were processed as normal. The way the instructions are
ordered in the original computer code is known as program order, in the
processor they are handled in data order, the order in which the data,
operands, become available in the processor's registers. Fairly complex
circuitry is needed to convert from one ordering to the other and maintain a
logical ordering of the output; the processor itself runs the instructions in
seemingly random order.

The benefit of OoO processing grows as the instruction pipeline deepens
and the speed difference between main memory (or cache memory) and
the processor widens. On modern machines, the processor runs many
times faster than the memory, so during the time an in-order processor
spends waiting for data to arrive, it could have processed a large number
of instructions.
Out-of-order processing
53
ECE 684
The solution to the above dilemma is to place the newly decoded instructions in a buffer,
and then issue them to the execution core whenever they're ready to be executed,
even if that means executing them not just in parallel but in reverse order.

This way, the current context in which a particular instruction finds itself executing
can have much more of an impact on when and how it's executed. In replacing the
control unit with a buffer, the P6 core replaces fixed rules with flexibility.

The P6 architecture feeds each decoded instruction into a buffer called the reservation
station (RS), where it waits until all of its execution requirements are met. Once they're
met, the instruction then moves out of the reservation station into the proper execution
unit, where it executes.

The reservation station

54
ECE 684
The reorder buffer

After the instructions are decoded, they must travel through the reorder buffer (ROB)
before flowing into the reservation station.

The ROB is like a large log book in which the P6 can record all the essential information
about each instruction that enters the execution core.

The primary function of the ROB is to ensure that instructions come out one end of the
out-of-order execution core in the same order in which they entered it.

So newly decoded instructions flow into the ROB, where their relevant information
is logged in one of 40 available entries. From there, they pass on to the reservation
station, and then on to the execution core. Once they're done executing, their results
go back to the ROB where they're stored until they're ready to be written back to
the architectural registers.

This final write-back, which is called retirement and which permanently alters
the programmer-visible machine state, cannot happen until all of the instructions
prior to the newly finished instruction have written back their results, a requirement
which is necessary for maintaining the appearance of sequential execution
55
ECE 684
The instruction window
A common metaphor for thinking about and talking about the P6's RS + ROB
combination, or analogous structures on other processors, is that of an instruction
window.

The P6's ROB can track up to 40 instructions in various stages of execution,
and its reservation station can hold and examine up to 20 instructions to determine
the optimal time for them to execute.

You can think of the reservation station's 20-instruction buffer as a window that
moves along the sequentially ordered code stream; on any given cycle, the P6 is
looking through this window at that visible segment of the code stream and thinking
about how its hardware can optimally execute the 20 or so instructions that it sees there.
56
ECE 684
Register renaming
Register renaming does for the data stream what the instruction window does for
the code stream it allows the processor some flexibility in adapting its resources
to fit the needs of the currently-executing program.

The x86 ISA has only eight general-purpose registers (GPRs) and eight floating-point
registers (FPRs), a paltry number by today's standards (e.g., the PowerPC ISA specifies
32 of each register type), and a half to a quarter of what many of the P6's RISC
contemporaries had.

Register renaming allows a processor to have a larger number of actual registers
than the ISA specifies, thereby enabling the chip to do more computations
simultaneously without running out of registers.

Each of the P6 core's 40 ROB entries has a data field, which holds program data
just like an x86 register. These fields give the P6's execution core 40 microarchitectural
registers to work with, and they're are used in combination with the P6's register
allocation table (RAT) to implement register renaming in the P6 core
57
ECE 684
The P6 execution core
The P6's execution core is significantly wider than that of the Pentium. Like the Pentium,
it contains two symmetrical integer ALUs and a separate floating-point unit, but its
load-store capabilities have been beefed up to include three execution units devoted
solely to memory accesses: a load address unit, a store address unit, and a store data
unit. The load address and store address units each contain a pair of four-input adders
for calculating addresses and checking segment limits; these are the adders in the
decode stage of the original Pentium.

Up to five instructions per cycle can pass from the reservation station through the
issue ports and into the execution units. This five issue-port structure is one of the
most recognizable features of the P6 core, and when later designs (like the P-II) added
execution units to the core (like MMX), they had to be added on one of the existing
five issue ports.
58
ECE 684
The P6 Pipeline
The P6 has a 12-stage pipeline considerably longer than the Pentium's 5-stage pipeline.

BTB access and instruction fetch: The first three and a half pipeline stages are dedicated
to accessing the branch target buffer and fetching the next instruction. The P6's two-cycle
instruction fetch phase is longer than the Pentium's 1-cycle fetch, but it keeps the L1
cache access latency from holding back the clock speed of the processor as a whole.

Decode: The next two-and-a-half stages are dedicated to decoding x86 instructions and
breaking them down into the P6's internal, RISC-like instruction format.

Register rename: This stage takes care of register renaming and logging instructions
in the ROB.
[Reservation Station]
Write to RS: Moving instructions from the ROB to the RS takes one cycle, and occurs here.

Read from RS: It takes another cycle to move instructions out of the RS, through the issue
ports, and into the execution units.

Execute: Instruction execution can take one cycle, as in the case of simple integer
instructions, or multiple cycles, as in the case of floating-point instructions.

Retire: These two final cycles are dedicated to writing the results of the instruction
execution back into the ROB, and then retiring the instructions by writing their results
from the ROB into the architectural register file.
59
ECE 684
The P6 Pipeline
Lengthening the P6's pipeline as described above has two primary beneficial effects.
First, it allows Intel to crank up the processor's clock speed, since each of the stages
is shorter, simpler, and can be completed quicker; but this is fairly common knowledge.

The second effect is a little more subtle and less widely appreciated. The P6's longer
pipeline, when combined with its buffered decoupling of fetch/decode bandwidth from
issue bandwidth, allows the processor to hide hiccups in the fetch and decode stages.
In short, the nine pipeline stages that lie ahead of the execute stage combine with the
RS to form a deep buffer for instructions, and this buffer can hide gaps and hang-ups
in the flow of instructions in much the same way that a large UPS can hide fluctuations
and gaps in the flow of electricity to a device or a large water reservoir can hide
interruptions in the flow of water to a facility.
60
ECE 684
General Branch Prediction
A branch predictor regards computer architecture and is the part of a processor
that determines whether a conditional branch in the instruction flow of a program
is likely to be taken or not.

Branch predictors are crucial in today's modern, superscalar processors for achieving
high performance. They allow processors to fetch and execute instructions without
waiting for a branch to be resolved.

Almost all pipelined processors do branch prediction of some form, because they
must guess the address of the next instruction to fetch before the current instruction
has been executed. Many earlier microprogrammed CPUs did not do branch prediction
because there was little or no performance penalty for altering the flow of the instruction
stream.

Branch prediction is not the same as branch target prediction. Branch prediction
attempts to guess whether a conditional branch will be taken or not. Branch target
prediction attempts to guess the target of the branch or unconditional jump before
it is computed from parsing the instruction itself.*
* WIkipedia definition
61
ECE 684
Branch prediction on the P6
The P6 expended considerably more resources than its predecessor on branch
prediction, and managed to boost dynamic branch prediction accuracy from the
Pentium's ~75% rate to upwards of 90%. As we'll see when we look at the P4,
branch prediction gets more important as pipelines get longer, because a pipeline
flush due to a mispredict means more lost cycles.

Consider the case of a conditional branch whose outcome depends on the result
of an integer calculation. On the original Pentium, the calculation happens in the
fourth pipeline stage, and if the branch prediction unit (BPU) has guessed
wrongly then only three cycles worth of work would be lost in the pipeline flush.
On the P6, though, the conditional calculation isn't performed until stage 10,
which means 9 cycles worth of work gets flushed if the BPU guesses wrongly.
* WIkipedia definition
62
ECE 684
The Pentium is a fifth-generation x86 architecture microprocessor from Intel,
developed by Vinod Dham. It was the successor to the 486 line.

The Pentium was expected to be named 80586 or i586, to follow the naming
convention of previous generations. However, Intel was unable to convince a
court to allow them to trademark a number (such as 486), in order to prevent
competitors such as Advanced Micro Devices from branding their processors
with similar names (such as AMD's Am486). Intel enlisted the help of Lexicon
Branding to create a brand that could be trademarked. The Pentium brand was
very successful, and was maintained through several generations of processors,
from the Pentium Pro to the Pentium Extreme Edition. Although not used for
marketing purposes, Pentium series processors are still given numerical product
codes, starting with 80500 for the original Pentium chip.
Additional Pentium Facts from Wikipedia
63
ECE 684
P5, P54C, P54CS
The original Pentium microprocessor had the
internal code name P5 and the product
code 80501 (80500 for the earliest steppings).
This was a pipelined in-order superscalar
microprocessor, produced using a 0.8 m
process. It was followed by the P54C (80502),
a shrink of the P5 to a 0.6 m process, which
was dual-processor ready and had an
internal clock speed different from the front
side bus (it's much more difficult to increase
the bus speed than to increase the internal
clock). In turn, the P54C was followed by the
P54CS, which used a 0.35 m process - a
pure CMOS process, as opposed to the
Bipolar CMOS process that was used for the
earlier Pentiums.
Additional Pentium Facts - from Wikipedia
64
ECE 684
The early versions of 60-100 MHz Pentiums had a problem in the floating point
unit that, in rare cases, resulted in reduced precision of division operations.

This bug, discovered in Lynchburg, Virginia in 1994, became known as the
Pentium FDIV bug (see article) and caused great embarrassment for Intel,
which created an exchange program to replace the faulty processors with
corrected ones.

Additional Pentium Facts - from Wikipedia
65
ECE 684
Pentium FDIV bug details



The Pentium FDIV bug is the most famous (or infamous) of the Intel microprocessor bugs. It was caused by an error in a lookup
table that was a part of Intel's SRT algorithm that was to be faster and more accurate.
With a goal to boost the execution of floating-point scalar code by 3 times and vector code by 5 times, compared to the 486DX chip,
Intel decided to use the SRT algorithm that can generate two quotient bits per clock cycle, while the traditional 486 shift-and-subtract
algorithm was generating only one quotient bit per cycle. This SRT algorithm uses a lookup table to calculate the intermidiate
quotients necessary for floating-point division. Intel's lookup table consists of 1066 table entries, of which, due to a programming
error, five were not downloaded into the programmable logic array (PLA). When any of these five cells is accessed by the
floating point unit (FPU), it (the FPU) fetches zero instead of +2, which was supposed to be contained in the "missing" cells. This
throws off the calculation and results in a less precise number than the correct answer(Byte Magazine, March 1995).
At its worst, this error can occur as high as the fourth significant digit of a decimal number, but the possibilities of this happening are
1 in 360 billion. It is most common that the error appears in the 9th or 10th decimal digit, which yields a chance of this happening of 1
in 9 billion.
Intel has clasified the bug (or the flaw, as they refer to it) with the following characteristics:

On certain input data, the FPDI (Floating Point Divide Instructions) on the Pentium processor produce inaccurate results.
The error can occur in any of the three operating precisions, namely single, double, or extended, for the divide instruction. However,
it has been noted that far fewer failures are found in single precision than in double or extended precisions.
The incidence of the problem is independent of the processor rounding modes.
The occurrence of the problem is highly dependent on the input data. Only certain data will trigger the problem. There is a
probability that 1 in 9 billion randomly fed divide or remainder instructions will produce inaccurate results.
The degree of inaccuracy depends on the input data and upon the instruction involved.
The problem does not occur on the specific use of the divide instruction to compute the reciprocal of the input operand in single
precision.
Furthermore, the bug affects any instruction that references the lookup table or calls FDIV. Related instructions that are affected by
the bug are FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, and FPREM1. The instructions FPTAN and FPATAN are also
susceptible. The instructions FYL2X, FYL2XP1, FSIN, FCOS, and FSINCOS, were a suspect but are now considered safe.
66
ECE 684

A 3-D plot of the ratio 4195835/3145727 calculated on a Pentium with FDIV bug. The
depressed triangular areas indicate where incorrect values have been computed. The
correct values all would round to 1.3338, but the returned values are 1.3337, an error in
the fifth significant digit. Byte Magazine, March 1995. Intel has adopted a no-questions-
asked replacement policy for its customers with the Pentium FDIV bug. It has also done
statistical reasearch and provided information on the bug at its site at intel.com
Pentium FDIV bug details

67
ECE 684
The 60 and 66 MHz 0.8 m versions of the Pentium processors were also
known for their fragility and their (for the time) high levels of heat production
- in fact, the Pentium 60 and 66 were often nicknamed "coffee warmers".
They were also known as "high voltage Pentiums", due to their 5V operation.
The heat problems were removed with the P54C, which ran at a much lower
voltage (3.3V). P5 Pentiums used Socket 4, while P54C started out on Socket
5 before moving to Socket 7 in later revisions.

All desktop Pentiums from P54CS onwards used Socket 7. Another bug
known as f00f bug was discovered soon afterwards, but fortunately,
operating system vendors responded by implementing workarounds that
prevented the crash.
Additional Pentium Facts - from Wikipedia
68
ECE 684
Pentium (FDIV) Jokes
At lntel, Quality is Job 0.99989960954

Q: What is Intel's follow-on to the Pentium? A: Repentium.

The Pentium doesn't have bugs or produce errors; it's just Precision-Impaired.

Q: How many Pentium designers does it take to screw in a light bulb? A: 1.99904274017, but
that's close enough for non-technical people.

Q: What's another name for the "Intel Inside" sticker they put on Pentiums? A: The warning
label.

Q: What do you call a series of FDIV instructions on a Pentium? A1: Successive
approximations. A2: A random number generator.

Q: Why didn't Intel call the Pentium the 586? A: Because they added 486 and 100 on the first
Pentium and got 585.999983605.


69
ECE 684
70
ECE 684
71
ECE 684
72
ECE 684
Pentium Pro
The Pentium Pro micro-architecture is a three-way superscalar, pipelined architecture. The
three-way superscalar architecture is capable of decoding, dispatching, and retireing three
instructions per clock cycle. The Pentium Pro process family utilizes a decoupled 14-stage
superpipeline that supports out-of-order instruction execution to facilitate the high level of
instruction throughput. The Pentium Pro micro-architecture is illustrated in figure 3. The
Pentium Pro micro-architecture pipeline is divided into four sections: the 1
st
level and 2
nd
level
caches, the front end, the out-of-order execution core, and the retire section. The sections of
the pipeline are supplied instructions and data through the bus interface unit.
!
The Pentium Pro processor micro-architecture utilizes two cache
levels to provide a steady stream of instructions and data to the
instruction execution pipeline. The L1 cache provides an 8-
Kbyte instruction cache and an 8-Kbyte data cache, both closely
coupled to the pipeline. The L2 cache is a 256-Kbyte, 512-
Kbyte, 1-Mbyte, or 2-Mbyte static RAM that is coupled to the
core processor through a full clock-speed 64-bit cache bus. The
pipelined L2 cache connects to the processor via a 64-bit, full-
frequency bus. The four-way set associative L2 cache employs
32-byte cache lines and contains 8 bits of error correcting code
for each 64 bits of data. The nonblocking L1 and L2 caches
permit multiple cache misses to proceed in parallel; cache hits
proceed in parallel; cache hits proceed during outstanding cache
misses to other addresses.
73
ECE 684
Steppings
74
ECE 684
75
ECE 684
76
ECE 684
77
ECE 684
78
ECE 684
79
ECE 684
NetBurst is the name Intel gave to the new architecture that succeeded its P6
microarchitecture. The concept behind NetBurst was to improve the throughput, improve the
efficiency of the out-of-order execution engine, and to create a processor that can reach much
higher frequencies with higher performance relative to the P5 and P6 microarchitectures, while
maintaining backward compatibility.

Initially launched in Intels seventh-generation Pentium 4 processors (the Willamette core) in
late 2000, the NetBurst architecture represented the biggest change to the IA-32 architecture
since the Pentium Pro in 1995. One of the most important changes was to the processors
internal pipeline, referred to as Hyper Pipeline. This comprised 20 pipeline stages versus the
ten for the P6 microarchitecture and was instrumental in allowing the processor to process
more instructions per clock and to operate at significantly higher clock speeds than its
predecessor.

80
ECE 684
The NetBurst microarchitecture has only one decoder (as opposed to the three in the P6 microarchitecture), and the out
of order execution unit now has the execution trace cache that stores decoded ops. The cores ability to execute
instructions out of order remains a key factor in enabling parallelism, several buffers being employed to smooth the flow
of ops, and longer pipelines and the improved out-of-order execution engine allow the processor to achieve higher
frequencies, and improve throughput.

Ultimately, the NetBurst microarchitecture was to prove to be something of a disappointment in comparison to Intels
mobile-processor technology. It was therefore not entirely surprising when it transpired that NetBursts successor would
build on the energy-efficient philosophy adopted in Intels mobile microarchitecture and embodied in its Pentium M family
of processors.
81
ECE 684
82
ECE 684
83
ECE 684
84
ECE 684
85
ECE 684
A Detailed Look Inside the Intel

NetBurst

Micro-Architecture of the Intel Pentium

4 Processor
Page 9
Intel

NetBurst Micro-architecture
The Pentium

4 processor is the first hardware implementation of a new micro-architecture, the Intel NetBurst
micro-architecture. To help reader understand this new micro-architecture, this section examines in detail the
following:
the design considerations the Intel NetBurst micro-architecture
the building blocks that make up this new micro-architecture
the operation of key functional units of this micro-architecture based on the implementation in the Pentium 4
processor.
The Intel NetBurst micro-architecture is designed to achieve high performance for both integer and floating-point
computations at very high clock rates. It has the following features:
hyper pipelined technology to enable high clock rates and frequency headroom to well above 1GHz
rapid execution engine to reduce the latency of basic integer instructions
high-performance, quad-pumped bus interface to the 400 MHz Intel NetBurst micro-architecture system bus.
execution trace cache to shorten branch delays
cache line sizes of 64 and 128 bytes
hardware prefetch
aggressive branch prediction to minimize pipeline delays
out-of-order speculative execution to enable parallelism
superscalar issue to enable parallelism
hardware register renaming to avoid register name space limitations
The Design Considerations of the Intel

NetBurst
TM
Micro-architecture
The design goals of Intel NetBurst micro-architecture are: (a) to execute both the legacy IA-32 code and applications
based on single-instruction, multiple-data (SIMD) technology at high processing rates; (b) to operate at high clock
rates, and to scale to higher performance and clock rates in the future. To accomplish these design goals, the Intel
NetBurst micro-architecture has many advanced features and improvements over the Pentium Pro processor micro-
architecture.
The major design considerations of the Intel NetBurst micro-architecture to enable high performance and highly
scalable clock rates are as follows:
It uses a deeply pipelined design to enable high clock rates with different parts of the chip running at different
clock rates, some faster and some slower than the nominally-quoted clock frequency of the processor. The
Intel NetBurst micro-architecture allows the Pentium 4 processor to achieve significantly higher clock rates as
compared with the PentiumIII processor. These clock rates will achieve well above 1 GHz.
Its pipeline provides high performance by optimizing for the common case of frequently executed
instructions. This means that the most frequently executed instructions in common circumstances (such as a
cache hit) are decoded efficiently and executed with short latencies, such that frequently encountered code
sequences are processed with high throughput.
It employs many techniques to hide stall penalties. Among these are parallel execution, buffering, and
speculation. Furthermore, the Intel NetBurst micro-architecture executes instructions dynamically and out-or-
order, so the time it takes to execute each individual instruction is not always deterministic. Performance of a
particular code sequence may vary depending on the state the machine was in when that code sequence was
entered.
86
ECE 684
A Detailed Look Inside the Intel

NetBurst

Micro-Architecture of the Intel Pentium

4 Processor
Page 10
Overview of the Intel

NetBurst
TM
Micro-architecture Pipeline
The pipeline of the Intel NetBurst micro-architecture contain three sections:
the in-order issue front end
the out-of-order superscalar execution core
the in-order retirement unit.
The front end supplies instructions in program order to
the out-of-order core. It fetches and decodes IA-32
instructions. The decoded IA-32 instructions are
translated into micro-operations (ops). The front ends
primary job is to feed a continuous stream of ops to
the execution core in original program order.
The core can then issue multiple ops per cycle, and
aggressively reorder ops so that those ops, whose
inputs are ready and have execution resources available,
can execute as soon as possible. The retirement section
ensures that the results of execution of the ops are
processed according to original program order and that
the proper architectural states are updated.
Figure 3 illustrates a block diagram view of the major
functional blocks associated with the Intel NetBurst
micro-architecture pipeline. The paragraphs that follow
Figure 3 provide an overview of each of the three
sections in the pipeline.
The Front End
The front end of the Intel NetBurst micro-architecture consists of two parts:
fetch/decode unit
execution trace cache.
The front end performs several basic functions:
prefetches IA-32 instructions that are likely to be executed
fetches instructions that have not already been prefetched
decodes instructions into ops
generates microcode for complex instructions and special-purpose code
delivers decoded instructions from the execution trace cache
predicts branches using highly advanced algorithm.
The front end of the Intel NetBurst micro-architecture is designed to address some of the common problems in high-
speed, pipelined microprocessors. Two of these problems contribute to major sources of delays:
the time to decode instructions fetched from the target
wasted decode bandwidth due to branches or branch target in the middle of cache lines.
The execution trace cache addresses both of these problems by storing decoded IA-32 instructions. Instructions are
fetched and decoded by a translation engine. The translation engine builds the decoded instruction into sequences of
Fetch/Decode
Trace Cache
Microcode ROM
Execution
Out-Of-Order Core
Retirement
1st Level Cache
4-way
2nd Level Cache
8-Way
BTBs/Branch Prediction
Bus Unit
System Bus
Frequently used paths
Less frequently used paths
Front End
3rd Level Cache
Optional, Server Product Only
Branch History Update
Figure 3 The Intel

NetBurst
TM
Micro-architecture
87
ECE 684
A Detailed Look Inside the Intel

NetBurst

Micro-Architecture of the Intel Pentium

4 Processor
Page 12
Prefetching
The Intel NetBurst micro-architecture supports three prefetching mechanisms:
the first is for instructions only
the second is for data only
the third is for code or data.
The first mechanism is hardware instruction fetcher that automatically prefetches instructions. The second is a
software-controlled mechanism that fetches data into the caches using the prefetch instructions. The third is a
hardware mechanism that automatically fetches data and instruction into the unified second-level cache.
The hardware instruction fetcher reads instructions along the path predicted by the BTB into the instruction
streaming buffers. Data is read in 32-byte chunks starting at the target address. The second and third mechanisms is
described in Data Prefetch.
Decoder
The front end of the Intel NetBurst micro-architecture has a single decoder that can decode instructions at the
maximum rate of one instruction per clock. Complex instruction must enlist the help of the microcode ROM. The
decoder operation is connected to the execution trace cache discussed in the section that follows.
Execution Trace Cache
The execution trace cache (TC) is the primary instruction cache in the Intel NetBurst micro-architecture. The TC
stores decoded IA-32 instructions, or ops. This removes decoding costs on frequently-executed code, such as
template restrictions and the extra latency to decode instructions upon a branch misprediction.
In the Pentium 4 processor implementation, the TC can hold up to 12K ops and can deliver up to three ops per
cycle. The TC does not hold all of the ops that need to be executed in the execution core. In some situations, the
execution core may need to execute a microcode flow, instead of the op traces that are stored in the trace cache.
The Pentium 4 processor is optimized so that most frequently-executed IA-32 instructions come from the trace
cache, efficiently and continuously, while only a few instructions involve the microcode ROM.
Branch Prediction
Branch prediction is very important to the performance of a deeply pipelined processor. Branch prediction enables
the processor to begin executing instructions long before the branch outcome is certain. Branch delay is the penalty
that is incurred in the absence of a correct prediction. For Pentium 4 processor, the branch delay for a correctly
predicted instruction can be as few as zero clock cycles. The branch delay for a mispredicted branch can be many
cycles; typically this is equivalent to the depth of the pipeline.
The branch prediction in the Intel NetBurst micro-architecture predicts all near branches, including conditional,
unconditional calls and returns, and indirect branches. It does not predict far transfers, for example, far calls, irets,
and software interrupts.
In addition, several mechanisms are implemented to aid in predicting branches more accurately and in reducing the
cost of taken branches:
dynamically predict the direction and target of branches based on the instructions linear address using the
branch target buffer (BTB)
if no dynamic prediction is available or if it is invalid, statically predict the outcome based on the offset of the
target: a backward branch is predicted to be taken, a forward branch is predicted to be not taken
return addresses are predicted using the 16-entry return address stack
traces of instructions are built across predicted taken branches to avoid branch penalties.
88
ECE 684
A Detailed Look Inside the Intel

NetBurst

Micro-Architecture of the Intel Pentium

4 Processor
Page 13
The Static Predictor. Once the branch instruction is decoded, the direction of the branch (forward or backward) is
known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the
direction of the branch. The static prediction mechanism predicts backward conditional branches (those with
negative displacement), such as loop-closing branches, as taken. Forward branches are predicted not taken.
Branch Target Buffer. Once branch history is available, the Pentium 4 processor can predict the branch outcome
before the branch instruction is even decoded, based on a history of previously-encountered branches. It uses a
branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of
branches based on an instructions linear address. Once the branch is retired, the BTB is updated with the target
address.
Return Stack. Returns are always taken, but since a procedure may be invoked from several call sites, a single
predicted target will not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for a
series of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates the
need to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.
Even if the direction and target address of the branch are correctly predicted well in advance, a taken branch may
reduce available parallelism in a typical processor, since the decode bandwidth is wasted for instructions which
immediately follow the branch and precede the target, if the branch does not end the line and target does not begin
the line. The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizing
instruction delivery from the front end.
Branch Hints
The Pentium 4 processor provides a feature that permits software to provide hints to the branch prediction and trace
formation hardware to enhance their performance. These hints take the form of prefixes to conditional branch
instructions. These prefixes have no effect for pre-Pentium 4 processor implementations. Branch hints are not
guaranteed to have any effect, and their function may vary across implementations. However, since branch hints are
architecturally visible, and the same code could be run on multiple implementations, they should be inserted only in
cases which are likely to be helpful across all implementations.
Branch hints are interpreted by the translation engine, and are used to assist branch prediction and trace construction
hardware. They are only used at trace build time, and have no effect within already-built traces. Directional hints
override the static (forward-taken, backward-not taken) prediction in the event that a BTB prediction is not
available. Because branch hints increase code size slightly, the preferred approach to providing directional hints is
by the arrangement of code so that
(i) forward branches that are more probable should be in the not-taken path, and
(ii) backward branches that are more probable should be in the taken path. Since the branch prediction information
that is available when the trace is built is used to predict which path or trace through the code will be taken,
directional branch hints can help traces be built along the most likely path.
Execution Core Detail
The execution core is designed to optimize overall performance by handling the most common cases most
efficiently. The hardware is designed to execute the most frequent operations in the most common context as fast as
possible, at the expense of less-frequent operations in rare context. Some parts of the core may speculate that a
common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains
to store forwarding. If a load is predicted to be dependent on a store, it gets its data from that store and tentatively
proceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loaded
from memory, then it proceeds.
Instruction Latency and Throughput
The superscalar, out-of-order core contains multiple execution hardware resources that can execute multiple ops in
parallel. The cores ability to make use of available parallelism can be enhanced by:
89
ECE 684
A Detailed Look Inside the Intel

NetBurst

Micro-Architecture of the Intel Pentium

4 Processor
Page 13
The Static Predictor. Once the branch instruction is decoded, the direction of the branch (forward or backward) is
known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the
direction of the branch. The static prediction mechanism predicts backward conditional branches (those with
negative displacement), such as loop-closing branches, as taken. Forward branches are predicted not taken.
Branch Target Buffer. Once branch history is available, the Pentium 4 processor can predict the branch outcome
before the branch instruction is even decoded, based on a history of previously-encountered branches. It uses a
branch history table and a branch target buffer (collectively called the BTB) to predict the direction and target of
branches based on an instructions linear address. Once the branch is retired, the BTB is updated with the target
address.
Return Stack. Returns are always taken, but since a procedure may be invoked from several call sites, a single
predicted target will not suffice. The Pentium 4 processor has a Return Stack that can predict return addresses for a
series of procedure calls. This increases the benefit of unrolling loops containing function calls. It also mitigates the
need to put certain procedures inline since the return penalty portion of the procedure call overhead is reduced.
Even if the direction and target address of the branch are correctly predicted well in advance, a taken branch may
reduce available parallelism in a typical processor, since the decode bandwidth is wasted for instructions which
immediately follow the branch and precede the target, if the branch does not end the line and target does not begin
the line. The branch predictor allows a branch and its target to coexist in a single trace cache line, maximizing
instruction delivery from the front end.
Branch Hints
The Pentium 4 processor provides a feature that permits software to provide hints to the branch prediction and trace
formation hardware to enhance their performance. These hints take the form of prefixes to conditional branch
instructions. These prefixes have no effect for pre-Pentium 4 processor implementations. Branch hints are not
guaranteed to have any effect, and their function may vary across implementations. However, since branch hints are
architecturally visible, and the same code could be run on multiple implementations, they should be inserted only in
cases which are likely to be helpful across all implementations.
Branch hints are interpreted by the translation engine, and are used to assist branch prediction and trace construction
hardware. They are only used at trace build time, and have no effect within already-built traces. Directional hints
override the static (forward-taken, backward-not taken) prediction in the event that a BTB prediction is not
available. Because branch hints increase code size slightly, the preferred approach to providing directional hints is
by the arrangement of code so that
(i) forward branches that are more probable should be in the not-taken path, and
(ii) backward branches that are more probable should be in the taken path. Since the branch prediction information
that is available when the trace is built is used to predict which path or trace through the code will be taken,
directional branch hints can help traces be built along the most likely path.
Execution Core Detail
The execution core is designed to optimize overall performance by handling the most common cases most
efficiently. The hardware is designed to execute the most frequent operations in the most common context as fast as
possible, at the expense of less-frequent operations in rare context. Some parts of the core may speculate that a
common condition holds to allow faster execution. If it does not, the machine may stall. An example of this pertains
to store forwarding. If a load is predicted to be dependent on a store, it gets its data from that store and tentatively
proceeds. If the load turned out not to depend on the store, the load is delayed until the real data has been loaded
from memory, then it proceeds.
Instruction Latency and Throughput
The superscalar, out-of-order core contains multiple execution hardware resources that can execute multiple ops in
parallel. The cores ability to make use of available parallelism can be enhanced by:
90
ECE 684
IA- 32 Architecture
Richard Eckert
Anthony Marino
Matt Morrison
Steve Sonntag
91
ECE 684
IA-32 Overview
IA-32 Overview
Pentium 4 / Netburst Architecture
SSE2
Hyper Pipeline
Overview
Branch Prediction
Execution Types
Rapid Execution Engine
Advanced Dynamic Execution
Memory Management
Segmentation
Paging
Virtual Memory
Address Modes / Instruction Format
Address Translation
Cache
Levels of Cache (L1 & L2) / Execution Trace Cache
Instruction Decoder
System Bus
Register Files
Enhanced Floating Point & Multi-Media Unit
Summary / Conclusion
92
ECE 684
IA-32 Background
Traced to 1969
Intel 4004
P4
1
st
IA-32 processor based on Intel Netburst microprocessor.
Netburst
Allows
Higher Performance Levels
Performance at Higher Clock Speeds
Compatible with existing applications and operating
systems
Written to run on Intel IA-32 architecture Processors
93
ECE 684
1
st
Implementation of Intel Netburst
Architecture
Rapid Execution Engine
Hyper Pipelined
Technology
Advanced Dynamic
Execution
Innovative Cache
Subsystem
Streaming SIMD
Extensions 2 (SSE2)
400 MHz System Bus
94
ECE 684
Netburst Architecture
95
ECE 684
SSE2
Internet Streaming SIMD Extensions 2 (SSE2)
What is it?
What does it do?
How is this helpful?
96
ECE 684
IA-32 Overview
IA-32 Overview
Pentium 4 / Netburst Architecture
SSE2
Hyper Pipeline
Overview
Branch Prediction
Execution Types
Rapid Execution Engine
Advanced Dynamic Execution
Memory Management
Segmentation
Paging
Virtual Memory
Address Modes / Instruction Format
Address Translation
Cache
Levels of Cache (L1 & L2) / Execution Trace Cache
Instruction Decoder
System Bus
Register Files
Enhanced Floating Point & Multi-Media Unit
Summary / Conclusion
97
ECE 684
Hyper Pipelined
What is hyper pipeline technology?
Deeper pipeline
Fewer gates per pipeline stage
What are the benefits of hyper pipeline?
Increased clock rate
Increased performance
98
ECE 684
Netburst

vs. P6
1
Fetch
2
Fetch
3
Decode
4
Decode
5
Decode
6
Rename
7
ROB Rd
8
Rdy/Sch
9
Dispatch
10
Exec
3 4
TC Fetch
5
Drive
6
Alloc
9
Que
10
Sch
12
Sch
13
Disp
14
Disp
15
RF
16
RF
17
Ex
18
Flgs
19
BrCk
20
Drive
1 2
TC Nxt IP
7 8
Rename
11
Sch
Typical P6 Pipeline
Typical Pentium 4 Pipeline
99
ECE 684
3
.
2

G
B
/
s

S
y
s
t
e
m

I
n
t
e
r
f
a
c
e

L2 Cache and Control
BTB
B
T
B

&

I
-
T
L
B

D
e
c
o
d
e
r

T
r
a
c
e

C
a
c
h
e

R
e
n
a
m
e
/
A
l
l
o
c

o
p

Q
u
e
u
e
s

S
c
h
e
d
u
l
e
r
s

I
n
t
e
g
e
r

R
F

F
P

R
F

Code
ROM
Store
AGU
Load
AGU
ALU
ALU
ALU
ALU
FP move
FP store
Fmul
Fadd
MMX
SSE
L
1

D
-
C
a
c
h
e

a
n
d

D
-
T
L
B

3 4
TC Fetch
5
Drive
6
Alloc
9
Que
10
Sch
12
Sch
13
Disp
14
Disp
15
RF
16
RF
17
Ex
18
Flgs
19
BrCk
20
Drive
1 2
TC Nxt IP
7 8
Rename
11
Sch
100
ECE 684
Netburst Architecture
101
ECE 684
Branch Prediction
Centerpiece of dynamic execution
Delivers high performance in pipelined - architecture
Allows continuous fetching and execution
Predicts next instruction address
Branch is predictable within 4 or less iterations
Branch Prediction decreases the amount of instructions
that would normally be flushed from pipeline
102
ECE 684
Examples
If (a == 5)
a = 7;
Else
a = 5;
L1: lpcnt++;
If ((lpcnt % 5)== 0)
printf ( Loop count is
divisible by 5\n);
Predictable
Not Predictable
103
ECE 684
IA-32 Overview
IA-32 Overview
Pentium 4 / Netburst Architecture
SSE2
Hyper Pipeline
Overview
Branch Prediction
Execution Types
Rapid Execution Engine
Advanced Dynamic Execution
Memory Management
Segmentation
Paging
Virtual Memory
Address Modes / Instruction Format
Address Translation
Cache
Levels of Cache (L1 & L2) / Execution Trace Cache
Instruction Decoder
System Bus
Register Files
Enhanced Floating Point & Multi-Media Unit
Summary / Conclusion
104
ECE 684
Rapid Execution Engine
Contains 2 ALUs
Twice core processor frequency
Allows basic integer instructions to execute in a
clock cycle
Up to 126 instructions, 48 load, and 24 stores can be
in flight at the same time
Example
Rapid Execution Engine on a 1.50 GHz P4 Processor runs
at _________Hz?
105
ECE 684
`
Out-of-Order
Execution
Logic
Retirement
Logic
Branch History Update
106
ECE 684
Advanced Dynamic Execution
Out-of-Order Engine
Reorders Instructions
Executes as input operands are ready
ALUs kept busy
Reports Branch History Information
Increases overall speed
107
ECE 684
IA-32 Overview
IA-32 Overview
Pentium 4 / Netburst Architecture
SSE2
Hyper Pipeline
Overview
Branch Prediction
Execution Types
Rapid Execution Engine
Advanced Dynamic Execution
Memory Management
Paging
Virtual Memory
Segmentation
Address Modes / Instruction Format
Address Translation
Cache
Levels of Cache (L1 & L2) / Execution Trace Cache
Instruction Decoder
System Bus
Register Files
Enhanced Floating Point & Multi-Media Unit
Summary / Conclusion
108
ECE 684
Memory Management
Management Facilities divided into two parts:
Segmentation - isolates individual processes so that multiple
programs can on same processor without
interfering w/each other.
Demand Paging - provides a mechanism for implementing a
virtual-memory that is much larger than the
actual memory, seemingly infinite.
109
ECE 684
Memory Management
Address Translation
Ex: Comp. Arch. I
Logical Address
Segmentation
& Paging
Physical Address
Control
Word
Memory
Instruction
Address
Instruction
Decoder
Instruction
Control
Word
IA-32
Memory
(Virtual Address)
110
ECE 684
Modes of Operation
Protected mode - Native operating mode of the processor. All
features available, providing highest performance
and capability.
- Must use segmentation, paging optional.
Real-address mode - 8086 processor programming environment
System management mode (SMM) - Standard arch. feature in
all later IA-32 processors. Power
management, OEM differentiation features
Virtual-8086 mode - used while in protected mode, allows
processor to execute 8086 software in a
protected, multitasked environment.
Concentration on:
Other modes:
111
ECE 684
Paging
Subdivide memory into small fixed-size chunks called
frames or page frames
Divide programs into same sized chunks, called pages
Loading a program in memory requires the allocation of the
required number of pages
Limits wasted memory to a fraction of the last page
Page frames used in loading process need not be contiguous
- Each program has a page table associated with it that maps
each program page to a memory page frame
112
ECE 684
Dir Page Offset
Paging
Main Memory
Physical
Address
Page
Directory
Page Table
Control
Word
IA-32: 2 - Level Paging
Linear Address
Logical Address Segmentation
Virtual Memory:
Only program pages required for
execution of the program are actually
loaded
Only a few pages of any one
program might be in memory at a time
Possible to run program consisting
of more pages than can fit in memory
Demand Paging
113
ECE 684
Segmentation
Programmer subdivides the program into logical units called
segments
- Programs subdivided by function
- Data array items grouped together as a unit
Paging - invisible to programmer, Segmentation - usually
visible to programmer
- Convenience for organizing programs and data, and a
means for associating access and usage rights with
instructions and data
- Sharing, segment could be addressed by other
processes, ex: table of data
- Dynamic size, growing data structure
114
ECE 684
Address Translation
Dir Page Offset
Paging
Main Memory
Physical
Address
Page
Directory
Page Table
Control
Word
Linear Address
Segment Offset
Segment
Table
Index TI RPL
Index: The number of the segment. Serves as
an index to the segment Table.
TI: (one bit) Table indicator indicates either
global or local segment table to be used for
translation
RPL: (two bits) Requested privilege level,
0=high privilege, 3 = low
115
ECE 684
IA-32 Overview
IA-32 Overview
Pentium 4 / Netburst Architecture
SSE2
Hyper Pipeline
Overview
Branch Prediction
Execution Types
Rapid Execution Engine
Advanced Dynamic Execution
Memory Management
Paging
Virtual Memory
Segmentation
Address Modes / Instruction Format
Address Translation
Cache
Levels of Cache (L1 & L2) / Execution Trace Cache
Instruction Decoder
System Bus
Register Files
Enhanced Floating Point & Multi-Media Unit
Summary / Conclusion
116
ECE 684
Addressing Modes
- Determine technique for offset generation
+
+
Displacement
(in instruction;
0, 8, or 32 bits)
Scale 1, 2, 4, or 8
x
Index Register
Base Register
L
i
m
i
t

Descriptor
Registers
Effective
Address
(Offset)
Segment Offset
Linear
Address
Segment
Base
Address
Access Rights
Limit
Base Address
Main Memory
Paging
(invisible to
programmer)
117
ECE 684
Mode Algorithm
Immediate Operand = A
Register operand LA = R
Displacement LA = (SR) + A
Base LA = (SR) + (B)
Base with displacement LA = (SR) + (B) + A
Scaled index with displacement LA = (SR) + (I) x S + A
Base with index and displacement LA = (SR) + (B) + (I) + A
Base with scaled index and displacement LA = (SR) + (I) x S + (B) + A
Relative LA = (PC) + A
LA = linear address
(X) = contents of X
SR = segment register
PC = program counter
A = contents of an address field in the instruction
R = register
B = base register
I = index register
S = scaling factor
Addressing Modes
118
ECE 684
+
+
Displacement
(in instruction;
0, 8, or 32 bits)
Scale 1, 2, 4, or 8
x
Index Register
L
i
m
i
t

Descriptor
Registers
Effective
Address
(Offset)
Segment
Linear
Address
Segment
Base
Address
Ex: scaled index with displacement
Access Rights
Limit
Base Address
119
ECE 684
Instruction Format
Instruction
Prefixes
Opcode Mod R/M SIB Displacement Immediate
Scale Index Base Mod
Reg/Opcode
R/M
Instruction
Prefix
Operand
Size
Override
Address
Size
Override
Segment
Override
Bytes
0 to 4 0 or 1 0 or 1 0, 1, 2, or 4 1 or 2 0, 1, 2, or 4
Bytes
0 or 1 0 or 1 0 or 1 0 or 1
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
120
ECE 684
IA-32 Overview
IA-32 Overview
Pentium 4 / Netburst Architecture
SSE2
Hyper Pipeline
Overview
Branch Prediction
Execution Types
Rapid Execution Engine
Advanced Dynamic Execution
Memory Management
Segmentation
Paging
Virtual Memory
Address Modes / Instruction Format
Address Translation
Cache
Levels of Cache (L1 & L2) / Execution Trace Cache
Instruction Decoder
System Bus
Register Files
Enhanced Floating Point & Multi-Media Unit
Summary / Conclusion
121
ECE 684
Cache Organization
Physical
Memory
System Bus
(External)
Bus Interface Unit
L2 Cache
Instruction Decoder Trace Cache
Instruction
TLBs
Data Cache
Unit (L1)
Store Buffer
Data TLBs
122
ECE 684
IA-32 Overview
IA-32 Overview
Pentium 4 / Netburst Architecture
SSE2
Hyper Pipeline
Overview
Branch Prediction
Execution Types
Rapid Execution Engine
Advanced Dynamic Execution
Memory Management
Segmentation
Paging
Virtual Memory
Address Modes / Instruction Format
Address Translation
Cache
Levels of Cache (L1 & L2) / Execution Trace Cache
Instruction Decoder
System Bus
Register Files
Enhanced Floating Point & Multi-Media Unit
Summary / Conclusion
123
ECE 684
Enhanced FP &
Multi-Media Unit
Expands Registers
128-bit
Adds One Additional Register
Data Movement
Improves performance on applications
Floating Point
Multi-Media
124
ECE 684
When the name Prescott was
first heard, people started to
assume Pentium 5 was coming,
as there are a number of changes
differentiate Prescott from the
Northwood core: 90 Nm process,
1 MB L2 cache rather than 512 KB,
the L1 data cache that was doubled to
16 KB, 13 new instructions referred to
as SSE3 and the new pipeline that was
extended from 20 to 31 stages, officially
part of Intel's NetBurst architecture.
P4 Prescott
Source: Intel's New Weapon: Pentium 4 Prescott, February 1, 2004,
Patrick Schmid
, chim Roos, Bert Tpelt
125
ECE 684
It looks like millions of other Pentium 4 processors, but there's something new:
1 MB L2 cache, 16 KB L1 data cache and SSE3, the fourth instruction set that
Intel adds to the Pentium family (MMX, SSE, SSE2).
Package
126
ECE 684
Intel's nomenclature is very simple, as they basically only add an E
after the clock speed number, e.g. Pentium 4 3.0E GHz. Besides the
three versions we reviewed in this article (2.8E / 3.0E / 3.2E GHz),
Intel also is launching a low-cost Prescott version at 2.8A GHz and
133 MHz FSB speed without HyperThreading.

That is of particular importance as the TDP (thermal design power)
reached a new record: It is 103 Watts for the 3.4E and 3.2E GHz versions.
Even more interesting is the TDP of the new P4 Extreme
Edition at 3.4 GHz: 102.9 Watts.
Numbering - thermal
127
ECE 684
With the advantage of a very small, circuitry production process of 90 nm,
Intel was easily able to increase the L2 cache size. Instead of Northwood's 512
KB, Prescott can now access 1 MB. Regardless of the transistor count, the die
size dropped from 127 mm2 to 112 mm2. At 3.4E GHz, Prescott has a maximum
cache bandwidth of 108 GB/s.

Additionally, Intel doubled the L1 data cache from 8 KB to 16 KB. Let's look back
to 2000 when Intel launched the Pentium 4 Willamette, with a reduction in the
cache size to 8 KB. Back then, the L1 cache had to be reduced to 8 KB in order to
keep the latency at two clock cycles. Slower cache access would have worsened
the performance gap with the Pentium III even more. Still today, it is very important
to have fast caches, since both AGUs (address generation units) need to access it
frequently.
cache
128
ECE 684
After Intel's success with the Pentium 4's SSE2 instruction set (Streaming SIMD
Extensions, 144 instructions), SSE3 is supposed to be a reaction to the wishes and
desires of big software companies. This time, there are only 13 new instructions to
make the programmer's life easier:

fisttp: fp to int conversion
addsubps, addsubpd, movsldup, movshdup, movddup: complex arithmetic
lddqu: video encoding
haddps, hsubps, haddpd, hsubpd: graphics (SIMD FP / AOS)
monitor, mwait: thread synchronization

SSE
129
ECE 684
NetBurst Architecture: Now 31 Pipeline Stages
Architecture
130
ECE 684
P4 Pipeline
Instructions are received over the 64 bit wide, 200 MHz and 6.4 GB/s fast system bus.
Then they enter the L2 cache. The prefetcher analyses the instructions and activates
the BTB (Branch Target Buffer) in order to get a branch prediction, accomplished by a
determination on what data could be required next. The modified instruction set is sent
through the instruction decoder that translates the x86 data into micro operations.

The x86 instructions can be complex and frequently feature loops, which is why Intel
abandoned the classic L1 instruction cache back with the first Pentium 4 Willamette in
favor of the Execution Trace Cache. It is based on micro operations and is located
behind the Instruction Decoder, making it the much smarter solution by eliminating
unnecessary decoding work. The Execution Trace Cache stores and reorganizes
chains of multiple micro operations in order to pass them to the Rapid Execution
Engine in an efficient manner.

131
ECE 684
If the BTB does not provide a branch prediction, the Instruction Decoder will perform a
static prediction that is supposed to have only little impact on performance in case the
prediction should be wrong. The little impact can be realized by an improved loop detection
process. The dynamic branch prediction has also been updated and integer multiplication is
now done within a dedicated unit.

Predicting branches is a core element in order to enable high performance. If the processor
knows or at least guesses what comes next, it will be able to fill its pipeline in an efficient
manner. This has become even more important since the pipeline has been stretched from
20 stages to now 31 stages. Intel tries to reduce the complexity of each stage in order to run
higher clock speeds. In exchange, the processor becomes more vulnerable to misprediction.

Now it's quite obvious why Intel tried to increase all caches. In case of misprediction, it's more
important than ever to "keep the system running". The right data thus must be available in order
to fill the pipeline. In order to support that, the L1 data cache must have an eight-way associativity
value associated with the ability to check if the requested data is already located inside the cache.


Prescott Branch Prediction
132
ECE 684
Wafer & die size
133
ECE 684
Wafer & die size
In contrast to the 200 mm wafers AMD uses, Intel's 300 mm Pizza-pie size models offer
much more space. We have analyzed the theoretical amount of processors on each of those
wafers in order to talk about availability, prices and finally the success of a processor (see
above).

It's either delightful or depressing (that depends on your personal view) to see how many
processors can be made of one single wafer. The theoretical limit should be 588 Prescott
processors in case of Intel's 300 mm wafers and 148 Opteron/Athlon64 FX CPUs with
AMD's 200 mm models. Even if Intel yielded only 40%, it would still gain more than
double the amount of processors than AMD with a 60% yield. Still you should not forget
that Intel usually has to supply larger customers than AMD, and has the fab capacity to do
so.

Wafer fab, 85% yields are definitely possible and are being hit from time to time, but in mass
production facilities, even a 70% yield is considered sufficiently high. When a production
facility starts to begin producing a a new product, yield rates usually are tremendously lower
until the production process begins to ramp up and begins producing mass-scale volumes.

134
ECE 684
Multicore - Nehalem
Here you can see a die shot of the new Nehalem processor - in this iteration a
four core design with two separate QPI links and large L3 cache in relation to the
rest of the chip.
135
ECE 684
Intel can easily create a range of processors from 1 core to 8 cores depending on
the application and market demands. Eight core CPUs will be found in servers while
you'll find dual core machines in the mobile market several months after the initial
desktop introduction.
SSE instructions get the bump to a 4.2 revision, better branch prediction and pre-
fetch algorithms and simultaneous multi-threading (SMT) makes a return after a brief
hiatus with the NetBurst architecture.
136
ECE 684
HyperThreading Returns
SMT (simultaneous multi-threading) or HyperThreading is also a key to keeping the 4-wide execution engine
fed with work and tasks to complete. With the larger caches and much higher memory bandwidth that the
chip provides this is a very important addition.

137
ECE 684
138
ECE 684
139
ECE 684
Power control
The Nehalem core also has a new trick in its bag that enables it to lower the power consumption of a core to
nearly 0 watts - something that wasn't possible on previous designs. You can see in the image above what
the total power consumption of a core was typically made up of with the Core 2 series of processors - clocks
and logic are the majority of it yes, but a third or more is related to leakage of the transistors and was
something that couldn't be turned off in prior designs.
Well with the independent power controller in the PCU and the different power planes that each core rests on,
the power consumption for each core is completely independent from the others. You can see in this diagram
that though Core 3 is loaded the entire time, both Core 2 and Core 0 are able to power down to practically 0
watts when their work load is complete.

140
ECE 684
IBM Power PC (Power 4):
p690 Architecture
141
ECE 684
Power4 CPUs

Caches

Memory

Prefetching
Overview
142
ECE 684
Basic features
1.3 GHz clock speed
two independent floating point units
single instruction for floating point multiply-add (FMA)
theoretical peak is therefore 5.2 GFlops per CPU

Many typical features of modern RISC processors

Difficult to attain high percentage of peak performance
dense linear algebra is an exception
good applications realise 10-20% of peak

easy to get much less than this!
IBM Power4 CPU
143
ECE 684
Superscalar processor
capable of issuing up to 5 instructions per clock cycle
2FP, 2 integer, 2 load/store, 1 branch, 1 logical

Two integer addition/logical units

Two floating point units
Single instruction for multiply-add
Non-pipelined divide and square root

80 integer, 72 FP registers
only 32 virtual registers in the instruction set
hardware maps virtual registers to physical ones on the fly.
Power4 processors
144
ECE 684

Long pipeline
up to 20 cycles for each instruction from start to finish
FMA takes 6 cycles from reading registers to
delivering result back to registers
not enough virtual registers to keep both FPUs busy
all the time

not even Linpack approaches 100% of peak

Out-of-order execution
hardware can reorder instructions to make best use
of the hardware resources
requires a great deal of internal bookkeeping!
Other features
145
ECE 684
Branch prediction
lots of hardware to try and predict branches
mispredicted branches cause pipeline to stall
16 Kbit local and global branch predictor tables
overkill for scientific codes

most branches are back to the start of a loop

Speculative execution
can issue instructions ahead of branches
instructions are killed if they are not required
keeps pipeline full
Other features (continued)
146
ECE 684

Caches rely on temporal and spatial locality

Caches are divided into lines (a.k.a blocks)

Lines are organized as sets

A memory location in mapped to a set
depending on its address

It can occupy any line within that set
Caches
147
ECE 684
A cache with 1 line per set is called direct mapped
A cache with k lines per set is called k-way set associative
A cache with only 1 set is called fully associative
Cache terminology
148
ECE 684
When a line is loaded into the cache, its address
determines which set it goes into.

In a direct mapped cache, it simply replaces the
only line in the set

In a k-way set associative cache, there are k lines
which could be ejected to make room for the new one
usual policy is to replace the least recently used
(LRU)
better than random, but not always optimal
LRU line may still be the one required next!
Replacement policy
149
ECE 684
Caches may be:
write-through

data written to cache line and to lower memory level
write-back

data is only written to the cache. Lower levels updated when cache
line is replaced

Caches may also be:
write allocate

if write location in not in cache the enclosing line is loaded into the
cache (usual for write-back)
no write allocate

if write location is not in memory only the underlying level is modified
(usual for write-through)
Caches and writes
150
ECE 684

p690 has 3 levels of cache
separate L1 data and instruction caches
unified L2 shared between 2 CPUs on a chip
global L3 cache (more of a memory buffer)
p690 Memory System
151
ECE 684

Instruction cache
64Kbytes, direct mapped
128 byte lines

Data cache
32Kbytes
2-way set associative
LRU replacement
128-byte lines

write-through, no write allocate
2x8-byte reads and 1x8-byte write per cycle.
4-5 cycle latency.
L1 caches
152
ECE 684
A single chip comprises
2 independent CPUs
shared L2 cache
Power4 Chip
153
ECE 684
1440 Kb Unified (data+instructions)

8-way set associative

Shared by both CPUs on the chip.
effectively each processor has 720Kb of cache

128-byte lines
write-through, write allocate
loads in 32-byte chunks.

14-20 cycle latency L2 -> registers

Cache has 3 independent sections of 480Kb
Lines within the 1440Kb unit are hashed to sections
(consecutive lines never go to the same section).
L2 cache
154
ECE 684
Power4 Chip
155
ECE 684

Chips are packaged up in groups
of four
each Multi-Chip Module (MCM)
has eight CPUs
all sharing the same L3 cache
Multi-Chip Modules
156
ECE 684
Really a memory buffer rather than a cache.

128Mb per MCM (4 chips, 8CPU)

8-way set-associative

512 bytes lines

approx. 100 cycle latency

Usually only caches memory locations attached to the MCM

Shared by all CPUs
single CPU jobs get access to ALL the L3 cache in the system.

Does not allocate if already busy
L3 cache
157
ECE 684
8 Gbytes of main memory per MCM
1 Gbyte per processor

Accessible by all CPUs

350-400 cycles latency from main memory to registers

Running one CPU on an MCM, a memory bandwidth
of around 2.5 Gbyte/s is observed.

However, when running all 8 CPUs the aggregate
bandwidth is around 8 Gbyte/s
poor scaling, or good single CPU performance?
beware of single CPU benchmarking
Main memory
158
ECE 684
Translation lookaside buffer
processor works on effective addresses
memory works on real addresses
TLB is a cache for the effective->real mapping

1024 entries, 4 way set associative
each entry corresponds to a page (4 Kbytes)
whole TLB addresses 4 Mbytes
larger than L2 cache
TLB
159
ECE 684
Four MCMs make up a p690 frame
also called Regatta H
32 CPUs and 32 Gb memory per frame
peak of 166.4 Gflops

Each frame configured as 4 machines
called Logical PARtitions
each LPAR maps to one MCM
Larger Shared-Memory Nodes
160
ECE 684
LPARs are almost completely independent
run separate operating systems
cannot access memory on a different LPAR

The 4 MCMs in a frames are connected by multiple
busses
some cross-LPAR traffic does occur
cache coherency mechanisms cannot be turned off

Single LPAR performance can be impacted by jobs
running on other LPARs in the same frame
can be on the order of 10% in worst case
not drastic, but noticable on some benchmarks.
Larger Shared-Memory Nodes (continued)
161
ECE 684
p690 has a hardware prefetch capability
helps to hide the long latencies
make use of the available memory bandwidth

Simple algorithm for guessing which cache lines will
be required in the near future
fetch them before they are requested

Prefetch engine monitors loads to cache lines
detects accesses to consecutive cache lines (128b)

either ascending of descending order in memory
two consecutive accesses trigger a prefetch stream
Hardware prefetching
162
ECE 684
Accesses to subsequent consecutive cache lines
cause data to be fetched into the different caches
next line in sequence is fetched to L1 cache
line 5 ahead is fetched into L2 cache
lines 17, 18, 19 & 20 ahead (512 bytes) are fetched
into L3 cache.

Distance ahead is long enough to hide the memory
Latency

Up to 8 streams can be active at the same time

Stream stops when page boundary is crossed
every 4 Kbytes, unless large pages enabled
Hardware prefetching (continued)
163
ECE 684
The Power4 Processor Introduction and Tuning Guide
http://www.redbooks.ibm.com/redbooks/SG247041.html
Where to find out more
Newest Supercomputer