You are on page 1of 20





The Itanium processor is the first ic runtime optimizations to enable the com-
implementation of the IA-64 instruction set piled code schedule to flow through at high
architecture (ISA). The design team opti- throughput. This strategy increases the syn-
mized the processor to meet a wide range of ergy between hardware and software, and
requirements: high performance on Internet leads to higher overall performance.
servers and workstations, support for 64-bit The processor provides a six-wide and 10-
addressing, reliability for mission-critical stage deep pipeline, running at 800 MHz on
applications, full IA-32 instruction set com- a 0.18-micron process. This combines both
patibility in hardware, and scalability across a abundant resources to exploit ILP and high
range of operating systems and platforms. frequency for minimizing the latency of each
The processor employs EPIC (explicitly instruction. The resources consist of four inte-
parallel instruction computing) design con- ger units, four multimedia units, two
Harsh Sharangpani cepts for a tighter coupling between hardware load/store units, three branch units, two
and software. In this design style the hard- extended-precision floating-point units, and
Ken Arora ware-software interface lets the software two additional single-precision floating-point
exploit all available compilation time infor- units (FPUs). The hardware employs dynam-
Intel mation and efficiently deliver this informa- ic prefetch, branch prediction, nonblocking
tion to the hardware. It addresses several caches, and a register scoreboard to optimize
fundamental performance bottlenecks in for compilation time nondeterminism. Three
modern computers, such as memory latency, levels of on-package cache minimize overall
memory address disambiguation, and control memory latency. This includes a 4-Mbyte
flow dependencies. level-3 (L3) cache, accessed at core speed, pro-
EPIC constructs provide powerful archi- viding over 12 Gbytes/s of data bandwidth.
tectural semantics and enable the software to The system bus provides glueless multi-
make global optimizations across a large processor support for up to four-processor sys-
scheduling scope, thereby exposing available tems and can be used as an effective building
instruction-level parallelism (ILP) to the hard- block for very large systems. The advanced
ware. The hardware takes advantage of this FPU delivers over 3 Gflops of numeric capa-
enhanced ILP, providing abundant execution bility (6 Gflops for single precision). The bal-
resources. Additionally, it focuses on dynam- anced core and memory subsystems provide

24 0272-1732/00/$10.00  2000 IEEE

Compiler-programmed features:
Branch parallelism; Data and control Memory
stack, Predication
hints instruction speculation hints

Hardware features:

Fetch Issue Register Control Parallel resources Memory

handling subsystem

Bypasses and dependencies

4 integer,
4 MMX units

Fast, simple 6-issue

128 GR,
128 FR, 2 + 2 FMACs
cache, register Three levels
branch remap, 2 load/store units of cache
predictors stack (L1, L2, L3)
engine 3 branch units

32-entry ALAT

Speculation deferral management

Figure 1. Conceptual view of EPIC hardware. GR: general register file; FR: floating-point regis-
ter file

high performance for a wide range of appli- events that are unpredictable at compi-
cations ranging from commercial workloads lation time so that the compiled code
to high-performance technical computing. flows through the pipeline at high
In contrast to traditional processors, the throughput.
machine’s core is characterized by hardware
support for the key ISA constructs that Figure 1 presents a conceptual view of the
embody the EPIC design style.1,2 This EPIC hardware. It illustrates how the various
includes support for speculation, predication, EPIC instruction set features map onto the
explicit parallelism, register stacking and rota- micropipelines in the hardware.
tion, branch hints, and memory hints. In this The core of the machine is the wide execu-
article we describe the hardware support for tion engine, designed to provide the compu-
these novel constructs, assuming a basic level tational bandwidth needed by ILP-rich EPIC
of familiarity with the IA-64 architecture (see code that abounds in speculative and predi-
the “IA-64 Architecture Overview” article in cated operations.
this issue). The execution control is augmented with a
bookkeeping structure called the advanced
EPIC hardware load address table (ALAT) to support data
The Itanium processor introduces a num- speculation and, with hardware, to manage
ber of unique microarchitectural features to the deferral of exceptions on speculative exe-
support the EPIC design style.2 These features cution. The hardware control for speculation
focus on the following areas: is quite simple: adding an extra bit to the data
path supports deferred exception tokens. The
• supplying plentiful fast, parallel, and controls for both the register scoreboard and
pipelined execution resources, exposed bypass network are enhanced to accommo-
directly to the software; date predicated execution.
• supporting the bookkeeping and control Operands are fed into this wide execution
for new EPIC constructs such as predi- core from the 128-entry integer and floating-
cation and speculation; and point register files. The register file addressing
• providing dynamic support to handle undergoes register remapping, in support of


M F I M F I 6 instructions provide: semantically richer register-remapping hard-

• 12 parallel ops/clock ware. Expensive register dependency-detec-
for scientific computing tion logic is eliminated via the explicit
Load 4 DP 2 ALU ops • 20 parallel ops/clock for parallelism directives that are precomputed by
digital content creation
(8 SP) ops via the software.
2 ldf-pair and 2 4 DP flops
ALU ops (8 SP flops) Using EPIC constructs, the compiler opti-
(postincrement) mizes the code schedule across a very large
scope. This scope of optimization far exceeds
M I I M B B 6 instructions provide: the limited hardware window of a few hun-
• 8 parallel ops/clock dred instructions seen on contemporary
for enterprise and
Internet applications dynamically scheduled processors. The result
2 loads and 2 ALU ops is an EPIC machine in which the close col-
2 ALU ops 2 branch
(postincrement) instructions laboration of hardware and software enables
high performance with a greater degree of
Figure 2. Two examples illustrating supported parallelism. SP: single preci- overall efficiency.
sion, DP: double precision
Overview of the EPIC core
The engineering team designed the EPIC
register stacking and rotation. The register core of the Itanium processor to be a parallel,
management hardware is enhanced with a deep, and dynamic pipeline that enables ILP-
control engine called the register stack engine rich compiled code to flow through at high
that is responsible for saving and restoring throughput. At the highest level, three impor-
registers that overflow or underflow the reg- tant directions characterize the core pipeline:
ister stack.
An instruction dispersal network feeds the • wide EPIC hardware delivering a new
execution pipeline. This network uses explic- level of parallelism (six instructions/
it parallelism and instruction templates to effi- clock),
ciently issue fetched instructions onto the • deep pipelining (10 stages) enabling high
correct instruction ports, both eliminating frequency of operation, and
complex dependency detection logic and • dynamic hardware for runtime opti-
streamlining the instruction routing network. mization and handling of compilation
A decoupled fetch engine exploits advanced time indeterminacies.
prefetch and branch hints to ensure that the
fetched instructions will come from the cor- New level of parallel execution
rect path and that they will arrive early enough The processor provides hardware for these
to avoid cache miss penalties. Finally, memo- execution units: four integer ALUs, four mul-
ry locality hints are employed by the cache timedia ALUs, two extended-precision float-
subsystem to improve the cache allocation and ing-point units, two additional single-precision
replacement policies, resulting in a better use floating-point units, two load/store units, and
of the three levels of on-package cache and all three branch units. The machine can fetch,
associated memory bandwidth. issue, execute, and retire six instructions each
EPIC features allow software to more effec- clock cycle. Given the powerful semantics of
tively communicate high-level semantic infor- the IA-64 instructions, this expands to many
mation to the hardware, thereby eliminating more operations being executed each cycle.
redundant or inefficient hardware and lead- The “Machine resources per port” sidebar on
ing to a more effective design. Notably absent p. 31 enumerates the full processor execution
from this machine are complex hardware resources.
structures seen in dynamically scheduled con- Figure 2 illustrates two examples demon-
temporary processors. Reservation stations, strating the level of parallel operation support-
reorder buffers, and memory ordering buffers ed for various workloads. For enterprise and
are all replaced by simpler hardware for spec- commercial codes, the MII/MBB template
ulation. Register alias tables used for register combination in a bundle pair provides six
renaming are replaced with the simpler and instructions or eight parallel operations per

clock (two load/store, two Execution core
Front end 4 single-cycle ALUs, 2 load/stores
general-purpose ALU opera- Prefetch/fetch of 6 instructions/clock Advanced load control
tions, two postincrement ALU Hierarchy of branch predictors Predicate delivery and branch
operations, and two branch Decoupling buffer NaT/exceptions/retirement
instructions). Alternatively, an
same mix of operations, but
with one branch hint and one Instruction delivery Operand delivery
branch operation, instead of Dispersal of 6 instructions onto Register file read and bypass
9 issue ports Register scoreboard
two branch operations. Register remapping Predicated dependencies
For scientific code, the use Register save engine
of the MFI template in each
bundle enables 12 parallel Figure 3. Itanium processor core pipeline.
operations per clock (loading
four double-precision oper-
ands to the registers and executing four double- levels of on-package cache. The pipeline
precision floating-point, two integer ALU, and design employs robust, scalable circuit design
two postincrement ALU operations). For digi- techniques. We consciously attempted to
tal content creation codes that use single- manage interconnect lengths and pipeline
precision floating point, the SIMD (single away secondary paths.
instruction, multiple data) features in the Figure 3 illustrates the 10-stage core
machine effectively enable up to 20 parallel pipeline. The bold line in the middle of the
operations per clock (loading eight single- core pipeline indicates a point of decoupling in
precision operands, executing eight single- the pipeline. The pipeline accommodates the
precision floating-point, two integer ALU, and decoupling buffer in the ROT (instruction
two postincrementing ALU operations). rotation) stage, dedicated register-remapping
hardware in the REN (register rename) stage,
Deep pipelining (10 stages) and pipelined access of the large register file
The cycle time and the core pipeline are bal- across the WLD (word line decode) and REG
anced and optimized for the sequential exe- (register read) stages. The DET (exception
cution in integer scalar codes, by minimizing detection) stage accommodates delayed branch
the latency of the most frequent operations, execution as well as memory exception man-
thus reducing dead time in the overall com- agement and speculation support.
putation. The high frequency (800 MHz) and
careful pipelining enable independent opera- Dynamic hardware for runtime optimization
tions to flow through this pipeline at high While the processor relies on the compiler
throughput, thus also optimizing vector to optimize the code schedule based upon the
numeric and multimedia computation. The deterministic latencies, the processor provides
cycle time accommodates the following key special support to dynamically optimize for
critical paths: several compilation time indeterminacies.
These dynamic features ensure that the com-
• single-cycle ALU, globally bypassed piled code flows through the pipeline at high
across four ALUs and two loads; throughput. To tolerate additional latency on
• two cycles of latency for load data data cache misses, the data caches are non-
returned from a dual-ported level-1 (L1) blocking, a register scoreboard enforces
cache of 16 Kbytes; and dependencies, and the machine stalls only on
• scoreboard and dependency control to encountering the use of unavailable data.
stall the machine on an unresolved regis- We focused on reducing sensitivity to branch
ter dependency. and fetch latency. The machine employs hard-
ware and software techniques beyond those
The feature set complies with the high- used in conventional processors and provides
frequency target and the degree of pipelin- aggressive instruction prefetch and advanced
ing—aggressive branch prediction, and three branch prediction through a hierarchy of


L1 instruction cache
and ITLB
fetch/prefetch engine
prediction Decoupling IA-32
buffer 8 bundles decode

Register stack engine/remapping

96-Kbyte 4-Mbyte
L2 Branch and 128 integer 128 floating-point L3
Scoreboard, predicate NaTs, exceptions

cache predicate registers registers cache

Integer Dual-
Branch and port
units MM L1
units data ALAT Floating-
cache point


Bus controller

Figure 4. Itanium processor block diagram.

branch prediction structures. A decoupling sor (six instructions per clock), an aggressive
buffer allows the front end to speculatively fetch front end is needed to keep the machine effec-
ahead, further hiding instruction cache laten- tively fed, especially in the presence of dis-
cy and branch prediction latency. ruptions due to branches and cache misses.
The machine’s front end is decoupled from
Block diagram the back end.
Figure 4 provides the block diagram of the Acting in conjunction with sophisticated
Itanium processor. Figure 5 provides a die plot branch prediction and correction hardware,
of the silicon database. A few top-level metal the machine speculatively fetches instructions
layers have been stripped off to create a suit- from a moderate-size, pipelined instruction
able view. cache into a decoupling buffer. A hierarchy of
branch predictors, aided by branch hints, pro-
Details of the core pipeline vides up to four progressively improving
The following describes details of the core instruction pointer resteers. Software-initiat-
processor microarchitecture. ed prefetch probes for future misses in the
instruction cache and then prefetches such
Decoupled software-directed front end target code from the level-2 (L2) cache into a
Given the high execution rate of the proces- streaming buffer and eventually into the

Core processor die

4 x 1Mbyte L3 cache

Figure 5. Die plot of the silicon database.

instruction cache. Figure 6 To dispersal

illustrates the front-end Instruction cache, Decoupling buffer

microarchitecture. ITLB, streaming buffers


Loop exit
Speculative fetches. The 16- Return stack buffer
Kbyte, four-way set-associative Branch
Target Adaptive Branch Branch
instruction cache is fully multiway
address address
address address
pipelined and can deliver 32 registers 2-level predictor calculation 1 calculation 2
bytes of code (two instruction
bundles or six instructions)
every clock. The cache is sup-
ported by a single-cycle, 64- Stages
entry instruction translation
look-aside buffer (TLB) that Figure 6. Processor front end.
is fully associative and backed
up by an on-chip hardware
page walker. be nine cycles of pipeline bubbles before the
The fetched code is fed into a decoupling pipeline is full again. This would mean a
buffer that can hold eight bundles of code. As heavy performance loss. Hence, we’ve placed
a result of this buffer, the machine’s front end significant emphasis on boosting the overall
can continue to fetch instructions into the branch prediction rate as well as reducing the
buffer even when the back end stalls. Con- branch prediction and correction latency.
versely, the buffer can continue to feed the The branch prediction hardware is assisted
back end even when the front end is disrupt- by branch hint directives provided by the
ed by fetch bubbles due to branches or compiler (in the form of explicit branch pre-
instruction cache misses. dict, or BRP, instructions as well as hint spec-
ifiers on branch instructions). The directives
Hierarchy of branch predictors. The processor provide branch target addresses, static hints
employs a hierarchy of branch prediction on branch direction, as well as indications on
structures to deliver high-accuracy and low- when to use dynamic prediction. These direc-
penalty predictions across a wide spectrum of tives are programmed into the branch predic-
workloads. Note that if a branch mispredic- tion structures and used in conjunction with
tion led to a full pipeline flush, there would dynamic prediction schemes. The machine


provides up to four progressive predictions by branch hints (using BRP and move-
and corrections to the fetch pointer, greatly BR hint instructions) and also managed
reducing the likelihood of a full-pipeline flush dynamically. Having the compiler pro-
due to a mispredicted branch. gram the structure with the upcoming
footprint of the program is an advantage
• Resteer 1: Single-cycle predictor. A special and enables a small, 64-entry structure to
set of four branch prediction registers be effective even on large commercial
(called target address registers, or TARs) workloads, saving die area and imple-
provides single-cycle turnaround on cer- mentation complexity. The BPT and
tain branches (for example, loop branch- MBPT cause a front-end resteer only if
es in numeric code), operating under tight the target address for the resteer is present
compiler control. The compiler programs in the target address cache. In the case of
these registers using BRP hints, distin- misses in the BPT and MBPT, a hit in the
guishing these hints with a special “impor- target address cache also provides a branch
tance” bit designator and indicating that direction prediction of taken.
these directives must get allocated into this A return stack buffer (RSB) provides
small structure. When the instruction predictions for return instructions. This
pointer of the candidate branch hits in buffer contains eight entries and stores
these registers, the branch is predicted return addresses along with correspond-
taken, and these registers provide the tar- ing register stack frame information.
get address for the resteer. On such taken • Resteers 3 and 4: Branch address calcula-
branches no bubbles appear in the execu- tion and correction. Once branch instruc-
tion schedule due to branching. tion opcodes are available (ROT stage),
• Resteer 2: Adaptive multiway and return it’s possible to apply a correction to pre-
predictors. For scalar codes, the processor dictions made earlier. The BAC1 stage
employs a dynamic, adaptive, two-level applies a correction for the exit condition
prediction scheme3,4 to achieve well over on modulo-scheduled loops through a
90% prediction rates on branch direction. special “perfect-loop-exit-predictor”
The branch prediction table (BPT) con- structure that keeps track of the loop
tains 512 entries (128 sets × 4 ways). Each count extracted during the loop initial-
entry, selected by the branch address, ization code. Thus, loop exits should
tracks the four most recent occurrences never see a branch misprediction in the
of that branch. This 4-bit value then back end. Additionally, in case of misses
indexes into one of 128 pattern tables in the earlier prediction structures, BAC1
(one per set). The 16 entries in each pat- extracts static prediction information and
tern table use a 2-bit, saturating, up-down addresses from branch instructions in the
counter to predict branch direction. rightmost slot of a bundle and uses these
The branch prediction table structure to provide a correction. Since most tem-
is additionally enhanced for multiway plates will place a branch in the rightmost
branches with a 64-entry, multiway slot, BAC1 should handle most branch-
branch prediction table (MBPT) that es. BAC2 applies a more general correc-
employs a similar algorithm but keeps tion for branches located in any slot.
three history registers per bundle entry.
A find-first-taken selection provides the Software-initiated prefetch. Another key ele-
first taken branch indication for the mul- ment of the front end is its software-initiated
tiway branch bundle. Multiway branch- instruction prefetch. Prefetch is triggered by
es are expected to be common in EPIC prefetch hints (encoded in the BRP instruc-
code, where multiple basic blocks are tions as well as in actual branch instructions)
expected to collapse after use of specula- as they pass through the ROT stage. Instruc-
tion and predication. tions get prefetched from the L2 cache into
Target addresses for this branch resteer an instruction-streaming buffer (ISB) con-
are provided by a 64-entry target address taining eight 32-byte entries. Support exists
cache (TAC). This structure is updated to prefetch either a short burst of 64 bytes of

code (typically, a basic block residing in up to
four bundles) or a long sequential instruction Machine resources per port
stream. Short burst prefetch is initiated by a Tables A-C describe the vocabulary of operations supported on the different issue ports
BRP instruction hoisted well above the actu- in the Itanium processor. The issue ports feed into memory (M), integer (I), floating-point (F),
al branch. For longer code streams, the and branch (B) execution data paths.
sequential streaming (“many”) hint from the
branch instruction triggers a continuous Table A. Memory and integer execution resources.
stream of additional prefetch requests until a
taken branch is encountered. The instruction Ports for issuing an instruction Latency
cache filters prefetch requests. The cache tags Instruction class M0 M1 I0 I1 (no. of clock cycles)
and the TLB have been enhanced with an
additional port to check whether an address ALU (add, shift-add, logical,
will lead to a miss. Such requests are sent to addp4, compare) • • • •
the L2 cache. Sign/zero extend, move long • • 1
The compiler can improve overall fetch per- Fixed extract/deposit, Tbit, TNaT • 1
formance by aggressive issue and hoisting of Multimedia ALU (add/avg./etc.) • • • • 2
BRP instructions, and by issuing sequential MM shift, avg, mix, pack • • 2
prefetch hints on the branch instruction when Move to/from branch/predicates/
branching to long sequential codes. To fully ARs, packed multiply, pop count • 2
hide the latency of returns from the L2 cache, Load/store/prefetch/setf/break.m/
BRP instructions that initiate prefetch should cache control/memory fence • • 2+
be hoisted 12 fetch cycles ahead of the branch. Memory management/system/getf • 2+
Hoisting by five cycles breaks even with no
prefetch at all. Every hoisted cycle above five
cycles has the potential of shaving one fetch Table B. Floating-point execution resources.
bubble. Although this kind of hoisting of BRP
instructions is a tall order, it does provide a Ports for issuing an instruction Latency
mechanism for the compiler to eliminate Instruction class F0 F1 (no. of clock cycles)
instruction fetch bubbles.
Efficient instruction and operand delivery Fixed multiply • • 7
After instructions are fetched in the front FClrf • • 1
end, they move into the middle pipeline that Fchk • • 1
disperses instructions, implements the archi- Fcompare • 2
tectural renaming of registers, and delivers Floating-point logicals, class,
operands to the wide parallel hardware. The min/max, pack, select • 5
hardware resources in the back end of the
machine are organized around nine issue ports.
The instruction and operand delivery hardware Table C. Branch execution resources.
maps the six incoming instructions onto the
nine issue ports and remaps the virtual register Ports for issuing an instruction
identifiers specified in the source code onto Instruction class B0 B1 B2
physical registers used to access the register file.
It then provides the source data to the execution Conditional or unconditional branch • • •
core. The dispersal and renaming hardware Call/return/indirect • • •
exploits high-level semantic information pro- Loop-type branch, BSW, cover •
vided by the IA-64 software, efficiently RFI •
enabling greater ILP and reduced instruction BRP (branch hint) • • •
path length.

Explicit parallelism directives. The instruction

dispersal mechanism disperses instructions pre- sor’s issue ports. The processor has a total of
sented by the decoupling buffer to the proces- nine issue ports capable of issuing up to two


tion resources to process all the instructions

Table 1. Instruction bundles capable of
full-bandwidth dispersal.
that will be issued in parallel. This over-
subscription problem is facilitated by the
First bundle* Second bundle IA-64 ISA feature of instruction bundle
MIH MLI, MFI, MIB, MBB, or MFB templates. Each instruction bundle not
MFI or MLI MLI, MFI, MIB, MBB, BBB, or MFB only specifies three instructions but also
MII MBB, BBB, or MFB contains a 4-bit template field, indicating
MMI BBB the type of each instruction: memory (M),
MFH MII, MLI, MFI, MIB, MBB, MFB integer (I), branch (B), and so on. By
examining template fields from the two
* B slots support branches and branch hints. bundles (a total of only 8 bits), the disper-
* H designates a branch hint operation in the B slot. sal logic can quickly determine the num-
ber of memory, integer, floating-point, and
branch instructions incoming every clock.
memory instructions (ports M0 and M1), two This is a hardware simplification resulting
integer (ports I0 and I1), two floating-point from the IA-64 instruction set architecture.
(ports F0 and F1), and three branch instruc- Unlike conventional instruction set archi-
tions (ports B0, B1, and B2) per clock. The tectures, the instruction encoding itself
processor’s 17 execution units are fed through doesn’t need to be examined to determine
the M, I, F, and B groups of issue ports. the type of each operation. This feature
The decoupling buffer feeds the dispersal removes decoders that would otherwise be
in a bundle granular fashion (up to two bun- required to examine many bits of the
dles or six instructions per cycle), with a fresh encoded instruction to determine the in-
bundle being presented each time one is con- struction’s type and associated issue port.
sumed. Dispersal from the two bundles is A second key advantage of the tem-
instruction granular—the processor disperses plate-based dispersal strategy is that cer-
as many instructions as can be issued (up to tain instruction types can only occur on
six) in left-to-right order. The dispersal algo- specific locations within any bundle. As
rithm is fast and simple, with instructions a result, the dispersal interconnection
being dispersed to the first available issue port, network can be significantly optimized;
subject to two constraints: detection of in- the routing required from dispersal to
struction independence and detection of issue ports is roughly only half of that
resource oversubscription. required for a fully connected crossbar.

• Independence. The processor must ensure Table 1 illustrates the effectiveness of the
that all instructions issued in parallel are dispersal strategy by enumerating the instruc-
either independent or contain only tion bundles that may be issued at full band-
allowed dependencies (such as a compare width. As can be seen, a rich mix of
instruction feeding a dependent condi- instructions can be issued to the machine at
tional branch). This question is easily dealt high throughput (six per clock). The combi-
with by using the stop-bits feature of the nation of stop bits and bundle templates, as
IA-64 ISA to explicitly communicate par- specified in the IA-64 instruction set, allows
allel instruction semantics. Instructions the compiler to indicate the independence and
between consecutive stop bits are deemed instruction-type information directly and
independent, so the instruction indepen- effectively to the dispersal hardware. As a
dence detection hardware is trivial. This result, the hardware is greatly simplified, there-
contrasts with traditional RISC processors by allowing an efficient implementation of
that are required to perform O(n2) (typi- instruction dispersal to a wide execution core.
cally dozens) comparisons between source
and destination register specifiers to deter- Efficient register remapping. After dispersal, the
mine independence. next step in preparing incoming instructions
• Oversubscription. The processor must also for execution involves implementing the reg-
guarantee that there are sufficient execu- ister stacking and rotation functions.

Register stacking is an IA-64 technique that nation registers. The total area taken by this
significantly reduces function call and return function is less than 0.25 square mm.
overhead. It ensures that all procedural input The register-stacking model also requires
and output parameters are in specific register special handling when software allocates more
locations, without requiring the compiler to virtual registers than are currently physically
perform register-register or memory-register available in the register file. A special state
moves. On procedure calls, a fresh register machine, the register stack engine (RSE), han-
frame is simply stacked on top of existing dles this case—termed stack overflow. This
frames in the large register file, without the engine observes all stacked register allocation
need for an explicit save of the caller’s registers. or deallocation requests. When an overflow is
This enables low-overhead procedure calls, detected on a procedure call, the engine silent-
providing significant performance benefit on ly takes control of the pipeline, spilling regis-
codes that are heavy in calls and returns, such ters to a backing store in memory until
as those in object-oriented languages. sufficient physical registers are available. In a
Register rotation is an IA-64 technique that similar manner, the engine handles the con-
allows very low overhead, software-pipelined verse situation—termed stack underflow—
loops. It broadens the applicability of com- when registers need to be restored from a
piler-driven software pipelining to a wide vari- backing store in memory. While these registers
ety of integer codes. Rotation provides a form are being spilled or filled, the engine simply
of register renaming that allows every itera- stalls instructions waiting on the registers; no
tion of a software-pipelined loop to have a pipeline flushes are needed to implement the
fresh copy of loop variables. This is accom- register spill/restore operations.
plished by accessing the registers through an Register stacking and rotation combine to
indirection based on the iteration count. provide significant performance benefits for a
Both stacking and rotation require the variety of applications, at the modest cost of
hardware to remap the register names. This a number of small adders, an additional
remapping translates the incoming virtual reg- pipeline stage, and control logic for a pro-
ister specifiers onto outgoing physical register grammer-invisible register stack engine.
specifiers, which are then used to perform the
actual lookup of the various register files. Large, multiported register files. The processor
Stacking can be thought of as simply adding provides an abundance of registers and execu-
an offset to the virtual register specifier. In a tion resources. The 128-entry integer register
similar fashion, rotation can also be viewed as file supports eight read ports and six write
an offset-modulo add. The remapping func- ports. Note that four ALU operations require
tion supports both stacking and rotation for eight read ports and four write ports from the
the integer register specifiers, but only register register file, while pending load data returns
rotation for the floating-point and predicate need two additional write ports (two returns
register specifiers. per cycle). The read and write ports can ade-
The Itanium processor efficiently supports quately support two memory and two integer
the register remapping for both register stack- instructions every clock. The IA-64 instruc-
ing and rotation with a set of adders and mul- tion set includes a feature known as postin-
tiplexers contained in the pipeline’s REN crement. Here, the address register of a
stage. The stacking logic requires only one 7- memory operation can be incremented as a
bit adder for each specifier, and the rotation side effect of the operation. This is supported
logic requires either one (predicate or float- by simply using two of the four ALU write
ing-point) or two (integer) additional 7-bit ports. (These two ALUs and write ports would
adders. The extra adder on the integer side is otherwise have been idle when memory oper-
needed due to the interaction of stacking with ations are issued off their ports).
rotation. Therefore, for full six-syllable exe- The floating-point register file also consists
cution, a total of ninety-eight 7-bit adders and of 128 registers, supports double extended-
42 multiplexers implement the combination precision arithmetic, and can sustain two
of integer, floating-point, and predicate memory ports in parallel with two multiply-
remapping for all incoming source and desti- accumulate units. This combination of


Clock 1: cmp.eq rl,r2 → pl, p3 Compute predicates P1, P3 resulting data may potentially not be con-
cmp.eq r3, r4 → p2, p4;; Compute predicates P2,P4 sumed. For such cases, it is key that 1) the
pipeline not be interrupted because of a cache
Clock 2: (p1) ld4 [r3] → r4;; Load nullified if P1=False
(Producer nullification)
miss, and 2) the pipeline only be interrupted
if and when the unavailable data is needed.
ClockN: (p4) add r4, r1 → r5 Add nullified if P4=False
(Consumer nullification) Thus, to achieve high performance, the
Note that a hazard exists only if strategy for dealing with detected data haz-
(a) p1=p2=true AND ards is based on stalls—the pipeline only stalls
(b) the r4 result is not available when the add collects its source data
when unavailable data is needed and stalls
Figure 7. Predicated producer-consumer dependencies. only as long as the data is unavailable. This
strategy allows the entire processor pipeline
to remain filled, and the in-flight dependent
resources requires eight read and four write instructions to be immediately ready to con-
ports. The register write ports are separated tinue as soon as the required data is available.
in even and odd banks, allowing each mem- This contrasts with other high-frequency
ory return to update a pair of floating-point designs, which are based on flushing and
registers. require that the pipeline be emptied when a
The other large register file is the predicate hazard is detected, resulting in reduced per-
register file. This register file has several formance. On the Itanium processor, innov-
unique characteristics: each entry is 1 bit, it ative techniques reap the performance benefits
has many read and write ports (15 reads/11 of a stall-based strategy and yet enable high-
writes), and it supports a “broadside” read or frequency operation on this wide machine.
write of the entire register file. As a result, it The scoreboard control is also enhanced to
has a distinct implementation, as described support predication. Since most operations
in the “Implementing predication elegantly” within the IA-64 instruction set architecture
section (next page). can be predicated, either the producer or the
consumer of a given piece of data may be nul-
High ILP execution core lified by having a false predicate. Figure 7 illus-
The execution core is the heart of the EPIC trates an example of such a case. Note that if
implementation. It supports data-speculative either the producer or consumer operation is
and control-speculative execution, as well as nullified via predication, there are no hazards.
predicated execution and the traditional func- The processor scoreboard therefore considers
tions of hazard detection and branch execu- both the producer and consumer predicates,
tion. Furthermore, the processor’s execution in addition to the normal operand availabili-
core provides these capabilities in the context ty, when evaluating whether a hazard exists.
of the wide execution width and powerful This hazard evaluation occurs in the REG
instruction semantics that characterize the (register read) pipeline stage.
EPIC design philosophy. Given the high frequency of the processor
pipeline, there’s not sufficient time to both
Stall-based scoreboard control strategy. As men- compute the existence of a hazard, and effect a
tioned earlier, the frequency target of the Ita- global pipeline stall in a single clock cycle.
nium processor was governed by several key Hence, we use a unique deferred-stall strategy.
timing paths such as the ALU plus bypass and This approach allows any dependent consumer
the two-cycle data cache. All of the control instructions to proceed from the REG into the
paths within the core pipeline fit within the EXE (execute) pipeline stage, where they are
given cycle time—detecting and dealing with then stalled—hence the term deferred stall.
data hazards was one such key control path. However, the instructions in the EXE stage
To achieve high performance, we adopted no longer have read port access to the register
a nonblocking cache with a scoreboard-based file to obtain new operand data. Therefore, to
stall-on-use strategy. This is particularly valu- ensure that the instructions in the EXE stage
able in the context of speculation, in which procure the correct data, the latches at the start
certain load operations may be aggressively of the EXE stage (which contain the source
boosted to avoid cache miss latencies, and the data values) continuously snoop all returning

data values, intercepting any data that the
instruction requires. The logic used to per- IA-32 compatibility
form this data interception is identical to the Another key feature of the Itanium processor is its full support of the IA-32 instruction set
register bypass network used to collect in hardware (see Figure A). This includes support for running a mix of IA-32 applications and
operands for instructions in the REG stage. IA-64 applications on an IA-64 operating system, as well as IA-32 applications on an IA-32
By noting that instructions observing a operating system, in both uniprocessor and multiprocessor configurations. The IA-32 engine
deferred stall in the REG stage don’t require makes use of the EPIC machine’s registers, caches, and execution resources. To deliver high
the use of the bypass network, the EXE stage performance on legacy binaries, the IA-32 engine dynamically schedules instructions.1,2 The
instructions can usurp the bypass network for IA-64 Seamless Architecture is defined to enable running IA-32 system functions in native
the deferred stall. By reusing existing register IA-64 mode, thus delivering native performance levels on the system functionality.
bypass hardware, the deferred stall strategy is
implemented in an area-efficient manner.
This allows the processor to combine the ben- References
efits of high frequency with stall-based 1. R. Colwell and R. Steck, “A 0.6µm BICMOS Microprocessor with Dynamic
pipeline control, thereby precluding the Execution,” Proc. Int’l Solid-State Circuits Conf., IEEE Press, Piscataway, N.J.,
penalty of pipeline flushes due to replays on 1995, pp. 176-177.
register hazards. 2. D. Papworth, “Tuning the Pentium Pro Microarchitecture,” IEEE Micro,
Mar./Apr. 1996, pp. 8-15.
Execution resources. The processor provides an
abundance of execution resources to exploit
ILP. The integer execution core includes two IA-32 instruction Shared
fetch and decode instruction cache
memory and two integer ports, with all four and TLB
ports capable of executing arithmetic, shift-
and-add, logical, compare, and most integer
SIMD multimedia operations. The memory
IA-32 dynamic
ports can also perform load and store opera- and scheduler
tions, including loads and stores with postin- Shared
crement functionality. The integer ports add execution
the ability to perform the less-common inte- IA-32 retirement core
and exceptions
ger instructions, such as test bit, look for zero
byte, and variable shift. Additional uncom-
mon instructions are also implemented on Figure A. IA-32 compatibility microarchitecture.
only the first integer port.
See the earlier sidebar for a full enumera-
tion of the per-port capabilities and associat- tions get introduced during predicated exe-
ed instruction latencies on the processor. In cution, the benefit of branch misprediction
general, we designed the method used to map elimination will be squandered. Care was
instructions onto each port to maximize over- taken to ensure that predicates are imple-
all performance, by balancing the instruction mented transparently in the pipeline.
frequency with the area and timing impact of The basic strategy for predicated execution
additional execution resources. is to allow all instructions to read the register
file and get issued to the hardware regardless
Implementing predication elegantly. Predication of their predicate value. Predicates are used to
is another key feature of the IA-64 architec- configure the data-forwarding network, detect
ture, allowing higher performance by elimi- the presence of hazards, control pipeline
nating branches and their associated advances, and conditionally nullify the exe-
misprediction penalties.5 However, predica- cution and retirement of issued operations.
tion affects several key aspects of the pipeline Predicates also feed the branching hardware.
design. Predication turns a control depen- The predicate register file is a highly multi-
dency (branching on the condition) into a ported structure. It is accessed in parallel with
data dependency (execution and forwarding the general registers in the REG stage. Since
of data dependent upon the value of the pred- predicates themselves are generated in the exe-
icate). If spurious stalls and pipeline disrup- cution core (from compare instructions, for


Register file data sumption. The costly bypass

logic that would have been

Source bypass
"True" source data Added to support predication needed for this is eliminated

by taking advantage of the
Bypass forwarded data fact that all predicate-writing
Source RegID
instructions have determinis-
tic latency. Instead, a specula-
Forwarded destination RegID
=? tive predicate register file
AND Forwarded instruction predicate (SPRF) is used and updated
Bypass forwarded data
as soon as predicate data is
computed. The source pred-
Figure 8. Predicated bypass control. icate of any dependent
instruction is then read
directly from this register file,
example) and may be in flight when they’re obviating the need for bypass logic. A sepa-
needed, they must be forwarded quickly to rate architectural predicate register file (APRF)
the specific hardware that consumes them. is only updated when a predicate-writing
Note that predication affects the hazard instruction retires and is only then allowed to
detection logic by nullifying either data pro- update the architectural state.
ducer or consumer instructions. Consumer In case of an exception or pipeline flush, the
nullification is performed after reading the SPRF is copied from the APRF in the shadow
predicate register file (PRF) for the predicate of the flush latency, undoing the effect of any
sources of the six instructions in the REG misspeculative predicate writes. The combi-
pipeline stage. Producer nullification is per- nation of latch-based implementation and the
formed after reading the predicate register file two-file strategy allow an area-efficient and
for the predicate sources for the six instruc- timing-efficient implementation of the high-
tions in the EXE stage. ly ported predicate registers.
Finally, three conditional branches can be Figure 8 shows one of the six EXE stage pred-
executed in the DET pipeline stage; this icates that allow or nullify data forwarding in
requires reading three additional predicate the data-forwarding network. The other five
sources. Thus, a total of 15 read ports are predicates are handled identically. Predication
needed to access the predicate register file. control of the bypass network is implemented
From a write port perspective, 11 predicates very efficiently by ANDing the predicate value
can be written every clock: eight from four with the destination-valid signal present in con-
parallel integer compares, two from a float- ventional bypass logic networks. Instructions
ing-point compare, and one via the stage pred- with false predicates are treated as merely not
icate write feature of loop branches. These writing to their destination register. Thus, the
read and write ports are in addition to a broad- impact of predication on the operand-
side read and write capability that allows a sin- forwarding network is fairly minimal.
gle instruction to read or write the entire
64-entry predicate register into or from a sin- Optimized speculation support in hardware.
gle 64-bit integer register. The predicate reg- With minimal hardware impact, the Itanium
ister file is implemented as a single 64-bit latch processor enables software to hide the latency
with 15 simple 64:1 multiplexers being used of load instructions and their dependent uses
as the read ports. Similarly, the 11 write ports by boosting them out of their home basic
are efficiently implemented, with each being block. This is termed speculation. To perform
a 6:64 decoder, with an AND-OR structure effective speculation, two key issues must be
used to update the actual predicate register file addressed. First, any exceptions that are detect-
latch. Broadside reads and writes are easily ed must be deferrable until an operation’s
implemented by reading or writing the con- home basic block is encountered; this is termed
tents of the entire 64 bit latch. control speculation. Second, all stores between
In-flight predicates must be forwarded the boosted load and its home location must
quickly after generation to the point of con- be checked for address overlap. If there is an

overlap, the latest store should forward the cor-
rect data; this is termed data speculation. The With minimal hardware
Itanium processor provides effective support
for both forms of speculation. impact, the Itanium processor
In case of control speculation, normal
exception checks are performed for a control- enables software to hide the
speculative load instruction. In the common
case, no exception is encountered, and there- latency of load instructions
fore no special handling is required. On a
detected exception, the hardware examines and their dependent uses by
the exception type, software-managed archi-
tectural control registers, and page attributes boosting them out of their
to determine whether the exception should be
handled immediately (such as for a TLB miss) home basic block.
or deferred for future handling.
For a deferral, a special deferred exception
token called NaT (Not a Thing) bit is retained
for each integer register, and a special float-
ing-point value, called NaTVal and encoded
in the NaN space, is set for floating-point reg- ware encounters an advanced load, it places
isters. This token indicates that a deferred the address, size, and destination register of
exception was detected. The deferred excep- the load into the ALAT structure. The ALAT
tion token is then propagated into result reg- then observes all subsequent explicit store
isters when any of the source registers indicates instructions, checking for overlaps of the valid
such a token. The exception is reported when advanced load addresses present in the ALAT.
either a speculation check or nonspeculative In the common case, there’s no match, the
use (such as a store instruction) consumes a ALAT state is unchanged, and the advanced
register that is flagged with the deferred excep- load result is used normally. In the case of an
tion token. In this way, NaT generation lever- overlap, all address-matching advanced loads
ages traditional exception logic simply, and in the ALAT are invalidated.
NaT propagation uses straightforward data After the last undisambiguated store prior
path logic. to the load’s home basic block, an instruction
The existence of NaT bits and NaTVals also can query the ALAT and find that the
affect the register spill-and-fill logic. For advanced load was matched by an interven-
explicit software-driven register spills and fills, ing store address. In this situation recovery is
special move instructions (store.spill and needed. When only the load and no depen-
load.fill) are supported that don’t take excep- dent instructions were boosted, a load-check
tions when encountering NaT’ed data. For (ld.c) instruction is used, and the load
floating-point data, the entire data is simply instruction is reissued down the pipeline, this
moved to and from memory. For integer data, time retrieving the updated memory data. As
the extra NaT bit is written into a special reg- an important performance feature, the ld.c
ister (called UNaT, or user NaT) on spills, and instruction can be issued in parallel with
is read back on the load.fill instruction. The instructions dependent on the load result
UNaT register can also be written to memo- data. By allowing this optimization, the crit-
ry if more than 64 registers need to be spilled. ical load uses can be issued immediately,
In the case of implicit spills and fills generat- allowing the ld.c to effectively be a zero-cycle
ed by the register save engine, the engine col- operation. When the advanced load and its
lects the NaT bits into another special register dependent uses were boosted, an advanced
(called RNaT, or register NaT), which is then check-load (chk.a) instruction traps to a user-
spilled (or filled) once for every 64 register save specified handler for a special fix-up code that
engine stores (or loads). reissues the load instruction and the opera-
For data speculation, the software issues an tions dependent on the load. Thus, support
advanced load instruction. When the hard- for data speculation was added to the pipeline


Floating-point feature set

The FPU in the processor is quite advanced. The native 82-bit hard- ations can yield high throughput on division and square-root operations
ware provides efficient support for multiple numeric programming mod- common in 3D geometry codes.
els, including support for single, double, extended, and The machine also provides one hardware pipe for execution of FCMPs
mixed-mode-precision computations. The wide-range 17-bit exponent and other operations (such as FMERGE, FPACK, FSWAP, FLogicals, recip-
enables efficient support for extended-precision library functions as well rocal, and reciprocal square root). Latency of the FCMP operations is two
as fast emulation of quad-precision computations. The large 128-entry clock cycles; latency of the other floating-point operations is five clock
register file provides adequate register resources. The FPU execution cycles.
hardware is based on the floating-point multiply-add (FMAC) primitive,
which is an effective building block for scientific computation.1 The Operand bandwidth
machine provides execution hardware for four double-precision or eight Care has been taken to ensure that the high computational bandwidth
single-precision flops per clock. This abundant computation bandwidth is matched with operand feed bandwidth. See Figure B. The 128-entry
is balanced with adequate operand bandwidth from the registers and floating-point register file has eight read and four write ports. Every cycle,
memory subsystem. With judicious use of data prefetch instructions, as the eight read ports can feed two extended-precision FMACs (each with
well as cache locality and allocation management hints, the software can three operands) as well as two floating-point stores to memory. The four
effectively arrange the computation for sustained high utilization of the write ports can accommodate two extended-precision results from the
parallel hardware. two FMAC units and the results from two load instructions each clock.
To increase the effective write bandwidth into the FPU from memory, we
FMAC units divided the floating-point registers into odd and even banks. This enables
The FPU supports two fully pipelined, 82-bit FMAC units that can exe- the two physical write ports dedicated to load returns to be used to write
cute single, double, or extended-precision floating-point operations. This four values per clock to the register file (two to each bank), using two ldf-
delivers a peak of 4 double-precision flops/clock, or 3.2 Gflops at 800 pair instructions. The ldf-pair instructions must obey the restriction that
MHz. FMAC units execute FMA, FMS, FNMA, FCVTFX, and FCVTXF oper- the pair of consecutive memory operands being loaded in sends one
ations. When bypassed to one another, the latency of the FMAC arith- operand to an even register and the other to an odd register for proper
metic operations is five clock cycles. use of the banks.
The processor also provides support for executing two SIMD-floating- The earliest cache level to feed the FPU is the unified L2 cache (96
point instructions in parallel. Since each instruction issues two single- Kbytes). Two ldf-pair instructions can load four double-precision values
precision FMAC operations (or four single-precision flops), the peak from the L2 cache into the registers. The latency of loads from this cache
execution bandwidth is 8 single-precision flops/clock or 6.4 Gflops at 800 to the FPU is nine clock cycles. For data beyond the L2 cache, the band-
MHz. Two supplemental single-precision FMAC units support this com- width to the L3 cache is two double-precision operations/clock (one 64-
putation. (Since the read of an 82-bit register actually yields two single- byte line every four clock cycles).
precision SIMD operands, the second operand in each case is peeled off Obviously, to achieve the peak rating of four double-precision floating-point
and sent to the supplemental SIMD units for execution.) The high com- operations per clock cycle, one needs to feed the FMACs with six operands
putational rate on single precision is especially suitable for digital con- per clock. The L2 memory can feed a peak of four operands per clock. The
tent creation workloads. remaining two need to come from the register file. Hence, with the right
The divide operation is done in software and can take advantage of amount of data reuse, and with appropriate cache management strategies
the twin fully pipelined FMAC hardware. Software-pipelined divide oper- aimed at ensuring that the L2 cache is well primed to feed the FPU, many
workloads can deliver sustained per-
formance at near the peak floating-
2 stores/clock point operation rating. For data
6 × 82 bits
without locality, use of the NT2 and
NTA hints enables the data to appear
to virtually stream into the FPU
Even Register
file through the next level of memory.
L2 (128-entry,
cache 2 double-
cache Odd 82 bits) FPU and integer core
4 double-
ops/clock The floating-point pipeline is cou-
ops/clock 2 × 82 bits pled to the integer pipeline. Regis-
(2 × ldf-pair) ter file read occurs in the REG stage,
Figure B. FMAC units deliver 8 flops/clock. with seven stages of execution

extending beyond the REG stage, followed by floating-point write back. FPU controls
Safe instruction recognition (SIR) hardware enables delivery of precise The FPU controls for operating precision and rounding are derived from
exceptions on numeric computation. In the FP1 (or EXE) stage, an early exam- the floating-point status register (FPSR). This register also contains the
ination of operands is performed to determine the possibility of numeric numeric execution status of each operation. The FPSR also supports spec-
exceptions on the instructions being issued. If the instructions are unsafe ulation in floating-point computation. Specifically, the register contains
(have potential for raising exceptions), a special form of hardware microre- four parallel fields or tracks for both controls and flags to support three par-
play is incurred. This mechanism enables instructions in the floating-point allel speculative streams, in addition to the primary stream.
and integer pipelines to flow freely in all situations in which no exceptions Special attention has been placed on delivering high performance for
are possible. speculative streams. The FPU provides high throughput in cases where the
The FPU is coupled to the integer data path via transfer paths between FCLRF instruction is used to clear status from speculative tracks before
the integer and floating-point register files. These transfers (setf, getf) forking off a fresh speculative chain. No stalls are incurred on such
are issued on the memory ports and made to look like memory operations changes. In addition, the FCHKF instruction (which checks for exceptions
(since they need register ports on both the integer and floating-point reg- on speculative chains on a given track) is also supported efficiently. Inter-
isters). While setf can be issued on either M0 or M1 ports, getf can only locks on this instruction are track-granular, so that no interlock stalls are
be issued on the M0 port. Transfer latency from the FPU to the integer incurred if floating-point instructions in the pipeline are only targeting the
registers (getf) is two clocks. The latency for the reverse transfer (setf) is other tracks. However, changes to the control bits in the FPSR (made via
nine clocks, since this operation appears like a load from the L2 cache. the FSETC instruction or the MOV GR→FPSR instruction) have a latency
We enhanced the FPU to support integer multiply inside the FMAC of seven clock cycles.
hardware. Under software control, operands are transferred from the inte-
ger registers to the FPU using setf. After multiplication is complete, the FPU summary
result is transferred to the integer registers using getf. This sequence The FPU feature set is balanced to deliver high performance across a
takes a total of 18 clocks (nine for setf, seven for fmul to write the regis- broad range of computational workloads. This is achieved through the
ters, and two for getf). The FPU can execute two integer multiply-add combination of abundant execution resources, ample operand bandwidth,
(XMA) operations in parallel. This is very useful in cryptographic appli- and a rich programming environment.
cations. The presence of twin XMA pipelines at 800 MHz allows for over
1,000 decryptions per second on a 1,024-bit RSA using private keys (server-
side encryption/decryption). References
1. B.Olsson et al., “RISC System/6000 Floating-Point Unit,” IBM
RISC System/6000 Technology, IBM Corp., 1990, pp. 34-42.

in a straightforward manner, only needing

management of a small ALAT in hardware. RegID
As shown in Figure 9, the ALAT is imple- [3:0]
mented as a 32-entry, two-way set-associative (index)
structure. The array is looked up based on the 16 sets
advanced load’s destination register ID, and Way 0 Way 1
each entry contains an advanced load’s phys-
ical address, a special octet mask, and a valid
bit. The physical address is used to compare
against subsequent stores, with the octet mask
bits used to track which bytes have actually RegID tag
been advance loaded. These are used in case of Physical address of adv load
partial overlap or in cases where the load and Valid bit
store are different sizes. In case of a match, the
corresponding valid bit is cleared. The later Figure 9. ALAT organization.
check instruction then simply queries the
ALAT to examine if a valid ALAT entry still
exists for the ALAT. for control speculation, eliminates the two
The combination of a simple ALAT for the fundamental barriers that software has tradi-
data speculation, in conjunction with NaT tionally encountered when boosting instruc-
bits and small changes to the exception logic tions. By adding this modest hardware


support for speculation, the processor allows effects of later instructions are automatically
the software to take advantage of the compil- squashed within the branch execution unit
er’s large scheduling window to hide memo- itself, preventing any architectural state update
ry latency, without the need for complex from branches in the shadow of a taken
dynamic scheduling hardware. branch. Given that the powerful branch pre-
diction in the front end contains tailored sup-
Parallel zero-latency delay-executed branching. port for multiway branch prediction, minimal
Achieving the highest levels of performance pipeline disruptions can be expected due to
requires a robust control flow mechanism. The this parallel branch execution.
processor’s branch-handling strategy is based Finally, the processor optimizes for the com-
on three key directions. First, branch seman- mon case of a very short distance between the
tics providing more program details are need- branch and the instruction that generates the
ed to allow the software to convey complex branch condition. The IA-64 instruction set
control flow information to the hardware. Sec- architecture allows a conditional branch to be
ond, aggressive use of speculation and predi- issued concurrently with the integer compare
cation will progressively lead to an emptying that generates its condition code—no stop bit
out of basic blocks, leaving clusters of branch- is needed. To accommodate this important
es. Finally, since the data flow from compare to performance optimization, the processor
dependent branch is often very tight, special pipelines the compare-branch sequence. The
care needs to be taken to enable high perfor- compare instruction is performed in the
mance for this important case. The processor pipeline’s EXE stage, with the results being
optimizes across all three of these fronts. known by the end of the EXE clock. To
The processor efficiently implements the accommodate the delivery of this condition to
powerful branch vocabulary of the IA-64 the branch hardware, the processor executes
instruction set architecture. The hardware takes all branches in the DET stage. (Note that the
advantage of the new semantics for improved presence of the DET stage isn’t an overhead
branch handling. For example, the loop count needed solely from branching. This stage is
(LC) register indicates the number of iterations also used for exception collection and priori-
in a For-type loop, and the epilogue count (EC) tization, and for the second clock of execution
register indicates the number of epilogue stages for integer-SIMD operations.) Thus, any
in a software-pipelined loop. branch issued in parallel with the compare that
By using the loop count information, high generates the condition will be evaluated in
performance can be achieved by software the DET stage, using the predicate results cre-
pipelining all loops. Moreover, the imple- ated in the previous (EXE) stage. In this man-
mentation avoids pipeline flushes for the first ner, the processor can easily handle the case of
and last loop iterations, since the actual num- compare and dependent branches issued in
ber of iterations is effectively communicated parallel.
to the hardware. By examining the epilog As a result of branch execution in the DET
count register information, the processor stage, in the rare case of a full pipeline flush
automatically generates correct stage predi- due to a branch misprediction, the processor
cates for the epilogue iterations of the soft- will incur a branch misprediction penalty of
ware-pipelined loop. This step leverages the nine pipeline bubbles. Note that we expect
predicate-remapping hardware along with the this to occur rarely, given the aggressive mul-
branch prediction information from the loop titier branch prediction strategy in the front
count register-based branch predictor. end. Most branches should be predicted cor-
Unlike conventional processors, the Itani- rectly using one of the four progressive resteers
um processor can execute up to three parallel in the front end.
branches per clock. This is implemented by The combination of enhanced branch
examining the three controlling conditions semantics, three-wide parallel branch execu-
(either predicates or the loop count/epilog tion, and zero-cycle compare-to-branch laten-
count counter values) for the three parallel cy allows the processor to achieve high
branches, and performing a priority encode performance on control-flow-dominated
to determine the earliest taken branch. All side codes, in addition to its high performance on

Table 2. Implementation of cache hints.

Hint Semantics L1 response L2 response L3 response

NTA Nontemporal (all levels) Don’t allocate Allocate, mark as next replace Don’t allocate
NT2 Nontemporal (2 levels) Don’t allocate Allocate, mark as next replace Normal allocation
NT1 Nontemporal (1 level) Don’t allocate Normal allocation Normal allocation
T1 (default) Temporal Normal allocation Normal allocation Normal allocation
Bias Intent to modify Normal allocation Allocate into exclusive state Allocate into exclusive state

more computation-oriented data-flow-dom- ter file, using two parallel floating-point load-
inated workloads. pair instructions.
The third level of on-package cache is 4
Memory subsystem Mbytes in size, uses a 64-byte line size, and is
In addition to the high-performance core, four-way set-associative. It communicates
the Itanium processor provides a robust cache with the processor at core frequency (800
and memory subsystem, which accommo- MHz) using a 128-bit bus. This cache serves
dates a variety of workloads and exploits the the large workloads of server- and transaction-
memory hints of the IA-64 ISA. processing applications, and minimizes the
cache traffic on the frontside system bus. The
Three levels of on-package cache L3 cache also implements a MESI protocol
The processor provides three levels of on- for microprocessor coherence.
package cache for scalable performance across A two-level hierarchy of TLBs handles vir-
a variety of workloads. At the first level, instruc- tual address translations for data accesses. The
tion and data caches are split, each 16 Kbytes hierarchy consists of a 32-entry first-level and
in size, four-way set-associative, and with a 32- 96-entry second-level TLB, backed by a hard-
byte line size. The dual-ported data cache has ware page walker.
a load latency of two cycles, is write-through,
and is physically addressed and tagged. The L1 Optimal cache management
caches are effective on moderate-size workloads To enable optimal use of the cache hierar-
and act as a first-level filter for capturing the chy, the IA-64 instruction set architecture
immediate locality of large workloads. defines a set of memory locality hints used for
The second cache level is 96 Kbytes in size, better managing the memory capacity at spe-
is six-way set-associative, and uses a 64-byte cific hierarchy levels. These hints indicate the
line size. The cache can handle two requests per temporal locality of each access at each level of
clock via banking. This cache is also the level at hierarchy. The processor uses them to deter-
which ordering requirements and semaphore mine allocation and replacement strategies for
operations are implemented. The L2 cache uses each cache level. Additionally, the IA-64 archi-
a four-state MESI (modified, exclusive, shared, tecture allows a bias hint, indicating that the
and invalid) protocol for multiprocessor coher- software intends to modify the data of a given
ence. The cache is unified, allowing it to ser- cache line. The bias hint brings a line into the
vice both instruction and data side requests cache with ownership, thereby optimizing the
from the L1 caches. This approach allows opti- MESI protocol latency.
mal cache use for both instruction-heavy (serv- Table 2 lists the hint bits and their mapping
er) and data-heavy (numeric) workloads. Since to cache behavior. If data is hinted to be non-
floating-point workloads often have large data temporal for a particular cache level, that data
working sets and are used with compiler opti- is simply not allocated to the cache. (On the L2
mizations such as data blocking, the L2 cache cache, to simplify the control logic, the proces-
is the first point of service for floating-point sor implements this algorithm approximately.
loads. Also, because floating-point performance The data can be allocated to the cache, but the
requires high bandwidth to the register file, the least recently used, or LRU, bits are modified
L2 cache can provide four double-precision to mark the line as the next target for replace-
operands per clock to the floating-point regis- ment.) Note that the nearest cache level to feed


pletion on the bus. A deferred transaction on

The Itanium processor is the the bus can be completed without reusing the
address bus. This reduces data return latency
first IA-64 processor and is for deferred transactions and efficiently uses
the address bus. This feature is critical for scal-
designed to meet the ability beyond four-processor systems.
The 64-bit system bus uses a source-syn-
demanding needs of a broad chronous data transfer to achieve 266-Mtrans-
fers/s, which enables a bandwidth of 2.1
range of enterprise and Gbytes/s. The combination of these features
makes the Itanium processor system a scalable
scientific workloads. building block for large multiprocessor sys-

the floating-point unit is the L2 cache. Hence,

for floating-point loads, the behavior is modi-
T he Itanium processor is the first IA-64
processor and is designed to meet the
demanding needs of a broad range of enter-
fied to reflect this shift (an NT1 hint on a float- prise and scientific workloads. Through its use
ing-point access is treated like an NT2 hint on of EPIC technology, the processor funda-
an integer access, and so on). mentally shifts the balance of responsibilities
Allowing the software to explicitly provide between software and hardware. The software
high-level semantics of the data usage pattern performs global scheduling across the entire
enables more efficient use of the on-chip compilation scope, exposing ILP to the hard-
memory structures, ultimately leading to ware. The hardware provides abundant exe-
higher performance for any given cache size cution resources, manages the bookkeeping
and access bandwidth. for EPIC constructs, and focuses on dynam-
ic fetch and control flow optimizations to keep
System bus the compiled code flowing through the
The processor uses a multidrop, shared sys- pipeline at high throughput. The tighter cou-
tem bus to provide four-way glueless multi- pling and increased synergy between hardware
processor system support. No additional bridges and software enable higher performance with
are needed for building up to a four-way sys- a simpler and more efficient design.
tem. Systems with eight or more processors are Additionally, the Itanium processor deliv-
designed through clusters of these nodes using ers significant value propositions beyond just
high-speed interconnects. Note that multidrop performance. These include support for 64
buses are a cost-effective way to build high-per- bits of addressing, reliability for mission-crit-
formance four-way systems for commercial ical applications, full IA-32 instruction set
transaction processing and e-business work- compatibility in hardware, and scalability
loads. These workloads often have highly shared across a range of operating systems and mul-
writeable data and demand high throughput tiprocessor platforms. MICRO
and low latency on transfers of modified data
between caches of multiple processors. References
In a four-processor system, the transaction- 1. L. Gwennap, “Merced Shows Innovative
based bus protocol allows up to 56 pending Design,” Microprocessor Report, Micro-
bus transactions (including 32 read transac- Design Resources, Sunnyvale, Calif., Oct. 6,
tions) on the bus at any given time. An 1999, pp. 1, 6-10.
advanced MESI coherence protocol helps in 2. M.S. Schlansker and B.R. Rau, “EPIC:
reducing bus invalidation transactions and in Explicitly Parallel Instruction Computing,”
providing faster access to writeable data. The Computer, Feb. 2000, pp. 37-45.
cache-to-cache transfer latency is further 3. T.Y. Yeh and Y.N. Patt, “Two-Level Adaptive
improved by an enhanced “defer mechanism,” Training Branch Prediction,” Proc. 24th Ann.
which permits efficient out-of-order data Int’l Symp. Microarchitecture, ACM Press,
transfers and out-of-order transaction com- New York, Nov. 1991, pp. 51-61.

4. L. Gwennap, “New Algorithm Improves Ken Arora was the microarchitect of the exe-
Branch Prediction,” Microprocessor Report, cution and pipeline control of the EPIC core
Mar. 27, 1995, pp. 17-21. of the Itanium processor. He participated in
5. B.R. Rau et al., “The Cydra 5 Departmental the joint Intel/HP IA-64 EPIC ISA definition
Supercomputer: Design Philosophies, and helped develop the initial IA-64 simula-
Decisions, and Trade-Offs,” Computer, Jan. tion environment. Earlier, he was a designer
1989, pp. 12-35. and architect on the i486 and Pentium proces-
sors. Arora received a BS degree in computer
Harsh Sharangpani was Intel’s principal science and a master’s degree in electrical engi-
microarchitect on the joint Intel-HP IA-64 neering from Rice University. He holds 12
EPIC ISA definition. He managed the patents in the field of microprocessor archi-
microarchitecture definition and validation of tecture and design.
the EPIC core of the Itanium processor. He
has also worked on the 80386 and i486 proces- Direct questions about this article to Harsh
sors, and was the numerics architect of the Pen- Sharangpani, Intel Corporation, Mail Stop
tium processor. Sharangpani received an SC 12-402, 2200 Mission College Blvd.,
MSEE from the University of Southern Cali- Santa Clara, CA 95052; harsh.sharang-
fornia, Los Angeles and a BSEE from the Indi-
an Institute of Technology, Bombay. He holds
20 patents in the field of microprocessors.

Introducing the
IEEE Computer Society

Career Service Center

Career Advance your career
Service Search for jobs
Center Post a resume
• Certification
List a job opportunity
• Educational Activities
• Career Information
Post your company’s profile
• Career Resources Link to career services
• Student Activities
• Activities Board