You are on page 1of 15

Itanium™ Processor Core

Intel® Itanium™ Processor Core

Harsh Sharangpani
Principal Engineer and IA-64 Microarchitecture Manager
Intel Corporation

Hot Chips, 15 August 2000


Itanium™ Processor Core

Itanium™ Processor Silicon

IA-32
FPU
Control

IA-64 Control TLB

Integer Units Cache


Instr.
Fetch &
Decode Cache Bus

Core Processor Die 4 x 1MB L3 cache

2 Hot Chips, 15 August 2000


Itanium™ Processor Core

Machine Characteristics
Frequency 800 MHz
Transistor Count 25.4M CPU; 295M L3
Process 0.18u CMOS, 6 metal layer
Package Organic Land Grid Array
Machine Width 6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br)
Registers 14 ported 128 GR & 128 FR; 64 Predicates
Speculation 32 entry ALAT, Exception Deferral
Branch Prediction Multilevel 4-stage Prediction Hierarchy
FP Compute Bandwidth 3.2 GFlops (DP/EP); 6.4 GFlops (SP)
Memory -> FP Bandwidth 4 DP (8 SP) operands/clock
Virtual Memory Support 64 entry ITLB, 32/96 2-level DTLB, VHPT
L2/L1 Cache Dual ported 96K Unified & 16KD; 16KI
L2/L1 Latency 6 / 2 clocks
L3 Cache 4MB, 4-way s.a., BW of 12.8 GB/sec;
System Bus 2.1 GB/sec; 4-way Glueless MP
Scalable to large (512+ proc) systems
®

3 Hot Chips, 15 August 2000


Itanium™ Processor Core

EPIC compared to Dynamic Scheduled RISC


Bottleneck Itanium EPIC Approach Dynamic RISC Approach
Scheduling Entire compilation scope Traditional compiler +
Scope limited hardware window
Memory Control Speculation across Hardware Scheduling
Latency & compiler scope; across dynamic window
Control Flow Data Speculation for assisted by Memory
Barriers undisambiguated memory; Order Buffer
Extensive Memory Hints
Control Flow Predication for flaky Large Dynamic Branch
Disruptions branches; Extensive Predictors;
branch/prefetch Hints; 1 branch/clock.
Superscalar branching;
Operand Large Register File, with Small Architectural File
Delivery Stacking & Rotation with Register Rename
Tables
Interprocedural Stacking for parameter Require spill/fill to
Overhead passing memory or registers
®

4 Hot Chips, 15 August 2000


Itanium™ Processor Core

Itanium™ EPIC Design Maximizes SW-HW Synergy


Architecture Features programmed by compiler:
Register Data & Control
Branch Explicit Memory
Stack Predication
Hints Parallelism Speculation Hints
& Rotation

Micro-architecture Features in hardware:


Micro-

Fetch Issue Register Control Parallel Resources Memory


Handling Subsystem

Bypasses & Dependencies


4 Integer +
4 MMX Units
Fast, Simple 6-Issue

128 GR &
Instruction 128 FR, 2 FMACs Three
Cache Register (4 for SSE) levels of
& Branch Remap cache:
Predictors &
2 LD/ST units L1, L2, L3
Stack
Engine
32 entry ALAT

® Speculation Deferral Management


5 Hot Chips, 15 August 2000
Itanium™ Processor Core

10 Stage In-Order Core Pipeline


Front End Execution
• Pre-fetch/Fetch of up • 4 single cycle ALUs, 2 ld/str
to 6 instructions/cycle • Advanced load control
• Hierarchy of branch • Predicate delivery & branch
predictors • Nat/Exception//Retirement
• Decoupling buffer

WORD-LINE
EXPAND RENAME DECODE REGISTER READ

IPG FET ROT EXP REN WLD REG EXE DET WRB

INST POINTER FETCH ROTATE EXECUTE EXCEPTION WRITE-BACK


GENERATION DETECT

Instruction Delivery Operand Delivery


• Dispersal of up to 6 • Reg read + Bypasses
instructions on 9 ports • Register scoreboard
• Reg. remapping • Predicated
• Reg. stack engine dependencies
®

6 Hot Chips, 15 August 2000


Itanium™ Processor Core

Front End IPG FET ROT

l SW-triggered prefetch loads target code early using BRP hints


l I-Fetch of 32 Bytes/clock feeds an 8-bundle decoupling buffer
l Branch hints combine with predictor hierarchy to improve
branch prediction, delivering upto four progressive resteers
l 4 TARs under compiler control.
l Adaptive 2-level predictor (512-entry 2-way + 64-entry Multiway);
64-entry Target Address Cache fed by hints; Return stack buffer;
l Perfect loop-exit predictor, BAC1, BAC2

16KB I-Cache & ITLB decoupling To Dispersal


IP MUX

buffer
Return Stack Buffer

Target Adaptive Br Target BAC 1 &


Address 2-Level Address Loop Exit BAC 2
Registers Predictor Cache Predictor

® IPG FET ROT EXP


7 Hot Chips, 15 August 2000
Itanium™ Processor Core

Instruction Delivery EXP REN

l Stop bits eliminate dependency checking; Templates simplify


routing; 1st available dispersal from 6 syllables to 9 issue ports
l Stacking eliminates most register spill / fills
l Register remapping done via several parallel 7-bit adders
l Stack engine performs the few required spill/fills
l REN stage supports renaming for stacking & rotation

M0 Stack Engine Stall


S0 M1 Spill/Fill Injection
S1 I0
S2 I1 Integer, FP,
& Predicate
F0
S3
Renamers
F1
S4
B0
S5 Dispersal B1
Network B2
® EXP REN WLD
8 Hot Chips, 15 August 2000
Itanium™ Processor Core

Operand Delivery WLD REG

l Multiported register file + mux hierarchy delivers operands in REG


l Unique “Delayed Stall” mechanism used for register dependencies
l Avoids pipeline flush or replay on unavailable data
l Stall computed in REG, but core pipeline stalls only in EXE
l Special Operand Latch Manipulation (OLM) captures data returns
into operand latches, to mimic register file read
è Retains benefits of “stall paradigm” on wide and hi-frequency machine

Bypass
128 Entry Integer

Muxes

ALUs
Src
Register File
8R / 6W

Src Src Dependency Control OLM comparators


Scoreboard
Comparators Delayed Stall
Dst Preds
®
WLD REG EXE
9 Hot Chips, 15 August 2000
Itanium™ Processor Core

Execution Resources EXE

Memory and Integer Resources: Ports Latency


Instruction Class M0 M1 I0 I1 (clocks)
ALU (Add, shift-add, logical, addp4, cmp) • • • • 1
Sign/zero extend, MoveLong • • 1
Fixed Extract/Deposit, TBit, TNaT • 1
Multimedia ALU • • • • 2
MM Shift, Avg, Mix, Pack • • 2
Move to/from BR/PR/ARs, • 2
Packed Multiply, PopCount
LD/ST/Prefetch/SetF/Cache Control • • 2+
Memory Mngmt/System/GetF • 2+

FP Resources: Ports Latency Branch Resources: Ports


Instruction Class F0 F1 (clocks) Instruction Class B0 B1 B2
FMAC, SIMD FMAC • • 5 Cond/Uncond • • •
Fixed Multiply • • 7 Call/Ret/Indirect • • •
Fset, Fchk • • 1 Branch.iA, EPC • • •
FCompare • 2 Loop, BSW, Cover •
FP Logicals/Class/Min/Max • 5 RFI •
®

10 Hot Chips, 15 August 2000


Itanium™ Processor Core

Predication Support EXE

l Basic strategy: All instructions read operands and execute


l Canceled at retirement if predicates off
l Predicates generated in EXE (by cmps), delivered in DET, &
feed into retirement, branch execution and dependency detection
l Smart control cancels false stalls on predicated dependencies
l Special detection exists in REG for cancelled producer or consumer
l Predication supported transparently - branches (& mispredicts)
eliminated without introduction of spurious stalls

REG EXE DET


To Dependency Detect (x6)

Bypass
Predicate Muxes
Register To Branch Execution (x3)
File Read
To Retirement (x6)
I-Cmps

F-Cmps
®

11 Hot Chips, 15 August 2000


Itanium™ Processor Core

Speculation Hardware DET WRB

l Control Speculation support requires minimal hardware


l Computed memory exception delivered with data as tokens (NaTs)
l NaTs propagate through subsequent executions like source data
l Data Speculation enabled efficiently via ALAT structure
l 32 outstanding advanced loads
l Indexed by reg-ids, keeps partial physical address tag
l 0 clk checks: dependent use can be issued in parallel with check
EXE DET WRB
Physical
Address 32-entry
Adv Ld Status

Exception
Address TLB ALAT Check

Logic
& Exception
Memory Spec. Ld. Status (NaT)
Subsystem

Check Instruction

®
Efficient elimination of memory bottlenecks
12 Hot Chips, 15 August 2000
Itanium™ Processor Core

Floating Point Features


l Native 82-bit hardware provides support for multiple numeric models
l 2 Extended precision pipelined FMACs deliver 4 EP / DP FLOPs/cycle
l Balanced with plenty of operand bandwidth from registers / memory
l Tuned for 3D graphics: 2 Additional SP FMACs deliver 8 SP FLOPs/cycle;
Software divide allows SW pipelining for high throughput;
l FPU hardware used for twin Integer multiply-add (>1000 RSA decrypts/sec)

2 stores/clk 6 x 82-bit operands

even
4Mbyte 128 entry
L2 82-bit
L3
Cache odd RF
Cache

2 DP 4 DP
Ops/clk Ops/clk
(2 x Fld-pair) 2 x 82-bit results
®

13 Hot Chips, 15 August 2000


Itanium™ Processor Core

Intel® Itanium™ Processor Block Diagram


ECC L1 Instruction Cache and
Fetch/Pre-
Fetch/Pre-fetch Engine
ITLB
IA-
IA-32
Branch Instruction
8 bundles Decode
Prediction Queue and
Control
9 Issue Ports B B B M M I I F F

Register Stack Engine / Re-


Re-Mapping
L2 Cache

L3 Cache
Branch & Predicate
128 Integer Registers 128 FP Registers
Scoreboard, Predicate

Registers
NaTs,, Exceptions

Branch Integer Dual-


Dual-Port
and

ALAT
Units L1
,NaTs

MM Units Data Floating


Cache Point
and Units
DTLB SIMD
ECC SIMD
FMAC
FMAC

ECC ECC
Bus Controller ECC
ECC

14 Hot Chips, 15 August 2000


Itanium™ Processor Core

Itanium™ Processor Core Summary

l State-of-the-Art processor for Servers and Workstations


– Combines High Performance with 64-bit addressing, Reliability features for
Mission critical Applications, & full iA-32 compatibility in hardware

l Highly parallel and deeply pipelined hardware at 800Mhz


– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

l EPIC technology increases Instruction Level Parallelism (ILP)


– Speculation, Predication, Explicit Parallelism, Register Stacking, Rotation, &
Branch/Memory Hints maximize hardware-software synergy

l Dynamic features enable high-throughput on compiled schedule


– Register scoreboard, non-blocking caches, Decoupled instruction prefetch
& aggressive branch prediction

l Supercomputer-level FP (3.2 GFLOPs) for technical workstations


®

15 Hot Chips, 15 August 2000

You might also like