You are on page 1of 60

Intel Pentium 4 Processor

(much slide content courtesy of Zhijian Lu and Steve Kelley)


Outline
 Introduction (Zhijian)
– Willamette (11/2000)
 Instruction Set Architecture (Zhijian)
 Instruction Stream (Steve)
 Data Stream (Zhijian)
 What went wrong (Steve)
 Pentium 4 revisions
– Northwood (1/2002)
– Xeon (Prestonia, ~2002)
– Prescott (2/2004)
 Dual Core
– Smithfield
Introduction

 Intel Pentium 4 processor


– Latest IA-32 processor equipped with a full set
of IA-32 SIMD operations
 First implementation of a new micro-
architecture called “NetBurst” by Intel
(11/2000)
IA-32
 Intel architecture 32-bit (IA-32)
– 80386 instruction set (1985)
– CISC, 32-bit addresses
 “Flat” memory model
 Registers
– Eight 32-bit registers
– Eight FP stack registers
– 6 segment registers
IA-32 (cont’d)
 Addressing modes
– Register indirect (mem[reg])
– Base + displacement (mem[reg + const])
– Base + scaled index (mem[reg + (2scale x index)])
– Base + scaled index + displacement (mem[reg + (2scale
x index) + displacement])
 SIMD instruction sets
– MMX (Pentium II)
» Eight 64-bit MMX registers, integer ops only
– SSE (Streaming SIMD Extension, Pentium III)
» Eight 128-bit registers
Pentium III vs. Pentium 4 Pipeline
Comparison Between Pentium3 and
Pentium4
Execution on MPEG4 Benchmarks @ 1 GHz
Instruction Set Architecture

 Pentium4 ISA =
Pentium3 ISA +
SSE2 (Streaming SIMD Extensions 2)

 SSE2 is an architectural enhancement to


the IA-32 architecture
SSE2
Extends MMX and the SSE extensions with
144 new instructions:
 128-bit SIMD integer arithmetic operations
 128-bit SIMD double precision floating
point operations
 Enhanced cache and memory management
operations
Comparison Between SSE and SSE2
 Both support operations on 128-bit XMM register
 SSE only supports 4 packed single-precision
floating-point values
 SSE2 supports more:
2 packed double-precision floating-point values
16 packed byte integers
8 packed word integers
4 packed doubleword integers
2 packed quadword integers
Double quadword
Packing
 128 bits (word = 2 bytes)
Quad word Quad word

64 bit 64 bit

Double word Double word Double word Double word

32 bit 32 bit 32 bit 32 bit


Hardware Support for SSE2
 Adder and Multiplier units in the SSE2
engine are 128 bits wide, twice the width of
that in Pentium3
 Increased bandwidth in load/store for
floating-point values
load and store are 128-bit wide
One load plus one store can be completed
between XMM register and L1 cache in one
clock cycle
SSE2 Instructions (1)
 Data movements
Move data between XMM registers and between
XMM registers and memory
 Double precision floating-point operations
Arithmetic instructions on both scalar and
packed values
 Logical Instructions
Perform logical operations on packed double
precision floating-point values
SSE2 Instructions (2)
 Compare instructions
Compare packed and scalar double precision
floating-point values
 Shuffle and unpack instructions
Shuffle or interleave double-precision floating-
point values in packed double-precision floating-
point operands
 Conversion Instructions
Conversion between double word and double-
precision floating-point or between single-
precision and double-precision floating-point
values
SSE2 Instructions (3)
 Packed single-precision floating-point instructions
Convert between single-precision floating-point
and double word integer operands
 128-bit SIMD integer instructions
Operations on integers contained in XMM
registers
 Cacheability Control and Instruction Ordering
More operations for caching of data when storing
from XMM registers to memory and additional
control of instruction ordering on store operations
Conclusion

 Pentium4 is equipped with the full set of


IA-32 SIMD technology. All existing
software can run correctly on it.
 AMD has decided to embrace and
implement SSE and SSE2 in future CPUs
Instruction Stream
Instruction Stream
 What’s new?
– Added Trace Cache
– Improved branch predictor
 Terminology
 op – Micro-op, already decoded RISC-like
instructions
– Front end – instruction fetch and issue
Front End
 Prefetches instructions that are likely to be
executed
 Fetches instructions that haven’t been
prefetched
 Decodes instruction into ops
 Generates ops for complex instructions or
special purpose code
 Predicts branches
Prefetch
 Three methods of prefetching:

 Instructions only – Hardware


 Data only – Software
 Code or data – Hardware
Decoder
 Single decoder that can operate at a
maximum of 1 instruction per cycle
 Receives instructions from L2 cache 64 bits
at a time
 Some complex instructions must enlist the
help of the microcode ROM
Trace Cache
 Primary instruction cache in NetBurst
architecture
 Stores decoded ops
 ~12K capacity
 On a Trace Cache miss, instructions are
fetched and decoded from the L2 cache
What is a Trace Cache?
I1 …
I2 br r2, L1
I3 …
I4 …
I5 …
L1: I6
I7 …
 Traditional instruction cache

I1 I2 I3 I4
 Trace cache

I1 I2 I6 I7
Pentium 4 Trace Cache
 Has its own branch predictor that directs
where instruction fetching needs to go next
in the Trace Cache
 Removes
– Decoding costs on frequently decoded
instructions
– Extra latency to decode instructions upon
branch mispredictions
Microcode ROM
 Used for complex IA-32 instructions (> 4
ops) , such as string move, and for fault
and interrupt handling
 When a complex instruction is encountered,
the Trace Cache jumps into the microcode
ROM which then issues the ops
 After the microcode ROM finishes, the
front end of the machine resumes fetching
ops from the Trace Cache
Branch Prediction
 Predicts ALL near branches
– Includes conditional branches, unconditional
calls and returns, and indirect branches

 Does not predict far transfers


– Includes far calls, irets, and software interrupts
Branch Prediction
 Dynamically predict the direction and target
of branches based on PC using BTB
 If no dynamic prediction is available,
statically predict
– Taken for backwards looping branches
– Not taken for forward branches
 Traces are built across predicted branches to
avoid branch penalties
Branch Target Buffer
 Uses a branch history table and a branch
target buffer to predict
 Updating occurs when branch is retired
Return Address Stack
 16 entries
 Predicts return addresses for procedure calls
 Allows branches and their targets to coexist
in a single cache line
– Increases parallelism since decode bandwidth is
not wasted
Branch Hints
 P4 permits software to provide hints to the
branch prediction and trace formation
hardware to enhance performance
 Take the forms of prefixes to conditional
branch instructions
 Used only at trace build time and have no
effect on already built traces
Out-of-Order Execution
 Designed to optimize performance by
handling the most common operations in
the most common context as fast as possible
 126 ops can in flight at once
– Up to 48 loads / 24 stores
Issue
 Instructions are fetched and decoded by
translation engine
 Translation engine builds instructions into
sequences of ops
 Stores ops to trace cache
 Trace cache can issue 3 ops per cycle
Execution
 Can dispatch up to 6 ops per cycle
 Exceeds trace cache and retirement op
bandwidth
– Allows for greater flexibility in issuing ops to
different execution units
Execution Units
Double-pumped ALUs
 ALU executes an operation on both rising
and falling edges of clock cycle
Retirement
 Can retire 3 ops per cycle
 Precise exceptions
 Reorder buffer to organize completed ops
 Also keeps track of branches and sends
updated branch information to the BTB
Execution Pipeline
Execution Pipeline
Data Stream of Pentium 4 Processor
Register Renaming
Register Renaming (2)
 8-entry architectural register file
 128-entry physical register file
 2 RAT
Frontend RAT and Retirement RAT
 Data does not need to be copied between
register files when the instruction retires
On-chip Caches
 L1 instruction cache (Trace Cache)
L1 data cache
L2 unified cache
 Parameters:

 All caches are not inclusive and a pseudo-LRU


replacement algorithm is used
L1 Instruction Cache
 Execution Trace Cache stores decoded
instructions
 Remove decoder latency from main
execution loops
 Integrate path of program execution flow
into a single line
L1 Data Cache
 Nonblocking
Support up to 4 outstanding load misses
 Load latency
2-clock for integer
6-clock for floating-point
 1 Load and 1 Store per clock
 Speculation Load
Assume the access will hit the cache
“Replay” the dependent instructions when miss
happen
L2 Cache
 Load latency
Net load access latency of 7 cycles
 Nonblocking
 Bandwidth
One load and one store in one cycle
New cache operation begin every 2 cycles
256-bit wide bus between L1 and L2
48Gbytes per second @ 1.5GHz
Data Prefetcher in L2 Cache

 Hardware prefetcher monitors the reference


patterns
 Bring cache lines automatically
 Attempt to stay 256 bytes ahead of current
data access location
 Prefetch for up to 8 simultaneous
independent streams
Store and Load
 Out of order store and load operations
Stores are always in program order
 48 loads and 24 stores can be in flight
 Store buffers and load buffers are allocated
at the allocation stage
Total 24 store buffers and 48 load buffers
Store
 Store operations are divided into two parts:
Store data
Store address
 Store data is dispatched to the fast ALU,
which operates twice per cycle
 Store address is dispatched to the store
AGU per cycle
Store-to-Load Forwarding

 Forward data from pending store buffer to


dependent load
 Load stalls still happen when the bytes of
the load operation are not exactly the same
as the bytes in the pending store buffer
System Bus
Deliver data with 3.2Gbytes/S
 64-bit wide bus
 Four data phase per clock cycle (quad
pumped)
 100MHz clocked system bus
Conclusion

Reduced Cache Size


VS
Increased Bandwidth and Lower Latency
What Went Wrong
No L3 cache
 Original plans called for a 1M cache
 Intel’s idea was to strap a separate memory
chip, perhaps an SDRAM, on the back of
the processor to act as the L3
 But that added another 100 pads to the
processor, and would have also forced Intel
to devise an expensive cartridge package to
contain the processor and cache memory
Small L1 Cache
 Only 8k!
– Doubled size of L2 cache to compensate
 Compare with
– AMD Athlon – 128k
– Alpha 21264 – 64k
– PIII – 32k
– Itanium – 16k
Loses consistently to AMD
 In terms of performance, the Pentium 4 is as
slow or slower than existing Pentium III and
AMD Athlon processors
 In terms of price, an entry level Pentium 4
sells for about double the cost of a similar
Pentium III or AMD Athlon based system
 1.5GHz clock rate is more hype than
substance
Northwood
Northwood
 1/2002
 Differences from Willamette
– Socket 478
– 21 stage pipeline
– 512 KB L2 cache
– 2.0 GHz, 2.2 GHz clock frequency
– 0.13 fabrication process (130 nm)
» 55 million transistors
Prescott
Prescott
 2/2004
 Differences
– 31 stage pipeline!
– 1MB L2 cache
– 3.8 GHz clock frequency
– 0.9 fabrication process
– SSE3

You might also like