You are on page 1of 123

A Look Inside Intel®:

The Core (Nehalem)


Microarchitecture

Beeman Strong
Intel® Core™ microarchitecture
(Nehalem) Architect
Intel Corporation
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts

2
Scalable Cores
Same core for Common software Common feature set
all segments optimization

Intel® Core™
Microarchitecture Servers/Workstations
(Nehalem)
Energy Efficiency,
45nm Performance, Virtualization,
Reliability, Capacity,
Scalability

Desktop
Performance, Graphics,
Energy Efficiency, Idle
Power, Security

Mobile
Battery Life, Performance,
Energy Efficiency, Graphics,
Security

Optimized cores to meet all market


segments
3
The First Intel® Core™ Microarchitecture
(Nehalem) Processor
Memory Controller
M M
i i
s s
c c
Core Core Q Core Core
I u I
O e O
u
e
Q Q
P P
I I
Shared L3 Cache
0 1

QPI: Intel® QuickPath


Interconnect (Intel®
QPI)
A Modular Design for Flexibility

4
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts

5
Intel® Core™ Microarchitecture Recap

• Wide Dynamic Execution


– 4-wide decode/rename/retire
• Advanced Digital Media Boost
– 128-bit wide SSE execution units
• Intel HD Boost
– New SSE4.1 Instructions
• Smart Memory Access
– Memory Disambiguation
– Hardware Prefetching
• Advanced Smart Cache
– Low latency, high BW shared L2 cache

Nehalem builds on the great Core microarchitecture

6
Designed for Performance
New SSE4.2 Improved Additional Caching
Instructions Lock Hierarchy
Support

L1 Data Cache L2 Cache


& Interrupt
Execution Servicing
Units
Memory Ordering
Deeper & Execution Paging
Buffers
Improved
Branch Prediction Loop
Out-of-Order Instruction Streaming
Scheduling & Decode &
Retirement Microcode Instruction Fetch
& L1 Cache

Simultaneous Better Branch


Multi-Threading Faster
Virtualization Prediction

7
Macrofusion
• Introduced in Intel® Core™2 microarchitecture
• TEST/CMP instruction followed by a conditional branch treated
as a single instruction
– Decode/execute/retire as one instruction
• Higher performance & improved power efficiency
– Improves throughput/Reduces execution latency
– Less processing required to accomplish the same work
• Support all the cases in Intel Core 2 microarchitecture PLUS
– CMP+Jcc macrofusion added for the following branch conditions
– JL/JNGE
– JGE/JNL
– JLE/JNG
– JG/JNLE
– Intel® Core™ microarchitecture (Nehalem) supports macrofusion
in both 32-bit and 64-bit modes
– Intel Core2 microarchitecture only supports macrofusion in 32-bit
mode

Increased macrofusion benefit on Intel®


Core™ microarchitecture (Nehalem)

8
Intel® Core™ Microarchitecture
(Nehalem) Loop Stream Detector
• Loop Stream Detector identifies software loops
– Stream from Loop Stream Detector instead of normal path
– Disable unneeded blocks of logic for power savings
– Higher performance by removing instruction fetch limitations
• Higher performance: Expand the size of the loops detected (vs Core 2)
• Improved power efficiency: Disable even more logic (vs Core 2)

Intel Core Microarchitecture (Nehalem) Loop Stream Detector

Branch Loop
Fetch Decode Stream
Prediction Detector

28
Micro-Ops

9
Branch Prediction Improvements
• Focus on improving branch prediction accuracy
each CPU generation
– Higher performance & lower power through more
accurate prediction
• Example Intel® Core™ microarchitecture (Nehalem)
improvements
– L2 Branch Predictor
– Improve accuracy for applications with large code size (ex.
database applications)
– Advanced Renamed Return Stack Buffer (RSB)
– Remove branch mispredicts on x86 RET instruction (function
returns) in the common case

Greater Performance through Branch Prediction

10
Execution Unit Overview
Execute 6 operations/cycle
Unified Reservation Station
• 3 Memory Operations
• Schedules operations to Execution units
• 1 Load
• Single Scheduler for all Execution Units
• 1 Store Address
• Can be used by all integer, all FP, etc.
• 1 Store Data
• 3 “Computational” Operations

Unified Reservation Station


Port 0

Port 4
Port 2
Port 1

Port 3

Port 5
Integer ALU & Integer ALU & Store Store Integer ALU &
Shift LEA Load Address Data Shift

FP Multiply FP Add Branch


Complex
Divide FP Shuffle
Integer
SSE Integer ALU SSE Integer
SSE Integer ALU
Integer Shuffles Multiply
Integer Shuffles

11
Increased Parallelism
1
Concurrent uOps Possible

• Goal: Keep powerful 128


execution engine fed 112
96
• Nehalem increases size of 80

out of order window by 33% 64


48

• Must also increase other 32


16
corresponding structures 0
Dothan Merom Nehalem

Structure Intel® Core™ Intel® Core™ Comment


microarchitecture microarchitecture
(formerly Merom) (Nehalem)
Reservation Station 32 36 Dispatches
operations to
execution units
Load Buffers 32 48 Tracks all load
operations allocated
Store Buffers 20 32 Tracks all store
operations allocated

Increased Resources for Higher Performance


1
Intel® Pentium® M processor (formerly Dothan)
Intel® Core™ microarchitecture (formerly Merom)
Intel® Core™ microarchitecture (Nehalem)

12
Enhanced Memory Subsystem

• Responsible for:
– Handling of memory operations (loads/stores)
• Key Intel® Core™2 Features
– Memory Disambiguation
– Hardware Prefetchers
– Advanced Smart Cache
• New Intel® Core™ Microarchitecture (Nehalem)
Features
– New TLB Hierarchy (new, low latency 2nd level unified TLB)
– Fast 16-Byte unaligned accesses
– Faster Synchronization Primitives

13
Intel® Hyper-Threading Technology
• Also known as Simultaneous Multi-
Threading (SMT) w/o SMT SMT
– Run 2 threads at the same time per core
• Take advantage of 4-wide execution engine
– Keep it fed with multiple threads

Time (proc. cycles)


– Hide latency of a single thread
• Most power efficient performance feature
– Very low die area cost
– Can provide significant performance benefit
depending on application
– Much more efficient than adding an entire
core
• Intel® Core™ microarchitecture (Nehalem)
advantages Note: Each box
– Larger caches represents a
processor
– Massive memory BW execution unit

Simultaneous multi-threading enhances


performance and energy efficiency

14
SMT Performance Chart

40%
Performance Gain SMT enabledvs disabled
34%
35%
29%
30%
25%
20%
16%
15% 13%
10%
10% 7%
5%
0%
FloatingPoint 3dsMax* Integer Cinebench* 10POV-Ray* 3.7 3DMark*
beta25 Vantage* CPU
Intel®Core™i7

Floating Point is based on SPECfp_rate_base2006* estimate


Integer is based on SPECint_rate_base2006* estimate

SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation.
For more information on SPEC benchmarks, see: http://www.spec.org

Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured
using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any
difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information
to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the
performance of Intel products, visit http://www.intel.com/performance/

15
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts

16
Designed For Modularity

C C C
O O O Core
R R R
E E E

L3 Cache
DRAM
Uncore
Power
Intel Intel
… IM ® … ® &
C QPI QPI Clock

Intel QPI

Intel® QPI: Intel®
Differentiation in the QuickPath Interconnect
“Uncore”: (Intel® QPI)

#QPI Type of Power Integrated


# mem Size of Manage-
# cores channels Links cache Memory graphics
ment

2008 – 2009 Servers &


Desktops
Optimal price / performance / energy
efficiency
for server, desktop and mobile products
17
Intel® Smart Cache – 3rd Level Cache

• Shared across all cores


• Size depends on # of cores
– Quad-core: Up to 8MB (16-ways) Core Core Core
– Scalability: L1 Caches L1 Caches L1 Caches

– Built to vary size with varied core


counts

– Built to easily increase L3 size in
future parts L2 Cache L2 Cache L2 Cache

• Perceived latency depends on


L3 Cache
frequency ratio between core &
uncore
• Inclusive cache policy for best
performance
– Address residing in L1/L2 must be
present in 3rd level cache

18
Why Inclusive?

• Inclusive cache provides benefit of an on-die snoop filter


• Core Valid Bits
– 1 bit per core per cache line
– If line may be in a core, set core valid bit
– Snoop only needed if line is in L3 and core valid bit is set
– Guaranteed that line is not modified if multiple bits set
• Scalability
– Addition of cores/sockets does not increase snoop traffic seen by
cores
• Latency
– Minimize effective cache latency by eliminating cross-core snoops
in the common case
– Minimize snoop response time for cross-socket cases

19
Intel® Core™ Microarchitecture
(Nehalem-EP) Platform Architecture
• Integrated Memory Controller
– 3 DDR3 channels per socket Nehalem Nehalem
– Massive memory bandwidth EP EP
– Memory Bandwidth scales with # of
processors
– Very low memory latency
Tylersburg
• Intel® QuickPath Interconnect EP
(Intel® QPI)
– New point-to-point interconnect
– Socket to socket connections
– Socket to chipset connections
IOH
– Build scalable solutions CPU CPU
memory memory
– Up to 6.4 GT/sec (12.8 GB/sec)
– Bidirectional (=> 25.6 GB/sec)

Significant performance memory


CPU CPU memory
IOH
leap from new platform

Intel® Core™ microarchitecture (Nehalem-EP)


Intel® Next Generation Server Processor Technology (Tylersburg-EP)

20
Non-Uniform Memory Access (NUMA)

• FSB architecture
– All memory in one location Nehalem Nehalem
• Starting with Intel Core™ ® EP EP
microarchitecture (Nehalem)
– Memory located in multiple
Tylersburg
places EP
• Latency to memory dependent
on location
• Local memory has highest Relative Memory Latency Comparison

BW, lowest latency 1.00

Relative Memory Latency


• Remote Memory still very fast 0.80

0.60

0.40

Ensure software is NUMA- 0.20

0.00
optimized for best performance Harpertown (FSB 1600) Nehalem (DDR3-1067) Nehalem (DDR3-1067)
Local Remote

Intel® Core™ microarchitecture (Nehalem-EP)


Intel® Next Generation Server Processor Technology (Tylersburg-EP)

21
Memory Bandwidth – Initial Intel® Core™
Microarchitecture (Nehalem) Products

• 3 memory channels per socket Stream Bandwidth – Mbytes/Sec


(Triad)
• ≥ DDR3-1066 at launch
• Massive memory BW 33376
• Scalability
– Design IMC and core to take 3.4X
advantage of BW
– Allow performance to scale with
cores
– Core enhancements 9776
– Support more cache misses per 6102
core
– Aggressive hardware prefetching w/
throttling enhancements
HTN 3.16/ BF1333/ HTN 3.00/ SB1600/ NHM 2.66/ 6.4 QPI/
– Example IMC Features 667 MHz mem 800 MHz mem 1066 MHz mem

– Independent memory channels Source: Intel Internal measurements – August 20081


– Aggressive Request Reordering

Massive memory BW provides performance and scalability

1
HTN: Intel® Xeon® processor 5400 Series (Harpertown)
NHM: Intel® Core™ microarchitecture (Nehalem)
22
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts

23
Virtualization
• To get best virtualized
performance
Round Trip Virtualization
– Have best native performance
1

Latency
– Reduce transitions to/from
virtual machine 100%
– Reduce latency of transitions 80%

Relative Latency
• Intel® Core™ microprocessor 60%
(Nehalem) virtualization
40%
features
– Reduced latency for transitions 20%
– Virtual Processor ID (VPID) to 0%
reduce effective cost of Merom Penryn Nehalem
transitions
– Extended Page Table (EPT) to
reduce # of transitions

24
EPT Solution
EPT
CR3 Base Pointer

Guest Guest Host


Intel® 64 EPT
Linear Page Tables Physical Page Tables Physical
Address Address Address

• Intel® 64 Page Tables


– Map Guest Linear Address to Guest Physical Address
– Can be read and written by the guest OS
• New EPT Page Tables under VMM Control
– Map Guest Physical Address to Host Physical Address
– Referenced by new EPT base pointer
• No VM Exits due to Page Faults, INVLPG or CR3
accesses

25
Extending Performance and Energy Efficiency
- Intel® SSE4.2 Instruction Set Architecture (ISA) Leadership in 2008

Faster XML parsing


Accelerated Faster search and pattern
matching
SSE4
String and Text STTNI (45nm CPUs)
Novel parallel data matching
Processing and comparison operations

Improved performance for


Accelerated Searching Genome Mining, SSE4.1 SSE4.2
& Pattern Recognition Handwriting recognition. (Penryn Core) (Nehalem Core)
of Large Data Sets Fast Hamming distance /
Population count

ATA
STTNI ATA
Hardware based CRC (Application
e.g. XML
e.g. XML Targeted
instruction acceleration Accelerators)
New Communications Accelerated Network
Capabilities attached storage
Improved power efficiency
for Software I-SCSI, RDMA, POPCNT CRC32
and SCTP e.g. Genome e.g. iSCSI
Mining Application

What should the applications, OS and VMM vendors do?:


Understand the benefits & take advantage of new instructions in 2008.
Provide us feedback on instructions ISV would like to see for
next generation of applications

26
Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design
Overview
• Enhanced Processor Core
• Performance Features
• Intel® Hyper-Threading Technology
• New Platform
• New Cache Hierarchy
• New Platform Architecture
• Performance Acceleration
• Virtualization
• New Instructions
• Power Management Overview
• Minimizing Idle Power Consumption
• Performance when it counts

27
Intel® Core™ Microarchitecture (Nehalem)
Design Goals
World class performance combined with superior energy efficiency –
Optimized for:
Dynamically
scaled
performance
Single when
Thread

Existing
Multi- Apps
threads Emerging Apps
All Usages needed to
maximize energy
efficiency
A single, scalable, foundation optimized across each segment and power
envelope

Desktop / Mobile Workstation /


Server
A Dynamic and Design Scalable
Microarchitecture
28
Power Control Unit
Vcc BCLK

Core
PLL
Vcc
Integrated proprietary
Freq.
Freq. PLL
Sensors
microcontroller
Core
Shifts control from
hardware to embedded
Vcc
firmware
Freq.
Freq. PLL
Sensors Real time sensors for
temperature, current,
Core
PCU power
Vcc
Freq.
Freq. PLL
Flexibility enables
Sensors sophisticated
algorithms, tuned for
Core
Uncore,
Uncore,
current operating
Vcc
LLC conditions
Freq.
Freq. PLL
Sensors

29
Minimizing Idle Power Consumption

• Operating system notifies CPU when no tasks are


ready for execution
– Execution of MWAIT instruction
• MWAIT arguments hint at expected idle duration
– Higher numbered C-states
lower power, but also C0

Idle Power (W)


longer exit latency
• CPU idle states referred
to as “C-States” C1
Cn

Exit Latency (us)

30
C6 on Intel® Core™ Microarchitecture
(Nehalem)

31
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power

Cores 0, 1, 2,
and 3 running Core 3
applications.
0

Core 2
0

Core 1
0

Core 0
0

Time

32
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power Task completes. No
work waiting. OS
executes MWAIT(C6)
instruction. Core 3
0

Core 2
0

Core 1
0

Core 0
0

Time

33
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power Execution stops. Core
architectural state
saved. Core clocks
stopped. Cores 0, 1, Core 3
and 3 continue
0 execution undisturbed.
Core 2
0

Core 1
0

Core 0
0

Time

34
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power
Core power gate turned
off. Core voltage goes
to 0. Cores 0, 1, and 3
continue execution
Core 3
0 undisturbed.

Core 2
0

Core 1
0

Core 0
0

Time

35
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power

Core 3
0

Core 2
0 Task completes. No work
waiting. OS executes
MWAIT(C6) instruction. Core Core 1
0 enters C6. Cores 1 and 3
0 continue execution
undisturbed. Core 0
0

Time

36
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Interrupt for Core 2 arrives.
Core Power Core 2 returns to C0,
execution resumes at
instruction following
MWAIT(C6). Cores 1 and 3 Core 3
continue execution
0 undisturbed.
Core 2
0

Core 1
0

Core 0
0

Time

37
C6 on Intel® Core™ Microarchitecture
(Nehalem)
Core Power

Core 3
0
Interrupt for Core 0 arrives. Core 2
Power gate turns on, core
0 clock turns on, core state
restored, core resumes
execution at instruction Core 1
following MWAIT(C6). Cores 1,
0 2, and 3 continue execution
undisturbed. Core 0
0

Time
Core independent C6 on Intel Core
microarchitecture (Nehalem) extends benefits
38
Intel® Core™ Microarchitecture (Nehalem)-
based Processor
Memory Controller
M M Total CPU Power
i i Consumption
s s
c c Core
Core Core Q Core Core Clocks

Cores (x N)
I u I
O e O and Logic
u
Q e Q
P P Core Clock
I I Distribution
Shared L3 Cache
0 1
Core
Leakage
Uncore Logic
• Significant logic outside core
– Integrated memory controller I/O
– Large shared cache Uncore Clock
Distribution
– High speed interconnect Uncore
– Arbitration logic Leakage
QPI = Intel® QuickPath Interconnect (Intel® QPI)

39
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
• All cores in C6 state: Core
– Core power to ~0 Clocks
and Logic

Cores (x N)
Core Clock
Distribution
Core
Leakage
Uncore Logic

I/O
Uncore Clock
Distribution
Uncore
Leakage

40
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
• All cores in C6 state:
– Core power to ~0
• Package to C6 state:
– Uncore logic stops toggling

Uncore Logic

I/O
Uncore Clock
Distribution
Uncore
Leakage

41
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
• All cores in C6 state:
– Core power to ~0
• Package to C6 state:
– Uncore logic stops toggling
– I/O to lower power state

I/O
Uncore Clock
Distribution
Uncore
Leakage

42
Intel® Core™ Microarchitecture
(Nehalem) Package C-State Support
Active CPU Power
• All cores in C6 state:
– Core power to ~0
• Package to C6 state:
– Uncore logic stops toggling
– I/O to lower power state
– Uncore clock grids stopped

I/O
Uncore Clock
Distribution
Uncore
Leakage
Substantial reduction in
idle CPU power
43
Managing Active Power
• Operating system changes frequency as needed to
meet performance needs, minimize power
– Enhanced Intel SpeedStep® Technology
– Referred to as processor P-States
• PCU tunes voltage for given frequency, operating
conditions, and silicon characteristics

PCU automatically optimizes operating voltage

44
Turbo Mode: Key to Scalability Goal

• Intel® Core™ microarchitecture


Nehalem
(Nehalem) is a scalable architecture
– High frequency core for performance in
less constrained form factors
– Retain ability to use that frequency in
very small form factors
– Retain ability to use that frequency when
running lightly threaded or lower power
workloads
• Turbo utilizes available frequency:
– Maximizes both single-thread and multi-
thread performance in the same part

Turbo Mode provides performance


when you need it

45
Turbo Mode Enabling

• Turbo Mode exposed as additional Enhanced Intel


SpeedStep® Technology operating point
– Operating system treats as any other P-state, requesting
Turbo Mode when it needs more performance
– Performance benefit comes from higher operating
frequency – no need to enable or tune software
• Turbo Mode is transparent to system
– Frequency transitions handled completely in hardware
– PCU keeps silicon within existing operating limits
– Systems designed to same specs, with or without Turbo
Mode
Performance benefits with
existing applications and operating systems

46
Summary

• Intel® Core™ microarchitecture (Nehalem) – The 45nm Tock


• Designed for
– Power Efficiency
– Scalability
– Performance
• Key Innovations:
– Enhanced Processor Core
– Brand New Platform Architecture
– Sophisticated Power Management

High Performance When You Need It


Lower Power When You Don’t

47
Q&A

48
Legal Disclaimer
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO
LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS
AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,
AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE
OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN
MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.
• All products, dates, and figures specified are preliminary based on current expectations, and
are subject to change without notice.
• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as
errata, which may cause the product to deviate from published specifications. Current
characterized errata are available on request.
• Merom, Penryn, Hapertown, Nehalem, Dothan, Westmere, Sandy Bridge, and other code names
featured are used internally within Intel to identify products that are in development and not
yet publicly announced for release. Customers, licensees and other third parties are not
authorized by Intel to use code names in advertising, promotion or marketing of any product or
services and any such use of Intel's internal code names is at the sole risk of the user
• Performance tests and ratings are measured using specific computer systems and/or
components and reflect the approximate performance of Intel products as measured by those
tests. Any difference in system hardware or software design or configuration may affect actual
performance.
• Intel, Intel Inside, Intel Core, Pentium, Intel SpeedStep Technology, and the Intel logo are
trademarks of Intel Corporation in the United States and other countries.
• *Other names and brands may be claimed as the property of others.
• Copyright © 2008 Intel Corporation.

49
Risk Factors
This presentation contains forward-looking statements that involve a number of risks and uncertainties. These
statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other
similar transactions that may be completed in the future. The information presented is accurate only as of
today’s date and will not be updated. In addition to any factors discussed in the presentation, the important
factors that could cause actual results to differ materially include the following: Demand could be different
from Intel's expectations due to factors including changes in business and economic conditions, including
conditions in the credit market that could affect consumer confidence; customer acceptance of Intel’s and
competitors’ products; changes in customer order patterns, including order cancellations; and changes in the
level of inventory at customers. Intel’s results could be affected by the timing of closing of acquisitions and
divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of
costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and
difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product
introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's
competitors, including product offerings and introductions, marketing programs and pricing pressures and
Intel’s response to such actions; Intel’s ability to respond quickly to technological developments and to
incorporate new features into its products; and the availability of sufficient supply of components from
suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on
changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation,
including variations related to the timing of qualifying products for sale; excess or obsolete inventory;
manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing,
assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated
costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary
depending on the level of demand for Intel's products, the level of revenue and profits, and impairments of
long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions
that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by
adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its
customers or its suppliers operate, including military conflict and other security risks, natural disasters,
infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be
affected by adverse effects associated with product defects and errata (deviations from published
specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer,
antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A
detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings,
including the report on Form 10-Q for the quarter ended 50 June 28, 2008.
Backup Slides

51
Intel® Core™ Microarchitecture (Nehalem)
Design Goals
World class performance combined with superior energy efficiency –
Optimized for:
Dynamically
scaled
performance
Single when
Thread

Existing
Multi- Apps
threads Emerging Apps
All Usages needed to
maximize energy
efficiency
A single, scalable, foundation optimized across each segment and power
envelope

Desktop / Mobile Workstation /


Server
A Dynamic and Design Scalable
Microarchitecture
52
Tick-Tock Development Model

Sandy
Merom1 Penryn Nehalem Westmere
Bridge
NEW NEW NEW NEW NEW
Microarchitecture Process Microarchitecture Process Microarchitecture
65nm 45nm 45nm 32nm 32nm

TOCK TICK TOCK TICK TOCK

Forecast

1
Intel® Core™ microarchitecture (formerly Merom)
45nm next generation Intel® Core™ microarchitecture (Penryn) All dates, product descriptions, availability and plans are forecasts and
Intel® Core™ Microarchitecture (Nehalem) subject to change without notice.
Intel® Microarchitecture (Westmere)
53
Intel® Microarchitecture (Sandy Bridge) 53
Enhanced Processor Core
Instruction Fetch and
ITLB
32kB Front End
Pre Decode Instruction Cache
Execution
Instruction Queue Engine
Memory
Decode

4 2nd Level TLB


Rename/Allocate L3 and beyond
256kB
2nd Level Cache
Retirement Unit 4
(ReOrder Buffer)

Reservation Station
6
Execution Units

DTLB

32kB
Data Cache

54
Front-end

• Responsible for feeding the compute engine


– Decode instructions
– Branch Prediction
• Key Intel® Core™2 microarchitecture Features
– 4-wide decode
– Macrofusion
– Loop Stream Detector
Instruction Fetch and 32kB
ITLB
Pre Decode Instruction Cache

Instruction Queue

Decode

55
Loop Stream Detector Reminder
• Loops are very common in most software
• Take advantage of knowledge of loops in HW
– Decoding the same instructions over and over
– Making the same branch predictions over and over
• Loop Stream Detector identifies software loops
– Stream from Loop Stream Detector instead of normal path
– Disable unneeded blocks of logic for power savings
– Higher performance by removing instruction fetch limitations

Intel® Core™2 Loop Stream Detector

Loop
Branch
Fetch Stream Decode
Prediction Detector

18
Instructions

56
Branch Prediction Reminder
• Goal: Keep powerful compute engine fed
• Options:
– Stall pipeline while determining branch direction/target
– Predict branch direction/target and correct if wrong
• Minimize amount of time wasted correcting from
incorrect branch predictions
– Performance:
– Through higher branch prediction accuracy
– Through faster correction when prediction is wrong
– Power efficiency: Minimize number of
speculative/incorrect micro-ops that are executed

Continued focus on branch


prediction improvements

57
L2 Branch Predictor

• Problem: Software with a large code footprint not


able to fit well in existing branch predictors
– Example: Database applications
• Solution: Use multi-level branch prediction scheme
• Benefits:
– Higher performance through improved branch prediction
accuracy
– Greater power efficiency through less mis-speculation

58
Advanced Renamed Return Stack Buffer (RSB)

• Instruction Reminder
– CALL: Entry into functions
– RET: Return from functions
• Classical Solution
– Return Stack Buffer (RSB) used to predict RET
– RSB can be corrupted by speculative path
• The Renamed RSB
– No RET mispredicts in the common case

59
Execution Engine

• Responsible for:
– Scheduling operations
– Executing operations
• Powerful Intel® Core™2 microarchitecture execution
engine
– Dynamic 4-wide Execution
– Intel® Advanced Digital Media Boost
– 128-bit wide SSE
– Super Shuffler (45nm next generation Intel® Core™
microarchitecture (Penryn))

60
Intel® Smart Cache – Core Caches
• New 3-level Cache Hierarchy
• 1st level caches
– 32kB Instruction cache
– 32kB, 8-way Data Cache Core
32kB L1 32kB L1
– Support more L1 misses in parallel than Data Cache Inst. Cache
Intel® Core™2 microarchitecture
• 2nd level Cache
– New cache introduced in Intel® Core™
microarchitecture (Nehalem)
– Unified (holds code and data)
– 256 kB per core (8-way) 256kB
L2 Cache
– Performance: Very low latency
– 10 cycle load-to-use
– Scalability: As core count increases,
reduce pressure on shared cache

61
New TLB Hierarchy

• Problem: Applications continue to grow in data size


• Need to increase TLB size to keep the pace for performance
• Nehalem adds new low-latency unified 2nd level TLB

# of Entries
1st Level Instruction TLBs
Small Page (4k) 128
Large Page (2M/4M) 7 per thread
1st Level Data TLBs
Small Page (4k) 64
Large Page (2M/4M) 32
New 2nd Level Unified TLB
Small Page Only 512

62
Fast Unaligned Cache Accesses
• Two flavors of 16-byte SSE loads/stores exist
– Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary
– Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement
• Prior to Intel® Core™ microarchitecture (Nehalem)
– Optimized for Aligned instructions
– Unaligned instructions slower, lower throughput -- Even for aligned accesses!
– Required multiple uops (not energy efficient)
– Compilers would largely avoid unaligned load
– 2-instruction sequence (MOVSD+MOVHPD) was faster
• Intel Core microarchitecture (Nehalem) optimizes Unaligned instructions
– Same speed/throughput as Aligned instructions on aligned accesses
– Optimizations for making accesses that cross 64-byte boundaries fast
– Lower latency/higher throughput than Core 2
– Aligned instructions remain fast
• No reason to use aligned instructions on Intel Core microarchitecture (Nehalem)!
• Benefits:
– Compiler can now use unaligned instructions without fear
– Higher performance on key media algorithms
– More energy efficient than prior implementations

63
Faster Synchronization Primitives
LOCK CMPXCHG Perform ance 1

• Multi-threaded software 1
becoming more prevalent 0.9
0.8
• Scalability of multi-thread 0.7

Relative Latency
applications can be limited by 0.6
0.5
synchronization 0.4

• Synchronization primitives: 0.3


0.2
LOCK prefix, XCHG 0.1
0
• Reduce synchronization Pentium 4 Core 2 Nehalem

latency for legacy software

Greater thread scalability with Nehalem

1
Intel® Pentium® 4 processor
Intel® Core™2 Duo processor
Intel® Core™ microarchitecture (Nehalem)-based processor
64
Intel® Core™ Microarchitecture (Nehalem) SMT
Implementation Details
Policy Description Intel® Core™
Microarchitecture
(Nehalem) Examples
Replicated Duplicate logic per thread Register State
Renamed RSB
Large Page ITLB

Partitioned Statically allocated Load Buffer


between threads Store Buffer
Reorder Buffer
Small Page ITLB

Competitively Shared Depends on thread’s Reservation Station


dynamic behavior Caches
Data TLB
2nd level TLB

Unaware No SMT impact Execution units

SMT efficient due to


minimal replication of logic
65
Feeding the Execution Engine

• Powerful 4-wide dynamic execution engine


• Need to keep providing fuel to the execution engine
• Intel® Core™ Microarchitecture (Nehalem) Goals
– Low latency to retrieve data
– Keep execution engine fed w/o stalling
– High data bandwidth
– Handle requests from multiple cores/threads seamlessly
– Scalability
– Design for increasing core counts
• Combination of great cache hierarchy and new platform

Intel® Core™ microarchitecture (Nehalem)


designed to feed the execution engine

66
Inclusive vs. Exclusive Caches –
Cache Miss
Exclusive Inclusive

L3 Cache L3 Cache

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

Data request from Core 0 misses Core 0’s L1 and L2


Request sent to the L3 cache

67
Inclusive vs. Exclusive Caches –
Cache Miss
Exclusive Inclusive

L3 Cache L3 Cache
MISS! MISS!

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

Core 0 looks up the L3 Cache


Data not in the L3 Cache

68
Inclusive vs. Exclusive Caches –
Cache Miss
Exclusive Inclusive

L3 Cache L3 Cache
MISS! MISS!

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

Must check other cores Guaranteed data is not on-die


Greater scalability from inclusive approach

69
Inclusive vs. Exclusive Caches –
Cache Hit
Exclusive Inclusive

L3 Cache L3 Cache
HIT! HIT!

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

No need to check other cores Data could be in another core BUT


Intel® CoreTM microarchitecture
(Nehalem) is smart…

70
Inclusive vs. Exclusive Caches –
Cache Hit
Inclusive
• Maintain a set of “core
valid” bits per cache line
in the L3 cache
• Each bit represents a core L3 Cache
• If the L1/L2 of a core may HIT! 0 0 0 0
contain the cache line,
then core valid bit is set to
“1”
• No snoops of cores are Core Core Core Core
needed if no bits are set
0 1 2 3
• If more than 1 bit is set,
line cannot be in Modified
state in any core
Core valid bits limit
unnecessary snoops

71
Inclusive vs. Exclusive Caches –
Read from other core
Exclusive Inclusive

L3 Cache L3 Cache
MISS! HIT! 0 0 1 0

Core Core Core Core Core Core Core Core


0 1 2 3 0 1 2 3

Must check all other cores Only need to check the core
whose core valid bit is set

72
Local Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from its DRAM
– CPU0 snoops CPU1 to check if data is present
• Step 2:
– DRAM returns data
– CPU1 returns snoop response
• Local memory latency is the maximum latency of the two responses
• Intel® Core™ microarchitecture (Nehalem) optimized to keep key latencies
close to each other

Intel®
QPI
DRAM CPU0 CPU1 DRAM

Intel® QPI = Intel® QuickPath Interconnect

73
Remote Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from CPU1
– Request sent over Intel® QuickPath Interconnect (Intel® QPI) to
CPU1
– CPU1’s IMC makes request to its DRAM
– CPU1 snoops internal caches
– Data returned to CPU0 over Intel QPI
• Remote memory latency a function of having a low latency
interconnect

Intel®
QPI DRAM
DRAM CPU0 CPU1

74
Hardware Prefetching (HWP)

• HW Prefetching critical to hiding memory latency


• Structure of HWPs similar as in Intel® Core™2
microarchitecture
– Algorithmic improvements in Intel® Core™ microarchitecture
(Nehalem) for higher performance
• L1 Prefetchers
– Based on instruction history and/or load address pattern
• L2 Prefetchers
– Prefetches loads/RFOs/code fetches based on address pattern
– Intel Core microarchitecture (Nehalem) changes:
– Efficient Prefetch mechanism
– Remove the need for Intel® Xeon® processors to disable HWP
– Increase prefetcher aggressiveness
– Locks on address streams quicker, adapts to change faster, issues more
prefetchers more aggressively (when appropriate)

75
Today’s Platform Architecture

Front-Side Bus Evolution

CPU memory
CPU memory
CPU memory

CPU CPU CPU


MCH MCH MCH
CPU CPU CPU

ICH ICH ICH


CPU CPU CPU

76
Intel® QuickPath Interconnect
• Intel® Core™ microarchitecture
(Nehalem) introduces new Nehalem Nehalem
EP EP
Intel® QuickPath Interconnect
(Intel® QPI)
• High bandwidth, low latency
point to point interconnect
• Up to 6.4 GT/sec initially
– 6.4 GT/sec -> 12.8 GB/sec
– Bi-directional link -> 25.6 IOH
CPU CPU
GB/sec per link memory memory

– Future implementations at even


higher speeds
CPU CPU
• Highly scalable for systems memory memory
IOH
with varying # of sockets

Intel® CoreTM microarchitecture (Nehalem-EP)

77
Integrated Memory Controller (IMC)
• Memory controller optimized per
market segment
• Initial Intel® Core™
microarchitecture (Nehalem)
products
– Native DDR3 IMC
– Up to 3 channels per socket
– Nehalem
Massive memory bandwidth DDR3 Nehalem DDR3
EP EP
– Designed for low latency
– Support RDIMM and UDIMM
– RAS Features
• Future products Tylersburg
EP
– Scalability
– Vary # of memory channels
– Increase memory speeds
– Buffered and Non-Buffered solutions
– Market specific needs
– Higher memory capacity
– Integrated graphics

Significant performance through new IMC

Intel® Core™ microarchitecture (Nehalem-EP)


Intel® Next Generation Server Processor Technology (Tylersburg-EP)

78
Memory Latency Comparison
• Low memory latency critical to high performance
• Design integrated memory controller for low latency
• Need to optimize both local and remote memory latency
• Intel® Core™ microarchitecture (Nehalem) delivers
– Huge reduction in local memory latency
– Even remote memory latency is fast
• Effective memory latency depends per application/OS
– Percentage of local vs. remote accesses
– Intel Core microarchitecture (Nehalem) has lower latency regardless of mix
Relative M emory Latency Comparison 1

1.00
Relative Memory Latency

0.80

0.60

0.40

0.20

0.00
Harpertow n (FSB 1600) Nehalem (DDR3-1067) Local Nehalem (DDR3-1067) Remote

1
Next generation Quad-Core Intel® Xeon® processor (Harpertown)
Intel® CoreTM microarchitecture (Nehalem)
79
Latency of Virtualization Transitions

• Microarchitectural
Round Trip Virtualization
– Huge latency reduction
1

Latency
generation over generation
100%
– Nehalem continues the
trend 80%

Relative Latency
• Architectural 60%

– Virtual Processor ID (VPID) 40%


added in Intel® Core™ 20%
microarchitecture
0%
(Nehalem) Merom Penryn Nehalem
– Removes need to flush
TLBs on transitions
Higher Virtualization Performance Through
Lower Transition Latencies

1
Intel® Core™ microarchitecture (formerly Merom)
45nm next generation Intel® Core™ microarchitecture (Penryn)
Intel® Core™ microarchitecture (Nehalem)
80
Extended Page Tables (EPT) Motivation
VM1 • A VMM needs to protect
Guest OS physical memory
• Multiple Guest OSs share
CR3 the same physical memory
Guest Page Table • Protections are
implemented through
Guest page table changes page-table virtualization
cause exits into the VMM
• Page table virtualization
VMM accounts for a
significant portion of
CR3 virtualization overheads
Active Page Table
• VM Exits / Entries
• The goal of EPT is to
VMM maintains the active
page table, which is used reduce these overheads
by the CPU

81
STTNI - STring & Text New Instructions
Operates on strings of bytes or words (16b)
Equal Each Instruction
True for each character in Src2 if
same position in Src1 is equal STTNI MODEL
Src1: Test\tday
Src2: tad tseT Source2 (XMM / M128) Bit 0
Mask: 01101111

Bit 0
t ad t s eT Check
each bit
in the
T xxxx xxxT diagonal

Equal Any Instruction Ranges Instruction e x x x x x xTx


True for each character in Src2 if s xxxx xTxx

Source1 (XMM)
True if a character in Src2 is in
any character in Src1 matches
Src1: Example\n
at least one of up to 8 ranges in
Src1 t xxxxTxxx
Src2:
Mask:
atad tsT
10100000
Src1: AZ’0’9zzz \t xxxFxxxx
Src2:
Mask:
taD tseT
00100001 d xxTx xxxx
a xTxx xxxx
Equal Ordered Instruction y Fxxxxxxx
IntRes1
Finds the start of a substring (Src1)
within another string (Src2)
01101111
Src1: ABCA0XYZ
Src2: S0BACBAB
Mask: 00000010

Projected 3.8x kernel speedup on XML parsing &


2.7x savings on instruction cycles
82
STTNI Model
EQUAL ANY EQUAL EACH
Source2 (XMM / M128) Source2 (XMM / M128)
Bit 0 Bit 0
Bit 0 a t ad t s T OR results down Bit 0 t ad t seT Check each
each column
E F FFF F F F F T xxxxxxxT bit in the
diagonal
x F FFF F F F F e xxxxxxTx

Source1 (XMM)
Source1 (XMM)

a T FTF F F F F s xxxxxTxx
m F FFF F F F F t xxxxTxxx
p F FFF F F F F \t xxxFxxxx
l F FFF F F F F d xxTxxxxx
e F FFF F F F F a xTxxxxxx
\n F FFF F F F F y Fxxxxxxx
10100000 IntRes1 01101111 IntRes1

RANGES EQUAL ORDERED


Source2 (XMM / M128) Source2 (XMM / M128)
Bit 0 Bit 0 AND the results
t aD t s e T Bit 0 S0BACBAB AND the along
results
each
Bit 0 Bit 0 along each
A FFTFFFTT •First Compare
A fF fF F T F F T F diagonal

Z F FTTF F FT does GE, next B fF fF T F F T F x diagonal

Source1 (XMM)
Source1 (XMM)

Source1 (XMM)

‘0’ T T T F T T T T does LE C fF fF F F T F x x
9 FFFTFFFF •AND GE/LE pairs of results A fF fF F T F x x x
z FFFFFFFF 0 fTfTfTfT x x x x
z TTTTTTTT •OR those results
X fTfTfT x x x x x
z FFFFFFFF Y fTfT x x x x x x
z TTTTTTTT Z fT x x x x x x x
00100001 IntRes1 00000010 IntRes1

83
ATA - Application Targeted Accelerators
CRC32 POPCNT
Accumulates a CRC32 value using the iSCSI polynomial POPCNT determines the number of nonzero
bits in the source.

63 32 31 0 63 10 Bit
0 1 0 ... 0 0 1 1 RAX
DST X Old CRC

SRC Data 8/16/32/64 bit

0x3 RBX
63 32 31 0
DST 0 New CRC
=? 0 0 ZF

One register maintains the running CRC value as a


POPCNT is useful for speeding up fast matching in
software loop iterates over data.
data mining workloads including:
Fixed CRC polynomial = 11EDC6F41h
• DNA/Genome Matching
• Voice Recognition
Replaces complex instruction sequences for CRC in
Upper layer data protocols:
ZFlag set if result is zero. All other flags (C,S,O,A,P)
• iSCSI, RDMA, SCTP
reset

Enables enterprise class data assurance with high data rates


in networked storage in any user environment.

84
Tools Support of New Instructions
• Intel® Compiler 10.x supports the new instructions
– Nehalem specific compiler optimizations
– SSE4.2 supported via vectorization and intrinsics
– Inline assembly supported on both IA-32 and Intel® 64 architecture targets
– Necessary to include required header files in order to access intrinsics

• Intel® XML Software Suite


– High performance C++ and Java runtime libraries
– Version 1.0 (C++), version 1.01 (Java) available now
– Version 1.1 w/SSE4.2 optimizations planned for September 2008

• Microsoft Visual Studio* 2008 VC++


– SSE4.2 supported via intrinsics
– Inline assembly supported on IA-32 only
– Necessary to include required header files in order to access intrinsics
– VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions

• Sun Studio Express* 7/08


– Supports Intel® CoreTM microarchitecture (Merom), 45nm next generation Intel® Core™ microarchitecture (Penryn), Intel® CoreTM
microarchitecture (Nehalem)
– SSE4.1, SSE4.2 through intrinsics
– Nehalem specific compiler optimizations

• GCC* 4.3.1
– Support Intel Core microarchitecture (Merom), 45nm next generation Intel Core microarchitecture (Penryn), Intel Core
microarchitecture (Nehalem)
– via –mtune=generic.
– Support SSE4.1 and SSE4.2 through vectorizer and intrinsics

Broad Software Support for Intel®


Core™ Microarchitecture (Nehalem)
85
Software Optimization Guidelines

• Most optimizations for Intel® Core™


microarchitecture still hold
• Examples of new optimization guidelines:
– 16-byte unaligned loads/stores
– Enhanced macrofusion rules
– NUMA optimizations
• Intel® Core™ microarchitecture (Nehalem) SW
Optimization Guide will be published
• Intel® Compiler will support settings for Intel Core
microarchitecture (Nehalem) optimizations

86
Example Code For strlen()
string equ [esp + 4]
STTNI Version
mov ecx,string ; ecx -> string int sttni_strlen(const char * src)
test ecx,3 je
; test if string is aligned short byte_2
on 32 bits
test eax,0ff000000h {
je short main_loop
str_misaligned: ; is it byte 3
je short byte_3 char eom_vals[32] = {1, 255, 0};
; simple byte loop until string is aligned
mov al,byte ptr [ecx] jmp short main_loop
; taken if bits 24-30 are clear and bit __asm{
add ecx,1
test al,al ; 31 is set
byte_3: mov eax, src
je short byte_3
test ecx,3 lea eax,[ecx - 1]
mov ecx,string movdqu xmm2, eom_vals
jne short str_misaligned
add eax,dword ptr 0 ; 5 byte nop tosub eax,ecx
align label below xor ecx, ecx
align 16 ; should be redundant ret
main_loop: byte_2:
topofloop:
mov eax,dword ptr [ecx] ; read 4 byteslea eax,[ecx - 2]
mov edx,7efefeffh mov ecx,string
sub eax,ecx add eax, ecx
add edx,eax
xor eax,-1 ret
byte_1: movdqu xmm1, OWORD PTR[eax]
xor eax,edx
add ecx,4 lea eax,[ecx - 3]
mov ecx,string pcmpistri xmm2, xmm1, imm8
test eax,81010100h
je short main_loop sub eax,ecx
ret jnz topofloop
; found zero byte in the loop
mov eax,[ecx - 4] byte_0:
lea eax,[ecx - 4] endofstring:
test al,al ; is it byte 0
je short byte_0 mov ecx,string
sub eax,ecx add eax, ecx
test ah,ah ; is it byte 1
je short byte_1 ret
strlen endp sub eax, src
test eax,00ff0000h ; is it byte 2 ret
end
}
Current Code: Minimum of 11 instructions; Inner loop processes 4 bytes with 8 instructions
}
STTNI Code: Minimum of 10 instructions; A single inner loop processes 16 bytes with only 4
instructions

87
CRC32 Preliminary Performance
CRC32 optimized Code
 Preliminary tests involved Kernel code implementing
crc32c_sse42_optimized_version(uint32 crc, unsigned CRC algorithms commonly used by iSCSI drivers.
char const *p, size_t len)
 32-bit and 64-bit versions of the Kernel under test
{ // Assuming len is a multiple of 0x10
asm("pusha");  32-bit version processes 4 bytes of data using
asm("mov %0, %%eax" :: "m" (crc)); 1 CRC32 instruction
asm("mov %0, %%ebx" :: "m" (p));  64-bit version processes 8 bytes of data using
asm("mov %0, %%ecx" :: "m" (len)); 1 CRC32 instruction
asm("1:");
 Input strings of sizes 48 bytes and 4KB used for the
// Processing four byte at a time: Unrolled four times:
test
asm("crc32 %eax, 0x0(%ebx)");
asm("crc32 %eax, 0x4(%ebx)");
asm("crc32 %eax, 0x8(%ebx)"); 32 - bit 64 - bit
asm("crc32 %eax, 0xc(%ebx)");
Input 6.53 X 9.85 X
asm("add $0x10, %ebx")2;
Data
asm("sub $0x10, %ecx");
Size =
asm("jecxz 2f");
48 bytes
asm("jmp 1b");
asm("2:");
Input 9.3 X 18.63 X
Data
asm("mov %%eax, %0" : "=m" (crc));
Size = 4
asm("popa");
KB
return crc;
}}

Preliminary Results show CRC32 instruction out-performing


the fastest CRC32C software algorithm by a big margin
88
Idle Power Matters

• Data center operating costs1


– 41M physical servers by 2010, average utilization < 10%
– $0.50 spent on power and cooling for every $1 spent on
server hardware
• Regulatory requirements affect all segments
– ENERGY STAR* and related requirements
• Environmental responsibility

Idle power consumption not just mobile concern

1. IDC’s Datacenter Trends Survey, January 2007

89
CPU Core Power Consumption

• High frequency processes are leaky


– Reduced via high-K metal gate process, Leakage
design technologies, manufacturing
optimizations

90
CPU Core Power Consumption

• High frequency designs require high


performance global clock distribution Clock
Distribution
• High frequency processes are leaky
– Reduced via high-K metal gate process, Leakage
design technologies, manufacturing
optimizations

91
CPU Core Power Consumption
Total Core Power
Consumption

• Remaining power in logic, local clocks Local


Clocks
– Power efficient microarchitecture, good and Logic
clock gating minimize waste
• High frequency designs require high
performance global clock distribution Clock
Distribution
• High frequency processes are leaky
– Reduced via high-K metal gate process, Leakage
design technologies, manufacturing
optimizations

Challenge – Minimize power when idle

92
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
Active Core Power
• C0: CPU active state

Local
Clocks
and Logic

Clock
Distribution

Leakage

93
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
Active Core Power
• C0: CPU active state
• C1, C2 states (early 1990s):
• Stop core pipeline Local
• Stop most core clocks Clocks
and Logic

Clock
Distribution

Leakage

94
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
Active Core Power
• C0: CPU active state
• C1, C2 states (early 1990s):
• Stop core pipeline
• Stop most core clocks
• C3 state (mid 1990s):
• Stop remaining core clocks
Clock
Distribution

Leakage

95
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
Active Core Power
• C0: CPU active state
• C1, C2 states (early 1990s):
• Stop core pipeline
• Stop most core clocks
• C3 state (mid 1990s):
• Stop remaining core clocks
• C4, C5, C6 states (mid 2000s):
• Drop core voltage, reducing leakage
Leakage
• Voltage reduction via shared VR

Existing C-states significantly reduce idle power

96
C-State Support Before Intel® Core™
Microarchitecture (Nehalem)
• Cores share a single voltage plane
– All cores must be idle before voltage reduced
– Independent VR’s per core prohibitive from cost and form
factor perspective
• Deepest C-states have relatively long exit latencies
– System / VR handshake, ramp voltage, restore state,
restart pipeline, etc.

Deepest C-states available in mobile products

97
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state

Local
Clocks
and Logic

Clock
Distribution

Leakage

98
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state
• C1 state:
• Stop core pipeline Local
Clocks
• Stop most core clocks and Logic

Clock
Distribution

Leakage

99
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state
• C1 state:
• Stop core pipeline
• Stop most core clocks
• C3 state:
• Stop remaining core clocks
Clock
Distribution

Leakage

100
Intel® Core™ Microarchitecture
(Nehalem) Core C-State Support
Active Core Power
• C0: CPU active state
• C1 state:
• Stop core pipeline
• Stop most core clocks
• C3 state:
• Stop remaining core clocks
• C6 state:
• Processor saves architectural state
Leakage
• Turn off power gate, eliminating
leakage

Core idle power goes to ~0

101
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)

102
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)

Core Power Cores running


applications.

Core 1
0

Core 0
0

Time

103
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Task completes. No
work waiting. OS
Core Power executes MWAIT(C6)
instruction.

Core 1
0

Core 0
0

Time

104
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Execution stops. Core
architectural state
Core Power saved. Core clocks
stopped. Core 0
continues execution
undisturbed.
Core 1
0

Core 0
0

Time

105
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)

Core Power

Core 1
0 Task completes. No
work waiting. OS
executes MWAIT(C6)
instruction. Core
enters C6.
Core 0
0

Time

106
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)

Core Power

Core 1
0
VR voltage
reduced.
Power drops.

Core 0
0

Time

107
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)
Interrupt for Core 1 arrives.
VR voltage increased. Core 1
Core Power clocks turn on, core state
restored, and core resumes
execution at instruction
following MWAIT(C6). Cores 0
remains idle. Core 1
0

Core 0
0

Time

108
C6 Support on Intel® Core™2 Duo Mobile
Processor (Penryn)

Core Power

Core 1
Interrupt for Core 0 arrives. Core
0 0 returns to C0 and resumes
execution at instruction following
MWAIT(C6). Core 1 continues
execution undisturbed.

Core 0
0

C6 significantly reducesTime
idle power consumption

109
Reducing Platform Idle Power
• Dramatic improvements in CPU idle power increase
importance of platform improvements
• Memory power:
– Memory clocks stopped between requests at low utilization
– Memory to self refresh in package C3, C6
• Link power:
– Intel® QuickPath Interconnect links to lower power states
as CPU becomes less active
– PCI Express* links on chipset have similar behavior
• Hint to VR to reduce phases during periods of low
current demand

Intel® Core™ microarchitecture (Nehalem)


reduces CPU and platform power
110
Intel® Core™ Microarchitecture
(Nehalem): Integrated Power Gate
VCC

• Integrated power switch


between VR output and
core voltage supply
– Very low on-resistance Core0 Core1 Core2 Core3
– Very high off-resistance
– Much faster voltage ramp
than external VR Memory System, Cache, I/O VTT
• Enables per core C6 state
– Individual cores transition to
~0 power state
– Transparent to other cores,
platform, software, and VR

Close collaboration with process technology


to optimize device characteristics

111 111
Agenda
• Intel® Core™ microarchitecture (Nehalem)
power management overview
• Minimizing idle power consumption
• Performance when you need it

112
Turbo Mode Before Intel® Core™
Microarchitecture (Nehalem)

Clock Stopped
Power reduction in
inactive cores

No Turbo

Frequency (F)
Frequency (F)

Core 0

Core 0
Core 1

Core 1
Workload Lightly Threaded

113 113
Turbo Mode Before Intel® Core™
Microarchitecture (Nehalem)

Clock Stopped Turbo Mode


Power reduction in In response to workload
inactive cores adds additional
performance bins within
No Turbo headroom

Frequency (F)
Frequency (F)

Core 0

Core 0
Core 1

Workload Lightly Threaded

114 114
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode

Power Gating
Zero power for
inactive cores

No Turbo

Frequency (F)

Core 1
Core 0

Core 2
Core 3
Frequency (F)

Core 1
Core 2
Core 3
Core 0

Workload Lightly Threaded


or < TDP

115 115
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode

Power Gating Turbo Mode


Zero power for In response to workload
inactive cores adds additional
performance bins within
No Turbo headroom

Frequency (F)

Core 1
Core 0
Frequency (F)

Core 1
Core 2
Core 3
Core 0

Workload Lightly Threaded


or < TDP

116 116
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode

Power Gating Turbo Mode


Zero power for In response to workload
inactive cores adds additional
performance bins within
No Turbo headroom

Frequency (F)

Core 1
Core 0
Frequency (F)

Core 1
Core 2
Core 3
Core 0

Workload Lightly Threaded


or < TDP

117 117
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode

Active cores running


workloads < TDP

No Turbo

Frequency (F)

Core 1
Core 0

Core 2
Core 3
Frequency (F)

Core 1
Core 2
Core 3
Core 0

Workload Lightly Threaded


or < TDP

118 118
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode

Active cores running Turbo Mode


workloads < TDP
In response to workload
adds additional
performance bins within
No Turbo headroom

Frequency (F)

Core 1
Core 2
Core 3
Core 0
Frequency (F)

Core 1
Core 2
Core 3
Core 0

Workload Lightly Threaded


or < TDP

119 119
Intel® Core™ Microarchitecture
(Nehalem) Turbo Mode

Power Gating Turbo Mode


Zero power for In response to workload
inactive cores adds additional
performance bins within
No Turbo headroom

Frequency (F)

Core 1
Core 0

Core 2
Core 3
Frequency (F)

Core 0
Core 1
Core 2
Core 3

Workload Lightly Threaded


or < TDP

Dynamically Delivering Optimal Performance


and Energy Efficiency

120 120
Additional Sources of
Information on This Topic:
• Other Sessions / Chalk Talks / Labs:
– TCHS001: Next Generation Intel® Core™ Microarchitecture (Nehalem) Family of
Processors: Screaming Performance, Efficient Power (8/19, 3:00 – 3:50)
– DPTS001: High End Desktop Platform Design Overview for the Next Generation
Intel® Microarchitecture (Nehalem) Processor (8/20, 2:40 – 3:30)
– NGMS001: Next Generation Intel® Microarchitecture (Nehalem) Family:
Architectural Insights and Power Management (8/19, 4:00 – 5:50)
– NGMC001: Chalk Talk: Next Generation Intel® Microarchitecture (Nehalem)
Family (8/19, 5:50 – 6:30)
– NGMS002: Tuning Your Software for the Next Generation Intel®
Microarchitecture (Nehalem) Family (8/20, 11:10 – 12:00)
– PWRS003: Power Managing the Virtual Data Center with Windows Server* 2008
/ Hyper-V and Next Generation Processor-based Intel® Servers Featuring Intel®
Dynamic Power Technology (8/19, 3:00 – 3:50)
– PWRS005: Platform Power Management Options for Intel® Next Generation
Server Processor Technology (Tylersburg-EP) (8/21, 1:40 – 2:30)
– SVRS002: Overview of the Intel® QuickPath Interconnect (8/21, 11:10 –
12:00)

121
Session Presentations - PDFs

The PDF for this Session presentation is


available from our IDF Content Catalog at the
end of the day at:
www.intel.com/idf
or

https://intel.wingateweb.com/US08/scheduler/public.jsp

122
Please Fill out the
Session Evaluation Form

Place form in evaluation box


at the back of session room

Thank you for your input, we use it to improve


future Intel Developer Forum events

123

You might also like