A Look Inside Intel®: The Core (Nehalem) Microarchitecture

Beeman Strong Intel® Core™ microarchitecture (Nehalem) Architect Intel Corporation

Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design Overview • Enhanced Processor Core
• Performance Features • Intel® Hyper-Threading Technology

• New Platform
• New Cache Hierarchy • New Platform Architecture

• Performance Acceleration
• Virtualization • New Instructions

• Power Management Overview
• Minimizing Idle Power Consumption • Performance when it counts

2

Scalable Cores
Same core for all segments Common software optimization Common feature set

Intel® Core™ Microarchitecture (Nehalem)

Servers/Workstations
Energy Efficiency, Performance, Virtualization, Reliability, Capacity, Scalability

45nm

Desktop
Performance, Graphics, Energy Efficiency, Idle Power, Security

Mobile
Battery Life, Performance, Energy Efficiency, Graphics, Security

Optimized cores to meet all market segments 3

The First Intel® Core™ Microarchitecture (Nehalem) Processor
Memory Controller M i s c I O Q P I 0 M i s c I O Q P I 1

Core

Core

Q u e u e

Core

Core

Shared L3 Cache

A Modular Design for Flexibility
4

QPI: Intel® QuickPath Interconnect (Intel® QPI)

Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design Overview • Enhanced Processor Core
• Performance Features • Intel® Hyper-Threading Technology

• New Platform
• New Cache Hierarchy • New Platform Architecture

• Performance Acceleration
• Virtualization • New Instructions

• Power Management Overview
• Minimizing Idle Power Consumption • Performance when it counts

5

Intel® Core™ Microarchitecture Recap
• Wide Dynamic Execution
– 4-wide decode/rename/retire

• Advanced Digital Media Boost
– 128-bit wide SSE execution units

• Intel HD Boost
– New SSE4.1 Instructions

• Smart Memory Access
– Memory Disambiguation – Hardware Prefetching

• Advanced Smart Cache
– Low latency, high BW shared L2 cache Nehalem builds on the great Core microarchitecture

6

Designed for Performance
New SSE4.2 Instructions Improved Lock Support
L1 Data Cache Execution Units Memory Ordering & Execution

Additional Caching Hierarchy

L2 Cache & Interrupt Servicing Paging

Deeper Buffers
Out-of-Order Scheduling & Retirement

Instruction Decode & Microcode

Improved Loop Branch Prediction Streaming
Instruction Fetch & L1 Cache

Simultaneous Multi-Threading

Faster Virtualization

Better Branch Prediction

7

Macrofusion
• Introduced in Intel® Core™2 microarchitecture • TEST/CMP instruction followed by a conditional branch treated as a single instruction • Higher performance & improved power efficiency • Support all the cases in Intel Core 2 microarchitecture PLUS
– CMP+Jcc macrofusion added for the following branch conditions
– – – – JL/JNGE JGE/JNL JLE/JNG JG/JNLE

– Decode/execute/retire as one instruction

– Improves throughput/Reduces execution latency – Less processing required to accomplish the same work

– Intel® Core™ microarchitecture (Nehalem) supports macrofusion in both 32-bit and 64-bit modes
– Intel Core2 microarchitecture only supports macrofusion in 32-bit mode

Increased macrofusion benefit on Intel® Core™ microarchitecture (Nehalem)
8

Intel® Core™ Microarchitecture (Nehalem) Loop Stream Detector
• Loop Stream Detector identifies software loops – Stream from Loop Stream Detector instead of normal path – Disable unneeded blocks of logic for power savings – Higher performance by removing instruction fetch limitations • Higher performance: Expand the size of the loops detected (vs Core 2) • Improved power efficiency: Disable even more logic (vs Core 2)

Intel Core Microarchitecture (Nehalem) Loop Stream Detector
Branch Prediction

Fetch

Decode

Loop Stream Detector

28 Micro-Ops

9

Branch Prediction Improvements
• Focus on improving branch prediction accuracy each CPU generation
– Higher performance & lower power through more accurate prediction

• Example Intel® Core™ microarchitecture (Nehalem) improvements
– L2 Branch Predictor
– Improve accuracy for applications with large code size (ex. database applications)

– Advanced Renamed Return Stack Buffer (RSB)
– Remove branch mispredicts on x86 RET instruction (function returns) in the common case

Greater Performance through Branch Prediction

10

Execution Unit Overview
Unified Reservation Station • Schedules operations to Execution units • Single Scheduler for all Execution Units • Can be used by all integer, all FP, etc. Execute 6 operations/cycle • 3 Memory Operations • 1 Load • 1 Store Address • 1 Store Data • 3 “Computational” Operations

Unified Reservation Station
Port 0
Integer ALU & Shift

Port 4

Port 2

Integer ALU & LEA

Port 1

Load

Store Address

Port 3

Store Data

Integer ALU & Shift

Port 5

FP Multiply Divide
SSE Integer ALU Integer Shuffles

FP Add
Complex Integer SSE Integer Multiply

Branch FP Shuffle
SSE Integer ALU Integer Shuffles

11

Increased Parallelism
Concurrent uOps Possible
1

• Goal: Keep powerful execution engine fed • Nehalem increases size of out of order window by 33% • Must also increase other corresponding structures
Structure Intel® Core™ microarchitecture (formerly Merom)
32

128 112 96 80 64 48 32 16 0
Dothan Merom Nehalem

Intel® Core™ microarchitecture (Nehalem)
36

Comment

Reservation Station

Dispatches operations to execution units Tracks all load operations allocated Tracks all store operations allocated

Load Buffers Store Buffers

32 20

48 32

Increased Resources for Higher Performance
Intel® Pentium® M processor (formerly Dothan) Intel® Core™ microarchitecture (formerly Merom) Intel® Core™ microarchitecture (Nehalem)
1

12

Enhanced Memory Subsystem
• Responsible for:
– Handling of memory operations (loads/stores)

• Key Intel® Core™2 Features
– Memory Disambiguation – Hardware Prefetchers – Advanced Smart Cache

• New Intel® Core™ Microarchitecture (Nehalem) Features
– New TLB Hierarchy (new, low latency 2nd level unified TLB) – Fast 16-Byte unaligned accesses – Faster Synchronization Primitives

13

Intel® Hyper-Threading Technology
• Also known as Simultaneous MultiThreading (SMT)
– Run 2 threads at the same time per core

w/o SMT

SMT

• Take advantage of 4-wide execution engine Time (proc. cycles)
– Keep it fed with multiple threads – Hide latency of a single thread

• Most power efficient performance feature
– Very low die area cost – Can provide significant performance benefit depending on application – Much more efficient than adding an entire core

• Intel® Core™ microarchitecture (Nehalem) advantages
– Larger caches – Massive memory BW

Note: Each box represents a processor execution unit

Simultaneous multi-threading enhances performance and energy efficiency
14

SMT Performance Chart
40% 35% 30% 25% 20% 15% 10% 5% 0%
Floa tingPoint 3dsM ax* Inte ger Cineben 10POV-R y* 3.7 3DM ch* a ark* beta25 Vanta * C ge PU

Perform nce G SM enabledvs disabled a ain T
34% 29%

13% 7% 10%

16%

Intel® Core™ i7

Floating Point is based on SPECfp_rate_base2006* estimate Integer is based on SPECint_rate_base2006* estimate
SPEC, SPECint, SPECfp, and SPECrate are trademarks of the Standard Performance Evaluation Corporation. For more information on SPEC benchmarks, see: http://www.spec.org
Source: Intel. Configuration: pre-production Intel® Core™ i7 processor with 3 channel DDR3 memory. Performance tests and ratings are measured using specific computer systems and / or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/

15

Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design Overview • Enhanced Processor Core
• Performance Features • Intel® Hyper-Threading Technology

• New Platform
• New Cache Hierarchy • New Platform Architecture

• Performance Acceleration
• Virtualization • New Instructions

• Power Management Overview
• Minimizing Idle Power Consumption • Performance when it counts

16

Designed For Modularity
C O R E C O R E C O R E

Core

DRAM

L3 Cache

IM C

Intel ® QPI


Intel QPI

Intel ® QPI

Power & Clock

Uncore

Differentiation in the “Uncore”:
# cores # mem channels #QPI Links Size of cache Type of Memory

Intel® QPI: Intel® QuickPath Interconnect (Intel® QPI) Power Management Integrated graphics

Optimal price / performance / energy efficiency for server, desktop and mobile products
17

2008 – 2009 Servers & Desktops

Intel® Smart Cache – 3rd Level Cache
• Shared across all cores • Size depends on # of cores
– Quad-core: Up to 8MB (16-ways) – Scalability:
– Built to vary size with varied core counts – Built to easily increase L3 size in future parts
Core Core Core
L1 Caches L1 Caches L1 Caches


L2 Cache L2 Cache L2 Cache

• Perceived latency depends on frequency ratio between core & uncore • Inclusive cache policy for best performance
– Address residing in L1/L2 must be present in 3rd level cache

L3 Cache

18

Why Inclusive?
• Inclusive cache provides benefit of an on-die snoop filter • Core Valid Bits
– 1 bit per core per cache line
– If line may be in a core, set core valid bit – Snoop only needed if line is in L3 and core valid bit is set – Guaranteed that line is not modified if multiple bits set

• Scalability
– Addition of cores/sockets does not increase snoop traffic seen by cores

• Latency
– Minimize effective cache latency by eliminating cross-core snoops in the common case – Minimize snoop response time for cross-socket cases

19

Intel® Core™ Microarchitecture (Nehalem-EP) Platform Architecture
• Integrated Memory Controller
– 3 DDR3 channels per socket – Massive memory bandwidth – Memory Bandwidth scales with # of processors – Very low memory latency
Nehalem EP Nehalem EP

• Intel® QuickPath Interconnect (Intel® QPI)
– – – – – – New point-to-point interconnect Socket to socket connections Socket to chipset connections Build scalable solutions Up to 6.4 GT/sec (12.8 GB/sec) Bidirectional (=> 25.6 GB/sec)

Tylersburg EP

memory

CPU

IOH

CPU

memory

Significant performance leap from new platform
Intel® Core™ microarchitecture (Nehalem-EP) Intel® Next Generation Server Processor Technology (Tylersburg-EP)

memory

CPU
IOH

CPU

memory

20

Non-Uniform Memory Access (NUMA)
• FSB architecture

• Starting with Intel Core™ microarchitecture (Nehalem)
®

– All memory in one location

Nehalem EP

Nehalem EP

Relative Memory Latency

• Latency to memory dependent on location • Local memory has highest BW, lowest latency • Remote Memory still very fast
Ensure software is NUMAoptimized for best performance

– Memory located in multiple places

Tylersburg EP

Relative Memory Latency Comparison
1.00 0.80 0.60 0.40 0.20 0.00 Harpertown (FSB 1600) Nehalem (DDR3-1067) Local Nehalem (DDR3-1067) Remote

Intel® Core™ microarchitecture (Nehalem-EP) Intel® Next Generation Server Processor Technology (Tylersburg-EP)

21

Memory Bandwidth – Initial Intel® Core™ Microarchitecture (Nehalem) Products
• • • • 3 memory channels per socket ≥ DDR3-1066 at launch Massive memory BW Scalability
– Design IMC and core to take advantage of BW – Allow performance to scale with cores
– Core enhancements
– Support more cache misses per core – Aggressive hardware prefetching w/ throttling enhancements
6102 9776

Stream Bandwidth – Mbytes/Sec (Triad)
33376

3.4X

– Example IMC Features
– Independent memory channels – Aggressive Request Reordering

HTN 3.16/ BF1333/ 667 MHz mem

HTN 3.00/ SB1600/ 800 MHz mem

NHM 2.66/ 6.4 QPI/ 1066 MHz mem

Source: Intel Internal measurements – August 20081

Massive memory BW provides performance and scalability
1

HTN: Intel® Xeon® processor 5400 Series (Harpertown)

NHM: Intel® Core™ microarchitecture (Nehalem)

22

Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design Overview • Enhanced Processor Core
• Performance Features • Intel® Hyper-Threading Technology

• New Platform
• New Cache Hierarchy • New Platform Architecture

• Performance Acceleration
• Virtualization • New Instructions

• Power Management Overview
• Minimizing Idle Power Consumption • Performance when it counts

23

Virtualization
• To get best virtualized performance
– Have best native performance – Reduce transitions to/from virtual machine – Reduce latency of transitions
Round Trip Virtualization Latency
100%
Relative Latency
1

80% 60% 40% 20% 0%
Merom Penryn Nehalem

• Intel® Core™ microprocessor (Nehalem) virtualization features
– Reduced latency for transitions – Virtual Processor ID (VPID) to reduce effective cost of transitions – Extended Page Table (EPT) to reduce # of transitions

24

EPT Solution
EPT CR3 Guest Linear Address Base Pointer
Intel® 64 Page Tables

Guest Physical Address

EPT Page Tables

Host Physical Address

• Intel® 64 Page Tables – Map Guest Linear Address to Guest Physical Address – Can be read and written by the guest OS • New EPT Page Tables under VMM Control – Map Guest Physical Address to Host Physical Address – Referenced by new EPT base pointer • No VM Exits due to Page Faults, INVLPG or CR3 accesses

25

Extending Performance and Energy Efficiency
Faster XML parsing Faster search and pattern matching Novel parallel data matching and comparison operations Improved performance for Genome Mining, Handwriting recognition. Fast Hamming distance / Population count Hardware based CRC instruction Accelerated Network attached storage Improved power efficiency for Software I-SCSI, RDMA, and SCTP ATA

- Intel® SSE4.2 Instruction Set Architecture (ISA) Leadership in 2008

Accelerated String and Text Processing

STTNI

(45nm CPUs)

SSE4

Accelerated Searching & Pattern Recognition of Large Data Sets

(Penryn Core)

SSE4.1

(Nehalem Core)

SSE4.2

STTNI e.g. XML
e.g. XML acceleration

New Communications Capabilities

(Application Targeted Accelerators)

ATA

POPCNT
e.g. Genome Mining

CRC32
e.g. iSCSI Application

What should the applications, OS and VMM vendors do?: Understand the benefits & take advantage of new instructions in 2008. Provide us feedback on instructions ISV would like to see for next generation of applications

26

Agenda
• Intel® Core™ Microarchitecture (Nehalem) Design Overview • Enhanced Processor Core
• Performance Features • Intel® Hyper-Threading Technology

• New Platform
• New Cache Hierarchy • New Platform Architecture

• Performance Acceleration
• Virtualization • New Instructions

• Power Management Overview
• Minimizing Idle Power Consumption • Performance when it counts

27

Intel® Core™ Microarchitecture (Nehalem) Design Goals
World class performance combined with superior energy efficiency – Optimized for: Dynamically scaled performance when Single Thread Multithreads Existing Apps Emerging Apps All Usages

needed to maximize energy efficiency A single, scalable, foundation optimized across each segment and power envelope

Desktop / Mobile

A Dynamic and Design Scalable Microarchitecture
28

Workstation / Server

Power Control Unit
Vcc Core Vcc Freq. Freq. Sensors Core Vcc Freq. Freq. Sensors Core Vcc Freq. Freq. Sensors Core Vcc Freq. Freq. Sensors PLL Uncore, Uncore, LLC PLL PLL PLL PLL BCLK

PCU

Integrated proprietary microcontroller Shifts control from hardware to embedded firmware Real time sensors for temperature, current, power Flexibility enables sophisticated algorithms, tuned for current operating conditions

29

Minimizing Idle Power Consumption
• Operating system notifies CPU when no tasks are ready for execution
– Execution of MWAIT instruction

• MWAIT arguments hint at expected idle duration
Idle Power (W)

– Higher numbered C-states lower power, but also longer exit latency

C0

• CPU idle states referred to as “C-States”

C1

Cn

Exit Latency (us)

30

C6 on Intel® Core™ Microarchitecture (Nehalem)

31

C6 on Intel® Core™ Microarchitecture (Nehalem)
Core Power
Cores 0, 1, 2, and 3 running applications.

Core 3 Core 2

0

0

Core 1
0

Core 0
0

Time

32

C6 on Intel® Core™ Microarchitecture (Nehalem)
Core Power
Task completes. No work waiting. OS executes MWAIT(C6) instruction.

Core 3 Core 2

0

0

Core 1
0

Core 0
0

Time
33

C6 on Intel® Core™ Microarchitecture (Nehalem)
Core Power
Execution stops. Core architectural state saved. Core clocks stopped. Cores 0, 1, and 3 continue execution undisturbed.

Core 3 Core 2

0

0

Core 1
0

Core 0
0

Time

34

C6 on Intel® Core™ Microarchitecture (Nehalem)
Core Power
Core power gate turned off. Core voltage goes to 0. Cores 0, 1, and 3 continue execution undisturbed.

Core 3 Core 2

0

0

Core 1
0

Core 0
0

Time
35

C6 on Intel® Core™ Microarchitecture (Nehalem)
Core Power Core 3
0

Core 2
0 Task completes. No work waiting. OS executes MWAIT(C6) instruction. Core 0 enters C6. Cores 1 and 3 continue execution undisturbed.

Core 1 Core 0

0

0

Time
36

C6 on Intel® Core™ Microarchitecture (Nehalem)
Core Power
Interrupt for Core 2 arrives. Core 2 returns to C0, execution resumes at instruction following MWAIT(C6). Cores 1 and 3 continue execution undisturbed.

Core 3 Core 2

0

0

Core 1
0

Core 0
0

Time
37

C6 on Intel® Core™ Microarchitecture (Nehalem)
Core Power Core 3
0 Interrupt for Core 0 arrives. Power gate turns on, core clock turns on, core state restored, core resumes execution at instruction following MWAIT(C6). Cores 1, 2, and 3 continue execution undisturbed.

Core 2 Core 1 Core 0

0

0

0

Time Core independent C6 on Intel Core microarchitecture (Nehalem) extends benefits
38

Intel® Core™ Microarchitecture (Nehalem)based Processor
M i s c I O Q P I 0 Memory Controller M Total CPU Power Consumption i s c Core I O Q P I 1

Cores (x N)

Core

Core

Q u e u e

Core

Core

Clocks and Logic

Shared L3 Cache

Core Clock Distribution Core Leakage Uncore Logic I/O Uncore Clock Distribution Uncore Leakage

• Significant logic outside core
– – – – Integrated memory controller Large shared cache High speed interconnect Arbitration logic
39

QPI = Intel® QuickPath Interconnect (Intel® QPI)

Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
• All cores in C6 state:
– Core power to ~0 Cores (x N) Active CPU Power Core Clocks and Logic Core Clock Distribution Core Leakage Uncore Logic I/O Uncore Clock Distribution Uncore Leakage

40

Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
• All cores in C6 state:
– Core power to ~0 Active CPU Power

• Package to C6 state:
– Uncore logic stops toggling

Uncore Logic I/O Uncore Clock Distribution Uncore Leakage

41

Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
• All cores in C6 state:
– Core power to ~0 Active CPU Power

• Package to C6 state:
– Uncore logic stops toggling – I/O to lower power state

I/O Uncore Clock Distribution Uncore Leakage

42

Intel® Core™ Microarchitecture (Nehalem) Package C-State Support
• All cores in C6 state:
– Core power to ~0 Active CPU Power

• Package to C6 state:
– Uncore logic stops toggling – I/O to lower power state – Uncore clock grids stopped

Substantial reduction in idle CPU power
43

I/O Uncore Clock Distribution Uncore Leakage

Managing Active Power
• Operating system changes frequency as needed to meet performance needs, minimize power
– Enhanced Intel SpeedStep® Technology – Referred to as processor P-States

• PCU tunes voltage for given frequency, operating conditions, and silicon characteristics

PCU automatically optimizes operating voltage

44

Turbo Mode: Key to Scalability Goal
• Intel® Core™ microarchitecture (Nehalem) is a scalable architecture
– High frequency core for performance in less constrained form factors – Retain ability to use that frequency in very small form factors – Retain ability to use that frequency when running lightly threaded or lower power workloads

Nehalem

• Turbo utilizes available frequency:
– Maximizes both single-thread and multithread performance in the same part

Turbo Mode provides performance when you need it
45

Turbo Mode Enabling
• Turbo Mode exposed as additional Enhanced Intel SpeedStep® Technology operating point
– Operating system treats as any other P-state, requesting Turbo Mode when it needs more performance – Performance benefit comes from higher operating frequency – no need to enable or tune software

• Turbo Mode is transparent to system
– Frequency transitions handled completely in hardware – PCU keeps silicon within existing operating limits – Systems designed to same specs, with or without Turbo Mode

Performance benefits with existing applications and operating systems
46

Summary
• Intel® Core™ microarchitecture (Nehalem) – The 45nm Tock • Designed for
– Power Efficiency – Scalability – Performance

• Key Innovations:
– Enhanced Processor Core – Brand New Platform Architecture – Sophisticated Power Management

High Performance When You Need It Lower Power When You Don’t

47

Q&A

48

Legal Disclaimer
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. • Intel may make changes to specifications and product descriptions at any time, without notice. • All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. • Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. • Merom, Penryn, Hapertown, Nehalem, Dothan, Westmere, Sandy Bridge, and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user • Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. • Intel, Intel Inside, Intel Core, Pentium, Intel SpeedStep Technology, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. • *Other names and brands may be claimed as the property of others. • Copyright © 2008 Intel Corporation.

49

Risk Factors
This presentation contains forward-looking statements that involve a number of risks and uncertainties. These statements do not reflect the potential impact of any mergers, acquisitions, divestitures, investments or other similar transactions that may be completed in the future. The information presented is accurate only as of today’s date and will not be updated. In addition to any factors discussed in the presentation, the important factors that could cause actual results to differ materially include the following: Demand could be different from Intel's expectations due to factors including changes in business and economic conditions, including conditions in the credit market that could affect consumer confidence; customer acceptance of Intel’s and competitors’ products; changes in customer order patterns, including order cancellations; and changes in the level of inventory at customers. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel operates in intensely competitive industries that are characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of new Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; Intel’s ability to respond quickly to technological developments and to incorporate new features into its products; and the availability of sufficient supply of components from suppliers to meet demand. The gross margin percentage could vary significantly from expectations based on changes in revenue levels; product mix and pricing; capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; excess or obsolete inventory; manufacturing yields; changes in unit costs; impairments of long-lived assets, including manufacturing, assembly/test and intangible assets; and the timing and execution of the manufacturing ramp and associated costs, including start-up costs. Expenses, particularly certain marketing and compensation expenses, vary depending on the level of demand for Intel's products, the level of revenue and profits, and impairments of long-lived assets. Intel is in the midst of a structure and efficiency program that is resulting in several actions that could have an impact on expected expense levels and gross margin. Intel's results could be impacted by adverse economic, social, political and physical/infrastructure conditions in the countries in which Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Intel's results could be affected by adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. A detailed discussion of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, 50 including the report on Form 10-Q for the quarter ended June 28, 2008.

Backup Slides

51

Intel® Core™ Microarchitecture (Nehalem) Design Goals
World class performance combined with superior energy efficiency – Optimized for: Dynamically scaled performance when Single Thread Multithreads Existing Apps Emerging Apps All Usages

needed to maximize energy efficiency A single, scalable, foundation optimized across each segment and power envelope

Desktop / Mobile

Workstation / Server

A Dynamic and Design Scalable Microarchitecture
52

Tick-Tock Development Model
Merom1
NEW Microarchitecture

Penryn
NEW Process

Nehalem
NEW Microarchitecture

Westmere
NEW Process

Sandy Bridge
NEW Microarchitecture

65nm

45nm

45nm

32nm

32nm

TOCK

TICK

TOCK
Forecast

TICK

TOCK

Intel® Core™ microarchitecture (formerly Merom) 45nm next generation Intel® Core™ microarchitecture (Penryn) Intel® Core™ Microarchitecture (Nehalem) Intel® Microarchitecture (Westmere) 53 Intel® Microarchitecture (Sandy Bridge)
1

All dates, product descriptions, availability and plans are forecasts and subject to change without notice.

53

Enhanced Processor Core
Instruction Fetch and Pre Decode Instruction Queue Decode
ITLB

32kB Instruction Cache

Front End Execution Engine Memory

4
Rename/Allocate
Retirement Unit (ReOrder Buffer)

2nd Level TLB 256kB 2nd Level Cache

L3 and beyond

4

Reservation Station 6 Execution Units
DTLB

32kB Data Cache

54

Front-end
• Responsible for feeding the compute engine
– Decode instructions – Branch Prediction

• Key Intel® Core™2 microarchitecture Features
– 4-wide decode – Macrofusion – Loop Stream Detector
Instruction Fetch and Pre Decode Instruction Queue Decode
ITLB

32kB Instruction Cache

55

Loop Stream Detector Reminder
• Loops are very common in most software • Take advantage of knowledge of loops in HW
– Decoding the same instructions over and over – Making the same branch predictions over and over

• Loop Stream Detector identifies software loops
– Stream from Loop Stream Detector instead of normal path – Disable unneeded blocks of logic for power savings – Higher performance by removing instruction fetch limitations

Intel® Core™2 Loop Stream Detector
Loop Stream Detector

Branch Prediction

Fetch

Decode

18 Instructions

56

Branch Prediction Reminder
• Goal: Keep powerful compute engine fed • Options:
– Stall pipeline while determining branch direction/target – Predict branch direction/target and correct if wrong

• Minimize amount of time wasted correcting from incorrect branch predictions
– Performance:
– Through higher branch prediction accuracy – Through faster correction when prediction is wrong

– Power efficiency: Minimize number of speculative/incorrect micro-ops that are executed Continued focus on branch prediction improvements

57

L2 Branch Predictor
• Problem: Software with a large code footprint not able to fit well in existing branch predictors
– Example: Database applications

• Solution: Use multi-level branch prediction scheme • Benefits:
– Higher performance through improved branch prediction accuracy – Greater power efficiency through less mis-speculation

58

Advanced Renamed Return Stack Buffer (RSB) • Instruction Reminder
– CALL: Entry into functions – RET: Return from functions

• Classical Solution
– Return Stack Buffer (RSB) used to predict RET – RSB can be corrupted by speculative path

• The Renamed RSB
– No RET mispredicts in the common case

59

Execution Engine
• Responsible for:
– Scheduling operations – Executing operations

• Powerful Intel® Core™2 microarchitecture execution engine
– Dynamic 4-wide Execution – Intel® Advanced Digital Media Boost
– 128-bit wide SSE

– Super Shuffler (45nm next generation Intel® Core™ microarchitecture (Penryn))

60

Intel® Smart Cache – Core Caches
• New 3-level Cache Hierarchy • 1st level caches
– 32kB Instruction cache – 32kB, 8-way Data Cache
– Support more L1 misses in parallel than Intel® Core™2 microarchitecture

Core
32kB L1 32kB L1 Data Cache Inst. Cache

• 2nd level Cache
– New cache introduced in Intel® Core™ microarchitecture (Nehalem) – Unified (holds code and data) – 256 kB per core (8-way) – Performance: Very low latency
– 10 cycle load-to-use

256kB L2 Cache

– Scalability: As core count increases, reduce pressure on shared cache

61

New TLB Hierarchy
• Problem: Applications continue to grow in data size • Need to increase TLB size to keep the pace for performance • Nehalem adds new low-latency unified 2nd level TLB
# of Entries 1st Level Instruction TLBs Small Page (4k) Large Page (2M/4M) 1st Level Data TLBs Small Page (4k) Large Page (2M/4M) Small Page Only 64 32 512 128 7 per thread

New 2nd Level Unified TLB

62

Fast Unaligned Cache Accesses
• Two flavors of 16-byte SSE loads/stores exist
– Aligned (MOVAPS/D, MOVDQA) -- Must be aligned on a 16-byte boundary – Unaligned (MOVUPS/D, MOVDQU) -- No alignment requirement

• Prior to Intel® Core™ microarchitecture (Nehalem)
– Optimized for Aligned instructions – Unaligned instructions slower, lower throughput -- Even for aligned accesses!
– Required multiple uops (not energy efficient)

– Compilers would largely avoid unaligned load
– 2-instruction sequence (MOVSD+MOVHPD) was faster

• Intel Core microarchitecture (Nehalem) optimizes Unaligned instructions
– Same speed/throughput as Aligned instructions on aligned accesses – Optimizations for making accesses that cross 64-byte boundaries fast
– Lower latency/higher throughput than Core 2

– Aligned instructions remain fast

• No reason to use aligned instructions on Intel Core microarchitecture (Nehalem)! • Benefits:
– Compiler can now use unaligned instructions without fear – Higher performance on key media algorithms – More energy efficient than prior implementations

63

Faster Synchronization Primitives
• Multi-threaded software becoming more prevalent • Scalability of multi-thread applications can be limited by synchronization • Synchronization primitives: LOCK prefix, XCHG • Reduce synchronization latency for legacy software
LOCK CMPXCHG Perform ance
1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Pentium 4 Core 2 Nehalem

Greater thread scalability with Nehalem
Intel® Pentium® 4 processor Intel® Core™2 Duo processor Intel® Core™ microarchitecture (Nehalem)-based processor
1

64

Relative Latency

Intel® Core™ Microarchitecture (Nehalem) SMT Implementation Details
Policy Description Intel® Core™ Microarchitecture (Nehalem) Examples
Register State Renamed RSB Large Page ITLB Load Buffer Store Buffer Reorder Buffer Small Page ITLB Reservation Station Caches Data TLB 2nd level TLB Execution units

Replicated Partitioned

Duplicate logic per thread Statically allocated between threads Depends on thread’s dynamic behavior No SMT impact

Competitively Shared

Unaware

SMT efficient due to minimal replication of logic
65

Feeding the Execution Engine
• Powerful 4-wide dynamic execution engine • Need to keep providing fuel to the execution engine • Intel® Core™ Microarchitecture (Nehalem) Goals
– Low latency to retrieve data
– Keep execution engine fed w/o stalling

– High data bandwidth
– Handle requests from multiple cores/threads seamlessly

– Scalability
– Design for increasing core counts

• Combination of great cache hierarchy and new platform

Intel® Core™ microarchitecture (Nehalem) designed to feed the execution engine

66

Inclusive vs. Exclusive Caches – Cache Miss
Exclusive Inclusive

L3 Cache

L3 Cache

Core 0

Core 1

Core 2

Core 3

Core 0

Core 1

Core 2

Core 3

Data request from Core 0 misses Core 0’s L1 and L2 Request sent to the L3 cache

67

Inclusive vs. Exclusive Caches – Cache Miss
Exclusive Inclusive

L3 Cache
MISS!

L3 Cache
MISS!

Core 0

Core 1

Core 2

Core 3

Core 0

Core 1

Core 2

Core 3

Core 0 looks up the L3 Cache Data not in the L3 Cache

68

Inclusive vs. Exclusive Caches – Cache Miss
Exclusive Inclusive

L3 Cache
MISS!

L3 Cache
MISS!

Core 0

Core 1

Core 2

Core 3

Core 0

Core 1

Core 2

Core 3

Must check other cores

Guaranteed data is not on-die

Greater scalability from inclusive approach

69

Inclusive vs. Exclusive Caches – Cache Hit
Exclusive Inclusive

L3 Cache
HIT!

L3 Cache
HIT!

Core 0

Core 1

Core 2

Core 3

Core 0

Core 1

Core 2

Core 3

No need to check other cores

Data could be in another core BUT Intel® CoreTM microarchitecture (Nehalem) is smart…

70

Inclusive vs. Exclusive Caches – Cache Hit
Inclusive • Maintain a set of “core valid” bits per cache line in the L3 cache • Each bit represents a core • If the L1/L2 of a core may contain the cache line, then core valid bit is set to “1” • No snoops of cores are needed if no bits are set • If more than 1 bit is set, line cannot be in Modified state in any core Core 0

L3 Cache
HIT! 0 0 0 0

Core 1

Core 2

Core 3

Core valid bits limit unnecessary snoops

71

Inclusive vs. Exclusive Caches – Read from other core
Exclusive Inclusive

L3 Cache
MISS!

L3 Cache
HIT! 0 0 1 0

Core 0

Core 1

Core 2

Core 3

Core 0

Core 1

Core 2

Core 3

Must check all other cores

Only need to check the core whose core valid bit is set

72

Local Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache • Step 2:
– CPU0 requests data from its DRAM – CPU0 snoops CPU1 to check if data is present – DRAM returns data – CPU1 returns snoop response

• Local memory latency is the maximum latency of the two responses • Intel® Core™ microarchitecture (Nehalem) optimized to keep key latencies close to each other

Intel® DRAM CPU0 QPI CPU1 DRAM

Intel® QPI = Intel® QuickPath Interconnect

73

Remote Memory Access
• CPU0 requests cache line X, not present in any CPU0 cache
– CPU0 requests data from CPU1 – Request sent over Intel® QuickPath Interconnect (Intel® QPI) to CPU1 – CPU1’s IMC makes request to its DRAM – CPU1 snoops internal caches – Data returned to CPU0 over Intel QPI

• Remote memory latency a function of having a low latency interconnect Intel® DRAM CPU0 QPI CPU1 DRAM

74

Hardware Prefetching (HWP)
• HW Prefetching critical to hiding memory latency • Structure of HWPs similar as in Intel® Core™2 microarchitecture
– Algorithmic improvements in Intel® Core™ microarchitecture (Nehalem) for higher performance

• L1 Prefetchers
– Based on instruction history and/or load address pattern

• L2 Prefetchers
– Prefetches loads/RFOs/code fetches based on address pattern – Intel Core microarchitecture (Nehalem) changes:
– Efficient Prefetch mechanism
– Remove the need for Intel® Xeon® processors to disable HWP

– Increase prefetcher aggressiveness
– Locks on address streams quicker, adapts to change faster, issues more prefetchers more aggressively (when appropriate)

75

Today’s Platform Architecture
Front-Side Bus Evolution CPU CPU
MCH

memory

CPU CPU

memory

CPU CPU

memory

MCH

MCH

CPU CPU
ICH

CPU CPU
ICH

CPU CPU
ICH

76

Intel® QuickPath Interconnect
• Intel® Core™ microarchitecture (Nehalem) introduces new Intel® QuickPath Interconnect (Intel® QPI) • High bandwidth, low latency point to point interconnect • Up to 6.4 GT/sec initially
– 6.4 GT/sec -> 12.8 GB/sec – Bi-directional link -> 25.6 GB/sec per link – Future implementations at even higher speeds
Nehalem EP Nehalem EP

memory

CPU

IOH

CPU

memory

• Highly scalable for systems with varying # of sockets

memory

CPU
IOH

CPU

memory

Intel® CoreTM microarchitecture (Nehalem-EP) 77

Integrated Memory Controller (IMC)
• Memory controller optimized per market segment • Initial Intel® Core™ microarchitecture (Nehalem) products
– – – – – – Native DDR3 IMC Up to 3 channels per socket Massive memory bandwidth Designed for low latency Support RDIMM and UDIMM RAS Features
DDR3 Nehalem EP Nehalem EP

DDR3

• Future products
– Scalability

Tylersburg EP

– Market specific needs

– Vary # of memory channels – Increase memory speeds – Buffered and Non-Buffered solutions – Higher memory capacity – Integrated graphics

Significant performance through new IMC
Intel® Core™ microarchitecture (Nehalem-EP) Intel® Next Generation Server Processor Technology (Tylersburg-EP)

78

Memory Latency Comparison
• • • • Low memory latency critical to high performance Design integrated memory controller for low latency Need to optimize both local and remote memory latency Intel® Core™ microarchitecture (Nehalem) delivers
– Huge reduction in local memory latency – Even remote memory latency is fast

• Effective memory latency depends per application/OS

– Percentage of local vs. remote accesses – Intel Core microarchitecture (Nehalem) has lower latency regardless of mix
Relative M emory Latency Comparison
Relative Memory Latency 1.00 0.80 0.60 0.40 0.20 0.00 Harpertow n (FSB 1600) Nehalem (DDR3-1067) Local Nehalem (DDR3-1067) Remote
1

1

Next generation Quad-Core Intel® Xeon® processor (Harpertown)

Intel® CoreTM microarchitecture (Nehalem)

79

Latency of Virtualization Transitions
• Microarchitectural
– Huge latency reduction generation over generation – Nehalem continues the trend
Round Trip Virtualization Latency
100%
Relative Latency
1

80% 60% 40% 20% 0%
Merom Penryn Nehalem

• Architectural
– Virtual Processor ID (VPID) added in Intel® Core™ microarchitecture (Nehalem) – Removes need to flush TLBs on transitions

Higher Virtualization Performance Through Lower Transition Latencies
Intel® Core™ microarchitecture (formerly Merom) 45nm next generation Intel® Core™ microarchitecture (Penryn) Intel® Core™ microarchitecture (Nehalem)
1

80

Extended Page Tables (EPT) Motivation
VM1

Guest OS CR3
Guest Page Table

• A VMM needs to protect physical memory
• Multiple Guest OSs share the same physical memory • Protections are implemented through page-table virtualization

Guest page table changes cause exits into the VMM

VMM

CR3
Active Page Table

• Page table virtualization accounts for a significant portion of virtualization overheads
• VM Exits / Entries

VMM maintains the active page table, which is used by the CPU

• The goal of EPT is to reduce these overheads

81

STTNI - STring & Text New Instructions
Operates on strings of bytes or words (16b)
Equal Each Instruction True for each character in Src2 if same position in Src1 is equal Src1: Test\tday Src2: tad tseT Mask: 01101111
Bit 0

STTNI MODEL
Source2 (XMM / M128) Bit 0 Check each bit in the diagonal

Equal Any Instruction True for each character in Src2 if any character in Src1 matches Src1: Example\n Src2: atad tsT Mask: 10100000

Ranges Instruction True if a character in Src2 is in at least one of up to 8 ranges in Src1 Src1: AZ’0’9zzz Src2: taD tseT Mask: 00100001
Source1 (XMM)

Equal Ordered Instruction Finds the start of a substring (Src1) within another string (Src2) Src1: ABCA0XYZ Src2: S0BACBAB Mask: 00000010

T e s t \t d a y

t ad t s eT xxxx xxxT x x x x x xTx xxxx xTxx xxxxTxxx xxxFxxxx xxTx xxxx xTxx xxxx Fxxxxxxx 01101111

IntRes1

Projected 3.8x kernel speedup on XML parsing & 2.7x savings on instruction cycles
82

STTNI Model
EQUAL ANY
Source2 (XMM / M128) Bit 0

EQUAL EACH
Source2 (XMM / M128)

E x a m p l e \n

a F F T F F F F F

t ad FFF FFF FTF FFF FFF FFF FFF FFF

F F F F F F F F

t F F F F F F F F

s F F F F F F F F

T F F F F F F F F

Bit 0 OR results down each column
Bit 0

T e s t \t d a y

10100000
RANGES
Source2 (XMM / M128) Bit 0 Source1 (XMM)

IntRes1

t ad t seT xxxxxxxT xxxxxxTx xxxxxTxx xxxxTxxx xxxFxxxx xxTxxxxx xTxxxxxx Fxxxxxxx 01101111

Bit 0

Source1 (XMM)

Source1 (XMM)

Check each bit in the diagonal

IntRes1

EQUAL ORDERED
Source2 (XMM / M128)

t aD t s e T A FFTFFFTT Z F FTTF F FT ‘0’ T T T F T T T T 9 FFFTFFFF z FFFFFFFF z TTTTTTTT z FFFFFFFF z TTTTTTTT 00100001

Bit 0 •First Compare does GE, next does LE •OR those results Bit 0 Source1 (XMM)
Bit 0

S0BACBAB A B C A 0 X Y Z
fF fF F T F fF fF T F F fF fF F F T fF fF F T F fTfTfTfT x fTfTfT x x fTfT x x x F T F x x x x T F x x x x x F x x x x x x

Bit 0

AND the results

AND the along each results along each diagonal diagonal

•AND GE/LE pairs of results

Source1 (XMM)

IntRes1

fT x x x x x x x 00000010

IntRes1

83

ATA - Application Targeted Accelerators

CRC32
32 31 X Old CRC

POPCNT
POPCNT determines the number of nonzero bits in the source.
63 0 1 0 10 ... 0 0 1 1 Bit RAX

Accumulates a CRC32 value using the iSCSI polynomial

63 DST

0

SRC

Data 8/16/32/64 bit Σ 0 63 DST 0 32 31 New CRC =? 0 0 ZF 0 0x3 RBX

One register maintains the running CRC value as a software loop iterates over data. Fixed CRC polynomial = 11EDC6F41h Replaces complex instruction sequences for CRC in Upper layer data protocols: • iSCSI, RDMA, SCTP

POPCNT is useful for speeding up fast matching in data mining workloads including: • DNA/Genome Matching • Voice Recognition ZFlag set if result is zero. All other flags (C,S,O,A,P) reset

Enables enterprise class data assurance with high data rates in networked storage in any user environment.
84

Tools Support of New Instructions
• Intel® Compiler 10.x supports the new instructions
– – – – Nehalem specific compiler optimizations SSE4.2 supported via vectorization and intrinsics Inline assembly supported on both IA-32 and Intel® 64 architecture targets Necessary to include required header files in order to access intrinsics

Intel® XML Software Suite
– – – High performance C++ and Java runtime libraries Version 1.0 (C++), version 1.01 (Java) available now Version 1.1 w/SSE4.2 optimizations planned for September 2008

Microsoft Visual Studio* 2008 VC++
– – – – SSE4.2 supported via intrinsics Inline assembly supported on IA-32 only Necessary to include required header files in order to access intrinsics VC++ 2008 tools masm, msdis, and debuggers recognize the new instructions

Sun Studio Express* 7/08
– – – Supports Intel® CoreTM microarchitecture (Merom), 45nm next generation Intel® Core™ microarchitecture (Penryn), Intel® CoreTM microarchitecture (Nehalem) SSE4.1, SSE4.2 through intrinsics Nehalem specific compiler optimizations

GCC* 4.3.1
– – – Support Intel Core microarchitecture (Merom), 45nm next generation Intel Core microarchitecture (Penryn), Intel Core microarchitecture (Nehalem) via –mtune=generic. Support SSE4.1 and SSE4.2 through vectorizer and intrinsics

Broad Software Support for Intel® Core™ Microarchitecture (Nehalem)
85

Software Optimization Guidelines
• Most optimizations for Intel® Core™ microarchitecture still hold • Examples of new optimization guidelines:
– 16-byte unaligned loads/stores – Enhanced macrofusion rules – NUMA optimizations

• Intel® Core™ microarchitecture (Nehalem) SW Optimization Guide will be published • Intel® Compiler will support settings for Intel Core microarchitecture (Nehalem) optimizations

86

Example Code For strlen()
string equ [esp + 4] mov ecx,string ; ecx -> string je short byte_2 test ecx,3 ; test if string is aligned on 32 bits test eax,0ff000000h je short main_loop ; is it byte 3 str_misaligned: je short byte_3 ; simple byte loop until string is aligned jmp short main_loop mov al,byte ptr [ecx] ; taken if bits 24-30 are clear and bit add ecx,1 ; 31 is set test al,al byte_3: je short byte_3 lea eax,[ecx - 1] test ecx,3 mov ecx,string jne short str_misaligned add eax,dword ptr 0 ; 5 byte nop tosub label below align eax,ecx ret align 16 ; should be redundant byte_2: main_loop: lea eax,[ecx - 2] mov eax,dword ptr [ecx] ; read 4 bytes mov ecx,string mov edx,7efefeffh sub eax,ecx add edx,eax ret xor eax,-1 byte_1: xor eax,edx lea eax,[ecx - 3] add ecx,4 mov ecx,string test eax,81010100h sub eax,ecx je short main_loop ret ; found zero byte in the loop byte_0: mov eax,[ecx - 4] lea eax,[ecx - 4] test al,al ; is it byte 0 mov ecx,string je short byte_0 sub eax,ecx test ah,ah ; is it byte 1 ret je short byte_1 strlen endp test eax,00ff0000h ; is it byte 2 end

STTNI Version
int sttni_strlen(const char * src) { char eom_vals[32] = {1, 255, 0}; __asm{ mov movdqu xor eax, src xmm2, eom_vals ecx, ecx

topofloop: add movdqu eax, ecx xmm1, OWORD PTR[eax]

pcmpistri xmm2, xmm1, imm8 jnz topofloop

endofstring: add sub ret eax, ecx eax, src

Current Code:
instructions

} Minimum of 11 instructions; Inner loop processes 4 bytes with 8 instructions }

STTNI Code: Minimum of 10 instructions; A single inner loop processes 16 bytes with only 4
87

CRC32 Preliminary Performance
CRC32 optimized Code
crc32c_sse42_optimized_version(uint32 crc, unsigned char const *p, size_t len) { // Assuming len is a multiple of 0x10 asm("pusha"); asm("mov %0, %%eax" :: "m" (crc)); asm("mov %0, %%ebx" :: "m" (p)); asm("mov %0, %%ecx" :: "m" (len)); asm("1:"); // Processing four byte at a time: Unrolled four times: asm("crc32 %eax, 0x0(%ebx)"); asm("crc32 %eax, 0x4(%ebx)"); asm("crc32 %eax, 0x8(%ebx)"); asm("crc32 %eax, 0xc(%ebx)"); asm("add $0x10, %ebx")2; asm("sub $0x10, %ecx"); asm("jecxz 2f"); asm("jmp 1b"); asm("2:"); asm("mov %%eax, %0" : "=m" (crc)); asm("popa"); return crc; }}

 Preliminary tests involved Kernel code implementing CRC algorithms commonly used by iSCSI drivers.  32-bit and 64-bit versions of the Kernel under test  32-bit version processes 4 bytes of data using 1 CRC32 instruction  64-bit version processes 8 bytes of data using 1 CRC32 instruction  Input strings of sizes 48 bytes and 4KB used for the test

32 - bit Input Data Size = 48 bytes Input Data Size = 4 KB 6.53 X

64 - bit 9.85 X

9.3 X

18.63 X

Preliminary Results show CRC32 instruction out-performing the fastest CRC32C software algorithm by a big margin
88

Idle Power Matters
• Data center operating costs1
– 41M physical servers by 2010, average utilization < 10% – $0.50 spent on power and cooling for every $1 spent on server hardware

• Regulatory requirements affect all segments
– ENERGY STAR* and related requirements

• Environmental responsibility

Idle power consumption not just mobile concern
1. IDC’s Datacenter Trends Survey, January 2007

89

CPU Core Power Consumption

• High frequency processes are leaky
– Reduced via high-K metal gate process, design technologies, manufacturing optimizations

Leakage

90

CPU Core Power Consumption

• High frequency designs require high performance global clock distribution • High frequency processes are leaky
– Reduced via high-K metal gate process, design technologies, manufacturing optimizations

Clock Distribution Leakage

91

CPU Core Power Consumption
Total Core Power Consumption

• Remaining power in logic, local clocks
– Power efficient microarchitecture, good clock gating minimize waste

Local Clocks and Logic Clock Distribution Leakage

• High frequency designs require high performance global clock distribution • High frequency processes are leaky
– Reduced via high-K metal gate process, design technologies, manufacturing optimizations

Challenge – Minimize power when idle

92

C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
• C0: CPU active state
Active Core Power

Local Clocks and Logic Clock Distribution Leakage

93

C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
• C0: CPU active state • C1, C2 states (early 1990s):
• Stop core pipeline • Stop most core clocks Active Core Power

Local Clocks and Logic Clock Distribution Leakage

94

C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
• C0: CPU active state • C1, C2 states (early 1990s):
• Stop core pipeline • Stop most core clocks Active Core Power

• C3 state (mid 1990s):
• Stop remaining core clocks Clock Distribution Leakage

95

C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
• C0: CPU active state • C1, C2 states (early 1990s):
• Stop core pipeline • Stop most core clocks Active Core Power

• C3 state (mid 1990s):
• Stop remaining core clocks

• C4, C5, C6 states (mid 2000s):
• Drop core voltage, reducing leakage • Voltage reduction via shared VR Leakage

Existing C-states significantly reduce idle power

96

C-State Support Before Intel® Core™ Microarchitecture (Nehalem)
• Cores share a single voltage plane
– All cores must be idle before voltage reduced – Independent VR’s per core prohibitive from cost and form factor perspective

• Deepest C-states have relatively long exit latencies
– System / VR handshake, ramp voltage, restore state, restart pipeline, etc.

Deepest C-states available in mobile products

97

Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
• C0: CPU active state
Active Core Power

Local Clocks and Logic Clock Distribution Leakage

98

Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
• C0: CPU active state • C1 state:
• Stop core pipeline • Stop most core clocks Active Core Power

Local Clocks and Logic Clock Distribution Leakage

99

Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
• C0: CPU active state • C1 state:
• Stop core pipeline • Stop most core clocks Active Core Power

• C3 state:
• Stop remaining core clocks Clock Distribution Leakage

100

Intel® Core™ Microarchitecture (Nehalem) Core C-State Support
• C0: CPU active state • C1 state:
• Stop core pipeline • Stop most core clocks Active Core Power

• C3 state:
• Stop remaining core clocks

• C6 state:
• Processor saves architectural state • Turn off power gate, eliminating leakage Leakage

Core idle power goes to ~0

101

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)

102

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Core Power
Cores running applications.

Core 1
0

Core 0
0

Time

103

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Core Power
Task completes. No work waiting. OS executes MWAIT(C6) instruction.

Core 1
0

Core 0
0

Time

104

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Core Power
Execution stops. Core architectural state saved. Core clocks stopped. Core 0 continues execution undisturbed.

Core 1

0

Core 0
0

Time

105

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Core Power

Core 1
0 Task completes. No work waiting. OS executes MWAIT(C6) instruction. Core enters C6.

Core 0

0

Time

106

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Core Power

Core 1
0 VR voltage reduced. Power drops.

Core 0
0

Time

107

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Core Power
Interrupt for Core 1 arrives. VR voltage increased. Core 1 clocks turn on, core state restored, and core resumes execution at instruction following MWAIT(C6). Cores 0 remains idle.

Core 1

0

Core 0
0

Time

108

C6 Support on Intel® Core™2 Duo Mobile Processor (Penryn)
Core Power

0

Interrupt for Core 0 arrives. Core 0 returns to C0 and resumes execution at instruction following MWAIT(C6). Core 1 continues execution undisturbed.

Core 1

Core 0
0

C6 significantly reducesTime power consumption idle

109

Reducing Platform Idle Power
• Dramatic improvements in CPU idle power increase importance of platform improvements • Memory power:
– Memory clocks stopped between requests at low utilization – Memory to self refresh in package C3, C6

• Link power:
– Intel® QuickPath Interconnect links to lower power states as CPU becomes less active – PCI Express* links on chipset have similar behavior

• Hint to VR to reduce phases during periods of low current demand Intel® Core™ microarchitecture (Nehalem) reduces CPU and platform power
110

Intel® Core™ Microarchitecture (Nehalem): Integrated Power Gate
VCC

• Integrated power switch between VR output and core voltage supply
– – – Very low on-resistance Very high off-resistance Much faster voltage ramp than external VR Individual cores transition to ~0 power state Transparent to other cores, platform, software, and VR
Core0 Core1 Core2 Core3

Memory System, Cache, I/O

VTT

Enables per core C6 state
– –

Close collaboration with process technology to optimize device characteristics
111

111

Agenda
• Intel® Core™ microarchitecture (Nehalem) power management overview • Minimizing idle power consumption • Performance when you need it

112

Turbo Mode Before Intel® Core™ Microarchitecture (Nehalem)
Clock Stopped Power reduction in inactive cores

No Turbo
Frequency (F)

Frequency (F)

Core 0 Core 1

Workload Lightly Threaded

113

113

Core 0 Core 1

Turbo Mode Before Intel® Core™ Microarchitecture (Nehalem)
Clock Stopped Power reduction in inactive cores Turbo Mode In response to workload adds additional performance bins within headroom

No Turbo

Frequency (F)

Frequency (F)

Core 0 Core 1

Workload Lightly Threaded

114

114

Core 0

Intel® Core™ Microarchitecture (Nehalem) Turbo Mode
Power Gating Zero power for inactive cores

No Turbo
Frequency (F)

Frequency (F)

Core 0 Core 1 Core 2 Core 3

Workload Lightly Threaded or < TDP

115

115

Core 0 Core 1 Core 2 Core 3

Intel® Core™ Microarchitecture (Nehalem) Turbo Mode
Power Gating Zero power for inactive cores Turbo Mode In response to workload adds additional performance bins within headroom

No Turbo

Frequency (F)

Frequency (F)

Core 0 Core 1 Core 2 Core 3

Workload Lightly Threaded or < TDP

116

116

Core 0 Core 1

Intel® Core™ Microarchitecture (Nehalem) Turbo Mode
Power Gating Zero power for inactive cores Turbo Mode In response to workload adds additional performance bins within headroom

No Turbo

Frequency (F)

Frequency (F)

Core 0 Core 1 Core 2 Core 3

Workload Lightly Threaded or < TDP

117

117

Core 0 Core 1

Intel® Core™ Microarchitecture (Nehalem) Turbo Mode
Active cores running workloads < TDP

No Turbo
Frequency (F)

Frequency (F)

Core 0 Core 1 Core 2 Core 3

Workload Lightly Threaded or < TDP

118

118

Core 0 Core 1 Core 2 Core 3

Intel® Core™ Microarchitecture (Nehalem) Turbo Mode
Active cores running workloads < TDP Turbo Mode In response to workload adds additional performance bins within headroom

No Turbo

Frequency (F)

Frequency (F)

Core 0 Core 1 Core 2 Core 3

Workload Lightly Threaded or < TDP

119

119

Core 0 Core 1 Core 2 Core 3

Intel® Core™ Microarchitecture (Nehalem) Turbo Mode
Power Gating Zero power for inactive cores Turbo Mode In response to workload adds additional performance bins within headroom

No Turbo

Frequency (F)

Frequency (F)

Core 0 Core 1 Core 2 Core 3

Workload Lightly Threaded or < TDP

Dynamically Delivering Optimal Performance and Energy Efficiency
120

120

Core 0 Core 1 Core 2 Core 3

Additional Sources of Information on This Topic:
• Other Sessions / Chalk Talks / Labs:
– TCHS001: Next Generation Intel® Core™ Microarchitecture (Nehalem) Family of Processors: Screaming Performance, Efficient Power (8/19, 3:00 – 3:50) – DPTS001: High End Desktop Platform Design Overview for the Next Generation Intel® Microarchitecture (Nehalem) Processor (8/20, 2:40 – 3:30) – NGMS001: Next Generation Intel® Microarchitecture (Nehalem) Family: Architectural Insights and Power Management (8/19, 4:00 – 5:50) – NGMC001: Chalk Talk: Next Generation Intel® Microarchitecture (Nehalem) Family (8/19, 5:50 – 6:30) – NGMS002: Tuning Your Software for the Next Generation Intel® Microarchitecture (Nehalem) Family (8/20, 11:10 – 12:00) – PWRS003: Power Managing the Virtual Data Center with Windows Server* 2008 / Hyper-V and Next Generation Processor-based Intel® Servers Featuring Intel® Dynamic Power Technology (8/19, 3:00 – 3:50) – PWRS005: Platform Power Management Options for Intel® Next Generation Server Processor Technology (Tylersburg-EP) (8/21, 1:40 – 2:30) – SVRS002: Overview of the Intel® QuickPath Interconnect (8/21, 11:10 – 12:00)

121

Session Presentations - PDFs
The PDF for this Session presentation is available from our IDF Content Catalog at the end of the day at:
www.intel.com/idf
or

https://intel.wingateweb.com/US08/scheduler/public.jsp

122

Please Fill out the Session Evaluation Form
Place form in evaluation box at the back of session room
Thank you for your input, we use it to improve future Intel Developer Forum events

123

Sign up to vote on this title
UsefulNot useful