Professional Documents
Culture Documents
0053 Parhuzamos Rendszerek Architekturaja
0053 Parhuzamos Rendszerek Architekturaja
PÁRHUZAMOS RENDSZEREK
ARCHITEKTÚRÁJA
TÁMOGATÁS:
Készült a TÁMOP-4.1.2-08/2/A/KMR-2009-0053 számú, “Proaktív informatikai
modulfejlesztés (PRIM1): IT Szolgáltatásmenedzsment modul és Többszálas
processzorok és programozásuk modul” című pályázat keretében
2
KULCSSZAVAK:
többmagos processzorok, sokmagos processzorok, homogén többmagos processzorok,
heterogén többmagos processzorok, mester-szolga elvű heterogén többmagos
processzorok, csatolt elvű heterogén többmagos processzorok, Core
2/Penryn/Nehalem/Nehalem-X/Westmere/Westmer-EX/Sandy Bridge-alapú Intel
architektúrák, egyéni (privat consumer) és vállalati (enterprise) orientált platformok, Intel vPro
platformja, általános célú GPU-k (GPGPU-k), adatpárhuzamos gyorsítók (DPA-k), integrált
CPU/GPU architektúrák
ÖSSZEFOGLALÓ:
A tárgy keretében a hallgatók áttekintést kapnak a processzorarchitektúrák terén az elmúlt
években végbement rohamos fejlődésről. Megismerkednek a többmagos processzorok
megjelenésének szükségszerűségével, a többmagos/sokmagos processzorok főbb
osztályaival, nevezetesen a homogén és a heterogén többmagos processzorokkal, azok
alosztályaival és reprezentáns implementációikkal.
Ismertetésre kerülnek a többmagos Intel processzorok főbb családjai és azok főbb jellemzői,
nevezetesen a Core 2, Penryn, Nehalem, Nehalem-EX, Westmere, Westmere-EX és a
Sandy Bridge alapú architektúrák és jellemzőik. Az előadásban a hallgatók megismerkednek
a többmagos asztali számítógép platformokkal, kiemelten az egyéni ill. a vállalati alkalmazási
orientációjú (vPro) platformokkal és azok sajátosságaival. Az anyag megértését nagyszámú
konkrét megvalósítás bemutatása segíti. A továbbiakban az előadás tárgyalja a
számításigényes alkalmazások terén egyre szélesebb körben elterjedő általános célú GPU-
kat (GPGPU-k) és adatpárhuzamos gyorsítókat (DPA-k). Végül ismertetésre kerülnek a
reprezentáns Nvidia és AMD/ATI GPGPU családok architektúrái valamint a processzorok
fejlődésének legutóbbi szakaszában megjelent integrált CPU/GPU architektúrák ill.
reprezentáns implementációik.
3
Tartalomjegyzék
• Multicore-Manycore Processors
• Evolution of Intel’s Basic Microarchitectures
• Intel’s Desktop Platforms
• GPGPUs/DPAs Overview
• GPGPUs/DPAs 5.1
• GPGPUs/DPAs 5.2
• Integrated CPUs/GPUs
• References to all four sectionsof GPGPUs/DPAs
Dezső Sima
• 2. Homogeneous multicores
• 3. Heterogeneous multicores
• 4. Outlook
• 5. References
SPECint92
Levelling off
10000
P4/3200 * * Prescott (2M)
* * *Prescott (1M)
5000 P4/3060 * Northwood B
P4/2400 * **P4/2800
P4/2000 * *P4/2200
2000 P4/1500 * *
P4/1700
PIII/600 PIII/1000
1000 *
**PIII/500
PII/400
PII/300 *
* PII/450
500 *
0.2 * 8088/5
Year
79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05
Performance (Pa)
Pa = fC x IPC
Clock Efficiency
Pa = x
frequency (Pa/fC)
SPECint_base2000/ f c 2. generation
superscalars
Levelling off
1
0.5 Pentium Pro Pentium II
* *
* * *
Pentium III
* *
~10*/10 years Pentium *
0.2
* *
486DX
0.1
0.05 * *
386DX
0.02 * 286
0.01
~
~
Year
78 79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02
pipeline superscalar
1. Gen. 2. Gen.
pipeline superscalar
1. Gen. 2. Gen.
Pa = fC x IPC
Figure 1.7: The actual rise of IC complexity in DRAMs and microprocessors [39]
pipeline superscalar
1. Gen. 2. Gen.
What is the best use of ever increasing Doubling transistor counts Moore’s
number of processors ??? ~ every two years law
IC fab technology
(Linear shrink ~ 0.7x/2 years) ~ Doubling transistor counts / 2 years
Moore’s law
pipeline superscalar
1. Gen. 2. Gen.
Multicore processors
Homogeneous Heterogeneous
multicores multicores
MPC
CPU GPU
Multicore processors
Homogeneous Heterogeneous
multicores multicores
MPC
CPU GPU
• 2.1.1.1 Introduction
2.1.1.1 Introduction
Servers
Nehalem-EP 45 nm
Westmere-EP 32 nm
Nehalem
Nehalem-EX 45 nm 3/2010 7500 (Beckton) 1x8 C, ¼ MB L2/C 24 MB L3
MP server platforms
Overview
Remark
For presenting a more complete view of the evolution of multicore MP server platforms
we include also the single core (SC) 90 nm Pentium 4 Prescott based Xeon MP (Potomac)
processor that was the first 64-bit MP server processor and gave rise to the Truland platform.
4/2003
Pentium 4 based
ICH ICH5
Core 2 based
Penryn based
Pentium 4-based/90 nm 30 Pentium 4-based/65 nm
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (3)
3/02 11/02 2Q/05
3/04
^ ^ ^ ^
Xeon - MP line Foster-MP Gallatin
Gallatin
Potomac
0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ/178 mtrs 0.09 µ/ 125 mtrs 0.09µ
1.4/1.5/1.7 GHz 1.8/2/2.2 GHz 2/2.4/2.6/2.8 GHz 3.06 GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.8 GHz
On-die 256 K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2, 1M L3 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB (Cancelled 5/04)
µPGA 603 µ PGA 603 µ PGA 603 µ PGA 603 µ PGA 604
µPGA604
800 MHz FSB 1066 MHz FSB
µPGA 478 LGA 775
11/00 8/01 1/02 5/02 11/02 5/03 2/04 6/04 8/04 3Q/05
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Willamette Willamette Northwood-A2,3 Northwood-B 4 Northwood-B Northwood-C5 Prescott 6,7 Prescott 8,9,10
Prescott-F11 Tejas
0.18 µ /42 mtrs 0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09 µ /
1.4/1.5 GHz 1.4 ... 2.0 GHz 2A/2.2 GHz 2.26/2.40B/2.53 GHz 3.06 GHz 2.40C/2.60C/2.80C GHz 2.80E/3E/3.20E/3.40E GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.20F/3.40F/3.60F GHz 4.0/4.2 GHz
On-die 256K L2 On-die 256K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 1M L2 On-die 1M L2 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB (Cancelled 5/04)
µ PGA 423 µ PGA 478 µPGA 478 µ PGA 478 µ PGA 478 µPGA 478 µ PGA 478 LGA 775 LGA 775
Cores supporting hyperthreading Cores with EM64T implemented but not enabled Cores supporting EM64T
Figure 2.2: The Potomac processor as Intel’s first 64-bit Xeon MP processor based on the
© Sima Dezső, ÓE NIK third core (Prescott core) of the
31 Pentium 4 family of processors www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (4)
FSB
XMB: eXxternal Memory Bridge
Provides a serial link,
XMB XMB 5.33 GB inbound BW
2.65 GB outbound BW
85001/8501
(simultaneously)l
XMB XMB
HI 1.5
DDR-266/333 HI 1.5 (Hub Interface 1.5)
DDR-266/333 8 bit wide, 66 MHz clock, QDR,
DDR2-400 DDR2-400
ICH5 66 MB/s peak transfer rate
1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)
Example 1: Block diagram of a 8500 chipset based Truland MP server board [2]
Figure 2.4: Block diagram of a 8500 chipset based Truland MP server board [2]
© Sima Dezső, ÓE NIK 34 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (7)
Example 2: Block diagram of the E8501 based Truland MP server platform [3]
Xeon DC MP 7000
(4/2005) or later
DC/QC MP 7000 processors
IMI: Independent
Memory Interface
IMI: Serial link
5.33 GB inbound BW
2.67 GB outbound BW
simultaneously
Intelligent MC
Dual mem. channels
DDR 266/333/400
4 DIMMs/channel
Figure 2.5: Intel’s 8501 chipset based Truland MP server platform (4/ 2006) [3]
© Sima Dezső, ÓE NIK 35 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (8)
2 x XMB
DDR2 Xeon DC
DIMMs
7000/7100
64 GB
2 x XMB E8501 NB
ICH5R SB
Figure 2.6: Intel E8501 chipset based MP server board (Supermicro X6QT8)
© Sima Dezső, ÓE NIK 36 DC MP processor families [4]
for the Xeon 7000/7100 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (9)
Figure 2.7: Bandwith bottlenecks in Intel’s 8501 based Truland MP server platform [5]
© Sima Dezső, ÓE NIK 37 www.tankonyvtar.hu
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (10)
Remark
Previous (first generation) MP servers made use of a symmetric topology including only a
single FSB that connects all 4 single core processors to the MCH (north bridge), as shown
below.
FSB
Preceding NBs
Figure 2.8: Previous Pentium 4 MP based MP server platform (for single core processors)
FSB
FSB
XMB XMB
Preceding NBs 85001/8501
XMB XMB
E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266 HI 1.5
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
Preceding ICH ICH5
HI 1.5 266 MB/s
MP platforms Caneland
9/2007 9/2008
9/2007
MCH E7300
(Clarksboro)
4xFSB
1066 MT/s
ESI
4 x FBDIMM
(DDR2-533/667
8 DIMMs/channel)
512GB
5/2006
Pentium 4 based
631xESB
ICH Core 2 based
632xESB
Penryn based
Core2-based/65 nm
42 Penryn 45 nm
© Sima Dezső, ÓE NIK www.tankonyvtar.hu
2.1.1.3 The Core 2 based Caneland MP server platform (3)
Xeon MP Xeon 7000 Xeon 7100 Xeon 7200 Xeon 7300 Xeon 7400
/ / / /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C (Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)
FSB FSB
XMB XMB up to
85001/8501 7300 8 DIMMs
XMB XMB
HI 1.5 ESI
DDR-266/333 DDR-266/333
631xESB/ FBDIMM
DDR2-400 ICH5 DDR2-400
632xESB DDR2-533/667
Example 1: Intel’s Nehalem-EP based Tylersburg-EP DP server platform with a single IOH
Xeon
7200 (Tigerton DC, Core 2), 2C
7300 (Tigerton QC, Core 2), QC
FB-DIMM
4 channels
8 DIMMs/channel
up to 512 GB
FB-DIMM Xeon
DDR2
7200 DC
192 GB
7300 QC
(Tigerton)
SBE2 SB
Figure 2.11: Caneland MP Supermicro serverboard, with the 7300 (Clarksboro) chipset
for the Xeon 7200/7300 DC/QC MP processor families [4]
© Sima Dezső, ÓE NIK 45 www.tankonyvtar.hu
2.1.1.3 The Core 2 based Caneland MP server platform (6)
Figure 2.12: Performance comparison of the Caneland platform with a quad core Xeon (7300 family)
vs the Bensley platform with a dual core Xeon 7140M [8]
© Sima Dezső, ÓE NIK 46 www.tankonyvtar.hu
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (1)
MP platforms Boxboro-EX
3/2010 4/2011
(Nehalem-EX)
(Becton) 8C (Westmere-EX) 10C
IOH 7500
(Boxboro)
2 QPI links
32xPCIe 2. Gen.
0.5 GB/s/lane/direction
ESI
1GB/s/directon
6/2008
ICH ICH10
Nehalem-EX-based Westmere-EX
© Sima Dezső, ÓE NIK 45 nm 48 45 nm www.tankonyvtar.hu
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (3)
2 cores
Figure 2.13: The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor [9]
© Sima Dezső, ÓE NIK 49 www.tankonyvtar.hu
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (4)
ESI
SMI: Serial link between the processor and
the SMB
ICH10 SMB: Scalable Memory Buffer
Parallel/serial conversion
ME: Management Engine
Wide range of scalability of the 7500/6500 IOH based Boxboro-EX platform [12]
ESI
FSB
FSB
XMB XMB
Preceding NBs 85001/8501
XMB XMB
E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266 HI 1.5
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
Preceding ICH ICH5
HI 1.5 266 MB/s
Evolution from the 90 nm Pentium 4 Prescott MP based Truland MP platform (up to 2 cores) to the
Core 2 based Caneland MP platform (up to 6 cores)
Xeon MP Xeon 7000 Xeon 7100 Xeon 7200 Xeon 7300 Xeon 7400
/ / / /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C (Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)
FSB FSB
XMB XMB up to
85001/8501 7300 8 DIMMs
XMB XMB
HI 1.5 ESI
DDR-266/333 DDR-266/333
631xESB/ FBDIMM
DDR2-400 ICH5 DDR2-400
632xESB DDR2-533/667
1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)
© Sima Dezső, ÓE NIK 58 www.tankonyvtar.hu
2.1.1.5 Evolution of MP server platforms (4)
ESI
SMI: Serial link between the processor and
the SMBs
ICH10 SMB: Scalable Memory Buffer
Parallel/serial converter
ME: Management Engine
Multicore processors
Homogeneous Heterogeneous
multicores multicores
MPC
CPU GPU
2.2.1 Larrabee
• Objectives:
High end graphics processing, HPC
Not a single product but a base architecture for a number of different products.
• Brief history:
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)
First official public presentation: 08/2008 (SIGGRAPH)
Due in ~ 2009
• Performance (targeted):
2 TFlops
Basic architecture
Figure 2.16: Four socket MP server design with 24-core Larrabees connected by the CSI bus [41]
© Sima Dezső, ÓE NIK 66 www.tankonyvtar.hu
2.2.2 Intel’s Tile processor
Bisection bandwidth:
If the network is segmented into two equal parts,
this is the bandwidth between the two parts
Figure 2.18: Die photo and chip details of the Tile processor [14]
FP Multiply-Accumulate
(AxB+C)
Performance at 4 GHz:
Peak SP FP: up to 1.28 TFlops (2 FPMA x 2 instr./cyclex80x4 GHz = 1.28 TFlops)
Figure 2.24: The full instruction set of the Tile processor [14]
VLIW
Figure 2.25: Instruction word and latencies of the Tile processor [14]
Figure 2.27: Instruction word and latencies of the Tile processor [14]
Figure 2.29: Lessons learned from the Tile processor (1) [14]
Figure 2.30: Lessons learned from the Tile processor (2) [14]
• 12/2009: Announced
• 9/2010: Many-core Application Research Project (MARC) initiative started on the SCC
platform
• Designed in Braunschweig and Bangalore
• 48 core, 2D-mesh system topology, message passing
Multicore processors
Homogeneous Heterogeneous
multicores multicores
MPC
CPU GPU
Multicore processors
Homogenous Heterogenous
multicores multicores
Desktops Servers
MPC
CPU GPU
Figure 3.3: Die shot of the Cell BE (221mm2, 234 mtrs) [44]
© Sima Dezső, ÓE NIK 100 www.tankonyvtar.hu
3.1.1 The Cell Processor (4)
• Cell BE - NIK
2007: Faculty Award (Cell 3Đ app./Teaching)
2008: IBM – NIK Reserch Agreement and Cooperation: Performance investigations
• IBM Böblingen Lab
• IBM Austin Lab
The Roadrunner
• 3.2.1 GPGPUs
Multicore processors
Homogeneous Heterogeneous
multicores multicores
MPC
CPU GPU
Based on its FP32 computing capability and the large number of FP-units available
GPGPUs
(General Purpose GPUs)
or
cGPUs
(computational GPUs)
Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [17]
Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [20]
Figure 3.14: Contrasting the utilization of the silicon area in CPUs and GPUs [21]
• Less area for control since GPGPUs have simplified control (same instruction for
all ALUs)
• Less area for caches since GPGPUs support massive multithereading to hide
latency of long operations, such as memory accesses in case of cache misses.
GPGPUs
90 nm G80
80 nm R600
Shrink Enhanced
arch. Shrink
Enhanced
65 nm G92 G200 arch.
55 nm RV670 RV770
Enhanced
Shrink Enhanced Enhanced
arch. Shrink arch. arch.
40 nm GF100 RV870 Cayman
(Fermi)
NVidia
11/06 10/07 6/08
OpenCL OpenCL
Standard
6/07 11/07 6/08 11/08
AMD/ATI
11/05 5/07 11/07 5/08
NVidia
3/10 07/10 11/10
Cores GF100 (Fermi) GF104 (Fermi) GF110 (Fermi)
40 nm/3000 mtrs 40 nm/1950 mtrs 40 nm/3000 mtrs
1/11
Cards GTX 470 GTX 480 GTX 460 GTX 580 GTX 560 Ti
448 ALUs 480 ALUs 336 ALUs 512 ALUs 480 ALUs
320-bit 384-bit 192/256-bit 384-bit 384-bit
CUDA Version 22 Version 2.3 Version 3.0 Version 3.1 Version 3.2 Version 4.0
Beta
AMD/ATI
9/09 10/10 12/10
Figure 3.18: Overview of GPGPUs and their basic software support (2)
© Sima Dezső, ÓE NIK 127 www.tankonyvtar.hu
3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (1)
Announced: 30. Sept. 2009 at NVidia’s GPU Technology Conference, available: 1Q 2010 [22]
Sub-families of Fermi
Fermi includes three sub-families with the following representative cores and features:
Gen.
GF110 11/2010 16 512 3000 mtrs 2.0
purpose
1 In the associated flagship card (GTX 480) however, one of the SMs has been disabled, due to overheating
problems, so it has actually only 15 SIMD cores, called Streaming Multiprocessors (SMs) by Nvidia and 480
FP32 EUs [69]
NVidia: 16 cores
(Streaming Multiprocessors)
(SMs)
Remark
In the associated flagship card
(GTX 480) however,
one SM has been disabled,
due to overheating problems,
so it has actually
15 SMs and 480 ALUs [a]
Fermi GF100
Note
The high level microarchitecture of Fermi evolved from a graphics oriented structure
to a computation oriented one complemented with a units needed for graphics processing.
SFU: Special
Function Unit
1 SM includes 32 ALUs
called “Cuda cores” by NVidia)
1 SM includes 32 ALUs
called “Cuda cores” by NVidia)
SP FP:32-bit
Remark
The Fermi line supports the Fused Multiply-Add (FMA) operation, rather than the Multiply-Add
operation performed in previous generations.
Previous lines
Fermi
Figure 3.21: Contrasting the Multiply-Add (MAD) and the Fused-Multiply-Add (FMA) operations
[27]
Host Device
Tesla cards
Flagship Tesla card C1060 C2070
Peak FP64 perf./card 30x1x2x1296 14x16x2x1150
1 In their GPGPU Fermi cards Nvidia activates only 4 FP64 units from the available 16
© Sima Dezső, ÓE NIK 140 www.tankonyvtar.hu
3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (1)
Integration
to the chip
DDR3 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz
L3 Cache 6MB 6MB 6MB 8MB 8MB
Intel HD
2000 2000 3000 2000 3000
Graphics
GPU Max freq 1100 MHz 1100 MHz 1100 MHz 1350 MHz 1350 MHz
Hyper-
No No No Yes Yes
Threading
Socket LGA 1155 LGA 1155 LGA 1155 LGA 1155 LGA 1155
Hyperthreading
32K L1D (3 clk) AES Instr.
AVX 256 bit VMX Unrestrict.
4 Operands 20 nm2 / Core
HD5570
400 ALUs
i5/i7 2xxx:
Sandy Bridge
i56xx
Arrandale
4. Outlook
Heterogenous
multicores
Master/slave Add-on
type multicores type multicores
M(Ma) = M(CPU)
M(S) = M(D)
Master-slave type multicores require much more intricate workflow control and
synchronization than add-on type multicores
It can expected that add-on type multicores will dominate the future of heterogeneous
multicores.
[1]: Gilbert J. D., Hunt S. H., Gunadi D., Srinivas G., The Tulsa Processor: A Dual Core Large
Shared-Cache Intel Xeon Processor 7000 Sequence for the MP Server Market Segment,
Aug 21 2006, http://www.hotchips.org/archives/hc18/3_Tues/HC18.S9/HC18.S9T1.pdf
[2]: Intel Server Board Set SE8500HW4, Technical Product Specification, Revision 1.0,
May 2005, ftp://download.intel.com/support/motherboards/server/sb/se8500hw4_board_
set_tpsr10.pdf
[3]: Intel® E8501 Chipset North Bridge (NB) Datasheet, Mai 2006,
http://www.intel.com/design/chipsets/e8501/datashts/309620.htm
[4]: Supermicro Motherboards, http://www.supermicro.com/products/motherboard/
[5]: Next-Generation AMD Opteron Processor with Direct Connect Architecture – 4P Server
Comparison, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/4P_
Server_Comparison_PID_41461.pdf
[6]: Supermicro P4QH6 / P4QH8 User’s Manual, 2002,
http://www.supermicro.com/manuals/motherboard/GC-HE/MNL-0665.pdf
[7]: Intel® 7300 Chipset Memory Controller Hub (MCH) – Datasheet, Sept. 2007,
http://www.intel.com/design/chipsets/datashts/313082.htm
[8]: Quad-Core Intel® Xeon® Processor 7300 Series Product Brief, Intel, Nov. 2007
http://download.intel.com/products/processor/xeon/7300_prodbrief.pdf
[9]: Mitchell D., Intel Nehalem-EX review, PCPro,
http://www.pcpro.co.uk/reviews/processors/357709/intel-nehalem-ex
[10]: Nagaraj D., Kottapalli S.: Westmere-EX: A 20 thread server CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
© Sima Dezső, ÓE NIK 153 www.tankonyvtar.hu
References (2)
[12]: Intel Xeon Processor 7500/6500 Series, Public Gold Presentation, March 30 2010,
http://cache-www.intel.com/cd/00/00/44/64/446456_446456.pdf
[14]: Mattson T., The Future of Many Core Computing: A tale of two processors, March 4 2010,
http://og-hpc.com/Rice2010/Slides/Mattson-OG-HPC-2010-Intel.pdf
[15]: Kirsch N., An Overview of Intel's Teraflops Research Chip, Febr. 13 2007, Legit Reviews,
http://www.legitreviews.com/article/460/1/
[18]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology,
AMD, March 9 2010,
http://developer.amd.com/gpu_assets/GPU%20Computing%20-%20Past%20
Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf
[19]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley,
January 24-25 2011
http://iccs.lbl.gov/assets/docs/20110124/lecture1_computational_thinking_Berkeley_2011.pdf
[20]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3
[21]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia,
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming
Guide_2.0.pdf
[22]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at Nvidia’s
2009 GPU Technology Conference, (GTC), Sept. 30 2009,
http://www.nvidia.com/object/gpu_tech_conf_press_room.html
[23]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf
[24]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT093009110932&
mode=print
[25]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview,
Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/
[26]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1
[27]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture
Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/
P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf
[28]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies,
Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242
[29]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/37789-nvidia-
geforce-gtx-580-review-5.html
[30]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors, Tech Report,
March 31 2010, http://techreport.com/articles.x/18682
[31]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor Graphics,
Presentation ARCS002, IDF San Francisco, Sept. 2010
[32]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor
Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010
[33]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review,
Jan. 3 2011,
http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2
[36]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core i3-2100
Tested, AnandTech, Jan. 3 2011,
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-600k-i5-2500k-
core-i3-2100-tested/11
[37]: Wall D. W.: Limits of Instruction Level Parallelism, WRL TN-15, Dec. 1990
[38]: Bhandarkar D.: „The Dawn of a New Era”, Presentation EMEA, May 11 2006.
[43]: Taylor M. B. & all: Evaluation of the Raw Microprocessor, Proc. ISCA 2004
http://groups.csail.mit.edu/cag/raw/documents/raw_isca_2004.pdf
[44]: Wright C., Henning P.: Roadrunner Tutorial, An Introduction to Roadrunner and the Cell
Processor, Febr. 7 2008,
http://ebookbrowse.com/roadrunner-tutorial-session-1-web1-pdf-d34334105
[45]: Seguin S.: IBM Roadrunner Beats Cray’s Jaguar, Tom’s Hardware, Nov. 18 2008
http://www.tomshardware.com/news/IBM-Roadrunner-Top500-Supercomputer,6610.html
Dezső Sima
• 1. Introduction
• 2. Core 2
• 3. Penryn
• 4. Nehalem
• 5. Nehalem-EX
• 6. Westmere
• 7. Westmere-EX
• 8. Sandy Bridge
Remarks
1) To preserve a clear review the discussion of the basic architectures is restricted only to
Intel’s “standard voltage” basic lines.
Medium-voltage/low-voltage/ultra-low voltage processors are not included.
2) The release dates given relate to the first processors shipped in a considered line.
Subsequently shipped models of the lines are not taken into account in order to keep
the overviews comprehensible.
3) On the sides the core numbers reflect the max. number of cores.
Usually, manufacturers provide also processors with less than the max. number of cores.
TICK
TOCK Pentium 4 /Willamette 180nm 11/2000 New microarch.
2 YEARS
TICK
TOCK Pentium 4 /Northwood 130nm 01/2002 Adv. microarch., hyperthreading
2 YEARS
01/2006
65nm
New microarch., 4-wide core,
TOCK Core 2 07/2006
128-bit SIMD, no hyperthreading
11/2007
01/2010
2006 65 nm Core 2
Core 2
2007 45 nm Penryn
2008 45 nm Nehalem
Nehalem
2010 32 nm Westmere
0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ/178 mtrs 0.09 µ/ 125 mtrs 0.09µ
1.4/1.5/1.7 GHz 1.8/2/2.2 GHz 2/2.4/2.6/2.8 GHz 3.06 GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.8 GHz
On-die 256 K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2, 1M L3 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB (Cancelled 5/04)
µPGA 603 µ PGA 603 µ PGA 603 µ PGA 603 µ PGA 604
11/00 8/01 1/02 5/02 11/02 5/03 2/04 6/04 8/04 3Q/05
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Willamette Willamette Northwood-A2,3 Northwood-B 4 Northwood-B Northwood-C5 Prescott 6,7 Prescott 8,9,10 Prescott-F11 Tejas
0.18 µ /42 mtrs 0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09 µ /
1.4/1.5 GHz 1.4 ... 2.0 GHz 2A/2.2 GHz 2.26/2.40B/2.53 GHz 3.06 GHz 2.40C/2.60C/2.80C GHz 2.80E/3E/3.20E/3.40E GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.20F/3.40F/3.60F GHz 4.0/4.2 GHz
On-die 256K L2 On-die 256K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 1M L2 On-die 1M L2 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB (Cancelled 5/04)
µ PGA 423 µ PGA 478 µPGA 478 µ PGA 478 µ PGA 478 µPGA 478 µ PGA 478 LGA 775 LGA 775
Cores supporting hyperthreading Cores with EM64T implemented but not enabled Cores supporting EM64T
90 nm 90 nm 65 nm 65 nm
112 mm2 135 mm2 81 mm2 2 x 81 mm2
125 mtrs 169 mtrs 188 mtrs 2 x 188 mtrs
1 The original Prescott core included but did not activate the support of 64-bit operation, called EM64T and used an mPGA 439 socket..
EM64T support was released later, about 6/2004 while changing to the socket LGA 775.
Figure 1.4: Genealogy of the Cedar Mill core and the DC Presler processor
• 2.1 Introduction
2.1 Introduction
Core 2 microarchitecture
The Pentium4 line was cancelled due to unmanageable high dissipation figures
of its third core (the 90 nm Prescott core)
(caused primarily by the design philosophy of the line
that preferred clock frequency over core efficiency for increasing performance)
For the development of the next processor line dissipation became the key design issue
The next line became based primarily on the Pentium M – Intel’s first mobile line
since for its designs dissipation reduction was a key issue.
(The Pentium M line was a 32-bit line with 3 subsequent cores, designed at
Intel’s Haifa Research and Development Center)
Intel® Advanced
Smart Cache
4-wide core
By contrast
both Intel’s previous Pentium 4 family and AMD’s K8 have 3-wide cores.
173
Figure 2.3: Block diagram of Intel’s Pentium 4 microarchitecture [5]
2.2 Wide execution (4)
174
Retire width: 3 instr./cycle
carmean
www.tankonyvtar.hu
175
2.2 Wide execution (6)
177
2.2 Wide execution (8)
Figure 2.6: Issue ports and execution unit of the Pentium 4 [9]
Ports 0 und 1 can issue up to two microinstructions per cycle, allowing to issue
altogether up to 6 microinstr./cycle
Remark
Both the Core’s and the Pentium 4’s schedulers can issue 6 operations per cycle, but
• Pentium 4’s schedulers have only 4 ports, with two double pumped simple ALUs,
• by contrast Core has a unified scheduler with 6 ports, allowing more flexibility
for issuing instructions.
Remark
IBM’s POWER4 and subsequent processors of this line have introduced 5-wide cores
with 8-wide out of order issue.
• Micro-op fusion can reduce the total number of micro-ops to be processed by more than
10 %.
Remark
Example
• AMD’s K8-based processors became the performance leader, first of all on the DP and MP
server market, where the 64-bit direct connect architecture has clear benefits
vs Intel’s 32-bit Pentium 4 based processors using shared FSBs to connect processors
to north bridges.
Figure 2.10: DP web server performance comparison: AMD Opteron 248 vs. Intel Xeon 2.8 [6]
© Sima Dezső, ÓE NIK 190 www.tankonyvtar.hu
2.2 Wide execution (21)
“In the extensive benchmark tests under Linux Enterprise Server 8 (32 bit as well as
64 bit), the AMD Opteron made a good impression. Especially in the server disciplines,
the benchmarks (MySQL, Whetstone, ARC 2D, NPB, etc.) show quite clearly that the
Dual Opteron puts the Dual Xeon in its place”.
• This situation has completely changed in 2006 when Intel introduced their Core 2
microarchitecture,
a 4-wide front-end and retire unit compared to the 3-wide K8 or the Pentium 4,
Webserver Performance
MSI K2-102A2M MSI K2-102A2M Opteron 280 vs. Extrapolated Xeon 5160
Opteron 275 Opteron 280 Opteron 275 Opteron 3 GHz 3 GHz
Figure 2.11: DP web server performance comparison: AMD Opteron 275/280 vs. Intel Xeon 5160 [8]
Remark
Both web-server benchmark results were published from the same source (AnandTech)
Private Shared
L2 L2
L2 Cache
Cache Cache
+ 2x bandwidth to L1 caches.
Figure 2.14: Data sharing in shared and private (independent) L2 cache implementations [11]
© Sima Dezső, ÓE NIK 197 www.tankonyvtar.hu
2.3 Smart L2 cache (5)
Trend
• Memory disambiguation
• Enhanced hardware prefetchers
Figure 2.15: Units involved in implementing memory disambiguation or hardware prefetching [12]
© Sima Dezső, ÓE NIK 200 www.tankonyvtar.hu
2.4 Smart memory accesses (2)
Memory disambiguation
Aim
Hiding memory latency through reordering of loads.
Example
Figure 2.16: Example for memory reordering (memory disambiguation), Loads may bypass
Stores [13]
© Sima Dezső, ÓE NIK 202 www.tankonyvtar.hu
2.4 Smart memory accesses (4)
Strong Weak
sequential consistency sequential consistency
Load reordering
Typ. examples
Pentium Pro, Pentium II, III See later.
Pentium 4
Example
Figure 2.18: Example for memory reordering (memory disambiguation), Loads may bypass
Stores [13]
© Sima Dezső, ÓE NIK 205 www.tankonyvtar.hu
2.4 Smart memory accesses (7)
• Loads may bypass only Stores whose target addresses differ from that of the Load,
else the Load would access a former, incorrect value, from the target address.
• However, Store addresses are not always known at the time when the scheduler needs
to decide whether or not the Load considered is allowed to bypass a Store.
• There are two options how to proceed when Store addresses are not yet known.
Deterministic Speculative
Store bypassing Store bypassing
Loads bypass Stores only Loads may bypass Stores also in cases
if all respective Store addresses are known when respective Store addresses
and the Load address does not coincide with are not yet known, that is not yet computed.
any of the Store addresses to be bypassed Then, the correctness of the speculative Load
need to be checked, e.g. as follows:
Each calculated Store address will be compared
to all younger Load addresses.
For a hit, this Load and subsequent instructions
Examples will be aborted and re-executed.
Figure 2.19: Introduction of Load reordering related to both Loads and Stores
© Sima Dezső, ÓE NIK 207 www.tankonyvtar.hu
2.4 Smart memory accesses (9)
Remark 1
Available literature sources for the UltraSPARC processor do not allow a clear distinction
about the load bypassing option used. Based on these a deterministic load bypassing
was assumed.
Remark 2
x86 processors have much more complex addressing modes, requiring a number of
address additions, compared to RISC processors.
This is the reason while load reordering was much later introduced in x86 processors
than in RISC processors.
Assumed reason
Stores are less frequent than Loads (~ 1/3-1/4)
Performance gain obtained does not pay off by the additional complexity
(and dissipation).
Figure 2.20: Assumed hardware structure to implement speculative Loads in Intel’s Core 2 [4]
© Sima Dezső, ÓE NIK 210 www.tankonyvtar.hu
2.4 Smart memory accesses (12)
• Actually, when a Load is issued from the Reservation Station’s scheduler to the Load
Buffer a predictor is looked up .
• If the prediction is “non colliding” a Store with an unknown store address may be passed
else not.
Figure 2.21: Principle of the implementation of speculative Loads in Intel’s Core 2 [12]
© Sima Dezső, ÓE NIK 212 www.tankonyvtar.hu
2.4 Smart memory accesses (14)
Source: [12]
Source: [12]
Remark 1
There is a further option to hide memory latency, called Store to Load forwarding, or
Store Buffer forwarding.
Store to Load forwarding
Forwarding store data immediately from the Store buffer to a Load without waiting
for the data to be written to the cache, in cases when the last Store writing the
same address as referenced by the Load, is actually available in the Store buffer.
Examples
Pentium 4 (2000)
(Core 2 2006)
Penryn (2007)
Merom (2008)
AMD Athlon64 (2003)
AMD Opteron (2003)
Remark 2
There is yet another option to reduce D cache load-use latency, called speculative Loads.
Speculative Loads
• Issuing a load-use dependent instruction before it turns out whether the data cache
hits or misses, optimistically, in expectation of a cache hit. Then data become
available typically 1-2 clock cycles earlier than with traditional processing.
• If the expectation turns out to be wrong, the execution of the instruction in concern
becomes aborted and re-executed after the cache miss is serviced.
Pentium Pro
Pentium
Pentium II/III
lines
Pentium 4 Pentium 4
Core 2 Core 2
Penryn Penryn
Nehalem Nehalem
AMD Athlon 64
lines Opteron
Figure 2.23: Overview of memory access reordering schemes and their use in x86 processors
© Sima Dezső, ÓE NIK 218 www.tankonyvtar.hu
2.4 Smart memory accesses (20)
Remarks
• Intel’s first on-die L2 cache debuted only about one year earlier (10/1999),
in the second core of the Pentium III line (called the Coppermine core,
built on 180 nm technology, with a size of 256 KB).
Figure 2.25: Widening the FP/SSE Execution Units from 64-bit to 128-bit [12]
© Sima Dezső, ÓE NIK 223 www.tankonyvtar.hu
2.5 Enhanced digital media support (3)
Single cycle 128-bit execution as a result of widening the FP/SSE Execution units
MultiMedia eXtensions
Northwood (Pentium 4)
8 MM registers (64-bit),
aliased on the FP Stack registers
Northwood (Pentium4)
Norhwood
Northwood (Pentium4)
Ivy Bridge
Figure 2.28: Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA
© Sima Dezső, ÓE NIK 227 www.tankonyvtar.hu
2.5 Enhanced digital media support (7)
Northwood (Pentium4)
Norhwood
DSP-oriented FP enhancements,
enhanced thread manipulation
Media acceleration
(video encoding, MM, gaming)
Figure 2.30: Intel’s x86 ISA extensions - the operations introduced (based on [17])
© Sima Dezső, ÓE NIK 229 www.tankonyvtar.hu
2.5 Enhanced digital media support (9)
Arithmetic:
addpd - Adds 2 64bit doubles.
addsd - Adds bottom 64bit doubles.
subpd - Subtracts 2 64bit doubles.
subsd - Subtracts bottom 64bit doubles.
mulpd - Multiplies 2 64bit doubles.
mulsd - Multiplies bottom 64bit doubles.
divpd - Divides 2 64bit doubles.
divsd - Divides bottom 64bit doubles.
maxpd - Gets largest of 2 64bit doubles for 2 sets.
maxsd - Gets largets of 2 64bit doubles to bottom set.
minpd - Gets smallest of 2 64bit doubles for 2 sets.
minsd - Gets smallest of 2 64bit values for bottom set.
paddb - Adds 16 8bit integers.
paddw - Adds 8 16bit integers.
paddd - Adds 4 32bit integers.
paddq - Adds 2 64bit integers.
paddsb - Adds 16 8bit integers with saturation.
paddsw - Adds 8 16bit integers using saturation.
paddusb - Adds 16 8bit unsigned integers using saturation.
paddusw - Adds 8 16bit unsigned integers using saturation.
psubb - Subtracts 16 8bit integers.
psubw - Subtracts 8 16bit integers.
psubd - Subtracts 4 32bit integers.
psubq - Subtracts 2 64bit integers.
psubsb - Subtracts 16 8bit integers using saturation.
psubsw - Subtracts 8 16bit integers using saturation.
psubusb - Subtracts 16 8bit unsigned integers using saturation.
psubusw - Subtracts 8 16bit unsigned integers using saturation.
pmaddwd - Multiplies 16bit integers into 32bit results and adds results.
pmulhw - Multiplies 16bit integers and returns the high 16bits of the result.
pmullw - Multiplies 16bit integers and returns the low 16bits of the result.
pmuludq - Multiplies 2 32bit pairs and stores 2 64bit results.
rcpps - Approximates the reciprocal of 4 32bit singles.
rcpss - Approximates the reciprocal of bottom 32bit single. Figure 2.31: Excerpt of Intel’s
sqrtpd - Returns square root of 2 64bit doubles. SSE2 SIMD ISA extension [19]
sqrtsd - Returns square root of bottom 64bit double.
230 www.tankonyvtar.hu
2.5 Enhanced digital media support (10)
2 x 32-bit MMX,
2 x 32-bit SSE EUs
Northwood (Pentium 4)
Larrabee: 24-32? x
512-bit FP/SSE EUs
Ivy Bridge
Figure 2.32: SIMD execution resources in Intel’s basic processors (based on [18])
© Sima Dezső, ÓE NIK 231 www.tankonyvtar.hu
2.5 Enhanced digital media support (11)
Figure 2.33: Achieved performance boost in Core2 for gaming vs AMD’s Athlon 64 FX60 [13]
© Sima Dezső, ÓE NIK 232 www.tankonyvtar.hu
2.6 Intelligent power management (1)
Figure 2.34: The operation of the Ultra fine grained power control – an example [11].
© Sima Dezső, ÓE NIK 234 www.tankonyvtar.hu
2.6 Intelligent power management (3)
Bus splitting
Principle of operation
Most buses are sized for worst case, activating only needed bus widths saves power.
Digital
Thermal Sensors
(DTS) on the dies,
instead of
analog diodes,
providing
digital data,
scanned by
dedicated logic.
Figure 2.37: Principle of the PECI-based platform fan speed control [42]
Remark
PECI reports the relative temperature values measured below the onset value of the
thermal control circuit (TCC)
Remark
In the Nehalem Intel modified the Loop Stream Detector as follows:
Figure 2.39: The modified loop Stream Detector in the Nehalem [1]
MP-Servers
72xx, 2C, Tigerton DC, (2xMP-enhanced SC Woodcrest) 9/2007
73xx, 2x2C, Tigerton QC, (2xMP-enhanced DC Woodcrest) 9/2007
Based on [43]
© Sima Dezső, ÓE NIK 240 www.tankonyvtar.hu
3. Penryn
• 3.1 Introduction
3.1 Introduction
Penryn
Sub-threshold =
Source-Drain
Figure 3.1: Dynamic and static power dissipation trends in chips [21]
© Sima Dezső, ÓE NIK 243 www.tankonyvtar.hu
3.1 Introduction (3)
2 x Core 2 x Penryn
Figure 3.4: The 45 nm Penryn is a shrink of the 65 nm Core with a few enhancements [25]
© Sima Dezső, ÓE NIK 246 www.tankonyvtar.hu
3.1 Introduction (6)
Figure 3.5: Key enhancements introduced into Penryn’s microarchitecture vs the Core
(based on [25])
Radix-r divider
QSL: Quotient
Select
Hybrid:
producing
Gs and Ps
Figure 3.6: Simplified block diagram and latency of Penryn’s radix-16 divider [27]
© Sima Dezső, ÓE NIK 249 www.tankonyvtar.hu
3.3 More advanced L2 cache (1)
Core 2 Penryn
SSE4.1 ISA extension Largest set of ISA extensions introduced since 2000.
(47 instructions)
Fast accessing of graphics card memory e.g. by using two treads for CPU-GPU sharing
Figure 3.14: Latency improvements achieved by Penryn’s Super Shuffle Engine [30]
Microarchitecture comparison
Figure 3.15: Performance improvements of Penryn vs Core at the same clock frequency [26]
© Sima Dezső, ÓE NIK 259 www.tankonyvtar.hu
3.4 More advanced digital media support (10)
Figure 3.16: Extending Intel’s performance leadership in main application segments [26]
(First Introduced in the Core Duo (3. core of the Pentium M line)
• Intelligent
heuristics decides
when enter into.
(OS API
WAIT)
Figure 3.19: Power reduction achieved by the Deep Power Down Technology [27]
© Sima Dezső, ÓE NIK 264 www.tankonyvtar.hu
3.5 More advanced power management (5)
Remark 1
A similar technique was already developed for the Montecito (dual core Itanium),
but not implemented, called the Foxton technology.
Remark 2
Intel’s next basic core, the Nehalem includes a more advanced technology than
the Enhanced Dynamic Acceleration Technology, called the Turbo Boost Technology for
increasing clock frequency in case of inactive cores or light workloads.
-DP
-MP (6 cores)
Desktops
Core 2 Duo E8xxx, Wolfdale, 2C, 1/2008
Core 2 Duo E7xxx, Wolfdale-3M, 2C, 4/2008
Core 2 Quad Q9xxx, Yorkfield-6M, 2x2C, (2x Wolfdale-3M), 3/2008
Core 2 Quad Q8xxx, Yorkfield-6M, 2x2C, (2x Wolfdale-3M), 8/2008
Core 2 Extreme QX9xxx, Yorkfield XE, 2x2C (2x Wolfdale), 11/2007
Servers
UP-Servers
E31xx Wolfdale, 2C, 1/2008
X33xx, Yorkfield-6M, 2x2C, (2xWolfdale), 1/2008
X33xx, Yorkfield, (2xWolfdale), 2x2C, 1/2008
DP-Servers
E52xx, Wolfdale, 2C, 11/2007
E54xx/X54xx, Harpertown 2x2C, (2xWolfdale), 11/2007
MP-Servers
E74xx, 4C/6C, Dunnington, 9/2008
• 4.1 Introduction
4.1 Introduction
Nehalem
Experiences with HT
The design effort took about five years and required thousands of engineers
(Ronak Singhal, lead architect of Nehalem) [37].
Nehalem lines
Mobiles Mobiles
Core i7-9xxM Clarksfield 4C 9/2009
Core i7-8xxQM Clarksfield 4C 9/2009
Core i7-7xxQM Clarksfield 4C 9/2009
Desktops Desktops
Core i7-9xx (Bloomfield) 4C 11/2008 Core i7-8xx (Lynnfield) 4C 9/2009
Core i5-7xx (Lynnfield) 4C 9/2009
Servers Servers
UP-Servers UP-Servers
DP-Servers DP-Servers
55xx Gainestown (Nehalem-EP) 4C 3/2009 C55xx Jasper forest1 2C/4C 2/2010
Based on [44]
© Sima Dezső, ÓE NIK 273 www.tankonyvtar.hu
4.1 Introduction (4)
Bloomfield [45]
Note
• Both the Bloomfield (desktop) chip and the Gainestown (DP server) chip have the same
layout.
• On the Bloomfield die there are two QPI bus controllers however they are not needed
for this desktop part.
One of them is simply not used in the desktop version Bloomfield [45], but both are needed
in the DP alternative (Gainestown).
© Sima Dezső, ÓE NIK 274 www.tankonyvtar.hu
4.1 Introduction (5)
The 2. generation Lynnfield chip as a major redesign of the Bloomfield chip (1) [46]
• It is a cheaper and more effective two-chip system solution instead of the previous three
chip solution.
• It is connected to the P55 chipset by a DMI interface rather than by a QPI interface
used in the previous system solution to connect the Bloomfield chip to the X58 chipset.
Intel's Bloomfield Platform (X58 + LGA-1366) Intel's Lynnfield Platform (P55 + LGA-1156)
© Sima Dezső, ÓE NIK 275 www.tankonyvtar.hu
4.1 Introduction (6)
The 2. generation Lynnfield chip as a major redesign of the Bloomfield chip (2) [46]
• It provides PCIe 2.0 lanes (16 to 32 lanes) to attach a graphics card immediately to the
processor rather than to the north bridge (by 36 lanes) as done in the previous solution.
Intel's Bloomfield Platform (X58 + LGA-1366) Intel's Lynnfield Platform (P55 + LGA-1156)
© Sima Dezső, ÓE NIK 276 www.tankonyvtar.hu
4.1 Introduction (7)
The Lynnfield chip as a major redesign of the Bloomfield chip (3) [46]
• It supports only two DDR3 memory channels instead of three as in the previous solution.
• Its socket needs less connections (LGA-1156) than the Bloomfield (LGA-1366).
• All in all the Lynnfield chip is a cheaper and more effective successor of the Bloomfield chip.
Intel's Bloomfield Platform (X58 + LGA-1366) Intel's Lynnfield Platform (P55 + LGA-1156)
© Sima Dezső, ÓE NIK 277 www.tankonyvtar.hu
4.1 Introduction (8)
Remarks [49]
• All 3 chips are basically the same design.
• In Jasper Forest the circled part is the QPI
controller, however this part remains blank
in the mobile and desktop versions as
these versions do not provide a QPI link.
Remark
The embedded DP server Jasper Forest (Xeon C5500) is not an “all QPI solution”.
It provides a single QPI bus along with a DMI bus for the 3420 chipset (Picket Post Platform).
• Native 4C
• Simultaneous Multithreading (SMT)
• New cache architecture
• SSE 4.2 ISA extension
• Integrated memory controller
• QuickPath Interconnect bus (QPI)
• Enhanced power management
• Advanced virtualization
• New socket
Figure 4.3: Overview of the major innovations of 1. generation Nehalem processors (based on [22])
(The die photo is that of the Bloomfield/Gainestown processor)
285
4.2 Simultaneous Multithreading (SMT) (1)
Benefits
Deeper buffers
Core
Pentium M
4 MB 256 KB
Shared/ L2 Cache L2 Caches L2 Caches L2 Cache
Private
two cores
Up to 8 MB
L3 Cache Inclusive
• The L2 cache is private again rather than shared as in the Core and Penryn processors
Private L2 Shared L2
Pentium 4
Core
Penryn
Nehalem
Private caches allow a more effective hardware prefetching than shared ones.
Reason
• Hardware prefetchers look for memory access patterns.
• Private L2 caches have more easily detectable memory access patterns
than shared L2 caches.
Remark
Private L2 Shared L2
POWER4
POWER5
POWER6
• with inclusive L3 caches an L3 cache miss means that the referenced data
doesn’t exist in any core’s L2 caches, thus no L2 snooping is needed.
• By contrast, with exclusive L3 caches the referenced data may exist in any
of the L2 caches, thus L2 snooping is required.
Demonstration example
Exclusive Inclusive
L3 Cache L3 Cache
Figure 4.10: Comparing exclusive and inclusive cache behavior in case of a L3 cache miss (1) [1]
Exclusive Inclusive
L3 Cache L3 Cache
MISS! MISS!
Figure 4.11: Comparing exclusive and inclusive cache behavior in case of a L3 cache miss (2) [1]
Exclusive Inclusive
L3 Cache L3 Cache
MISS! MISS!
Figure 4.12: Comparing exclusive and inclusive cache behavior in case of a L3 cache miss (3) [1]
Exclusive Inclusive
L3 Cache L3 Cache
HIT! HIT!
Figure 4.13: Comparing exclusive and inclusive cache behavior in case of a L3 cache hit (1) [1]
Inclusive
Figure 4.14: Comparing exclusive and inclusive cache behavior in case of a L3 cache hit (2) [1]
Main features
Figure 4.17: Non Uniform Memory Access (NUMA) in multi-socket servers [1]
© Sima Dezső, ÓE NIK 302 www.tankonyvtar.hu
4.5 Integrated memory controller (4)
Remark
Figure 4.20: Intel’s roadmap from 1999 showing Timna due to 2H 2000. [35]
TX Unidirectional link
RX Unidirectional link
16 data
2 protocol
2 CRC
• Each unidirectional link comprises 20 data lanes and a clock lane, with
each lane consisting of a pair of differential signals.
TX Unidirectional link
RX Unidirectional link
16 data
2 protocol
2 CRC
Note
(The QPI isn’t an I/O interface, the standard I/O interface remains the PCI-Express bus)
• Consists of 2 unidirectional links, one in each directions, called the TX and RX links.
• Each unidirectional link comprises 20 data lanes (16 data, 2 protocol, 2 CRC) and a clock lane,
with each lane consisting of a pair of differential signals.
Figure 4.23: QPI based DP and MP server system architectures [31], [33]
Fastest FSB
400 MHz QDR 8 Byte 12.8 GB/s bidirectional
The “Uncore”
Power switches
Utilization of the power headroom of inactive cores and that of active cores with light workload
for increasing clock frequency.
Remark
The Penryn core already introduced a less intricate technology for the same purpose,
termed as the Enhanced Dynamic Acceleration Technology that increases clock frequency
only for the mobile platform and in case of inactive cores.
Understanding the notion of TDP and the related potential to boost performance (1) [50]
TDP (Thermal Design Power) is the maximum power consumed at realistic worst case
applications (TDP application).
The thermal solution (cooling system) needs to ensure that the junction temperature (Tj)
at maximum core frequency specified in connection with TDP does not exceed the
junction temperature limit (Tjmax) while the processor runs TDP applications.
Example
The mobile quad core Clarksfield processor i7-920XM has
• a TDP of 55 W
• ACPI P-states from 1.2 GHz (Low Frequency Mode) to 2.0 GHz (High Frequency Mode)
available to implement DBS (Demand Based Switching) of fc and Vcc, and
• allows to increase fc in turbo mode from 2.0 GHz up to 3.20 GHz.
The maximum clock frequency related to TDP (2 GHz in the above example) is determined
while running (worst case) TDP applications that intensively utilize all four cores such that
at this frequency dissipation still remains below TDP (i.e. 55 W in the above example).
© Sima Dezső, ÓE NIK 318 www.tankonyvtar.hu
4.7 More advanced power management (6)
Understanding the notion of TDP and the related potential to boost performance (2) [50]
Typical workloads however, are not intensive enough to push power consumption to the TDP limit.
The remaining power headroom can be utilized to increase fc if
The possible frequency increase depends on the intensity of the workload and the number of
active cores.
If the OS requests an active core to increase fc beyond the TDP limited maximum frequency
(i.e. to enter the PO state),
and there is an available power headroom
• either by having idle cores
• or a lightly threaded workload
the turbo mode controller will increase the core frequency of the active cores
provided that the power consumption of the socket and junction temperatures of the cores
do not exceed the given limits.
In turbo mode all active cores in the processor will operate at the same fc and voltage.
Figure 4.27: Turbo mode uses the available power headroom in processor package power limits [52]
© Sima Dezső, ÓE NIK 321 www.tankonyvtar.hu
4.7 More advanced power management (9)
For inactive cores, the turbo mode controller will increase fc to a maximum turbo frequency
that depends on the number of active cores provided that actual power and temperature values
remain below specified limits.
Maximum turbo frequencies are factory configured and kept in an internal register (MSR 1ADH).
E.g. in case of a single active core the Core i7-920XM will increase fc to 3.2 GHz,
which is 8 frequency bins higher than the TDP limited core frequency of 2.0 GHz.
Remark
In the above example the 133 MHz is the basic frequency that will be multiplied by the PLL
by an appropriate factor to generate the clock frequency fc.
Assuring that power and temperature values do not exceed specified limits [53], [50]
A precondition for increasing fc is that the power consumption of the package and
the junction temperature of the cores do not exceed given limits.
To assure this the turbo boost controller samples the current power consumption and
die temperatures in 5 ms intervals [53].
Power consumption is determined by monitoring the processor current at its input pin as well
as the associated voltage (Vcc) and calculating the power consumption as a moving average.
The junction temperature of the cores are monitored by DTSs (Digital Thermal Sensors) with
an error of ± 5 % [50].
• 5.1 Introduction
5.1 Introduction
• Nehalem-EX based DP/MP servers are designated also as the Beckton family.
• They include only server processors.
• First Becton processor were delivered in 3/2010.
Remark: The SMI interface was formerly designated as the Fully Buffered DIMM2 interface
© Sima Dezső, ÓE NIK 334 www.tankonyvtar.hu
5.5 Scalable platform configurations
DP-Servers
65xx Beckton (Nehalem-EX) 8C 3/2010
MP-Servers
75xx Beckton (Nehalem-EX) 8C 3/2010
Performance features of the 8-core Nehalem-EX based Xeon 7500 vs the Penryn based
6-core Xeon 7400 [67]
• 6.1 Introduction
6.1 Introduction
• Westmere (formerly Nehalem-C) is the 32 nm die shrink of Nehalem.
• First Westmere-based processors were launched in 1/2010
Westmere family
Mobiles
Core i3-3xxM Arrandale 2C+G 1/2010
Core i5-4xxM Arrandale 2C+G 1/2010
Core i5-5xxM Arrandale 2C+G 1/2010
Core i7-6xxM Arrandale 2C+G 1/2010
Desktops
Core i3-5xx Clarkdale 2C+G 1/2010
Core i5-6xx Clarkdale 2C+G 1/2010
Core i7-9xx/9xxX Gulftown 6C 3/2010
Servers
UP-Servers
DP-Servers DP-Servers
56xx Gulftown (Westmere-EP) 6C 3/2010 E7-28xx Westmere-EX 10C 4/2011
MP-Servers
E7-48xx Westmere-EX 10C 4/2011
E7-88xx Westmere-EX 10C 4/2011
Data based on [44]
Westmere 2-core mobile and desktop platform Westmere 6-core UP/DP server platform
6.2 Native 6 cores with 12 MB L3 cache (LLC) for UP/DP servers [58]
6.3 In-package integrated CPU/GPU processors for the mobile and the
desktop segments
Example
i3 3xxM 2C i3 5xx 2C
i5 4xx/5xx 2C i5 6xx 2C
i7 6xx 2C
CPU/GPU components
CPU: Hillel (32 nm Westmere architecture)
(Enhanced 32 nm shrink of the
45 nm Nehalem architecture)
GPU: Ironlake (45 nm)
Shader model 4, DX10 support
32 nm CPU (Hillel)
(Mobile implementation of the Westmere
basic architecture,
which is the 32 nm shrink of the
45 nm Nehalem basic architecture) 45 nm GPU (Ironlake)
Intel’s GMA HD (Graphics Media Accelerator)
(12 Execution Units, Shader model 4, no OpenCL support)
© Sima Dezső, ÓE NIK 349 www.tankonyvtar.hu
6.3 In-package integrated CPU/GPU (4)
http://www.anandtech.com/show/2902
350
6.3 In-package integrated CPU/GPU (5)
Figure 6.1: The Clarksdale processor with in-package integrated graphics along with the H57 chipset
[140] Part4-ből
353
6.3 In-package integrated CPU/GPU (8)
Remark
In Jan. 2011 Intel replaced their in-package integrated their CPU/GPU lines with the
on-die integrated Sandy Bridge line.
• The processor cores are operating at maximum thermal power level (which is greater
than their TDP) and the integrated graphics and the integrated memory controller
are operating at their minimum thermal power.
• The integrated graphics operates at its maximum thermal power level, while
the processor cores consumes the remaining MCP package power limit.
• Processor core currents are monitored by the processor input pin and calculated using a
moving average.
• When the power limit is reached power sharing control will adaptively remove the turbo boost
states to remain within the MCP thermal power limit.
• Errors in power estimation or measurement can significantly impact or completely eliminate
the performance benefit of the turbo boost technology.
NHM/M
(WSM/M)
NHM/D (WSM/D)
• For the two point thermal design it must however be ensured that the
component Tjmax limits do not exceeded when either component is operating
at its extreme thermal power limit.
• The junction temperature of the cores, integrated graphics and memory controller are
monitored by their respective DTS (Digital Thermal Sensor).
A DTS outputs a temperature relative to the maximum supported junction temperature.
The error associated with DTS measurements will not exceed ± 5 % within the operating range.
• 7.1 Introduction
7.1 Introduction
• Westmere-EX processors are 32 nm die shrinks of the 45 nm Nehalem-EX line.
• First Westmere-EX processors were shipped in 4/2011.
• They are socket compatible with the Nehalem-EX line (Xeon 75xx or Benton line).
Native 10 cores with 30 MB of L3 cache (LLC) vs native 8 cores with 24 MB L3 cache (LLC)
in order to compete with AMD’s 2x6 core (dual chip) Magny Course processors.
UP-Servers
E7-28xx Westmere-EX 10C 4/2011
DP-Servers
E7-28xx Westmere-EX 10C 4/2011
MP-Servers
E7-48xx Westmere-EX 10C 4/2011
E7-88xx Westmere-EX 10C 4/2011
• 8.1 Introduction
8.1 Introduction
• Sandy Bridge is Intel’s new microarchitecture using 32 nm line width.
• First delivered in 1/2011
Hyperthreading
32K L1D (3 clk) AES Instr.
AVX 256 bit VMX Unrestrict.
4 Operands 20 nm2 / Core
Key features and benefits of the Sandy Bridge line vs the 1. generation Nehalem line [61]
370
8.1 Introduction (5)
Mobiles
Core i3-23xxM, 2C, 2/2011
Core i5-24xxM//25xxM, 2C, 2/2011
Core i7-26xxQM/27xxQM/28xxQM, 4C, 1/2011
Core i7 Extreme-29xxXM , 4C, Q1 2011
Desktops
Core i3-21xx, 2C, 2/2011
Core i5-23xx/24xx/25xx, 4C, 1/2011
Core i7-26xx, 4C, 1/2011
Servers
UP-Servers
E3 12xx, 4C, Sandy Bridge-H2, 4C, 3/2011
DP-Servers
E5 2xxx, Sandy Bridge-EP, up to 8C, Q4/2011
MP-Servers
E5 4xxx, Sandy Bridge-EX, up to 8C, Q1/2012
Sandy Bridge
8 MM registers (64-bit),
aliased on the FP Stack registers
Northwood (Pentium4)
Norhwood
Northwood (Pentium4)
Ivy Bridge
Figure 8.2: Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA
© Sima Dezső, ÓE NIK 373 www.tankonyvtar.hu
8.2 Advanced Vector Extension (AVX) (3)
1.5 K µops
Remark [65]
A µcode cache was already introduced by Intel in the ill-fated Pentium 4 (2010),
designated as the Trace Cache (keeping 12 K µops).
12 EUs
HD5570
400 ALUs
i5/i7 2xxx/3xxx:
Sandy Bridge
i5 6xx
Arrandale
The concept utilizes the real temperature response of processors to power changes
in order to increase the extent of overclocking [64]
Concept: Use thermal energy budget accumulated during idle periods to push the core
beyond the TDP for short periods of time (e.g. for 20 sec).
Multiple algorithms manage in parallel current, power and die temperature. [64]
© Sima Dezső, ÓE NIK 383 www.tankonyvtar.hu
8.6 Enhanced turbo boost technology (3)
Intelligent power sharing between the cores and the integrated graphics [64]
Intelligent power sharing between the cores and the integrated graphics [68]
NHM/M WSM/M
NHM/D WSM/D
[61]
386
8.6 Enhanced turbo boost technology (6)
Remark
• Individual cores may run at different frequencies but all cores share the same power plane.
• Individual cores may be shut down if idle by power gates.
Microfusion
Macrofusion
Radix-16 divider
2-way SMT
SMT (with deeper buffers
to support SMT)
Private L2 caches
Cache Private L2 caches Shared L2 caches (256 KB/core)
architecture (up to 2 MB/core) (4 MB/2 core) (6 MB/2 core) + Shared L3 cache
(up to 8 MB)
Memory
Store to Load forwarding
accesses
Loads bypass both Loads
Loads bypass only Loads
and Stores
8 prefetchers
in DC processors
Single L2 prefetchers
(2 x L1 D$, 1 x L1 I$ per
core, 2 x L2)
Table 9.1: Evolution of the main features of Intel’s basic cores (1)
© Sima Dezső, ÓE NIK 389 www.tankonyvtar.hu
9. Overview of the evolution (2)
Power
Detailed subsequently
management
Support of
Not discussed
virtualization
Table 9.2: Evolution of the main features of Intel’s basic cores (2)
(180 nm/478 pins) (90 nm/775 pins) (90 nm) (90/65 nm) (65 nm) (45 nm) (45 nm)
Protection Thermal Monitor 1
of (TM1) ---
overheating Hardware controlled
Turning off and on Adaptive
the clock Thermal
(Clock modulation) Monitor
Thermal Monitor 2 First activate
(TM2) TM2 and if not
enough,
Hardware controlled
activate
switching to a
second operating also TM1
state with reduced
fc and VID
(C1E state)
(180 nm/478 pins) (90 nm/775 pins) (90 nm) (90/65 nm) (65 nm) (45 nm) (45 nm)
Reducing EIST
power (Enhanced Intel
consumption Speed Step
of active Technology)
processors OS controlled
--- switching to
multiple P-states
(Power states)
in less active
periods
to reduce power
consumption
Ultra fine
grained
power
control
Shutting down
not needed
proc. units
Bus spliting
Activating not
needed bus
lines
[7]: Völkel F., “Duel of the Titans: Opteron vs. Xeon : Hammer Time: AMD On The Attack,”
Tom’s hardware, Apr. 22. 2003,
http://www.tomshardware.com/reviews/duel-titans,620.html
[8]: De Gelas J., “Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1:
Server CPU Shoot-out,” AnandTech, June 17. 2006,
http://www.anandtech.com/IT/showdoc.aspx?i=2772&p=1
[9]: Hinton G. & al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology
Journal, Q1 2001, pp. 1-13
[10]: Wechsler O., “Inside Intel Core Microarchitecture,” White Paper, Intel, 2006
[11]: Lee V., “Inside the Intel Core Microarchitecture,” IDF, May 2006, Shenzhen,
http://www.prcidf.com.cn/sz/systems_conf/track_sz/SMC/Intel%20Core%20uArch.pdf
[12]: Doweck J., “Inside Intel Core Microarchitecture,” Hot Chips 18, 2006,
http://www.hotchips.org/archives/hc18/
[13]: Gruen H., “Intel’s new Core Microarchitecture,” Develop Brighton, AMD Technical
Day, July 2006,
http://ati.amd.com/developer/brighton/03%20Intel%20MicroArchitecture.pdf
[14]: Doweck J., “Intel Smart Memory access: Minimizing Latency on Intel Core
Microarchitecture, ” Technology @ intel Magazine, Sept. 2006, pp. 1-7,
ftp://download.intel.com/corporate/pressroom/emea/deu/fotos/06-10-Strategie_Tag/
Intel/Intel_Core2_Prozessoren/Texte/ENG-Smart_Memory_Access_Technology@
Intel_Magazine_Article.pdf
[15]: Sima D., Fountain T., Kacsuk P., Advanced Computer Architectures, Addison Wesley,
Harlow etc., 1997
[16]: Jafarjead B., “Intel Core Duo Processor,” Intel, 2006,
http://masih0111.persiangig.com/document/peresentation/behrooz%20jafarnejad.ppt
[17]: Pawlowski S. & Wechsler O., “Intel Core Microarchitecture,” IDF Spring, 2006,
http://www.intel.com/pressroom/kits/core2duo/pdf/ICM_tech_overview.pdf
[18]: Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 06. 2008,
http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm
[19]: SIMD Instruction Sets, http://softpixel.com/~cwright/programming/simd/index.php
[20]: Platform Environment Control Interface,
http://en.wikipedia.org/wiki/Platform_Environment_Control_Interface
[21]: Kim N. S. et al., „Leakage Current: Moore’s Law Meets Static Power”, Computer,
Dec. 2003, pp. 68-75.
[22]: Ng P. K., “High End Desktop Platform Design Overview for the Next Generation
Intel Microarchitecture (Nehalem) Processor,” IDF Taipei, TDPS001, 2008,
http://intel.wingateweb.com/taiwan08/published/sessions/TDPS001/
FA08%20IDF-Taipei_TDPS001_100.pdf
[23]: Bohr M., Mistry K., Smith S., “Intel Demonstrates High-k + Metal Gate Transistor
Breakthrough in 45 nm Microprocessors,”, Intel, Jan. 2007,
http://download.intel.com/pressroom/kits/45nm/Press45nm107_FINAL.pdf
[24]: Scott D. S., “Toward Petascale and Beyond,” APAC Conference, Oct. 2007,
http://www.apac.edu.au/apac07/pages/program/presentations/
Tuesday%20Harbour%20A%20B/David_Scott.pdf
[25]: Smith S. L., “45nm Product Press Briefing,” IDF Fall, 2007,
http://download.intel.com/pressroom/kits/events/idffall_2007/BriefingSmith45nm.pdf
[26]: Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture
(Penryn),” IPTS001, Fall IDF 2007, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf
[27]: George V., 45nm Next Generation Intel Core Microarchitecture (Penryn),”
Hot Chips 19, 2007,
http://www.hotchips.org/archives/hc19/3_Tues/HC19.08/HC19.08.01.pdf
[28]: Foxton Technology, Wikipedia, http://en.wikipedia.org/wiki/Foxton_Technology
[29]: Coke J. & al., “Improvements in the Intel Core Penryn Processor Family Architecture
and Microarchitecture,” Intel Technology Journal, Vol. 12, No. 3, 2008, pp. 179-192
[30]: Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture
(Penryn),” BMA S004, IDF 2007,
http://my.ocworkbench.com/bbs/attachment.php?attachmentid=318&d=1176911500
[35]: Smith T., “Timna - Intel's first system-on-a-chip, Before 'Tolapai', before 'Banias'.
Register Hardware, 6. February 2007,
http://www.reghardware.co.uk/2007/02/06/forgotten_tech_intel_timna/
[45]: Glaskowsky P.: Investigating Intel's Lynnfield mysteries, cnet News, Sept. 21. 2009,
http://news.cnet.com/8301-13512_3-10357328-23.html
© Sima Dezső, ÓE NIK 399 www.tankonyvtar.hu
References (6)
[46]: Shimpi A. L.: Intel's Core i7 870 & i5 750, Lynnfield: Harder, Better, Faster Stronger,
AnandTech, Sept. 8. 2009, http://www.anandtech.com/show/2832
[47]: Intel Xeon Processor C5500/C3500 Series, Datasheet – Volume 1, Febr. 2010,
http://download.intel.com/embedded/processor/datasheet/323103.pdf
[48]: Intel CoreTM i7-800 and i5-700 Desktop Processor Series Datasheet – Volume 1,
July 2010, http://download.intel.com/design/processor/datashts/322164.pdf
[49]: Glaskowsky P.: Intel's Lynnfield mysteries solved, cnet News, Sept. 28. 2009,
http://news.cnet.com/8301-13512_3-10362512-23.html
[50]: Intel CoreTM i7-900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and
i7-700 Mobile Processor Series, Datasheet – Volume One, Sept. 2009
http://download.intel.com/design/processor/datashts/320765.pdf
[51]: Intel Turbo Boost Technology in Intel CoreTM Microarchitecture (Nehalem) Based
Processors, White Paper, Nov. 2008
http://download.intel.com/design/processor/applnots/320354.pdf
[52]: Power Management in Intel Architecture Servers, White Paper, April 2009
http://download.intel.com/support/motherboards/server/sb/power_management_of_intel
architecture_servers.pdf
[53]: Glaskowsky P.: Explaining Intel’s Turbo Boost technology, cnet News, Sept. 28. 2009,
http://news.cnet.com/8301-13512_3-10362882-23.html
[54]: Intel Xeon Processor 7500 Series, Datasheet – Volume 2, March 2010
http://www.intel.com/Assets/PDF/datasheet/323341.pdf
[55]: Pawlowski S.: Intelligent and Expandable High- End Intel Server Platform, Codenamed
Nehalem-EX, IDF 2009
[56]: Kottapalli S., Baxter J.: Nehalem-EX CPU Architecture, Hot Chips 2009, Sept. 10. 2009
http://www.hotchips.org/archives/hc21/2_mon/HC21.24.100.ServerSystemsI-Epub/HC21.24.
122-Kottapalli-Intel-NHM-EX.pdf
[57]: Kurd N. A. & all: A Family of 32 nm IA Processors, IEEE Journal of Solide-State Circuits,
Vol. 46, Issue 1., Jan. 2011, pp. 119-130
[58]: Hill D., Chowdhury M.: Westmere Xeon-56xx „Tick” CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf
[59]: Intel CoreTM i7-600, i5-500, i5-400 and i3-300 Mobile Processor Series, Datasheet -
Vol.1, Jan. 2010, http://download.intel.com/design/processor/datashts/322812.pdf
[60]: Nagaraj D., Kottapalli S.: Westmere-EX: A 20 thread server CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
[61]: Kahn O., Piazza T., Valentine B.: Technology Insight: Intel Next Generation
Microarchitecture Codename Sandy Bridge, IDF 2010extreme.pcgameshardware.de/.../
281270d1288260884-bonusmaterial-pc-games-hardware-12-2010-sf10_spcs001_100.pdf
[64]: Kahn O., Valentine B.: Intel Next Generation Microarchitecture Codename Sandy Bridge:
New Processor Innovations, IDF 2010
© Sima Dezső, ÓE NIK 401 www.tankonyvtar.hu
References (8)
[65]: Shimpi A. L.: Intel Pentium 4 1.4GHz & 1.5GHz, AnandTech, Nov. 20. 2000
http://www.anandtech.com/show/661/5
[66]: Yuffe M., Knoll E., Mehalel M., Shor J., Kurts T.: A fully integrated multi-CPU, GPU and
memory controller 32nm processor, ISSCC, Febr. 20-24. 2011, pp. 264-266
[67]: Intel Xeon Processor 7500/6500 Series, Public Gold Presentation, Data Center Group,
March 30. 2010, http://cache-www.intel.com/cd/00/00/44/64/446456_446456.pdf
[68]: Tang H., Cheng H.: Intel Xeon Processor E3 Family Based Servers: A Smart Investment
for Managing Your Small Business, IDF 2011
[69]: Thomadakis M. E. PhD: The Architecture of the Nehalem Processor and Nehalem-EP SMP
Platforms, Texas A&M University, March 17. 2011
http://alphamike.tamu.edu/web_home/papers/perf_nehalem.pdf
[70]: Intel Xeon Processor E7-8800/4800/2800 Product Families, Datasheet Vol. 1 of 2,
April 2011, http://www.intel.com/Assets/PDF/datasheet/325119.pdf
Dezső Sima
• 1. Introduction to DT platforms
Platforms
Set of processors and associated chipsets capable of working together.
Traditional three chip platforms consist of
• a single or multiple processors,
• an MCH (Memory Control Hub) and Processor Processor
IOH
Traditional Recent
DT platform DT platform
6/2006
ICH8
Averill
7/2006 11/2006
E6xxx/E4xxx Q6xxx
X6800 QX6xxx
965 Series
(Broadwater)
FSB
1066/800/566 MT/s
2 DDR2 channels
DDR2: 800/666 MT/s
2 DIMMs/channel
8 GB max.
6/2006
ICH8
Note
Main components
• Two register chips, for buffering the address- and command lines
• A PLL (Phase Locked Loop) unit for deskewing clock distribution.
ECC
Figure 1.1: Typical layout of a registered memory module with ECC [1]
SDRAM 168-pin
DDR 184-pin
DDR3 240-pin
FSB
Display 2 DIMMs/channel
2 DIMMs/channel
card
C-link
There are limitations on both the minimum width and the spacing of the copper trails.
• The minimum width of the copper trails is restricted by their non-zero trace impedance.
• The minimum spacing between trails is constrained by parasitic capacitance and crosstalk.
Given
• the huge number of connections to be implemented as copper traces connecting
particular parts to the MCH,
• and the large number of connections each DDR2 or DDR3 memory channel needs,
• as well as the physical restrictions implied on the copper traces,
recent 3-chip platforms are limited typically to 2 DDR2 or DDR3 memory channels.
By contrast, FBDIMM channels make use of serial links and need only about 80 lines,
as a consequence 3-chip platforms may have about 3 times more FBDIMM memory
channels than DDR2 or DDR3 channels.
Beginning with Intel’s Nehalem processor, however the memory controller moved onto the
processor chip, as shown below.
6.4 GT/s
Tylersburg
In this case the memory channels are attached to the processor whereby the processor
has much less connections to other unit than the MCH had in the previous design.
As a consequence, Nehalem-based or subsequent platforms may implement more than two
DDR2/DDR3 channels, as illustrated above.
Remark
Typical bandwidth of recent 2-channel memory interfaces
Let’s assume that a particular platform has dual memory channels with DDR3-1333 DIMMs.
Then the resulting memory bandwidth (BW) of the platform amounts to
As memory speeds increase or the number of DIMMs attached to each memory channel
increases the operational tolerances of data transmissions over the memory channels
become narrower due to electrical effects, such as reflections, jitter, skews, crosstalk etc..
The operational margins are effective while capturing transmitted data at the receiver.
They are defined by
V
VH
VHmin
VLmax
VL
t
DVD
Data
CK
tS
tH
Min. DVW
a size, that is the sum of the setup time (tS) and the hold time (tH), and
a correct phase related to the clock edge, to satisfy both tS and tH requirements.
The size and fulfillment of the operational tolerances can be visualized by the eye diagram.
It shows the picture of a large number of overlaid data signals.
min
DVW
max
Figure 1.5: Eye diagram of a real signal showing both available DVW and requested voltage levels [4]
Reflections, jitter, skews, crosstalks and other disturbances narrow the operational margins of
the data transfer and limit thereby
• the transfer speed and
• the number of DIMMs allowed to be attached to each channel.
Reflections
At operational speeds of DDR2/DDR3 memories the connection lines, i.e. the copper traces
behave like transmission lines.
Transmission lines need to be terminated by their characteristic impedance (about 50-70 Ω for
copper traces on mainboards) if reflections should be avoided.
In case of a termination mismatch or existing inhomogenities of the transmission line,
reflections arise and narrow the operational tolerances.
Termination of the transmission lines connecting the memory controller and the DRAM chips
Despite the fact that subsequent memory technologies (SDRAM to DDR3) laid more and more
emphasis on the appropriate termination of the transmission lines (till on die dynamically
adjusted termination of the lines in case of DDR3 memories), a certain termination mismatch
typically remains and reflections arise, as shown in the next slide.
Figure 1.6: Reflections shown on an eye diagram due to termination mismatch [5]
Inhomogenity of transmission lines connecting the memory controller and the DRAM chips
Transmission lines connecting the memory controller and the DRAM chips mounted on the
DIMMs are inherently inhomogen due to the kind of the connection.
The dataway that connects the memory controller and the DRAM chips
Motherboard trace
Figure 1.7: The copper traces connecting the memory controller and the DRAM chips behaves
like transmission lines (based on [6])
Jitter
• I means phase uncertainty causing ambiguity in the rising and falling edges of a
signal, as shown in the figure below,
• It has a stochastic nature,
• Crosstalk caused by coupling adjacent traces on the board or in the DRAM device,
• ISI (Inter-Symbol Interference) caused by cycling the bus faster than it can settle,
• Reflection noise due to mismatching termination of signal lines,
• EMI (Electromagnetic Interference) caused by electromagnetic radiation emitted
from external sources.
Skew
CK-1
CK-2
Skew
Reflections, jitter, skews and further electrical disturbances reduce the operational tolerances
effective at the receiver end of the transmission lines connecting the memory controller and
the DRAM chips and limit the operational speed of the memory channels as well as
the number of DIMMs attachable per channel.
As a consequence in recent DT and also server platforms the number of DIMMs that can be
attached to DDR2 or DDR 3 memory channels is typically restricted to two.
This restriction roots in the parallel style of connecting traditional memory modules to
memory controllers.
By contrast, serially connected memory modules, such as FBDIMM modules, have much higher
operational tolerances and in this case more than two (typically up to 6 or 8) FDDIMM modules
can be attached to a memory channel.
Example
The DRAM capacity of a DT platform (C) having three memory channels with two 4 GB DIMMs
per channel (C) amounts to
C = 3 x 2 x 4 GB = 24 GB
DT platforms
Target market
Enterprise computing
Main goals
• lowering TCO (Total Cost of Ownership)
• increasing system availability and
• enhancing security.
(AMT)1
(VT)1 21
(VT-d)
(TXT)
(TXT)23
(AMT)
(TXT)
Client Intel© Turbo Memory Technology (TM)
(TXT)2
Server
Server
(AT)2
(AMT)
(AT)2
(VT)
(VT)
1AMT version 1.0 preceded vPro, it was introduced based on the 945 chipset, the ICH7 and
a Gigabit Ethernet controller and supported the dual core P4 Smithfield processor in 2005
2Introduced in the 2. gen. vPro based on the Q35 in 2007
Example
Main hardware components of the 5. generation vPro
Remark
Note
Beyond hardware requirements vPro needs also firmware (BIOS) and OS support, not
detailed here. For details see e.g. [9].
FW: Firmware
Remark
Relocation of the ME while the DT system architecture evolved from the 3-chip solution to
the 2-chip solution (along with the Nehalem-EX processors (Lynnfield) and their associated
5 Series PCHs) [24]
Previous
(Dedicated graphics
via graphics card)
(Series 5 PCH)
“Consumer
graphics
Operation of ME [30]
Intel Virtualization Technology for Directed I/O (VT-d) [31], [32], [33]
Virtualization technology in general
• consists of a set of hardware and software components that allow running multiple OSs
and applications in independent partitions.
• Each partition is isolated and protected from all other partitions.
• Virtualization enables among others
• Server consolidation
Substituting multiple dedicated servers by a single virtualized platform,
• Legacy software migration
Legacy software: software commonly used previously,
written often in not more commonly used languages (such as Cobol) and
running under not more commonly used OSs or platforms.
Legacy software migration: moving legacy software to a recent platform,
• Effective disaster recovery.
• The preceding designation for the TXT technology was the LaGrande technology.
• It provides hardware based security against hypervisor attacks, BIOS or other
firmware attacks, malicious root kit installations or other software attacks.
• It extends the Virtual Machine Environment (MLE) of Intel’s Virtualization Technology (VT)
by providing a verifiably secure installation, launch and use of a hypervisor or OS.
• It consists of a number of hardware enhancements to allow the creation of multiple
separated execution environments or partitions.
• One of the components is the TPM (Trusted Platform Module), a special chip which allows for
secure key generation and storage and authenticated access to data encrypted by this key.
The TPM chip is usually connected to the LPC (Low Pin Count) bus.
• TXT became available for DT platforms beginning with the 3 Series MCH model Q35 in 2007.
• It is a disk caching technology by using SSD (Solid-State Drives), i.e. flash memory
placed on a card.
• The Turbo memory card provides 1 – 4 GB disk space and is connected to the PC via the
PCIe interface.
• It caches frequently used data or user selected applications.
• Expected results: faster access to data and lower power consumption.
• It was announced first for mobile platforms in 2005 and offered
later on also for DT platforms, along with the 3 and 4 Series chipsets
starting in 2007.
• The Turbo Memory Technology is supported by Microsoft
Windows Vista “Ready Drive” and “Ready Boost” technologies.
• According to related reviews the Turbo Memory technology
did not fulfilled the expectations, it was costly and
was not worth the cost.
• In 2009 Intel announced the successor technology for the
5 Series mobile chipsets, called the Braidwood technology,
but subsequently they did withdraw it.
• In 2011 Intel introduced disk caching mechanism in their Z68 chipset
(and mobile derivatives) of the Series 6 PCH family, to provide
disk caching by a SATA SDD.
Table 3.1: Intel’s Core 2 based or more recent multicore desktop (DT) processors
DT platforms
Overview
Remark
• The X38 chipset may be considered as belonging to the 3 Series family of chipsets.
It does not support vPro.
• Similarly, the X48 chipset may be considered as belonging to the 4 Series family of chipsets.
It does not support vPro.
Features of different platforms are defined by the particular chipset they include.
E.g. different models of the Averill platform (Core 2/Penryn based platform incorporating the
965 chipset and the ICH7 south bridge) provide the following features [19]: .
Core2/Penryn
(2C/2*2C)
proc.
FSB
DMI C-link1
ICH8/9/10
1TC-link(Controller link) is needed basically for loading ME (Management Engine) firmware from the nonvolatile
system flash memory that is attached to the ICH
The ME is used to implement particular platform features supported.
© Sima Dezső, ÓE NIK 461 www.tankonyvtar.hu
4. Intel’s Core 2 and Penryn based DT platforms (7)
1. Example: The Core 2/Penryn based private consumer oriented Averill DT platform
with the P965 MCH and ICH8 that does not provide an integrated display controller
[20]
card 2 DIMMs/channel
2 DIMMs/channel
C-link
Remark
2. Example: The Core 2/Penryn based private consumer oriented Averill DT platform
with the G965 MCH and ICH8 that provides an integrated display controller [2]
Display 2 DIMMs/channel
2 DIMMs/channel
card
C-link
b) Intel’s Core 2 and Penryn based enterprise oriented DT platforms (vPro platforms)
Overview
Core 2-Quad based (65 nm) 1Q8000: 8/2008 Penryn-based (45 nm)
Core 2-based (65 nm) 466 www.tankonyvtar.hu
4. Intel’s Core 2 and Penryn based DT platforms (12)
Core2/Penryn
(2C/2*2C)
proc.
FSB
DDR2/DDR3
Q965/Q35/Q45
depending on
MCH the MCH model used
DMI C-link1
ICH8/9/10
1TC-link(Controller link) is needed basically for loading ME (Management Engine) firmware from the nonvolatile
system flash memory that is attached to the ICH.
The ME is used to implement particular platform feature, such as AMT.
Display
2 DIMMs/channel
card 2 DIMMs/channel
C-link
C-link
Remark
Overview-1
Private consumer oriented Enterprise oriented Private consumer oriented Enterprise oriented
DT platforms DT platforms DT platforms DT platforms
Overview-2
11/2008 9/2009
Kings Creek
Tylersburg Piketon (vPro)1
ICH10
1. gen. Nehalem-EP based 1 Needs the Q57 PCH Westmere-EP
45 nm Westmere-EP Nehalem-EX-based 32 nm
© Sima Dezső, ÓE NIK 32 nm 473 45 nm www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (3)
1. gen. Nehalem
Bllomfied (4C)/
Westmere Gulftown DDR3
(6C) proc.
QPI
X58 IOH
DMI C-link1
ICH10
1TC-link(Controller link) is needed basically for loading ME (Management Engine) firmware from the nonvolatile
system flash memory that is attached to the ICH.
The ME is used to implement particular platform feature supported.
Example: 1. gen. Nehalem (Bloomfield, 4C) based private consumer oriented Tylersburg
DT platform [22]
2 DIMMs/channel
2 DIMMs/channel
2 DIMMs/channel
Processor:
• Nehalem-EP
(Bloomfield, 4C)
• Westmere-EP
(Gulftown, 6C)
Remark: The platform shown does not include an integrated display controller
© Sima Dezső, ÓE NIK 475 www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (5)
[23]
These platforms introduced a new kind of system architecture that consists only of two chips.
Previous
(Dedicated graphics
via graphics card)
5 Series PCH
controller
“Consumer
graphics
Remarks
segments.
For each segment Intel provides a number of PCH lines for different use, among others
• the X line for extreme performance for home use (mostly for gamers),
• the P line for home use,
• the Q and H lines for business use etc.
Each line comprises usually a number of models with different feature sets.
E.g. the desktop segment includes the following line and models with the feature sets given.
5 series datasheet
© Sima Dezső, ÓE NIK 480 www.tankonyvtar.hu
5. Intel’s Nehalem based DT platforms (10)
Q57
5-Series
PCH
PCH
(w/AMT 6.0)
1FDI is needed for an integrated display controller (included in all 6 Series PCHs except the P55
Note
In the 2-chip system architecture the PCH includes the ME (Manageability Engine) as well as
it is connected to the nonvolatile system flash memory that keeps the microcode to be read at
boot time.
So there is no need for an extra interface for loading the ME from the nonvolatile system memory
as was in case of the previous 3-chip system architecture.
2 DIMMs/channel
2 DIMMs/channel
Remark
1/2011 1/2011
i7-26xx-29xx i7-26xx-29xx
i5-23xx-25xx i5-23xx-25xx
i3-21xx-23xx i7-21xx.23xx
4C (12 EUs): 32 nm/915? mtrs/216 mm2 4C (12 EUs): 32 nm/915? mtrs/216 mm2
2C (12 EUs): 32 nm/624? mtrs/149 mm2 2C (12 EUs): 32 nm/624? mtrs/149 mm2
¼ MB L2/C ¼ MB L2/C
Up to 8 MB L3 Up to 8 MB L3
1 x DMI2 1 x DMI2
+ 1 x FDI (except the P67 PCH) + 1 x FDI (except the P67 PCH)
2 DDR3 channels 2 DDR3 channels
1066/1333 MT/s 1066/1333 MT/s
2 DIMMs/channel 2 DIMMs/channel
max. 32 GB max. 32 GB
LGA-1155 LGA-1155
1/2011 1/2011
6- Series Q67
PCH PCH
1FDI is needed for integrated display controllers (included in all 6 Series PCHs except the P67
Remark
The Z68 model supports disk caching that allows an SDD to be used to cache a SATA hard disk.
This technology is designated now as the Smart Response Technology.
It is analogous to Intel’s previous Turbo Memory Technology introduced along with the 3 Series
Chipsets but discontinued with the Series 5 PCHs.
DMI DMI
ICH8/9/10 ICH10
Remark
[1]: DDR SDRAM Registered DIMM Design Specification, JEDEC Standard No. 21-C,
Page 4.20.4-1, Jan. 2002, http://www.jedec.org
[5]: Allan G., „The outlook for DRAMs in consumer electronics”, EETIMES Europe Online,
01/12/2007, http://eetimes.eu/showArticle.jhtml?articleID=196901366&queryText=calibrated
[6]: Jacob B., Ng S. W., Wang D. T., Memory Systems, Elsevier, 2008
Intel Core versus
AMD's K8 architecture
[8]: Kirstein B., „Practical timing analysis for 100-MHz digital design,”, EDN, Aug. 8, 2002,
www.edn.com
[9]: Izaguirre J., Building and Deploying Better Embedded Systems with Intel Active
Management Technology (Intel AMT), Intel Technology Journal, Vol. 13, Issue 1., 2009,
pp. 84-95
[10]: Technology Brief, 2nd generation Intel Core processor family, Intel Anti-Theft Technology,
2011, http://www.intel.com/technology/anti-theft/anti-theft-tech-brief.pdf
© Sima Dezső, ÓE NIK 499 www.tankonyvtar.hu
References (2)
[14]: Intel Lyndon Platform & Future TS, VR-Zone, Dec. 9 2004,
http://vr-zone.com/articles/intel-lyndon-platform--future-ts/1520.html
[15]: Greene J., Intel Trusted Execution Technology, White Paper, 2010,
http://www.intel.com/Assets/PDF/whitepaper/323586.pdf
[16]: Wikipedia: Trusted Execution Technology, 2011,
http://en.wikipedia.org/wiki/Trusted_Execution_Technology
[17]: Intel Turbo Memory supported chipsets, Intel Corporation,
http://www.intel.com/support/chipsets/itm/sb/CS-025854.htm
[19]: Intel 965 Express Chipset Family Datasheet, July 2006, http://ivanlef0u.fr/repo/ebooks/
intel_manuals/Intel%20%20965%20Express%20Chipset%20Family.pdf
[21]: Product Brief: Intel Q35 and Q33 Express Chipsets, 2007,
http://www.intel.com/Assets/PDF/prodbrief/317312.pdf
© Sima Dezső, ÓE NIK 500 www.tankonyvtar.hu
References (3)
[29]: Freeman V., Intel refreshes its vPro platform for 2010, March 9 2010, Hardware Central,
http://www.hardwarecentral.com/features/article.php/3869536/Intel-Refreshes-Its-
vPro-Platform-for-2010.htm
[30]: Marek J., Rasheed Y., Watts L., Technical Overview of Next Generation Intel vPro
Technology, PROS001, 2010
[31]: Neiger G., Santoni A., Leung F., Rodgers D., Uhlig R., Intel® Virtualization Technology:
Hardware support for efficient processor virtualization, Aug. 10 2006, Vol. 10, Issue 3,
http://www.intel.com/technology/itj/2006/v10i3/1-hardware/1-abstract.htm
Dezső Sima
Aim
Brief introduction and overview.
1.Introduction
3. Overview of GPGPUs
5. References
Vertex
Edge Surface
Vertices
• have three spatial coordinates
• supplementary information necessary to render the object, such as
• color
• texture
• reflectance properties
• etc.
Shaders
8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ Windows
Server 2003
9.0 (12/2002) 2.0 2.0
Table 1.1: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX
and MS’s OSs [18], [21]
• Different instructions
• Different resources (e.g. registers)
Based on its FP32 computing capability and the large number of FP-units available
GPGPUs
(General Purpose GPUs)
or
cGPUs
(computational GPUs)
Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [43]
Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [43]
Figure 1.2: Contrasting the utilization of the silicon area in CPUs and GPUs [11]
• Less area for control since GPGPUs have simplified control (same instruction for
all ALUs)
• Less area for caches since GPGPUs support massive multithereading to hide
latency of long operations, such as memory accesses in case of cache misses.
• One dimensional data parallel execution, • Two dimensional data parallel execution,
i.e. it performs the same operation i.e. it performs the same operation
on all elements of given on all elements of a given
FX/FP input vectors FX/FP input array (matrix)
• is massively multithreaded,
and provides
• data dependent flow control as well as
• barrier synchronization
Remarks
1) SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia).
2) The SIMT execution model is a low level execution model that needs to be complemented
with further models, such as the model of computational resources or the memory model,
not discussed here.
HLL
Virtual machine
level
Object code
level
Nvidia AMD
nvcc (brcc)
HLL compiler
nvopencc
The compiled pseudo ISA code (PTX code/IL code) remains independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over different
GPGPU families.
Compiling a PTX/IL file to a GPGPU that misses features supported by the particular PTX/IL
version however, may need emulation for features not implemented in hardware.
This slows down execution.
© Sima Dezső, ÓE NIK 525 www.tankonyvtar.hu
2. Basics of the SIMT execution (5)
• Phase 2: Compiling the pseudo assembly code to GPU specific binary code
Nvidia AMD
The object code (GPGPU code, e.g. a CUBIN file) is forward portable, but forward portabilility
is provided typically only within major GPGPU versions, such as Nvidia’s compute capability
versions 1.x or 2.x.
• The compiled pseudo ISA code (PTX code/IL code) remains independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over subsequent
GPGPU families.
Forward portability of the object code (GPGPU code, e.g. CUBIN code) is provided however,
typically only within major versions.
• Compiling a PTX/IL file to a GPGPU that misses features supported by the particular PTX/IL
version however, may need emulation for features not implemented in hardware.
This slows down execution.
• Portability of pseudo assembly code (Nvidia’s PTX code or AMD’s IL code) is highly
advantageous in the recent rapid evolution phase of GPGPU technology as it results in
less costs for code refactoring.
Code refactoring costs are a kind of software maintenance costs that arise when the user
switches from a given generation to a subsequent GPGPU generation (like from GT200
based devices to GF100 or GF110-based devices) or to a new software environment
(like from CUDA 1.x SDK to CUDA 2.x or from CUDA 3.x SDK to CUDA 4.x SDK).
Remark
The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to
the virtual machine concept underlying Java.
• For Java there is also an inherent pseudo ISA definition, called the Java bytecode.
• Applications written in Java will first be compiled to the platform independent Java bytecode.
• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE)
installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT)
compiler of the end user.
First, let’s discuss the basic structure of the underlying SIMD cores.
SIMD cores execute the same instruction stream on a number of ALUs (e.g. on 32 ALUs),
i.e. all ALUs perform typically the same operations in parallel.
Fetch/Decode
SIMD core
ALU ALU ALU ALU ALU
SIMD ALUs operate according to the load/store principle, like RISC processors i.e.
• they load operands from the memory,
• perform operations in the “register space” i.e.
• they take operands from the register file,
• perform the prescribed operations and
• store operation results again into the register file, and
• store (write back) final results into the memory.
The load/store principle of operation takes for granted the availability of a register file (RF)
for each ALU.
Load/Store
Memory RF
ALU
As a consequence of the chosen principle of execution each ALU is allocated a register file (RF)
that is a number of working registers.
Fetch/Decode
RF RF RF RF RF RF
Remark
The register sets (RF) allocated to each ALU are actually, parts of a large enough register file.
RF RF RF RF RF RF
Figure 2.6: Allocation of distinct parts of a large register file to the private register sets of the ALUs
Beyond the basic operations the SIMD cores provide a set of further computational capabilities,
such as
• FX32 operations,
• FP64 operations,
• FX/FP conversions,
• single precision trigonometric functions (to calculate reflections, shading etc.).
Note
Computational capabilities specified at the pseudo ISA level (intermediate level) are
• typically implemented in hardware.
Nevertheless, it is also possible to implement some compute capabilities
• by firmware (i.e. microcoded,
• or even by emulation during the second phase of compilation.
SIMT cores are enhanced SIMD cores that provide an effective support of multithreading
Achieved by
• providing and maintaining separate contexts for each thread, and
• implementing a zero-cycle context switch mechanism.
SIMT cores
= SIMD cores with per thread register files (designated as CTX in the figure)
Fetch/Decode
SIMT core
CTX CTX CTX CTX CTX CTX
CTX CTX CTX CTX CTX CTX
CTX CTX CTX CTX CTX CTX
Actual context CTX
Register file (RF)
CTX CTX CTX CTX CTX
ALU
ALU ALU ALU ALU ALU ALU
Figure 2.7: SIMT cores are specific SIMD cores providing separate thread contexts for each thread
The final model of computational resources of GPGPUs at the virtual machine level
The GPGPU is assumed to have a number of SIMT cores and is connected to the host.
Fetch/Decode
SIMT
ALU ALU core
ALU ALU ALU ALU ALU ALU
Host
Fetch/Decode
Fetch/Decode SIMT
ALU ALU
core
ALU ALU ALU ALU ALU ALU
ALU ALU SIMT
ALU ALU ALU ALU ALU ALU
core
During SIMT execution 2-dimensional matrices will be mapped to the available SIMT cores.
© Sima Dezső, ÓE NIK 540 www.tankonyvtar.hu
2. Basics of the SIMT execution (20)
Remarks
1) The final model of computational resources of GPGPUs at the virtual machine level is similar
to the platform model of OpenCL, given below assuming multiple cards.
ALU
Card
SIMT core
Figure 2.10: Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]
© Sima Dezső, ÓE NIK 542 www.tankonyvtar.hu
2. Basics of the SIMT execution (22)
The memory model at the virtual machine level declares all data spaces available at this level
along with their features, like their accessibility, access mode (read or write) access width etc.
SIMT 1
Local Memory
Instr.
ALU 1 ALU 2 ALU n
Unit
Constant Memory
Global Memory
Figure 2.12: Key components of available data spaces at the level of SIMT cores
Local memory
• On-die R/W data space that is accessible from all ALUs of a particular SIMT core.
• It allows sharing of data for the threads that are executing on the same SIMT core.
SIMT 1
Local Memory
Instr.
ALU 1 ALU 2 ALU n
Unit
Constant Memory
Global Memory
Figure 2.13: Key components of available data spaces at the level of SIMT cores
Constant Memory
• On-die Read only data space that is accessible from all SIMT cores.
• It can be written by the system memory and is used to provide constants for all threads
that are valid for the duration of a kernel execution with low access latency.
GPGPU
SIMT 1 SIMT m
Reg. Reg. Reg. Reg.
File 1 File n File 1 File n
Constant Memory
Global Memory
Figure 2.14: Key components of available data spaces at the level of the GPGPU
© Sima Dezső, ÓE NIK 547 www.tankonyvtar.hu
2. Basics of the SIMT execution (27)
Global Memory
• Off-die R/W data space that is accessible for all SIMT cores of a GPGPU.
• It can be accessed by the system memory and is used to hold all instructions and data
needed for executing kernels.
GPGPU
SIMT 1 SIMT m
Reg. Reg. Reg. Reg.
File 1 File n File 1 File n
Constant Memory
Global Memory
Figure 2.15: Key components of available data spaces at the level of the GPGPU
© Sima Dezső, ÓE NIK 548 www.tankonyvtar.hu
2. Basics of the SIMT execution (28)
Remarks
1. AMD introduced Local memories, designated as Local Data Share, only along with their
RV770-based HD 4xxx line in 2008.
2. Beyond the key data space elements available at the virtual machine level, discussed so far,
there may be also other kinds of memories declared at the virtual machine level,
such as AMD’s Global Data Share, an on-chip Global memory introduced with their
RV770-bssed HD 4xxx line in 2008).
3. Traditional caches are not visible at the virtual machine level, as they are transparent for
program execution.
Nevertheless, more advanced GPGPUs allow an explicit cache management at the
virtual machine level, by providing e.g. data prefetching.
In these cases the memory model needs to be extended with these caches accordingly.
4. Max. sizes of particular data spaces are specified by the related instruction formats
of the intermediate language.
5. Actual sizes of particular data spaces are implementation dependent.
6. Nvidia and AMD designates different kinds of their data spaces differently, as shown below.
Nvidia AMD
A set of ALUs
within the
SIMT cores
Deafult: (127-2)*4
General Purpose Registers R/W Per ALU 2*4 registers are reserved as Clause
Temporary Registers
Remarks
• Max. sizes of data spaces are specified along with the instructions formats of the
intermediate language.
• The actual sizes of the data spaces are implementation dependent.
Example: Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]
8 8 8
8 8 8
Domain of execution: Domain of execution: Domain of execution:
scalars, no indices one-dimensional index space two-dimensional index space
Objects of execution: Objects of execution: Objects of execution:
single data elements data elements of vectors data elements of matrices
Supported by Supported by Supported by
all processors 2.G/3.G superscalars GPGPUs/DPAs
Figure 2.16: Domains of execution in case of scalar, SIMD and SIMT execution
2. Massive multithreading
The programmer creates for each element of the index space, called the execution domain
parallel executable threads that will be executed by the GPGPU or DPA.
Threads
(work items)
Domain of
execution
Figure 2.17: Parallel executable threads created and executed for each element of an execution
domain
The programmer describes the set of operations to be done over the entire domain of execution
by kernels.
Threads
(work items)
Kernels are specified at the HLL level and compiled to the intermediate level.
Specification of kernels
• A kernel is defined by
• Each thread that executes the kernel is given a unique identifier (thread ID, Work item ID)
that is accessible within the kernel.
Remark
During execution each thread is identified by a unique identifier that is
• int I in case of CUDA C, accessible through the threadIdx variable, and
• int id in case of OpenCL accessible through the built-in get_global_id() function.
Invocation of kernels
The kernel is invoked in CUDA C and OpenCL differently
• In CUDA C
by specifying the name of the kernel and the domain of execution [43]
• In OpenCL
by specifying the name of the kernel and the related configuration arguments, not detailed
here [144].
• The domain of execution will be broken down into equal sized ranges, called
work allocation units (WAUs), i.e. units of work that will be allocated to the SIMT cores
as an entity.
Domain of execution Domain of execution
Global size m Global size m
WAU WAU
Global size n
Global size n
(0,0) (0,1)
WAU WAU
(1,0) (1,1)
Figure 2.19: Segmenting the domain of execution to work allocation units (WAUs)
E.g. Segmenting a 512 x 512 sized domain of execution into four 256 x 256 sized
work allocation units (WAUs).
WAU WAU
Global size n
Global size n
(0,0) (0,1)
WAU WAU
(1,0) (1,1)
Figure 2.20: Segmenting the domain of execution to work allocation units (WAUs)
Work allocation units will be assigned for execution to the available SIMT cores as entities
by the scheduler of the GPGPU/DPA.
Example: Assigning work allocation units to the SIMT cores in AMD’s Cayman GPGPU [93]
(0,0) (0,1)
SIMT cores
The GPGPU scheduler assigns work allocation units The GPGPU scheduler is capable of
only from a single kernel assigning work allocation units to SIMT cores
to the available SIMT cores, from multiple kernels concurrently
i.e. the scheduler distributes work allocation units with the constraint that
to available SIMT cores for maximum the scheduler can assign work allocation units
parallel execution. to each particular SIMT core only
from a single kernel
• A global scheduler, called the Gigathread scheduler assigns work to each SIMT core.
• In Nvidia’s pre-Fermi GPGPU generations (G80-, G92-, GT200-based GPGPUs)
the global scheduler could only assign work to the SIMT cores from a single kernel
(serial kernel execution).
• By contrast, in Fermi-based GPGPUs the global scheduler is able to run up to 16 different
kernels concurrently, presumable, one per SM (concurrent kernel execution).
Kernel 1: NDRange1
Global size 10
Kernel 2: NDRange2
Global size 20
(0,0) (0,1)
4.c Segmenting work allocation units into work scheduling units to be executed
on the execution pipelines of the SIMT cores-1
• Work scheduling units are parts of a work allocation unit that will be scheduled for execution
on the execution pipelines of a SIMT core as an entity.
• The scheduler of the GPGPU segments work allocation units into work scheduling units
of given size.
Example: Segmentation of a 16 x 16 sized Work Group into Subgroups of the size of 8x8
in AMD’s Cayman core [92]
Work Group
Wavefront of
64 elements
4.c Segmenting work allocation units into work scheduling units to be executed
on the execution pipelines of the SIMT cores-2
Work scheduling units are called warps by Nvidia or wavefronts by AMD.
Size of the work scheduling units
• In Nvidia’s GPGPUs the size of the work scheduling unit (called warp) is 32.
• AMD’s GPGPUs have different work scheduling sizes (called wavefront sizes)
• High performance GPGPU cards have typically wavefront sizes of 64, whereas
• lower performance cards may have wavefront sizes of 32 or even 16.
The scheduling units, created by segmentation are then send to the scheduler.
Example: Sending work scheduling units for execution to SIMT cores in AMD’s Cayman core [92]
Work Group
Subgroup of 64 elements
4.d Scheduling work scheduling units for execution to the execution pipelines of the
SIMT cores
The scheduler assigns work scheduling units to the execution pipelines of the SIMT cores
for execution according to a chosen scheduling policy (discussed in the case example parts
5.1.6 and 5.2.8).
Work scheduling units will be executed on the execution pipelines (ALUs) of the SIMT cores.
SIMT core
ALU ALU ALU ALU ALU
Note
Massive multitheading is a means to prevent stalls occurring during the execution of
work scheduling units due to long latency operations, such as memory accesses caused
by cache misses.
Example
Up to date (Fermi-based) Nvidia GPGPUs can maintain up to 48 work scheduling units,
called warps per SIMT core.
For instance, the GTX 580 includes 16 SIMT cores, with 48 warps per SIMT core and
32 threads per warp for a total number of 24576 threads.
• The model of data sharing declares the possibilities to share data between threads..
• This is not an orthogonal concept, but result from both
• the memory concept and
• the concept of assigning work to execution pipelines of the GPGPU.
Per-thread
reg. file
Local Memory
Domain of execution 1
Notes
Domain of execution 2
1) Work Allocation Units
are designated in the Figure
as Thread Block/Block
In SIMT processing both paths of a branch are executed subsequently such that
for each path the prescribed operations are executed only on those data elements which
fulfill the data condition given for that path (e.g. xi > 0).
Example
First all ALUs meeting the condition execute the prescibed three operations,
then all ALUs missing the condition execute the next two operatons
Figure 2.23: Resuming instruction stream processing after executing a branch [24]
7. Barrier synchronization
Barrier synchronization
Synchronization of Synchronization of
thread execution memory read/writes
It is implemented
• in Nvidia’s PTX by the “membar” instruction [147] or
• in AMD’s IL by the “fence lds”/”fence memory” instructions [10].
Discussion of this topic assumes the knowledge of programming details therefore it is omitted.
Interested readers are referred to the related reference guides [147], [104], [105].
Example
• Evolution of the pseudo ISA of Nvidia’s GPGPUs and their support in real GPGPUs.
• Subsequent versions of both the pseudo- and real ISA are designated as compute capabilities.
588 www.tankonyvtar.hu
2. Basics of the SIMT execution (68)
589 www.tankonyvtar.hu
2. Basics of the SIMT execution (69)
21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti,
GTX 560Ti
GPGPUs
90 nm G80
80 nm R600
Shrink Enhanced
arch. Shrink
Enhanced
65 nm G92 G200 arch.
55 nm RV670 RV770
Enhanced
Shrink Enhanced Enhanced
arch. Shrink arch. arch.
40 nm GF100 RV870 Cayman
(Fermi)
NVidia
11/06 10/07 6/08
OpenCL OpenCL
Standard
6/07 11/07 6/08 11/08
AMD/ATI
11/05 5/07 11/07 5/08
Figure 3.3: Overview of GPGPUs and their basic software support (1)
© Sima Dezső, ÓE NIK 596 www.tankonyvtar.hu
3. Overview of GPGPUs (4)
NVidia
3/10 07/10 11/10
Cores GF100 (Fermi) GF104 (Fermi) GF110 (Fermi)
40 nm/3000 mtrs 40 nm/1950 mtrs 40 nm/3000 mtrs
1/11
Cards GTX 470 GTX 480 GTX 460 GTX 580 GTX 560 Ti
448 ALUs 480 ALUs 336 ALUs 512 ALUs 480 ALUs
320-bit 384-bit 192/256-bit 384-bit 384-bit
CUDA Version 22 Version 2.3 Version 3.0 Version 3.1 Version 3.2 Version 4.0
Beta
AMD/ATI
9/09 10/10 12/10
Figure 3.4: Overview of GPGPUs and their basic software support (2)
© Sima Dezső, ÓE NIK 597 www.tankonyvtar.hu
3. Overview of GPGPUs (5)
Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and started supporting OpenCL as their basic HLL programming language.
AMD/ATI
9/09 10/10 12/10
FP64 speed
• ½ of the FP32 speed for the Tesla 20-series
• 1/8 of the SP32 speed for the GeForce GTX 470/480/570/580 cards
1/12 for other GForce GTX4xx cards
ECC
available only on the Tesla 20-series
Memory size
Tesla 20 products have larger on board memory (3GB and 6GB)
Positioning Nvidia’s discussed GPGPU cards in their entire product portfolio [82]
602 www.tankonyvtar.hu
3. Overview of GPGPUs (10)
b) Device parameters bound to the compute capability versions of Nvidia’s GPGPUs [81]
603 www.tankonyvtar.hu
3. Overview of GPGPUs (11)
21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti,
GTX 560Ti
IC technology 90 nm 90 nm 65 nm 65 nm 65 nm
Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs
Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2
Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz
Computation
No of SMs (cores) 12 16 14 24 30
Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz
Peak FP32 performance 230.4 GFLOPS 345.61 GFLOPS 508 GFLOPS 715 GFLOPS 933 GFLOPS
Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s
Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
1: Nvidia takes the FP32 capable Texture Processing Units also into consideration and calculates with 3 FP32 operations/cycle
© Sima Dezső, ÓE NIK 605 of Nvidia’s GPGPUs-1
Table 3.1: Main features www.tankonyvtar.hu
3. Overview of GPGPUs (13)
Remarks
In publications there are conflicting statements about whether or not the GT80 makes use
of dual issue (including a MAD and a Mul operation) within a period of four shader cycles or not.
Official specifications [22] declare the capability of dual issue, but other literature sources [64]
and even a textbook, co-authored by one of the chief developers of the GT80 (D. Kirk [65])
deny it.
A clarification could be found in a blog [66], revealing that the higher figure given in Nvidia’s
specifications includes calculations made both by the ALUs in the SMs and by the texture
processing units TPU).
Nevertheless, the TPUs can not be directly accessed by CUDA except for graphical tasks,
such as texture filtering.
Accordingly, in our discussion focusing on numerical calculations it is fair to take only
the MAD operations into account for specifying the peak numerical performance.
GTX 470 GTX 480 GTX 460 GTX 570 GTX 580
Core GF100 GF100 GF104 GF110 GF110
IC technology 40 nm 40 nm 40 nm 40 nm 40 nm
Nr. of transistors 3200 mtrs 3200 mtrs 1950 mtrs 3000 mtrs 3000 mtrs
Die are 529 mm2 529 mm2 367 mm2 520 mm2 520 mm2
Shader frequency 1215 MHz 1401 MHz 1350 MHz 1464 MHz 1544 MHz
Peak FP32 performance 1088 GFLOPS 1345 GFLOPS 9072 GFLOPS 1405 GFLOPS 1581 GFLOPS
Peak FP64 performance 136 GFLOPS 168 GFLOPS 75.6 GFLOPS 175.6 GFLOPS 197.6 GFLOPS
Memory
Mem. transfer rate (eff) 3348 Mb/s 3698 Mb/s 3600 Mb/s 3800 Mb/s 4008 Mb/s
Mem. bandwidth 133.9 GB/s 177.4 GB/s 86.4/115.2 GB/s 152 GB/s 192.4 GB/s
Interface PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16
MS Direct X 11 11 11 11 11
Remarks
IC technology 80 nm 55 nm 55 nm 55 nm 55 nm
Nr. of transistors 700 mtrs 666 mtrs 666 mtrs 956 mtrs 956 mtrs
Die are 408 mm2 192 mm2 192 mm2 260 mm2 260 mm2
Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
Computation
No. of ALUs 320 320 320 800 800
Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
Peak FP32 performance 471.6 GFLOPS 429 GFLOPS 496 GFLOPS 1000 GFLOPS 1200 GFLOPS
Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s
Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar
System
Multi. CPU techn. CrossFire X CrossFire X CrossFire X CrossFire X CrossFire X
Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
IC technology 40 nm 40 nm 40 nm
MS Direct X 11 11 11
IC technology 40 nm 40 nm
Mem. size 1 GB 1 GB
MS Direct X 11 11
IC technology 40 nm 40 nm 40 nm 40 nm
Nr. of transistors 2.64 billion 2.64 billion 2*2.64 billion 2*2.64 billion
Die are 389 mm2 389 mm2 2*389 mm2 2*389 mm2
Core frequency 800 MHz 880 MHz 830 MHz 880 MHz
Computation
No. of SIMD cores /VLIW4 ALUs 22/16 24/16 2*24/16 2*24/16
Shader frequency 800 MHz 880 MHz 830 MHz 880 MHz
Peak FP32 performance 2.25 TFLOPS 2.7 TFLOPS 5.1 TFLOPS 5.4 TFLOPS
Peak FP64 performance 0.5625 TFLOPS 0.683 TFLOPS 1.275 TFLOPS 1.35 TFLOPS
Memory
Mem. transfer rate (eff) 5000 Mb/s 5500 Mb/s 5000 Mb/s 5000 Mb/s
Mem. bandwidth 160 GB/s 176 GB/s 2*160 GB/s 2*160 GB/s
MS Direct X 11 11 11 11
Remark
The Radeon HD 5xxx line of cards is designated also as the Evergreen series and
the Radeon HD 6xxx line of cards is designated also as the Northern islands series.
AMD HD 6990: 2 x ATI HD 6970 with slightly reduced memory and shader clock
Nvidia
GTX 570 ~ 350 $
GTX 580 ~ 500 $
AMD
HD 6970 ~ 400 $
HD 6990 ~ 700 $
(Dual 6970)
On card On-die
implementation integration
Recent Emerging
implementations implementations
Trend
On-card accelerators
E.g. Nvidia Tesla C870 Nvidia Tesla D870 Nvidia Tesla S870
Nvidia Tesla C1060 Nvidia Tesla S1070
Nvidia Tesla C2070 Nvidia Tesla S2050/S2070
AMD FireStream 9170
AMD FireStream 9250
AMD FireStream 9370
G80-based GT200-based
6/07 6/08
6/07
Desktop D870
2*C870 incl.
3 GB GDDR3
SP: 691.2 GFLOPS
DP: -
6/07 6/08
IU Server S870 S1070
4*C870 incl. 4*C1060
6 GB GDDR3 16 GB GDDR3
SP: 1382 GFLOPS SP: 3732 GFLOPS
DP: - DP: 311 GFLOPS
2007 2008
Figure 4.4: Main functional units of Nvidia’s Tesla C870 card [2]
Figure 4.8: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]
Figure 4.11: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards
inserted into PCI-E x16 slots of the host server [6]
© Sima Dezső, ÓE NIK 634 www.tankonyvtar.hu
4. Overview of data parallel accelerators (12)
11/09
Card C2050/C2070
3/6 GB GDDR5
SP: 1.03 TLOPS1
DP: 0.515 TFLOPS
04/10 08/10
Module M2050/M2070 M2070Q
11/09
IU Server S2050/S2070
4*C2050/C2070
12/24 GB GDDR31
SP: 4.1 TFLOPS
DP: 8.2 TFLOPS
6/10
Remark
The M2070Q is an upgrade of the M2070 providing higher memory clock (introduced 08/2010)
© Sima Dezső, ÓE NIK 637 www.tankonyvtar.hu
4. Overview of data parallel accelerators (15)
Remark
The M2070Q is an upgrade of the M2070, providing higher memory clock (introduced 08/2010)
Support of ECC
• Fermi based Tesla devices introduced the support of ECC.
• By contrast recently neither Nvidia’s straightforward GPGPU cards nor AMD’s GPGPU or
DPA devices support ECC [76].
Tesla S2050/S2070 1U
The S2050/S2070 differ only in the memory size, the S2050 includes 12 GB, the S2070 24 GB.
GPU Specification
Number of processor cores: 448
Processor core clock: 1.15 GHz
Memory clock: 1.546 GHz
Memory interface: 384 bit
System Specification
Four Fermi GPUs
12.0/24.0 GB of GDDR5,
configured as 3.0/6.0 GB per GPU.
RV670-based RV770-based
11/07 6/08
6/08 10/08
9250 9250
1 GB GDDR3 Shipped
FP32: 1000 GLOPS
FP64: ~300 GFLOPS
12/07 09/08
Stream Computing
SDK Version 1.0 Version 1.2
Brook+ Brook+
ACM/AMD Core Math Library ACM/AMD Core Math Library
CAL (Computer Abstor Layer) CAL (Computer Abstor Layer)
Rapid Mind
2007 2008
RV870-based
06/10 10/10
Core
Table 4.1: Main features of Nvidia’s data parallel accelerator cards (Tesla line) [73]
Core
Core frequency 800 MHz 625 MHz 700 MHz 825 MHz
ALU frequency 800 MHz 325 MHz 700 MHz 825 MHz
Peak FP32 performance 512 GFLOPS 1 TFLOPS 2016 GFLOPS 2640 GFLOPS
Peak FP64 performance ~200 GFLOPS ~250 GFLOPS 403.2 GFLOPS 528 GFLOPS
Memory
Mem. transfer rate (eff) 1600 Gb/s 1986 Gb/s 4000 Gb/s 4600 Gb/s
Mem. bandwidth 51.2 GB/s 63.5 GB/s 128 GB/s 147.2 GB/s
Mem. size 2 GB 1 GB 2 GB 4 GB
Table 4.2: Main features of AMD/ATI’s data parallel accelerator cards (FireStream line) [67]
Nvidia Tesla
C2050 ~ 2000 $
C2070 ~ 4000 $
S2050 ~ 13 000 $
S2070 ~ 19 000 $
NVidia GTX
GTX580 ~ 500 $
Dezső Sima
Aim
Brief introduction and overview.
Announced: 30. Sept. 2009 at NVidia’s GPU Technology Conference, available: 1Q 2010 [83]
Sub-families of Fermi
Fermi includes three sub-families with the following representative cores and features:
1 In the associated flagship card (GTX 480) however, one of the SMs has been disabled, due to overheating
problems, so it has actually only 15 SIMD cores, called Streaming Multiprocessors (SMs) by Nvidia and 480
FP32 EUs [69]
VLIW4/VLIW5 ALU
Stream core (in OpenCL SDKs)
Streaming Processor
Algebraic Logic Unit Compute Unit Pipeline (6900 ISA)
ALU CUDA Core
(ALU) SIMD pipeline (Pre OpenCL) term
Thread processor (Pre OpenCL term)
Shader processor (Pre OpenCL term)
These models are only outlined here, a detailed description can be found in the related
documentation [147].
Remark
The outlined four abstractions remained basically unchanged through the life span of PTX
(from the version 1.0 (6/2007) to version 2.3 (3/2011).
A set of ALUs
within the
SIMD cores
A set of ALUs
within the
SIMD cores
Compilation to
PTX pseudo ISA
instructions
Translation to
executable CUBIN file
at load time
CUBIN FILE
665 Source: [68]
5.1.2 Nvidia’s PTX Virtual Machine Concept (12)
The PTX virtual machine concept gives rise to a two phase compilation process.
1) First, the application, e.g. a CUDA or OpenCL program will be compiled to a pseudo code,
called also as PTX ISA code or PTX code by the appropriate compiler.
The PTX code is a pseudo code since it is not directly executable and needs to be translated
to the actual ISA of a given GPGPU to become executable.
Application
(CUDA C/OpenCL file)
Two-phase compilation
CUDA C compiler
or
OpenCL compiler
• First phase:
Compilation to the PTX ISA format
(stored in text format)
pseudo ISA instructions)
CUDA driver
• Second phase (during loading):
JIT-compilation to
executable object code
(called CUBIN file).
CUBIN file
© Sima Dezső, ÓE NIK (executable on the GPGPU
666) www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (13)
2) In order to become executable the PTX code needs to be compiled to the actual ISA code of a
particular GPGPU, called the CUBIN file.
This compilation is performed by the CUDA driver during loading the program (Just-In-Time).
Application
(CUDA C/OpenCL file)
Two-phase compilation
CUDA C compiler
or
OpenCL compiler
• First phase:
Compilation to the PTX ISA format
(stored in text format)
pseudo ISA instructions)
CUDA driver
• The compiled pseudo ISA code (PTX code) remains in principle independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over subsequent
GPGPU families.
Porting a PTX file to a lower compute capability level GPGPU however, may need emulation
for features not implemented in hardware that slows down execution.
Forward portability of GPGPU code (CUBIN code) is provided however only within major
compute capability versions.
• Forward portability of PTX code is highly advantageous in the recent rapid evolution phase of
GPGPU technology as it results in less costs for code refactoring.
Code refactoring costs are a kind of software maintenance costs that arise when the user
switches from a given generation to a subsequent GPGPU generation (like from GT200
based devices to GF100 or GF110-based devices) or to a new software environment
(like from CUDA 1.x SDK to CUDA 2.x or CUDA 3.x SDK).
Remarks [149]
1) • Nvidia manages the evolution of their devices and programming environment by maintaining
compute capability versions of both
• their intermediate virtual PTX architectures (PTX ISA) and
• their real architectures (GPGPU ISA).
• Designation of the compute capability versions
• Subsequent versions of the intermediate PTX ISA are designated as PTX ISA 1.x or 2.x.
• Subsequent versions of GPGPU ISAs are designated as sm_1x/sm_2x or simply by 1.x/2.x.
• The first digit 1 or 2 denotes the major version number, the second or subsequent digit
denotes the minor version.
• Major versions of 1.x or 1x relate to pre-Fermi solutions whereas those of 2.x or 2x
to Fermi based solutions.
671 www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (18)
672 www.tankonyvtar.hu
5.1.2 Nvidia’s PTX Virtual Machine Concept (19)
21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti,
GTX 560Ti
Remarks (cont.)
2. Contrasting the virtual machine concept with the traditional computer technology
Whereas the PTX virtual machine concept is based on a forward compatible but not directly
executable compiler’s target code (pseudo code), in traditional computer technology
the compiled code, such as an x86 object code, is immediately executable by the processor.
Earlier CISC processors, like Intel’s x86 processors up to the Pentium, executed x86 code
immediately by hardware.
Subsequent CISCs, beginning with 2. generation superscalars (like the Pentium Pro),
including current x86 processors, like Intel’s Nehalem (2008) or AMD’ Bulldozer (2011)
map x86 CISC instructions during decoding first to internally defined RISC instructions.
In these processors a ROM-based µcode Engine (i.e. firmware) supports decoding of
complex x86 instructions (decoding of instructions which need more than 4 RISC instructions)
The RISC core of the processor executes then the requested RISC operations directly.
Remarks (cont.)
3) Nvidia’ CUDA compiler (nvcc) is designated as CUDA C compiler, beginning with
CUDA version 3.0 to stress the support of C.
Remarks (cont.)
4) nvcc can be used to generate both architecture specific files (CUBIN files) or
forward compatible PTX versions of the kernels [52].
Application
(CUDA C/OpenCL file)
CUDA driver
Remarks (cont.)
The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to
the virtual machine concept underlying Java.
• For Java there is also an inherent computational model and a pseudo ISA, called the
Java bytecode.
• Applications written in Java will first be compiled to the platform independent Java bytecode.
• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE)
installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT)
compiler of the end user.
With the PTX2.0 Nvidia states that they have created a long-evity ISA for GPUs,
like the x86 ISA for CPUs.
Based on the key innovations and declared goals of Fermi’s ISA (PTX2.0) and considering
the significant innovations and enhancements made in the microarchitecture
it can be expected that Nvidia’s GPGPUs entered a phase of relative consolidation.
These new features greatly improve GPU programmability, accuracy and performance.
a) Unified address space for all variables and pointers with a single set of
load/store instructions-1 [58]
• In PTX 1.0 there are three separate address spaces
(thread private local, block shared and global)
with specific load/store instructions to each one of the three address spaces.
• Programs could load or store values in a particular target address space at addresses
that become known at compile time.
It was difficult to fully implement C and C++ pointers since a pointer’s target address
could only be determined dynamically at run time.
a) Unified address space for all variables and pointers with a single set of
load/store instructions-2 [58]
• PTX 2.0 unifies all three address spaces into a single continuous address space that
can be accessed by a single set of load/store instructions.
• PTX 2.0 allows to use unified pointers to pass objects in any memory space and
Fermi’s hardware automatically maps pointer references to the correct memory space.
Thus the concept of the unified address space enables Fermi to support C++ programs.
• Nvidia’s previous generation GPGPUs (G80, G92, GT200) provide 32 bit addressing
for load/store instructions,
• PTX 2.0 extends the addressing capability to 64-bit for future growth.
however, recent Fermi implementations use only 40-bit addresses allowing to access
an address space of 1 Terabyte.
Figure 5.1.2: Supported languages and APIs (as of starting with CUDA version 3.0)
• CUDA C exposes the CUDA programming model • The CUDA Driver API is a lover level C API that allows
as a minimal set of C language extensions. to load and launch kernels as modules of binary or
• These extensions allow to define kernels along with assembly CUDA code and to manage the platform.
the dimensions of associated grids and thread blocks. • Binary and assembly codes are usually obtained
• The CUDA C program must be compiled with nvcc. by compiling kernels written in C.
E.g. CUDA C
(to be compiled
(E.g. CUBLAS) with nvcc)
API
(To manage the platform
SIMD core
ALUs
Remark
Compute capability dependent memory sizes of Nvidia’s GPGPUs
Overview
• CUDA C allows the programmer to define kernels as C functions, that, when called
are executed N-times in parallel by N different CUDA threads, as opposed to only once
like regular C functions.
• A kernel is defined by
• using the _global_ declaration specifier and
• declaring the instructions to be executed.
• The number of CUDA threads that execute that kernel for a given kernel call is given
during kernel invocation by using the <<< …>>> execution configuration identifier.
• Each thread that executes the kernel is given a unique thread ID that is accessible within
the kernel through the built-in threadIdx variable.
The subsequent sample code illustrates a kernel that adds two vectors A and B of size N and
stores the result into vector C as well as its invocation.
c2) The allocation of threads and thread blocks to ALUs and SIMD cores
Available register spaces for threads, thread blocks and grids-1 [43]
Available register spaces for threads, thread blocks and grids-2 [43]
Per-thread
reg. space
Major innovations
a) Concurrent kernel execution
b) True two level cache hierarchy
c) Configurable shared memory/L1 cache per SM
d) ECC support
Major enhancements
a) Vastly increased FP64 performance
b) Greatly reduced context switching times
c) 10-20 times faster atomic memory operations
It protects
• DRAM memory
• register files
• shared memories
• L1 and L2 caches.
Remark
ECC support is provided only for Tesla devices.
1 1 GT 80/92
does not
support
FP64
712 www.tankonyvtar.hu
5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (8)
NVidia: 16 cores
(Streaming Multiprocessors)
(SMs)
Remark
In the associated flagship card
(GTX 480) however,
one SM has been disabled,
due to overheating problems,
so it has actually
15 SMs and 480 ALUs [a]
Fermi GF100
Note
The high level microarchitecture of Fermi evolved from a graphics oriented structure
to a computation oriented one complemented with a units needed for graphics processing.
SFU: Special
Function Unit
1 SM includes 32 ALUs
called “Cuda cores” by NVidia)
GT80 SM [57]
Streaming Multiprocessor
Instruction L1
Instruction Fetch/Dispatch
Shared Memory
SP SP
SP SP
SFU SFU
SP SP
SP SP
• 16 KB Shared Memory
• 8 K registersx32-bit/SM • 16 KB Shared Memory
• up to 24 active warps/SM • 16 K registersx32-bit/SM
up to 768 active threads/SM • up to 32 active warps/SM
• 64 KB Shared Memory/L1 Cache
10 registers/SM on average up to 1 K active threads/SM
• up to 48 active warps/SM
16 registers/thread on average
• 32 threads/warp
• 1 FMA FPU (not shown) up to 1536 active threads/SM
© Sima Dezső, ÓE NIK 720 20 registers/thread on average
5.1.6 Microarchitecture of Fermi GF100 (6)
GF100 [70] GF104 [55]
Further evolution of the cores
(SMs) in Nvidia’s GPGPUs -2
GF104 [55]
Available specifications:
Data about
• the number of active warps/SM and
• the number of active threads/SM
are at present (March 2011) not available.
1 SM includes 32 ALUs
called “Cuda cores” by NVidia)
SP FP:32-bit
Remark
The Fermi line supports the Fused Multiply-Add (FMA) operation, rather than the Multiply-Add
operation performed in previous generations.
Previous lines
Fermi
Figure 5.1.7: Contrasting the Multiply-Add (MAD) and the Fused-Multiply-Add (FMA) operations
[56]
Host Device
• A global scheduler, called the Gigathread scheduler assigns work to each SM.
• In previous generations (G80, G92, GT200) the global scheduler could only assign work to the
SMs from a single kernel (serial kernel execution).
• The global scheduler of Fermi is able to run up to 16 different kernels concurrently, one per SM.
• A large kernel may be spread over multiple SMs.
The context switch time occurring between kernel switches is greatly reduced compared to
the previous generation, from about 250 µs to about 20 µs (needed for cleaning up TLBs,
dirty data in caches, registers etc.) [39].
Remark
The number of threads constituting a warp
is an implementation decision and not
part of the CUDA programming model.
E.g. in the G80 there are 24 warps per SM, whereas
in the GT200 there are 32 warps per SM.
Nvidia did not reveal details of the microarchitecture of Fermi so the subsequent
discussion of warp scheduling is based on assumptions given in the sources [39], [58].
Assumed block diagram of the Fermi GF100 microarchitecture and its operation
Remark
Fermi’s front end is similar to the basic building block of AMD’s Bulldozer core (2011)
that consists of two tightly coupled thin cores [85].
1 1 GT 80/92
does not
support
DP FP
739 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (25)
Official documentation reveals only that Fermi GT100 has dual issue zero overhead
prioritized scheduling [58]
Remarks
D. Kirk, one of the developers of Nvidia’s GPGPUs details warp scheduling in [12],
but this publication includes two conflicting figures, one indicating to coarse grain and the
other to fine grain warp scheduling as shown below.
Underlying microarchitecture of warp scheduling in an SM of the G80
I$
The G80 fetches one warp instruction/issue cycle L1
from the instruction L1 cache
into any instruction buffer slot.
Multithreaded
Issues one “ready-to-go” warp instruction/issue cycle Instruction Buffer
from any warp - instruction buffer slot.
Operand scoreboarding is used to prevent hazards
Shared
An instruction becomes ready after all needed R C$ Mem
F L1
values are deposited.
It prevents hazards
Operand Select
Cleared instructions become eligible for issue
Issue selection is based on round-robin/age of warp.
SM broadcasts the same instruction to 32 threads of a warp. SFU
MAD
Scheduling policy of warps in an SM of the G80 indicating coarse grain warp scheduling
TB1, W1 stall
TB2, W1 stall TB3, W2 stall
Note
The given scheduling scheme reveals a coarse grain one.
SM Warp Scheduling
Scheduling policy of warps in an SM of the G80 indicating fine grain warp scheduling
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96
Figure 5.1.15: Warp scheduling in the G80 [12]
© Sima Dezső, ÓE NIK 744 www.tankonyvtar.hu
5.1.6 Microarchitecture of Fermi GF100 (30)
Key differences in the block diagrams of the microarchitectures of the GT200 and Fermi
(Assumed block diagrams [39] without showing result data paths)
Vastly increased
execution resources
© Sima Dezső, ÓE NIK 748 www.tankonyvtar.hu
5.1.7 Comparing5.1.3
the microarchitectures
Comparison of key of Fermi GF100
features … (2) and GT200 (2)
Throughput of arithmetic operations per clock cycle per SM in the GT200 [43]
750 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (4)
Latency, throughput and warp issue rate of the arithmetic pipelines of the GT200 [60]
• A global scheduler, called the Gigathread scheduler assigns work to each SM.
• In the GT200, and all previous generations (G80, G92) the global scheduler could only assign
work to the SMs from a single kernel (serial kernel execution).
• By contrast, Fermi’s global scheduler is able to run up to 16 different kernels concurrently,
presumable, one per SM.
The global scheduler distributes the thread blocks of the running kernel to the available SMs,
by assigning typically multiple blocks to each SM, as indicated in the Figure below.
Each thread block may consist of multiple warps, e.g. of 4 warps, as indicated in the Figure.
• the maximal issue rate of warps to a particular group of pipelined execution units of an SM,
called the issue rate by Nvidia and
• the maximal issue rate of warps to the execution pipeline of the SM.
The maximal issue rate of warps to a particular group of pipelined execution units
(called the warp issue rate (clocks per warp) in Nvidia’s terminology)
It depends on the number and throughput of the individual execution units in a group,
or from another point of view, on the total throughput of all execution units
constituting a group of execution units, called arithmetic and flow control pipelines.
The issue rate of the arithmetic and flow control pipelines will be determined by the warp size
(32 threads) and the throughput (ops/clock) of the arithmetic or flow control pipelines,
as shown below for the arithmetic pipelines.
Table 5.1.3: Issue rate of the arithmetic pipelines in the GT 200 [60]
© Sima Dezső, ÓE NIK 758 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (12)
Accordingly,
• the issue of an FP32 MUL or MAD warp needs 32/8 = 4 clock cycles, or
• the issue of an FP64 MUL or MAD warp needs 32/1 = 32 clock cycles.
In the GT200 FP64 instructions are executed at 1/8 rate of FP32 instructions.
The maximal issue rate of warps to the execution pipeline of the SM -1 (based on [25])
As discussed previously, the Warp Schedulers of the SM can issue warps to the arithmetic
pipelines at the associated issue rates, e.g. in case of FX32 or FP32 warp instruction
in every fourth shader cycle to the FPU units .
Nevertheless, the scheduler of GT200 is capable of issuing warp instructions in every second
shader cycle to not occupied arithmetic pipelines of the SM if there are no dependencies
with previously issued instructions.
E.g. after issuing an FP32 MAD instruction to the FPU units, the Warp Scheduler can issue
already two cycles later an FP32 MUL instruction to the SFU units, if these units are not busy
and there is no data dependency between these two instructions.
The FP32 MUL warp instruction will occupy the 4 SPU units for 4 shader cycles.
© Sima Dezső, ÓE NIK 761 www.tankonyvtar.hu
5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (15)
The maximal issue rate of warps to the execution pipeline of an SM -2 (based on [25])
In this way the Warp Schedulers of the SMs may issue up to two warp instructions to the
single execution pipeline of the SM in every four shader cycle, provided that there are
no resource or data dependencies.
Figure 5.1.22: Dual issue of warp instructions in every 4 cycles in the GT200
(Based on [25]
PFP64 = 1/32 x 32 x 2 x fs x n
Figure 5.1.23: Contrasting the overall structures of the GF104 and the GF100 [69]
© Sima Dezső, ÓE NIK 766 www.tankonyvtar.hu
5.1.8 Microarchitecture of Fermi GF104 (2)
Note
In the GF104 based GTX 460 flagship card Nvidia activated only 7 SMs rather than
all 8 SMs available, due to overheating.
execution resources
in the GF104 vs the GF100
GF100 GF104
No. of
SP FX/FP 32 48
ALUs
No. of
16 16
L/S units
No. of
4 8
SFUs
No. of
8 4
DP FP ALUs
Note
The modifications done in the GF104 vs the GF100 aim at increasing graphics performance per
SM at the expense of FP64 performance while halving the number of SMs in order to
reduce power consumption and price.
Peak computational performance data for the Fermi GF104 based GTX 460 card
According to the computational capability data [43] and in harmony with the figure on the
previous slide:
Peak FP32 performance of a GTX460 card while it executes FMA warp instructions:
Peak FP64 performance of a GTX 460 card while it executes FMA instructions:
Key differences between the GF100-based GTX 480 and the GF110-based GTX 580 cards
No. of SMs 15 16
• Due to its larger shader frequency and increased number of SMs the GF110-based
GTX 580 card achieves a ~ 10 % peak performance over the GF100 based GTX 480 card
by a somewhat reduced power consumption.
Tesla cards
Flagship Tesla card C1060 C2070
Peak FP64 perf./card 30x1x2x1296 14x16x2x1150
1 In their GPGPU Fermi cards Nvidia activates only 4 FP64 units from the available 16
© Sima Dezső, ÓE NIK 776 www.tankonyvtar.hu
GPGPUs/DPAs 5.2
Case example 2:
AMD’s Cayman core
Dezső Sima
Aim
Brief introduction and overview.
5.2.3 AMD’s high level data and task parallel programming model
Remarks
Cards
HD 6850 (Barts Pro) 10/2010 HD 6950 (Cayman Pro) 12/2010
HD 6870 (Barts XT) 10/2010 HD 6970 (Cayman XT) 12/2010
HD 6990 (Antilles) 3/2011 3/2011
Remarks
1) The Barts core (underlying AMD’s HD 68xx cards) is named after Saint Barthélemy island.
2) The Cayman core (underlying AMD’s HD 69xx cards) is named after the Cayman island.
3) Cayman (AMD HD 69xx) was originally planned as a 32 nm device.
But both TSMC and Global Foundries canceled their 32 nm technology efforts (11/2009
resp. 4/2010) to focus on the 28 nm process, so AMD had to use the 40 nm feature size
for Cayman while eliminating some features already foreseen for that device [88].
• For their earlier GPGPUs, including the Evergreen series (HD 5xxx) AMD made use of the
ATI brand, e.g.
• But starting with the Northern Island series AMD discontinued using the ATI brand
and began to use the AMD brand to emphasize the correlation with their computing
platforms, e.g.
• At the same time AMD renamed also the new version (v2.3) of their ATI Stream SDK
to AMD Accelerated Parallel Processing (APP).
AMD/ATI
9/09 10/10 12/10
Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and started supporting OpenCL.
Considerable implications on both the microarchitecture of AMD’s GGGPUs, AMD IL and also
on the terminology used in connection with AMD’s GPGPUs.
Remark
1) In particular Pre-OpenCL and OpenCL publications AMD makes use of contradicting
terminology.
In Pre-OpenCL publications (relating to RV700 based HD4xxx cards or before)
AMD interprets the term “stream core” as the individual execution units within
the VLIW ALUs, whereas
in OpenCL terminology the same term designates the complete VLIW5 or VLIW4 ALU.
(SIMD core)
2) AMD designates their RV770 based HD4xxx cards as Terascale Graphics Engines [36]
referring to the fact that the HD4800 card reached a peak FP32 performance of 1 TFLOPS.
3) Beginning with the RV870 based Evergreen line (HD5xxx cards) AMD designated their
GPGPU architecture as the Terascale 2 Architecture referring to the fact that
the peak FP32 performance of the HD 5850/5870 cards surpassed the 2 TFLOPS mark.
VLIW4/VLIW5 ALU
Stream core (in OpenCL SDKs)
Streaming Processor
Compute Unit Pipeline (6900 ISA)
ALU Algebraic Logic Unit (ALU) CUDA Core
SIMD pipeline (Pre OpenCL) term
Thread processor (Pre OpenCL term)
Shader processor (Pre OpenCL term)
For their GPGPU technology AMD makes us of the virtual machine concept like Nvidia with their
PTX Virtual Machine.
AMD’s virtual machine is composed of
• the pseudo ISA, called AMD IL and
• its underlying computational model.
Remarks
1) Originally, the IL Intermediate Language was based on Microsoft 9X Shader Language [104]
2) About 2008 AMD made a far reaching decision to replace their Brook+ software environment
with the OpenCL environment, as already mentioned in the previous Section.
Figure 5.2.2: AMD’s Brook+ based Figure 5.2.3: AMD’s OpenCL based
programming environment [90] programming environment [91]
Figure 5.2.4: Introduction of Local and Global Data Share memories (LDSs, GDS) in AMD’s HD
4800 [36]
3) AMD provides also a low level programming interface to their GPGPU, called the
CAL interface (Compute Abstraction Layer) programming interface [106], [107].
Figure 5.2.5: AMD’s OpenCL based HLL and the low level CAL programming environment [91]
The CAL programming interface [104]
• is actually a low-level device-driver library that allows a direct control of the hardware.
• The set of low-level APIs provided allows programmers to directly open devices,
allocate memory, transfer data and initiate kernel execution and thus optimize
performance.
• An integral part of CAL interface is a JIT compiler for AMD IL.
© Sima Dezső, ÓE NIK 800 www.tankonyvtar.hu
5.2.2 AMD’s virtual machine concept (6)
CAL compilation
from AMD IL to the device specific ISA
Figure 5.2.6: Kernel compilation from AMD IL to Device-Specific ISA (disassembled) [148]
The AMD IL pseudo ISA and its underlying parallel computational model together constitute
a virtual machine.
From its conception on the virtual machine, like Nvidia’s virtual machine, evolved in many
aspects, but due to lacking documentation of previous AMD IL versions it can not be
tracked.
The following brief overview is based on version 2.0e of the AMD Intermediate Language
Specification (Dec. 2010) [105].
The parallel computational model inherent in AMD IL is set up of three key abstractions:
a) The model of execution resources
b) The memory model, and
c) The parallel execution model (parallel machine model) including
c1) The allocation of execution objects to the execution pipelines
c2) The data sharing concept and
c3) The synchronization concept.
which will be outlined very briefly and simplified below.
The execution resources include a set of SIMT cores, each incorporating a number of ALUs
that are able to perform a set of given computations.
Ideally, the same parallel execution model underlies all main components of the
programming environment, such as
• the real ISA of the GPGPU,
• the pseudo ISA and
• the HLL (like Brook+, OpenCL).
The interpretation of the notion “AMD’s data and task parallel programming model”
A peculiarity of the GPGPU technology is that its high-level programming model is associated
with a dedicated high level language (HLL) programming environment, like that
for CUDA. Brook+ oder OpenCL.
(By contrast, the programming model of the traditional CPU technology is associated with an
entire class of HLL languages, called the imperative languages, like Fortran, C, C++ etc.
as these languages share the same high-level programming model).
With their SDK 2.0 (Software Development Kit) AMD changed the supported high-level
programming environment from Brook+ to OpenCL in 2009.
Accordingly, AMD’s high-level data parallel programming model became that of OpenCL.
• So a distinction is needed between AMD’s Pre-OpenCL and OpenCL
programming models.
• The next section discusses the programming model of OpenCL.
Changing AMD’s GPGPU terminology with distinction between Pre-OpenCL and OpenCL
terminology
Along with changing their programming model AMD also changed their related terminology
by distinguishing between Pre-OpenCL and OpenCL terminology, as already discussed
in Section 5.2.1.
For example, in their Pre-OpenCL terminology AMD speaks about threads and thread groups,
whereas in OpenCL terminology these terms are designated as Work items and Work Groups.
For a summary of the terminology changes see [109].
OpenCL includes
• a language (resembling to C99) to write kernels, which allow the utilization of
data parallelism by using GPGPUs, and
• APIs to control the platform and program execution.
Main components of AMD’s data and task parallel programming model of OpenCL [109]
The data and task parallel programming model is based on the following abstractions
a) The platform model
b) The memory model of the platform
c) The execution model
c1) Command queues
c2) The kernel concept as a means to utilize data parallelism
c3) The concept of NDRanges-1
c4) The concept of task graphs as a means to utilize task parallelism
c5) The scheme of allocation Work items and Work Groups to execution resources
of the platform model
c6) The data sharing concept
c7) The synchronization concept
An abstract, hierarchical model that allows a unified view of different kinds of processors.
• In this model a Host coordinates execution and data transfers to and from an array
of Compute Devices.
• A Compute Device may be a GPGPU or even a CPU.
• Each Compute Device is composed of an array of Compute Units
(e.g. VLIW cores in case of a GPGPU card),
whereas each Compute Unit incorporates an array of Processing Elements
(e.g. VLIW5 ALUs in case of AMD’s GPGPUs.
© Sima Dezső, ÓE NIK 813 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (7)
Card
SIMD core
• Each Compute Unit is assigned a Local memory that is typically implemented on-chip,
providing lower latency and higher bandwidth than the Global Memory.
• There is also available an on-chip Constant memory, accessible to all Compute Units that
allows the reuse of read-only parameters during computations.
• Finally, a Private Memory space, typically a small register space, is allocated to each
Processing Element, e.g. to each VLIW5 ALU of an AMD GPGPU.
The memory model including assigned Work items and Work Groups (Based on 94)
• They coordinate data movements between the host and Compute Devices (e.g. GPGPU cards)
as well as let launch kernels.
• An OpenCL command queue is created by the developer and is associated with a specific
compute device.
• To target multiple OpenCL Compute Devices simultaneously the developer need to
create multiple command queues.
• Command queues allow to specify dependencies between tasks (in form of a task graph),
ensuring that tasks will be executed in the specified order.
• The OpenCL runtime module will execute tasks in parallel if their dependencies are
satisfied and the platform is capable to do so.
• In this way command queues as conceived, allow to implement a task parallel execution
model.
Aims
• either to parallelize an application across multiple compute devices (SIMD cores),
• or to run multiple completely independent streams of computation (kernels) across
multiple compute devices (SIMD cores).
The latter possibility is given only with the Cayman core.
• Kernels are high level language constructs that allow to express data parallelism and
utilizing it for speeding up computation by means of a GPGPU.
• Kernels will be written in a C99-like language.
• OpenCL kernels are executed over an index space, which can be 1, 2 or 3 dimensional.
• The index space is designated as the NDRange (N-dimensional Range).
The subsequent Figure shows an example of a 2-dimensional index space, which has
Gx x Gy elements.
• Each Work item in the Work Group will be assigned a Work Group id, labeled as wx, wy
in the Figure, as well as a local id, labeled as Sx, Sy in the Figure.
• Each Work item also becomes a global id, which can be derived from its Work Group and
local ids.
© Sima Dezső, ÓE NIK 823 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (17)
Example for specification of the global work size and explicit specification of Work Groups [96]
Example for a simple kernel written for adding two vectors [144]
Remarks
1) Compilation of the kernel code delivers both the GPU code and the CPU code.
Remarks (cont)
Clauses
• Groups of instructions of the same clause type that will be executed without preemption,
like ALU instructions, texture instructions etc.
Example [100]
The subsequent example relates to a previous generation device (HD 58xx) that
includes 5 Execution Units, designated as the x, y, z, w and t unit.
By contrast the discussed HD 69xx devices (belonging to the Cayman family) provide
only 4 ALUs (x, y, z, w).
• In the task graph arrows indicate dependencies. E.g. the kernel A is allowed to be executed
only after Write A and Write B have finished etc.
•The OpenCL runtime has the freedom to execute tasks given in the task graph in parallel
as long as the dependencies specified are fulfilled.
© Sima Dezső, ÓE NIK 829 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (23)
c5) The scheme of allocation Work items and Work Groups to execution resources
of the platform model
Work items
Work Groups
E.g. The Cayman core has 32 KB large Local memories and a Global Memory of 64 KB [99].
© Sima Dezső, ÓE NIK 831 www.tankonyvtar.hu
5.2.3 AMD’s high level data and task parallel programming model (25)
Synchronization Synchronization
of Work items of Work items
within a Work Group being in different Work Groups
Task graphs
Discussed already in connection with parallel task execution.
Barrier synchronization
• Allows synchronization of Work items within a Work Group.
• Each Work item in a Work Group must first execute the barrier before execution is allowed
to proceed (see the SIMT execution model, discussed in Section 2).
Memory fences
let synchronize memory operations (load/store sequences).
Atomic memory transactions
• Allow synchronization of Work items being in the same or in different Work Groups.
• Work items may e.g. append variable numbers of results to a shared queue
in global memory to coordinate execution of Work items in different Work Groups.
• Atomic memory transactions are OpenCL extensions supported by some OpenCL runtimes,
such as the ATI Stream SDK OpenCL runtime for x86 processors.
• The use of atomic memory transactions needs special care to avoid deadlocks and allow
scalability.
Simplified block diagram of the Cayman core (used in the HD 69xx series) [97]
Comparing the block diagrams of the Cypress (HD 58xx) and the Cayman (HD 69xx)
cores [97]
Comparing the block diagrams of the Cayman (HD 69xx) and the Fermi cores [97]
GF110
Note
Fermi has read/write L1/L2 caches like CPUs.
16 VLIW4 ALUs/core
4 EUs/VLIW4 ALU
1536 EUs
838 www.tankonyvtar.hu
5.2.4 Simplified block diagram of the Cayman core (5)
Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]
The developer writes a set of commands that control the execution of the GPGPU program.
These commands
• configure the HD 69xx device (not detailed here),
• specify the data domain on which the HD 69xx device has to operate,
• command the HD 69xx device to copy programs and data between system memory and
device memory,
• cause the HD 69xx device to begin the execution of a GPGPU program (OpenCL program).
The host writes the commands to the memory mapped HD 69xx registers in the
system memory space.
© Sima Dezső, ÓE NIK 841 www.tankonyvtar.hu
5.2.5 Principle of operation of the Command Processor (2)
The Command Processor reads the commands that the host has written to
memory mapped HD 69xx registers into the system-memory address space,
copies them into the Device Memory of the HD 6900 and launches their execution.
© Sima Dezső, ÓE NIK 842 www.tankonyvtar.hu
5.2.5 Principle of operation of the Command Processor (3)
The Data Parallel Processor Array (DPP) in the Cayman core [99]
Remark
The SIMD cores are also designated as
• Data Parallel Processors (DPP) or
• Compute Units.
SIMD core
Remark
Both the Cypress-based HD 5870 and the Cayman-based HD 6970 have the same basic structure
FP capability:
• 4xFP32 FMA
• 1XFP64 FMA
per clock.
• The VLIW4 ALUs are pipelined, having a throughput of 1 instruction per shader cycle for the
basic operations, i.e. they accept a new instruction every new cycle for the basic operations.
• ALU operations have a latency of 8 cycles, i.e. they require 8 cycles to be performed.
• The first 4 cycles are needed to read the operands from the register file,
one quarter wavefront at a time.
• The next four cycles are needed to execute the requested operation.
Contrasting AMD’s VLIW4 issue in the Cayman core (HD 69xx) with Nvidia’s scalar issue
(Based on [88])
AMD/ATI Nvidia
• VLIW issue • Scalar issue
• Static dependency resolution • Dynamic dependency resolution
performed by the compiler by scoreboarded warp scheduler
PFP32 = n x 16 x 4 x 1 x 2 x fs
(= SIMD core)
Private memory: 16 K GPRs/SIMD core
4 x 32-bit/GPR
Main features of
the memory spaces
available in the
Cayman core [99]
Compliance of the Cayman core (HD 6900) memory hierarchy with the OpenCL
memory model
(= SIMD core)
857 www.tankonyvtar.hu
5.2.7 The memory architecture (5)
(EU)
Example
If a kernel needs 30 GPRs (30 x 4 x 32-bit registers), up to 8 active wavefronts can run on
each SIMD core.
Comparing the Cayman’s (HD 69xx) and Fermi GF110’s (GTX 580) cache system [97]
GF110
Notes
1) Fermi has read/write L1/L2 caches like CPUs, whereas the Cayman core has read-only caches.
2) Fermi has optionally 16 KB L1 cache and 48 KB shared buffer or vice versa per SIMD core.
3) Fermi has larger L1/L2 caches.
Kernel 1: NDRange1
Global size 1,0
(0,0) (0,1)
DPP Array
?
(1,0) (1,1)
Kernel 2: NDRange2
Global size 2,0
(0,0) (0,1)
• The Ultra-Threaded Dispatch Processor allocates each Work Group of a kernel for execution
to a particular SIMD core.
• Up to 8 Work Groups of the same kernel can share the same SIMD core, if there are
enough resources to fit them in.
• Multiple kernels can run in parallel at different SIMD cores, in contrast to previous designs
when only a single kernel was allowed to be active and their Work Groups were spread
over the available SIMD cores.
• Hardware barrier support is provided for up to 8 Work Groups per SIMD core.
• Work items (threads) of a Work Group may communicate to each other through a
local shared memory (LDS) provided by the SIMD core.
This feature is available since the RV770.based 4xxx GPGPUs.
Remark
Barriers allow to synchronize Work item (thread) execution whereas memory fences
let synchronize memory operations (load/store sequences).
SIMD core
Remark
Cayman’s SIMD cores have the same structure as the HD 5870
Wavefront size
SIMD core
• In Cayman each SIMD core has 16 VLIW4 (in earlier models VLIW5) ALUs, similarly as
the SIMD cores in the HD 5870, shown in the Figure [100].
• Both VLIW4 and VLIW5 ALUs provide 4 identical Execution Units, so
Each 4 Work items, called collectively as a quad, is processed on the same VLIW ALU
The wavefront is composed of quads.
The number of quads is identical with the number of VLIW ALUs.
Building wavefronts
• The Ultra-Thraded Dispatch Processor segments Work Groups into wavefronts,
and schedules them for execution on a single SIMD core.
• This segmentation is called also as rasterization.
Example: Segmentation of a 16 x 16 sized Work Group into wavefronts of the size 8x8
and mapping them to SIMD cores [92]
Work Group
• AMD’s GPGPUs have dual Ultra-Threaded Dispatch Processors, each responsible for
one issue slot of two available.
• The main task of the Ultra-Threaded Dispatch Processors is to assign Work Groups of
currently running kernels to the SIMD cores and schedule wavefronts for execution.
• Each Ultra-Threaded Dispatch processor has a dispatch pool of 248 wavefronts.
• Each Ultra-Threaded Dispatch processor selects two wavefronts for execution for each
SIMD core and dispatches them to the SIMD cores.
• The selected two wavefronts will be executed interleaved.
Figure 5.2.20: Simplified execution of Work items on a SIMD core with hiding memory stalls [93]
© Sima Dezső, ÓE NIK 880 www.tankonyvtar.hu
5.2.8 The Ultra-Threaded Dispatch Processor (15)
• When data requested from memory arrive (that is T0 becomes ready) the scheduler
will select T0 for execution again.
• If there are enough work-items ready for execution memory stall times can be hidden,
as shown in the Figure below.
Figure 5.2.21: Simplified execution of Work items on a SIMD core with hiding memory stalls [93]
© Sima Dezső, ÓE NIK 881 www.tankonyvtar.hu
5.2.8 The Ultra-Threaded Dispatch Processor (16)
Figure 5.2.22: Simplified execution of Work items on a SIMD core without hiding memory stalls [93]
© Sima Dezső, ÓE NIK 882 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s
GPGPU microarchitectures
AMD/ATI
9/09 10/10 12/10
Presumable in anticipation of this move AMD modified their IL, enhanced the
microarchitecture of their GPGPU devices with LDS and GDS (as discussed next),
and also changed their terminology, as given in Table xx.
The memory concept of Brook+ was decisive to the memory architecture of AMD’s first
R600-based GPGPUs, as shown next.
is part of the
System memory
that is accessible
by both the Host
and the GPGPU
The memory architecture of the R600 reflects the memory concept of Brook+.
© Sima Dezső, ÓE NIK 888 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (6)
By contrast, the memory model of OpenCL includes also Local and Global memory spaces
in order to allow data sharing among Work items running on the same SIMD core or even
on the GPGPU.
Figure 5.2.25: Introduction of LDSs and a GDS in RV770-based HD 4xxx GPGPUs [36]
Remark: AMD designates the RV770 core internally as the R700 DPP: Data Parallel Processor
© Sima Dezső, ÓE NIK 894 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (12)
The GDS memory became visible only in the ISA documentation of the Evergreen (HD 5xxx )
[112] and the Northern Island (HD 6xxx) families of GPGPUs [99].
Figure 5.2.28: Basic architecture of Cayman (that underlies both the HD 6950 and 6970 GPGPUs) [99]
© Sima Dezső, ÓE NIK 895 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (13)
LDS/SIMD core - - 16 KB 32 KB 32 KB
GDS/GPGPU - - 16 KB 64 KB 64 KB
Domain of execution
Wavefronts
One wavefront
Global size 1
(0,0) (0,1)
• allocates the Work Groups for execution to the SIMD cores,
• and segments the Work Groups into wavefronts.
After segmentation the Ultra-Threaded Dispatch Processor Work Group Work Group
schedules the wavefronts for execution in the SIMD cores. (1,0) (1,1)
Work Group
Kernel 1: NDRange1
Global size 10
Global size 11
Kernel 2: NDRange2
Global size 20
Global size 21
902 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (20)
e) Introducing FP64 capability and replacing VLIW5 ALUs with VLIW4 ALUs
Remark
Reasons for replacing VLIW5 ALUs with VLIW4 ALUs [97]
AMD/ATI choose the VLIW-5 ALU design in connection with DX9, as it allowed to calculate a
4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) in parallel.
But in gaming applications for DX10/11 shaders the average slot utilization became only 3.4.
On average the 5. EU remains unused.
With Cayman AMD redesigned their ALU by
• removing the T-unit and
• enhancing 3 of the new EUs such that these units together became capable of
performing 1 transcendental operation per cycle as well as
• enhancing all 4 EUs to perform together an FP64 operation per cycle.
The new design can compute
• 4 FX32 or 4 FP32 operations or
• 1 FP64 operation or
• 1 transcendental + 1 FX32 or 1FP32 operation
per cycle, whereas
the previous design was able to calculate
• 5 FX32 or 5 FP32 operations or
• 1 FP64 operation or
• 1 transcendental + 4 FX/FP operation
per cycle.
© Sima Dezső, ÓE NIK 904 www.tankonyvtar.hu
5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (22)
Dezső Sima
Aim
Brief introduction and overview
General remark
Integrated CPU/GPU designs are not yet advanced enough to be employed as GPGPUs,
but they mark the way of the evolution.
For this reason their discussion is included into the description of GPGPUs.
6. Integrated CPUs/GPUs
7. References
Remarks
• In early PCs, displays were connected to the system bus (first to the ISA then to the PCI bus)
via graphic cards.
• Spreading of internet and multimedia apps at the end of the 1990’s impelled enhanced
graphic support from the processors.
This led to the emergence of the 3. generation superscalars that provided already MM and
graphics support (by means of SIMD ISA enhancements).
• A more demanding graphics processing however, invoked the evolution of the system
architecture away from the bus-based one to the hub-based one at the end of the 1990’s.
Along with the appearance of the hub-based system architecture the graphics
controller (if provided) became typically integrated into the north bride (MCH/GMCH),
as shown below.
PCI architecture Hub architecture
P P
Display
ISA
PCI
Figure 6.1: Emergence of hub based system architectures at the end of the 1990’s
© Sima Dezső, ÓE NIK 911 www.tankonyvtar.hu
6.1 Introduction to integrated CPUs/GPUs (3)
Example
Integrated graphics controllers appeared
in Intel chipsets first in the 810 chipset
in 1999 [113].
Note
The 810 chipset does not provide AGP
connection.
© Sima Dezső, ÓE NIK 912 www.tankonyvtar.hu
6.1 Introduction to integrated CPUs/GPUs (4)
Subsequent chipsets (e.g. those developed for P4 processors (around 2000) provided then
both an integrated graphics controller intended for connecting the display and
further on a Host-to-AGP Bridge to cater for an AGP-bus output in order to achieve high quality
graphics for gaming apps by using a graphics card.
Display
AGP
bus
Figure 6.2: Conceptual block diagram of the north bridge of Intel’s 845G chipset [129]
Fusion line
Line of processors with on-die integrated CPU and GPU units, designated as APUs
(Accelerated Processing Units)
• Introduced in connection with the AMD/ATI merger in 10/2006.
• Originally planned to ship late 2008 or early 2009
• Actually shipped
• 11/2010: for OEMs
• 01/2011: for retail
UNB: Unbuffered
UVD: Universal Video Decoder
North Bridge
SB: South Bridge
Source: [115]
Source: [115]
Source: [116]
Source: [115]
Table 6.2: Main features of AMD’s mainstream Zacate Fusion APU line [131]
Table 6.3: Main features of AMD’s low power Ontario Fusion APU line [131]
OpenCL programming support for both the Ontario and the Zacate lines
AMD’s APP SDK 2.3 (1/2011) (formerly ATI Stream SDK) ( provides OpenCL support for
both lines .
Source: [115]
© Sima Dezső, ÓE NIK 932 www.tankonyvtar.hu
6.2 The AMD Fusion APU line (18)
In-breadth support
of dual issue
TSMC 40nm
(Taiwan Semiconductor Manufacturing Company)
~ 400 mtrs
Source:
939
6.2 The AMD Fusion APU line (25)
AMD 400 st. 650 MHz DDR3/GD 1 800/4 000 1 800/4 39W
HD5570 DR5 000 MHz
Floor plan of the CPU core (revamped Star core) of the Llano APU [133]
In 2011
Bulldozer CPU cores will be used as the basis for their desktop and server CPU processors,
In 2012
next generation Bulldozer CPUs are planned to be used both
Remark
No plans are revealed to continue to develop the Llano APU
Bulldozer’s basic modules each consisting of two tightly coupled cores -1[135]
Bulldozer’s basic modules each consisting of two tightly coupled cores – 2 [136]
22
Bulldozer’s basic modules each consisting of two tightly coupled cores – 3 [135]
Similarity of the microarchitectures of Nvidia’s Fermi and AMD’s Bulldozer CPU core
Both are using tightly coupled dual pipelines with shared and dedicated units.
22
Floor plan of
Bulldozer [128]
953 www.tankonyvtar.hu
6.2 The AMD Fusion APU line (39)
Arrandale Clarksdale
i3 3xx i3 5xx
i5 4xx/5xx i5 6xx
i7 6xx
CPU/GPU components
CPU: Westmere architecture (32 nm)
(Enhanced 32 nm shrink of the
45 nm Nehalem architecture)
GPU: (45 nm)
Shader model 4, DX10 support
32 nm CPU
(Mobile implementation of the Westmere
basic architecture,
which is the 32 nm shrink of the
45 nm Nehalem basic architecture) 45 nm GPU
Intel’s GMA HD (Graphics Media Accelerator)
(12 Execution Units, Shader model 4, no OpenCL support)
© Sima Dezső, ÓE NIK 963 www.tankonyvtar.hu
6.3 Intel’s in-package integrated CPU/GPU lines (8)
964
6.3 Intel’s in-package integrated CPU/GPU lines (9)
Figure 6.3: The Clarksdale processor with in-package integrated graphics along with the H57 chipset
[140]
In Jan. 2011 Intel replaced their in-package integrated CPU/GPU lines with the on-die integrated
Sandy Bridge line.
DDR3 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz
L3 Cache 6MB 6MB 6MB 8MB 8MB
Intel HD
2000 2000 3000 2000 3000
Graphics
GPU Max freq 1100 MHz 1100 MHz 1100 MHz 1350 MHz 1350 MHz
Hyper-
No No No Yes Yes
Threading
Socket LGA 1155 LGA 1155 LGA 1155 LGA 1155 LGA 1155
Hyperthreading
32K L1D (3 clk) AES Instr.
AVX 256 bit VMX Unrestrict.
4 Operands 20 nm2 / Core
HD5570
400 ALUs
i5/i7 2xxx:
Sandy Bridge
i56xx
Arrandale
Dezső Sima
[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia, http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf
[5]: Tesla S870 GPU Computing System, Specification, Nvida, March 13 2008,
http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf
[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD
[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,
ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007
http://ati.amd.com/developer/techpapers.html
[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia,
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming
Guide_2.0.pdf
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
ectures/lecture7-threading%20hardware.ppt
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,
http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[16]: Goto H., NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008,
http://pc.watch.impress.co.jp/docs/2008/0702/kaigai451.htm
[17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3
© Sima Dezső, ÓE NIK 980 www.tankonyvtar.hu
References (3)
[18]: http://en.wikipedia.org/wiki/DirectX
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for
Visual Information Technology, IIIT Hyderabad, March 2007,
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html
[23]: Goto H., Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,
http://pc.watch.impress.co.jp/docs/2008/0822/kaigai_06.pdf
[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,
[25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies,
Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242
[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008
[29]: Shrout R., IDF Fall 2007 Keynote, PC Perspective, Sept. 18, 2007,
http://www.pcper.com/article.php?aid=453
[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Ars Technica,
Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels-biggest-leap-
ahead-since-the-pentium-pro.html
[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf
[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/doggett-
radeon2900-gh07.pdf
[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf
[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?
ArticleID=RWT093009110932&mode=print
[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1
[44]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley,
January 24-25 2011
http://iccs.lbl.gov/assets/docs/2011-01-24/lecture1_computational_thinking_
Berkeley_2011.pdf
[45]: Wasson S., Nvidia's GeForce GTX 580 graphics processor
Tech Report, Nov 9 2010, http://techreport.com/articles.x/19934/1
[46]: Shrout R., Nvidia GeForce 8800 GTX Review – DX10 and Unified Architecture,
PC Perspective, Nov 8 2006
http://swfan.com/reviews/graphics-cards/nvidia-geforce-8800-gtx-review-dx10-
and-unified-architecture/g80-architecture
[47]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors
Tech Report, March 31 2010, http://techreport.com/articles.x/18682
[48]: Gangar K., Tianhe-1A from China is world’s fastest Supercomputer
Tech Ticker, Oct 28 2010, http://techtickerblog.com/2010/10/28/tianhe-1a-
from-china-is-worlds-fastest-supercomputer/
[49]: Smalley T., ATI Radeon HD 5870 Architecture Analysis, Bit-tech, Sept 30 2009,
http://www.bit-tech.net/hardware/graphics/2009/09/30/ati-radeon-hd-5870-
architecture-analysis/8
[50]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 1.0, June 2007,
https://www.doc.ic.ac.uk/~wl/teachlocal/arch2/papers/nvidia-PTX_ISA_1.0.pdf
[51]: Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies,
Sept 25 2010 http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4
[52]: Nvidia CUDATM FermiTM Compatibility Guide for CUDA Applications, Version 1.0,
February 2010, http://developer.download.nvidia.com/compute/cuda/3_0/
docs/NVIDIA_FermiCompatibilityGuide.pdf
[53]: Hallock R., Dissecting Fermi, NVIDIA’s next generation GPU, Icrontic, Sept 30 2009,
http://tech.icrontic.com/articles/nvidia_fermi_dissected/
[54]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview,
Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/
[55]: Hoenig M., NVIDIA GeForce GTX 460 SE 1GB Review, Hardware Canucks, Nov 21 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/38178-
nvidia-geforce-gtx-460-se-1gb-review-2.html
[56]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture
Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/
P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf
[57]: Kirk D. & Hwu W. W., ECE498AL Lectures 4: CUDA Threads – Part 2, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt
[58]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf
[59]: Kirk D. & Hwu W. W., ECE498AL Lectures 8: Threading Hardware in G80, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture8-threading-hardware-spring-2009.ppt
[60]: Wong H., Papadopoulou M.M., Sadooghi-Alvandi M., Moshovos A., Demystifying GPU
Microarchitecture through Microbenchmarking, University of Toronto, 2010,
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf
[61]: Pettersson J., Wainwright I., Radar Signal Processing with Graphics Processors
(GPUs), SAAB Technologies, Jan 27 2010,
http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf
[62]: Smith R., NVIDIA’s GeForce GTX 460: The $200 King, AnandTech, July 11 2010,
http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2
[63]: Angelini C., GeForce GTX 580 And GF110: The Way Nvidia Meant It To Be Played,
Tom’s Hardware, Nov 9 2010, http://www.tomshardware.com/reviews/geforce-
gtx-580-gf110-geforce-gtx-480,2781.html
[64]: NVIDIA G80: Architecture and GPU Analysis, Beyond3D, Nov. 8 2006,
http://www.beyond3d.com/content/reviews/1/11
[65]: D. Kirk and W. Hwu, Programming Massively Parallel Processors, 2008
Chapter 3: CUDA Threads, http://courses.engr.illinois.edu/ece498/al/textbook/
Chapter3-CudaThreadingModel.pdf
[72]: New NVIDIA Tesla GPUs Reduce Cost Of Supercomputing By A Factor Of 10,
Nvidia, Nov. 16 2009
http://www.nvidia.com/object/io_1258360868914.html
[73]: Nvidia Tesla, Wikipedia, http://en.wikipedia.org/wiki/Nvidia_Tesla
[74]: Tesla M2050 and Tesla M2070/M2070Q Dual-Slot Computing Processor Modules,
Board Specification, v. 03, Nvidia, Aug. 2010,
http://www.nvidia.asia/docs/IO/43395/BD-05238-001_v03.pdf
[75]: Tesla 1U gPU Computing System, Product Soecification, v. 04, Nvidia, June 2009,
http://www.nvidia.com/docs/IO/43395/SP-04975-001-v04.pdf
[76]: Kanter D., The Case for ECC Memory in Nvidia’s Next GPU, Realworkd Technologies,
19 Aug. 2009,
http://www.realworldtech.com/page.cfm?ArticleID=RWT081909212132
[77]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/37789-nvidia-
geforce-gtx-580-review-5.html
[78]: Angelini C., AMD Radeon HD 6990 4 GB Review, Tom’s Hardware, March 8, 2011,
http://www.tomshardware.com/reviews/radeon-hd-6990-antilles-crossfire,2878.html
[79]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/two-cypress-gpus,0101-230369-
7179-0-0-0-jpg-.html
[80]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/Bare-Radeon-HD-5970,0101-230349-
7179-0-0-0-jpg-.html
[81]: CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA
[82]: GeForce Graphics Processors, Nvidia, http://www.nvidia.com/object/geforce_family.html
[83]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at
Nvidia’s 2009 GPU Technology Conference, (GTC), Sept. 30 2009,
http://www.nvidia.com/object/gpu_tech_conf_press_room.html
[88]: Smith R., AMD's Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD,
AnandTech, Dec. 15 2010,
http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950
[89] Christian, AMD renames ATI Stream SDK, updates its with APU, OpenCL 1.1 support,
Jan. 27 2011, http://www.tcmagazine.com/tcm/news/software/34765/
amd-renames-ati-stream-sdk-updates-its-apu-opencl-11-support
[90]: User Guide: AMD Stream Computing, Revision 1.3.0, Dec. 2008,
http://www.ele.uri.edu/courses/ele408/StreamGPU.pdf
[91]: ATI Stream Computing Compute Abstraction Layer (CAL) Programming Guide,
Revision 2.01, AMD, March 2010, http://developer.amd.com/gpu_assets/ATI_Stream_
SDK_CAL_Programming_Guide_v2.0.pdf
http://developer.amd.com/gpu/amdappsdk/assets/AMD_CAL_Programming_Guide_v2.0.pdf
© Sima Dezső, ÓE NIK 989 www.tankonyvtar.hu
References (12)
[92]: Technical Overview: AMD Stream Computing, Revision 1.2.1, Oct. 2008,
http://www.cct.lsu.edu/~scheinin/Parallel/StreamComputingOverview.pdf
[93]: AMD Accelerated Parallel Processing OpenCL Programming Guide, Rev. 1.2,
AMD, Jan. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
[97]: Kanter D., AMD's Cayman GPU Architecture, Real World Technologies, Dec. 14 2010,
http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=3
[98]: Hoenig M., AMD Radeon HD 6970 and HD 6950 Review, Hardware Canucks,
Dec. 14 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/
38899-amd-radeon-hd-6970-hd-6950-review-3.html
[99]: Reference Guide: AMD HD 6900 Series Instruction Set Architecture, Revision 1.0,
Febr. 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/
AMD_HD_6900_Series_Instruction_Set_Architecture.pdf
[100]:Howes L., AMD and OpenCL, AMD Application Engineering, Dec. 2010,
http://www.many-core.group.cam.ac.uk/ukgpucc2/talks/Howes.pdf
[101]: ATI R700-Family Instruction Set Architecture Reference Guide, Revision 1.0a,
AMD, Febr. 2011, http://developer.amd.com/gpu_assets/R700-Family_Instruction_
Set_Architecture.pdf
[102]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor
Graphics, Presentation ARCS002, IDF San Francisco, Sept. 2010
[103]: Bhaniramka P., Introduction to Compute Abstraction Layer (CAL),
http://coachk.cs.ucf.edu/courses/CDA6938/AMD_course/M5%20-
%20Introduction%20to%20CAL.pdf
[104]: Villmow M., ATI Stream Computing, ATI Intermediate Language (IL),
May 30 2008, http://developer.amd.com/gpu/amdappsdk/assets/ATI%20Stream
%20Computing%20-%20ATI%20Intermediate%20Language.ppt#547,9
[105]: AMD Accelerated Parallel Processing Technology,
AMD Intermediate Language (IL), Reference Guide, Revision 2.0e, March 2011,
http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Intermediate_Language
_(IL)_Specification_v2.pdf
[106]: Hensley J., Hardware and Compute Abstraction Layers for Accelerated Computing
Using Graphics Hardware and Conventional CPUs, AMD, 2007,
http://www.ll.mit.edu/HPEC/agendas/proc07/Day3/10_Hensley_Abstract.pdf
[107]: Hensley J., Yang J., Compute Abstraction Layer, AMD, Febr. 1 2008,
http://coachk.cs.ucf.edu/courses/CDA6938/s08/UCF-2008-02-01a.pdf
[108]: AMD Accelerated Parallel Processing (APP) SDK, AMD Developer Central,
http://developer.amd.com/gpu/amdappsdk/pages/default.aspx
[109]: OpenCL™ and the AMD APP SDK v2.4, AMD Developer Central, April 6 2011,
http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-AMD-APP-
SDK.aspx
[110]: Stone J., An Introduction to OpenCL, U. of Illinois at Urbana-Champign, Dec. 2009,
http://www.ks.uiuc.edu/Research/gpu/gpucomputing.net
[111]: Introduction to OpenCL Programming, AMD, No. 137-41768-10, Rev. A, May 2010,
http://developer.amd.com/zones/OpenCLZone/courses/Documents/Introduction_
to_OpenCL_Programming%20Training_Guide%20(201005).pdf
[112]: Evergreen Family Instruction Set Architecture, Instructions and Microcode Reference
Guide, AMD, Febr. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Evergreen-Family_Instruction_Set_Architecture.pdf
[113]: Intel 810 Chipset: Intel 82810/82810-DC100 Graphics and Memory Controller Hub
(GMCH) Datasheet, June 1999
ftp://download.intel.com/design/chipsets/datashts/29065602.pdf
[114]: Huynh A.T., AMD Announces "Fusion" CPU/GPU Program, Daily Tech, Oct. 25 2006,
http://www.dailytech.com/article.aspx?newsid=4696
[115]: Grim B., AMD Fusion Family of APUs, Dec. 7 2010, http://www.mytechnology.eu/wp-
content/uploads/2011/01/AMD-Fusion-Press-Tour_EMEA.pdf
[116]: Newell D., AMD Financial Analyst Day, Nov. 9 2010,
http://www.rumorpedia.net/wp-content/uploads/2010/11/rumorpedia02.jpg
[117]: De Maesschalck T., AMD starts shipping Ontario and Zacate CPUs, DarkVision
Hardware, Nov. 10 2010, http://www.dvhardware.net/article46449.html
[118]: AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with
OpenCL 1.1 Support, APP SDK 2.3 Jan. 2011
[119]: Burgess B., „Bobcat” AMD’s New Low Power x86 Core Architecture, Aug. 24 2010,
http://www.hotchips.org/uploads/archive22/HC22.24.730-Burgess-AMD-Bobcat-x86.pdf
[126]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core
i3-2100 Tested, AnandTech, Jan. 3 2011,
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-600k-i5-
2500k-core-i3-2100-tested/11
[127]: Marques T., AMD Ontario, Zacate Die Sizes - Take 2 , Sept. 14 2010,
http://www.siliconmadness.com/2010/09/amd-ontario-zacate-die-sizes-take-2.html
[138]: Dodeja A., Intel Arrandale, High Performance for the Masses, Hot Hardware,
Review of the IDF San Francisco, Sept. 2009,
http://akshaydodeja.com/intel-arrandale-high-performance-for-the-mass
[139]: Shimpi A., An Intel Arrandale: 32nm review for Notebooks, core to be assigned Core i5 540M
ReviewedAnand Tech, 1/4/2010
http://www.anandtech.com/show/2902
[140]: Chiappeta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware
Jan. 03 2010,
http://hothardware.com/Articles/Intel-Clarkdale-Core-i5-Desktop-Processor-Debuts/
[141]: Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem)
Based Platform, Presentation ARCS001, IDF 2009
[142]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor
Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010
[145]: ATI Stream Computing OpenCL Programming Guide, rev.1.0b, AMD, March 2010,
http://www.ljll.math.upmc.fr/groupes/gpgpu/tutorial/ATI_Stream_SDK_OpenCL
Programming_Guide.pdf
[147]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 2.3, March 2011,
http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/ptx_isa_
2.3.pdf
[148]: ATI Stream Computing Compute Abstraction Layer (CAL) Programming Guide,
Revision 2.03, AMD, Dec. 2010
http://developer.amd.com/gpu/amdappsdk/assets/AMD_CAL_Programming_Guide_
v2.0.pdf
[149]: Wikipedia: Dolphin triangle mesh,
http://en.wikipedia.org/wiki/File:Dolphin_triangle_mesh.png