0053 Parhuzamos Rendszerek Architekturaja

Írta: Sima Dezső
Lektorálta: oktatói munkaközösség
PÁRHUZAMOS RENDSZEREK
ARCHITEKTÚRÁJA
PÁRHUZAMOS SZÁMÍTÁSTECHNIKA MODUL
PROAKTÍV INFORMATIKAI MODULFEJLESZTÉS

1
COPYRIGHT:
2011-2016, Dr. Sima Dezső, Óbudai Egyetem, Neumann János Informatikai Kar
LEKTORÁLTA: oktatói munkaközösség
Creative Commons NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)

A szerző nevének feltüntetése mellett nem kereskedelmi céllal szabadon másolható,
terjeszthető, megjelentethető és előadható, de nem módosítható.
TÁMOGATÁS:
Készült a TÁMOP-4.1.2-08/2/A/KMR-2009-0053 számú, “Proaktív informatikai
modulfejlesztés (PRIM1): IT Szolgáltatásmenedzsment modul és Többszálas
processzorok és programozásuk modul” című pályázat keretében
KÉSZÜLT: a Typotex Kiadó gondozásában

FELELŐS VEZETŐ: Votisky Zsuzsa
ISBN 978-963-279-561-4
2
KULCSSZAVAK:
többmagos processzorok, sokmagos processzorok, homogén többmagos processzorok,
heterogén többmagos processzorok, mester-szolga elvű heterogén többmagos
processzorok, csatolt elvű heterogén többmagos processzorok, Core
2/Penryn/Nehalem/Nehalem-X/Westmere/Westmer-EX/Sandy Bridge-alapú Intel
architektúrák, egyéni (privat consumer) és vállalati (enterprise) orientált platformok, Intel vPro
platformja, általános célú GPU-k (GPGPU-k), adatpárhuzamos gyorsítók (DPA-k), integrált
CPU/GPU architektúrák
ÖSSZEFOGLALÓ:
A tárgy keretében a hallgatók áttekintést kapnak a processzorarchitektúrák terén az elmúlt
években végbement rohamos fejlődésről. Megismerkednek a többmagos processzorok
megjelenésének szükségszerűségével, a többmagos/sokmagos processzorok főbb
osztályaival, nevezetesen a homogén és a heterogén többmagos processzorokkal, azok
alosztályaival és reprezentáns implementációikkal.
Ismertetésre kerülnek a többmagos Intel processzorok főbb családjai és azok főbb jellemzői,
nevezetesen a Core 2, Penryn, Nehalem, Nehalem-EX, Westmere, Westmere-EX és a
Sandy Bridge alapú architektúrák és jellemzőik. Az előadásban a hallgatók megismerkednek
a többmagos asztali számítógép platformokkal, kiemelten az egyéni ill. a vállalati alkalmazási
orientációjú (vPro) platformokkal és azok sajátosságaival. Az anyag megértését nagyszámú
konkrét megvalósítás bemutatása segíti. A továbbiakban az előadás tárgyalja a
számításigényes alkalmazások terén egyre szélesebb körben elterjedő általános célú GPU-
kat (GPGPU-k) és adatpárhuzamos gyorsítókat (DPA-k). Végül ismertetésre kerülnek a
reprezentáns Nvidia és AMD/ATI GPGPU családok architektúrái valamint a processzorok
fejlődésének legutóbbi szakaszában megjelent integrált CPU/GPU architektúrák ill.
reprezentáns implementációik.
3
Tartalomjegyzék
• Multicore-Manycore Processors
• Evolution of Intel’s Basic Microarchitectures
• Intel’s Desktop Platforms
• GPGPUs/DPAs Overview
• GPGPUs/DPAs 5.1
• GPGPUs/DPAs 5.2
• Integrated CPUs/GPUs
• References to all four sectionsof GPGPUs/DPAs
© Sima Dezső, ÓE NIK 4 www.tankonyvtar.hu

Multicore-Manycore
Processors
Dezső Sima

Contents
• 1.The inevitable era of multicores
• 2. Homogeneous multicores
• 2.1 Conventional multicores
• 2.2 Many-core processors
• 3. Heterogeneous multicores
• 3.1 Master-slave type heterogeneous multicores
• 3.2 Add-on type heterogeneous multicores
• 4. Outlook
• 5. References

1. The inevitable era of multicores

1. The inevitable era of multicores (1)
1. The inevitable era of multicores

Integer performance grows
SPECint92
Levelling off
10000
P4/3200 * * Prescott (2M)
* * *Prescott (1M)
5000 P4/3060 * Northwood B
P4/2400 * **P4/2800
P4/2000 * *P4/2200
2000 P4/1500 * *
P4/1700
PIII/600 PIII/1000
1000 *
**PIII/500
PII/400
PII/300 *
* PII/450
500 *
~ 100*/10 years Pentium/200 * Pentium Pro/200

200 *
Pentium/133 * * Pentium/166
Pentium/100 * * Pentium/120
100
Pentium/66*
50 * 486-DX4/100
486/50 * 486-DX2/66
486/33 * *
20 486-DX2/50
*
486/25 *
10
* 386/33
386/20 * 386/25
5 *
386/16 *
80286/12
2 *
80286/10
1 *
8088/8
0.5 *
0.2 * 8088/5
Year
79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05
Figure 1.1: Integer performance8growth of Intel’s x86 processors

© Sima Dezső, ÓE NIK www.tankonyvtar.hu
Performance (Pa)
Pa = fC x IPC
Clock frequency x Instructions Per Cycle
Clock Efficiency
Pa = x
frequency (Pa/fC)

SPECint_base2000/ f c 2. generation
superscalars
Levelling off
1
0.5 Pentium Pro Pentium II
* *
* * *
Pentium III
* *
~10*/10 years Pentium *
0.2
* *
486DX
0.1
0.05 * *
386DX
0.02 * 286
0.01
~
~
Year
78 79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 2000 01 02
Figure 1.2: Efficiency of Intel processors

Main sources of processor efficiency (IPC)
Processor width Core enhancements Cache enhancements
• branch prediction L2/L3

• speculative loads enhancements
1 2 4
• ... (size, associativity ...)
pipeline superscalar
1. Gen. 2. Gen.
Figure 1.3: Main sources of processor efficiency

Figure 1.4: Extent of parallelism available in general purpose applications

for 2. generation superscalars [37]

1 2 4
1. Gen. 2. Gen.
Figure 1.5: Main sources of processor efficiency

Beginning with 2. generation superscalars
• the era of extensively increasing processor efficiency came to an end

• processor efficiency levelled off.
Pa = fC x IPC
Clock frequency x Instructions Per Cycle
Performance increase can basically be achived by fc

Shrinking: ~ 0.7/2 Years
Figure 1.6: Evolution of Intel’s process technology [38]

Figure 1.7: The actual rise of IC complexity in DRAMs and microprocessors [39]


1 2 4
1. Gen. 2. Gen.
What is the best use of ever increasing Doubling transistor counts Moore’s
number of processors ??? ~ every two years law

IC fab technology
(Linear shrink ~ 0.7x/2 years) ~ Doubling transistor counts / 2 years
Moore’s law
Possible use of surplus transistors

1 2 4
1. Gen. 2. Gen.
Figure 1.8: Possible use of surplus transistors

Increasing number of transistors Diminishing return in performance
Use available surplus transistors for multiple cores
The inevitable era of multicores

Figure 1.9: Rapid spreading of Intel’s multicore processors [40]

2. Homogeneous multicores
• 2.1 Conventional multicores
• 2.2 Manycore processors

2. Homogeneous multicores (1)
Multicore processors
Homogeneous Heterogeneous
multicores multicores
Conventional Manycore Master/slave Add-on

multicores processors type multicores type multicores
2≤ n≤8 cores with >8 cores
Mobiles Desktops Servers
MPC
CPU GPU
General purpose Prototypes/ MM/3D/HPC HPC

computing experimental systems production stage
© Sima Dezső, ÓE NIK Figure 2.1: Major classes22

of multicore processors www.tankonyvtar.hu
2.1 Conventional multicores
• 2.1.1 Example: Intel’s MP servers

2.1 Conventional multicores (1)

MPC
CPU GPU


2.1.1 Example: Intel’s MP servers
• 2.1.1.1 Introduction
• 2.1.1.2 The Pentium 4 based Truland MP platform
• 2.1.1.3 The Core 2 based Caneland MP platform
• 2.1.1.4 The Nehalem-EX based Boxboro-EX MP platform
• 2.1.1.5 Evolution of MP platforms

2.1.1.1 Introduction (1)
2.1.1.1 Introduction
Servers
Uni-Processors Dual Processors Multi Processors Servers with more than

(UP) (DP) (typically 4 processors) 8 processors
(MP)

Basic Arch. Core/technology MP server processors
Pentium 4 90 nm 11/2005 Paxville MP 2x1 C, 2 MB L2/C

Pentium 4
(Prescott)
Pentium 4 65 nm 8/2006 7100 (Tulsa) 2x1 C, 1 MB L2/C 16 MB L3
7200 (Tigerton DC) 1x2 C, 4 MB L2/C

Core2 65 nm 9/2007
7300 (Tigerton QC) 2x2 C, 4 MB L2/C
Core 2
Penryn 45 nm 9/2008 7400 (Dunnington) 1x6 C, 3 MB L2/2C 16 MB L3
Nehalem-EP 45 nm
Westmere-EP 32 nm
Nehalem
Nehalem-EX 45 nm 3/2010 7500 (Beckton) 1x8 C, ¼ MB L2/C 24 MB L3
Westmere-EX 32nm 4/2011 E7-48xx (Westmere-EX) 1x10 C, ¼ MB L2/C 30 MB L3
Sandy Bidge 32 nm /2011

Sandy
Bridge
Ivy Bridge 22 nm 11/2012
© Sima Dezső, ÓE NIK Table 2.1: Overview of Intel’s

27 multicore MP servers www.tankonyvtar.hu
MP server platforms
Pentium 4 based Core 2 based Nehalem-EX based Sandy Bridge based

MP server platform MP server platform MP server platform MP server platform
Truland (2005) Caneland (2007) Boxboro-EX (2010) To be announced yet

(90 nm/65 nm Pentium 4
Prescott MP based)

2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform (1)
2.1.1.2 The Pentium 4 Prescott MP based Truland MP server platform
Overview
Remark
For presenting a more complete view of the evolution of multicore MP server platforms
we include also the single core (SC) 90 nm Pentium 4 Prescott based Xeon MP (Potomac)
processor that was the first 64-bit MP server processor and gave rise to the Truland platform.

3/2005 11/2005
MP platforms Truland Truland( updated)
3/2005 11/2005 8/2006
MP cores Xeon MP Xeon 7000 Xeon 7100

(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C
90 nm/675 mtrs 90 nm/2x169 mtrs 65 nm/1328 mtrs

1 MB L2 2x1 (2) MB L2 2x1 MB L2
8/4 MB L3 - 16/8/4 MB L3
667 MT/s 800/667 MT/s 800/667 MT/s
mPGA 604 mPGA 604 mPGA 604
3/2005 11/2005
MCH E8500 E8501
(Twin Castle) (Twin Castle?)

2xFSB 2xFSB
667 MT/s 800 MT/s
HI 1.5 HI 1.5
4 x XMB 4 x XMB
(2 channels/XMB (2 channels/XMB
4 DIMMs/channel 4 DIMMs/channel
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
32GB 32GB
4/2003
Pentium 4 based
ICH ICH5
Core 2 based
Penryn based
Pentium 4-based/90 nm 30 Pentium 4-based/65 nm
3/02 11/02 2Q/05
3/04
^ ^ ^ ^
Xeon - MP line Foster-MP Gallatin
Gallatin
Potomac
0.18 µ /108 mtrs 0.13 µ /178 mtrs 0.09µ

0.13 µ /286 mtrs
1.4/1.5/1.6 GHz 1.5/1.9/2 GHz > 3.5 MHz
2.2/2.7/3.0 GHz
On-die 256K L2 On-die 512K L2 On-die 1M L2
On-die 512K L2
On-die 512K/1M L3 On-die 1M/2M L3 On-die 8M L3 (?)
On-die 2M/4M L3
400 MHz FSB 400 MHz FSB
400 MHz FSB
µ PGA 603 µ PGA 603 µ PGA 603
5/01 2/02 11/02 7/03 6/04 2Q/05
^ ^ ^ ^ ^ ^
Xeon DP line Foster Prestonia-A Prestonia-B Prestonia-C Nocona Jayhawk
0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ/178 mtrs 0.09 µ/ 125 mtrs 0.09µ
1.4/1.5/1.7 GHz 1.8/2/2.2 GHz 2/2.4/2.6/2.8 GHz 3.06 GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.8 GHz
On-die 256 K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2, 1M L3 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB (Cancelled 5/04)
µPGA 603 µ PGA 603 µ PGA 603 µ PGA 603 µ PGA 604
11/03 11/04 1Q/05 2/05

^ ^ ^
Extreme Edition Irwindale-A1 Irwindale-B1 Irwindale-C
0.13µ /178 mtrs 0.13µ /178mtrs 0.09 µ

3.2EE GHz 3.4EE GHz 3.0/3.2/3.4/3.6 GHz
Desktop-line On-die 512K L2, 2M L3 On-die 512K L2, 2 MB L3 On-die 512K L2, 2M L3
µPGA604
µPGA 478 LGA 775
11/00 8/01 1/02 5/02 11/02 5/03 2/04 6/04 8/04 3Q/05
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Willamette Willamette Northwood-A2,3 Northwood-B 4 Northwood-B Northwood-C5 Prescott 6,7 Prescott 8,9,10
Prescott-F11 Tejas
0.18 µ /42 mtrs 0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09 µ /
1.4/1.5 GHz 1.4 ... 2.0 GHz 2A/2.2 GHz 2.26/2.40B/2.53 GHz 3.06 GHz 2.40C/2.60C/2.80C GHz 2.80E/3E/3.20E/3.40E GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.20F/3.40F/3.60F GHz 4.0/4.2 GHz
On-die 256K L2 On-die 256K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 1M L2 On-die 1M L2 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB (Cancelled 5/04)
µ PGA 423 µ PGA 478 µPGA 478 µ PGA 478 µ PGA 478 µPGA 478 µ PGA 478 LGA 775 LGA 775
9/02 6/04 9/04

5/02
^ ^ ^ ^
Celeron-line Willamette-128 Northwood-128 Celeron-D12 Celeron-D13
(Value PC-s)
0.18µ 0.13µ 0.09µ 0.09µ
1.7 GHz 2 GHz 2.4/2.53/2.66/2.8 GHz 2.53/2.66/2.80/2.93 GHz
On-die 128K L2 On-die 128K L2 On-die 256K L2 On-die 256K L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB
µPGA 478 µPGA 478 µ PGA 478 LGA 775
2000 2001 2002 2003 2004 2005
Cores supporting hyperthreading Cores with EM64T implemented but not enabled Cores supporting EM64T
Figure 2.2: The Potomac processor as Intel’s first 64-bit Xeon MP processor based on the
© Sima Dezső, ÓE NIK third core (Prescott core) of the
31 Pentium 4 family of processors www.tankonyvtar.hu
Basic system architecture of the 90 nm Pentium 4 Prescott MP based Truland

MP server platform
Xeon MP Xeon 7000 Xeon 7100

/ /
Pentium 4 Pentium 4 Pentium 4 Pentium 4

Xeon Xeon XeonP XeonP
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C
FSB
XMB: eXxternal Memory Bridge
Provides a serial link,
XMB XMB 5.33 GB inbound BW
2.65 GB outbound BW
85001/8501
(simultaneously)l
XMB XMB
HI 1.5
DDR-266/333 HI 1.5 (Hub Interface 1.5)
DDR-266/333 8 bit wide, 66 MHz clock, QDR,
DDR2-400 DDR2-400
ICH5 66 MB/s peak transfer rate
Pentium 4 Prescott MP based Truland MP server platform (for up to 2 cores)
1 The E8500 MCH supports an FSB of 667 MT/s and consequently only the SC Xeon MP (Potomac)

Expanding the Truland platform to 3 generations of Pentium 4 based Xeon MP servers

1C 2x1C 2C
Figure 2.3: Expanding the Truland platform [1]

Example 1: Block diagram of a 8500 chipset based Truland MP server board [2]
Figure 2.4: Block diagram of a 8500 chipset based Truland MP server board [2]
Example 2: Block diagram of the E8501 based Truland MP server platform [3]
Xeon DC MP 7000
(4/2005) or later
DC/QC MP 7000 processors
IMI: Independent
Memory Interface
IMI: Serial link
5.33 GB inbound BW
2.67 GB outbound BW
simultaneously
(North Bridge) XMB: eXxternal

Memory Bridge
Intelligent MC
Dual mem. channels
DDR 266/333/400
4 DIMMs/channel
Figure 2.5: Intel’s 8501 chipset based Truland MP server platform (4/ 2006) [3]
Example 3: E8501 based MP server board implementing the Truland platform
2 x XMB
DDR2 Xeon DC
DIMMs
7000/7100
64 GB
2 x XMB E8501 NB
ICH5R SB
Figure 2.6: Intel E8501 chipset based MP server board (Supermicro X6QT8)
© Sima Dezső, ÓE NIK 36 DC MP processor families [4]
for the Xeon 7000/7100 www.tankonyvtar.hu
Figure 2.7: Bandwith bottlenecks in Intel’s 8501 based Truland MP server platform [5]
Remark
Previous (first generation) MP servers made use of a symmetric topology including only a
single FSB that connects all 4 single core processors to the MCH (north bridge), as shown
below.
Typical system architecture of a first generation Xeon MP based MP server platform
Xeon MP1 Xeon MP1 Xeon MP1 Xeon MP1

SC SC SC SC
FSB
Preceding NBs
E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266
Preceding ICH HI 1.5 266 MB/s
Figure 2.8: Previous Pentium 4 MP based MP server platform (for single core processors)

Example: Block diagram of an MP server board that is based on Pentium 4

(Willamette MP) single core 32-bit Xeon MP processors (called Foster)
The memory is placed

on an extra card.
There are 4 memory controllers
each supporting 4 DIMMs
(DDR-266/200)
The chipset (CMIC/CSB5) is

ServerWorks’
Grand Champion HE Classic
chipset
Figure 2.9: Block diagram of an MP server board [6]

Evolution from the first generation MP servers supporting SC processors to the

90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)

/ /

Xeon MP Xeon MP Xeon MP Xeon MP
SC SC SC SC
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C
FSB
FSB
XMB XMB
Preceding NBs 85001/8501
XMB XMB
E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266 HI 1.5
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
Preceding ICH ICH5
HI 1.5 266 MB/s
Previous Pentium 4 MP based 90 nm Pentium 4 Prescott MP based

MP server (for single core processors) Truland MP server platform (for up to 2 C)

2.1.1.3 The Core 2 based Caneland MP server platform (1)
2.1.1.3 The Core 2 based Caneland MP server platform

9/2007
MP platforms Caneland
9/2007 9/2008
MP cores Xeon 7200 Xeon 7300 Xeon 7400

(Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)
65 nm/2x291 mtrs 65 nm/2x291 mtrs 45 nm/1900 mtrs

2x4 MB L2 2x(4/3/2) MB L2 9/6 MB L2
- - 16/12/8 MB L3
1066 MT/s 1066 MT/s 1066 MT/s
mPGA 604 mPGA 604 mPGA 604
9/2007
MCH E7300
(Clarksboro)
4xFSB
1066 MT/s
ESI
4 x FBDIMM
(DDR2-533/667
8 DIMMs/channel)
512GB
5/2006
Pentium 4 based
631xESB
ICH Core 2 based
632xESB
Penryn based
Core2-based/65 nm
42 Penryn 45 nm
Basic system architecture of the Core 2 based Caneland MP server platform
Xeon MP Xeon 7000 Xeon 7100 Xeon 7200 Xeon 7300 Xeon 7400
/ / / /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C (Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)
Pentium 4 Pentium 4 Pentium 4 Pentium 4 Core2 Core2 Core2 Core2

Xeon MP Xeon MP Xeon MP Xeon MP (2C/4C) (2C/4C) (2C/4C) (2C/4C)
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C Penryn (6C) Penryn (6C) Penryn (6C) Penryn (6C)
FSB FSB
XMB XMB up to
85001/8501 7300 8 DIMMs
XMB XMB
HI 1.5 ESI
DDR-266/333 DDR-266/333
631xESB/ FBDIMM
DDR2-400 ICH5 DDR2-400
632xESB DDR2-533/667
90 nm Pentium 4 Prescott MP based Core 2 based

Truland MP server platform (for up to 2 C) Caneland MP server platform (for up to 6 C)
HI 1.5 (Hub Interface 1.5) ESI: Enterprise System Interface

8 bit wide, 66 MHz clock, QDR, 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface,
66 MB/s peak transfer rate providing 1 GB/s transfer rate in each direction)
Example 1: Intel’s Nehalem-EP based Tylersburg-EP DP server platform with a single IOH
Xeon
7200 (Tigerton DC, Core 2), 2C
7300 (Tigerton QC, Core 2), QC
FB-DIMM
4 channels
8 DIMMs/channel
up to 512 GB
Figure 2.10: Intel’s 7300 chipset based Caneland platform

for the Xeon 7200/7300 DC/QC processors (9/2007) [7]
Example 3: Caneland MP serverboard
FB-DIMM Xeon
DDR2
7200 DC
192 GB
7300 QC
(Tigerton)
ATI ES1000 Graphics with

7300 NB
32MB video memory
SBE2 SB
Figure 2.11: Caneland MP Supermicro serverboard, with the 7300 (Clarksboro) chipset
for the Xeon 7200/7300 DC/QC MP processor families [4]
Figure 2.12: Performance comparison of the Caneland platform with a quad core Xeon (7300 family)
vs the Bensley platform with a dual core Xeon 7140M [8]
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform (1)
2.1.1.4 The Nehalem-EX based Boxboro-EX MP server platform

3/2010
MP platforms Boxboro-EX
3/2010 4/2011
MP cores Xeon 7500 Xeon E7-4800
(Nehalem-EX)
(Becton) 8C (Westmere-EX) 10C
45 nm/2300 mtrs/513 mm2 32 nm/2600 mtrs/584 mm2

¼ MB L2/C ¼ MB L2/C
24 MB L3 30 MB L3
4 QPI links 4 QPI links
4 SMI links 4 SMI links
2 mem. channels/link 2 mem. channels/link
2 DIMMs/mem. channel 2 DIMMs/mem. channel
DDR3 1067 MT/s DDR3 1333 MT/s
1 TB (64x16 GB) 1 TB (64x16 GB)
LGA1567
LGA1567
3/2010
IOH 7500
(Boxboro)
2 QPI links
32xPCIe 2. Gen.
0.5 GB/s/lane/direction
ESI
1GB/s/directon
6/2008
ICH ICH10
Nehalem-EX-based Westmere-EX
© Sima Dezső, ÓE NIK 45 nm 48 45 nm www.tankonyvtar.hu
The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor
2 cores
Figure 2.13: The 8 core Nehalem-EX (Xeon 7500/Beckton) Xeon 7500 MP server processor [9]
The 10 core Westmere-EX (Xeon E7-!800) MP server processor [10]

Block diagram of the Westmere-EX (E7-8800/4800/2800) processors [11]
E7-8800: for 8 P systems

E7-4800: for MP systems
E7-2800: for DP systems

Main platform features introduced in the 7500 Boxboro IOH (1)

Along with their Nehalem-EX based Boxboro platform Intel continued their move to
increase system security and manageability by introducing platform features
provided else by their continuously enhanced vPro technology for enterprise oriented
desktops since 2006 and DP servers since 2007.
The platform features introduced in the 7500 IOH are basically the same as described for the
Tylersburg-EP DP platform that is based on the 5500 IOH which is akin to the 7500 IOH of the
Boxboro-EX platform.
They include:
a) Intel Management Engine (ME)
b) Intel Virtualization Technology for Directed I/O (VT-d2)
VT-d2 is an upgraded version of VT-d.
c) Intel Trusted Execution Technology (TXT) .

Basic system architecture of the Nehalem-EX based Boxboro-EX MP server platform

(assuming 1 IOH)
Xeon 7500 Xeon 7-4800

(Nehalem-EX) /
SMB SMB
SMB Nehalem-EX 8C QPI Nehalem-EX 8C SMB
SMB Westmere-EX Westmere-EX SMB
10C 10C
SMB SMB
QPI QPI QPI QPI
SMB SMB
SMB Nehalem-EX 8C Nehalem-EX 8C SMB
Westmere-EX Westmere-EX
SMB 10C QPI 10C SMB
SMB QPI QPI SMB

2x4 SMI 2x4 SMI
channels channels
DDR3-1067 7500 IOH DDR3-1067
ME
ESI
SMI: Serial link between the processor and
the SMB
ICH10 SMB: Scalable Memory Buffer
Parallel/serial conversion
ME: Management Engine
Nehalem-EX based Boxboro-EX MP server platform (for up to 10 C)

Wide range of scalability of the 7500/6500 IOH based Boxboro-EX platform [12]

Example: Block diagram of a 7500 chipset based Boxboro-EX MP serverboard [13]
ESI

2.1.1.5 Evolution of MP server platforms (1)
2.1.1.5 Evolution of MP server platforms

Evolution from the first generation MP servers supporting SC processors to the

90 nm Pentium 4 Prescott MP based Truland MP server platform (supporting up to 2 cores)

/ /

Xeon MP Xeon MP Xeon MP Xeon MP
SC SC SC SC
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C
FSB
FSB
XMB XMB
Preceding NBs 85001/8501
XMB XMB
E.g. DDR-200/266 E.g. HI 1.5 E.g. DDR-200/266 HI 1.5
DDR-266/333 DDR-266/333
DDR2-400 DDR2-400
Preceding ICH ICH5
HI 1.5 266 MB/s
Previous Pentium 4 MP based 90 nm Pentium 4 Prescott MP based

MP server platform (for single core processors) Truland MP server platform (for up to 2 C)

Evolution from the 90 nm Pentium 4 Prescott MP based Truland MP platform (up to 2 cores) to the
Core 2 based Caneland MP platform (up to 6 cores)
Xeon MP Xeon 7000 Xeon 7100 Xeon 7200 Xeon 7300 Xeon 7400
/ / / /
(Potomac) 1C (Paxville MP) 2x1C (Tulsa) 2C (Tigerton DC) 1x2C (Tigerton QC) 2x2C (Dunnington 6C)
Pentium 4 Pentium 4 Pentium 4 Pentium 4 Core2 Core2 Core2 Core2

Xeon MP Xeon MP Xeon MP Xeon MP (2C/4C) (2C/4C) (2C/4C) (2C/4C)
1C/2x1C 1C/2x1C 1C/2x1C 1C/2x1C Penryn (6C) Penryn (6C) Penryn (6C) Penryn (6C)
FSB FSB
XMB XMB up to
85001/8501 7300 8 DIMMs
XMB XMB
HI 1.5 ESI
DDR-266/333 DDR-266/333
631xESB/ FBDIMM
DDR2-400 ICH5 DDR2-400
632xESB DDR2-533/667
90 nm Pentium 4 Prescott MP based Core 2 based

Truland MP server platform (for up to 2 C) Caneland MP server platform (for up to 6 C)
HI 1.5 (Hub Interface 1.5) ESI: Enterprise System Interface

8 bit wide, 66 MHz clock, QDR, 4 PCIe lanes, 0.25 GB/s per lane (like the DMI interface,
266 MB/s peak transfer rate providing 1 GB/s transfer rate in each direction)
Evolution to the Nehalem-EX based Boxboro-EX MP platform (that supports up to 10 cores)

(In the basic system architecture we show the single IOH alternative)
Xeon 7500 Xeon 7-4800

(Nehalem-EX) /
SMB SMB
SMB Nehalem-EX 8C QPI Nehalem-EX 8C SMB
SMB Westmere-EX Westmere-EX SMB
10C 10C
SMB SMB
QPI QPI QPI QPI
SMB SMB
SMB Nehalem-EX 8C Nehalem-EX 8C SMB
Westmere-EX Westmere-EX
SMB 10C QPI 10C SMB
SMB QPI QPI SMB

2x4 SMI 2x4 SMI
channels channels
DDR3-1067 7500 IOH DDR3-1067
ME
ESI
SMI: Serial link between the processor and
the SMBs
ICH10 SMB: Scalable Memory Buffer
Parallel/serial converter
ME: Management Engine
Nehalem-EX based Boxboro-EX MP server platform (for up to 10 C)

2.2 Many-core processors
• 2.2.1 Intel’s Larrabee
• 2.2.2 Intel’s Tile processor

• 2.2.3 Intel’s SCC

2.2 Manycore processors (1)

MPC
CPU GPU


2.2.1 Intel’s Larrabee

2.2.1 Intel’s Larrabee (1)
2.2.1 Larrabee
Part of Intel’s Tera-Scale Initiative.
• Objectives:
High end graphics processing, HPC
Not a single product but a base architecture for a number of different products.
• Brief history:
Project started ~ 2005
First unofficial public presentation: 03/2006 (withdrawn)
First official public presentation: 08/2008 (SIGGRAPH)
Due in ~ 2009
• Performance (targeted):
2 TFlops

Basic architecture
Figure 2.14: Block diagram of a GPU-oriented Larrabee (2006, outdated) [41]
Update: SIMD processing width: SIMD-64 rather than SIMD-16

Figure 2.15: Board layout of a GPU-oriented Larrabee (2006, outdated) [42]

Figure 2.16: Four socket MP server design with 24-core Larrabees connected by the CSI bus [41]
2.2.2 Intel’s Tile processor

2.2.2 Intel’s Tile processor (1)
2.2.2 Tile processor
• First incarnation of Intel’s Tera-Scale Initiative

(more than 100 projects underway)
• Objective: Tera-Scale experimental chip
• Brief history:
Announced at IDF 9/2006
Due in 2009/2010

Figure 2.17: A forerunner: The Raw processor (MIT 2002)

(16 tiles, each tile has a compute element, router, instruction and data memory) [43]

Bisection bandwidth:
If the network is segmented into two equal parts,
this is the bandwidth between the two parts
Figure 2.18: Die photo and chip details of the Tile processor [14]

Figure 2.19: Main blocks of a tile [14]

(Clocks run with the same frequency

but unknown phases
FP Multiply-Accumulate
(AxB+C)
Figure 2.20: Block diagram of a tile [14]

Figure 2.21: On board implementation of the 80-core Tile Processor [15]

Figure 2.22: Performance and dissipation figures of the Tile-processor [15]

Performance at 4 GHz:
Peak SP FP: up to 1.28 TFlops (2 FPMA x 2 instr./cyclex80x4 GHz = 1.28 TFlops)

Figure 2.23: Programmer’s perspective of the Tile processor [14]

Figure 2.24: The full instruction set of the Tile processor [14]

VLIW
Figure 2.25: Instruction word and latencies of the Tile processor [14]

Figure 2.26: Performance of the Tile processor – the workloads [14]

Figure 2.27: Instruction word and latencies of the Tile processor [14]

Figure 2.28: The significance of the Tile processor [14]

Figure 2.29: Lessons learned from the Tile processor (1) [14]

Figure 2.30: Lessons learned from the Tile processor (2) [14]

2.2.3 Intel’s SCC

2.2.3 Intel’s SCC (1)
2.2.3 Intel’s SCC (Single-chip Cloud Computer)
• 12/2009: Announced
• 9/2010: Many-core Application Research Project (MARC) initiative started on the SCC
platform
• Designed in Braunschweig and Bangalore
• 48 core, 2D-mesh system topology, message passing

Figure 2.31: The SCC chip [14]

Figure 2.32: Hardware view of SCC [14]

Figure 2.33: Dual core tile of SCC [14]

(Joint Test Action Group)

Standard Test Access Port
Figure 2.34: SCC system overview [14]

Figure 2.35: Removing hardware cache coherency [16]

Figure 2.36: Improving energy efficiency [16]

Figure 2.37: A programmer’s view of SCC [14]

(Message Passing Buffer)
Figure 2.38: Operation of SCC [14]

3. Heterogeneous multicores
• 3.1 Master-slave type multicores
• 3.2 Add-on type multicores

3. Heterogeneous multicores (1)

MPC
CPU GPU


3.1 Master-slave type multicores
• 3.1.1 The Cell processor

3.1 Master-slave type multicores (1)
Homogenous Heterogenous

MC processors processors architectures architectures
Desktops Servers
MPC
CPU GPU

computing experimental systems production stage near future
© Sima Dezső, ÓE NIK Figure 3.1: Major classes

97 of multicore processors www.tankonyvtar.hu
3.1.1 The Cell Processor (1)
3.1.1 The Cell Processor
• Designated also as the Cell BE (Broadband Engine)
• Collaborative effort from Sony, IBM and Toshiba

• Objective: Game/multimedia, HPC apps.
Playstation 3 (PS3) QS2x Blade Server family

(2 Cell BE/blade)
• Brief history:
Summer 2000: High level architectural discussions
02/2006: Cell Blade QS20
08/ 2007 Cell Blade QS21
05/ 2008 Cell Blade QS22

SPE: Synergistic Procesing Element

SPU: Synergistic Processor Unit
SXU: Synergistic Execution Unit
LS: Local Store of 256 KB
SMF: Synergistic Mem. Flow Unit
EIB: Element Interface Bus
PPE: Power Processing Element

PPU: Power Processing Unit
PXU: POWER Execution Unit
MIC: Memory Interface Contr.
BIC: Bus Interface Contr.
XDR: Rambus DRAM
Figure 3.2: Block diagram of the Cell BE [44]

Figure 3.3: Die shot of the Cell BE (221mm2, 234 mtrs) [44]
Figure 3.4: Die shot of the Cell BE – PPE [44]

Figure 3.5: Block diagram of the PPE [44]

© Sima Dezső, ÓE NIK Figure 3.6: Die shot of103

the Cell BE – SPEs [44] www.tankonyvtar.hu
Figure 3.7: Block diagram of the SPE

[44]

Figure 3.8: Die shot of the Cell BE – Memory interface [44]

Figure 3.9: Die shot of the Cell BE – I/O interface [44]

Figure 3.10: Die shot of the Cell BE – EIB [44]

Figure 3.11: Principle of the operation of the EIB [44]

Figure 3.12: Concurrent transactions of the EIB [44]

• Performance @ 3.2 GHz:

QS21 Peak SP FP: 409,6 GFlops (3.2 GHz x 2x8 SPE x 2x4 SP FP/cycle)
• Cell BE - NIK
2007: Faculty Award (Cell 3Đ app./Teaching)
2008: IBM – NIK Reserch Agreement and Cooperation: Performance investigations
• IBM Böblingen Lab
• IBM Austin Lab

© Sima Dezső, ÓE NIK Picture 3.1: IBM’s Roadrunner

111 (Los Alamos 2008) [45] www.tankonyvtar.hu
The Roadrunner
6/2008 : International Supercomputing Conference, Dresden
Top 500 supercomputers
1. Roadrunner 1 Petaflops (1015) sustained Linpack

Figure 3.13: Key features of Roadrunner [44]

3.2 Add-on type multicores
• 3.2.1 GPGPUs

3.2 Add-on type multicores (1)

MPC
CPU GPU


3.2.1 GPGPUs
• 3.2.1.1 Introduction to GPGPUs
• 3.2.1.2 Overview of GPGPUs
• 3.2.1.3 Example 1: Nvidia’s Fermi GF100
• 3.2.1.4 Example 2: Intel’s on-die integrated

CPU/GPU lines – Sandy Bridge

3.2.1.1 Introduction to GPGPUs (1)
3.2.1.1 Introduction to GPGPUs
Unified shader model of GPUs (introduced in the SM 4.0 of DirectX 10.0)
Unified, programable shader architecture
The same (programmable) processor can be used to implement all shaders;

• the vertex shader
• the pixel shader and
• the geometry shader (new feature of the SMl 4)

Based on its FP32 computing capability and the large number of FP-units available
the unified shader is a prospective candidate for speeding up HPC!
GPUs with unified shader architectures also termed as
GPGPUs
(General Purpose GPUs)
or
cGPUs
(computational GPUs)

Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [17]

Peak FP32 performance of AMD’s GPGPUs [18]

Evolution of the FP-32 performance of GPGPUs [19]

Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [20]

Figure 3.14: Contrasting the utilization of the silicon area in CPUs and GPUs [21]
• Less area for control since GPGPUs have simplified control (same instruction for
all ALUs)
• Less area for caches since GPGPUs support massive multithereading to hide
latency of long operations, such as memory accesses in case of cache misses.

3.2.1.2 Overview of GPGPUs (1)
3.2.1.2 Overview of GPGPUs
Basic implementation alternatives of the SIMT execution
GPGPUs Data parallel accelerators
Programmable GPUs Dedicated units

with appropriate supporting data parallel execution
programming environments with appropriate
programming environment
Have display outputs No display outputs

Have larger memories
than GPGPUs
E.g. Nvidia’s 8800 and GTX lines Nvidia’s Tesla lines

AMD’s HD 38xx, HD48xx lines AMD’s FireStream lines
Figure 3.15: Basic implementation alternatives of the SIMT execution

GPGPUs
Nvidia’s line AMD/ATI’s line
90 nm G80
80 nm R600
Shrink Enhanced
arch. Shrink
Enhanced
65 nm G92 G200 arch.
55 nm RV670 RV770
Enhanced
Shrink Enhanced Enhanced
arch. Shrink arch. arch.
40 nm GF100 RV870 Cayman
(Fermi)
Figure 3.16: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

NVidia
11/06 10/07 6/08
Cores G80 G92 GT200

90 nm/681 mtrs 65 nm/754 mtrs 65 nm/1400 mtrs
Cards 8800 GTS 8800 GTX 8800 GT GTX260 GTX280

96 ALUs 128 ALUs 112 ALUs 192 ALUs 240 ALUs
320-bit 384-bit 256-bit 448-bit 512-bit
OpenCL OpenCL
Standard
6/07 11/07 6/08 11/08
CUDA Version 1.0 Version 1.1 Version 2.0 Version 2.1
AMD/ATI
11/05 5/07 11/07 5/08
Cores R500 R600 R670 RV770

Cards (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

48 ALUs 320 ALUs 320 ALUs 320 ALUs 800 ALUs 800 ALUs
12/08
OpenCL OpenCL
Standard
11/07 9/08 12/08
Brooks+ Brook+ Brook+ 1.2 Brook+ 1.3
(SDK v.1.0) (SDK v.1.2) (SDK v.1.3)
6/08
RapidMind 3870
support
2005 2006 2007 2008
Figure 3.17: Overview of GPGPUs and their basic software support (1)
NVidia
3/10 07/10 11/10
Cores GF100 (Fermi) GF104 (Fermi) GF110 (Fermi)
1/11
Cards GTX 470 GTX 480 GTX 460 GTX 580 GTX 560 Ti
320-bit 384-bit 192/256-bit 384-bit 384-bit
6/09 10/09 6/10
OpenCL OpenCL 1.0 OpenCL 1.0 OpenCL 1.1

SDK 1.0 Early release SDK 1.0 SDK 1.1
5/09 6/09 3/10 6/10 1/11 3/11
CUDA Version 22 Version 2.3 Version 3.0 Version 3.1 Version 3.2 Version 4.0
Beta
AMD/ATI
9/09 10/10 12/10
Cores RV870 (Cypress) Barts Pro/XT Cayman Pro/XT

Cards HD 5850/70 HD 6850/70 HD 6950/70

1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit
11/09 03/10 08/10

(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.1.4 Beta) 8/09
Intel bought RapidMind
RapidMind
2009 2010 2011
3.2.1.3 Example 1: Nvidia’s Fermi GF 100 (1)
3.2.1.3 Example 1: Nvidia’s Fermi GF 100
Announced: 30. Sept. 2009 at NVidia’s GPU Technology Conference, available: 1Q 2010 [22]

Sub-families of Fermi
Fermi includes three sub-families with the following representative cores and features:
Available Max. no. Max. no. No of Compute

GPGPU Aimed at
since of cores of ALUs transistors capability
Gen.
GF100 3/2010 161 5121 3200 mtrs 2.0
purpose
GF104 7/2010 8 384 1950 mtrs 2.1 Graphics
Gen.
GF110 11/2010 16 512 3000 mtrs 2.0
purpose
1 In the associated flagship card (GTX 480) however, one of the SMs has been disabled, due to overheating
problems, so it has actually only 15 SIMD cores, called Streaming Multiprocessors (SMs) by Nvidia and 480
FP32 EUs [69]

Overall structure of Fermi GF100 [22], [23]
NVidia: 16 cores
(Streaming Multiprocessors)
(SMs)
Each core: 32 ALUs

512 ALUs
Remark
In the associated flagship card
(GTX 480) however,
one SM has been disabled,
due to overheating problems,
so it has actually
15 SMs and 480 ALUs [a]
6x Dual Channel GDDR5

(6x 64 = 384 bit)

High level microarchitecture of Fermi GT100
Figure 3.19: Fermi’s131system architecture [24]

Evolution of the high level microachitecture of Nvidia’s GPGPUs [24]
Fermi GF100
Note
The high level microarchitecture of Fermi evolved from a graphics oriented structure
to a computation oriented one complemented with a units needed for graphics processing.

Layout of a Cuda GF100 core (SM) [25]

(SM: Streaming Multiprocessor)
SFU: Special
Function Unit
1 SM includes 32 ALUs
called “Cuda cores” by NVidia)

Layout of a Cuda GF100 core

(SM) [25]
Special Function Units

calculate FP32
transcendental functions
(such as trigonometric
functions etc.)

A single ALU (“CUDA core”)
SP FP:32-bit
FP64 Fermi’s integer units (INT Units)

• are 32-bit wide.
• First implementation of the IEEE 754-2008
• became stand alone units, i.e.
standard
they are no longer merged with
• Needs 2 clock cycles to issue the entire warp the MAD units as in prior designs.
for execution.
• In addition, each floating-point
unit (FP Unit) is now capable of
producing IEEE 754-2008-
FP64 performance: ½ of FP32 performance!! compliant double-precision (DP)
(Enabled only on Tesla devices! FP results in every 2. clock cycles,
at ½ of the performance of
single-precision FP calculations.
Figure 3.20: A single ALU [26]
Remark
The Fermi line supports the Fused Multiply-Add (FMA) operation, rather than the Multiply-Add
operation performed in previous generations.
Previous lines
Fermi
Figure 3.21: Contrasting the Multiply-Add (MAD) and the Fused-Multiply-Add (FMA) operations
[27]

Principle of the SIMT execution in case of serial kernel execution
Host Device
Each kernel invocation

lets execute all
kernel0<<<>>>() thread blocks (Block(i,j))
Thread blocks may be

executed independently
from each other
kernel1<<<>>>()
Figure 3.22: Hierarchy of

threads [28]
Nvidia GeForce GTX 480 and 580 cards [29]
GTX 480 GTX 580

(GF 100 based) (GF 110 based)

A pair of GeForce GTX 480 cards [30]

(GF100 based)

FP64 performance increase in Nvidia’s Tesla and GPGPU lines
Performance is bound by the number of available DP FP execution units.
G80/G92 GT200 GF100 GF110

(11/06, 10/07) (06/08) (03/10) (11/10)
Avail. FP64 No FP64 1 16 16
units support FP64 unit FP64 units FP64 units
operations (Add, Mul, MAD) (Add, Mul, FMA ) (Add, Mul, FMA)
Peak FP64 load/SM 1 FP64 MAD 16 FP64 FMA 16 FP64 MAD

Peak FP64 perf./cycle/SM 1x2 operations/SM 16x2 operations/SM 16x2 operations/SM
Tesla cards
Flagship Tesla card C1060 C2070
Peak FP64 perf./card 30x1x2x1296 14x16x2x1150
77.76 GFLOPS 515.2 GFLOPS

GPGPU cards
Flagship GPGPU úcard GT280 GTX 4801 GTX 5801
Peak FP64 perf./card 30x1x2x1296 15x4x2x1401 16x4x2x1544
77.76 GFLOPS 168.12 GFLOPS 197.632 GFLOPS
1 In their GPGPU Fermi cards Nvidia activates only 4 FP64 units from the available 16
3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs (1)
3.2.1.4 Example 2: Intel’s on-die integrated CPU/GPUs – Sandy Bridge
Integration
to the chip
Figure 3.23: Evolution of add-on MC architectures

The Sandy Bridge processor [31]

• Shipped in Jan. 2011
• Provides on-die integrated CPU and GPU

Main features of Sandy Bridge [32]

Key specification data of Sandy Bridge [33]
Branding Core i5 Core i5 Core i5 Core i7 Core i7

Processor 2400 2500 2500K 2600 2600K
Price $184 $205 $216 $294 $317
TDP 95W 95W 95W 95W 95W
Cores / Threads 4/4 4/4 4/4 4/8 4/8
Frequency GHz 3.1 3.3 3.3 3.4 3.4
Max Turbo GHz 3.4 3.7 3.7 3.8 3.8
DDR3 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz
L3 Cache 6MB 6MB 6MB 8MB 8MB
Intel HD
2000 2000 3000 2000 3000
Graphics
GPU Max freq 1100 MHz 1100 MHz 1100 MHz 1350 MHz 1350 MHz
Hyper-
No No No Yes Yes
Threading
AVX Extensions Yes Yes Yes Yes Yes
Socket LGA 1155 LGA 1155 LGA 1155 LGA 1155 LGA 1155

Die photo of Sandy Bridge [34]
256 KB L2 256 KB L2 256 KB L2 256 KB L2

(9 clk) (9 clk) (9 clk) (9 clk)
Hyperthreading
32K L1D (3 clk) AES Instr.
AVX 256 bit VMX Unrestrict.
4 Operands 20 nm2 / Core
@ 1.0 1.4 GHz

(to L3 connected) (25 clk)
256 b/cycle Ring Architecture PCIe 2.0
DDR3-1600 25.6 GB/s
32 nm process / ~225 nm2 die size / 85W TDP
Figure 3.24: Die photo of Sandy Bridge [34]

Sandy Bridge’s integrated graphics unit [31]

Specification data of the HD 2000 and HD 3000 graphics [35]

Performance comparison: gaming [36]
HD5570
400 ALUs
i5/i7 2xxx:
Sandy Bridge
i56xx
Arrandale
frames per sec

4. Outlook

4. Outlook (1)
4. Outlook
Heterogenous
multicores
Master/slave Add-on
type multicores type multicores
1(Ma):M(S) 2(Ma):M(S) M(Ma):M(S) 1(CPU):1(D) M(CPU):1(D) M(CPU):M(D)
M(Ma) = M(CPU)
M(S) = M(D)
Ma: Master D: Dedicated (like GPU)

S: Slave H: Homogenous
M: Many M: Many
Figure 4.1: Expected evolution

150 of heterogeneous multicores
4. Outlook (2)
Master-slave type multicores require much more intricate workflow control and
synchronization than add-on type multicores
It can expected that add-on type multicores will dominate the future of heterogeneous
multicores.

5. References

References (1)
[1]: Gilbert J. D., Hunt S. H., Gunadi D., Srinivas G., The Tulsa Processor: A Dual Core Large
Shared-Cache Intel Xeon Processor 7000 Sequence for the MP Server Market Segment,
Aug 21 2006, http://www.hotchips.org/archives/hc18/3_Tues/HC18.S9/HC18.S9T1.pdf
[2]: Intel Server Board Set SE8500HW4, Technical Product Specification, Revision 1.0,
May 2005, ftp://download.intel.com/support/motherboards/server/sb/se8500hw4_board_
set_tpsr10.pdf
[3]: Intel® E8501 Chipset North Bridge (NB) Datasheet, Mai 2006,
http://www.intel.com/design/chipsets/e8501/datashts/309620.htm
[4]: Supermicro Motherboards, http://www.supermicro.com/products/motherboard/
[5]: Next-Generation AMD Opteron Processor with Direct Connect Architecture – 4P Server
Comparison, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/4P_
Server_Comparison_PID_41461.pdf
[6]: Supermicro P4QH6 / P4QH8 User’s Manual, 2002,
http://www.supermicro.com/manuals/motherboard/GC-HE/MNL-0665.pdf
[7]: Intel® 7300 Chipset Memory Controller Hub (MCH) – Datasheet, Sept. 2007,
http://www.intel.com/design/chipsets/datashts/313082.htm
[8]: Quad-Core Intel® Xeon® Processor 7300 Series Product Brief, Intel, Nov. 2007
http://download.intel.com/products/processor/xeon/7300_prodbrief.pdf
[9]: Mitchell D., Intel Nehalem-EX review, PCPro,
http://www.pcpro.co.uk/reviews/processors/357709/intel-nehalem-ex
[10]: Nagaraj D., Kottapalli S.: Westmere-EX: A 20 thread server CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
References (2)
[11]: Intel Xeon Processor E7-8800/4800/2800 Product Families, Datasheet Vol. 1 of 2,

April 2011, http://www.intel.com/Assets/PDF/datasheet/325119.pdf
[12]: Intel Xeon Processor 7500/6500 Series, Public Gold Presentation, March 30 2010,
http://cache-www.intel.com/cd/00/00/44/64/446456_446456.pdf
[13]: Supermicro X8QB6-F / X8QBE-F User’s Manual, 2010,

http://files.siliconmechanics.com/Documentation/Rackform/iServ/R413/Mainboard/MNL
-X8QB-E-6-F.pdf
[14]: Mattson T., The Future of Many Core Computing: A tale of two processors, March 4 2010,
http://og-hpc.com/Rice2010/Slides/Mattson-OG-HPC-2010-Intel.pdf
[15]: Kirsch N., An Overview of Intel's Teraflops Research Chip, Febr. 13 2007, Legit Reviews,
http://www.legitreviews.com/article/460/1/
[16]: Rattner J., „Single-chip Cloud Computer”, Dec. 2 2009

http://www.pcper.com/reviews/Processors/Intel-Shows-48-core-x86-Processor-Single-
chip-Cloud-Computer
[17]: Nvidia CUDA C Programming Guide, Version 3.2, October 22 2010
http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_
Programming_Guide.pdf
[18]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology,
AMD, March 9 2010,
http://developer.amd.com/gpu_assets/GPU%20Computing%20-%20Past%20
Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf

References (3)
[19]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley,
January 24-25 2011
http://iccs.lbl.gov/assets/docs/20110124/lecture1_computational_thinking_Berkeley_2011.pdf
[20]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3
[21]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia,
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming
Guide_2.0.pdf
[22]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at Nvidia’s
2009 GPU Technology Conference, (GTC), Sept. 30 2009,
http://www.nvidia.com/object/gpu_tech_conf_press_room.html
[23]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf
[24]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT093009110932&
mode=print
[25]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview,
Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/

References (4)
[26]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
Tech Report, Sept 30 2009, http://techreport.com/articles.x/17670/1
[27]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture
Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/
P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf
[28]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies,
Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242
[29]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/37789-nvidia-
geforce-gtx-580-review-5.html
[30]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors, Tech Report,
March 31 2010, http://techreport.com/articles.x/18682
[31]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor Graphics,
Presentation ARCS002, IDF San Francisco, Sept. 2010
[32]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor
Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010
[33]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review,
Jan. 3 2011,
http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2
[34]: Intel Sandy Bridge Review, Bit-tech, Jan. 3 2011,

http://www.bit-tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1

References (5)
[35]: Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA
[36]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core i3-2100
Tested, AnandTech, Jan. 3 2011,
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-600k-i5-2500k-
core-i3-2100-tested/11
[37]: Wall D. W.: Limits of Instruction Level Parallelism, WRL TN-15, Dec. 1990
[38]: Bhandarkar D.: „The Dawn of a New Era”, Presentation EMEA, May 11 2006.
[39]: Moore G. E., No Exponential is Forever… ISSCC 2003,

http://download.intel.com/research/silicon/Gordon_Moore_ISSCC_021003.pdf
[40]: Intel Roadmap 2006, Source Intel
[41]: Davis E.: Tera Tera Tera Presentation, 2008

http://bt.pa.msu.edu/TM/BocaRaton2006/talks/davis.pdf
[42]: Stokes J.: Clearing up the confusion over Intel’s Larrabee, Part II
http://arstechnica.com/hardware/news/2007/06/clearing-up-the-confusion-over-intels-
larrabee-part-ii.ars
[43]: Taylor M. B. & all: Evaluation of the Raw Microprocessor, Proc. ISCA 2004
http://groups.csail.mit.edu/cag/raw/documents/raw_isca_2004.pdf
[44]: Wright C., Henning P.: Roadrunner Tutorial, An Introduction to Roadrunner and the Cell
Processor, Febr. 7 2008,
http://ebookbrowse.com/roadrunner-tutorial-session-1-web1-pdf-d34334105

References (6)
[45]: Seguin S.: IBM Roadrunner Beats Cray’s Jaguar, Tom’s Hardware, Nov. 18 2008
http://www.tomshardware.com/news/IBM-Roadrunner-Top500-Supercomputer,6610.html

Evolution of Intel’s
Basic Microarchitectures
Dezső Sima

Contents
• 1. Introduction
• 2. Core 2
• 3. Penryn
• 4. Nehalem
• 5. Nehalem-EX
• 6. Westmere
• 7. Westmere-EX
• 8. Sandy Bridge
• 9. Overview of the evolution

Remarks
Remarks
1) To preserve a clear review the discussion of the basic architectures is restricted only to
Intel’s “standard voltage” basic lines.
Medium-voltage/low-voltage/ultra-low voltage processors are not included.
2) The release dates given relate to the first processors shipped in a considered line.
Subsequently shipped models of the lines are not taken into account in order to keep
the overviews comprehensible.
3) On the sides the core numbers reflect the max. number of cores.
Usually, manufacturers provide also processors with less than the max. number of cores.

1. Introduction

1. Introduction (1)
The evolution of Intel’s basic microarchitectures
Figure 1.1: Intel’s Tick-Tock development model [1]

1. Introduction (2)
Intel’s Tick-Tock model
2 YEARS
Key microarchitectural features
TICK
TOCK Pentium 4 /Willamette 180nm 11/2000 New microarch.
2 YEARS
TICK
TOCK Pentium 4 /Northwood 130nm 01/2002 Adv. microarch., hyperthreading
2 YEARS
TICK Adv. microarch., hyperthreading,

Pentium 4 /Prescott 90nm 02/2004
TOCK 64-bit
TICK Pentium 4 / Cedar Mill

2 YEARS
01/2006
65nm
New microarch., 4-wide core,
TOCK Core 2 07/2006
128-bit SIMD, no hyperthreading
11/2007
11/2008 New microarch., hyperthreading,

(inclusive) L3, integrated MC, QPI
01/2010
01/2011 New microarch. hyperthreading,

256-bit AVX, integr. GPU, ring bus,
© Sima Dezső, ÓE NIK Figure 1.2: Overview of Intel’s

164 Tick-Tock model (based on [3]) www.tankonyvtar.hu
1. Introduction (3)
Basic architectures and their related shrinks
Considered from the Pentium 4 Prescott (the third core of Pentium 4) on
Basic architectures Basic architectures and their shrinks
Pentium 4 2005 90 nm Pentium 4

(Prescott) 2006 65 nm Pentium 4
2006 65 nm Core 2
Core 2
2007 45 nm Penryn
2008 45 nm Nehalem
Nehalem
2010 32 nm Westmere
2011 32 nm Sandy Bridge

Sandy Bridge
2012 22 nm Ivy Bridge

1. Introduction (4)
3/02 11/02 2Q/05

3/04
^ ^ ^ ^
Xeon - MP line Foster-MP Gallatin
Gallatin
Potomac
0.18 µ /108 mtrs 0.13 µ /178 mtrs 0.09µ

0.13 µ /286 mtrs
1.4/1.5/1.6 GHz 1.5/1.9/2 GHz > 3.5 MHz
2.2/2.7/3.0 GHz
On-die 256K L2 On-die 512K L2 On-die 1M L2
On-die 512K L2
On-die 512K/1M L3 On-die 1M/2M L3 On-die 8M L3 (?)
On-die 2M/4M L3
400 MHz FSB
µ PGA 603 µ PGA 603 µ PGA 603
5/01 2/02 11/02 7/03 6/04 2Q/05
^ ^ ^ ^ ^ ^
Xeon DP line Foster Prestonia-A Prestonia-B Prestonia-C Nocona Jayhawk
0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ/178 mtrs 0.09 µ/ 125 mtrs 0.09µ
1.4/1.5/1.7 GHz 1.8/2/2.2 GHz 2/2.4/2.6/2.8 GHz 3.06 GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.8 GHz
On-die 256 K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2, 1M L3 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB (Cancelled 5/04)
µPGA 603 µ PGA 603 µ PGA 603 µ PGA 603 µ PGA 604
11/03 11/04 1Q/05

^ ^ ^
Extreme Edition Irwindale-A1 Irwindale-B1 Irwindale-C
0.13µ /178 mtrs 0.13µ /178mtrs 0.09 µ

3.2EE GHz 3.4EE GHz 3.0/3.2/3.4/3.6 GHz
Desktop-line On-die 512K L2, 2M L3 On-die 512K L2, 2 MB L3 On-die 512K L2, 2M L3
µPGA 478 LGA 775
11/00 8/01 1/02 5/02 11/02 5/03 2/04 6/04 8/04 3Q/05
^ ^ ^ ^ ^ ^ ^ ^ ^ ^
Willamette Willamette Northwood-A2,3 Northwood-B 4 Northwood-B Northwood-C5 Prescott 6,7 Prescott 8,9,10 Prescott-F11 Tejas
0.18 µ /42 mtrs 0.18 µ /42 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.13 µ /55 mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09µ /125mtrs 0.09 µ /
1.4/1.5 GHz 1.4 ... 2.0 GHz 2A/2.2 GHz 2.26/2.40B/2.53 GHz 3.06 GHz 2.40C/2.60C/2.80C GHz 2.80E/3E/3.20E/3.40E GHz 2.8/3.0/3.2/3.4/3.6 GHz 3.20F/3.40F/3.60F GHz 4.0/4.2 GHz
On-die 256K L2 On-die 256K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 512K L2 On-die 1M L2 On-die 1M L2 On-die 1M L2 On-die 1M L2
400 MHz FSB 400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB 800 MHz FSB (Cancelled 5/04)
µ PGA 423 µ PGA 478 µPGA 478 µ PGA 478 µ PGA 478 µPGA 478 µ PGA 478 LGA 775 LGA 775
9/02 6/04 9/04

5/02
^ ^ ^ ^
Celeron-line Willamette-128 Northwood-128 Celeron-D12 Celeron-D13
(Value PC-s)
0.18µ 0.13µ 0.09µ 0.09µ
1.7 GHz 2 GHz 2.4/2.53/2.66/2.8 GHz 2.53/2.66/2.80/2.93 GHz
On-die 128K L2 On-die 128K L2 On-die 256K L2 On-die 256K L2
400 MHz FSB 400 MHz FSB 533 MHz FSB 533 MHz FSB
µPGA 478 µPGA 478 µ PGA 478 LGA 775
2000 2001 2002 2003 2004 2005
Cores supporting hyperthreading Cores with EM64T implemented but not enabled Cores supporting EM64T
Figure 1.3: Intel’s P4 family of orocessors based on the Netburst architecture

1. Introduction (5)
Prescott Prescott 2M Cedar Mill Presler

(original core) (L2 increased to 2 MB) (65 nm shrink) (2 x Cedar Mill)
Intel’s first 64 –bit x86

desktop processor
2/20041 2/2005 1/2006 1/2006
90 nm 90 nm 65 nm 65 nm
112 mm2 135 mm2 81 mm2 2 x 81 mm2
125 mtrs 169 mtrs 188 mtrs 2 x 188 mtrs
Pentium 4 A/E/F series Pentium 4 6x0/6x2 series Pentium D 9xx

Pentium 4 5xx series Pentium 4 EE 3.73 Pentium 4 6x1 series Pentium EE 955/965
1 The original Prescott core included but did not activate the support of 64-bit operation, called EM64T and used an mPGA 439 socket..
EM64T support was released later, about 6/2004 while changing to the socket LGA 775.
Figure 1.4: Genealogy of the Cedar Mill core and the DC Presler processor

2. Core 2
• 2.1 Introduction
• 2.2 Wide execution
• 2.3 Smart L2 cache
• 2.4 Smart memory accesses
• 2.5 Enhanced digital media support
• 2.6 Intelligent power management
• 2.7 Core based processor lines

2.1 Introduction (1)
2.1 Introduction
Core 2 microarchitecture
Developed at Intel’s Haifa Research and Development Center
The Pentium4 line was cancelled due to unmanageable high dissipation figures
of its third core (the 90 nm Prescott core)
(caused primarily by the design philosophy of the line
that preferred clock frequency over core efficiency for increasing performance)
For the development of the next processor line dissipation became the key design issue
The next line became based primarily on the Pentium M – Intel’s first mobile line
since for its designs dissipation reduction was a key issue.
(The Pentium M line was a 32-bit line with 3 subsequent cores, designed at
Intel’s Haifa Research and Development Center)
The Haifa team had no experiences with multithreading
The Core microarchitecture does not support multithreading

Key features of the Core 2 microarchitecture
Intel® Wide Intel® Intelligent

Dynamic Execution Power Capability
Intel® Advanced Intel® Smart

Digital Media Boost Memory Access
Intel® Advanced
Smart Cache
Figure 2.1: Key features of170

the Core 2 microarchitecture [16] www.tankonyvtar.hu
© Sima Dezső, ÓE NIK
2.2 Wide execution (1)
2.2 Wide execution

• 4-wide core
• Enhanced execution resources
• Micro fusion
• Macro fusion

4-wide core
4-wide front end and retire unit
Key benefit of the Core family
By contrast
both Intel’s previous Pentium 4 family and AMD’s K8 have 3-wide cores.

Figure 2.2: Block diagram of Intel’s Core microarchitecture [4]
173
Figure 2.3: Block diagram of Intel’s Pentium 4 microarchitecture [5]
174
Retire width: 3 instr./cycle
carmean
www.tankonyvtar.hu
Figure 2.4: Block diagram of AMD’s K8 microarchitecture[4]

175
Enhanced execution resources
The Core has three complex SSE units

By contrast
• The Pentium 4 provides a single complex SSE unit and
a second simple SSE unit performing only SSE move and store operations.
• The K8 has two SSE units

Figure 2.5: Issue ports and execution units of the Core [4]
177
Figure 2.6: Issue ports and execution unit of the Pentium 4 [9]
Ports 0 und 1 can issue up to two microinstructions per cycle, allowing to issue
altogether up to 6 microinstr./cycle

Remark
Both the Core’s and the Pentium 4’s schedulers can issue 6 operations per cycle, but
• Pentium 4’s schedulers have only 4 ports, with two double pumped simple ALUs,
• by contrast Core has a unified scheduler with 6 ports, allowing more flexibility
for issuing instructions.

Table 2.1: Key features

of x86 processors [4]
180 www.tankonyvtar.hu
Remark
IBM’s POWER4 and subsequent processors of this line have introduced 5-wide cores
with 8-wide out of order issue.
These processor bundle 5 subsequent instructions into a group, dispatch groups

in order, execute instructions out of order,
and retire groups (one group in a cycle) in order.

Micro-op fusion [10]
• Originally introduced in the Pentium M (1. core (Banias) in 2003).

• Combining micro-ops derived from the same macro-operation into a single micro-op.
• Micro-op fusion can reduce the total number of micro-ops to be processed by more than
10 %.
This results in higher processor performance.

Remark
IBM’s POWER4 and subsequent processor provide a 5-wide frontend.

Macro-op fusion [10]
• New feature introduced into the Core.
• Combing common x86 instruction pairs (such as a compare followed by a conditional

jump) into a single micro-op during decoding.
Two x86 instructions can be executed as a single micro-op.

This increases performance.
Example

Figure 2.7: Macro-op fusion example (1) [11]



Table 2.2: Comparing Intel’s and AMD’s fusion techniques [4]

Performance leadership changes between Intel and AMD
• In 2003 AMD introduced their K8-based processors implementing

• the 64-bit x86 ISA and
• the direct connect architecture concept, that includes
• integrated memory controllers and
• high speed point-to-point serial buses (the HyperTransport bus)
used to connect processors to processors and processors to south bridges.
• AMD’s K8-based processors became the performance leader, first of all on the DP and MP
server market, where the 64-bit direct connect architecture has clear benefits
vs Intel’s 32-bit Pentium 4 based processors using shared FSBs to connect processors
to north bridges.

Example 1: DP web-server performance comparison (2003)
Figure 2.10: DP web server performance comparison: AMD Opteron 248 vs. Intel Xeon 2.8 [6]
Example 2: Summary assessment of extensive benchmark tests contrasting

dual Opterons vs dual Xeons (2003) [7]
“In the extensive benchmark tests under Linux Enterprise Server 8 (32 bit as well as
64 bit), the AMD Opteron made a good impression. Especially in the server disciplines,
the benchmarks (MySQL, Whetstone, ARC 2D, NPB, etc.) show quite clearly that the
Dual Opteron puts the Dual Xeon in its place”.

• This situation has completely changed in 2006 when Intel introduced their Core 2
microarchitecture,
• The Core 2 has
 a 4-wide front-end and retire unit compared to the 3-wide K8 or the Pentium 4,
 three complex FP/SSE units compared to two units available in the K8 or

just a single complex unit and a second simple unit performing only FP-move and
FP store operations.
• This and further enhancements of the Core microarchitecture, detailed subsequently,

resulted in record breaking performance figures.
Intel regained performance leadership vs AMD.

Example: DP web-server performance comparison (2006)
Webserver Performance
MSI K2-102A2M MSI K2-102A2M Opteron 280 vs. Extrapolated Xeon 5160
Opteron 275 Opteron 280 Opteron 275 Opteron 3 GHz 3 GHz
Jsp - Peak 144 154 7% 182 230
AMP - Peak 984 1042 6% 1178 1828
Jsp: Java Server Page performance

AMP: Apache/MySQL/PHP
Figure 2.11: DP web server performance comparison: AMD Opteron 275/280 vs. Intel Xeon 5160 [8]
Remark
Both web-server benchmark results were published from the same source (AnandTech)

2.3 Smart L2 cache (1)
2.3 Smart L2 cache
Shared L2 instead of private L2 caches associated with the cores.
Private Shared
Core1 Core2 Core1 Core2
L2 L2
L2 Cache
Cache Cache
Pentium 4-based DCs Core 2-based DCs
Figure 2.12: Core’s shared L2 cache vs Pentium 4’s private L2 caches

Benefits of shared caches
• Dynamic cache allocation to the individual cores

• Efficient data sharing (no replicated data)
+ 2x bandwidth to L1 caches.

Figure 2.13: Dynamic L2 cache allocation according to cache demand [11]

Figure 2.14: Data sharing in shared and private (independent) L2 cache implementations [11]
Drawbacks of shared caches
Shared caches combine access patterns
Reduce the efficiency of hardware prefetching vs private caches.
Choice between shared and private caches
Design decision depending on
whether benefits or drawbacks dominate as far as performance is concerned.
Trend
Core 2 prefers a shared L2 cache Nehalem prefers private L2 caches
POWER5 prefers shared L2 cache POWER6 prefers private L2 caches

Table 2.3: Cache parameters of Intel’s and AMD’s processors [4]

2.4 Smart memory accesses (1)
2.4 Smart memory accesses
• Memory disambiguation
• Enhanced hardware prefetchers
(L1 I-Cache not shown)
Figure 2.15: Units involved in implementing memory disambiguation or hardware prefetching [12]
Memory disambiguation
In Intel’s Core 2: Memory disambiguation means memory access reordering

by letting Loads to bypass both Loads and Stores speculatively.
Aim
Hiding memory latency through reordering of loads.

Example
Figure 2.16: Example for memory reordering (memory disambiguation), Loads may bypass
Stores [13]
An overview of memory access reordering

Memory operations are typically decoupled from other operations in such a way that
• memory operations to be performed (i.e. Loads/Stores) are typically written into

two separate queues, called the Load Queue and the Store Queue,
• sequential consistency among memory operations is preserved according to a chosen
sequential consistency model by e.g. a Memory Order Buffer, as discussed subsequently,
• sequential consistency between memory operations and non memory operations is
maintained by appropriately handling load-use dependencies, such
that dependent operations are scheduled for execution only after a Load instruction
has already delivered the referenced value from the memory.

Sequential consistency models
Strong Weak
sequential consistency sequential consistency
No Load/Store reordering Load/Store reordering

is allowed is allowed
Load reordering
Loads bypass Loads bypass

only Loads both Loads and Stores
Typ. examples
Pentium Pro, Pentium II, III See later.
Pentium 4
Figure 2.17: Main alternatives of load reordering

Example
Figure 2.18: Example for memory reordering (memory disambiguation), Loads may bypass
Stores [13]
When Loads bypass Loads

no dependencies need to be taken into account.
When Loads bypass Stores

data dependencies need to be checked, as follows:
• Loads may bypass only Stores whose target addresses differ from that of the Load,
else the Load would access a former, incorrect value, from the target address.
• However, Store addresses are not always known at the time when the scheduler needs
to decide whether or not the Load considered is allowed to bypass a Store.
• There are two options how to proceed when Store addresses are not yet known.

Loads bypass both Loads and Stores
Deterministic Speculative
Store bypassing Store bypassing
Loads bypass Stores only Loads may bypass Stores also in cases
if all respective Store addresses are known when respective Store addresses
and the Load address does not coincide with are not yet known, that is not yet computed.
any of the Store addresses to be bypassed Then, the correctness of the speculative Load
need to be checked, e.g. as follows:
Each calculated Store address will be compared
to all younger Load addresses.
For a hit, this Load and subsequent instructions
Examples will be aborted and re-executed.
A few 1. gen. superscalar RISCs, such as 2. gen. superscalar RISCs, such as

MC88110 (1993) PowerPC 620 1996)
PowerPC 603/604 (1993/1995) PA8000 (1996)
2. gen. superscalar RISCs, such as Recent x86 CISCs, such as

UltraSPARC (1995) Core 2 (2006)
Alpha 21264 (1998) Penryn (2007)
Figure 2.19: Introduction of Load reordering related to both Loads and Stores
Remark 1
Available literature sources for the UltraSPARC processor do not allow a clear distinction
about the load bypassing option used. Based on these a deterministic load bypassing
was assumed.
Remark 2
x86 processors have much more complex addressing modes, requiring a number of
address additions, compared to RISC processors.
This is the reason while load reordering was much later introduced in x86 processors
than in RISC processors.

The benefit arising when Loads may bypass Stores as well
• Runtime overlapping of tight loops that access memory.

• Overlapping: when Loads at the beginning of an iteration may access memory without
having to wait until Stores at the end of the previous iteration are completed [15].
Usually, Stores are not allowed to bypass older Stores or Loads.
Assumed reason
Stores are less frequent than Loads (~ 1/3-1/4)
Performance gain obtained does not pay off by the additional complexity
(and dissipation).

How to implement speculative Loads?
Figure 2.20: Assumed hardware structure to implement speculative Loads in Intel’s Core 2 [4]
The predictor [14]
• Actually, when a Load is issued from the Reservation Station’s scheduler to the Load
Buffer a predictor is looked up .
• The predictor makes a prediction, as described subsequently.
• If the prediction is “non colliding” a Store with an unknown store address may be passed
else not.

Figure 2.21: Principle of the implementation of speculative Loads in Intel’s Core 2 [12]
Source: [12]

Source: [12]

Remark 1
There is a further option to hide memory latency, called Store to Load forwarding, or
Store Buffer forwarding.
Store to Load forwarding
Forwarding store data immediately from the Store buffer to a Load without waiting
for the data to be written to the cache, in cases when the last Store writing the
same address as referenced by the Load, is actually available in the Store buffer.
Examples
Pentium 4 (2000)
(Core 2 2006)
Penryn (2007)
Merom (2008)
AMD Athlon64 (2003)
AMD Opteron (2003)
In effect Store to Load forwarding is a special form of Load reordering when

Loads are allowed to bypass both Loads and Stores,
As Load to Store bypassing results in bypassing both Loads and Stores,
it is however, effective only to Loads whose companion Store (last Store
referring to the same memory address) is actually available in the Store Buffer.

Figure 2.22: Example for Store to Load forwarding [12]

Remark 2
There is yet another option to reduce D cache load-use latency, called speculative Loads.
Speculative Loads
• Issuing a load-use dependent instruction before it turns out whether the data cache
hits or misses, optimistically, in expectation of a cache hit. Then data become
available typically 1-2 clock cycles earlier than with traditional processing.
• If the expectation turns out to be wrong, the execution of the instruction in concern
becomes aborted and re-executed after the cache miss is serviced.
Example: Alpha 21264
Speculative Loads do not modify the memory access order.
They do not belong to the memory access reordering schemes.

Memory access reordering
Load reordering Store reordering
Loads bypass Loads bypass both

only Loads Loads and Stores
Store to Load Deterministic Speculative

forwarding Store bypassing Store bypassing
Pentium Pro
Pentium
Pentium II/III
lines
Pentium 4 Pentium 4
Core 2 Core 2
Penryn Penryn
Nehalem Nehalem
AMD Athlon 64
lines Opteron
Figure 2.23: Overview of memory access reordering schemes and their use in x86 processors
Hardware prefetchers [9]
Remarks
• Intel’s first hardware prefetcher appeared in the Pentium 4 family,

associated with the L2 cache.
• Intel’s first on-die L2 cache debuted only about one year earlier (10/1999),
in the second core of the Pentium III line (called the Coppermine core,
built on 180 nm technology, with a size of 256 KB).
Principle of operation of the L2 hardware prefetcher

• it monitors data access patterns and prefetches data automatically into the L2 cache,
• it attempts to stay 256 bytes ahead of the current data access location,
• the prefetcher remembers the history of cache misses to detect concurrent,
independent data streams that it tries to prefetch ahead of its use.

Enhanced hardware prefetchers in the Core 2 [11]
8 prefetchers per two-core processor
• 2 data and 1 L1 instruction prefetchers per core,

able to handle multiple simultaneous patterns.
• 2 prefetchers in the L2 cache
tracking multiple access patterns per core.
• Prefetchers monitor demand traffic and regulate “aggression”.

Hardware prefetchers within the Core 2 microarchitecture
Figure 2.24: Hardware prefetchers within the Core 2 microarchitecture [11]

2.5 Enhanced digital media support (1)
2.5 Enhanced digital media support

• Widening the FP/SSE Execution units from 64-bit to 128-bit.
• Supplemental enhancement of the SSE3 ISA extension.

Widening the FP/SSE Execution units from 64-bit to 128-bit
Figure 2.25: Widening the FP/SSE Execution Units from 64-bit to 128-bit [12]
Single cycle 128-bit execution as a result of widening the FP/SSE Execution units
Figure 2.26: Single cycle execution of 128-bit operations

as a result of widening the FP/SSE Execution Units from 64.bit to 128-bit [17]
• Supplemental enhancement of the SSE3 ISA extension,

as shown in the next Figure.
Overview of the x86 ISA extensions in Intel’ processor lines

MultiMedia eXtensions
Streaming SIMD Extensions
Northwood (Pentium 4)
Advanced Encryption Standard

Advanced Vector Extension
Ivy Bridge Fused Multiply-Add instr.
Figure 2.27: Overview of Intel’s x86 ISA extensions (based on [18])

8 MM registers (64-bit),
aliased on the FP Stack registers
8 XMM registers (128-bit)
Northwood (Pentium4)
Norhwood
Larrabee: large number

of registers (512-bit)
16 YMM registers (256-bit)
Ivy Bridge
Figure 2.28: Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA
Figure 2.29: Evolution of the SIMD

processing width [18]
64-bit FX SIMD with

32/16/8-bit operands (ops)
Support of MM
128-bit FX, FP SIMD with

32/16/8-bit FX, 32-bit FP ops.
Support of MM/3D

64/32/16/8-bit FX, 64/32-bit FP,
Support for MPEG-2, video, MP3
Norhwood
DSP-oriented FP enhancements,
enhanced thread manipulation
Diverse arithmetic enhancements
Media acceleration
(video encoding, MM, gaming)
Accelerated string/text manip.,

appl. targeted acceleration
Accelerated encription operations

64/32/16/8-bit FX, 64/32-bit FP???
Ivy Bridge
Figure 2.30: Intel’s x86 ISA extensions - the operations introduced (based on [17])
Arithmetic:
addpd - Adds 2 64bit doubles.
addsd - Adds bottom 64bit doubles.
subpd - Subtracts 2 64bit doubles.
subsd - Subtracts bottom 64bit doubles.
mulpd - Multiplies 2 64bit doubles.
mulsd - Multiplies bottom 64bit doubles.
divpd - Divides 2 64bit doubles.
divsd - Divides bottom 64bit doubles.
maxpd - Gets largest of 2 64bit doubles for 2 sets.
maxsd - Gets largets of 2 64bit doubles to bottom set.
minpd - Gets smallest of 2 64bit doubles for 2 sets.
minsd - Gets smallest of 2 64bit values for bottom set.
paddb - Adds 16 8bit integers.
paddw - Adds 8 16bit integers.
paddd - Adds 4 32bit integers.
paddq - Adds 2 64bit integers.
paddsb - Adds 16 8bit integers with saturation.
paddsw - Adds 8 16bit integers using saturation.
paddusb - Adds 16 8bit unsigned integers using saturation.
paddusw - Adds 8 16bit unsigned integers using saturation.
psubb - Subtracts 16 8bit integers.
psubw - Subtracts 8 16bit integers.
psubd - Subtracts 4 32bit integers.
psubq - Subtracts 2 64bit integers.
psubsb - Subtracts 16 8bit integers using saturation.
psubsw - Subtracts 8 16bit integers using saturation.
psubusb - Subtracts 16 8bit unsigned integers using saturation.
psubusw - Subtracts 8 16bit unsigned integers using saturation.
pmaddwd - Multiplies 16bit integers into 32bit results and adds results.
pmulhw - Multiplies 16bit integers and returns the high 16bits of the result.
pmullw - Multiplies 16bit integers and returns the low 16bits of the result.
pmuludq - Multiplies 2 32bit pairs and stores 2 64bit results.
rcpps - Approximates the reciprocal of 4 32bit singles.
rcpss - Approximates the reciprocal of bottom 32bit single. Figure 2.31: Excerpt of Intel’s
sqrtpd - Returns square root of 2 64bit doubles. SSE2 SIMD ISA extension [19]
sqrtsd - Returns square root of bottom 64bit double.
2 x 32-bit FX MMX EUs
2 x 32-bit MMX,
2 x 32-bit SSE EUs
1 full + 1 simple (moves/stores)

64-bit FP/SSE EUs
Northwood (Pentium 4)
3 x 128-bit FP/SSE EUs
Larrabee: 24-32? x
512-bit FP/SSE EUs
? x 256-bit SSE EUs??
Ivy Bridge
Figure 2.32: SIMD execution resources in Intel’s basic processors (based on [18])
Achieved performance boost in Core 2 for gaming apps
Figure 2.33: Achieved performance boost in Core2 for gaming vs AMD’s Athlon 64 FX60 [13]
2.6 Intelligent power management (1)
2.6 Intelligent Power management

• Ultra fine grained power control
• Bus splitting
• Platform Thermal Control

Ultra fine grained power control

Shutting down currently not needed units of the processor.
Green units not needed

for the given operation,
they can be shut off.
Figure 2.34: The operation of the Ultra fine grained power control – an example [11].
Bus splitting
Introduced already in the Pentium M.
Principle of operation
Most buses are sized for worst case, activating only needed bus widths saves power.
Figure 2.35: Principle of bus splitting to save power [11]

Platform Thermal Control
Digital
Thermal Sensors
(DTS) on the dies,
instead of
analog diodes,
providing
digital data,
scanned by
dedicated logic.
(Platform Environment Control Interface)

A single wire proprietary interface with
transfer rates of 2 kbit/s-2 Mbit/s)
DTSs report the difference between the
current temperature and the throttle point,
at which the processor reduces speed.
Figure 2.36: Principle of the Platform Thermal Control [11] , [20]
PECI-based platform fan speed control
Figure 2.37: Principle of the PECI-based platform fan speed control [42]
Remark
PECI reports the relative temperature values measured below the onset value of the
thermal control circuit (TCC)

Further feature – Loop Stream Detector

• Loops are very common in most applications
• While executing loops
- the same instructions are decoded over and over
- the same branch predictions are made over and over.
• Loop Stream Detector identifies program loops
- lets forward instructions for decoding from the Loop Stream Detector instead of
the normal path,
- disables unneeded blocks of logic for power savings.
• Higher performance by removing instruction fetch limitations.
Figure 2.38: Loop Stream Detector in the Core [1]

Remark
In the Nehalem Intel modified the Loop Stream Detector as follows:
• Same concept as in the Core but

• Higher performance by expanding the size of the Loop Stream Detector,
• Improved power efficiency by disabling even more logic.
Figure 2.39: The modified loop Stream Detector in the Nehalem [1]

2.7 Core based processor lines (1)
2.7 Overview of Core 2 based processor lines

Mobiles
T5xxx/T7xxx, 2C ,Merom 8/2006
Desktops
E6xxx/E4xxx, 2C, Core 2 Duo, (Conroe) 7/2006
X6800 Core 2 Extreme, 2C, (Conroe) 7/2006
E6xxx/E4xxx, Core 2 Duo, 2C, (Allendale) 1/2007
Q6xxx Core 2 Quad, 2x2C (Kentsfield) (2xConroe) 1/2007
QX6xxx Core 2 Extreme Quad, 2x2C, (Kentsfield XE) 11/2006
Servers
UP-Servers
30xx, 2C, Conroe, 9/2006
30xx, 2C, Allendale 1/2007
32xx, 2x2C, Kentsfield (2xConroe) 1/2007
DP-Servers
51xx, 2C Woodcrest, 6/2006
53xx, 2x2, Clowertown, (2xWoodcrest) 11/2006
MP-Servers
72xx, 2C, Tigerton DC, (2xMP-enhanced SC Woodcrest) 9/2007
73xx, 2x2C, Tigerton QC, (2xMP-enhanced DC Woodcrest) 9/2007
Based on [43]
3. Penryn
• 3.2 Enhanced wide execution
• 3.3 More advanced L2 cache
• 3.4 More advanced digital media support
• 3.5 More advanced power management
• 3.6 Penryn based processor lines

3.1 Introduction
Penryn () (1)
3.1 Introduction
Penryn
Basically a shrink (tick) from the 65 nm Core to 45 nm with a few microarchitectural

and ISA enhancements, discussed subsequently.

Sub-threshold =
Source-Drain
Figure 3.1: Dynamic and static power dissipation trends in chips [21]
Figure 3.2: Structure of a high-k + metal transistor [23]

Figure 3.3: Benefits of high-k + metal gate transistors [23], [24]

2 x Core 2 x Penryn
Figure 3.4: The 45 nm Penryn is a shrink of the 65 nm Core with a few enhancements [25]
Key enhancements of Penryn
Fast Radix-16 Divider
Large shared L2 cache
Intel SSE4 ISA Extension

Super Shuffle Engine
Figure 3.5: Key enhancements introduced into Penryn’s microarchitecture vs the Core
(based on [25])

3.2 Enhanced wide execution (1)
3.2 Enhanced wide execution

Fast radix-16 divider [29]
Principle of a conventional binary divider

iteratively subtracts the divisor from the dividend.
Principle of a fast binary divisider

iteratively subtracts multiples of the divisor from the dividend.
Multiple bits of the quotient can be calculated in each iteration.
Radix-r divider
computes in each iteration b bits, where r = 2b
Core: radix- 4 divider (computing 2 bits/iteration).

Penryn: radix16 divider (computing 4 bits/iteration).
On average > 50 % speed up over previous generation [27]

3.2 Enhanced wide execution (2)
CSA: Carry Save

Adder
CPA: Carry Prop.

Adder
QSL: Quotient
Select
Hybrid:
producing
Gs and Ps
Figure 3.6: Simplified block diagram and latency of Penryn’s radix-16 divider [27]
3.3 More advanced L2 cache (1)
3.3 More advanced L2 cache

Larger shared L2 cache
Core 2 Penryn
Figure 3.7: Penryn’s L2 caches vs Core 2’s caches (Source: Intel)

3.4 More advanced digital media support (1)
3.4 More advanced digital media support

• SSE4.1 ISA extension
• Super shuffle engine
SSE4.1 ISA extension Largest set of ISA extensions introduced since 2000.
(47 instructions)
Figure 3.8: Overview of the SSE4.1 ISA extension [33]

SS4.1 ISA Extension details
Figure 3.9: Detailed overview of the SSE4.1 ISA extension [26]

Aim of Streaming Loads
Fast accessing of graphics card memory e.g. by using two treads for CPU-GPU sharing
Figure 3.10: Aim of streaming loads: [29]

The streaming Load instruction
Figure 3.11: Streaming Load instruction (1) [26]

Super Shuffle Engine
Figure 3.12: The Super Shuffle Engine of Penryn [30]

Figure 3.13: Operation of the Super Shuffle Engine of Penryn [30]

Latency improvements achieved by Penryn’s Super Shuffle Engine
Figure 3.14: Latency improvements achieved by Penryn’s Super Shuffle Engine [30]

Microarchitecture comparison
Table 3.1: Microarchitecture comparison [25]

Overall performance achievements with Penryn (1)
Figure 3.15: Performance improvements of Penryn vs Core at the same clock frequency [26]
Overall performance achievements with Penryn (2)
Figure 3.16: Extending Intel’s performance leadership in main application segments [26]

3.5 More advanced power management (1)
3.5 More advanced power management

• Deep Power Down (DPD) technology
• Enhanced Dynamic Acceleration (EDAT)
available only on mobile platforms.

(Both techniques became introduced in Nehalem for general use)!

Deep Power Down technology (DPD)
(First Introduced in the Core Duo (3. core of the Pentium M line)
• Intelligent
heuristics decides
when enter into.
Figure 3.17: Intel’s Deep Power Down technology [26]

262
(OS API
WAIT)
Figure 3.18: Operation of Intel’s Deep Power Down technology [27]

Figure 3.19: Power reduction achieved by the Deep Power Down Technology [27]
Enhanced Dynamic Acceleration Technology (EDAT) (for mobiles)
Principle for dual core Penryn processors
Figure 3.20: Principle of the Enhanced Dynamic Acceleration Technology [27]

Remark 1
A similar technique was already developed for the Montecito (dual core Itanium),
but not implemented, called the Foxton technology.
Foxton technology [28]

Depending on the actual usage pattern the chip is able to speed-up or down by
increasing/decreasing core voltage and clock rate.
• Under “low activity workloads” which generate less heat

the processors speeds up by increasing core voltage and core frequency
until it reaches the nominal power setting.
• Under “high activity workloads” which generate more heat
the processors scales down by decreasing core voltage and core frequency
to stay berlow the nominal power setting.
“Low activity workloads” typically include integer-intensive applications, such as

commercial database applications.
Foxton technology should increase performance for these applications

by about 10% compared with the same processor running with a "fixed clock."

Remark 2
Intel’s next basic core, the Nehalem includes a more advanced technology than
the Enhanced Dynamic Acceleration Technology, called the Turbo Boost Technology for
increasing clock frequency in case of inactive cores or light workloads.

3.6 Penryn based processor lines (1)
3.6 Overview of Penryn based processor lines (1)
-DP
-MP (6 cores)
Figure 3.21: Overview of the Penryn family [25]

3.6 Penryn based processor lines (2)
Overview of Penryn based processor lines (2)

Mobiles
Core 2 Duo T8xxx/T9xxx, Penryn-3M, 2C, 1/2008
Core 2 Duo T6xxx, Penryn-3M, 2C, 1/2009
Core 2 Quad Q9xxx, Penryn QC, 2x2C, (2xPenryn-3M), 8/2008
Desktops
Core 2 Duo E8xxx, Wolfdale, 2C, 1/2008
Core 2 Duo E7xxx, Wolfdale-3M, 2C, 4/2008
Core 2 Quad Q9xxx, Yorkfield-6M, 2x2C, (2x Wolfdale-3M), 3/2008
Core 2 Quad Q8xxx, Yorkfield-6M, 2x2C, (2x Wolfdale-3M), 8/2008
Core 2 Extreme QX9xxx, Yorkfield XE, 2x2C (2x Wolfdale), 11/2007
Servers
UP-Servers
E31xx Wolfdale, 2C, 1/2008
X33xx, Yorkfield-6M, 2x2C, (2xWolfdale), 1/2008
X33xx, Yorkfield, (2xWolfdale), 2x2C, 1/2008
DP-Servers
E52xx, Wolfdale, 2C, 11/2007
E54xx/X54xx, Harpertown 2x2C, (2xWolfdale), 11/2007
MP-Servers
E74xx, 4C/6C, Dunnington, 9/2008
© Sima Dezső, ÓE NIK Based on [43] 269 www.tankonyvtar.hu

4. Nehalem
• 4.2 Simultaneous Multithreading (SMT)
• 4.3 New cache architecture
• 4.4 Further advanced digital media support
• 4.5 Integrated memory controller
• 4.6 QuickPath Interconnect bus
• 4.7 More advanced power management
• 4.8 Advanced virtualization
• 4.9 New socket

4.1 Introduction
Nehalem
Developed at Hillsboro, Oregon, at the site where the Pentium 4 emerged).
Experiences with HT
Nehalem became a multithreaded design.
The design effort took about five years and required thousands of engineers
(Ronak Singhal, lead architect of Nehalem) [37].
First implementation of the Nehalem microarchitecture in the desktop segment

Core i7-9xx (Bloomfield) 4C in 11/2008

Design objective: The same core for all major segments
Figure 4.1: Design objective of Nehalem [1]

Nehalem lines
1. generation Nehalem processors 2. generation Nehalem processors
Mobiles Mobiles
Core i7-9xxM Clarksfield 4C 9/2009
Core i7-8xxQM Clarksfield 4C 9/2009
Core i7-7xxQM Clarksfield 4C 9/2009
Desktops Desktops
Core i7-9xx (Bloomfield) 4C 11/2008 Core i7-8xx (Lynnfield) 4C 9/2009
Core i5-7xx (Lynnfield) 4C 9/2009
Servers Servers
UP-Servers UP-Servers
34xx Lynnfield 4C 9/2009

35xx Bloomfield 4C 3/2009 C35xx Jasper forest1 4C 2/2010
DP-Servers DP-Servers
55xx Gainestown (Nehalem-EP) 4C 3/2009 C55xx Jasper forest1 2C/4C 2/2010
1Jasper forest: Embedded UP or DP server
Based on [44]
Die photo of the 1. generation Nehalem desktop processor (Bloomfield) [45]
Bloomfield [45]
Note
• Both the Bloomfield (desktop) chip and the Gainestown (DP server) chip have the same
layout.
• On the Bloomfield die there are two QPI bus controllers however they are not needed
for this desktop part.
One of them is simply not used in the desktop version Bloomfield [45], but both are needed
in the DP alternative (Gainestown).
The 2. generation Lynnfield chip as a major redesign of the Bloomfield chip (1) [46]
• It is a cheaper and more effective two-chip system solution instead of the previous three
chip solution.
• It is connected to the P55 chipset by a DMI interface rather than by a QPI interface
used in the previous system solution to connect the Bloomfield chip to the X58 chipset.
Intel's Bloomfield Platform (X58 + LGA-1366) Intel's Lynnfield Platform (P55 + LGA-1156)
The 2. generation Lynnfield chip as a major redesign of the Bloomfield chip (2) [46]
• It provides PCIe 2.0 lanes (16 to 32 lanes) to attach a graphics card immediately to the
processor rather than to the north bridge (by 36 lanes) as done in the previous solution.
The Lynnfield chip as a major redesign of the Bloomfield chip (3) [46]
• It supports only two DDR3 memory channels instead of three as in the previous solution.
• Its socket needs less connections (LGA-1156) than the Bloomfield (LGA-1366).
• All in all the Lynnfield chip is a cheaper and more effective successor of the Bloomfield chip.
Die photos of the 1. and 2. gen.

Nehalem desktop processors
First generation: Bloomfield (11/2008)

(263 mm2, 731 mtrs, LGA-1366)
[47] [45] [46] Bloomfield
(desktop/DP server) []
Second gneration: Lynnfield (9/2009) [45]

(296 mm2, 774 mtrs, LGA-1156) [48] [45] [46] 278
Az adatok védelme érdekében a PowerPoint nem engedélyezte a kép automatikus letöltését.
The three alternatives of the

2. generation Nehalem processors [45]
Remarks [49]
• All 3 chips are basically the same design.
• In Jasper Forest the circled part is the QPI
controller, however this part remains blank
in the mobile and desktop versions as
these versions do not provide a QPI link.
Clarksfield (mobile) [49]
Lynnfield (desktop) [49] 279 Jasper Forest (embedded DP server) [49]

Remark
The embedded DP server Jasper Forest (Xeon C5500) is not an “all QPI solution”.
It provides a single QPI bus along with a DMI bus for the 3420 chipset (Picket Post Platform).
Figure 4.2: The Picket Post

platform [47]
Major innovations of the 1. generation Nehalem processors
• Native 4C
• Simultaneous Multithreading (SMT)
• New cache architecture
• SSE 4.2 ISA extension
• Integrated memory controller
• QuickPath Interconnect bus (QPI)
• Enhanced power management
• Advanced virtualization
• New socket
Figure 4.3: Overview of the major innovations of 1. generation Nehalem processors (based on [22])
(The die photo is that of the Bloomfield/Gainestown processor)

Overview of the microarchitecture of Nemalem [69]

The front-end part of the microarchitecture of Nehalem [69]

The back-end part of the microarchitecture of Nehalem [69]

Figure 4.4: Block diagram of Intel’s Core microarchitecture [4]
285
4.2 Simultaneous Multithreading (SMT) (1)
4.2 Simultaneous Multithreading (SMT)

• Two-way multithreaded ( two threads at the same time)
Benefits
• 4-wide core is fed more efficiently (from 2 threads)

• Hides latency of a single tread
• More performance with low additional die area cost,
• May provide significant performance increase on
dedicated applications.
Figure 4.5: Simultaneous Multithreading (SMT) of Nehalem [1]

Details of Nehalem’s SMT implementation
Figure 4.6: Details of Nehalem’s SMT implementation [1]

Deeper buffers
Core
Pentium M
Figure 4.7: Deeper buffers

288 to support SMT [1]
Performance gains of SMT
Figure 4.8: Performance gains achieved by Nehalem’ SMT [1]

4.3 New cache architecture (1)
4.3 Enhanced cache architecture
2-level cache hierarchy 3-level cache hierarchy

(Penryn) (Nehalem)
Core Core Core Core Core
32 kB/32 KB L1 Caches L1 Caches L1 Caches L1 Caches L1 Caches 32 kB/32 KB
4 MB 256 KB
Shared/ L2 Cache L2 Caches L2 Caches L2 Cache
Private
two cores
Up to 8 MB
L3 Cache Inclusive
Figure 4.9: The 3-level cache architecture of Nehalem [based on 1]

Distinguished features of Nehalem’s cache architecture
• The L2 cache is private again rather than shared as in the Core and Penryn processors
Private L2 Shared L2
Pentium 4
Core
Penryn
Nehalem
Assumed reason for returning to the private scheme
Private caches allow a more effective hardware prefetching than shared ones.
Reason
• Hardware prefetchers look for memory access patterns.
• Private L2 caches have more easily detectable memory access patterns
than shared L2 caches.

Remark
The POWER family had the same evolution path as above
Private L2 Shared L2
POWER4
POWER5
POWER6

• The L3 cache is inclusive rather than exclusive
as in a number of competing designs, such as UltraSPARC IV+ (2005), POWER5 (2005),

POWER6 (2007), AMD’s K10-based processors (2007).
Intel’s argumentation for inclusive caches [38]
Inclusive L3 caches prevent L2 snoop traffic for L3 cache misses since
• with inclusive L3 caches an L3 cache miss means that the referenced data
doesn’t exist in any core’s L2 caches, thus no L2 snooping is needed.
• By contrast, with exclusive L3 caches the referenced data may exist in any
of the L2 caches, thus L2 snooping is required.
For higher core numbers L2 snooping becomes a more demanding task

and may overshadow the benefits arising from the more efficient cache use
of the explicit cache scheme.
Demonstration example
Benefits of inclusive caches vs exclusive ones.

Exclusive Inclusive
L3 Cache L3 Cache
Core Core Core Core Core Core Core Core

0 1 2 3 0 1 2 3
• Data request from Core 0 misses Core 0’s L1 and L2

• Request sent to the L3 cache
Figure 4.10: Comparing exclusive and inclusive cache behavior in case of a L3 cache miss (1) [1]

Exclusive Inclusive
L3 Cache L3 Cache
MISS! MISS!

0 1 2 3 0 1 2 3
• Core 0 looks up the L3 Cache

• Data not in the L3 Cache

Exclusive Inclusive
L3 Cache L3 Cache
MISS! MISS!

0 1 2 3 0 1 2 3
Must check other cores Guaranteed: data is not on-die

Exclusive Inclusive
L3 Cache L3 Cache
HIT! HIT!

0 1 2 3 0 1 2 3
No need to check other cores Data could be in another core BUT

Intel® CoreTM microarchitecture
(Nehalem) is smart…
Figure 4.13: Comparing exclusive and inclusive cache behavior in case of a L3 cache hit (1) [1]

Inclusive
• Maintain a set of “core valid”

bits per cache line in the L3
cache L3 Cache
• Each bit represents a core HIT! 0 0 0 0
• If the L1/L2 of a core may
contain the cache line, then
core valid bit is set to “1”
• No snoops of cores are Core Core Core Core
needed if no bits are set
0 1 2 3
• If more than 1 bit is set, line
cannot be in Modified state in
any core
Core valid bits limit unnecessary
snoops
Figure 4.14: Comparing exclusive and inclusive cache behavior in case of a L3 cache hit (2) [1]

4.4 Further advanced digital media support (1)
4.4 SSE4.2 ISA extension

Neh alem
Figure 4.15: Nehalem’ new

299 SSE4.2 ISA extension [33]
4.5 Integrated memory controller (1)
4.5 Integrated memory controller
Main features
• 3 channels per socket

Nehalem-EP (Efficient Performance):
• Up to 3 DIMMs per channel (impl. dependent)
Designation of the server line
• DDR3-800, 1066, 1333,…
• Supports both RDIMM and UDIMM (impl. dependent)
• Supports single, dual and quad-rank DIMMs
Figure 4.16: Integrated memory controller of Nehalem [33]

Benefit of integrated memory controllers
Low memory access latency
important for memory intensive apps.
Drawback of integrated memory controllers
Processor becomes memory technology dependent
For an enhanced memory solution (e.g. for increased memory speed)

a new processor modification is needed.

Non Uniform Memory Access (NUMA)
a consequence of using integrated memory controllers in connection with multi-socket servers
Local memory access Remote memory access
• Most multi-socket platforms use NUMA

• Remote memory access latency ~ 1.7 x longer than local memory access latency
• Local memory bandwidth is up to 2 x greater than remote bandwidth
• Operating systems differ in allocation strategies + APIs
Figure 4.17: Non Uniform Memory Access (NUMA) in multi-socket servers [1]
Memory latency comparison: Nehalem vs Penryn
Harpertown: Quad-Core Penryn based server (Xeon 5400 series)
Figure 4.18: Memory latency comparison: Nehalem vs Penryn [1]

Remark
Intel’s Timna – a forerunner to integrated memory controllers [34]
Timna (announced in 1999, due to 2H 2000, cancelled in Sept. 2000)
• Developed in Intel’s Haifa Design and Development Center.

• Low cost microprocessor with integrated graphics and memory controller
(for Rambus DRAMs).
• As the price of Rambus memories failed to fall as anticipated before, Intel decided
to use a bridge chip (Memory Translation Hub, MTH) to attach less expensive SDRAM
chips to the memory controller.
• Due to design problems with the MTH chip and lack of interest from many vendors,
Intel finally cancelled Timna in Sept. 2000.
Figure 4.19: The low cost (<600 $) Timna PC [40]

Figure 4.20: Intel’s roadmap from 1999 showing Timna due to 2H 2000. [35]

4.6 QuickPath Interconnect bus (1)
4.6 QuickPath Interconnect bus (QPI)

• A processor interconnect bus, connecting processors to processors or South Bridges.
Note
(The QPI isn’t an I/O interface, the standard I/O interface remains the PCI-Express bus)
• Formerly designated as the Common System Interface bus (CSI bus)
• A serial, high speed differential point-to-point interconnect

(similar to the HyperTransport bus)
• Consists of 2 unidirectional links, one in each directions, called the TX and RX

unidirectional links.

Signals of the QuickPath Interconnect bus (QPI bus)
TX Unidirectional link
RX Unidirectional link
16 data
2 protocol
2 CRC
Figure 4.21: Signals of the QuickPath Interconnect bus (QPI-bus) [22]

QuickPath Interconnect bus (QPI)

Note

(similar to the HyperTransport bus).
• Consists of 2 unidirectional links, one in each directions, called the TX and RX

unidirectional links.
• Each unidirectional link comprises 20 data lanes and a clock lane, with
each lane consisting of a pair of differential signals.

Signals of the QuickPath Interconnect bus (QPI)
TX Unidirectional link
RX Unidirectional link
16 data
2 protocol
2 CRC
Figure 4.22: Signals of the QuickPath Interconnect bus (QPI-bus) [22]

QuickPath Interconnect bus (QPI)

Note

(similar to the HyperTransport bus).
• Consists of 2 unidirectional links, one in each directions, called the TX and RX links.
• Each unidirectional link comprises 20 data lanes (16 data, 2 protocol, 2 CRC) and a clock lane,
with each lane consisting of a pair of differential signals.
• Number of QPI links: up to 4, implementation dependent.

(Desktop processors: 1, DP processors: 2, MP processors: 4)

QPI based DP and MP server system architectures
Figure 4.23: QPI based DP and MP server system architectures [31], [33]

Comparison of the transfer rates of the QPI, FSB and HT buses
QuickPath Interconnect Bus (QPI-bus)

3.2 GHz DDR 2-Byte 12.8 GB/s in each direction
Fastest FSB
400 MHz QDR 8 Byte 12.8 GB/s bidirectional
HyperTransport Bus (HT-bus)
Typical speed and width figures in AMD’s systems

HT 1.0: 0.8 GHz DDR 2-Byte 3.2 GB/s in each direction
DDR: Double Data Rate

The “Uncore”
Figure 4.24: Interpretation of the notion “Uncore” [1]

4.7 Enhanced power management

Discussed issues
• Integrated power gates

• Integrated Power Control Unit
• Turbo boost technology

Power switches
Figure 4.25: Use of integrated power gates [32]

Allows to manage the power

of the whole chip as an entity.
Used also for the Turbo Boost Mode
Figure 4.26: Overview of the Power Control unit [32]

Nehalem’s Turbo Mode

Aim
Utilization of the power headroom of inactive cores and that of active cores with light workload
for increasing clock frequency.
Remark
The Penryn core already introduced a less intricate technology for the same purpose,
termed as the Enhanced Dynamic Acceleration Technology that increases clock frequency
only for the mobile platform and in case of inactive cores.

Understanding the notion of TDP and the related potential to boost performance (1) [50]
TDP (Thermal Design Power) is the maximum power consumed at realistic worst case
applications (TDP application).
The thermal solution (cooling system) needs to ensure that the junction temperature (Tj)
at maximum core frequency specified in connection with TDP does not exceed the
junction temperature limit (Tjmax) while the processor runs TDP applications.
Example
The mobile quad core Clarksfield processor i7-920XM has
• a TDP of 55 W
• ACPI P-states from 1.2 GHz (Low Frequency Mode) to 2.0 GHz (High Frequency Mode)
available to implement DBS (Demand Based Switching) of fc and Vcc, and
• allows to increase fc in turbo mode from 2.0 GHz up to 3.20 GHz.
The maximum clock frequency related to TDP (2 GHz in the above example) is determined
while running (worst case) TDP applications that intensively utilize all four cores such that
at this frequency dissipation still remains below TDP (i.e. 55 W in the above example).
Understanding the notion of TDP and the related potential to boost performance (2) [50]
Typical workloads however, are not intensive enough to push power consumption to the TDP limit.
The remaining power headroom can be utilized to increase fc if
• the OS requests the highest performance ACPI state (P0 state)

• provided that the processor operates within its TDP and temperature limits.
The possible frequency increase depends on the intensity of the workload and the number of
active cores.

Principle of the turbo boost technology (1) [51]
If the OS requests an active core to increase fc beyond the TDP limited maximum frequency
(i.e. to enter the PO state),
and there is an available power headroom
• either by having idle cores
• or a lightly threaded workload
the turbo mode controller will increase the core frequency of the active cores
provided that the power consumption of the socket and junction temperatures of the cores
do not exceed the given limits.
In turbo mode all active cores in the processor will operate at the same fc and voltage.

Principle of the turbo boost technology (2) [52]
Figure 4.27: Turbo mode uses the available power headroom in processor package power limits [52]
Turbo boost frequencies in case of inactive cores [53]
For inactive cores, the turbo mode controller will increase fc to a maximum turbo frequency
that depends on the number of active cores provided that actual power and temperature values
remain below specified limits.
Maximum turbo frequencies are factory configured and kept in an internal register (MSR 1ADH).
E.g. in case of a single active core the Core i7-920XM will increase fc to 3.2 GHz,
which is 8 frequency bins higher than the TDP limited core frequency of 2.0 GHz.
The turbo boost technology considers a core “active” if it is in ACPI C0 or C1 states,

whereas cores in the C3 or C6 ACPI states are considered as “inactive”.

Increasing/decreasing turbo boost frequencies [51]
• If OS is requesting the ACPI P0 state

and the calculated power consumption of the package and measured junction temperatures
(Tj) of the cores remain below factory configured limits
the turbo boost controller automatically steps up core frequency typically by 133 MHz
until it reaches the maximum frequency dictated by the number of active cores or
the intensity of the workload.
• When the power consumption of the package or the junction temperature of any core
exceeds the factory configured limits, the turbo boost controller automatically steps down
core frequency in increments of e.g. 133 MHz.
Remark
In the above example the 133 MHz is the basic frequency that will be multiplied by the PLL
by an appropriate factor to generate the clock frequency fc.

Assuring that power and temperature values do not exceed specified limits [53], [50]
A precondition for increasing fc is that the power consumption of the package and
the junction temperature of the cores do not exceed given limits.
To assure this the turbo boost controller samples the current power consumption and
die temperatures in 5 ms intervals [53].
Power consumption is determined by monitoring the processor current at its input pin as well
as the associated voltage (Vcc) and calculating the power consumption as a moving average.
The junction temperature of the cores are monitored by DTSs (Digital Thermal Sensors) with
an error of ± 5 % [50].

4.8 Enhanced virtualization (1)
4.8 Enhanced Virtualization

Very wide area, not discussed.
(See e.g. [41])

4.9 New socket (1)
Figure 4.28: Die

shot
of a DP
(guessed)
Nehalem [36]
4.9 New socket (2)
4.9 LGA1366 socket
Figure 4.29: New LGA1366 socket [22]

5. Nehalem-EX lines
• 5.2 Native 8 cores with 24 MB L3 cache (LLC)
• 5.3 On-die ring interconnect bus
• 5.4 Serial memory channels
• 5.5 Scalable platform configuration
• 5.5 Extending the turbo boost technology to 8 cores

• 5.7 Overview of the 2. generation Nehalem based
(Nehalem-EX based) processor lines

5.1 Introduction
• Nehalem-EX based DP/MP servers are designated also as the Beckton family.
• They include only server processors.
• First Becton processor were delivered in 3/2010.
Major innovations of the Nehalem-EX processors

• Native 8 cores with 24 MB L3 cache (LLC)
• On-die ring interconnect
• Serial memory channels (designated as scalable memory interface (SMI)
• Enhanced support for RAS, virtualization and power control (not detailed here)
• Scalable platform configurations
• Extending turbo boost to 8 cores

Block diagram of the Nehalem-EX 7500 processor [54]

5.2 Native 8 cores with 24 MB L3 cache (LLC)
5.2 Native 8 cores with 24 MB L3 cache (LLC) [55]

5.3 On-die ring interconnect bus
5.3 On-die ring interconnect bus [56]

5.4 Serial memory channels (1)
5.4 Serial memory channels [55]
Pawlowski IDF 2009

5.4 Serial memory channels (2)
Nehalem-EX processors provide an FB-DIMM like memory subsystem [55].
Remark: The SMI interface was formerly designated as the Fully Buffered DIMM2 interface
5.5 Scalable platform configurations
5.5 Scalable platform configurations [55]

Nehalem-EX allows platform scaling from 2 to 256 sockets.

5.6 Extending the turbo boost technology to 8 cores
5.6. Extending the turbo boost technology to 8 cores [67]

5.7 Overview of the 2. generation Nehalem based processor lines (1)
5.7 Overview of the 2. generation Nehalem based (Nehalem-EX based)

processor lines (based on [44])
Servers
DP-Servers
65xx Beckton (Nehalem-EX) 8C 3/2010
MP-Servers
75xx Beckton (Nehalem-EX) 8C 3/2010

5.7 Overview of the 2. generation Nehalem based processor lines (2)
Performance features of the 8-core Nehalem-EX based Xeon 7500 vs the Penryn based
6-core Xeon 7400 [67]

6. Westmere
• 6.3 In-package integrated CPU/GPU
• 6.4 Enhanced turbo boost technology

6.1 Introduction
• Westmere (formerly Nehalem-C) is the 32 nm die shrink of Nehalem.
• First Westmere-based processors were launched in 1/2010

Announcing Intel’s Westmere lines [140] Part4-ből

Westmere family
Westmere lines Westmere-EX lines

(only servers)
Mobiles
Core i3-3xxM Arrandale 2C+G 1/2010
Desktops
Core i3-5xx Clarkdale 2C+G 1/2010
Core i5-6xx Clarkdale 2C+G 1/2010
Core i7-9xx/9xxX Gulftown 6C 3/2010
Servers
UP-Servers
36xx Gulftown (Westmere-EP) 6C 3/2010 E7-28xx Westmere-EX 10C 4/2011
DP-Servers DP-Servers
56xx Gulftown (Westmere-EP) 6C 3/2010 E7-28xx Westmere-EX 10C 4/2011
MP-Servers
E7-48xx Westmere-EX 10C 4/2011
Data based on [44]

1. Westmere 2-core and 6-core die plots [57]
2-core die plot 6-core die plot

Arrandale (mobile) Gulftown (Westmere-EP) (desktop/UP/DP server)
Clarkdale (desktop)

1. Westmere 2-core mobile/desktop and 6-core UP/DP server platforms [57]
Westmere 2-core mobile and desktop platform Westmere 6-core UP/DP server platform

Key improvements of the Westmere lines vs the Nehalem lines [44]
• Increased number of cores for UP/DP servers.

Native 6 core processors vs 4 cores of the 1. generation Nehalem processors.
• Enlarged L3 cache for UP/DP servers.
Native 12 MB L3 cache vs 8 MB available in Nehalem-based servers.
• In-package integrated graphics for the mobile and desktop segments.
• Enhanced support for AES (Advanced Encryption Standard) by providing a set of
instructions to perform hardware accelerated encryption/decryption (not discussed here).
• Enhanced support for virtualization (not discussed here).
• Over 100 incremental improvements in the microarchitecture [58] (not discussed here).
• Enhanced Turbo Boost technology.

6.2 Native 6 cores with 12 MB L3 cache
6.2 Native 6 cores with 12 MB L3 cache (LLC) for UP/DP servers [58]

6.3 In-package integrated CPU/GPU (1)
6.3 In-package integrated CPU/GPU processors for the mobile and the
desktop segments
Example
In-package integrated CPU/GPU of the mobile Arrandale line [137] Part4-ből
32 nm CPU/45 nm discrete GPU

In-package integrated CPU/GPU Westmere lines
Mobile lines Desktop lines
Arrandale lines Clarkdale lines
i3 3xxM 2C i3 5xx 2C
i5 4xx/5xx 2C i5 6xx 2C
i7 6xx 2C
CPU/GPU components
CPU: Hillel (32 nm Westmere architecture)
(Enhanced 32 nm shrink of the
45 nm Nehalem architecture)
GPU: Ironlake (45 nm)
Shader model 4, DX10 support

Basic components of Intel’s mobile Arrandale line [123] Part4
32 nm CPU (Hillel)
(Mobile implementation of the Westmere
basic architecture,
which is the 32 nm shrink of the
45 nm Nehalem basic architecture) 45 nm GPU (Ironlake)
Intel’s GMA HD (Graphics Media Accelerator)
(12 Execution Units, Shader model 4, no OpenCL support)
Key specifications of Intel’s Arrandale line [139] Part4
http://www.anandtech.com/show/2902
350
Intel’s i3/i5 package integrated CPU/GPU desktop Clarksdale line
Figure 6.1: The Clarksdale processor with in-package integrated graphics along with the H57 chipset
[140] Part4-ből

Key features of the Clarkdale line [141] Part4

Integrated Graphics Media (IGM) architecture of Clarkdale [141] Part4
353
Remark
In Jan. 2011 Intel replaced their in-package integrated their CPU/GPU lines with the
on-die integrated Sandy Bridge line.

6.4 Enhanced turbo boost technology (1)
6.4 Enhanced turbo boost technology [57]

The 2-core mobile version (Arrandale) with its in-package integrated graphics
extends the turbo boost technology to both the graphics and the integrated memory controller
by using a “two point” thermal design.
The driver controls the power sharing between the cores and the integrated graphics.
The desktop version of Westmere (Clarkdale) and the server lines (Westmere-EP, Westmere-EX
make use of a Nehalem-like turbo boost technology.

The two point thermal design (1) [59]
It includes two extreme design points
• The processor cores are operating at maximum thermal power level (which is greater
than their TDP) and the integrated graphics and the integrated memory controller
are operating at their minimum thermal power.
• The integrated graphics operates at its maximum thermal power level, while
the processor cores consumes the remaining MCP package power limit.
HFM: High Frequency Mode (highest P-state)

LFM: Low Frequency Mode (lowest P-state)

• Processor core currents are monitored by the processor input pin and calculated using a
moving average.
• When the power limit is reached power sharing control will adaptively remove the turbo boost
states to remain within the MCP thermal power limit.
• Errors in power estimation or measurement can significantly impact or completely eliminate
the performance benefit of the turbo boost technology.

NHM/M
(WSM/M)
NHM/D (WSM/D)
Kahn, Piazza, Valentine IDF 2010

Figure 6.2: Implementation alternatives of Intel’s turbo boost technologies [59]
• For the two point thermal design it must however be ensured that the
component Tjmax limits do not exceeded when either component is operating
at its extreme thermal power limit.
• The junction temperature of the cores, integrated graphics and memory controller are
monitored by their respective DTS (Digital Thermal Sensor).
A DTS outputs a temperature relative to the maximum supported junction temperature.
The error associated with DTS measurements will not exceed ± 5 % within the operating range.

7. Westmere-EX
7.3 Overview of the 2. generation Westmere based

•
(Westmere-EX based) processor lines

7.1 Introduction
• Westmere-EX processors are 32 nm die shrinks of the 45 nm Nehalem-EX line.
• First Westmere-EX processors were shipped in 4/2011.
• They are socket compatible with the Nehalem-EX line (Xeon 75xx or Benton line).
Key improvements of the Westmere-EX processors vs Nehalem-EX server processors
Native 10 cores with 30 MB of L3 cache (LLC) vs native 8 cores with 24 MB L3 cache (LLC)
in order to compete with AMD’s 2x6 core (dual chip) Magny Course processors.

Basic building blocks of the Westmere-EX processors [60]

Block diagram of the Westmere-EX E7688xx/48xx/28xx processors [70]

7.2 Native 10 cores with 30 MB L3 cache (LLC)
7.2 Native 10 cores with 30 MB L3 cache (LLC) [60]

7.3 Overview of the 2. Westmere-EX based processor lines
7.3 Overview of the Westmere-EX based processor lines (based on [44])

Servers
UP-Servers
DP-Servers
MP-Servers

8. Sandy Bridge
• 8.2 Advanced Vector Extension (AVX)

• 8.3 Decoded µops cache
• 8.4 On-die ring interconnect bus
• 8.5 On-die integrated graphics unit

• 8.6 Enhanced turbo boost technology

8.1 Introduction
• Sandy Bridge is Intel’s new microarchitecture using 32 nm line width.
• First delivered in 1/2011

Main functional units of Sandy Bridge [143] Part 4
256 KB L2 256 KB L2 256 KB L2 256 KB L2

(9 clk) (9 clk) (9 clk) (9 clk)
Hyperthreading
@ 1.0 1.4 GHz

DDR3-1600 25.6 GB/s

The microarchitecture of Sandy Bridge [61]

Key features and benefits of the Sandy Bridge line vs the 1. generation Nehalem line [61]
370
Overview of the Sandy Bridge based processor lines
Mobiles
Core i3-23xxM, 2C, 2/2011
Core i5-24xxM//25xxM, 2C, 2/2011
Core i7-26xxQM/27xxQM/28xxQM, 4C, 1/2011
Core i7 Extreme-29xxXM , 4C, Q1 2011
Desktops
Core i3-21xx, 2C, 2/2011
Core i5-23xx/24xx/25xx, 4C, 1/2011
Core i7-26xx, 4C, 1/2011
Servers
UP-Servers
E3 12xx, 4C, Sandy Bridge-H2, 4C, 3/2011
DP-Servers
E5 2xxx, Sandy Bridge-EP, up to 8C, Q4/2011
MP-Servers
E5 4xxx, Sandy Bridge-EX, up to 8C, Q1/2012
Based on [62] and [63]

8.2 Advanced Vector Extension (AVX) (1)
8.2 Advanced Vector Extension

(AVX)
Introduction of AVX
Sandy Bridge
Figure 8.1: Evolution of the SIMD

processing width [18] BMA-ból
8 MM registers (64-bit),
aliased on the FP Stack registers
Norhwood
Larrabee: large number

of registers (512-bit)
16 YMM registers (256-bit)
Ivy Bridge
Figure 8.2: Intel’s x86 ISA extensions - the SIMD register space (based on [18]) BMA
Details of AVX [64]

8.3 Decoded µops cache (1)
8.3 Decoded µop cache [61]
1.5 K µops

8.3 Decoded µops cache (2)
Remark [65]
A µcode cache was already introduced by Intel in the ill-fated Pentium 4 (2010),
designated as the Trace Cache (keeping 12 K µops).

8.4 On-die ring interconnect bus (1)
8.4 The on die ring interconnect bus of Sandy Bridge [66]
Six bus agents.
The four cores and the

L3 slices share interfaces.

8.4 On-die ring interconnect bus (2)
Details of the on-die ring interconnect bus [64]

8.5 On-die integrated graphics unit (1)
8.5 Sandy Bridge’s integrated graphics unit [102] Part4
12 EUs

Specification data of the HD 2000 and HD 3000 graphics [125] Part 4

Performance comparison: gaming [126] part 4
HD5570
400 ALUs
i5/i7 2xxx/3xxx:
Sandy Bridge
i5 6xx
Arrandale
frames per sec

8.6 Enhanced turbo boost technology [64]

Innovative concept of the 2.0 generation Turbo Boost technology
The concept utilizes the real temperature response of processors to power changes
in order to increase the extent of overclocking [64]
Cooler Thermal capacitance

Concept: Use thermal energy budget accumulated during idle periods to push the core
beyond the TDP for short periods of time (e.g. for 20 sec).
Multiple algorithms manage in parallel current, power and die temperature. [64]
Intelligent power sharing between the cores and the integrated graphics [64]

Intelligent power sharing between the cores and the integrated graphics [68]

NHM/M WSM/M
NHM/D WSM/D
[61]
386
Remark
• Individual cores may run at different frequencies but all cores share the same power plane.
• Individual cores may be shut down if idle by power gates.

9. Overview of the evolution

9. Overview of the evolution (1)
Pentium 4 Core Penryn Nehalem

E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180/130/90 nm) (65 nm) (45 nm) (45 nm)
Processor
3-wide core 4-wide core
width
2 x 64-bit FP/SSE EUs 3 x 128-bit FP/SSE EUs
(1 complex + 1 simple) (3 complex units)
Enhanced Loop Strem
Loop Strem Detector
Detector
Microfusion
Macrofusion
Radix-16 divider
2-way SMT
SMT (with deeper buffers
to support SMT)
Private L2 caches
Cache Private L2 caches Shared L2 caches (256 KB/core)
architecture (up to 2 MB/core) (4 MB/2 core) (6 MB/2 core) + Shared L3 cache
(up to 8 MB)
Memory
Store to Load forwarding
accesses
Loads bypass both Loads
Loads bypass only Loads
and Stores
8 prefetchers
in DC processors
Single L2 prefetchers
(2 x L1 D$, 1 x L1 I$ per
core, 2 x L2)
Table 9.1: Evolution of the main features of Intel’s basic cores (1)

E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180/130/90 nm) (65 nm) (45 nm) (45 nm)
ISA SSE3 Supplemental SSE3 SSE 4.1 SSE 4.2
Extension (DSP-oriented FP, (Diverse arithmetric (Media acceleration: (Accelerated string/text
enhanced thread enhancements) video encoding, gaming) manipulation,
manipulation) appl. targeted acceleration)
Support Super Shuffle Engine

Streaming Loads
- Integrated memory cont.

Support of the
syst. arch. - QuickPath
Interconnect bus
Power
Detailed subsequently
management
Support of
Not discussed
virtualization
Table 9.2: Evolution of the main features of Intel’s basic cores (2)

Pentium 4 Pentium 4 P4 P4 Core Penryn Nehalem

5xx/ 6xx/
Willamette --- Prescott
5x1 6x1
E4xxx/
E6xxx
E7xxx/
E8xxx
i7-9xx
(180 nm/478 pins) (90 nm/775 pins) (90 nm) (90/65 nm) (65 nm) (45 nm) (45 nm)
Protection Thermal Monitor 1
of (TM1) ---
overheating Hardware controlled
Turning off and on Adaptive
the clock Thermal
(Clock modulation) Monitor
Thermal Monitor 2 First activate
(TM2) TM2 and if not
enough,
Hardware controlled
activate
switching to a
second operating also TM1
state with reduced
fc and VID
(C1E state)
Table 9.3: Evolution of thermal management (1)

Pentium 4 Pentium 4 P4 P4 Core Penryn Nehalem

5xx/ 6xx/
Willamette
--- Prescott
5x1 6x1
E4xxx/
E6xxx
E7xxx/
E8xxx
i7-9xx
(180 nm/478 pins) (90 nm/775 pins) (90 nm) (90/65 nm) (65 nm) (45 nm) (45 nm)
Reducing EIST
power (Enhanced Intel
consumption Speed Step
of active Technology)
processors OS controlled
--- switching to
multiple P-states
(Power states)
in less active
periods
to reduce power
consumption
Ultra fine
grained
power
control
Shutting down
not needed
proc. units
Bus spliting
Activating not
needed bus
lines


Willamette --- E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180 nm/478 pins) (65 nm) (45 nm) (45 nm)
Reducing Multiple C-states/
power S-states ---
consumption (CPU-states/Sleep states)
of less active Os controlled switching to
or idle states with increasingly
processors higher power savings but
longer wake-up times
Deep Power Down State

(DPD)
Saving core state in
dedicated SRAM and
reducing VID further on
(Available only on mobile (Available on all platforms)

platforms)
Integrated Power Gate

Idle cores go to near zero
power consumption

Willamette --- E4xxx/E6xxx E7xxx/E8xxx i7-9xx
(180 nm/478 pins) (65 nm) (45 nm) (45 nm)
Managing the Power Control Unit
power
consumption Allows to manage the
of a multi- --- power consumption
core of the whole chip as an
processor entity. (Needed for the
as an entity Turbo Boost Mode)
EDAT Turbo Boost Mode

(Enhanced Dynamic In multi-core processors
Acceleration Technology) utilize the power
In multi-core processors headroom of idle cores
utilize the power headroom plus light workloads to
of idle cores in states CC3 boost fc of the active
or deeper to boost fc of the core)
active core)
(Only on mobile platforms)
Managing the Digital Temperature
fan-speed of Sensor (DTS)
a multi-chip
---
DTS readings represent
platform the temperature
difference below the
activation temperature of
the Thermal Control
Circuit (TCC)
Platform Environment
Control Interface (PECI)
A single wire interface to
forward temperature
readings of DTSs placed
on the dies
PECI-based platform
fan speed control
Table 9.6: Evolution

394 of thermal management (4) www.tankonyvtar.hu
References (1)
[1]: Singhal R., “Next Generation Intel Microarchitecture (Nehalem) Family:

Architecture Insight and Power Management , IDF Taipeh, Oct. 2008,
http://intel.wingateweb.com/taiwan08/published/sessions/TPTS001/FA08%20IDF
-Taipei_TPTS001_100.pdf
[2]: Bryant D., “Intel Hitting on All Cylinders,” UBS Conf., Nov. 2007,
http://files.shareholder.com/downloads/INTC/0x0x191011/e2b3bcc5-0a37-4d06-
aa5a-0c46e8a1a76d/UBSConfNov2007Bryant.pdf
[3]: Fisher S., “Technical Overview of the 45 nm Next Generation Intel Core Microarchitecture
(Penryn),” IDF 2007, ITPS001, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf
[4]: De Gelas J., “Intel Core versus AMD’s K8 architecture,” AnandTech, Mai 1. 2006,
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=2748&p=1
[5]: Carmean D., “Inside the PentiumIntel
4 Core
Processor
versus
AMD's K8 architecture
Micro-architecture,”, Aug. 2000,
http://people.virginia.edu/~zl4j/CS854/pda_s01_cd.pdf
Date: May 1st, 2006
[6]: Shimpi A. L. & Clark J., “AMD Opteron 248 vs. Intel Xeon 2.8: 2-way Web Servers
go Head to Head,” AnandTech, Dec. 17. 2003,
http://www.anandtech.com/showdoc.aspx?i=1935&p=1
[7]: Völkel F., “Duel of the Titans: Opteron vs. Xeon : Hammer Time: AMD On The Attack,”
Tom’s hardware, Apr. 22. 2003,
http://www.tomshardware.com/reviews/duel-titans,620.html
[8]: De Gelas J., “Intel Woodcrest, AMD's Opteron and Sun's UltraSparc T1:
Server CPU Shoot-out,” AnandTech, June 17. 2006,
http://www.anandtech.com/IT/showdoc.aspx?i=2772&p=1

References (2)
[9]: Hinton G. & al., “The Microarchitecture of the Pentium 4 Processor,” Intel Technology
Journal, Q1 2001, pp. 1-13
[10]: Wechsler O., “Inside Intel Core Microarchitecture,” White Paper, Intel, 2006
[11]: Lee V., “Inside the Intel Core Microarchitecture,” IDF, May 2006, Shenzhen,
http://www.prcidf.com.cn/sz/systems_conf/track_sz/SMC/Intel%20Core%20uArch.pdf
[12]: Doweck J., “Inside Intel Core Microarchitecture,” Hot Chips 18, 2006,
http://www.hotchips.org/archives/hc18/
[13]: Gruen H., “Intel’s new Core Microarchitecture,” Develop Brighton, AMD Technical
Day, July 2006,
http://ati.amd.com/developer/brighton/03%20Intel%20MicroArchitecture.pdf
[14]: Doweck J., “Intel Smart Memory access: Minimizing Latency on Intel Core
Microarchitecture, ” Technology @ intel Magazine, Sept. 2006, pp. 1-7,
ftp://download.intel.com/corporate/pressroom/emea/deu/fotos/06-10-Strategie_Tag/
Intel/Intel_Core2_Prozessoren/Texte/ENG-Smart_Memory_Access_Technology@
Intel_Magazine_Article.pdf
[15]: Sima D., Fountain T., Kacsuk P., Advanced Computer Architectures, Addison Wesley,
Harlow etc., 1997
[16]: Jafarjead B., “Intel Core Duo Processor,” Intel, 2006,
http://masih0111.persiangig.com/document/peresentation/behrooz%20jafarnejad.ppt
[17]: Pawlowski S. & Wechsler O., “Intel Core Microarchitecture,” IDF Spring, 2006,
http://www.intel.com/pressroom/kits/core2duo/pdf/ICM_tech_overview.pdf

References (3)
[18]: Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 06. 2008,
http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm
[19]: SIMD Instruction Sets, http://softpixel.com/~cwright/programming/simd/index.php
[20]: Platform Environment Control Interface,
http://en.wikipedia.org/wiki/Platform_Environment_Control_Interface
[21]: Kim N. S. et al., „Leakage Current: Moore’s Law Meets Static Power”, Computer,
Dec. 2003, pp. 68-75.
[22]: Ng P. K., “High End Desktop Platform Design Overview for the Next Generation
Intel Microarchitecture (Nehalem) Processor,” IDF Taipei, TDPS001, 2008,
http://intel.wingateweb.com/taiwan08/published/sessions/TDPS001/
FA08%20IDF-Taipei_TDPS001_100.pdf
[23]: Bohr M., Mistry K., Smith S., “Intel Demonstrates High-k + Metal Gate Transistor
Breakthrough in 45 nm Microprocessors,”, Intel, Jan. 2007,
http://download.intel.com/pressroom/kits/45nm/Press45nm107_FINAL.pdf
[24]: Scott D. S., “Toward Petascale and Beyond,” APAC Conference, Oct. 2007,
http://www.apac.edu.au/apac07/pages/program/presentations/
Tuesday%20Harbour%20A%20B/David_Scott.pdf
[25]: Smith S. L., “45nm Product Press Briefing,” IDF Fall, 2007,
http://download.intel.com/pressroom/kits/events/idffall_2007/BriefingSmith45nm.pdf
[26]: Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture
(Penryn),” IPTS001, Fall IDF 2007, http://isdlibrary.intel-dispatch.com/isd/89/45nm.pdf

References (4)
[27]: George V., 45nm Next Generation Intel Core Microarchitecture (Penryn),”
Hot Chips 19, 2007,
http://www.hotchips.org/archives/hc19/3_Tues/HC19.08/HC19.08.01.pdf
[28]: Foxton Technology, Wikipedia, http://en.wikipedia.org/wiki/Foxton_Technology
[29]: Coke J. & al., “Improvements in the Intel Core Penryn Processor Family Architecture
and Microarchitecture,” Intel Technology Journal, Vol. 12, No. 3, 2008, pp. 179-192
[30]: Fisher S., “Technical Overview of the 45nm Next Generation Intel Core Microarchitecture
(Penryn),” BMA S004, IDF 2007,
http://my.ocworkbench.com/bbs/attachment.php?attachmentid=318&d=1176911500
[31]: Gelsinger P. P., “Intel Architecture, Press Briefing, March 2008,

http://www.slideshare.net/angsikod/gelsinger-briefing-on-intel-architecture
[32]: Gelsinger P., “Invent the new reality,” IDF Fall 2008, San Francisco
http://download.intel.com/pressroom/kits/events/idffall_2008/PatGelsinger_day1.pdf
[33]: Brayton J.,”Nehalem: Talk of the Tock, Intel Next Generation Microprocessor,” IDF,
April 2008, Shanghai,
http://inteldeveloperforum.com.edgesuite.net/shanghai_2008/ti/TCH001/f.htm
[34]: Intel Timna Microprocessor Family, CPU World, http://www.cpu-world.com/CPUs/Timna/
[35]: Smith T., “Timna - Intel's first system-on-a-chip, Before 'Tolapai', before 'Banias'.
Register Hardware, 6. February 2007,
http://www.reghardware.co.uk/2007/02/06/forgotten_tech_intel_timna/

References (5)
[36]: Images, Xtreview,

http://xtreview.com/images/K10%20processor%2045nm%20architec%203.jpg
[37]: Singhal R. Intel’s i7, Podtech, http://www.podtech.net/home/5436/intels-core-i7
[38]: Strong B., “A Look Inside Intel: The Core (Nehalem) Microarchitecture,”
http://www.cs.utexas.edu/users/cart/arch/beeman.ppt
[39]: Valles A. C., Ansari Z., Mehrotra P.: “Tuning Your Software for the Next Generation
Intel Microarchitecture (Nehalem) Family,” IDF 2008, NGMS002,
http://www.benchmark.rs/tests/editorial/Nehalem_munich/presentations/
Software-Tuning_for_Nehalem.pdf
[40]: The low cost Tymna CPU, Tom’s hardware, Febr. 25 2000,
http://www.tomshardware.com/reviews/idf-2000,166-3.html
[41]: Intel® Virtualization Technology, Special issue, Vol. 10 No. 03 Aug. 2006,
http://www.intel.com/technology/itj/2006/v10i3/
[42]: Intel Core 2 Duo Processor E8000 and E7000 Series Datasheet, Intel, Jan. 2009
[43]: Wikipedia: List of Intel Core 2 microprocessors

http://en.wikipedia.org/wiki/List_of_Intel_Core_2_microprocessors
[44]: Wikipedia: Nehalem (microarchitecture)

http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
[45]: Glaskowsky P.: Investigating Intel's Lynnfield mysteries, cnet News, Sept. 21. 2009,
http://news.cnet.com/8301-13512_3-10357328-23.html
References (6)
[46]: Shimpi A. L.: Intel's Core i7 870 & i5 750, Lynnfield: Harder, Better, Faster Stronger,
AnandTech, Sept. 8. 2009, http://www.anandtech.com/show/2832
[47]: Intel Xeon Processor C5500/C3500 Series, Datasheet – Volume 1, Febr. 2010,
http://download.intel.com/embedded/processor/datasheet/323103.pdf
[48]: Intel CoreTM i7-800 and i5-700 Desktop Processor Series Datasheet – Volume 1,
July 2010, http://download.intel.com/design/processor/datashts/322164.pdf
[49]: Glaskowsky P.: Intel's Lynnfield mysteries solved, cnet News, Sept. 28. 2009,
[50]: Intel CoreTM i7-900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and
i7-700 Mobile Processor Series, Datasheet – Volume One, Sept. 2009
http://download.intel.com/design/processor/datashts/320765.pdf
[51]: Intel Turbo Boost Technology in Intel CoreTM Microarchitecture (Nehalem) Based
Processors, White Paper, Nov. 2008
http://download.intel.com/design/processor/applnots/320354.pdf
[52]: Power Management in Intel Architecture Servers, White Paper, April 2009
http://download.intel.com/support/motherboards/server/sb/power_management_of_intel
architecture_servers.pdf
[53]: Glaskowsky P.: Explaining Intel’s Turbo Boost technology, cnet News, Sept. 28. 2009,
[54]: Intel Xeon Processor 7500 Series, Datasheet – Volume 2, March 2010
http://www.intel.com/Assets/PDF/datasheet/323341.pdf

References (7)
[55]: Pawlowski S.: Intelligent and Expandable High- End Intel Server Platform, Codenamed
Nehalem-EX, IDF 2009
[56]: Kottapalli S., Baxter J.: Nehalem-EX CPU Architecture, Hot Chips 2009, Sept. 10. 2009
http://www.hotchips.org/archives/hc21/2_mon/HC21.24.100.ServerSystemsI-Epub/HC21.24.
122-Kottapalli-Intel-NHM-EX.pdf
[57]: Kurd N. A. & all: A Family of 32 nm IA Processors, IEEE Journal of Solide-State Circuits,
Vol. 46, Issue 1., Jan. 2011, pp. 119-130
[58]: Hill D., Chowdhury M.: Westmere Xeon-56xx „Tick” CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.620-Hill-Intel-WSM-EP-print.pdf
[59]: Intel CoreTM i7-600, i5-500, i5-400 and i3-300 Mobile Processor Series, Datasheet -
Vol.1, Jan. 2010, http://download.intel.com/design/processor/datashts/322812.pdf
[60]: Nagaraj D., Kottapalli S.: Westmere-EX: A 20 thread server CPU, Hot Chips 2010
http://www.hotchips.org/uploads/archive22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
[61]: Kahn O., Piazza T., Valentine B.: Technology Insight: Intel Next Generation
Microarchitecture Codename Sandy Bridge, IDF 2010extreme.pcgameshardware.de/.../
281270d1288260884-bonusmaterial-pc-games-hardware-12-2010-sf10_spcs001_100.pdf
[62]: Wikipedia: Sandy Bridge, http://en.wikipedia.org/wiki/Sandy_Bridge

[63]: http://ark.intel.com
[64]: Kahn O., Valentine B.: Intel Next Generation Microarchitecture Codename Sandy Bridge:
New Processor Innovations, IDF 2010
References (8)
[65]: Shimpi A. L.: Intel Pentium 4 1.4GHz & 1.5GHz, AnandTech, Nov. 20. 2000
http://www.anandtech.com/show/661/5
[66]: Yuffe M., Knoll E., Mehalel M., Shor J., Kurts T.: A fully integrated multi-CPU, GPU and
memory controller 32nm processor, ISSCC, Febr. 20-24. 2011, pp. 264-266
[67]: Intel Xeon Processor 7500/6500 Series, Public Gold Presentation, Data Center Group,
March 30. 2010, http://cache-www.intel.com/cd/00/00/44/64/446456_446456.pdf
[68]: Tang H., Cheng H.: Intel Xeon Processor E3 Family Based Servers: A Smart Investment
for Managing Your Small Business, IDF 2011
[69]: Thomadakis M. E. PhD: The Architecture of the Nehalem Processor and Nehalem-EP SMP
Platforms, Texas A&M University, March 17. 2011
http://alphamike.tamu.edu/web_home/papers/perf_nehalem.pdf
[70]: Intel Xeon Processor E7-8800/4800/2800 Product Families, Datasheet Vol. 1 of 2,
April 2011, http://www.intel.com/Assets/PDF/datasheet/325119.pdf

Intel’s Desktop Platforms
Dezső Sima

Contents
• 1. Introduction to DT platforms
• 2. Introduction to Intel’s vPro platform family
• 3. Overview of Intel’s DT platforms
• 4. Core 2 and Penryn based DT platforms
• 5. Nehalem and Westmere based DT platforms
• 6. Sandy Bridge based DT platforms

• 7. Overview of distinguishedaspects of the evolution of
Intel’s DT platforms
• 8. References

1. Introduction to DT platforms

1. Introduction to DT platforms (1)
Intel’s desktop (DT) platforms
Platforms
Set of processors and associated chipsets capable of working together.
Traditional three chip platforms consist of
• a single or multiple processors,
• an MCH (Memory Control Hub) and Processor Processor
• an IOH (I/O Control Hub), whereas

FSB DMI
Intels’s recent two chip platforms include

MCH PCH
• a single or multiple processors and
• a PCH (Platform control Hub). DMI
IOH
Traditional Recent
DT platform DT platform

6/2006
Co-design of platform components

Averill
Platform components are often co-designed, announced 7/2006 11/2006

and delivered together, as it was the case with Intel’s
E6xxx/E4xxx Q6xxx
Core 2 based Averill platform. X6800 QX6xxx
Interchangeability of platform components (Conroe (Steppings B2/GÍ) (Kentsfield)
Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C
Usually, co-designed components are used as a set Core 2 Extreme 2C Core 2 Quad
to implement a particular system architecture. Core 2 Duo 2C
65 nm 65 nm
Nevertheless, co-designed components may be Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2
substituted by components belonging to their Allendale: 167 mtrs/111 mm2
2/4 MB L2 2*4 MB L2
preceding or subsequent generation, assuming E6800X/E6xxx: 1066 MT/s 1066 MT/s
compatibility. E4xxx: 800MT/s
LGA775 LGA775
E.g. The Core-2 based Averill platform supports 6/2006
also the Core 2 Quad processor lines, or
instead of the genuine 965 chipset and the 965 Series
ICH8 IOH the previous 975 MCH can be chosen
(Broadwater)
with the ICH7 IOH that targeted the 90 nm FSB
Pentium 4 Prescott (not shown since it is a 1066/800/566 MT/s
single core processor), as indicated partly 2 DDR2 channels
DDR2: 800/666 MT/s
in the figure. 2 DIMMs/channel
8 GB max.
6/2006
ICH8
The Core 2-based Averill platform (65 nm)

6/2006
Averill
7/2006 11/2006
E6xxx/E4xxx Q6xxx
X6800 QX6xxx
(Conroe (Steppings B2/GÍ) (Kentsfield)

Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C
Core 2 Extreme 2C Core 2 Quad
Core 2 Duo 2C
65 nm 65 nm
Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2
Allendale: 167 mtrs/111 mm2
2/4 MB L2 2*4 MB L2
E6800X/E6xxx: 1066 MT/s 1066 MT/s
E4xxx: 800MT/s
LGA775 LGA775
6/2006
965 Series
(Broadwater)
FSB
1066/800/566 MT/s
2 DDR2 channels
DDR2: 800/666 MT/s
2 DIMMs/channel
8 GB max.
6/2006
ICH8
Core 2-based (65 nm)

Note
Main features of memory interfaces of recent platforms

a) Width of the memory channels
Memory channels are 64 bit wide
b) Supported memory features
DT memories typically do not support ECC or registered (buffered) DIMMs,
in contrast to servers that typically make use of registered DIMMs with ECC protection.

Typical implementation of ECC protected registered DIMMs (used in servers)
Main components
• Two register chips, for buffering the address- and command lines
• A PLL (Phase Locked Loop) unit for deskewing clock distribution.
ECC
Register PLL Register
Figure 1.1: Typical layout of a registered memory module with ECC [1]

c) Memory types used in recent DT platforms
DDR2 and increasingly DDR3 memories.
SDRAM 168-pin
DDR 184-pin
DDR2 240- pin
DDR3 240-pin
© Sima Dezső, ÓE NIK Figure 1.2: DIMM modules

411 (8-Byte wide) www.tankonyvtar.hu
Typical memory speeds
• DDR2: 400/667/800 MT/s

• DDR3: 800/1067/1333 MT/s

d) The number of memory channels provided
In a traditional pre-Nehalem based system architecture the memory controller

(which is responsible for the memory channels) is implemented in the MCH.
The MCH has however, a large number of electrical connections that are implemented as
copper trails on the motherboard.

Example: Core 2 based (private consumer oriented Averill) DT platform [2]
FSB
Display 2 DIMMs/channel
2 DIMMs/channel
card
C-link

Figure 1.3: Copper trails on a motherboard (MSI 915G Combo motherboard)

(The copper trails are equalized to reduce skew)
There are limitations on both the minimum width and the spacing of the copper trails.
• The minimum width of the copper trails is restricted by their non-zero trace impedance.
• The minimum spacing between trails is constrained by parasitic capacitance and crosstalk.
Given
• the huge number of connections to be implemented as copper traces connecting
particular parts to the MCH,
• and the large number of connections each DDR2 or DDR3 memory channel needs,
• as well as the physical restrictions implied on the copper traces,
recent 3-chip platforms are limited typically to 2 DDR2 or DDR3 memory channels.
By contrast, FBDIMM channels make use of serial links and need only about 80 lines,
as a consequence 3-chip platforms may have about 3 times more FBDIMM memory
channels than DDR2 or DDR3 channels.

Beginning with Intel’s Nehalem processor, however the memory controller moved onto the
processor chip, as shown below.

Example: 1. generation Nehalem (called Bloomfield) private consumer oriented

Tylersburg) DT platform [3]
6.4 GT/s
Tylersburg

In this case the memory channels are attached to the processor whereby the processor
has much less connections to other unit than the MCH had in the previous design.
As a consequence, Nehalem-based or subsequent platforms may implement more than two
DDR2/DDR3 channels, as illustrated above.
Remark
Typical bandwidth of recent 2-channel memory interfaces
Let’s assume that a particular platform has dual memory channels with DDR3-1333 DIMMs.
Then the resulting memory bandwidth (BW) of the platform amounts to
BW = 2 x 8 B x 1333 MT/s = 21.3 GB/s

e) The number of DIMMs/channel
As memory speeds increase or the number of DIMMs attached to each memory channel
increases the operational tolerances of data transmissions over the memory channels
become narrower due to electrical effects, such as reflections, jitter, skews, crosstalk etc..

The operational margins are effective while capturing transmitted data at the receiver.
They are defined by
• a temporal window, called the Data Valid Window (DVD), and

• a voltage window, given by the VHmin and VLmax values.
V
VH
VHmin
Forbidden V area Data
VLmax
VL
t
DVD
DVD: Min. time data must remain valid
Clock (for capturing data)

Interpretation of the Data Valid Window (DVW)

It is the minimum time interval for which the input signal must remain valid (high or low)
before and after the clock edge in order to capture the data bits correctly.
Data
CK
tS
tH
Min. DVW
Figure 1.4: Interpretation of the DVW for ideal signals
The minimum DVW has two characteristics,
a size, that is the sum of the setup time (tS) and the hold time (tH), and
a correct phase related to the clock edge, to satisfy both tS and tH requirements.

The size and fulfillment of the operational tolerances can be visualized by the eye diagram.
It shows the picture of a large number of overlaid data signals.
min
DVW
max
Figure 1.5: Eye diagram of a real signal showing both available DVW and requested voltage levels [4]
DVW: Data Valid Window

Reflections, jitter, skews, crosstalks and other disturbances narrow the operational margins of
the data transfer and limit thereby
• the transfer speed and
• the number of DIMMs allowed to be attached to each channel.

Reflections
At operational speeds of DDR2/DDR3 memories the connection lines, i.e. the copper traces
behave like transmission lines.
Transmission lines need to be terminated by their characteristic impedance (about 50-70 Ω for
copper traces on mainboards) if reflections should be avoided.
In case of a termination mismatch or existing inhomogenities of the transmission line,
reflections arise and narrow the operational tolerances.

Termination of the transmission lines connecting the memory controller and the DRAM chips
Despite the fact that subsequent memory technologies (SDRAM to DDR3) laid more and more
emphasis on the appropriate termination of the transmission lines (till on die dynamically
adjusted termination of the lines in case of DDR3 memories), a certain termination mismatch
typically remains and reflections arise, as shown in the next slide.

Example for reflections
Figure 1.6: Reflections shown on an eye diagram due to termination mismatch [5]

Inhomogenity of transmission lines connecting the memory controller and the DRAM chips
Transmission lines connecting the memory controller and the DRAM chips mounted on the
DIMMs are inherently inhomogen due to the kind of the connection.

The dataway that connects the memory controller and the DRAM chips
Memory modules Inhomogenities

For higher data rates PCB traces arising in the
behave like transmission lines transmission line
Memory controller
Motherboard trace
Figure 1.7: The copper traces connecting the memory controller and the DRAM chips behaves
like transmission lines (based on [6])

Jitter
• I means phase uncertainty causing ambiguity in the rising and falling edges of a
signal, as shown in the figure below,
• It has a stochastic nature,
Figure 1.8: Jitter of signal edges [7]
The main sources of jitter are
• Crosstalk caused by coupling adjacent traces on the board or in the DRAM device,
• ISI (Inter-Symbol Interference) caused by cycling the bus faster than it can settle,
• Reflection noise due to mismatching termination of signal lines,
• EMI (Electromagnetic Interference) caused by electromagnetic radiation emitted
from external sources.

Skew
It is a time offset of the signal edges

• between different occurances of the same signal, such as a clock, at different locations
on a chip or a PC board (as shown in the Figure below), or
• between different bit lines of a parallel bus at a given location.
Figure 1.9: Skew due to propagation delay [7]
Skews arise mainly due to

- propagation delays in the PC-board traces, termed also as time of flight (TOF)
(about 170 ps/inch), as indicated above [8],
- capacitive loading of a PC-board trace (about 50 ps per pF) as indicated in the
subsequent figure [8],
- SSO (Simultaneous Switching Output) occurring due to parasitic inductances in case
when a number of bit lines simultaneously change their output states.

CK-1
CK-2
Skew
Figure 1.10: Skew due to capacitve loading of signal lines [8]

Reflections, jitter, skews and further electrical disturbances reduce the operational tolerances
effective at the receiver end of the transmission lines connecting the memory controller and
the DRAM chips and limit the operational speed of the memory channels as well as
the number of DIMMs attachable per channel.
As a consequence in recent DT and also server platforms the number of DIMMs that can be
attached to DDR2 or DDR 3 memory channels is typically restricted to two.
This restriction roots in the parallel style of connecting traditional memory modules to
memory controllers.
By contrast, serially connected memory modules, such as FBDIMM modules, have much higher
operational tolerances and in this case more than two (typically up to 6 or 8) FDDIMM modules
can be attached to a memory channel.
Example
The DRAM capacity of a DT platform (C) having three memory channels with two 4 GB DIMMs
per channel (C) amounts to
C = 3 x 2 x 4 GB = 24 GB

DT platforms
Private consumer oriented Enterprise oriented

DT platforms DT platforms
Intel’s standard DT platforms Intel’s vPro platforms

2. Introduction to Intel’s vPro platform family

2. Introduction to Intel’s vPro platform family (1)
Intel’s vPro platforms
Target market
Enterprise computing
Main goals
• lowering TCO (Total Cost of Ownership)
• increasing system availability and
• enhancing security.
Main goals achieved basically by

• hardware assisted remote maintenance
• hardware assisted virtualization and
• hardware assisted security.

Dedicated technologies constituting the vPro technology
VPro consists of an increasing set of dedicated technologies, such as [30]
(AMT)1
(VT)1 21
(VT-d)
(TXT)
(TXT)23
(AMT)
(TXT)
Client Intel© Turbo Memory Technology (TM)
(TXT)2
Server
Server
(AT)2
(AMT)
(AT)2
(VT)
(VT)
1AMT version 1.0 preceded vPro, it was introduced based on the 945 chipset, the ICH7 and
a Gigabit Ethernet controller and supported the dual core P4 Smithfield processor in 2005
2Introduced in the 2. gen. vPro based on the Q35 in 2007
3Introduced in the 4. gen. vPro based on the Q57/QM57 in 2007

(AT)1
(AT)1
Through the evolution of these technologies vPro provides a continuously expanding feature set.

Brief overview of the basic constituent technologies of vPro
Intel Active management Technology (AMT) [9]

• Allows system administrators to remotely manage PCs when the PC is shut down or there is
a hardware error (e.g. hard disk failure) or the OS is unavailable.
Intel Virtualization Technology for Directed I/O (VT-d) [11]

• Consists of a set of hardware and software components that support platform virtualization.
That allows running multiple OSs and applications in independent partitions.
• Each partition behaves like a virtual machine (VM).
• VT provides isolation and protection across partitions.
• VT enables among others server consolidation, workload isolation, legacy software migration,
disaster recovery.
Intel Trusted Execution Technology (TXT) [11]

• Hardware based enhanced protection for storing, processing and exchanging data in a PC.
Intel Turbo Memory Technology (TM) [12]
• It provides hard disk caching by using 1 – 4 GB SSD (Solid-State Drives), i.e. flash memory
placed on a card connected typically via the PCIe interface.
Intel Antitheft Technology (AT) [10]
• Allows system administrators to protect data stored on missing or stolen laptops e.g.
by sending an encrypted SMS message (designated as the poison pill) over a 3G network or
even to request the laptop to send location information (GPS coordinates) to the
central server, as well as to reactivate data if the desktop is recovered.
Main hardware components of vPro [30]
• AMT, VT-d, TXT, TM and AT enabled processor.

Most recent Intel processors provide support for these technologies.
• AMT enabled MCH or PCH, such as the GM45, Q35, Q45, Q55 or Q67 Express chipsets.
These chipsets include a Manageability Engine (actually a microcontroller),
which is the heart of AMT.
• AMT capable Gigabit LAN controller that provides an independent LAN communication
channel (needed for AMT) and
• an Intel wireless LAN controller that provides an independent wireless communication
channel (for mobiles, needed for AT).
Example
Main hardware components of the 5. generation vPro

Remark
AMT implements an Out-of-band remote management
Out-of-band management (OOB) (Based on [13])

• It means providing a dedicated LAN and/or wireless channel and hardware support
for remote monitoring and maintenance of devices, such as PCs, servers or network
equipments for system administrators.
• By contrast, in-band management makes typically use of the regular LAN and/or wireless
connection and is based on software, such as remote maintenance software that must be
installed on the remote device being managed and only works after the OS has been booted.
•Both in-bound and out-bound management requires LAN and or wireless network connection,
but out-of-band management needs separate dedicated LAN and/or wireless connections,
like a separated network connector.
• The remote management unit usually has an independent power supply and can power on or
off the device through the network.
• In-bound management may be cheaper but it does not allow to access BIOS settings or
reinstall the OS etc.

Note
Beyond hardware requirements vPro needs also firmware (BIOS) and OS support, not
detailed here. For details see e.g. [9].

More details about the basic constituent technologies of vPro

Intel Active management Technology (AMT) [9]
• Key component of vPro.

• Allows system administrators to remotely manage PCs when the PC is shut down or there
a hardware error (e.g. hard disk failure) or the OS is unavailable.
Main components of AMT [9]
Filters, sensors: Provide enhanced

security features
FW: Firmware
MAC: Media Access Control

(provides addressing and channel
access control) as part of OSI layer 2.
NVM: Non Volatile Memory
Out-of-band: Dedicated independent

system control without using the OS
3PDS: Third party data storage space

(≈ 192 KB) for general use
of OEM platform SW or third party
platform applications.

Remark
AMT version 1.0 preceded vPro.

It was based on the DT Lyndon platform (set up of the dual core P4 Smithfield processor,
the 945/955 chipset with the ICH7 and a Gigabit Ethernet Controller) in 2005 [14].

The Manageability Engine (ME)
• It is the heart of AMT.

• ME is an embedded microcontroller that is incorporated into the IOH or PCH.
(The ME became part of the PCH since the 2. generation Nehalem processors
(designated as Lynnfied) along with their associated 5 Series chipset called the PCH.
• ME provides an out-of-band (dedicated independent) remote management whereas
software based remote management requires running the underlying OS.

Relocation of the ME while the DT system architecture evolved from the 3-chip solution to
the 2-chip solution (along with the Nehalem-EX processors (Lynnfield) and their associated
5 Series PCHs) [24]
Previous
(Dedicated graphics
via graphics card)
(Series 5 PCH)
“Consumer
graphics

Operation of ME [30]
• ME is an embedded microcontroller integrated into the IOH or PCH.

• ME runs a dedicated microkernel OS that provides an execution engine for out-of-band
processor management.
• At system initialization ME loads its code (ME FW) from the nonvolatile (NVM) system flash
memory.
• This allows the dedicated OS to run before the main OS is started, independently from
the main OS.
• At runtime, ME has access to a protected area of system memory being in DIMM 0
of Channel A.
• ME is connected to an independent power plane that allows running it even if the CPU
or many other components of the system are in (ACPI) deeper sleep state.

Intel Virtualization Technology for Directed I/O (VT-d) [31], [32], [33]
Virtualization technology in general
• consists of a set of hardware and software components that allow running multiple OSs
and applications in independent partitions.
• Each partition is isolated and protected from all other partitions.
• Virtualization enables among others
• Server consolidation
Substituting multiple dedicated servers by a single virtualized platform,
• Legacy software migration
Legacy software: software commonly used previously,
written often in not more commonly used languages (such as Cobol) and
running under not more commonly used OSs or platforms.
Legacy software migration: moving legacy software to a recent platform,
• Effective disaster recovery.

Overview of the evolution of Intel’s virtualization technology

VT
It is Intel’s general designation for virtualization technology
VT-x
• Intel’s first generation VT implementation for x86 architectures.
• It provides hardware support for processor virtualization.
• Appeared first in 2005 for two Pentium 4 models (662. 672).
VT-i
• Intel’s first generation VT implementation for IA-64 (Itanium) architectures.
• It provides hardware support for processor virtualization.
• First implemented in the Montecito line of Itanium Processors in 2006.
VT-d (Intel Virtualization Technology for Directed I/O)
• Intel’s second generation VT implementation.
• It adds chipset hardware features to enhance I/O performance and robustness
of virtualization.
• First implemented in the desktop oriented Bearlake chipset (x35) in 2007.

Intel Trusted Execution Technology (TXT) [15], [16].
• The preceding designation for the TXT technology was the LaGrande technology.
• It provides hardware based security against hypervisor attacks, BIOS or other
firmware attacks, malicious root kit installations or other software attacks.
• It extends the Virtual Machine Environment (MLE) of Intel’s Virtualization Technology (VT)
by providing a verifiably secure installation, launch and use of a hypervisor or OS.
• It consists of a number of hardware enhancements to allow the creation of multiple
separated execution environments or partitions.
• One of the components is the TPM (Trusted Platform Module), a special chip which allows for
secure key generation and storage and authenticated access to data encrypted by this key.
The TPM chip is usually connected to the LPC (Low Pin Count) bus.
• TXT became available for DT platforms beginning with the 3 Series MCH model Q35 in 2007.

Intel Turbo Memory [12], [17], [18]
• It is a disk caching technology by using SSD (Solid-State Drives), i.e. flash memory
placed on a card.
• The Turbo memory card provides 1 – 4 GB disk space and is connected to the PC via the
PCIe interface.
• It caches frequently used data or user selected applications.
• Expected results: faster access to data and lower power consumption.
• It was announced first for mobile platforms in 2005 and offered
later on also for DT platforms, along with the 3 and 4 Series chipsets
starting in 2007.
• The Turbo Memory Technology is supported by Microsoft
Windows Vista “Ready Drive” and “Ready Boost” technologies.
• According to related reviews the Turbo Memory technology
did not fulfilled the expectations, it was costly and
was not worth the cost.
• In 2009 Intel announced the successor technology for the
5 Series mobile chipsets, called the Braidwood technology,
but subsequently they did withdraw it.
• In 2011 Intel introduced disk caching mechanism in their Z68 chipset
(and mobile derivatives) of the Series 6 PCH family, to provide
disk caching by a SATA SDD.

3. Overview of Intel’s DT platforms

3. Overview of Intel’s DT platforms (1)
Basic Arch. Techn. Core/technology Cores Intro. Cache arch. Interf.
X6800 Conroe 2C 7/2006 4 MB L2

E6xxx Conroe 2C 7/2006 2/4 MB L2
E4xxx Allendale 2C 1/2007 4 MB L2
Core2 65 nm FSB
E6xxx Allendale 2C 7/2007 4 MB L2
QX67xx Kentsfield 2x2C 11/2006 2x4MB L2
Q6xxx Kentsfield 2*2C 1/2007 2x4 MB l2
E8xxx Wolfdale 2C 1/2008 6 MB L2
E7xxx Wolfdale-3M 2C 4/2008 3 MB L2
QX9xxx Yorkfield XE 2x2C 11/2007 2x6 MB L2
Penryn 45 nm FSB
Q9xxx Yorkfield 2*2C 1/2008 2x6 MB L2
Q9xxx Yorkfield-6M 2*2C 1/2008 2x3 MB L2
Q8xxx Yorkfield-4M 2x2C 8/2008 2x2 MB L2
1. gen. Nehalem i7-920-965 Bloomfield 4C 11/2008 ¼ MB L2/C, 8 MB L3 QPI
45 nm
2. gen. Nehalem i7-8xxx/i5-7xx Lynnfield 4C 9/2009 ¼ MB L2/C, 8 MB L3 DMI
i7-9xxX Gulftown 6C 3/2010 ¼ MB L2/C, 12 MB L3 QPI

Westmere 32 nm i7-9xx Gulftown 6C 7/2010 ¼ MB L2/C, 12 MB L3 QPI
i5-6xx/i3-5xx Clarkdale 2C+G 1/2010 ¼ MB L2/C, max. 4 MB L2 DMI
i7-26/27/28/29xx 2/4C+G 1/2001 ¼ MB L2/C, 4/8 MB L3

Sandy Bridge 32 nm i5-23/24/25xx Sandy Bridge 2/4C+G 1/2011 ¼ MB L2/C, 3/6 MB L3 DMI2
i3-21/23xx 2C+G 1/2011 ¼ MB L2/C, 3 MB L3
Table 3.1: Intel’s Core 2 based or more recent multicore desktop (DT) processors

3. Overview of Intel’s DT platforms (2)
DT platforms
Core 2 and Penryn Nehalem Sandy Bridge

based DT platforms based DT platforms based DT platforms

4. Intel’s Core 2 and Penryn based DT platforms

4. Intel’s Core 2 and Penryn based DT platforms (1)
Intel’s Core 2 and Penryn based DT platforms
Core 2 and Penryn based DT platforms

Core 2 and Penryn based DT platforms Core 2 and Penryn based DT platforms
Averill (2006) Averill professional (2006)

Salt Creek (2007) Weybridge (2007)
Boulder Creek (2008) McCreary (2008)

a) Intel’s Core 2 and Penryn based private consumer oriented DT platforms
Overview

6/2006 6/2007 6/2008
Averill Salt Creek Boulder Creek
7/2006 11/2006 11/2007-3/2008
E6xxx/E4xxx Q6xxx E7xxx/E8xxx

X6800 QX6xxx QX9xxx/Q9xxx/Q8xxx1
(Conroe (Steppings B2/GÍ) (Kentsfield) (Wolfdale/Wolfdale 3M: 2C

Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C /Yorkfield/Yorkfield 6M: 2x2C
Core 2 Duo 2C
65 nm 65 nm 45 nm
Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2 Wolfdale/410 mtrs/107 mm2
Allendale: 167 mtrs/111 mm2 Yorkfield:/2x410 mtrs/2x107 mm2
2/4 MB L2 2*4 MB L2 2C: 6 MB L2 for Wolfdale
E6800X/E6xxx: 1066 MT/s 1066 MT/s 4C: 2*6 MB L2 for Yorkfield
E4xxx: 800MT/s 1066/1333 MT/s
LGA775 LGA775 LGA771
6/2006 6/2007 6/2008
965 Series 3 Series 4 Series
(Broadwater) (Bearlake) (Eaglelake)

FSB FSB FSB
1066/800/566 MT/s 1333/1066/800 MT/s 1333/1066/800 MT/s
2 DDR2 channels 2 DDR2/DDR3 channels 2 DDR2/DDR3 channels
DDR2: 800/666 MT/s DDR2: 666/800 MT/s DDR2: 800 MT/s
2 DIMMs/channel DDR3: 800/1067 MT/s DDR3: 1067 MT/s
8 GB max. 2 DIMMs/channel 2 DIMMs/channel
8 GB max. 16 GB max.
6/2006 6/2007 6/2008
ICH8 ICH9 ICH10
Core 2-based (65 nm) Penryn-based (45 nm)

Core 2-Quad based
458 (65 nm) Q8000: 8/2008
1
Remark
• The X38 chipset may be considered as belonging to the 3 Series family of chipsets.
It does not support vPro.
• Similarly, the X48 chipset may be considered as belonging to the 4 Series family of chipsets.
It does not support vPro.

Features of different platforms are defined by the particular chipset they include.
E.g. different models of the Averill platform (Core 2/Penryn based platform incorporating the
965 chipset and the ICH7 south bridge) provide the following features [19]: .

Basic system architecture of Core 2/Penryn based private consumer oriented

DT platforms
Core2/Penryn
(2C/2*2C)
proc.
FSB
965/3-/4- Series DDR2/DDR3

MCH depending on
the MCH model used
DMI C-link1
ICH8/9/10
1TC-link(Controller link) is needed basically for loading ME (Management Engine) firmware from the nonvolatile
system flash memory that is attached to the ICH
The ME is used to implement particular platform features supported.
1. Example: The Core 2/Penryn based private consumer oriented Averill DT platform
with the P965 MCH and ICH8 that does not provide an integrated display controller
[20]
card 2 DIMMs/channel
2 DIMMs/channel
C-link

Remark
Intel Matrix Storage Technology

• Reconfigures the SATA controller to support RAID 0, 1, 5 and 10.
• Provides increased data protection and disk performance.
Intel Quiet system Technology (QST)

• Aims at reducing system noise and heat through intelligent fan speed control algorithms.
• Formerly designated as the Advanced Fan Speed Control (AFSC)

2. Example: The Core 2/Penryn based private consumer oriented Averill DT platform
with the G965 MCH and ICH8 that provides an integrated display controller [2]
Display 2 DIMMs/channel
2 DIMMs/channel
card
C-link

b) Intel’s Core 2 and Penryn based enterprise oriented DT platforms (vPro platforms)
Overview

6/2006 6/2007 6/2008
Averill prof. (vPro) Weybridge( vPro) McCreary (vPro)
7/2006 11/2006 11/2007-3/2008
E6xxx/E4xxx Q6xxx E7xxx/E8xxx

X6800 QX6xxx QX9xxx/Q9xxx/Q8xxx1
(Conroe (Steppings B2/GÍ) (Kentsfield) (Wolfdale/Wolfdale 3M: 2C

Allendale: (Steppings L2/MÍ) Core 2 Extreme Quad 2x2C /Yorkfield/Yorkfield 6M: 2x2C
Core 2 Duo 2C
65 nm 65 nm 45 nm
Conroe: 291 mtrs/143 mm2 /2x291 mtrs/2x143 mm2 Wolfdale/410 mtrs/107 mm2
Allendale: 167 mtrs/111 mm2 Yorkfield:/2x410 mtrs/2x107 mm2
2/4 MB L2 2*4 MB L2 2C: 6 MB L2 for Wolfdale
E6800X/E6xxx: 1066 MT/s 1066 MT/s 4C: 2*6 MB L2 for Yorkfield
E4xxx: 800MT/s 1066/1333 MT/s
LGA775 LGA775 LGA771
6/2006 6/2007 6/2008
Q965 Q35 Q45
(Broadwater) (Bearlake) (Eaglelake)

FSB FSB FSB
1066/800/566 MT/s 1333/1066/800 MT/s 1333/1066/800 MT/s
2 channels DDR2 2 channels DDR2/DDR3 2 channels DDR2/DDR3
DDR2: 800/666 MT/s DDR2: 666/800 MT/s DDR2: 800 MT/s
2 DIMMs/channel DDR3: 800/1066/ MT/s DDR3: 1067 MT/s
8 GB max. 2 DIMMs/channel 2 DIMMs/channel
8 GB max. 16 GB max.
6/2006 6/2007 6/2008
ICH8 ICH9 ICH10
Core 2-Quad based (65 nm) 1Q8000: 8/2008 Penryn-based (45 nm)
Core 2-based (65 nm) 466 www.tankonyvtar.hu
Basic system architecture of Core 2/Penryn based enterprise oriented DT platforms
Core2/Penryn
(2C/2*2C)
proc.
FSB
DDR2/DDR3
Q965/Q35/Q45
depending on
MCH the MCH model used
DMI C-link1
ICH8/9/10
system flash memory that is attached to the ICH.
The ME is used to implement particular platform feature, such as AMT.

1. Example: The Core 2/Penryn based enterprise oriented Averill DT platform

with the Q965 MCH and ICH8 that does not provide an integrated display controller [2]
Display
2 DIMMs/channel
card 2 DIMMs/channel
C-link

2. Example: The Core 2/Penryn based enterprise oriented Weybridge DT platform

with the Q35 MCH and ICH9 that does not provide an integrated display controller [21]
C-link

Remark
• The Weybridge platform is the 2. gen. (Q35 based) vPro platform.

• It provides greatly enhanced features vs the previous 1. gen. vPro platform that is
the (Q965 based) Averill professional vPro platform.
Major enhancements include
• Virtualization Technology for Directed I/O (VT-d)
• Trusted Execution Technology (TXT)
• Turbo Memory Technology (TM)
• AMT version 3.0 vs AMT version 2.0.

5. Intel’s Nehalem based DT platforms

5. Intel’s Nehalem based DT platforms (1)
Intel’s Nehalem based DT platforms
Overview-1
Nehalem based DT platforms
1. gen. Nehalem (Bloomfield, 4C) 2. gen. NehalemX (Lynnfield 4C)

based DT platforms based DT platforms
Private consumer oriented Enterprise oriented Private consumer oriented Enterprise oriented
DT platforms DT platforms DT platforms DT platforms
Tylersburg (2008) Kings Creek (2009) Piketon (2009)

Overview-2
11/2008 9/2009
Kings Creek
Tylersburg Piketon (vPro)1
11/2008 3/2010 9/2009 1/2010
i7-920-965 i7-970-990X i7-8xx/i5-7xx i5-6xx/i3-5xx
(Bloomfield) (Gulftown) (Lynnfield) (Clarkdale)

Nehalem-EP 4C Westmere 6C Nehalem-EX 4C Westmere 2C+G
45 nm/731 mtrs/263 mm2 45 nm/1170 mtrs/240 mm2 45 nm/774 mtrs/296 mm2 32 nm/384 mtrs/81 mm2
¼ MB L2/C ¼ MB L2/C ¼ MB L2/C ¼ MB L2/C
8 MB L3 12 MB L3 8 MB L3 4 MB L3
1 QPI link/1 DMI link 1 QPI link/1 DMI link 1 DMI 1 DMI link
3 DDR3 channels 3 DDR3 channels + 1 FDI (for B/H/Q PCHs) + 1 FDI (for B/H/Q PCHs)
800/1066 MT/s 800/1066 MT/s 2 DDR3 channels 2 DDR3 channels
2 DIMMs/channel 2 DIMMs/channel 1066/1333 MT/s 1066/1333 MT/s
24 GB max. 24 GB max. 2 DIMMs/channel 2 DIMMs/channel
LGA-1366 LGA-1366 16 GB max. 16 GB max.
LGA-1156 LGA-1156
11/2008
9/2009
X58
5 series PCH
(Tylersburg)
1 QPI link/1 DMI link (Ibex Peak)
Akin to the 34xx chipset)
36xPCIe 2. gen.
DMI 1 DMI link/PECI
6/2008 6-8xPCIe 2. gen.
ICH10
1. gen. Nehalem-EP based 1 Needs the Q57 PCH Westmere-EP
45 nm Westmere-EP Nehalem-EX-based 32 nm
a) 1. gen. Nehalem (Bloomfield, 4C) based private consumer oriented Tylersburg

DT platform
Basic system architecture of the platform
1. gen. Nehalem
Bllomfied (4C)/
Westmere Gulftown DDR3
(6C) proc.
QPI
X58 IOH
DMI C-link1
ICH10
system flash memory that is attached to the ICH.
The ME is used to implement particular platform feature supported.

Example: 1. gen. Nehalem (Bloomfield, 4C) based private consumer oriented Tylersburg
DT platform [22]
2 DIMMs/channel
2 DIMMs/channel
2 DIMMs/channel
Processor:
• Nehalem-EP
(Bloomfield, 4C)
• Westmere-EP
(Gulftown, 6C)
Remark: The platform shown does not include an integrated display controller
[23]

b) 2. gen. Nehalem (Lynnfield 4C) based DT platforms
These platforms introduced a new kind of system architecture that consists only of two chips.

Introduction of Intel’s 2-chip DT chipset solution
Based on their 5 Series PCH (Platform Control Hub) in 9/2009 [24]
Previous
(Dedicated graphics
via graphics card)
5 Series PCH
controller
“Consumer
graphics

Remarks
The 5 Series (Ibex Peak) PCH family covers all the

• mobile
• desktop and the
• UP and DP server
segments.
For each segment Intel provides a number of PCH lines for different use, among others
• the X line for extreme performance for home use (mostly for gamers),
• the P line for home use,
• the Q and H lines for business use etc.
Each line comprises usually a number of models with different feature sets.
E.g. the desktop segment includes the following line and models with the feature sets given.

Supported features of different desktop models of the series 5 PCH family
5 series datasheet
Basic system architecture of the 2. gen. Nehalem based DT platforms
2. gen. Nehalem (Lynnfield, 4C) based DT platforms

DT platforms DT platforms
Kings Creek (2009) Piketon (2009)
1. gen. Nehalem 2. gen. Nehalem

(Lynnfield, 4C)/ DDR3- (Lynnfield, 4C)/ DDR3-
Westmere 1333 Westmere 1333
(Clarkdale, 2C+G) (Clarkdale, 2C-EP+G)
FDI1 DMI FDI1 DMI
Q57
5-Series
PCH
PCH
(w/AMT 6.0)
1FDI is needed for an integrated display controller (included in all 6 Series PCHs except the P55

Note
In the 2-chip system architecture the PCH includes the ME (Manageability Engine) as well as
it is connected to the nonvolatile system flash memory that keeps the microcode to be read at
boot time.
So there is no need for an extra interface for loading the ME from the nonvolatile system memory
as was in case of the previous 3-chip system architecture.

Example 1: 2. gen. Nehalem (Lynnfield, 4C) based private consumer oriented

DT platform; the Kings Creek platform [25]
2 DIMMs/channel
2 DIMMs/channel
Remark: It does not include an integrated display controller

Example 2: 2. gen. Nehalem (Lynnfield, 4C) based enterprise oriented DT platforms;

the Piketon platform [26]
FDI: Flexible Display interface, it is needed for an integrated display controller

Remark
Intel Rapid Storage Technology

• This is an enhanced version of Intel’s previous Matrix Storage Technology.
• It configures the SATA controller as a RAID controller that supports RAID 0/1/5/10 resulting
in a more effective and secure data access.

6. Intel’s Sandy Bridge based DT platforms

6. Intel’s Sandy Bridge based DT platforms (1)
Intel’s Sandy Bridge based DT platforms – Overview-1
Sandy Bridge based DT platforms

Sandy Bridge based DT platforms Sandy Bridge DT platforms
Sugar Bay (2011) Sugar Bay (vPro)? (2011)

1/2011 1/2011
Overview-2
Sugar Bay Sugar Bay (vPro)
1/2011 1/2011
i7-26xx-29xx i7-26xx-29xx
i5-23xx-25xx i5-23xx-25xx
i3-21xx-23xx i7-21xx.23xx
Sandy Bridge 2C/4C Sandy Bridge 2C/4C
4C (12 EUs): 32 nm/915? mtrs/216 mm2 4C (12 EUs): 32 nm/915? mtrs/216 mm2
2C (12 EUs): 32 nm/624? mtrs/149 mm2 2C (12 EUs): 32 nm/624? mtrs/149 mm2
¼ MB L2/C ¼ MB L2/C
Up to 8 MB L3 Up to 8 MB L3
1 x DMI2 1 x DMI2
+ 1 x FDI (except the P67 PCH) + 1 x FDI (except the P67 PCH)
2 DDR3 channels 2 DDR3 channels
1066/1333 MT/s 1066/1333 MT/s
2 DIMMs/channel 2 DIMMs/channel
max. 32 GB max. 32 GB
LGA-1155 LGA-1155
1/2011 1/2011
6 series PCH Q67 PCH

(Cougar Point) (Cougar Point)
(akin to the C200 PCH) (akin to the C200 PCH)
1 DMI2 link/PECI 1 DMI2 link/PECI
6-8xPCIe 2. gen. 6-8xPCIe 2. gen.
8 GB/s/lane/direction 8 GB/s/lane/direction
Sandy Bridge-based Sandy Bridge-based

Supported features of different models of the 6 Series PCH family
AHCI: Advanced Host Controller Interface Specification for Serial ATA

RST: Rapid Storage Technology
6 Series datasheet
HDMI/DVI/VGA/eDP: Different display interfaces
SSD: Solid State Drive
489
Basic system architecture of Sandy Bridge based DT platforms
Sandy Bridge DDR3- DDR3-

Sandy Bridge
(2C/4C + G) 1333 1333
(2C/4C + G)
FDI1 DMI2 FDI1 DMI2
6- Series Q67
PCH PCH

Sandy Bridge based DT platform Sandy Bridge based DT platform
Sugar Bay (2011) Sugar Bay (vPro)? (2011)
1FDI is needed for integrated display controllers (included in all 6 Series PCHs except the P67

Example 1: Sandy Bridge based private consumer oriented DT platform;

the Sugar Bay platform [27]

Example 2: Sandy Bridge based consumer oriented DT platform;

the Sugar Bay vPro platform [28]

Remark
The Z68 model supports disk caching that allows an SDD to be used to cache a SATA hard disk.
This technology is designated now as the Smart Response Technology.
It is analogous to Intel’s previous Turbo Memory Technology introduced along with the 3 Series
Chipsets but discontinued with the Series 5 PCHs.

7. Overview of distinguished aspects
of the evolution of Intel’s DT platforms

7. Overview of distinguished features of the evolution of DT platforms (1)
Overview of distinguished aspects of the evolution of Intel’s DT platforms
a) Evolution of the basic system architecture
1. gen. Nehalem 2. gen. Nehalem

Core2/Penryn (4C)/
(4C)/
(2C/2*2C) Westmere (2C)/
Westmere (6C)
proc. Sandy Bridge (4C)
proc.
FSB QPI FDI DMI
965/3-/4- Series 5/6- Series

X58 IOH
MCH PCH
DMI DMI
ICH8/9/10 ICH10

b) Evolution of the vPro technology (based on [30])
1. generation 2. generation 3. generation 4. generation 5. generation

vPro vPro vPro vPro vPro
Averill Prof. Weybridge prof. McCreary Piketon Sugar Bay

Core2 Core 2/Penryn Penryn Nehalem/Westmere Sandy Bridge
Q965 Q35 Q45 Q57 Q67
AMT 2.0 AMT 3.0 AMT
4965.0 AMT 6.0 AMT 7.0
Remark
KVM (Keyboard, Video, Mouse) feature [29]

• It gives maintenance personal complete remote control over the PC by using keyboard and
mouse.
• KVM allows pre-boot access to the remote PC.
• It allows both hard wired and wireless connections so also every mobile device, even an
iPhone can be used to access the remote PC.

8. References

References (1)
[1]: DDR SDRAM Registered DIMM Design Specification, JEDEC Standard No. 21-C,
Page 4.20.4-1, Jan. 2002, http://www.jedec.org
[2]: Product Brief: Intel G965 Express Chipset, 2006,

http://szarka.ssgg.sk/Vyuka/Prednaska-3/2009/Prednaska-3_/G965-prod_brief.pdf
[3]: Wikipedia: Intel X58, 2011, http://en.wikipedia.org/wiki/Intel_X58
[4]: Ahn J.-H., „Memory Design Overview,” March 2007, Hynix,

http://netro.ajou.ac.kr/~jungyol/memory2.pdf
[5]: Allan G., „The outlook for DRAMs in consumer electronics”, EETIMES Europe Online,
01/12/2007, http://eetimes.eu/showArticle.jhtml?articleID=196901366&queryText=calibrated
[6]: Jacob B., Ng S. W., Wang D. T., Memory Systems, Elsevier, 2008
Intel Core versus
AMD's K8 architecture
[7]: Ebeling C., Koontz T., Krueger R., „System

Date: Clock Management Simplified with Virtex-II
May 1st, 2006
Pro FPGAs”, WP190, Febr. 25 2003, Xilinx,

http://www.xilinx.com/support/documentation/white_papers/wp190.pdf
[8]: Kirstein B., „Practical timing analysis for 100-MHz digital design,”, EDN, Aug. 8, 2002,
www.edn.com
[9]: Izaguirre J., Building and Deploying Better Embedded Systems with Intel Active
Management Technology (Intel AMT), Intel Technology Journal, Vol. 13, Issue 1., 2009,
pp. 84-95
[10]: Technology Brief, 2nd generation Intel Core processor family, Intel Anti-Theft Technology,
2011, http://www.intel.com/technology/anti-theft/anti-theft-tech-brief.pdf
References (2)
[11]: Intel Technology – Intel vPro Technology,

https://nmso.mdg.ca/WebManuals/Vx_7_8_English/components/mbd/vpro_amt.htm
[12]: Intel Turbo Memory, Intel Corporation, http://www.intel.com/cd/channel/reseller/apac/
eng/products/mobile/mprod/turbo_memory/396715.htm
[13]: Wikipedia: Out-of-band management, 2011,
http://en.wikipedia.org/wiki/Out-of-band_management
[14]: Intel Lyndon Platform & Future TS, VR-Zone, Dec. 9 2004,
http://vr-zone.com/articles/intel-lyndon-platform--future-ts/1520.html
[15]: Greene J., Intel Trusted Execution Technology, White Paper, 2010,
http://www.intel.com/Assets/PDF/whitepaper/323586.pdf
[16]: Wikipedia: Trusted Execution Technology, 2011,
http://en.wikipedia.org/wiki/Trusted_Execution_Technology
[17]: Intel Turbo Memory supported chipsets, Intel Corporation,
http://www.intel.com/support/chipsets/itm/sb/CS-025854.htm
[18]: Wikipedia: Intel Turbo Memory, 2011, http://en.wikipedia.org/wiki/Intel_Turbo_Memory
[19]: Intel 965 Express Chipset Family Datasheet, July 2006, http://ivanlef0u.fr/repo/ebooks/
intel_manuals/Intel%20%20965%20Express%20Chipset%20Family.pdf
[20]: Intel Core 2 Duo Processor, http://www.intel.com/pressroom/kits/core2duo/
[21]: Product Brief: Intel Q35 and Q33 Express Chipsets, 2007,
http://www.intel.com/Assets/PDF/prodbrief/317312.pdf
References (3)
[22]: Product Brief: Intel X58 Express Chipset, 2008,

http://www.intel.com/Assets/PDF/prodbrief/x58-product-brief.pdf
[23]: 2nd Generation Intel Core Processors 30/3/30, 2010,
http://cache-www.intel.com/cd/00/00/46/02/460297_460297.pdf
[24]: Smith S. L., Intel Roadmap Overview, Aug. 20 2008,

http://download.intel.com/pressroom/kits/events/idffall_2008/SSmith_briefing_roadmap.pdf
[25]: Product Brief: Intel P55 Express Chipset, 2009,

[26]: Product Brief: Intel Q57 Express Chipset, 2009,
[27]: Product Brief: Intel P67 Express Chipset and 2nd Generation Intel Core Processors, 2010,
[28]: Product Brief: Intel Q67 Express Chipset, 2011,
[29]: Freeman V., Intel refreshes its vPro platform for 2010, March 9 2010, Hardware Central,
http://www.hardwarecentral.com/features/article.php/3869536/Intel-Refreshes-Its-
vPro-Platform-for-2010.htm
[30]: Marek J., Rasheed Y., Watts L., Technical Overview of Next Generation Intel vPro
Technology, PROS001, 2010

References (4)
[31]: Neiger G., Santoni A., Leung F., Rodgers D., Uhlig R., Intel® Virtualization Technology:
Hardware support for efficient processor virtualization, Aug. 10 2006, Vol. 10, Issue 3,
http://www.intel.com/technology/itj/2006/v10i3/1-hardware/1-abstract.htm
[32]: Intel Software Networks: Forums,

http://software.intel.com/en-us/forums/showthread.php?t=56802
[33]: Wikipedia: x86 virtualization, 2011,

http://en.wikipedia.org/wiki/X86_virtualization#Intel_virtualization_.28VT-x.29

GPGPUs/DPAs
Overview
Dezső Sima

Aim
Aim
Brief introduction and overview.

Contents
1.Introduction
2. Basics of the SIMT execution
3. Overview of GPGPUs
4. Overview of data parallel accelerators
5. References

1. Introduction

1. Introduction (1)
Representation of objects by triangles
Vertex
Edge Surface
Vertices
• have three spatial coordinates
• supplementary information necessary to render the object, such as
• color
• texture
• reflectance properties
• etc.

1. Introduction (2)
Example: Triangle representation of a dolphin [149]

1. Introduction (3)
Main types of shaders in GPUs
Shaders
Vertex shaders Pixel shaders Geometry shaders

(Fragment shaders)
Transform each vertex’s Calculate the color Can add or remove

3D-position in the virtual space of the pixels vertices from a mesh
to the 2D coordinate,
at which it appears on the screen

1. Introduction (4)
DirectX version Pixel SM Vertex SM Supporting OS
8.0 (11/2000) 1.0, 1.1 1.0, 1.1 Windows 2000
8.1 (10/2001) 1.2, 1.3, 1.4 1.0, 1.1 Windows XP/ Windows
Server 2003
9.0 (12/2002) 2.0 2.0
9.0a (3/2003) 2_A, 2_B 2.x
9.0c (8/2004) 3.0 3.0 Windows XP SP2
10.0 (11/2006) 4.0 4.0 Windows Vista
10.1 (2/2008) 4.1 4.1 Windows Vista SP1/

Windows Server 2008
11 (10/2009) 5.0 5.0 Windows7/
Windows Vista SP1/
Windows Server 2008 SP2
Table 1.1: Pixel/vertex shader models (SM) supported by subsequent versions of DirectX
and MS’s OSs [18], [21]
DirectX: Microsoft’s API set for MM/3D

1. Introduction (5)
Convergence of important features of the vertex and pixel shader models

Subsequent shader models introduce typically, a number of new/enhanced features.
Differences between the vertex and pixel shader models in subsequent shader models
concerning precision requirements, instruction sets and programming resources.
Shader model 2 [19]

• Different precision requirements
Different data types
• Vertex shader: FP32 (coordinates)
• Pixel shader: FX24 (3 colors x 8)
• Different instructions
• Different resources (e.g. registers)
Shader model 3 [19]

• Unified precision requirements for both shaders (FP32)
with the option to specify partial precision (FP16 or FP24)
by adding a modifier to the shader code
• Different instructions
• Different resources (e.g. registers)

1. Introduction (6)
Shader model 4 (introduced with DirectX10) [20]

• Unified precision requirements for both shaders (FP32)
with the possibility to use new data formats.
• Unified instruction set
• Unified resources (e.g. temporary and constant registers)
Shader architectures of GPUs prior to SM4
GPUs prior to SM4 (DirectX 10):

have separate vertex and pixel units with different features.
Drawback of having separate units for vertex and pixel shading

• Inefficiency of the hardware implementation
• (Vertex shaders and pixel shaders often have complementary load patterns [21]).

1. Introduction (7)
Unified shader model (introduced in the SM 4.0 of DirectX 10.0)
Unified, programable shader architecture
The same (programmable) processor can be used to implement all shaders;

• the vertex shader
• the pixel shader and
• the geometry shader (new feature of the SMl 4)

1. Introduction (8)
Figure 1.1: Principle of the unified shader architecture [22]

1. Introduction (9)
Based on its FP32 computing capability and the large number of FP-units available
the unified shader is a prospective candidate for speeding up HPC!
GPUs with unified shader architectures also termed as
GPGPUs
(General Purpose GPUs)
or
cGPUs
(computational GPUs)

1. Introduction (10)
Peak FP32/FP64 performance of Nvidia’s GPUs vs Intel’ P4 and Core2 processors [43]

Peak FP32 performance of AMD’s GPGPUs [87]

Evolution of the FP-32 performance of GPGPUs [44]

Evolution of the bandwidth of Nvidia’s GPU’s vs Intel’s P4 and Core2 processors [43]

Figure 1.2: Contrasting the utilization of the silicon area in CPUs and GPUs [11]
• Less area for control since GPGPUs have simplified control (same instruction for
all ALUs)
• Less area for caches since GPGPUs support massive multithereading to hide
latency of long operations, such as memory accesses in case of cache misses.

2. Basics of the SIMT execution

2. Basics of the SIMT execution (1)
Main alternatives of data parallel execution models
Data parallel execution models
SIMD execution SIMT execution
• One dimensional data parallel execution, • Two dimensional data parallel execution,
i.e. it performs the same operation i.e. it performs the same operation
on all elements of given on all elements of a given
FX/FP input vectors FX/FP input array (matrix)
• is massively multithreaded,
and provides
• data dependent flow control as well as
• barrier synchronization
Needs an FX/FP SIMD extension Assumes an entirely new specification,

of the ISA that is done at the virtual machine level
(pseudo ISA level)
E.g. 2. and 3. generation GPGPUs,
superscalars data parallel accelerators
Figure 2.1: Main alternatives of data parallel execution

Remarks
1) SIMT execution is also termed as SPMD (Single_Program Multiple_Data) execution (Nvidia).
2) The SIMT execution model is a low level execution model that needs to be complemented
with further models, such as the model of computational resources or the memory model,
not discussed here.

Specification levels of GPGPUs

GPGPUs are specified at two levels
• at a virtual machine level (pseudo ISA level, pseudo assembly level, intermediate level) and
• at the object code level (real GPGPU ISA level).
HLL
Virtual machine
level
Object code
level

The process of program development

Becomes a two-phase process
• Phase 1: Compiling the HLL application to pseudo assembly code
Nvidia AMD
HLL level CUDA (Brook+)

HLL application
OpenCL OpenCL
nvcc (brcc)
HLL compiler
nvopencc
Virtual machine level

Pseudo assembly code PTX AMD IL
(Compatible code)
The compiled pseudo ISA code (PTX code/IL code) remains independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over different
GPGPU families.
Compiling a PTX/IL file to a GPGPU that misses features supported by the particular PTX/IL
version however, may need emulation for features not implemented in hardware.
This slows down execution.
The process of program development-2
• Phase 2: Compiling the pseudo assembly code to GPU specific binary code
Nvidia AMD
Virtual machine level

Pseudo assembly code PTX AMD IL
(Compatible code)
Pseudo assembly – GPU compiler CUDA driver CAL compiler
Object code level

GPU specific binary code CUBIN file Target binary
(GPU bound)
The object code (GPGPU code, e.g. a CUBIN file) is forward portable, but forward portabilility
is provided typically only within major GPGPU versions, such as Nvidia’s compute capability
versions 1.x or 2.x.

Benefits of the portability of the pseudo assembly code
• The compiled pseudo ISA code (PTX code/IL code) remains independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over subsequent
GPGPU families.
Forward portability of the object code (GPGPU code, e.g. CUBIN code) is provided however,
typically only within major versions.
• Compiling a PTX/IL file to a GPGPU that misses features supported by the particular PTX/IL
version however, may need emulation for features not implemented in hardware.
This slows down execution.
• Portability of pseudo assembly code (Nvidia’s PTX code or AMD’s IL code) is highly
advantageous in the recent rapid evolution phase of GPGPU technology as it results in
less costs for code refactoring.
Code refactoring costs are a kind of software maintenance costs that arise when the user
switches from a given generation to a subsequent GPGPU generation (like from GT200
based devices to GF100 or GF110-based devices) or to a new software environment
(like from CUDA 1.x SDK to CUDA 2.x or from CUDA 3.x SDK to CUDA 4.x SDK).

Remark
The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to
the virtual machine concept underlying Java.
• For Java there is also an inherent pseudo ISA definition, called the Java bytecode.
• Applications written in Java will first be compiled to the platform independent Java bytecode.
• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE)
installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT)
compiler of the end user.

Specification GPGPU computing at the virtual machine level

At the virtual machine level GPGPU computing is specified by
• the SIMT computational model and
• the related pseudo iSA of the GPGPU.

The SIMT computational model

It covers the following three abstractions
Model of The memory Model of

computational model SIMT
resources execution
Figure 2.2: Key abstractions of the SIMT computational model

1. The model of computational resources

It specifies the computational resources available at virtual machine level (the pseudo ISA level).
• Basic elements of the computational resources are SIMT cores.
• SIMT cores are specific SIMD cores, i.e. SIMD cores enhanced for efficient multithreading.
Efficient multithreading means zero-cycle penalty context switches, to be discussed later.
First, let’s discuss the basic structure of the underlying SIMD cores.
SIMD cores execute the same instruction stream on a number of ALUs (e.g. on 32 ALUs),
i.e. all ALUs perform typically the same operations in parallel.
Fetch/Decode
SIMD core
ALU ALU ALU ALU ALU
Figure 2.3: Basic structure of the underlying SIMD cores
ALUs operate in a pipelined fashion, to be discussed later.

SIMD ALUs operate according to the load/store principle, like RISC processors i.e.
• they load operands from the memory,
• perform operations in the “register space” i.e.
• they take operands from the register file,
• perform the prescribed operations and
• store operation results again into the register file, and
• store (write back) final results into the memory.
The load/store principle of operation takes for granted the availability of a register file (RF)
for each ALU.
Load/Store
Memory RF
ALU
Figure 2.4: Principle of operation of a SIMD ALU

As a consequence of the chosen principle of execution each ALU is allocated a register file (RF)
that is a number of working registers.
Fetch/Decode
ALU ALU ALU ALU ALU ALU
RF RF RF RF RF RF
Figure 2.5: Main functional blocks of a SIMD core

Remark
The register sets (RF) allocated to each ALU are actually, parts of a large enough register file.
RF RF RF RF RF RF

ALU ALU
Figure 2.6: Allocation of distinct parts of a large register file to the private register sets of the ALUs

Basic operations of the underlying SIMD ALUs
• They execute basically FP32 Multiply-Add instructions of the form

axb+c,
• and are pipelined, i.e.
• capable of starting a new operation every new clock cycle,
RF (more precisely, every new shader clock cycle), and
• need a few number of clock cycles, e.g. 2 or 4 shader cycles
to present the results of the FP32 Multiply-Add operations to the RF,
ALU Without further enhancements

the peak performance of the ALUs is 2 FP32 operations/cycle.

Beyond the basic operations the SIMD cores provide a set of further computational capabilities,
such as
• FX32 operations,
• FP64 operations,
• FX/FP conversions,
• single precision trigonometric functions (to calculate reflections, shading etc.).
Note
Computational capabilities specified at the pseudo ISA level (intermediate level) are
• typically implemented in hardware.
Nevertheless, it is also possible to implement some compute capabilities
• by firmware (i.e. microcoded,
• or even by emulation during the second phase of compilation.

Enhancing SIMD cores to SIMT cores
SIMT cores are enhanced SIMD cores that provide an effective support of multithreading
Aim of multithreading in GPGPUs
Speeding up computations by eliminating thread stalls due to long latency operations.

Achieved by suspending stalled threads from execution and allocating free computational
resources to runable threads.
This allows to lay less emphasis on the implementation of sophisticated cache systems
and utilize redeemed silicon area (used otherwise for implementing caches)
for performing computations.

Effective implementation of multithreading

requires that thread switches, called context switches, do not cause cycle penalties.
Achieved by
• providing and maintaining separate contexts for each thread, and
• implementing a zero-cycle context switch mechanism.

SIMT cores
= SIMD cores with per thread register files (designated as CTX in the figure)
Fetch/Decode
SIMT core
CTX CTX CTX CTX CTX CTX
Actual context CTX
Register file (RF)
CTX CTX CTX CTX CTX
Context switch CTX CTX CTX CTX CTX CTX
ALU
Figure 2.7: SIMT cores are specific SIMD cores providing separate thread contexts for each thread

The final model of computational resources of GPGPUs at the virtual machine level
The GPGPU is assumed to have a number of SIMT cores and is connected to the host.
Fetch/Decode
SIMT
ALU ALU core
Host
Fetch/Decode
Fetch/Decode SIMT
ALU ALU
core
ALU ALU SIMT
core
Figure 2.8: The model of computational resources of GPGPUs
During SIMT execution 2-dimensional matrices will be mapped to the available SIMT cores.
Remarks
1) The final model of computational resources of GPGPUs at the virtual machine level is similar
to the platform model of OpenCL, given below assuming multiple cards.
ALU
Card
SIMT core
Figure 2.9: The Platform model of OpenCL [144]

2) Real GPGPU microarchitectures reflect the model of computational resources discussed

at the virtual machine level.
Figure 2.10: Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]
3) Different manufacturers designate SIMT cores differently, such as
• streaming multiprocessor (Nvidia),

• superscalar shader processor (AMD),
• wide SIMD processor, CPU core (Intel).

The memory model
The memory model at the virtual machine level declares all data spaces available at this level
along with their features, like their accessibility, access mode (read or write) access width etc.
Key components of available data spaces at the virtual machine level
Available data spaces
Register space Memory space
Per thread Local memory Constant memory Global memory

register file
(Local Data Share) (Constant Buffer) (Device memory)
Figure 2.11: Overview of available data spaces in GPGPUs

Per thread register files

• Provide the working registers for the ALUs.
• There are private, per thread data spaces available for the execution of threads
that is a prerequisite of zero-cycle context switches.
SIMT 1
Local Memory
Reg. Reg. Reg.

File 1 File 2 File n
Instr.
ALU 1 ALU 2 ALU n
Unit
Constant Memory
Global Memory
Figure 2.12: Key components of available data spaces at the level of SIMT cores

Local memory
• On-die R/W data space that is accessible from all ALUs of a particular SIMT core.
• It allows sharing of data for the threads that are executing on the same SIMT core.
SIMT 1
Local Memory
Reg. Reg. Reg.

File 1 File 2 File n
Instr.
ALU 1 ALU 2 ALU n
Unit
Constant Memory
Global Memory
Figure 2.13: Key components of available data spaces at the level of SIMT cores

Constant Memory
• On-die Read only data space that is accessible from all SIMT cores.
• It can be written by the system memory and is used to provide constants for all threads
that are valid for the duration of a kernel execution with low access latency.
GPGPU
SIMT 1 SIMT m
Reg. Reg. Reg. Reg.
File 1 File n File 1 File n
ALU 1 ALU n ALU 1 ALU n
Local Memory Local Memory
Constant Memory
Global Memory
Figure 2.14: Key components of available data spaces at the level of the GPGPU
Global Memory
• Off-die R/W data space that is accessible for all SIMT cores of a GPGPU.
• It can be accessed by the system memory and is used to hold all instructions and data
needed for executing kernels.
GPGPU
SIMT 1 SIMT m
Reg. Reg. Reg. Reg.
File 1 File n File 1 File n
ALU 1 ALU n ALU 1 ALU n
Local Memory Local Memory
Constant Memory
Global Memory
Figure 2.15: Key components of available data spaces at the level of the GPGPU
Remarks
1. AMD introduced Local memories, designated as Local Data Share, only along with their
RV770-based HD 4xxx line in 2008.
2. Beyond the key data space elements available at the virtual machine level, discussed so far,
there may be also other kinds of memories declared at the virtual machine level,
such as AMD’s Global Data Share, an on-chip Global memory introduced with their
RV770-bssed HD 4xxx line in 2008).
3. Traditional caches are not visible at the virtual machine level, as they are transparent for
program execution.
Nevertheless, more advanced GPGPUs allow an explicit cache management at the
virtual machine level, by providing e.g. data prefetching.
In these cases the memory model needs to be extended with these caches accordingly.
4. Max. sizes of particular data spaces are specified by the related instruction formats
of the intermediate language.
5. Actual sizes of particular data spaces are implementation dependent.
6. Nvidia and AMD designates different kinds of their data spaces differently, as shown below.
Nvidia AMD
Register file Registers General Purpose Registers

Local Memory Shared Memory Local Data Share
Constant Memory Constant Memory Constant Register
Global memory Global Memory Device memory

Example 1: The platform model of PTX vers. 2.3 [147]

Nvidia
A set of SIMT cores

with on-chip
shared memory
A set of ALUs
within the
SIMT cores

Example 2: Data spaces in AMD’s IL vers. 2.0 (simplified)

Data space Access type Available Remark
Deafult: (127-2)*4
General Purpose Registers R/W Per ALU 2*4 registers are reserved as Clause
Temporary Registers
On-chip memory that enables sharing of

Local Data Share (LDS) R/W Per SIMD core data between threads executing on a
particular SIMT
128 x 128 bit

Constant Register (CR) R Per GPGPU
Written by the host
On-chip memory that enables

Global Data Share R/W Per GPGPU sharing of data between threads
executing on a GPGPU
Device Memory R/W GPGPU Read or written by the host
Table 2.1: Available data spaces in AMD’s IL vers. 2.0 [107]
Remarks
• Max. sizes of data spaces are specified along with the instructions formats of the
intermediate language.
• The actual sizes of the data spaces are implementation dependent.

Example: Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]

The SIMT execution model
Key components of the SIMT execution model
SIMT execution model
Multi-dimensional The kernel The model Barrier

domain of concept of data synchronization
execution sharing
Massive Concept of Data dependent Communication

multithreading assigning work flow control between
to execution threads
pipelines

1. Multi-dimensional domain of execution
Domain of execution: index space of the execution
Scalar execution SIMD execution SIMT execution

(assuming a 2-dimensional
index space)
8 8 8
8 8 8
Domain of execution: Domain of execution: Domain of execution:
scalars, no indices one-dimensional index space two-dimensional index space
Objects of execution: Objects of execution: Objects of execution:
single data elements data elements of vectors data elements of matrices
Supported by Supported by Supported by
all processors 2.G/3.G superscalars GPGPUs/DPAs
Figure 2.16: Domains of execution in case of scalar, SIMD and SIMT execution

2. Massive multithreading
The programmer creates for each element of the index space, called the execution domain
parallel executable threads that will be executed by the GPGPU or DPA.
Threads
(work items)
The same instructions

will be executed
for all elements of the
domain of execution
Domain of
execution
Figure 2.17: Parallel executable threads created and executed for each element of an execution
domain

3. The kernel concept-1
The programmer describes the set of operations to be done over the entire domain of execution
by kernels.
Threads
(work items)
The same instructions

will be executed
for all elements of the
Operations to be done domain of execution
over the entire
domain of execution
are described Domain of
by a kernel execution
Figure 2.18: Interpretation of the kernel concept
Kernels are specified at the HLL level and compiled to the intermediate level.

The kernel concept-2

Dedicated HLLs like OpenCL or CUDA C allow the programmer to define kernels, that,
when called are executed n-times in parallel by n different threads,
as opposed to only once like regular C functions.
Specification of kernels
• A kernel is defined by
• using a declaration specifier (like _kernel in OpenCL or _global_ in CUDA C) and

• declaring the instructions to be executed.
• Each thread that executes the kernel is given a unique identifier (thread ID, Work item ID)
that is accessible within the kernel.

Sample codes for kernels

The subsequent sample codes illustrate two kernels that adds two vectors (a/A) and (b/B)
and store the result into vector (c/C).
CUDA C [43] OpenCL [144]
Remark
During execution each thread is identified by a unique identifier that is
• int I in case of CUDA C, accessible through the threadIdx variable, and
• int id in case of OpenCL accessible through the built-in get_global_id() function.

Invocation of kernels
The kernel is invoked in CUDA C and OpenCL differently
• In CUDA C
by specifying the name of the kernel and the domain of execution [43]
• In OpenCL
by specifying the name of the kernel and the related configuration arguments, not detailed
here [144].

4. Concept of assigning work to execution pipelines of the GPGPU
Typically a four step process
a) Segmenting the domain of execution to work allocation units

b) Assigning work allocation units to SIMT cores for execution
c) Segmenting work allocation units into work scheduling units to be executed on the
execution pipelines of the SIMT cores
d) Scheduling work scheduling units for execution to the execution pipelines of the SIMT cores

4.a Segmenting the domain of execution to work allocation units-1
• The domain of execution will be broken down into equal sized ranges, called
work allocation units (WAUs), i.e. units of work that will be allocated to the SIMT cores
as an entity.
Domain of execution Domain of execution
Global size m Global size m
WAU WAU
Global size n
Global size n
(0,0) (0,1)
WAU WAU
(1,0) (1,1)
Figure 2.19: Segmenting the domain of execution to work allocation units (WAUs)
E.g. Segmenting a 512 x 512 sized domain of execution into four 256 x 256 sized
work allocation units (WAUs).

4.a Segmenting the domain of execution to work allocation units-2
Domain of execution Domain of execution

Global size m Global size m
WAU WAU
Global size n
Global size n
(0,0) (0,1)
WAU WAU
(1,0) (1,1)
Figure 2.20: Segmenting the domain of execution to work allocation units (WAUs)
• Work allocation units may be executed in parallel on available SIMT cores.

• The kind how a domain of execution will be segmented to work allocation units
is implementation specific, it can be done either by the programmer or the HLL compiler.
Remark
Work allocation units are designated
by Nvidia as Thread blocks and
by AMD as Thread blocks (Pre-OpenCL term) or Work Groups or Workgroups (OpenCL term).
4.b Assigning work allocation units to SIMT cores for execution
Work allocation units will be assigned for execution to the available SIMT cores as entities
by the scheduler of the GPGPU/DPA.

Example: Assigning work allocation units to the SIMT cores in AMD’s Cayman GPGPU [93]
Kernel i: Domain of execution

Global size mi
Work Group Work Group

Array of
Global size ni
(0,0) (0,1)
SIMT cores

(1,0) (1,1)
The work allocation units are

called here Work Groups.
They will be assigned

for execution to the same or to
different SIMT cores.
(ALU)

Kind of assigning work allocation units to SIMT cores
Serial kernel processing Concurrent kernel processing
The GPGPU scheduler assigns work allocation units The GPGPU scheduler is capable of
only from a single kernel assigning work allocation units to SIMT cores
to the available SIMT cores, from multiple kernels concurrently
i.e. the scheduler distributes work allocation units with the constraint that
to available SIMT cores for maximum the scheduler can assign work allocation units
parallel execution. to each particular SIMT core only
from a single kernel

Serial/concurrent kernel processing-1 [38], [83]

Serial kernel processing
The global scheduler of the GPGPU is capable of assigning work to the SIMT cores only
from a single kernel

Serial/concurrent kernel processing in Nvidia’s GPGPUs [38], [83]
• A global scheduler, called the Gigathread scheduler assigns work to each SIMT core.
• In Nvidia’s pre-Fermi GPGPU generations (G80-, G92-, GT200-based GPGPUs)
the global scheduler could only assign work to the SIMT cores from a single kernel
(serial kernel execution).
• By contrast, in Fermi-based GPGPUs the global scheduler is able to run up to 16 different
kernels concurrently, presumable, one per SM (concurrent kernel execution).
In Fermi up to 16 kernels can run

concurrently, presumable, each one
on a different SM.
Compute devices 1.x Compute devices 2.x

(devices before Fermi) (devices starting with Fermi)
Serial/concurrent kernel processing in AMD’s GPGPUs

• In GPGPUs preceding Cayman-based systems (2010), only a single kernel was allowed to run
on a GPGPU.
In these systems, the work allocation units constituting the NDRange (domain of execution)
were spread over all available SIMD cores in order to speed up execution.
• In Cayman based systems (2010) multiple kernels may run on the same GPGPU, each one
on a single or multiple SIMD cores, allowing a better utilization of the hardware resources
for a more parallel execution.

Example: Assigning multiple kernels to the SIMT cores in Cayman-based systems
Kernel 1: NDRange1
Global size 10

Global size 11
(0,0) (0,1) DPP Array

(1,0) (1,1)
Kernel 2: NDRange2
Global size 20

Global size 21
(0,0) (0,1)

(1,0) (1,1)

4.c Segmenting work allocation units into work scheduling units to be executed
on the execution pipelines of the SIMT cores-1
• Work scheduling units are parts of a work allocation unit that will be scheduled for execution
on the execution pipelines of a SIMT core as an entity.
• The scheduler of the GPGPU segments work allocation units into work scheduling units
of given size.

Example: Segmentation of a 16 x 16 sized Work Group into Subgroups of the size of 8x8
in AMD’s Cayman core [92]
Work Group
One 8x8 block Another 8x8 block

constitutes a wavefront constitutes an another wavefront
and is executed on one and is executed on the same or
SIMT core another SIMT core
Wavefront of
64 elements
In the example a SIMT core

has 64 execution pipelines
(ALUs)
Array of SIMT cores

4.c Segmenting work allocation units into work scheduling units to be executed
on the execution pipelines of the SIMT cores-2
Work scheduling units are called warps by Nvidia or wavefronts by AMD.
Size of the work scheduling units
• In Nvidia’s GPGPUs the size of the work scheduling unit (called warp) is 32.
• AMD’s GPGPUs have different work scheduling sizes (called wavefront sizes)
• High performance GPGPU cards have typically wavefront sizes of 64, whereas
• lower performance cards may have wavefront sizes of 32 or even 16.
The scheduling units, created by segmentation are then send to the scheduler.

Example: Sending work scheduling units for execution to SIMT cores in AMD’s Cayman core [92]
Work Group
One 8x8 block Another 8x8 block

constitutes a wavefront constitutes a wavefront
SIMT core another SIMT core
Subgroup of 64 elements
In the example a SIMT core

has 64 execution pipelines
(ALUs)
Array of SIMT cores

4.d Scheduling work scheduling units for execution to the execution pipelines of the
SIMT cores
The scheduler assigns work scheduling units to the execution pipelines of the SIMT cores
for execution according to a chosen scheduling policy (discussed in the case example parts
5.1.6 and 5.2.8).

Work scheduling units will be executed on the execution pipelines (ALUs) of the SIMT cores.
Thread Thread Thread Thread Thread
Fetch/Decode SIMT core
Work scheduling unit

ALU ALU ALU ALU ALU

SIMT core
Fetch/Decode
Fetch/Decode
ALU ALU ALU ALU ALU
SIMT core
ALU ALU ALU ALU ALU

Note
Massive multitheading is a means to prevent stalls occurring during the execution of
work scheduling units due to long latency operations, such as memory accesses caused
by cache misses.
Principle of preventing stalls by massive multithreading

• Suspend the execution of stalled work scheduling units and allocate ready to run
work scheduling units for execution.
• When a large enough number of work scheduling units is available, stalls can be hidden.
Example
Up to date (Fermi-based) Nvidia GPGPUs can maintain up to 48 work scheduling units,
called warps per SIMT core.
For instance, the GTX 580 includes 16 SIMT cores, with 48 warps per SIMT core and
32 threads per warp for a total number of 24576 threads.

5. The model of data sharing-1
• The model of data sharing declares the possibilities to share data between threads..
• This is not an orthogonal concept, but result from both
• the memory concept and
• the concept of assigning work to execution pipelines of the GPGPU.

The model of data sharing-2

(considering only key elements of the data space, based on [43])
Per-thread
reg. file
Local Memory
Domain of execution 1
Notes
Domain of execution 2
1) Work Allocation Units
are designated in the Figure
as Thread Block/Block
2) The Constant Memory

is not shown due to space limitations.
It has the same data sharing scheme
but provides only Read only accessibility.

6. Data dependent flow control

Implemented by SIMT branch processing
In SIMT processing both paths of a branch are executed subsequently such that
for each path the prescribed operations are executed only on those data elements which
fulfill the data condition given for that path (e.g. xi > 0).
Example

Figure 2.21: Execution of branches [24]

The given condition will be checked separately for each thread

First all ALUs meeting the condition execute the prescibed three operations,
then all ALUs missing the condition execute the next two operatons
Figure 2.22: Execution of branches [24]

Figure 2.23: Resuming instruction stream processing after executing a branch [24]

7. Barrier synchronization
Barrier synchronization
Synchronization of Synchronization of
thread execution memory read/writes

Barrier synchronization of thread execution

It allows to synchronize threads in a Work Group such that at a given point
(marked by the barrier synchronization instruction) all threads must have completed
all prior instructions before execution can proceed.
It is implemented
• in Nvidia’s PTX by the “bar” instruction [147] or
• in AMD’s IL by the “fence thread” instruction [10]..

Barrier synchronization of memory read/writes

• It ensures that no read/write instructions can be re-ordered or moved across the memory
barrier instruction in the specified data space (Local Data Space/Global memory/
System memory).
• Thread execution resumes when all the thread’s prior memory writes have been completed
and thus the data became visible to other threads in the specified data space.
It is implemented
• in Nvidia’s PTX by the “membar” instruction [147] or
• in AMD’s IL by the “fence lds”/”fence memory” instructions [10].

8. Communication between threads
Discussion of this topic assumes the knowledge of programming details therefore it is omitted.
Interested readers are referred to the related reference guides [147], [104], [105].

The pseudo ISA

• The pseudo ISA part of the virtual machine specifies the instruction set available at this level.
• The pseudo ISA evolves in line width the real ISA in form of subsequent releases.
• The evolution comprises both the enhancement of the qualitative (functional) and the
quantitative features of the pseudo architecture.
Example
• Evolution of the pseudo ISA of Nvidia’s GPGPUs and their support in real GPGPUs.
• Subsequent versions of both the pseudo- and real ISA are designated as compute capabilities.

a) Evolution of the qualitative

(functional) features of subsequent
compute capability versions
of Nvidia’s pseudo ISA
(called virtual PTX) [81]
Evolution of the device parameters bound to Nvidia’s subsequent compute capability

versions [81]
b) Compute capability versions of PTX ISAs generated by subsequent releases of

CUDA SDKs and supported GPGPUs (designated as Targets in the Table) [147]
PTX ISA 1.x/sm_1x

Pre-Fermi implementations
PTX ISA 1.x/sm_1x

Fermi implementations

c) Supported compute capability versions of Nvidia’s GPGPU cards [81]
Capability vers. GPGPU cores GPGPU devices

(sm_xy)
10 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870,

FX4/5600, 360M
11 G86, G84, G98, G96, G96b, G94, GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS,
G94b, G92, G92b 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT
120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM,
32/370M, 3/5/770M, 16/17/27/28/36/37/3800M,
NVS420/50
12 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M,
NVS 2/3100M
13 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX
3/4/5800
20 GF100, GF110 GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro
600,4/5/6000, Plex7000, GTX570, GTX580
21 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 450, GTX 460, GTX 550Ti,
GTX 560Ti

d) Forward portability of PTX code [52]

Applications compiled for pre-Fermi GPGPUs that include PTX versions of their kernels
should work as-is on Fermi GPGPUs as well .
e) Compatibility rules of object files (CUBIN files) compiled to a particular GPGPU

compute capability version [52]
The basic rule is forward compatibility within the main versions (versions sm_1x and sm_2x),
but not across main versions.
This is interpreted as follows:
Object files (called CUBIN files) compiled to a particular GPGPU compute capability version
are supported on all devices having the same or higher version number within the
same main version.
E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices
but not supported on compute capability 2.0 (Fermi) devices.
For more details see [52].

3. Overview of GPGPUs

3. Overview of GPGPUs (1)
Basic implementation alternatives of the SIMT execution
GPGPUs Data parallel accelerators
Programmable GPUs Dedicated units

with appropriate supporting data parallel execution
programming environments with appropriate
programming environment
Have display outputs No display outputs

Have larger memories
than GPGPUs
E.g. Nvidia’s 8800 and GTX lines Nvidia’s Tesla lines

AMD’s HD 38xx, HD48xx lines AMD’s FireStream lines
Figure 3.1: Basic implementation alternatives of the SIMT execution

GPGPUs
Nvidia’s line AMD/ATI’s line
90 nm G80
80 nm R600
Shrink Enhanced
arch. Shrink
Enhanced
65 nm G92 G200 arch.
55 nm RV670 RV770
Enhanced
Shrink Enhanced Enhanced
arch. Shrink arch. arch.
40 nm GF100 RV870 Cayman
(Fermi)
Figure 3.2: Overview of Nvidia’s and AMD/ATI’s GPGPU lines

NVidia
11/06 10/07 6/08
Cores G80 G92 GT200

Cards 8800 GTS 8800 GTX 8800 GT GTX260 GTX280

OpenCL OpenCL
Standard
6/07 11/07 6/08 11/08
CUDA Version 1.0 Version 1.1 Version 2.0 Version 2.1
AMD/ATI
11/05 5/07 11/07 5/08
Cores R500 R600 R670 RV770

Cards (Xbox) HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

48 ALUs 320 ALUs 320 ALUs 320 ALUs 800 ALUs 800 ALUs
12/08
OpenCL OpenCL
Standard
11/07 9/08 12/08
Brooks+ Brook+ Brook+ 1.2 Brook+ 1.3
(SDK v.1.0) (SDK v.1.2) (SDK v.1.3)
6/08
RapidMind 3870
support
2005 2006 2007 2008
NVidia
3/10 07/10 11/10
Cores GF100 (Fermi) GF104 (Fermi) GF110 (Fermi)
1/11
Cards GTX 470 GTX 480 GTX 460 GTX 580 GTX 560 Ti
320-bit 384-bit 192/256-bit 384-bit 384-bit
6/09 10/09 6/10

SDK 1.0 Early release SDK 1.0 SDK 1.1
5/09 6/09 3/10 6/10 1/11 3/11
CUDA Version 22 Version 2.3 Version 3.0 Version 3.1 Version 3.2 Version 4.0
Beta
AMD/ATI
9/09 10/10 12/10

Cards HD 5850/70 HD 6850/70 HD 6950/70

1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit
11/09 03/10 08/10

(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.1.4 Beta) 8/09
RapidMind
2009 2010 2011
Remarks on AMD-based graphics cards [45], [66]
Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and started supporting OpenCL as their basic HLL programming language.
AMD/ATI
9/09 10/10 12/10
Cards HD 5850/70 HD 6850/70 HD 6950/70

1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit
11/09 03/10 08/10

(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.2.01) 8/09
RapidMind
2009 2010 2011
As a consequence AMD changed also

• both the microarchitecture of their GPGPUs (by introducing Local and Global Data Share
memories) and
• their terminology by introducing Pre-OpenCL and OpenCL terminology, as discussed
in Section 5.2.
Remarks on Fermi-based graphics cards [45], [66]
FP64 speed
• ½ of the FP32 speed for the Tesla 20-series
• 1/8 of the SP32 speed for the GeForce GTX 470/480/570/580 cards
1/12 for other GForce GTX4xx cards
ECC
available only on the Tesla 20-series
Number of DMA engines

Tesla 20-series has 2 DMA Engines (copy engines). GeForce cards have 1 DMA Engine.
This means that CUDA applications can overlap computation and communication on Tesla
using bi-directional communication over PCI-e.
Memory size
Tesla 20 products have larger on board memory (3GB and 6GB)

Positioning Nvidia’s discussed GPGPU cards in their entire product portfolio [82]

Nvidia’s compute capability concept [52], [149]

• Nvidia manages the evolution of their devices and programming environment by maintaining
compute capability versions of both
• their intermediate virtual PTX architectures (PTX ISA) (not discussed here) and
• their real architectures (GPGPU ISA).
• Designation of the compute capability versions
• Subsequent versions of GPGPU ISAs are designated as sm_1x/sm_2x or simply by 1.x/2.x.
• The first digit 1 or 2 denotes the major version number, the second digit denotes the
minor version.
• Major versions of 1.x or 1x relate to pre-Fermi solutions whereas those of 2.x or 2x
to Fermi based solutions.

a) Functional features provided by

the compute capability versions
of Nvidia’se GPGPUs [81]
b) Device parameters bound to the compute capability versions of Nvidia’s GPGPUs [81]
c) Supported GPGPU compute capability versions of Nvidia’s GPGPU cards [81]

(sm_xy)

FX4/5600, 360M
120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM,
32/370M, 3/5/770M, 16/17/27/28/36/37/3800M,
NVS420/50
NVS 2/3100M
3/4/5800
600,4/5/6000, Plex7000, GTX570, GTX580
GTX 560Ti

8800 GTS 8800 GTX 8800 GT GTX 260 GTX 280
Core G80 G80 G92 GT200 GT200
Introduction 11/06 11/06 10/07 6/08 6/08
IC technology 90 nm 90 nm 65 nm 65 nm 65 nm
Nr. of transistors 681 mtrs 681 mtrs 754 mtrs 1400 mtrs 1400 mtrs
Die are 480 mm2 480 mm2 324 mm2 576 mm2 576 mm2
Core frequency 500 MHz 575 MHz 600 MHz 576 MHz 602 MHz
Computation
No of SMs (cores) 12 16 14 24 30
No.of FP32 EUss 96 128 112 192 240
Shader frequency 1.2 GHz 1.35 GHz 1.512 GHz 1.242 GHz 1.296 GHz
No. FP32 operations./cycle 21 3 3
Peak FP32 performance 230.4 GFLOPS 345.61 GFLOPS 508 GFLOPS 715 GFLOPS 933 GFLOPS
Peak FP64 performance – – – 59.62 GFLOPS 77.76 GFLOPS

Memory
Mem. transfer rate (eff) 1600 Mb/s 1800 Mb/s 1800 Mb/s 1998 Mb/s 2214 Mb/s
Mem. interface 320-bit 384-bit 256-bit 448-bit 512-bit
Mem. bandwidth 64 GB/s 86.4 GB/s 57.6 GB/s 111.9 GB/s 141.7 GB/s
Mem. size 320 MB 768 MB 512 MB 896 MB 1.0 GB
Mem. type GDDR3 GDDR3 GDDR3 GDDR3 GDDR3
Mem. channel 6*64-bit 6*64-bit 4*64-bit 8*64-bit 8*64-bit

System
Multi. CPU techn. SLI SLI SLI SLI SLI
Interface PCIe x16 PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
MS Direct X 10 10 10 10.1 subset 10.1 subset
TDP 146 W 155 W 105 W 182 W 236 W
1: Nvidia takes the FP32 capable Texture Processing Units also into consideration and calculates with 3 FP32 operations/cycle
© Sima Dezső, ÓE NIK 605 of Nvidia’s GPGPUs-1
Table 3.1: Main features www.tankonyvtar.hu
Remarks
In publications there are conflicting statements about whether or not the GT80 makes use
of dual issue (including a MAD and a Mul operation) within a period of four shader cycles or not.
Official specifications [22] declare the capability of dual issue, but other literature sources [64]
and even a textbook, co-authored by one of the chief developers of the GT80 (D. Kirk [65])
deny it.
A clarification could be found in a blog [66], revealing that the higher figure given in Nvidia’s
specifications includes calculations made both by the ALUs in the SMs and by the texture
processing units TPU).
Nevertheless, the TPUs can not be directly accessed by CUDA except for graphical tasks,
such as texture filtering.
Accordingly, in our discussion focusing on numerical calculations it is fair to take only
the MAD operations into account for specifying the peak numerical performance.

Structure of an SM of the G80 architecture
Texture processing Units

consisting of
• TA: Texture Address units
• TF: Texture Filter Units
They are FP32 or FP16 capable [46]

GTX 470 GTX 480 GTX 460 GTX 570 GTX 580
Core GF100 GF100 GF104 GF110 GF110
Introduction 3/10 3/10 7/10 12/10 11/10
Core frequency 732 MHz 772 MHz

Computation
No of SMs (cores) 14 15 7 15 16
No. of FP32 EUs 448 480 336 480 512
Shader frequency 1215 MHz 1401 MHz 1350 MHz 1464 MHz 1544 MHz
No. FP32 operations/cycle 2 2 3 2 2
Peak FP32 performance 1088 GFLOPS 1345 GFLOPS 9072 GFLOPS 1405 GFLOPS 1581 GFLOPS
Peak FP64 performance 136 GFLOPS 168 GFLOPS 75.6 GFLOPS 175.6 GFLOPS 197.6 GFLOPS
Memory
Mem. transfer rate (eff) 3348 Mb/s 3698 Mb/s 3600 Mb/s 3800 Mb/s 4008 Mb/s
Mem. interface 320-bit 384-bit 192/256-bit 320-bit 384-bit
Mem. bandwidth 133.9 GB/s 177.4 GB/s 86.4/115.2 GB/s 152 GB/s 192.4 GB/s
Mem. size 1.28 GB 1.536 GB 0.768/1.024 GB/s 1.28 GB 1.536/3.072 GB
Mem. type GDDR5 GDDR5 GDDR5 GDDR5 GDDR5
Mem. channel 5*64-bit 6*64-bit 3/4 *64-bit 5*64-bit 6*64-bit

System
Multi. CPU techn. SLI SLI SLI SLI SLI
Interface PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16
MS Direct X 11 11 11 11 11
TDP 215 W 250 W 150/160 W 219 W 244 W
© Sima Dezső, ÓE NIK 608 of Nvidia’s GPGPUs-2

Remarks
1) The GDDR3 memory has a double clocked data transfer

Effective memory transfer rate = 2 x memory frequency
The GDDR5 memory has a quad clocked data transfer

Effective memory transfer rate = 4 x memory frequency
2) Both the GDDR3 and GDDR5 memories are 32-bit devices.

Nevertheless, memory controllers of GPGPUs may be designed either to control a single
32-bit memory channel or dual memory channels, providing a 64-bit channel width.

Examples for Nvidia cards
Nvidia GeForce GTX 480 (GF 100 based) [47]

Nvidia GeForce GTX 480 and 580 cards [77]
GTX 480 GTX 580

(GF 100 based) (GF 110 based)

A pair of GeForce GTX 480 cards [47]

(GF100 based)

HD 2900XT HD 3850 HD 3870 HD 4850 HD 4870

Core R600 R670 R670 RV770 (R700-based) RV770 (R700 based)
Introduction 5/07 11/07 11/07 5/08 5/08
Core frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
Computation
No. of ALUs 320 320 320 800 800
Shader frequency 740 MHz 670 MHz 775 MHz 625 MHz 750 MHz
No. FP32 operations./cycle 2 2 2 2 2
Peak FP32 performance 471.6 GFLOPS 429 GFLOPS 496 GFLOPS 1000 GFLOPS 1200 GFLOPS
Peak FP64 performance – – – 200 GFLOPS 240 GFLOPS

Memory
Mem. transfer rate (eff) 1600 Mb/s 1660 Mb/s 2250 Mb/s 2000 Mb/s 3600 Mb/s (GDDR5)
Mem. interface 512-bit 256-bit 256-bit 265-bit 265-bit
Mem. bandwidth 105.6 GB/s 53.1 GB/s 720 GB/s 64 GB/s 118 GB/s
Mem. size 512 MB 256 MB 512 MB 512 MB 512 MB
Mem. type GDDR3 GDDR3 GDDR4 GDDR3 GDDR3/GDDR5
Mem. channel 8*64-bit 8*32-bit 8*32-bit 4*64-bit 4*64-bit
Mem. contr. Ring bus Ring bus Ring bus Crossbar Crossbar
System
Multi. CPU techn. CrossFire X CrossFire X CrossFire X CrossFire X CrossFire X
Interface PCIe x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16 PCIe 2.0x16
MS Direct X 10 10.1 10.1 10.1 10.1
TDP Max./Idle 150 W 75 W 105 W 110 W 150 W
© Sima Dezső, ÓE NIK 613 of AMD/ATIs GPGPUs-1

Evergreen series HD 5850 HD 5870 HD 5970

Core Cypress PRO (RV870-based) Cypress XT (RV870-based) Hemlock XT (RV870-based)
Introduction 9/09 9/09 11/09
IC technology 40 nm 40 nm 40 nm
Nr. of transistors 2154 mtrs 2154 mtrs 2*2154 mtrs
Die are 334 mm2 334 mm2 2*334 mm2
Core frequency 725 MHz 850 MHz 725 MHz

Computation
No. of SIMD cores / VLIW5 ALUs 18/16 20/16 2*20/16
No. of EUs 1440 1600 2*1600
Shader frequency 725 MHz 850 MHz 725 MHz
No. FP32 inst./cycle 2 2 2

Peak FP32 performance 2088 GFLOPS 2720 GFLOPS 4640 GFLOPS
Peak FP64 performance 417.6 GFLOPS 544 GFLOPS 928 GFLOPS

Memory
Mem. transfer rate (eff) 4000 Mb/s 4800 Mb/s 4000 Mb/s
Mem. interface 256-bit 256-bit 2*256-bit
Mem. bandwidth 128 GB/s 153.6 GB/s 2*128 GB/s
Mem. size 1.0 GB 1.0/2.0 GB 2*(1.0/2.0) GB
Mem. type GDDR5 GDDR5 GDDR5
Mem. channel 8*32-bit 8*32-bit 2*8*32-bit

System
Multi. CPU techn. CrossFire X CrossFire X CrossFire X
Interface PCIe 2.1*16 PCIe 2.1*16 PCIe 2.1*16
MS Direct X 11 11 11
TDP Max./Idle 151/27 W 188/27 W 294/51 W
© Sima Dezső, ÓE NIK 614

Table 3.4: Main features of AMD/ATI’s GPGPUs-2 www.tankonyvtar.hu
Northerm Islands series HD 6850 HD 6870

Core Barts Pro Barts XT
Introduction 10/10 10/10
IC technology 40 nm 40 nm
Nr. of transistors 1700 mtrs 1700 mtrs
Die are 255 mm2 255 mm2
Core frequency 775 MHz 900 MHz

Computation
No. of SIMD cores /VLIW5 ALUs 12/16 14/16
No. of EUs 960 1120
Shader frequency 775 MHz 900 MHz
No. FP32 inst./cycle 2 2
Peak FP32 performance 1488 GFLOPS 2016 GFLOPS
Peak FP64 performance - -

Memory
Mem. transfer rate (eff) 4000 Mb/s 4200 Mb/s
Mem. interface 256-bit 256-bit
Mem. bandwidth 128 GB/s 134.4 GB/s
Mem. size 1 GB 1 GB
Mem. type GDDR5 GDDR5
Mem. channel 8*32-bit 8*32-bit

System
Multi. CPU techn. CrossFire X CrossFire X
Interface PCIe 2.1*16 PCIe 2.1*16
MS Direct X 11 11
TDP Max./Idle 127/19 W 151/19 W

Table 3.5: Main features of AMD/ATI’s GPGPUs-3 www.tankonyvtar.hu
Northerm Islands series HD 6950 HD 6970 HD 6990 HD 6990 unlocked
Core Cayman Pro Cayman XT Antilles Antilles
Introduction 12/10 12/10 3/11 3/11
IC technology 40 nm 40 nm 40 nm 40 nm
Nr. of transistors 2.64 billion 2.64 billion 2*2.64 billion 2*2.64 billion
Die are 389 mm2 389 mm2 2*389 mm2 2*389 mm2
Core frequency 800 MHz 880 MHz 830 MHz 880 MHz
Computation
No. of SIMD cores /VLIW4 ALUs 22/16 24/16 2*24/16 2*24/16
No. of EUs 1408 1536 2*1536 2*1536
Shader frequency 800 MHz 880 MHz 830 MHz 880 MHz
No. FP32 inst./cycle / ALU 4 4 4 4
Peak FP32 performance 2.25 TFLOPS 2.7 TFLOPS 5.1 TFLOPS 5.4 TFLOPS
Peak FP64 performance 0.5625 TFLOPS 0.683 TFLOPS 1.275 TFLOPS 1.35 TFLOPS
Memory
Mem. transfer rate (eff) 5000 Mb/s 5500 Mb/s 5000 Mb/s 5000 Mb/s
Mem. interface 256-bit 256-bit 256-bit 256-bit
Mem. bandwidth 160 GB/s 176 GB/s 2*160 GB/s 2*160 GB/s
Mem. size 2 GB 2 GB 2*2 GB 2*2 GB
Mem. type GDDR5 GDDR5 GDDR5 GDDR5
Mem. channel 8*32-bit 5*32-bit 2*8*32-bit 2*8*32-bit

System
ECC - - - -
Multi. CPU techn. CrossFireX CrossFireX CrossFireX CrossFireX
Interface PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit PCIe 2.1*16-bit
MS Direct X 11 11 11 11
TDP Max./Idle 200/20 W 250/20 W 350/37 W 415/37 W

Table 3.6: Main features of AMD/ATIs GPGPUs-4
Remark
The Radeon HD 5xxx line of cards is designated also as the Evergreen series and
the Radeon HD 6xxx line of cards is designated also as the Northern islands series.

Examples for AMD cards
HD 5870 (RV870 based) [41]

HD 5970 (actually RV870 based) [80]
ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock

HD 5970 (actually RV870 based) [79]
ATI HD 5970: 2 x ATI HD 5870 with slightly reduced memory clock

AMD HD 6990 (actually Cayman based) [78]
AMD HD 6990: 2 x ATI HD 6970 with slightly reduced memory and shader clock

Price relations (as of 01/2011)
Nvidia
GTX 570 ~ 350 $
GTX 580 ~ 500 $
AMD
HD 6970 ~ 400 $
HD 6990 ~ 700 $
(Dual 6970)

4. Overview of data parallel accelerators

4. Overview of data parallel accelerators (1)
Data parallel accelerators
Implementation alternatives of data parallel accelerators
On card On-die
implementation integration
Recent Emerging
implementations implementations
E.g. GPU cards

Intel’s Heavendahl
Data-parallel
accelerator cards Intel’s Sandy Bridge (2011)
AMD’s Torrenza AMD’s Fusion (2008) 2010

integration technology integration technology
Trend
Figure 4.1: Implementation alternatives of dedicated data parallel accelerators

On-card accelerators
Card Desktop 1U server

implementations implementations implementations
Usually dual cards Usually 4 cards

Single cards fitting mounted into a box, mounted into a 1U server rack,
into a free PCI Ex16 slot connected to an connected two adapter cards
of the host computer. adapter card that are inserted into
that is inserted into a two free PCIEx16 slots of a server
free PCI-E x16 slot of the through two switches
host PC through a cable. and two cables.
E.g. Nvidia Tesla C870 Nvidia Tesla D870 Nvidia Tesla S870
Nvidia Tesla C1060 Nvidia Tesla S1070
Nvidia Tesla C2070 Nvidia Tesla S2050/S2070
AMD FireStream 9170
AMD FireStream 9250
AMD FireStream 9370
Figure 4.2: Implementation alternatives of on-card accelerators

NVidia Tesla-1 (Non-Fermi based DPAs)
G80-based GT200-based
6/07 6/08
Card C870 C1060
1.5 GB GDDR3 4 GB GDDR3

345.6 SP: 345.6 GFLOPS
DP: -
SP: 933 GFLOPS
DP: 77.76 GFLOPS
6/07
Desktop D870
2*C870 incl.
3 GB GDDR3
SP: 691.2 GFLOPS
DP: -
6/07 6/08
IU Server S870 S1070
4*C870 incl. 4*C1060
6 GB GDDR3 16 GB GDDR3
SP: 1382 GFLOPS SP: 3732 GFLOPS
DP: - DP: 311 GFLOPS
6/07 11/07 6/08
CUDA Version 1.0 Version 1.01 Version 2.0
2007 2008
Figure 4.3: Overview of Nvidia’s G80/G200-based Tesla family-1

FB: Frame Buffer
Figure 4.4: Main functional units of Nvidia’s Tesla C870 card [2]

Figure 4.5: Nvida’s Tesla C870 and

AMD’s FireStream 9170 cards [2], [3]
Figure 4.6: Tesla D870 desktop implementation [4]

Figure 4.7: Nvidia’s Tesla D870 desktop implementation [4]

Figure 4.8: PCI-E x16 host adapter card of Nvidia’s Tesla D870 desktop [4]

Figure 4.9: Concept of Nvidia’s Tesla S870 1U rack server [5]

Figure 4.10: Internal layout of Nvidia’s Tesla S870 1U rack [6]

Figure 4.11: Connection cable between Nvidia’s Tesla S870 1U rack and the adapter cards
inserted into PCI-E x16 slots of the host server [6]
NVidia Tesla-2 (Fermi-based DPAs)

GF100 (Fermi)-based
11/09
Card C2050/C2070
3/6 GB GDDR5
SP: 1.03 TLOPS1
DP: 0.515 TFLOPS
04/10 08/10
Module M2050/M2070 M2070Q
3/6 GB GDDR5 6 GB GDDR5

SP: 1.03 TFLOPS1 SP: 1.03 TFLOPS1
DP: 0.515 TFLOPS DP: 0.515 TFLOPS
11/09
IU Server S2050/S2070
4*C2050/C2070
12/24 GB GDDR31
SP: 4.1 TFLOPS
DP: 8.2 TFLOPS
5/09 6/09 3/10 6/10 1/11

CUDA
CUDA Version 2.2 Version 2.3 Version 3.0 Version 3.1 Version 3.2
6/10
OpenCL+ OpenCL 1.1
2009 2010 2011
1: Without SF (Special Function) operations
Figure 4.12: Overview of Nvidia’s GF100 (Fermi)-based Tesla family

Fermi based Tesla devices
Tesla C2050/C2070 Card [71] Tesla S2050/S2070 1U [72]

(11/2009) (11/2009)
Single GPU Card Four GPUs

3/6 GB GDDR5 12/16 GB GDDR5s
515 GFLOPS DP 2060 GFLOPS DP
ECC ECC

Tesla M2050/M2070/M2070Q Processor Module

(Dual slot board with PCIe Gen. 2 x16 interface)
(04/2010)
Figure 4.13: Tesla M2050/M2070/M2070Q Processor Module [74]
Used in the Tianhe-1A Chinese supercomputer (10/2010)
Remark
The M2070Q is an upgrade of the M2070 providing higher memory clock (introduced 08/2010)
Tianhe-1A (10/2010) [48]

• Upgraded version of the Tianhe-1 (China)
• 2.6 PetaFLOPS (fastest supercomputer in the World in 2010)
• 14 336 Intel Xeon 5670
• 7 168 Nvidia Tesla M2050

Specification data of the Tesla M2050/M2070/M2070Q modules [74]
(448 ALUs) (448 ALUs)
Remark
The M2070Q is an upgrade of the M2070, providing higher memory clock (introduced 08/2010)

Support of ECC
• Fermi based Tesla devices introduced the support of ECC.
• By contrast recently neither Nvidia’s straightforward GPGPU cards nor AMD’s GPGPU or
DPA devices support ECC [76].

Tesla S2050/S2070 1U
The S2050/S2070 differ only in the memory size, the S2050 includes 12 GB, the S2070 24 GB.
GPU Specification
 Number of processor cores: 448
 Processor core clock: 1.15 GHz
 Memory clock: 1.546 GHz
 Memory interface: 384 bit
System Specification
 Four Fermi GPUs
 12.0/24.0 GB of GDDR5,
 configured as 3.0/6.0 GB per GPU.
 When ECC is turned on,

Figure 4.14: Block diagram and technical specifications  available memory is ~10.5 GB
of Tesla S2050/S2070 [75]  Typical power consumption: 900 W
AMD FireStream-1 (Brook+ programmable DPAs)
RV670-based RV770-based
11/07 6/08
Card 9170 9170

2 GB GDDR3 Shipped
FP32: 500 GLOPS
FP64:~200 GLOPS
6/08 10/08
9250 9250
1 GB GDDR3 Shipped
FP32: 1000 GLOPS
FP64: ~300 GFLOPS
12/07 09/08
Stream Computing
SDK Version 1.0 Version 1.2
Brook+ Brook+
ACM/AMD Core Math Library ACM/AMD Core Math Library
CAL (Computer Abstor Layer) CAL (Computer Abstor Layer)
Rapid Mind
2007 2008
Figure 4.15: Overview of AMD/ATI’s FireStream family-1

AMD FireStream-2 (OpenCL programmable DPAs)

In 01/11 Version 2.3
renamed to APP
RV870-based
06/10 10/10
Card 9350/9370 9350/9370
2/4 GB GDDR5 Shipped

FP32: 2016 GLOPS
FP64: 403/528 GLOPS
03/09 03/10 05/10 08/10 12/10

Stream Computing
SDK Version 1.4 Version 2.01 Version 2.1 Version 2.2 Version 23
Brooks+ OpenCL 1.0 OpenCL 1.0 OpenCL 1.1 OpenCL 1.1
2009 2010 2011
APP: Accelerated Parallel Processing
Figure 4.16: Overview of AMD/ATI’s FireStream family-2

Nvidia Tesla cards

Core type C870 C1060 C2050 C2070
Based on G80 GT200 T20 (GF100-based)
Introduction 6/07 6/08 11/09
Core
Core frequency 600 MHz 602 MHz 575 MHz
ALU frequency 1350 MHz 1296 GHz 1150 MHz
No. of SMs (cores) 16 30 14
No. of ALUs 128 240 448
Peak FP32 performance 345.6 GFLOPS 933 GFLOPS 1030.4 GFLOPS
Peak FP64 performance - 77.76 GFLOPS 515.2 GFLOPS

Memory
Mem. transfer rate (eff) 1600 Gb/s 1600 Gb/s 3000 Gb/s
Mem. interface 384-bit 512-bit 384-bit
Mem. bandwidth 768 GB/s 102 GB/s 144 GB/s
Mem. size 1.5 GB 4 GB 3 GB 6 GB
Mem. type GDDR3 GDDR3 GDDR5

System
ECC - - ECC
Interface PCIe *16 PCIe 2.0*16 PCIe 2.0*16
Power (max) 171 W 200 W 238 W 247 W
Table 4.1: Main features of Nvidia’s data parallel accelerator cards (Tesla line) [73]

AMD FireStream cards

Core type 9170 9250 9350 9370
Based on RV670 RV770 RV870 RV870
Introduction 11/07 6/08 10/10 10/10
Core
Core frequency 800 MHz 625 MHz 700 MHz 825 MHz
ALU frequency 800 MHz 325 MHz 700 MHz 825 MHz
No. of EUs 320 800 1440 1600
Peak FP32 performance 512 GFLOPS 1 TFLOPS 2016 GFLOPS 2640 GFLOPS
Peak FP64 performance ~200 GFLOPS ~250 GFLOPS 403.2 GFLOPS 528 GFLOPS
Memory
Mem. transfer rate (eff) 1600 Gb/s 1986 Gb/s 4000 Gb/s 4600 Gb/s
Mem. interface 256-bit 256-bit 256-bit 256-bit
Mem. bandwidth 51.2 GB/s 63.5 GB/s 128 GB/s 147.2 GB/s
Mem. size 2 GB 1 GB 2 GB 4 GB
Mem. type GDDR3 GDDR3 GDDR5 GDDR5

System
ECC - - - -
Interface PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16 PCIe 2.0*16
Power (max) 150 W 150 W 150 W 225 W
Table 4.2: Main features of AMD/ATI’s data parallel accelerator cards (FireStream line) [67]

Price relations (as of 1/2011)
Nvidia Tesla
C2050 ~ 2000 $
C2070 ~ 4000 $
S2050 ~ 13 000 $
S2070 ~ 19 000 $
NVidia GTX
GTX580 ~ 500 $

GPGPUs/DPAs 5.1
Case example 1:
Nvidia’s Fermi family of cores
Dezső Sima

Aim
Aim

5.1 Nvidia’s Fermi family of cores
5.1.1 Introduction to Nvidia’s Fermi family of cores

5.1.2 Nvidia’s Parallel Thread eXecution (PTX)
Virtual Machine concept
5.1.3 Key innovations of Fermi’s PTX 2.0
5.1.4 Nvidia’s high level data parallel programming model

5.1.5 Major innovations and enhancements of
Fermi’s microarchitecture
5.1.6 Microarchitecture of Fermi GF100
5.1.7 Comparing key features of the microarchitectures of
Fermi GF100 and the predecessor GT200

5.1.10 Evolution of key features of the
microarchitecture of Nvidia’s GPGPU lines
5.1.1 Introduction to Nvidia’s Fermi family of cores

5.1.1 Introduction to Nvidia’s Fermi family of cores (1)
Announced: 30. Sept. 2009 at NVidia’s GPU Technology Conference, available: 1Q 2010 [83]

Sub-families of Fermi
Fermi includes three sub-families with the following representative cores and features:
Available Max. no. Max. no. No of Compute

GPGPU Aimed at
since of cores of ALUs transistors capability
GF100 3/2010 161 5121 3200 mtrs 2.0 Gen. purpose
GF104 7/2010 8 384 1950 mtrs 2.1 Graphics
GF110 11/2010 16 512 3000 mtrs 2.0 Gen. purpose
1 In the associated flagship card (GTX 480) however, one of the SMs has been disabled, due to overheating
problems, so it has actually only 15 SIMD cores, called Streaming Multiprocessors (SMs) by Nvidia and 480
FP32 EUs [69]

Terminology of these slides Nvidia’s terminology AMD/ATI’s terminology
(4-24) X SIMD Cores

Core Block Array
Streaming Processor Array SIMD Array
CBA in Nvidia’s G80/G92/
(7-10) TPC in Data Parallel Processor Array
GT200 SPA
G80/G92/GT200 DPP Array
CA Core Array (else)
(8-16) SMs in the Fermi line Compute Device
Stream Processor
Texture Processor Cluster

Core Block (2-3) Streaming
CB in Nvidia’s G80/G92/ TPC Multiprocessors
G200 in G80/G92/GT200,
not present in Fermi
Streaming Multiprocessor (SM) SIMD Core

G80-GT200: scalar issue to SIMD Engine (Pre OpenCL term)
a single pipeline Data Parallel Processor (DPP)
SIMT Core
C SM GF100/110: scalar issue to Compute Unit
SIMD core
dual pipelines 16 x (VLIW4/VLIW5) ALUs
GF104: 2-way superscalar
issue to dual pipelines
VLIW4/VLIW5 ALU
Stream core (in OpenCL SDKs)
Streaming Processor
Algebraic Logic Unit Compute Unit Pipeline (6900 ISA)
ALU CUDA Core
(ALU) SIMD pipeline (Pre OpenCL) term
Thread processor (Pre OpenCL term)
Shader processor (Pre OpenCL term)
Stream cores (ISA publ.s)

Execution Units (EUs) FP Units Processing elements
EU
(e.g. FP32 units etc.) FX Units Stream Processing Units
ALUs (in ISA publications)
Table 5.1.1: Terminologies used with GPGPUs/Data parallel accelerators

5.1.2 Nvidia’s Parallel Thread eXecution (PTX)
Virtual Machine concept

5.1.2 Nvidia’s PTX Virtual Machine Concept (1)
The PTX Virtual Machine concept consists of two related components

• a parallel computational model and
• the ISA of the PTX virtual machine (Instruction Set architecture), which is a pseudo ISA,
since programs compiled to the ISA of the PTX are not directly executable but need
a further compilation to the ISA of the target GPGPU.
The parallel computational model underlies the PTX virtual machine.

The parallel computational model of PTX

The parallel computational model of PTX underlies both the ISA of the PTX and the CUDA
language.
It is based on three key abstractions
a) The model of computational resources
b) The memory model
c) The data parallel execution model covering
c1) the mapping of execution objects to the execution resources (parallel machine model).
c2) The data sharing concept
c3) The synchronization concept
These models are only outlined here, a detailed description can be found in the related
documentation [147].
Remark
The outlined four abstractions remained basically unchanged through the life span of PTX
(from the version 1.0 (6/2007) to version 2.3 (3/2011).

a) The model of computational

resources [147]
A set of SIMD cores

with on-chip
shared memory
A set of ALUs
within the
SIMD cores

b) The memory model [147]

Per-thread
reg. space
Main features of the memory spaces

c) The data parallel execution model-1[147]

(SIMT model)
The execution model is based on
• a set of SIMT capable SIMD processors,

designated in our slides as SIMD cores,
(called Multiprocessors in the Figure
and the subsequent description), and
• a set of ALUs (whose capabilities are
declared in the associated ISA),
designated as Processors in the Figure
and the subsequent description.

c1) The data parallel execution model-2[147]

A concise overview of the execution model is given in Nvidia’s PTX ISA description,
worth to cite.

c1) The data parallel execution model-3 [147]

c2) The data sharing concept-1 [147]

Per-thread
The data parallel model allows to share data for threads reg. space
within a CTA by means of a Shared Memory declared

in the platform model that is allocated to each SIMD core.
Main features of the memory spaces

c2) The data sharing concept-2 [147]
A set of SIMD cores

with on-chip
shared memory
A set of ALUs
within the
SIMD cores

c3) The synchronization concept [147]
• Sequential consistency is provided by barrier synchronization

(implemented by the bar.synch instruction.
• Threads wait at the barrier until all threads in the CTA has arrived.
• In this way all memory writes prior the barrier are guaranteed to have stored data before
reads after the barrier will access referenced data (providing memory consistency).

The ISA of the PTX virtual machine

It is the definition of a pseudo ISA for GPGPUs that
• is close to the “metal” (i.e. to the actual ISA of GPGPUs) and
• serves as the hardware independent target code for compilers e.g. for CUDA or OpenCL.
Compilation to
PTX pseudo ISA
instructions
Translation to
executable CUBIN file
at load time
CUBIN FILE
665 Source: [68]
The PTX virtual machine concept gives rise to a two phase compilation process.
1) First, the application, e.g. a CUDA or OpenCL program will be compiled to a pseudo code,
called also as PTX ISA code or PTX code by the appropriate compiler.
The PTX code is a pseudo code since it is not directly executable and needs to be translated
to the actual ISA of a given GPGPU to become executable.
Application
(CUDA C/OpenCL file)
Two-phase compilation
CUDA C compiler
or
OpenCL compiler
• First phase:
Compilation to the PTX ISA format
(stored in text format)
pseudo ISA instructions)
CUDA driver
• Second phase (during loading):
JIT-compilation to
executable object code
(called CUBIN file).
CUBIN file
© Sima Dezső, ÓE NIK (executable on the GPGPU
666) www.tankonyvtar.hu
2) In order to become executable the PTX code needs to be compiled to the actual ISA code of a
particular GPGPU, called the CUBIN file.
This compilation is performed by the CUDA driver during loading the program (Just-In-Time).
Application
Two-phase compilation
CUDA C compiler
or
OpenCL compiler
• First phase:
Compilation to the PTX ISA format
(stored in text format)
pseudo ISA instructions)
CUDA driver
• Second phase (during loading):

JIT-compilation to
executable object code
CUBIN file (called CUBIN file).
(runable on the GPGPU )
Benefit of the Virtual machine concept
• The compiled pseudo ISA code (PTX code) remains in principle independent from the
actual hardware implementation of a target GPGPU, i.e. it is portable over subsequent
GPGPU families.
Porting a PTX file to a lower compute capability level GPGPU however, may need emulation
for features not implemented in hardware that slows down execution.
Forward portability of GPGPU code (CUBIN code) is provided however only within major
compute capability versions.
• Forward portability of PTX code is highly advantageous in the recent rapid evolution phase of
GPGPU technology as it results in less costs for code refactoring.
Code refactoring costs are a kind of software maintenance costs that arise when the user
switches from a given generation to a subsequent GPGPU generation (like from GT200
based devices to GF100 or GF110-based devices) or to a new software environment
(like from CUDA 1.x SDK to CUDA 2.x or CUDA 3.x SDK).

Remarks [149]
1) • Nvidia manages the evolution of their devices and programming environment by maintaining
compute capability versions of both
• their intermediate virtual PTX architectures (PTX ISA) and
• their real architectures (GPGPU ISA).
• Designation of the compute capability versions
• Subsequent versions of the intermediate PTX ISA are designated as PTX ISA 1.x or 2.x.
• Subsequent versions of GPGPU ISAs are designated as sm_1x/sm_2x or simply by 1.x/2.x.
• The first digit 1 or 2 denotes the major version number, the second or subsequent digit
denotes the minor version.
• Major versions of 1.x or 1x relate to pre-Fermi solutions whereas those of 2.x or 2x
to Fermi based solutions.

Remarks (cont.) [149]

• Correspondence of the PTX ISA and GPGPU ISA compute capability versions
Until now there is a one-to-one correspondence between the PTX ISA version and the
GPGPU ISA versions, i.e. the PTX ISA versions and the GPGPU ISA versions with the
same major and minor version number have the same compute capability.
However, there is no guarantee that this one-to-one correspondence will remain valid
in the future.
Main facts concerning the compute capability versions are summarized below.

a) Functional features provided by

the compute capability versions
of the GPGPUs
and virtual PTX ISAs [81]
b) Device parameters bound to the compute capability versions of Nvidia’s GPGPUs

and virtual
PTX ISAs [81]
c) Supported compute capability versions of Nvidia’s GPGPU cards [81]

(sm_xy)

FX4/5600, 360M
120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM,
32/370M, 3/5/770M, 16/17/27/28/36/37/3800M,
NVS420/50
NVS 2/3100M
3/4/5800
600,4/5/6000, Plex7000, GTX570, GTX580
GTX 560Ti

d) Compute capability versions of PTX ISAs generated by subsequent releases of

CUDA SDKs and supported GPGPUs (designated as Targets in the Table) [147]
PTX ISA 1.x/sm_1x

Pre-Fermi implementations
PTX ISA 1.x/sm_1x

Fermi implementations

e) Forward portability of PTX code [52]

Applications compiled for pre-Fermi GPGPUs that include PTX versions of their kernels
should work as-is on Fermi GPGPUs as well .
f) Compatibility rules of object files (CUBIN files) compiled to a particular GPGPU

compute capability version [52]
The basic rule is forward compatibility within the main versions (versions sm_1x and sm_2x),
but not across main versions.
This is interpreted as follows:
Object files (called CUBIN files) compiled to a particular GPGPU compute capability version
are supported on all devices having the same or higher version number within the
same main version.
E.g. object files compiled to the compute capability 1.0 are supported on all 1.x devices
but not supported on compute capability 2.0 (Fermi) devices.
For more details see [52].

Remarks (cont.)
2. Contrasting the virtual machine concept with the traditional computer technology
Whereas the PTX virtual machine concept is based on a forward compatible but not directly
executable compiler’s target code (pseudo code), in traditional computer technology
the compiled code, such as an x86 object code, is immediately executable by the processor.
Earlier CISC processors, like Intel’s x86 processors up to the Pentium, executed x86 code
immediately by hardware.
Subsequent CISCs, beginning with 2. generation superscalars (like the Pentium Pro),
including current x86 processors, like Intel’s Nehalem (2008) or AMD’ Bulldozer (2011)
map x86 CISC instructions during decoding first to internally defined RISC instructions.
In these processors a ROM-based µcode Engine (i.e. firmware) supports decoding of
complex x86 instructions (decoding of instructions which need more than 4 RISC instructions)
The RISC core of the processor executes then the requested RISC operations directly.
Figure 5.1.1: Hardware/firmware mapping

of x86 instructions of
directly executable RISC instructions
in Intel Nehalem [51]

Remarks (cont.)
3) Nvidia’ CUDA compiler (nvcc) is designated as CUDA C compiler, beginning with
CUDA version 3.0 to stress the support of C.

Remarks (cont.)
4) nvcc can be used to generate both architecture specific files (CUBIN files) or
forward compatible PTX versions of the kernels [52].
Application
Direct compilation CUDA C compiler Two-phase compilation

to executable object cod, or • First to PTX code
called CUBIN file.
OpenCL compiler (pseudo ISA instructions)
• then during loading
(JIT-compilation)
to executable object code,
called CUBIN file.
CUDA driver
No forward compatibility, Forward compatibility

the CUBIN file is bound within the same major
to the given compute capability
compute capability CUBIN file version no. (1.x or 2.x)
revision no. (1.x or 2.x). (GPGPU specific provided
object678
code)
Remarks (cont.)
The virtual machine concept underlying both Nvidia’s and AMD’s GPGPUs is similar to
the virtual machine concept underlying Java.
• For Java there is also an inherent computational model and a pseudo ISA, called the
Java bytecode.
• Applications written in Java will first be compiled to the platform independent Java bytecode.
• The Java bytecode will then either be interpreted by the Java Runtime Environment (JRE)
installed on the end user’s computer or compiled at runtime by the Just-In-Time (JIT)
compiler of the end user.

5.1.3 Key innovations of Fermi’s PTX 2.0

5.1.3 Key innovations of Fermi’s PTX 2.0 (1)
Overview of PTX 2.0

• Fermi’s underlying pseudo ISA is the 2. generation PTX 2.x (Parallel Thread eXecution) ISA,
introduced along with the Fermi line.
• PTX 2.x is a major redesign of the PTX 1.x ISA, towards a more RISC-like load/store
architecture rather than being an x86 memory based architecture.
With the PTX2.0 Nvidia states that they have created a long-evity ISA for GPUs,
like the x86 ISA for CPUs.
Based on the key innovations and declared goals of Fermi’s ISA (PTX2.0) and considering
the significant innovations and enhancements made in the microarchitecture
it can be expected that Nvidia’s GPGPUs entered a phase of relative consolidation.

Key innovations of PTX 2.0

a) Unified address space for all variables and pointers with a single set of load/store instructions
b) 64-bit addressing capability
c) New instructions to support the OpenCL and DirectCompute APIs
d) Full support of predication
e) Full IEEE 754-3008 support for 32-bit and 64-bit FP precision
These new features greatly improve GPU programmability, accuracy and performance.

a) Unified address space for all variables and pointers with a single set of
load/store instructions-1 [58]
• In PTX 1.0 there are three separate address spaces
(thread private local, block shared and global)
with specific load/store instructions to each one of the three address spaces.
• Programs could load or store values in a particular target address space at addresses
that become known at compile time.
It was difficult to fully implement C and C++ pointers since a pointer’s target address
could only be determined dynamically at run time.

a) Unified address space for all variables and pointers with a single set of
load/store instructions-2 [58]
• PTX 2.0 unifies all three address spaces into a single continuous address space that
can be accessed by a single set of load/store instructions.
• PTX 2.0 allows to use unified pointers to pass objects in any memory space and
Fermi’s hardware automatically maps pointer references to the correct memory space.
Thus the concept of the unified address space enables Fermi to support C++ programs.

b) 64-bit addressing capability
• Nvidia’s previous generation GPGPUs (G80, G92, GT200) provide 32 bit addressing
for load/store instructions,
• PTX 2.0 extends the addressing capability to 64-bit for future growth.
however, recent Fermi implementations use only 40-bit addresses allowing to access
an address space of 1 Terabyte.

c) New instructions to support the OpenCL and DirectCompute APIs

• PTX2.0 is optimized for the OpenCL and DirectCompute programming environments.
• It provides a set of new instructions allowing hardware support for these APIs.

d) Full support of predication [56]
• PTX 2.0 supports predication for all instructions.

• Predicated instructions will be executed or skipped depending on the actual values
of conditional codes.
• Predication allows each thread to perform different operations while execution continuous
at full speed.
• Predication is a more efficient solution for streaming applications than using conventional
conditional branches and branch prediction.

e) Full IEEE 754-3008 support for 32-bit and 64-bit FP precision
• Fermi’s FP32 instruction semantics and implementation supports now

• calculations with subnormal numbers
(numbers that lie between zero and the smallest normalized number) and
• all four rounding modes (nearest, zero, positive infinity, negative infinity).
• Fermi provides fused multiply-add (FMA) instructions for both single and double precision
FP calculations (with retaining full precision in the intermediate stage)
instead of using truncation between the multiplication and addition as done in
previous generation GPGPUs for multiply-add instructions (MAD).

Supporting program development for the Fermi line of GPGPUs [58]
• Nvidia provides a development environment, called Nexus, designed specifically to support

parallel CUDA C, OpenCL and DirectCompute applications.
• Nexus brings parallel-aware hardware source code debugging and performance analysis
directly into Microsoft Visual Studio.
• Nexus allows Visual Studio developers to write and debug GPU source code using
exactly the same tools and interfaces that are used when writing and debugging CPU code.
• Furthermore, Nexus extends Visual Studio functionality by offering tools to manage
massive parallelism.

5.1.4 Nvidia’s high level data parallel
programming model

5.1.4 Nvidia’s high level data parallel programming model (1)
CUDA (Compute Unified Device Architecture) [43]

• It is Nvidia’s hardware and software architecture for issuing and managing
data parallel computations on a GPGPU without the need to mapping them to a graphics API.
• It became available starting with the CUDA release 0.8 (in 2/2007) and the GeForce 8800 cards.
• CUDA is designed to support various languages and Application Programming Interfaces (APIs).
Figure 5.1.2: Supported languages and APIs (as of starting with CUDA version 3.0)

Writing CUDA programs [43]

Writing CUDA programs
At the level of CUDA C At the level of the CUDA driver API
• CUDA C exposes the CUDA programming model • The CUDA Driver API is a lover level C API that allows
as a minimal set of C language extensions. to load and launch kernels as modules of binary or
• These extensions allow to define kernels along with assembly CUDA code and to manage the platform.
the dimensions of associated grids and thread blocks. • Binary and assembly codes are usually obtained
• The CUDA C program must be compiled with nvcc. by compiling kernels written in C.
E.g. CUDA C
(to be compiled
(E.g. CUBLAS) with nvcc)
API
(To manage the platform
CUDA Driver API

API
(To manage the platform
and to load and launch kernels)
Figure 5.1.3: The CUDA

software stack [43]
cuda 3.2
The high-level CUDA programming model
• It supports data parallelism.

• It is the task of the operating system’s multitasking mechanism to manage accessing
the GPGPU by several CUDA and graphics applications running concurrently.
• Beyond that advanced Nvidia GPGPU’s (beginning with the Fermi family) are able to run
multiple kernels concurrently.

Main components of the programming model of CUDA [43],

• The data parallel programming model is based on the following abstractions
a) The platform model
b) The memory model of the platform
c) The execution model including
c1) The kernel concept as a means to utilize data parallelism
c2) The allocation of threads and thread blocks to ALUs and SIMD cores
c4) The synchronization concept)
• These abstractions will be outlined briefly below.

• A more detailed description of the programming model of the OpenCL standard is given
in Section 3.

a) The platform model [146]
SIMD core
ALUs

b) The memory model of the platform [43]
The Local Memory is an

extension of the per-thread
register space in the
device memory.
A thread has access to the

device’s DRAM and on-chip
memory through a set of
memory spaces.
Remark
Compute capability dependent memory sizes of Nvidia’s GPGPUs

c) The execution model [43]
Overview
Serial code executes on the host

while parallel code executes on
the device.
c1) The kernel concept [43]
• CUDA C allows the programmer to define kernels as C functions, that, when called
are executed N-times in parallel by N different CUDA threads, as opposed to only once
like regular C functions.
• A kernel is defined by
• using the _global_ declaration specifier and
• declaring the instructions to be executed.
• The number of CUDA threads that execute that kernel for a given kernel call is given
during kernel invocation by using the <<< …>>> execution configuration identifier.
• Each thread that executes the kernel is given a unique thread ID that is accessible within
the kernel through the built-in threadIdx variable.
The subsequent sample code illustrates a kernel that adds two vectors A and B of size N and
stores the result into vector C as well as its invocation.

c2) The allocation of threads and thread blocks to ALUs and SIMD cores
• Threads are allocated to ALUs for execution.

• The capability of the ALUs depends on the Compute capability version of the GPGPU that needs
to be supported by the compilation (SDK release).
• Thread blocks are allocated for execution to the SIMD cores.
Within thread blocks there is a possibility of data sharing through the Shared Memories,
that are allocated to each SIMD core (to be discussed subsequently).

Available register spaces for threads, thread blocks and grids-1 [43]

Available register spaces for threads, thread blocks and grids-2 [43]
Per-thread
reg. space

c3) The data sharing concept [43]

Threads within a thread block can cooperate by sharing date through a
per thread block Shared Memory.
Figure 5.1.4: The memory

model [43]

c4) The synchronization concept [43]
• Within a thread block the execution of threads can be synchronized.

• To achieve this the programmer can specify synchronization points in the kernel by
calling the syncthread() function that acts as a barrier at which all threads in the
thread block must wait before any is allowed to proceed.

5.1.5 Major innovations and enhancements of
Fermi’s microarchitecture

5.1.5 Major innovations and enhancements of Fermi’s microarchitecture (1)
Major innovations
a) Concurrent kernel execution
b) True two level cache hierarchy
c) Configurable shared memory/L1 cache per SM
d) ECC support
Major enhancements
a) Vastly increased FP64 performance
b) Greatly reduced context switching times
c) 10-20 times faster atomic memory operations

Major architectural innovations of Fermi
a) Concurrent kernel execution [39], [83]

• In previous generations (G80, G92, GT200) the global scheduler could only assign work
to the SMs from a single kernel (serial kernel execution).
• The global scheduler of Fermi is able to run up to 16 different kernels concurrently, one per SM
• A large kernel may be spread over multiple SMs.

concurrently each one on a different SM.


© Sima Dezső, ÓE NIK (devices before Fermi) 707 (devices starting with Fermi) www.tankonyvtar.hu
b) True two level cache hierarchy [58]
• Traditional GPU architectures support a read-only

“load path” for texture operations and a write-only
“export path” for pixel data output.
For computational tasks however, this impedes
the ordering of read and write operations
done usually for speeding up computations.
• To eliminate this deficiency Fermi implements a
unified memory access path for both loads and stores.
• Fermi provides further on a unified L2 cache
for speeding up loads, stores and texture requests.
[58] Fermi white paper

c) Configurable shared memory/L1 cache per SM [58]
• Fermi provides furthermore a configurable

shared memory/L1 cache per SM.
• The shared memory/L1 cache unit is configurable
to optimally support both shared memory and
caching of local and global memory operations.
Supported options are 48 KB shared memory
with 16 KB L1 cache or vice versa.
• The optimal configuration depends on
the application to be run.

d) ECC support [58]
It protects
• DRAM memory
• register files
• shared memories
• L1 and L2 caches.
Remark
ECC support is provided only for Tesla devices.

Major architectural enhancements of Fermi
a) Vastly increased FP64 performance

Compared to the previous G80 and GT200-based generations, Fermi provides vastly increased
FP64 performance over flagship GPGPU cards.
Flagship Tesla cards
C870 C1060 C2070

(G80-based) (GT200-based) (Fermi T20-based)
FP64 performance - 77.76 GFLOPS 515.2 GFLOPS
Flagship GPGPU cards
8800 GTX GTX 280 GTX 480 GTX 580

(G80-based) (GT200-based) (Fermi GF100-based) (Fermi GF110-based)
FP64 performance - 77.76 GFLOPS 168 GFLOPS 197.6 GFLOPS

Throughput of arithmetic operations per clock cycle per SM [43]

GT80- GT2001 GF100/110 GF104
1 1 GT 80/92
does not
support
FP64
b) Greatly reduced context switching times [58]
• Fermi performs context switches between different applications in about 25 µs.

• This is about 10 times faster than context switches in the previous GT200 (200-250 µs).

c) 10-20 times faster atomic memory operations [58]

• Atomic operations are widely used in parallel programming to facilitate correct
read-modify-write operations on shared data structures.
• Owing to its increased number of atomic units and the L2 cache added, Fermi performs
atomic operations up to 20x faster than previous GT200 based devices.


5.1.6 Microarchitecture of Fermi GF100 (1)
Overall structure of Fermi GF100 [83], [58]
NVidia: 16 cores
(Streaming Multiprocessors)
(SMs)
Each core: 32 ALUs

512 ALUs
Remark
In the associated flagship card
(GTX 480) however,
one SM has been disabled,
due to overheating problems,
so it has actually
15 SMs and 480 ALUs [a]
6x Dual Channel GDDR5

(6x 64 = 384 bit)

High level microarchitecture of Fermi GT100
Figure 5.1.5: Fermi’s717system architecture [39]

Evolution of the high level microachitecture of Nvidia’s GPGPUs [39]
Fermi GF100
Note
The high level microarchitecture of Fermi evolved from a graphics oriented structure
to a computation oriented one complemented with a units needed for graphics processing.

Layout of a Cuda GF100 core (SM) [54]

(SM: Streaming Multiprocessor)
SFU: Special
Function Unit

Fermi GT100/GT110 [54]
Evolution of the cores (SMs)
in Nvidia’s GPGPUs -1 GT200 SM [84]
GT80 SM [57]
Streaming Multiprocessor
Instruction L1
Instruction Fetch/Dispatch
Shared Memory
SP SP
SP SP
SFU SFU
SP SP
SP SP
• 16 KB Shared Memory
• 8 K registersx32-bit/SM • 16 KB Shared Memory
• up to 24 active warps/SM • 16 K registersx32-bit/SM
up to 768 active threads/SM • up to 32 active warps/SM
• 64 KB Shared Memory/L1 Cache
10 registers/SM on average up to 1 K active threads/SM
• up to 48 active warps/SM
16 registers/thread on average
• 32 threads/warp
• 1 FMA FPU (not shown) up to 1536 active threads/SM
© Sima Dezső, ÓE NIK 720 20 registers/thread on average
GF100 [70] GF104 [55]
Further evolution of the cores
(SMs) in Nvidia’s GPGPUs -2
GF104 [55]
Available specifications:
• 64 KB Shared Memory/L1 Cache

• 32 Kx32-bit registers/SM
• 32 threads/warp
Data about
• the number of active warps/SM and
• the number of active threads/SM
are at present (March 2011) not available.

Structure and operation of the Fermi GF100 GPGPU

Layout of a Cuda GF100 core

(SM) [54]
Special Function Units

calculate FP32
transcendental functions
(such as trigonometric
functions etc.)

A single ALU (“CUDA core”)
SP FP:32-bit
FP64 Fermi’s integer units (INT Units)

• are 32-bit wide.
• First implementation of the IEEE 754-2008
• became stand alone units, i.e.
standard
they are no longer merged with
• Needs 2 clock cycles to issue the entire warp the MAD units as in prior designs.
for execution.
• In addition, each floating-point
unit (FP Unit) is now capable of
producing IEEE 754-2008-
FP64 performance: ½ of FP32 performance!! compliant double-precision (DP)
(Enabled only on Tesla devices! FP results in every 2. clock cycles,
at ½ of the performance of
single-precision FP calculations.
Figure 5.1.6: A single ALU [40]
Remark
The Fermi line supports the Fused Multiply-Add (FMA) operation, rather than the Multiply-Add
operation performed in previous generations.
Previous lines
Fermi
Figure 5.1.7: Contrasting the Multiply-Add (MAD) and the Fused-Multiply-Add (FMA) operations
[56]

Principle of the SIMT execution in case of serial kernel execution
Host Device
Each kernel invocation

lets execute all
kernel0<<<>>>() thread blocks (Block(i,j))
Thread blocks may be

executed independently
from each other
kernel1<<<>>>()
Figure 5.1.8: Hierarchy of

threads [25]
Principle of operation of a Fermi GF100 GPGPU
The key point of operation is work scheduling
Subtasks of work scheduling

• Scheduling kernels to SMs
• Scheduling thread blocks of the kernels to the SMs
• Segmenting thread blocks into warps
• Scheduling warps for execution in SMs

Scheduling kernels to SMs [38], [83]
• A global scheduler, called the Gigathread scheduler assigns work to each SM.
• In previous generations (G80, G92, GT200) the global scheduler could only assign work to the
SMs from a single kernel (serial kernel execution).
• The global scheduler of Fermi is able to run up to 16 different kernels concurrently, one per SM.
• A large kernel may be spread over multiple SMs.



The context switch time occurring between kernel switches is greatly reduced compared to
the previous generation, from about 250 µs to about 20 µs (needed for cleaning up TLBs,
dirty data in caches, registers etc.) [39].

Scheduling thread blocks of the kernels

to the SMs
• The Gigathread scheduler assigns t0 t1 t2 … tm t0 t1 t2 … tm

up to 8 thread blocks of the same kernel
to each SM.
(Tread blocks assigned to a particular SM
must belong to the same kernel).
• Nevertheless, the Gigathread scheduler
can assign different kernels to different SMs, Up to 16 Blocks Up to 16 Blocks
so up to 16 concurrent kernels can run
on 16 SMs.

The notion and main features of thread blocks in CUDA [57]
 Programmer declares blocks:

 Block size: 1 to 512 concurrent threads CUDA Thread Block
 Block shape: 1D, 2D, or 3D
 Block dimensions in threads Thread Id #:
 Each block can execute in any order relative to 0123… m
other blocs!
 All threads in a block execute the same kernel program
(SPMD)
 Threads have thread id numbers within block
Thread program
 Thread program uses thread id to select work
and address shared data
 Threads in the same block share data and
synchronize while doing their share of the work
 Threads in different blocks cannot cooperate
Courtesy: John Nickolls,

NVIDIA

Segmenting thread blocks

into warps [12] Block 1 Warps Block 2 Warps Block 1 Warps
… … …
t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31
… … …
TB1 W3
TB1 W2
TB1 W1
TB: Thread Block

• Threads are scheduled for execution in groups W: Warp
of 32 threads, called the warps.
• For scheduling each thread block is subdivided
into warps.
• At any point of time up to 48 warps can be
maintained by the schedulers of the SM.
Remark
The number of threads constituting a warp
is an implementation decision and not
part of the CUDA programming model.
E.g. in the G80 there are 24 warps per SM, whereas
in the GT200 there are 32 warps per SM.

Scheduling warps for execution in SMs
Nvidia did not reveal details of the microarchitecture of Fermi so the subsequent
discussion of warp scheduling is based on assumptions given in the sources [39], [58].

Assumed block diagram of the Fermi GF100 microarchitecture and its operation
• Based on [39] and [58], Fermi’s front end can be

assumed to be built up and operate as follows:
• The front end consist of dual execution pipelines or
from another point of view of two tightly coupled
thin cores with dedicated and shared resources.
• Dedicated resources per thin core are
• the Warp Instruction Queues,
• the Scoreboarded Warp Schedulers and
• 16 ALUs.
• Shared resources include
• the Instruction Cache,
• the 128 KB (32 K registersx32-bit) Register File,
• the four SFUs and the
• 64 KB LiD Shared Memory.

Remark
Fermi’s front end is similar to the basic building block of AMD’s Bulldozer core (2011)
that consists of two tightly coupled thin cores [85].
Figure 5.1.9: The Bulldozer core [85]

Assumed principle of operation-1
• Both warp schedulers are connected through a

partial crossbar to five groups of execution
units, as shown in the figure.
• Up to 48 warp instructions may be held in dual
instruction queues waiting for issue.
• Warp instructions having all needed operands
are eligible for issue.
• Scoreboarding tracks the availability of
operands based on the expected latencies of
issued instructions, and marks instructions
whose operands became already computed as
eligible for issue.
• Fermi’s dual warp schedulers select two
eligible warp instruction for issue in every two
shader cycles according to a given scheduling
policy.
• Each Warp Scheduler issues one warp
instruction to a group of 16 ALUs
(each consisting of a 32-bit SP ALU and
a 32-bit FPU), 4 SFUs or 16 load/store units
(not shown in the figure).
Figure 5.1.10: The Fermi core [39]

• Warp instructions are issued to the appropriate

group of execution units as follows:
• FX and FP32 arithmetic instructions, including
FP FMA instructions are forwarded to
16 32-bit ALUs, each of them incorporating a
32-bit FX ALU and a 32-bit FP32 ALU (FPU) .
FX instructions will be executed in the
32-bit FX unit whereas SP FP instructions
in the SP FP unit.
• FP64 arithmetic instructions, including
FP64 FMA instructions will be forwarded to
both groups of 16 32-bit FPUs in the same time,
thus DP FMA instructions enforce single issue.
• FP32 transcendental instructions will be
issued to the 4 SPUs.

• A warp scheduler needs multiple shader cycles

to issue the entire warp (i.e. 32 threads),
to the available number of execution units of
the target group.
The number of shader cycles needed is determined

by the number of execution units available in
a particular group, e.g. :
• FX or FP32 arithmetic instructions: 2 cycles
• FP64 arithmetic instructions : 2 cycles
(but they prevent dual issue)
• FP32 transcendental instructions: 8 cycles
• Load/store instructions 2 cycles.
Execution cycles of further operations are given

in [43].

Example: Throughput of arithmetic operations per clock cycle per SM [43]
GT80- GT200 GF100/110 GF104
1 1 GT 80/92
does not
support
DP FP
Scheduling policy of warps in an SM of Fermi GT100-1
Official documentation reveals only that Fermi GT100 has

dual issue zero overhead prioritized scheduling [58]

Scheduling policy of warps in an SM of Fermi GT100-2
Official documentation reveals only that Fermi GT100 has dual issue zero overhead
prioritized scheduling [58]
Nevertheless, based on further sources [86)] and early slides discussing

warp scheduling in the GT80 in a lecture held by D. Kirk, one of the key developers of
Nvidia’s GPGPUs (ECE 498AL Illinois, [12] the following assumptions can be made for the
warp scheduling in Fermi:
 Warps whose next instruction is ready for execution, that is all its operands are
available, are eligible for scheduling.
 Eligible Warps are selected for execution on a not revealed priority scheme that is
based presumably on the warp type (e.g. pixel warp, computational warp),
instructions type and age of the warp.
 Eligible warp instructions of the same priority are scheduled presumably
according to a round robin policy.
 It is not unambiguous whether or not Fermi is using fine grained or coarse
grained scheduling.
Early publications discussing warp scheduling in the GT80 [12] let assume that
warps are scheduled coarse grained but figures in the same publication
illustrating warp scheduling show to the contrary fine grain scheduling,
as shown subsequently.

Remarks
D. Kirk, one of the developers of Nvidia’s GPGPUs details warp scheduling in [12],
but this publication includes two conflicting figures, one indicating to coarse grain and the
other to fine grain warp scheduling as shown below.
Underlying microarchitecture of warp scheduling in an SM of the G80
I$
 The G80 fetches one warp instruction/issue cycle L1
 from the instruction L1 cache
 into any instruction buffer slot.
Multithreaded
 Issues one “ready-to-go” warp instruction/issue cycle Instruction Buffer
 from any warp - instruction buffer slot.
 Operand scoreboarding is used to prevent hazards
Shared
 An instruction becomes ready after all needed R C$ Mem
F L1
values are deposited.
 It prevents hazards
Operand Select
 Cleared instructions become eligible for issue
 Issue selection is based on round-robin/age of warp.
 SM broadcasts the same instruction to 32 threads of a warp. SFU
MAD
Figure 5.1.13: Warp scheduling

in the G80 [12]

Scheduling policy of warps in an SM of the G80 indicating coarse grain warp scheduling
 The G80 uses decoupled memory/processor pipelines

 any thread can continue to issue instructions until scoreboarding prevents
issue
 it allows memory/processor ops to proceed in shadow of other waiting
memory/processor ops.
TB1, W1 stall
TB2, W1 stall TB3, W2 stall
TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4
Time TB = Thread Block, W = Warp
Figure 5.1.14: Warp scheduling in the G80 [12]
Note
The given scheduling scheme reveals a coarse grain one.

SM Warp Scheduling
Scheduling policy of warps in an SM of the G80 indicating fine grain warp scheduling
 SM hardware implements zero-overhead Warp

scheduling
 Warps whose next instruction has its operands
ready for consumption are eligible for
execution.
 Eligible Warps are selected for execution on a
SM multithreaded prioritized scheduling policy.
Warp scheduler  All threads in a Warp execute the same
instruction when selected.
time  4 clock cycles needed to dispatch the same
instruction for all threads in a Warp in the G80.
warp 8 instruction 11
Note
The given scheduling scheme reveals a coarse grain one.
..
.
Figure 5.1.15: Warp scheduling in the G80 [12]
Estimation of the peak performance of the Fermi GF100 -1
a) Peak FP32 performance per SM

Max. throughput of warp instructions/SM:
• dual issue
• 2 cycles/issue
2 x ½ = 1 warp instruction/cycle
b) Peak FP32 performance (P FP32) of a GPGPU card

• 1 warp instructions/cycle
• 32 FMA/warp
• 2 operations/FMA
• at a shader frequency of fs
• n SM units
P FP32 = 1 x 32 x 2 x 2 x fs x n
P FP32 = 2 x 32 x fs x n FP32 operations/s
E.g. in case of the GTX580

• fs = 1 401 GHz
• n = 15
P FP32 = 2 x 32 X 1401 x 15 = 1344.96 GFLOPS
Estimation of the peak performance of the Fermi GF100 -2
c) Peak FP64 performance per SM

Max. throughput of warp instructions/SM:
• single issue
• 2 cycles/issue
1 x 1/2 = 1/2 warp instruction/cycle
d) Peak FP64 performance (P FP64) of a GPGPU card

• 1 warp instruction/2 cycles
• 32 FMA/warp
• n SM units
PFP64 = ½ x 32 x 2 x fs x n
PFP64 = 32 x fs x n FP64 operations/s

• fs = 1 401 GHz
• n = 15
PFP64 = 32 x 1401 x 15 = 672. 048 GFLOPS Figure 5.1.17: The Fermi core [39]
(This speed is provided only on Tesla devices, else it is merely 1/8 of the FP32 performance).
5.1.7 Comparing key features of the
microarchitectures of Fermi GF100
and the predecessor GT200

5.1.7 Comparing the microarchitectures of Fermi GF100 and GT200 (1)
Key differences in the block diagrams of the microarchitectures of the GT200 and Fermi
(Assumed block diagrams [39] without showing result data paths)
Single execution Dual execution

pipeline pipelines
Vastly increased
execution resources
5.1.7 Comparing5.1.3
the microarchitectures
Comparison of key of Fermi GF100
features … (2) and GT200 (2)
Available execution resources of the GT200

• The GT200 includes only a single [39]
execution pipeline with the following
execution resources:
• Eight 32-bit ALUs, each incorporating
• a 32-bit FX and
• a 32-bit (SP) FP
execution unit.
The 32-bit ALUs have a throughput of
one FX32 or FP32 operation
(including the MAP operation) per cycle.
• A single 64-bit FPU, with a throughput
of one FP64 arithmetic operation,
including the MAD operation) per cycle.
• Dual SFUs (Special Function Units),
each with a throughput of 4 FP32 MUL
instructions per cycle or
diverse transcendental functions per
multiple cycles.

Throughput of arithmetic operations per clock cycle per SM in the GT200 [43]
G80- GT200 GF100/110 GF104
Latency, throughput and warp issue rate of the arithmetic pipelines of the GT200 [60]

Scheduling kernels to SMs in the GT200 [38], [83]
• A global scheduler, called the Gigathread scheduler assigns work to each SM.
• In the GT200, and all previous generations (G80, G92) the global scheduler could only assign
work to the SMs from a single kernel (serial kernel execution).
• By contrast, Fermi’s global scheduler is able to run up to 16 different kernels concurrently,
presumable, one per SM.

concurrently, presumable, each one
on a different SM.


Assigning thread blocks to the SMs in the GT200
The global scheduler distributes the thread blocks of the running kernel to the available SMs,
by assigning typically multiple blocks to each SM, as indicated in the Figure below.
Figure 5.1.18: Assigning thread blocks

to an SM in the GT200 [61]

Each thread block may consist of multiple warps, e.g. of 4 warps, as indicated in the Figure.
Figure 5.1.19: Assigning thread blocks to an SM in the GT200 [61]

Scheduling warps for issue in the GT200-1

Subsequently, the SM’s instruction schedulers select warps for execution.
The SM’s instruction scheduler

is designated as the Warp Scheduler
Figure 5.1.20: Allocation of thread

blocks
to an SM in the GT200 [61]
Scheduling warps for issue in the GT200-2

Nvidia did not reveal the scheduling policy.

Issuing warps to the execution pipeline

The issue process can be characterized by
• the maximal issue rate of warps to a particular group of pipelined execution units of an SM,
called the issue rate by Nvidia and
• the maximal issue rate of warps to the execution pipeline of the SM.
GT200 core [39]
Maximal issue rate of warps

to the execution pipeline of the SM
Maximal issue rate of warps

to particular groups of
pipelined execution units of an SM

The maximal issue rate of warps to a particular group of pipelined execution units
(called the warp issue rate (clocks per warp) in Nvidia’s terminology)
It depends on the number and throughput of the individual execution units in a group,
or from another point of view, on the total throughput of all execution units
constituting a group of execution units, called arithmetic and flow control pipelines.
Table 5.1.2: Throughput of the arithmetic pipelines in the GT 200 [60]
The issue rate of the arithmetic and flow control pipelines will be determined by the warp size
(32 threads) and the throughput (ops/clock) of the arithmetic or flow control pipelines,
as shown below for the arithmetic pipelines.
Table 5.1.3: Issue rate of the arithmetic pipelines in the GT 200 [60]
Accordingly,
• the issue of an FP32 MUL or MAD warp needs 32/8 = 4 clock cycles, or
• the issue of an FP64 MUL or MAD warp needs 32/1 = 32 clock cycles.
In the GT200 FP64 instructions are executed at 1/8 rate of FP32 instructions.

Example: Issuing basic arithmetic operations, such as an FP32 addition or multiplication

to the related arithmetic pipeline (constituting of 8 FPU units, called SP units in the Figure)
Figure 5.1.21: Issuing warps on an SM [61]

The maximal issue rate of warps to the execution pipeline of the SM -1 (based on [25])
As discussed previously, the Warp Schedulers of the SM can issue warps to the arithmetic
pipelines at the associated issue rates, e.g. in case of FX32 or FP32 warp instruction
in every fourth shader cycle to the FPU units .
FPU MAD MAD
FPU units occupied FPU units occupied
Nevertheless, the scheduler of GT200 is capable of issuing warp instructions in every second
shader cycle to not occupied arithmetic pipelines of the SM if there are no dependencies
with previously issued instructions.
E.g. after issuing an FP32 MAD instruction to the FPU units, the Warp Scheduler can issue
already two cycles later an FP32 MUL instruction to the SFU units, if these units are not busy
and there is no data dependency between these two instructions.
FPU MAD MAD
SFU MUL MUL
SFU units occupied
The FP32 MUL warp instruction will occupy the 4 SPU units for 4 shader cycles.
The maximal issue rate of warps to the execution pipeline of an SM -2 (based on [25])
In this way the Warp Schedulers of the SMs may issue up to two warp instructions to the
single execution pipeline of the SM in every four shader cycle, provided that there are
no resource or data dependencies.
Issue MAD MUL MAD MUL
FPU MAD MAD
SFU MUL MUL
SFU units occupied
Figure 5.1.22: Dual issue of warp instructions in every 4 cycles in the GT200
(Based on [25]

Estimation of the peak performance of the GT200 -1
a) Peak FP32 performance for arithmetic [39]

operations per SM
Max. throughput of the FP32 arithmetic
instructions per SM for e,g,:
• dual issue of a FP32 MAD and a FP32 MUL
warp instruction in 4 cycles
2 x 1/4 = 1/2 warp instruction/cycle
b) Peak FP32 performance (PFP32) of a GPGPU card
• 1 FP32 MAD warp instructions/4 cycle plus
• 1 FP32 MUL warp instruction/4 cycle
• 32 x (2+1) operations in 4 shader cycles
• n SM units
P SP FP = 1/4 x 32 x 3 x fs x n
P FP32 = 24 x fs x n FP32 operations/s

• fs = 1 296 GHz
• n = 30
PFP32 = 24 x 1296 x 30 = 933.12 GFLOPS
Estimation of the peak performance of the GT200 -2
c) Peak FP64 performance for arithmetic [39]

operations per SM
Max. throughput of the FP64 MAD
instructions per SM:
• single issue of a FP64 MAD
warp instruction in 32 cycles
1/32 warp instruction/cycle
d) Peak FP64 performance (PFP64) of a GPGPU card
• 1 FP64 MAD warp instructions/32 cycle
• 32 x 2 operations in 32 shader cycles
• n SM units
PFP64 = 1/32 x 32 x 2 x fs x n
PFP64 = 2 x fs x n FP64 operations/s

• fs = 1 296 GHz
• n = 30
PFP64 = 2 x 1296 x 30 = 77.76 GFLOPS

Introduced 7/2010 for graphic use

Key differences to the GF100
Number of cores
Only 8 in the GF104 vs 16 in the GF100
Figure 5.1.23: Contrasting the overall structures of the GF104 and the GF100 [69]
Note
In the GF104 based GTX 460 flagship card Nvidia activated only 7 SMs rather than
all 8 SMs available, due to overheating.

Available per SM Fermi GF100 [70] Fermi GF104 [55]
execution resources
in the GF104 vs the GF100
GF100 GF104
No. of
SP FX/FP 32 48
ALUs
No. of
16 16
L/S units
No. of
4 8
SFUs
No. of
8 4
DP FP ALUs

Note
The modifications done in the GF104 vs the GF100 aim at increasing graphics performance per
SM at the expense of FP64 performance while halving the number of SMs in order to
reduce power consumption and price.

Warp issue in the Fermi GF104

SMs of the GF104 have dual thin cores
(execution pipelines), each with
2-wide superscalar issue [62]
It is officially not revealed how the

issue slots (1a – 2b) are allocated to
the groups of execution units.

Peak computational performance data for the Fermi GF104 based GTX 460 card
According to the computational capability data [43] and in harmony with the figure on the
previous slide:
Peak FP32 FMA performance per SM:

2 x 48 FMA operations/shader cycle per SM
Peak FP32 performance of a GTX460 card while it executes FMA warp instructions:
PFP32 = 2 x 48 x fs x n FP32 operations/s
with • fs: shader frequency

• n: number of SM units
For the GTX 460 card

• fs = 1350 MHz
•n=7
P FP32 = 2 x 48 x 1350 x 7 = 907.2 MFLOPS
Peak FP64 performance of a GTX 460 card while it executes FMA instructions:
PFP64 = 2 x 4 x fs x n FP64 operations/s
PFP64 = 2 x 4 x 1350 x 7 = 75.6 MFLOPS


Microarchitecture of Fermi GF110 [63]

• Introduced 11/2010 for general purpose use
• The GF110 is a redesign of the GF100 resulting in less power consumption and higher speed.
As long as in the GF100-based flagship card (GTX 480) only 15 SMs could be activated
due to overheating, the GF110 design allows Nvidia to activate all 16 SMs in the associated
flagship card (GTX 580) and at the same time to increase clock speed.
Key differences between the GF100-based GTX 480 and the GF110-based GTX 580 cards
GTX 480 GTX 580
No. of SMs 15 16
No. SP ALUs 480 512
Shader frequency 1401 MHz 1544 MHz
TDP 250 W 244 W
• Due to its larger shader frequency and increased number of SMs the GF110-based
GTX 580 card achieves a ~ 10 % peak performance over the GF100 based GTX 480 card
by a somewhat reduced power consumption.

5.1.10 Evolution of key features of
Nvidia’s GPGPU microarchitectures

5.1.10 Evolution of key features of Nvidia’s GPGPU microarchitectures (1)
Evolution of FP32 warp issue efficiency in Nvidia’s GPGPUs
FP32 Warp issue
Scalar issue Scalar issue Scalar issue 2-way superscalar issue

to a single pipeline to a single pipeline per dual pipelines per dual pipelines
in every 4. cycle1 in every 2. cycle in every 2. cycle in every 2. cycle
Max. no. of warps per cycle

¼ warp per cycle 1/2 warp per cycle 1 warp per cycle 2 warps per cycle
with issue restrictions with less issue with issue restrictions
Peak warp mix restrictions
1 FP32 MAD (1 FP32 MAD + (1 FP32 FMA + (2 FP32 FMA +

per 4 cycles 1 FP32 MUL) 1 FP32 FMA) 1 FP32 FMA)
per 4 cycles per 2 cycles per 2 cycles
Max .no. of operations/cycle

16 24 64 96
Used in
G80/G92 GT200 GF100/GF110 GF104
(11/06, 10/07) (6/08) (3/10) (11/10)

5.1.10 Evolution of key features of Nvidia’s GPGPU microarchitectures (2)
FP64 performance increase in Nvidia’s Tesla and GPGPU lines
Performance is bound by the number of available DP FP execution units.
G80/G92 GT200 GF100 GF110

(11/06, 10/07) (06/08) (03/10) (11/10)
Avail. FP64 No FP64 1 16 16
units support FP64 unit FP64 units FP64 units
operations (Add, Mul, MAD) (Add, Mul, FMA ) (Add, Mul, FMA)
Peak FP64 load/SM 1 FP64 MAD 16 FP64 FMA 16 FP64 MAD

Peak FP64 perf./cycle/SM 1x2 operations/SM 16x2 operations/SM 16x2 operations/SM
Tesla cards
Flagship Tesla card C1060 C2070
Peak FP64 perf./card 30x1x2x1296 14x16x2x1150
77.76 GFLOPS 515.2 GFLOPS

GPGPU cards
Flagship GPGPU úcard GT280 GTX 4801 GTX 5801
Peak FP64 perf./card 30x1x2x1296 15x4x2x1401 16x4x2x1544
77.76 GFLOPS 168.12 GFLOPS 197.632 GFLOPS
1 In their GPGPU Fermi cards Nvidia activates only 4 FP64 units from the available 16
GPGPUs/DPAs 5.2
Case example 2:
AMD’s Cayman core
Dezső Sima

Aim
Aim

5.2 AMD’s Cayman core
5.2.1 Introduction to the Cayman core
5.2.2 AMD’s virtual machine concept
5.2.3 AMD’s high level data and task parallel programming model
5.2.4 Simplified block diagram of the Cayman core
5.2.5 Principle of operation of the Command Processor
5.2.6 The Data Parallel Processor Array
5.2.7 The memory architecture
5.2.8 The Ultra-Threaded Dispatch Processor
5.2.9 Major steps of the evolution of AMD’s GPGPUs

5.2.1 Introduction to the Cayman core

5.2.1 Introduction to the Cayman core (1)
Remarks
1) The subsequent description of the microarchitecture and its operation focuses on

the execution of computational tasks and disregards graphics ones.
2) To shorten designations, in the following slides the prefixes AMD or ATI as well as Radeon
will be left.

Overview of the performance increase of AMD’s GPGPUs [87]

Overview of AMD’s Northern Island series (HD 6xxx)
• The AMD HD 6970 (Cayman XT) introduced in 12/2010

• It is part of the AMD Northern Island (HD 6xxx) series.
AMD Northern Island series (HD 6xxx)
AMD HD 68xx (Barts-based) AMD HD 69xx (Cayman based)
• Gaming oriented • GPGPU oriented

• no FP64 • FP64 at 1/4 speed of FP32
• ALUs: 5-way VLIWs • ALUs: 4-way VLIWS
Cards
HD 6850 (Barts Pro) 10/2010 HD 6950 (Cayman Pro) 12/2010
HD 6870 (Barts XT) 10/2010 HD 6970 (Cayman XT) 12/2010
HD 6990 (Antilles) 3/2011 3/2011

Remarks
1) The Barts core (underlying AMD’s HD 68xx cards) is named after Saint Barthélemy island.
2) The Cayman core (underlying AMD’s HD 69xx cards) is named after the Cayman island.
3) Cayman (AMD HD 69xx) was originally planned as a 32 nm device.
But both TSMC and Global Foundries canceled their 32 nm technology efforts (11/2009
resp. 4/2010) to focus on the 28 nm process, so AMD had to use the 40 nm feature size
for Cayman while eliminating some features already foreseen for that device [88].

Changing the brand naming convention of GPGPU cards and SDKs

along with the introduction of the Northern island series (HD 6xxx) [89]
• For their earlier GPGPUs, including the Evergreen series (HD 5xxx) AMD made use of the
ATI brand, e.g.
ATI Evergreen series: ATI Radeon HD 5xxx

ATI Radeon HD 5850 (Cypress Pro) 9/2009
ATI Radeon HD 5870 (Cypress XT)
ATI Radeon HD 5970 (Hemslock) 11/2009
• But starting with the Northern Island series AMD discontinued using the ATI brand
and began to use the AMD brand to emphasize the correlation with their computing
platforms, e.g.
AMD Northern Island series: AMD Radeon HD 6xxx
AMD Radeon HD 6850 (Barts Pro) 10/2010

AMD Radeon HD 6870 (Barts XT) 10/2010
• At the same time AMD renamed also the new version (v2.3) of their ATI Stream SDK
to AMD Accelerated Parallel Processing (APP).

Changing AMD’s software ecosystem

along with the introduction of the RV870 (Cypress) core based Evergreen series

AMD Stream Software Ecosystem as declared in the AMD Stream Computing

User Guide 1.3 [90]

AMD Stream Software Ecosystem as declared in the ATI Stream Computing

Programming Guide 2.0 [91]

Changing AMD’s software ecosystem
Software ecosystem: more recent designation of the the programming environment
AMD/ATI
9/09 10/10 12/10
Cards HD 5850/70 HD 6850/70 HD 6950/70

1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit
11/09 03/10 08/10

(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.2.01) 8/09
RapidMind
2009 2010 2011
Beginning with their Cypress-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and started supporting OpenCL.

Implications of changing the software ecosystem
Considerable implications on both the microarchitecture of AMD’s GGGPUs, AMD IL and also
on the terminology used in connection with AMD’s GPGPUs.
Implications to the microarchitecture
• The microarchitecture of general purpose CPUs is obviously, language independent.

By contrast, GPGPUs are designed typically with a dedicated language (such as CUDA or
Brook+) in mind,
there is a close interrelationship between the programming environment and the
microarchitecture of a GPGPU that supports the programming environment.
• As a consequence, changing the software ecosystem affects the microarchitecture of the

related cards as well, as discussed in the Sections 5.2.2 (AMD’s virtual machine concept)
and 2.5.9 (Major steps of the evolution of AMD’s GPGPUs.
Implications to the AMD IL (Intermediate Language) pseudo ISA

This point will be discussed in Section 5.2.2 (AMD’s virtual machine concept).

Implications to the terminology

While moving from the Brook+ based ecosystem to the OpenCL based ecosystem
AMD also changed their terminology by distinguishing between
• Pre OpenCL terms
(while designating them as deprecated), and
• OpenCL terms, .
as shown in the Table Terminologies concerning GPGPUs/Data parallel accelerators.
Remark
1) In particular Pre-OpenCL and OpenCL publications AMD makes use of contradicting
terminology.
In Pre-OpenCL publications (relating to RV700 based HD4xxx cards or before)
AMD interprets the term “stream core” as the individual execution units within
the VLIW ALUs, whereas
in OpenCL terminology the same term designates the complete VLIW5 or VLIW4 ALU.

Pre-OpenCL terminology [92] OpenCL terminology [93]
(SIMD core)

Further remarks to naming of GPGPU cards
2) AMD designates their RV770 based HD4xxx cards as Terascale Graphics Engines [36]
referring to the fact that the HD4800 card reached a peak FP32 performance of 1 TFLOPS.
3) Beginning with the RV870 based Evergreen line (HD5xxx cards) AMD designated their
GPGPU architecture as the Terascale 2 Architecture referring to the fact that
the peak FP32 performance of the HD 5850/5870 cards surpassed the 2 TFLOPS mark.

In these slides Nvidia AMD/ATI
(4-24) X SIMD Cores

Core Block Array
Streaming Processor Array SIMD Array
CBA in Nvidia’s G80/G92/
(7-10) TPC in Data Parallel Processor Array
GT200 SPA
G80/G92/GT200 DPP Array
CA Core Array (else)
(8-16) SMs in the Fermi line Compute Device
Stream Processor
Texture Processor Cluster

Core Block (2-3) Streaming
CB in Nvidia’s G80/G92/ TPC Multiprocessors
G200 in G80/G92/GT200,
not present in Fermi
Streaming Multiprocessor (SM) SIMD Core

G80-GT200: scalar issue to SIMD Engine (Pre OpenCL term)
a single pipeline Data Parallel Processor (DPP)
SIMT Core
C SM GF100/110: scalar issue to Compute Unit
SIMD core
dual pipelines 16 x (VLIW4/VLIW5) ALUs
GF104: 2-way superscalar
issue to dual pipelines
VLIW4/VLIW5 ALU
Stream core (in OpenCL SDKs)
Streaming Processor
Compute Unit Pipeline (6900 ISA)
ALU Algebraic Logic Unit (ALU) CUDA Core
SIMD pipeline (Pre OpenCL) term
Thread processor (Pre OpenCL term)
Shader processor (Pre OpenCL term)
Stream cores (ISA publ.s)

Execution Units (EUs) FP Units Processing elements
EU
(e.g. FP32 units etc.) FX Units Stream Processing Units
ALUs (in ISA publications)
Table 5.2.1: Terminologies used with GPGPUs/Data

794 parallel accelerators www.tankonyvtar.hu
5.2.2 AMD’s virtual machine concept

5.2.2 AMD’s virtual machine concept (1)
For their GPGPU technology AMD makes us of the virtual machine concept like Nvidia with their
PTX Virtual Machine.
AMD’s virtual machine is composed of
• the pseudo ISA, called AMD IL and
• its underlying computational model.

AMD IL (AMD’s Intermediate Language)
Main features of AMD IL

• it is an abstract ISA that is close to the real ISA
of AMD’s GPGPUs,
• it hides the hardware details of real GPGPUs
in order to become forward compatible over
subsequent generations of GPGPUs,
• it serves as the backend of compilers.
• Kernels written in a HLL are compiled to
AMD’s IL code.
• IL code can not directly be executed on
real GPGPUs, but needs a second compilation
before execution.
• IL code becomes executable after a second
compilation by the CAL compiler that
produces GPGPU specific binary code
and if required also the IL file.
Figure 5.2.1: AMD’s virtual machine concept,

that is based on the IL pseudo ISA [103]

Remarks
1) Originally, the IL Intermediate Language was based on Microsoft 9X Shader Language [104]
2) About 2008 AMD made a far reaching decision to replace their Brook+ software environment
with the OpenCL environment, as already mentioned in the previous Section.
Figure 5.2.2: AMD’s Brook+ based Figure 5.2.3: AMD’s OpenCL based
programming environment [90] programming environment [91]

As a consequence of this decision, AMD had to make a major change in their IL

(and also in the real ISA of their GPGPUs along with the associated microarchitecture).
An example for the induced microarchitectural changes is the introduction of local and global
memories (LDS, GDS) in AMD’s RV770-based HD4xx line (Evergreen line) of GPGPUs in 2008.
Here we note that although Brook+ supported local data sharing, until the appearance of the
RV770 based HD 4xxx line AMD’s GPGPUs did not provide Local Data Share memories.
Figure 5.2.4: Introduction of Local and Global Data Share memories (LDSs, GDS) in AMD’s HD
4800 [36]

3) AMD provides also a low level programming interface to their GPGPU, called the
CAL interface (Compute Abstraction Layer) programming interface [106], [107].
Figure 5.2.5: AMD’s OpenCL based HLL and the low level CAL programming environment [91]
The CAL programming interface [104]
• is actually a low-level device-driver library that allows a direct control of the hardware.
• The set of low-level APIs provided allows programmers to directly open devices,
allocate memory, transfer data and initiate kernel execution and thus optimize
performance.
• An integral part of CAL interface is a JIT compiler for AMD IL.
CAL compilation
from AMD IL to the device specific ISA
Device-Specific ISA disassembled
Figure 5.2.6: Kernel compilation from AMD IL to Device-Specific ISA (disassembled) [148]

The AMD IL pseudo ISA and its underlying parallel computational model together constitute
a virtual machine.
From its conception on the virtual machine, like Nvidia’s virtual machine, evolved in many
aspects, but due to lacking documentation of previous AMD IL versions it can not be
tracked.

AMD’s low level (IL) parallel computational model
The following brief overview is based on version 2.0e of the AMD Intermediate Language
Specification (Dec. 2010) [105].
The parallel computational model inherent in AMD IL is set up of three key abstractions:
a) The model of execution resources
b) The memory model, and
c) The parallel execution model (parallel machine model) including
c1) The allocation of execution objects to the execution pipelines
c2) The data sharing concept and
c3) The synchronization concept.
which will be outlined very briefly and simplified below.

a) The model of execution resources
The execution resources include a set of SIMT cores, each incorporating a number of ALUs
that are able to perform a set of given computations.
b) The memory model

AMD IL v.2.0e maintains five non-uniform memory regions;
• the private, constant, LDS, GDS and device memory regions.
•These memory regions map directly to the memory regions supported by OpenCL.
c) The execution model
c1) Allocation of execution objects to the execution units.
The thread model
• A hierarchical thread model is used.
• Threads are grouped together into thread groups which are units of allocation to the
SIMT cores.
• Threads within a thread group can communicate through local shared memory (LDS).
• There is no supported communication between thread groups.
Main features of the allocation

• A kernel running on a GPGPU can be launched with a number of thread groups.
• Threads in a thread group run in units, called wavefronts.
• All threads in a wavefront run in SIMD fashion (in lock steps).
• All wavefronts within a thread group can be synchronized by barrier synchronization.
Data sharing is supported between all threads of a thread group.

This is implemented by providing Local Data Stores (actually read/write registers)
allocated to each SIMD core.
• There is a barrier synchronization mechanism available to synchronize threads within

a thread group.
• When using a barrier all threads in a thread group must have reached the barrier before any
thread can advance.

Ideally, the same parallel execution model underlies all main components of the
programming environment, such as
• the real ISA of the GPGPU,
• the pseudo ISA and
• the HLL (like Brook+, OpenCL).
Enhancements of a particular component (e.g. the introduction of supporting the

global shared memory concept by the HLL (OpenCL) evoked the enhancement
of both other components.

5.2.3 AMD’s high level data and task parallel
programming model

5.2.3 AMD’s high level data and task parallel programming model (1)
The interpretation of the notion “AMD’s data and task parallel programming model”
A peculiarity of the GPGPU technology is that its high-level programming model is associated
with a dedicated high level language (HLL) programming environment, like that
for CUDA. Brook+ oder OpenCL.
(By contrast, the programming model of the traditional CPU technology is associated with an
entire class of HLL languages, called the imperative languages, like Fortran, C, C++ etc.
as these languages share the same high-level programming model).
With their SDK 2.0 (Software Development Kit) AMD changed the supported high-level
programming environment from Brook+ to OpenCL in 2009.
Accordingly, AMD’s high-level data parallel programming model became that of OpenCL.
• So a distinction is needed between AMD’s Pre-OpenCL and OpenCL
programming models.
• The next section discusses the programming model of OpenCL.

Remarks to the related AMD terminology

The actual designation of SDK 2.0 was ATI Stream SDK 2.0 referring to the fact
that the SDK is part of AMD’s Stream technology.
AMD’s Stream technology [108]
It is a set of hardware and software technologies that enable AMD’ GPGPUs to work in concert
with x86 cores for accelerating the execution of applications including data parallelism.
Renaming of “AMD stream technology” to “AMD Accelerated Parallel Processing technology”

(APP)
In January 2011, along with the introduction of their SDK 2.3, AMD renamed
their “ATI Stream technology” to “AMD Accelerated Parallel Processing technology”,
presumably to better emphasize the main function of GPGPUs as data processing accelerators.
Changing AMD’s GPGPU terminology with distinction between Pre-OpenCL and OpenCL
terminology
Along with changing their programming model AMD also changed their related terminology
by distinguishing between Pre-OpenCL and OpenCL terminology, as already discussed
in Section 5.2.1.
For example, in their Pre-OpenCL terminology AMD speaks about threads and thread groups,
whereas in OpenCL terminology these terms are designated as Work items and Work Groups.
For a summary of the terminology changes see [109].

Main components of the programming model of OpenCL [93], [109], [144]
Basic philosophy of the OpenCL programming model

OpenCL allows developers to write a single portable program for a heterogeneous platform
including CPUs and GPGPUs (designated as GPUs in the Figure below).
Figure 5.2.7: Example heterogeneous platform targeted by OpenCL [144]

OpenCL includes
• a language (resembling to C99) to write kernels, which allow the utilization of
data parallelism by using GPGPUs, and
• APIs to control the platform and program execution.

Main components of AMD’s data and task parallel programming model of OpenCL [109]
The data and task parallel programming model is based on the following abstractions
a) The platform model
b) The memory model of the platform
c1) Command queues
c2) The kernel concept as a means to utilize data parallelism
c3) The concept of NDRanges-1
c4) The concept of task graphs as a means to utilize task parallelism
c5) The scheme of allocation Work items and Work Groups to execution resources
of the platform model

a) The platform model [109], [144]
An abstract, hierarchical model that allows a unified view of different kinds of processors.
• In this model a Host coordinates execution and data transfers to and from an array
of Compute Devices.
• A Compute Device may be a GPGPU or even a CPU.
• Each Compute Device is composed of an array of Compute Units
(e.g. VLIW cores in case of a GPGPU card),
whereas each Compute Unit incorporates an array of Processing Elements
(e.g. VLIW5 ALUs in case of AMD’s GPGPUs.
E.g. in case of AMD’s GPGPUs the usual platform components are:
E.g. VLIW5 or VLIW4 ALU
Card
SIMD core
Figure 5.2.8: The Platform model of OpenCL [144]

b) The memory model of the platform-1 [109]

• Each Compute Device has a separate global memory space that is typically implemented as
an off-chip DRAM.
It is accessible by all Compute units of the Compute Device, typically supported by
a data cache.
• Each Compute Unit is assigned a Local memory that is typically implemented on-chip,
providing lower latency and higher bandwidth than the Global Memory.

b) The memory model of the platform-2 [109]
• There is also available an on-chip Constant memory, accessible to all Compute Units that
allows the reuse of read-only parameters during computations.
• Finally, a Private Memory space, typically a small register space, is allocated to each
Processing Element, e.g. to each VLIW5 ALU of an AMD GPGPU.

The memory model including assigned Work items and Work Groups (Based on 94)
Compute Unit 1/Work Groups Compute Unit N/Work Groups


• There is a mechanism, called command queues, to specify tasks to be executed, such as
data movements between the host and the compute device or data parallel kernels.
• The execution model supports both data and task parallelism.
• Data parallelism is obtained by the kernel concept which allows to apply a single function
over a range of data elements in parallel.
• Task parallelism is achieved by
• providing a general way to express execution dependencies between tasks, in form of
task graphs,
• and ensuring by an appropriate execution mechanism that the GPGPU will execute
the tasks in the specified order.
• Task graphs can be specified as lists of task related events in the command queues
that must occur before a particular kernel can be executed.

c1) Command queues
• They coordinate data movements between the host and Compute Devices (e.g. GPGPU cards)
as well as let launch kernels.
• An OpenCL command queue is created by the developer and is associated with a specific
compute device.
• To target multiple OpenCL Compute Devices simultaneously the developer need to
create multiple command queues.
• Command queues allow to specify dependencies between tasks (in form of a task graph),
ensuring that tasks will be executed in the specified order.
• The OpenCL runtime module will execute tasks in parallel if their dependencies are
satisfied and the platform is capable to do so.
• In this way command queues as conceived, allow to implement a task parallel execution
model.

Use of multiple Command queues
Aims
• either to parallelize an application across multiple compute devices (SIMD cores),
• or to run multiple completely independent streams of computation (kernels) across
multiple compute devices (SIMD cores).
The latter possibility is given only with the Cayman core.

c2) The kernel concept as a means to utilize data parallelism [109]
• Kernels are high level language constructs that allow to express data parallelism and
utilizing it for speeding up computation by means of a GPGPU.
• Kernels will be written in a C99-like language.
• OpenCL kernels are executed over an index space, which can be 1, 2 or 3 dimensional.
• The index space is designated as the NDRange (N-dimensional Range).
The subsequent Figure shows an example of a 2-dimensional index space, which has
Gx x Gy elements.
Figure 5.2.9: Example 2-dimensional

index space [109]

• For every element of the kernel index space a Work item will be executed.
• All work items execute the same program, although their execution may differ due to
branching based on actual data values or the index assigned to each Work item.
The NDRange defines the total number of Work items that should be executed in parallel
related to a particular kernel.
Figure 5.2.10: Example 2-dimensional index space [109]


• The index space is regularly subdivided into tiles, called Work Groups.
As an example, the Figure below illustrates Work Groups of size Sx x Sy elements.
Figure 5.2.11: Example

2-dimensional index
space [109]
• Each Work item in the Work Group will be assigned a Work Group id, labeled as wx, wy
in the Figure, as well as a local id, labeled as Sx, Sy in the Figure.
• Each Work item also becomes a global id, which can be derived from its Work Group and
local ids.
Segmentation of NDRanges to Work Groups

(Called also as the NDRange configuration)
Segmentation of NDRanges to Work Groups
Explicit specification Implicit specification
The developer explicitly specifies The developer does not specify

sizes of Work Groups in kernels the Work Group sizes.
by commands. in this case the OpenCL driver
will do the segmentation.
Example for specification of the global work size and explicit specification of Work Groups [96]

Example for a simple kernel written for adding two vectors [144]

Remarks
1) Compilation of the kernel code delivers both the GPU code and the CPU code.
Figure 5.2.12: Compilation of the kernel code [144]

Remarks (cont)
2) Kernels will be compiled to the ISA level to a series of clauses.
Clauses
• Groups of instructions of the same clause type that will be executed without preemption,
like ALU instructions, texture instructions etc.
Example [100]
The subsequent example relates to a previous generation device (HD 58xx) that
includes 5 Execution Units, designated as the x, y, z, w and t unit.
By contrast the discussed HD 69xx devices (belonging to the Cayman family) provide
only 4 ALUs (x, y, z, w).

Figure 5.2.13: Examples for ALU and Tex clauses [100]

Howes L.
c4) The concept of task graphs as a means to utilize task parallelism

• The developer can specify a task graph associated with the command queues in form of
a list of events that must occur before a particular kernel can be executed by the GPGPU.
• Events will be generated by completing the execution of kernels or read, write or copy
commands.
Example task graph [109]
• In the task graph arrows indicate dependencies. E.g. the kernel A is allowed to be executed
only after Write A and Write B have finished etc.
•The OpenCL runtime has the freedom to execute tasks given in the task graph in parallel
as long as the dependencies specified are fulfilled.
c5) The scheme of allocation Work items and Work Groups to execution resources
of the platform model
As part of the execution model

• Work items will be executed by the processing Elements of the platform
(e.g. VLIW 5 or VLIW4 ALUs in case of AMD’s GPGPUs.
• Each Work Group will be assigned for execution to a particular Compute Unit (SIMD core).

c6) The data sharing concept [145]

It results from the platform memory model and the Work item/Work Group allocation model.
Local data sharing
All Work items in a Work Group can access the Local Memory. which allows data sharing
within Work items of a Work Grouö that is allocated to a Compute Unit (SIMD core)
Global data sharing
All Work items in all Work Groups can access the Global Memory, which allows data sharing
within Work items running on the same Compute Device (e.g. PPGPU card).
Work items
Work Groups
E.g. The Cayman core has 32 KB large Local memories and a Global Memory of 64 KB [99].
Synchronization mechanisms of OpenCL
Synchronization during Synchronization during

task parallel execution data parallel (kernel) execution
Synchronization Synchronization
of Work items of Work items
within a Work Group being in different Work Groups
Task graph • Barrier synchronization Atomic memory transactions

• Memory fences

Synchronization mechanisms [109]
Task graphs
Discussed already in connection with parallel task execution.
Barrier synchronization
• Allows synchronization of Work items within a Work Group.
• Each Work item in a Work Group must first execute the barrier before execution is allowed
to proceed (see the SIMT execution model, discussed in Section 2).
Memory fences
let synchronize memory operations (load/store sequences).
Atomic memory transactions
• Allow synchronization of Work items being in the same or in different Work Groups.
• Work items may e.g. append variable numbers of results to a shared queue
in global memory to coordinate execution of Work items in different Work Groups.
• Atomic memory transactions are OpenCL extensions supported by some OpenCL runtimes,
such as the ATI Stream SDK OpenCL runtime for x86 processors.
• The use of atomic memory transactions needs special care to avoid deadlocks and allow
scalability.

5.2.4 Simplified block diagram of the
Cayman core

5.2.4 Simplified block diagram of the Cayman core (1)
Simplified block diagram of the Cayman core (used in the HD 69xx series) [97]
L1T/L2T: Texture caches

(Read-only caches)

Comparing the block diagrams of the Cypress (HD 58xx) and the Cayman (HD 69xx)
cores [97]

Comparing the block diagrams of the Cayman (HD 69xx) and the Fermi cores [97]
GF110
Note
Fermi has read/write L1/L2 caches like CPUs.

Simplified block diagram

of the Cayman XT core [98]
(Used in the HD 6970)
2x12 SIMD cores
16 VLIW4 ALUs/core
4 EUs/VLIW4 ALU
1536 EUs
Simplified block diagram of the Cayman core (that underlies the HD 69xx series) [99]

5.2.5 Principle of operation of the Command
Processor

5.2.5 Principle of operation of the Command Processor (1)
Principle of operation of the Command Processor-1 [99]
Basic architecture of Cayman (that underlies both the HD 6950/70 GPGPUs)
The developer writes a set of commands that control the execution of the GPGPU program.
These commands
• configure the HD 69xx device (not detailed here),
• specify the data domain on which the HD 69xx device has to operate,
• command the HD 69xx device to copy programs and data between system memory and
device memory,
• cause the HD 69xx device to begin the execution of a GPGPU program (OpenCL program).
The host writes the commands to the memory mapped HD 69xx registers in the
system memory space.
The Command Processor reads the commands that the host has written to
memory mapped HD 69xx registers into the system-memory address space,
copies them into the Device Memory of the HD 6900 and launches their execution.
The Command Processor sends hardware-generated interrupt to the host when

the commands are completed. 843 www.tankonyvtar.hu
5.2.6 The Data Parallel Processor Array

5.2.6 The Data Parallel Processor Array (1)
The Data Parallel Processor Array (DPP) in the Cayman core [99]

The Data Parallel Processor Array (DPP)

in more detail [93]
The DPP Array

• The DPP is the “heart” of the GPGPU.
• It consists of 24 SIMD cores
(designated also as Compute Units). DPP
DPP Array
Array
SIMD cores
• Operate independently from each other.
• Each of them includes 16 4-wide
VLIW ALUs, termed as VLIW4 ALUs.
Remark
The SIMD cores are also designated as
• Data Parallel Processors (DPP) or
• Compute Units.
OpenCL Progr. Guide 1.2

OpenCL
Jan. 2011 Progr. Guide 1.2
Jan. 2011
Operation of the SIMD cores

• Each SIMD core consists of 16 VLIW4 ALUs
• All 16 VLIW4 ALUs of a SIMD core operate in lock step (in parallel)
SIMD core
Figure 5.2.14: Operation of a SIMD core of an HD 5870 [100]
Remark
Both the Cypress-based HD 5870 and the Cayman-based HD 6970 have the same basic structure

The four-wide VLIW ALUs of the Cayman core [98]

(Designated as Stream Processing Units in the Figure below)
FP capability:
• 4xFP32 FMA
• 1XFP64 FMA
per clock.

Operation of Cayman’s VLIW4 ALUs [97]
Throughput and latency of the pipelined VLIW4 ALUs
• The VLIW4 ALUs are pipelined, having a throughput of 1 instruction per shader cycle for the
basic operations, i.e. they accept a new instruction every new cycle for the basic operations.
• ALU operations have a latency of 8 cycles, i.e. they require 8 cycles to be performed.
• The first 4 cycles are needed to read the operands from the register file,
one quarter wavefront at a time.
• The next four cycles are needed to execute the requested operation.
Interleaved mode of execution

To hide the 8-cycles latency two wavefronts (an even and an odd) will be executed in an
interleaved fashion.
As long as one wavefront accesses the register file the other will execute.
This is conceptually similar to fine-grained multi-threading, where the two threads
switch every 4 cycles, but do not simultaneously execute.

Contrasting AMD’s VLIW4 issue in the Cayman core (HD 69xx) with Nvidia’s scalar issue
(Based on [88])
AMD/ATI Nvidia
• VLIW issue • Scalar issue
• Static dependency resolution • Dynamic dependency resolution
performed by the compiler by scoreboarded warp scheduler
Requires sophisticated Requires sophisticated

compiler to properly fill hardware scheduler
instruction slots
Remark
Nvidia’s Fermi GF104 introduced
(7/2010) already 2-way
superscalar issue

Peak FP32/64 performance of Cayman based GPGPU cards
a) Peak FP32 performance (PFP32) of a Cayman-based GPGPU card

• n SIMD cores
• 16 SIMD VLIW4 ALUs per SIMD core
• 4 Execution Units (EUs) per SIMD VLIW4 ALU
• 1 FP32 FMA operation per EU
PFP32 = n x 16 x 4 x 1 x 2 x fs
P FP32 = n x 128 x fs FP32 operations/s
E.g. in case of the HD 6970

• n = 24
• fs = 880 GHz
PFP32 = 24 x 128 x 880 = 2703 GFLOPS

b) Peak FP64 performance (PFP64) of a Cayman-based GPGPU card

• n SIMD cores
• 16 VLIW4 ALUs per SIMD core
• 4 Execution Units (EUs) per VLIW4 ALU
• 1 FP64 FMA operation per 4 EU
PFP64 = n x 16 x 4 x 1/4 x 2 x fs
PFP64 = n x 32 x fs FP64 operations/s = ¼ PFP32
E.g. in case of the HD 6970

• n = 24
• fs = 900 GHz
PFP64 = 24 x 32 x 900 = 691.12 GFLOPS

5.2.7 The memory architecture of the
Cayman core

5.2.7 The memory architecture (1)
Overview of the memory architecture of the Cayman core (simplified)
(= SIMD core)
Private memory: 16 K GPRs/SIMD core
4 x 32-bit/GPR
LDS: 32 KB/SIMD core

L1: 8 KB/SIMD core (Read only)
L2: 512 KB/GPGPU (Read only)
Global Memory: 64 KB per GPGPU
Figure 5.2.15: Memory architecture of the Cayman core [99]

Main features of
the memory spaces
available in the
Cayman core [99]

Compliance of the Cayman core (HD 6900) memory hierarchy with the OpenCL
memory model
HD 6900 memory architecture [99] OpenCL memory architecture [94]
(= SIMD core)

Memory architecture of OpenCL (more detailed) [93]
The Global memory can be either

in the high-speed GPU memory
(VRAM) or in the host memory,
which is accessed by the
PCIe bus [99].
The Private memory
• It is implemented as a number of General Purpose Registers (GPRs)

available per SIMD core, e.g. 16 K GPRs for the Cayman core.
• Each GPR includes a set of four 32-bit registers,
to provide register space for three 32-bit input data and a single
32-bit output data.
• As the Cayman core includes 16 VLIW4 ALUs, there are 1 K GPRs
available for each VLIW4 ALU.
• As each VLIW4 ALU incorporates 4 Execution Units
(termed as Processing Element in the Figure), in the Cayman core
there are 1 K/4 = 256 GPRs available for each Execution Unit.
(EU)

The Private memory-2

• All GPRs available for a particular Execution Unit are shared among all Work Groups
allocated to that SIMD core.
• The driver allocates the required number of GPRs for each kernel, this set of GPRs remains
persistent across all Work items during the execution of the kernel.
• The available number of GPRs per Execution Unit (256 GPRs for Cayman) and the requested
number of GPRs for a given kernel, determine the maximum number of active wavefronts
on a SIMD core running the considered kernel, as shown below.
Table 5.2.2: Max. no. of wavefronts per SIMD859

core vs no. of GPRs requested per wavefront [93]
Example
If a kernel needs 30 GPRs (30 x 4 x 32-bit registers), up to 8 active wavefronts can run on
each SIMD core.
Remarks [93], [94]

• Each work item has access to up to 127 GPRs minus two times the number of
Clause Temporary GPRs.
By default 2x2 GPRs are reserved as Clause Temporary Registers, each two for the even and
the odd instruction slots, used for all instructions of an ALU clause.
• Each SIMD core is limited to 32 active wavefronts.
• The Private Memory has a bandwidth of 48 B/cycle
(3 (source registers x 32-bit x 4 (Execution Units)

Data sharing in Cayman-1
There are two options for data sharing

• 32 KB Local Data Share (LDS) and
• 64 KB Global Data Share (GDS)
both introduced in the RV770 [36].
The 32 KB Local Data Share (LDS)
[93], [97]
• Used for data sharing among

wavefronts within a Work Group
(running on the same SIMD core).
• Exposed through the OpenCL and
DirectCompute specifications.
• Each VLIW4 ALU can load two
32-bit values from the LDS/cycle.
• The LDS size is allocated
on a per Work Group basis.
Each Work Group specifies
how much LDS it requests.
The Ultra-Threaded Dispatch Proc.
uses this information to determine
which Work Groups can share
the same SIMD core.
Figure 5.2.16: Local data sharing in the Cayman core [99]
Data sharing in Cayman-2

The 64 KB Global Data Share (GDS)
• Introduced in the RV770 [36],

(at a size of 16 KB),
but not documented in the
related ISA Reference Guide [101]
but in the previous ISA and
Microcode Reference Guide [112].
• Provides data sharing across
an entire kernel.
• It is not exposed in the OpenCL
or DirectCompute specifications,
must be accessed through
vendor specific extensions.
Figure 5.2.17: Global data sharing in the

Cayman core [99]

The cache system of the Cayman core [97]
• The Cayman core has 8KB L1 Texture cache

(Read only cache) per SIMD core and
• a 512 KB L2 Texture cache (Read only cache)
shared by all SIMD cores.
L1T/L2T: Texture caches

(Read-only caches)

Comparing the Cayman’s (HD 69xx) and Fermi GF110’s (GTX 580) cache system [97]
GF110
Notes
1) Fermi has read/write L1/L2 caches like CPUs, whereas the Cayman core has read-only caches.
2) Fermi has optionally 16 KB L1 cache and 48 KB shared buffer or vice versa per SIMD core.
3) Fermi has larger L1/L2 caches.

Memory controller [88]
The Cayman core has dual

bidirectional DMA Engines,
for faster system memory
reads and writes,
whereas previous families
provided only one.

5.2.8 The Ultra-Threaded Dispatch Processor

5.2.8 The Ultra-Threaded Dispatch Processor (1)
The Ultra-Threaded Dispatch Processor-1 [99]
The Ultra-Threaded Dispatch Processor

The Ultra-Threaded Dispatch Procesors-2

In contrast to the previous figure (from [99]), Cayman has actually dual Ultra-Threaded Dispatch
processors, as shown below.
Each Ultra-Threaded Dispatch

Processor manages half of the
SIMD cores i.e. 12 SIMD cores
(called SIMD Engines
in the Figures).
This layout is roughly a dual

(RV770-based) HD 4870 design,
since in the HD 4870
a single Ultra-Threaded Dispatch
Processor cares for
10 SIMD cores.
Figure 5.2.18: Block diagram

of Cayman XT [98]
(Used in the HD 6970)
Block diagram of AMD’s (RV770-based) 4870 [36]

The Ultra-Threaded Dispatch Procesors-3

The main task of the Ultra-Threaded Dispatch Processors is work scheduling
Work scheduling consists of

• Assigning kernels and their Work Groups to SIMD cores
• Scheduling Work items of the Work Groups for execution

Assigning Work Groups to SIMD cores -1
Kernel 1: NDRange1
Global size 1,0

Global size 1,1
(0,0) (0,1)
DPP Array
?
(1,0) (1,1)
Kernel 2: NDRange2
Global size 2,0

Global size 2,1
(0,0) (0,1)

Work Group Work Group (EU)
(1,0) (1,1) Jan. 2011

Assigning Work Groups to SIMD cores -2
• The Ultra-Threaded Dispatch Processor allocates each Work Group of a kernel for execution
to a particular SIMD core.
• Up to 8 Work Groups of the same kernel can share the same SIMD core, if there are
enough resources to fit them in.
• Multiple kernels can run in parallel at different SIMD cores, in contrast to previous designs
when only a single kernel was allowed to be active and their Work Groups were spread
over the available SIMD cores.
• Hardware barrier support is provided for up to 8 Work Groups per SIMD core.
• Work items (threads) of a Work Group may communicate to each other through a
local shared memory (LDS) provided by the SIMD core.
This feature is available since the RV770.based 4xxx GPGPUs.

Data sharing and synchronization of Work items within a Work Group

Work items in a Work Group may
• share a local memory, not visible by Work items belonging to different Work Groups.
• synchronize with each other by means of barriers or memory fence operations.
Work items in different Work Groups cannot synchronize with each other.
Remark
Barriers allow to synchronize Work item (thread) execution whereas memory fences
let synchronize memory operations (load/store sequences).

Scheduling Work items (threads) within Work Groups for execution
Work items are scheduled for execution in groups, called wavefronts.

Wavefronts
• Collection of Work items (threads) scheduled for execution as an entity.
• Work items of a wavefront will be executed in parallel on a SIMD core.
SIMD core
Figure 5.2.19: Structure of a SIMD core in the HD 5870 [100]
Remark
Cayman’s SIMD cores have the same structure as the HD 5870

Wavefront size
SIMD core
• In Cayman each SIMD core has 16 VLIW4 (in earlier models VLIW5) ALUs, similarly as
the SIMD cores in the HD 5870, shown in the Figure [100].
• Both VLIW4 and VLIW5 ALUs provide 4 identical Execution Units, so
Each 4 Work items, called collectively as a quad, is processed on the same VLIW ALU
The wavefront is composed of quads.
The number of quads is identical with the number of VLIW ALUs.
The wavefront size = No. of VLIW4 ALUs x 4
• High performance GPGPU cards provide typically 16 VLIW ALUs,

so for these cards the wavefront size is 64.
• Lower performance cards may have 8 or even 4 VLIW ALUs,
resulting in wavefront sizes of 32 and 16 resp.
Building wavefronts
• The Ultra-Thraded Dispatch Processor segments Work Groups into wavefronts,
and schedules them for execution on a single SIMD core.
• This segmentation is called also as rasterization.

Example: Segmentation of a 16 x 16 sized Work Group into wavefronts of the size 8x8
and mapping them to SIMD cores [92]
Work Group
One 8x8 block maps Another 8x8 block maps

to one wavefront to another wavefront
and is executed on one and is executed on another
SIMD core SIMD core
Each quad executes in

the same VLIW ALU
Wavefront (16 quads)
All VLIW ALUs in a SIMD core

execute the same
instruction sequence,
different SIMD cores
may execute different
instructions.
Scheduling wavefronts for execution

This is task of the Ultra-Threaded Dispatch Processors [97]
• AMD’s GPGPUs have dual Ultra-Threaded Dispatch Processors, each responsible for
one issue slot of two available.
• The main task of the Ultra-Threaded Dispatch Processors is to assign Work Groups of
currently running kernels to the SIMD cores and schedule wavefronts for execution.
• Each Ultra-Threaded Dispatch processor has a dispatch pool of 248 wavefronts.
• Each Ultra-Threaded Dispatch processor selects two wavefronts for execution for each
SIMD core and dispatches them to the SIMD cores.
• The selected two wavefronts will be executed interleaved.

Scheduling policy of wavefronts

• AMD did not detail the scheduling policy of wavefronts.
• Based on published figures [98] it can be assumed that wavefronts are scheduled
coarse grained.
• Wavefronts can be switched only on clause boundaries.

Example for scheduling wavefronts-1 [93]

• At run time, Work item T0 executes until a stall occurs, e.g. due to a memory fetch request
at cycle 20.
• The Ultra-Threaded Dispatch Scheduler selects then Work item T1 (based e.g. on age).
• Work item T1 runs until it stalls, etc.
Figure 5.2.20: Simplified execution of Work items on a SIMD core with hiding memory stalls [93]
• When data requested from memory arrive (that is T0 becomes ready) the scheduler
will select T0 for execution again.
• If there are enough work-items ready for execution memory stall times can be hidden,
as shown in the Figure below.
Figure 5.2.21: Simplified execution of Work items on a SIMD core with hiding memory stalls [93]

• If during scheduling no runable Work items are available, scheduling stalls and the scheduler
waits until one of the Work items becomes ready.
• In the example below Work item T0 is the first returning from the stall and will continue
execution.
Figure 5.2.22: Simplified execution of Work items on a SIMD core without hiding memory stalls [93]
5.2.9 Evolution of key features of AMD’s
GPGPU microarchitectures

5.2.9 Evolution of key features of AMD’s GPGPU microarchitectures (1)
Evolution of key features of AMD’s GPGPU microarchitectures

(Key features: Important qualitative features rather than quantitative features
a) Changing the underlying programming environment from Brook+ to OpenCL (2009)

b) Introduction of LDS, GDS (2008)
c) Introduction of 2-level segmentation in breaking down the domain of execution/
NDRange into wavefronts (2009)
d) Allowing multiple kernels to run in parallel on different SIMD cores (2010)
e) Introducing FP64 capability and replacing VLIW5 ALUs with VLIW4 ALUs (2007/2010)

a) Changing the underlying programming environment from Brook+ to OpenCL

Starting with their RV870 (Cypress)-based HD 5xxx line and SDK v.2.0 AMD left Brook+
and began supporting OpenCL in 2009.
AMD/ATI
9/09 10/10 12/10
Cards HD 5850/70 HD 6850/70 HD 6950/70

1440/1600 ALUs 960/1120 ALUs 1408/1536 ALUs
256-bit 256-bit 256-bit
11/09 03/10 08/10

(SDK V.2.0) (SDK V.2.01) (SDK V.2.2)
3/09
Brooks+ Brook+ 1.4
(SDK V.2.01) 8/09
RapidMind
2009 2010 2011
Presumable in anticipation of this move AMD modified their IL, enhanced the
microarchitecture of their GPGPU devices with LDS and GDS (as discussed next),
and also changed their terminology, as given in Table xx.

b) Introduction of LDS, GDS
Interrelationship between the programming environment of a GPGPU and its microachitecture

The microarchitecture of general purpose CPUs is obviously, language independent
as it has to serve an entire class of HLLs, called the imperative languages.
By contrast, GPGPUs are designed typically with a dedicated language (such as CUDA or
Brook+) in mind.
There is a close interrelationship between the programming environment and the
microarchitecture of a GPGPU, as will be shown subsequently.

The memory concept of Brook+ [92]

Brook+ defined three memory domains (beyond the register space of the ALUs):
• Host (CPU memory or system memory) memory
• PCIe memory (this is a section of the host memory that can be accessed both by the
host and the GPGPU)
• Local (GPGPU) memory.
The memory concept of Brook+ was decisive to the memory architecture of AMD’s first
R600-based GPGPUs, as shown next.

The microarchitecture of R600 family processors [7]
The PCI memory
is part of the
System memory
that is accessible
by both the Host
and the GPGPU
The memory architecture of the R600 reflects the memory concept of Brook+.
The memory concept of OpenCL
By contrast, the memory model of OpenCL includes also Local and Global memory spaces
in order to allow data sharing among Work items running on the same SIMD core or even
on the GPGPU.
Figure 5.2.23: OpenCL’s memory concept [94]

The microarchitecture of the

Cayman core-1
Two options for data sharing

• 32 KB Local Data Share (LDS)
• 64 KB Global Data Share (GDS)
Both the LDS and the GDS were

introduced in the RV770-based
HD 4xxx series of GPGPUs [36].
Figure 5.2.24: Data sharing in the

Cayman core [99]

5.2.9 Major
5.2.9 Evolution steps
of key of theof
features evolution of AMD’s
AMD’s GPGPU GPGPUs (8)
microarchitectures (8)
Introduction of LDSs and a GDS in the RV770-based HD 4xxx line
Presumably, in anticipation of supporting OpenCL AMD introduced both

• 16 KB Local Data Share (LDS) memories per SIMD core and
• a 16 KB Global Data Share (GDS) memory to be shared for all SIMD cores,
in their RV770-based HD 4xxx lines of GPGPUs (2008), as indicated below.
Figure 5.2.25: Introduction of LDSs and a GDS in RV770-based HD 4xxx GPGPUs [36]

Data sharing in the Cayman core-2

The Local Data Share (LDS)
• Introduced in the RV770-based
4xxx lines [36]
• The LDS has a size of 32 KB
per SIMD core.
• Used for data sharing within a
wavefront or workgroup
(running on the same SIMD core).
• Exposed through the OpenCL and
DirectCompute specifications.
• Each EU can have one read and
one write access (32-bit values)
per clock cycle from/to the LDS.
Figure 5.2.26: Local Data sharing in the

Cayman core [99]


The Global Data Share (GDS)
• Introduced also in the RV770-based

HD 4xxx line, as shown before [36]
(at a size of 16 KB/GPGPU).
• It was however not documented
in the related ISA Reference Guide.
[101].
Figure 5.2.27: Global Data sharing in the

Cayman core [99]

The microarchitecture of the RV770 (HD 4xxx) family of GPGPUs [101]
Remark: AMD designates the RV770 core internally as the R700 DPP: Data Parallel Processor
The GDS memory became visible only in the ISA documentation of the Evergreen (HD 5xxx )
[112] and the Northern Island (HD 6xxx) families of GPGPUs [99].
Figure 5.2.28: Basic architecture of Cayman (that underlies both the HD 6950 and 6970 GPGPUs) [99]
The Global Data Share (GDS)
• Provides data sharing across

an entire kernel.
• It is not exposed in the OpenCL
or DirectCompute specifications,
must be accessed through
vendor specific extensions.
Figure 5.2.29: Global Data sharing in the

Cayman core [99]

Sizes of the introduced LDS, GDS memories

Both the sizes of LDS and GDS became enlarged in subsequent lines, as indicated below.
R600-based R670-based RV770-based RV870-based Cayman-based

HD 2900 XT line HD 3xxx line HD 4xxx line HD 5xxx line HD 69xx line
LDS/SIMD core - - 16 KB 32 KB 32 KB
GDS/GPGPU - - 16 KB 64 KB 64 KB

c) Introduction of 2-level segmentation in breaking down the domain of execution/

NDRange into wavefronts (2009)
Segmentation of the domains of execution (now NDR-Ranges) into wavefronts :

in Pre-OpenCL GPGPUs (those preceding the RV870-based 5xxx line (2009))
Domain of execution
Wavefronts
In Pre-OpenCL GPGPUs the domain of

execution was segmented into wavefronts
in a single level process.
One wavefront
Figure 5.2.30: Segmenting the domain of execution

in Pre-OpenCL GPGPUs of AMD
(Based on [92])

Segmentation of the domain of execution, called NDRange into wavefronts in

OpenCL-based GPGPUs
• OpenCL based GPGPUs were introduced beginning with the RV870-based HD 5xxx line).
• In these systems the segmentation became a 2-level process.
The process of the 2-level segmentation

• First the NDRange (renamed domain of execution) is broken down
into Work Groups of the size n x wavefront size NDRange
• either explicitly by the developer else Global size 0
• implicitly by the OpenCL driver,
• then the Ultra-Threaded Dispatch Processor.
Global size 1
(0,0) (0,1)
• allocates the Work Groups for execution to the SIMD cores,
• and segments the Work Groups into wavefronts.
After segmentation the Ultra-Threaded Dispatch Processor Work Group Work Group
schedules the wavefronts for execution in the SIMD cores. (1,0) (1,1)

Example for the second step of the segmentation:

Segmentation of a 16 x 16 sized Work Group into 8x8 sized wavefronts [92]
Work Group
One 8x8 block maps Another 8x8 block maps

to one wavefront to another wavefront
SIMD core another SIMD core
Each quad executes in

the same VLIW ALU
Wavefront (16 quads)
All VLIW ALUs in a SIMD core

execute the same
instruction sequence,
different SIMD cores
may execute different
instructions.
d) Allowing multiple kernels to run in parallel on different SIMD cores (2010)

• In GPGPUs preceding Cayman-based systems, only a single kernel was allowed to run
on a GPGPU.
In these systems, the Work Groups constituting the NDRange (domain of execution) were
spread over all available SIMD cores in order to speed up execution.
• In Cayman based systems multiple kernels may run on the same GPGPU, each one
on a single or multiple SIMD cores, allowing a better utilization of the hardware resources.

Assigning multiple kernels to the SIMD cores
Kernel 1: NDRange1
Global size 10
Global size 11

(0,0) (0,1)
DPP Array

(1,0) (1,1)
Kernel 2: NDRange2
Global size 20
Global size 21

(0,0) (0,1)

Work Group Work Group Jan. 2011
(1,0) (1,1)
e) Introducing FP64 capability and replacing VLIW5 ALUs with VLIW4 ALUs
Main steps of the evolution
2007 2008-2009 2010
R600 (HD 2900XT) RV770 (HD 4850/70), Cayman Pro/XT

RV670 (HD 3850/3870) RV870 aka Cypress Pro/XT (HD 6950/70)
(HD 5850/70)
VLIW-5 SIMD VLIW-5 SIMD VLIW-4 SIMD

[34] [49] [98]
5xFP32 MAD 5xFP32 MAD 4xFP32 FMA

- FP64 MAD 1XFP64 MAD 1XFP64 FMA

Remark
Reasons for replacing VLIW5 ALUs with VLIW4 ALUs [97]
AMD/ATI choose the VLIW-5 ALU design in connection with DX9, as it allowed to calculate a
4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) in parallel.
But in gaming applications for DX10/11 shaders the average slot utilization became only 3.4.
On average the 5. EU remains unused.
With Cayman AMD redesigned their ALU by
• removing the T-unit and
• enhancing 3 of the new EUs such that these units together became capable of
performing 1 transcendental operation per cycle as well as
• enhancing all 4 EUs to perform together an FP64 operation per cycle.
The new design can compute
• 4 FX32 or 4 FP32 operations or
• 1 FP64 operation or
• 1 transcendental + 1 FX32 or 1FP32 operation
per cycle, whereas
the previous design was able to calculate
• 5 FX32 or 5 FP32 operations or
• 1 FP64 operation or
• 1 transcendental + 4 FX/FP operation
per cycle.
Advantages of the VLIW4 design [97]

• With removing the T-unit but enhancing 3 of the EUs to perform transcendental functions
as well as all 4 EUs to perform together an FP64 operation per cycle
about 10 % less floor space is needed compared with the previous design.
More ALUs can be implemented on the same space.
• The symmetric ALU design simplifies largely the scheduling task for the VLIW compiler,
• FP64 calculations can now be performed by a ¼ rate of FP32 calculations, rather than
by a 1/5 rate as before.

Integrated CPUs/GPUs
Dezső Sima

Aim
Aim
Brief introduction and overview
General remark
Integrated CPU/GPU designs are not yet advanced enough to be employed as GPGPUs,
but they mark the way of the evolution.
For this reason their discussion is included into the description of GPGPUs.

Contents
6. Integrated CPUs/GPUs
6.1 Introduction to integrated CPUs/GPUs
6.2 The AMD Fusion APU line
6.3 Intel’s in-package

on-die integrated
integrated
CPU/GPU
CPU/GPU
lineslines
6.4 Intel’s on-die integrated CPU/GPU lines
7. References

6.1 Introduction to integrated CPU/GPUs

6.1 Introduction to integrated CPUs/GPUs (1)
Basic support of graphics
Integrated traditionally into the north bridge

Remarks
• In early PCs, displays were connected to the system bus (first to the ISA then to the PCI bus)
via graphic cards.
• Spreading of internet and multimedia apps at the end of the 1990’s impelled enhanced
graphic support from the processors.
This led to the emergence of the 3. generation superscalars that provided already MM and
graphics support (by means of SIMD ISA enhancements).
• A more demanding graphics processing however, invoked the evolution of the system
architecture away from the bus-based one to the hub-based one at the end of the 1990’s.
Along with the appearance of the hub-based system architecture the graphics
controller (if provided) became typically integrated into the north bride (MCH/GMCH),
as shown below.
PCI architecture Hub architecture
P P
Display
System contr. MCH

PCI G. card
Perif. contr. ICH
ISA
PCI
Figure 6.1: Emergence of hub based system architectures at the end of the 1990’s
Example
Integrated graphics controllers appeared
in Intel chipsets first in the 810 chipset
in 1999 [113].
Note
The 810 chipset does not provide AGP
connection.
Subsequent chipsets (e.g. those developed for P4 processors (around 2000) provided then
both an integrated graphics controller intended for connecting the display and
further on a Host-to-AGP Bridge to cater for an AGP-bus output in order to achieve high quality
graphics for gaming apps by using a graphics card.
Display
AGP
bus
Figure 6.2: Conceptual block diagram of the north bridge of Intel’s 845G chipset [129]

Integration trends of graphics controllers and HPC accelerators
Basic graphics support

Integration to the CPU chip
Integrated traditionally into the north bridge Now
(Intel, AMD)
Professional graphics support Integration to the CPU chip

Given by high performance graphics cards just started
with modest performance
(Intel, AMD)
HPC support
Given by Integration to the CPU chip
just started
• high performance GPGPUs or
with modest performance
• Data accelerators (AMD)

6.2 The AMD Fusion APU line
(Accelerated Processing Unit)

6.2 The AMD Fusion APU line (1)
Fusion line
Line of processors with on-die integrated CPU and GPU units, designated as APUs
(Accelerated Processing Units)
• Introduced in connection with the AMD/ATI merger in 10/2006.
• Originally planned to ship late 2008 or early 2009
• Actually shipped
• 11/2010: for OEMs
• 01/2011: for retail
First implementations for

• laptops, called the Zacate line, and
• netbooks, called the Ontario line.

Benefits of using APUs instead of using graphics processing units integrated

into the NB, called IGPs [115]
UNB: Unbuffered
UVD: Universal Video Decoder
North Bridge
SB: South Bridge
IGP: Integrated Graphics Processing unit

Roadmaps of introducing APUs (published in 11/2010)

(Based on [115], [116])

Source: [115]

© Sima Dezső, ÓE NIK Table 6.1: Main features 920

of AMD’s Stars family [130] www.tankonyvtar.hu
Source: [115]

Source: [116]

AMD’s Ontario and Zacate Fusion APUs

Source: [115]

AMD’s Ontario and Zacate Fusion APUs
Targeted market segments

• Ontario: Main stream notebooks
• Zacate: HD Netbooks

Main features of the Ontario and Zacate APUs [115]

Benefit of CPU/GPU integration [115]

Table 6.2: Main features of AMD’s mainstream Zacate Fusion APU line [131]

Table 6.3: Main features of AMD’s low power Ontario Fusion APU line [131]

OpenCL programming support for both the Ontario and the Zacate lines
AMD’s APP SDK 2.3 (1/2011) (formerly ATI Stream SDK) ( provides OpenCL support for
both lines .
New features of APP SDK 2.3 [118]
Improved OpenCL runtime performance:

Improved kernel launch times.
Improved PCIe transfer times.
Enabled DRMDMA for the ATI Radeon 5000 Series and AMD Radeon 6800 GPUs that are
specified in the Supported Devices.
Increased size of staging buffers.
Enhanced Binary Image Format (BIF).
Support for UVD video hardware component through OpenCL (Windows 7).
Support for AMD E-Series and C-Series platforms (AMD Fusion APUs).
Support for Northern Islands family of devices.
Support for AMD Radeon™ HD 6310 and AMD Radeon™ 6250 devices.
Support for OpenCL math libraries: FFT and BLAS-3, available for download at AMD Accelerated
Parallel Processing Math Libraries.
Preview feature: An optimization pragma for unrolling loops.
Preview feature: Support for CPU/X86 image. This enables the support for Image formats, as
described in the Khronos specification for OpenCL, to be run on the x86 CPU. It is enabled by
the following environment variable in your application: CPU_IMAGE_SUPPORT.

The Bobcat CPU core

Both the Ontario and Zacate Fusion APUs are based on Bobcat CPU cores.

Target use of the Bobcat CPU core
Source: [115]
Contrasting the Bulldozer and the Bobcat cores [119]

Microarchitecture of a Bobcat CPU core [119]
(Dual issue superscalar)

Microarchitecture of a Bobcat CPU core (detailed) [119]
In-breadth support
of dual issue

Bobcat’s floorplan [119]

AMD’s Ontario APU chip [120]:

Dual Bobcat CPU cores
+
ATI-DX11 GPU
TSMC 40nm
(Taiwan Semiconductor Manufacturing Company)
~ 400 mtrs

Die shot of the Ontario APU chip [127]

Source:
DX9, 2 Execution Units
939
AMD’s Llano Fusion APU CPU

Target use of the Llano Fusion APU

Llano Fusion APU (Ax APU line)

Target use
Mainstream desktops, but will be replaced by next generation Bulldozer Fusion APUs in 2012
Main components
2-4 CPU cores and a DX11 capable GPU
CPU cores
Revamped Star cores
Main features of the CPU cores [121]
• The cores are 32 nm shrinks of the originally 45 nm Star cores with major enhancements, like
new power gating and digital controlled power management
• Three-wide out-of-order cores
• 1 MB L2
• Power consumption range of a single core: 2.5 -25 W
• 35 million transistors per core
Main feature of the GPU

It supports DX11
No. of the EUs in the GPU
Up to 400
Targeted availability
Q3 2011

Supposed parameters of Llano’s GPU [134]
APU/GPU Stream GPU-klock Memory Memory clock. TDP

pro.
AMD A8 400 st. ~ 594 DDR3 1 866 MHz 65/100

APU MHz W*
AMD A6 320 st. ~ 443 DDR3 1 866 MHz 65/100

APU MHz W*
AMD A4 160 st. ~ 594 DDR3 1 866 MHz 65W*

APU MHz
AMD E2 80 st. ~ 443 DDR3 1 600 MHz 65W*

APU MHz
AMD 400 st. 775 MHz GDDR5 4 000 MHz 64W

HD5670
AMD 400 st. 650 MHz DDR3/GD 1 800/4 000 1 800/4 39W
HD5570 DR5 000 MHz
AMD 320 st. 550 MHz DDR3/GD 1 800/4 000 1 800/4

HD5550 DR5 000 MHz
TDP: Total 39W
TDP for the CPUs and the GDP
AMD 80 st. 400-650 DDR2/DD 800/1 600 MHz 19W
HD5450 MHz R3

Floor plan of the CPU core (revamped Star core) of the Llano APU [133]

Az adatok védelme érdekében a PowerPoint nem engedélyezte a kép automatikus letöltését.
Floor plan of the Llano APU [132]

The Bulldozer CPU core
Intended to be used both in CPU processors and Fusion APUs, as shown

in the Notebook, Desktop and Server roadmaps before.
In 2011
Bulldozer CPU cores will be used as the basis for their desktop and server CPU processors,
In 2012
next generation Bulldozer CPUs are planned to be used both
• in AMD’s Fusion APUs for notebooks and desktops as well as

• in AMD’s servers, implemented as CPUs.
Remark
No plans are revealed to continue to develop the Llano APU

Bulldozer’s basic modules each consisting of two tightly coupled cores -1[135]
Basic module of the microarchitecture
22 Butler, Dec. 2010

Bulldozer’s basic modules each consisting of two tightly coupled cores – 2 [136]
22

Bulldozer’s basic modules each consisting of two tightly coupled cores – 3 [135]
22 Butler, Dec. 2010

Similarity of the microarchitectures of Nvidia’s Fermi and AMD’s Bulldozer CPU core
Both are using tightly coupled dual pipelines with shared and dedicated units.
AMD’s Bulldozer module [136] Nvidia’s Fermi core [39]

Bulldozer’s cores doesn’t support multithreading [136]
22

An 8-core Bulldozer chip consisting of four building blocks [136]

Floor plan of
Bulldozer [128]
Floor plan of a dual core module [128]

Expected evolution of the process technology at Global Foundries [122]

6.3 Intel’s in-package integrated CPU/GPU lines

6.3 Intel’s in-package integrated CPU/GPU lines (1)
Intel’s in-package integrated CPU/GPU processors

Introduced in Jan. 2010
Example
In-package integrated CPU/GPU [137]

Announcing Intel’s in-package integrated CPU/GPU lines [140]

Intel’s in-package integrated CPU/GPU processors
Mobile processors Desktop processors
Arrandale Clarksdale
i3 3xx i3 5xx
i5 4xx/5xx i5 6xx
i7 6xx
CPU/GPU components
CPU: Westmere architecture (32 nm)
(Enhanced 32 nm shrink of the
45 nm Nehalem architecture)
GPU: (45 nm)
Shader model 4, DX10 support

Repartitioning the system architecture by introducing the Nehalem based Westmere

with in-package integrated graphics [141]

Intel’s i3/i5/i7 mobile Arrandale line [137]

Announced 1/2010
32 nm CPU/45 nm discrete GPU

The Arrandale processor [138]

Westmere CPU (32 nm shrink of the 45 nm Nehalem) + tightly coupled 45 nm GPU in a package

Basic components of Intel’s mobile Arrandale processor [123]
32 nm CPU
(Mobile implementation of the Westmere
basic architecture,
which is the 32 nm shrink of the
45 nm Nehalem basic architecture) 45 nm GPU
Intel’s GMA HD (Graphics Media Accelerator)
(12 Execution Units, Shader model 4, no OpenCL support)
Key specifications of Intel’s Arrandale line [139] http://www.anandtech.com/show/2902
964
Intel’s i3/i5 desktop Clarksdale line

Announced 1/2010
Figure 6.3: The Clarksdale processor with in-package integrated graphics along with the H57 chipset
[140]

Key features of the Clarksdale processor [141]

Integrated Graphics Media (IGM) architecture [141]

Key features of Intel’s i5-based Clarksdale desktop processors [140]

In Jan. 2011 Intel replaced their in-package integrated CPU/GPU lines with the on-die integrated
Sandy Bridge line.

6.4 Intel’s on-die integrated CPU/GPU lines
(Sandy Bridge)

6.4 Intel’s on-die integrated CPU/GPU lines (1)
The Sandy Bridge processor

• Shipped in Jan. 2011
• Provides on-die integrated CPU and GPU

Main features of Sandy Bridge [142]

Key specification data of Sandy Bridge [124]
Branding Core i5 Core i5 Core i5 Core i7 Core i7

Processor 2400 2500 2500K 2600 2600K
Price $184 $205 $216 $294 $317
TDP 95W 95W 95W 95W 95W
Cores / Threads 4/4 4/4 4/4 4/8 4/8
Frequency GHz 3.1 3.3 3.3 3.4 3.4
Max Turbo GHz 3.4 3.7 3.7 3.8 3.8
DDR3 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz 1333 MHz
L3 Cache 6MB 6MB 6MB 8MB 8MB
Intel HD
2000 2000 3000 2000 3000
Graphics
GPU Max freq 1100 MHz 1100 MHz 1100 MHz 1350 MHz 1350 MHz
Hyper-
No No No Yes Yes
Threading
AVX Extensions Yes Yes Yes Yes Yes
Socket LGA 1155 LGA 1155 LGA 1155 LGA 1155 LGA 1155

Die photo of Sandy Bridge [143]
256 KB L2 256 KB L2 256 KB L2 256 KB L2

(9 clk) (9 clk) (9 clk) (9 clk)
Hyperthreading
@ 1.0 1.4 GHz

DDR3-1600 25.6 GB/s

Sandy Bridge’s integrated graphics unit [102]

Specification data of the HD 2000 and HD 3000 graphics [125]

Performance comparison: gaming [126]
HD5570
400 ALUs
i5/i7 2xxx:
Sandy Bridge
i56xx
Arrandale
frames per sec

References to all four sections
of GPGPUs/DPAs
Dezső Sima

References (1)
References (to all four sections)
[1]: Torricelli F., AMD in HPC, HPC07, 2007

http://www.altairhyperworks.co.uk/html/en-GB/keynote2/Torricelli_AMD.pdf
[2]: NVIDIA Tesla C870 GPU Computing Board, Board Specification, Jan. 2008, Nvidia
[3] AMD FireStream 9170, 2008

http://ati.amd.com/technology/streamcomputing/product_firestream_9170.html
[4]: NVIDIA Tesla D870 Deskside GPU Computing System, System Specification, Jan. 2008,
Nvidia, http://www.nvidia.com/docs/IO/43395/D870-SystemSpec-SP-03718-001_v01.pdf
[5]: Tesla S870 GPU Computing System, Specification, Nvida, March 13 2008,
http://jp.nvidia.com/docs/IO/43395/S870-BoardSpec_SP-03685-001_v00b.pdf
[6]: Torres G., Nvidia Tesla Technology, Nov. 2007,

http://www.hardwaresecrets.com/article/495
[7]: R600-Family Instruction Set Architecture, Revision 0.31, May 2007, AMD
[8]: Zheng B., Gladding D., Villmow M., Building a High Level Language Compiler for GPGPU,
ASPLOS 2006, June 2008
[9]: Huddy R., ATI Radeon HD2000 Series Technology Overview, AMD Technology Day, 2007
http://ati.amd.com/developer/techpapers.html

References (2)
[10]: Compute Abstraction Layer (CAL) Technology – Intermediate Language (IL),

Version 2.0, AMD, Oct. 2008
[11]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0,
June 2008, Nvidia,
http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming
Guide_2.0.pdf
[12]: Kirk D. & Hwu W. W., ECE498AL Lectures 7: Threading Hardware in G80, 2007,
University of Illinois, Urbana-Champaign, http://courses.ece.uiuc.edu/ece498/al1/
ectures/lecture7-threading%20hardware.ppt
[13]: Kogo H., R600 (Radeon HD2900 XT), PC Watch, June 26 2008,
http://pc.watch.impress.co.jp/docs/2008/0626/kaigai_3.pdf
[14]: Goto H., Nvidia G80, PC Watch, April 16 2007,

[15]: Goto H., GeForce 8800GT (G92), PC Watch, Oct. 31 2007,

http://pc.watch.impress.co.jp/docs/2007/1031/kaigai398_07.pdf
[16]: Goto H., NVIDIA GT200 and AMD RV770, PC Watch, July 2 2008,
[17]: Shrout R., Nvidia GT200 Revealed – GeForce GTX 280 and GTX 260 Review,
PC Perspective, June 16 2008,
http://www.pcper.com/article.php?aid=577&type=expert&pid=3
References (3)
[18]: http://en.wikipedia.org/wiki/DirectX
[19]: Dietrich S., “Shader Model 3.0, April 2004, Nvidia,

http://www.cs.umbc.edu/~olano/s2004c01/ch15.pdf
[20]: Microsoft DirectX 10: The Next-Generation Graphics API, Technical Brief, Nov. 2006,
Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html
[21]: Patidar S. & al., “Exploiting the Shader Model 4.0 Architecture, Center for
Visual Information Technology, IIIT Hyderabad, March 2007,
http://research.iiit.ac.in/~shiben/docs/SM4_Skp-Shiben-Jag-PJN_draft.pdf
[22]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia,
http://www.nvidia.com/page/8800_tech_briefs.html
[23]: Goto H., Graphics Pipeline Rendering History, Aug. 22 2008, PC Watch,
[24]: Fatahalian K., “From Shader Code to a Teraflop: How Shader Cores Work,”
Workshop: Beyond Programmable Shading: Fundamentals, SIGGRAPH 2008,
[25]: Kanter D., “NVIDIA’s GT200: Inside a Parallel Processor,” Real World Technologies,
Sept. 8 2008, http://www.realworldtech.com/page.cfm?ArticleID=RWT090808195242
[26]: Nvidia CUDA Compute Unified Device Architecture Programming Guide,

Version 1.1, Nov. 2007, Nvidia,
http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_
Programming_Guide_1.1.pdf 981 www.tankonyvtar.hu
References (4)
[27]: Seiler L. & al., “Larrabee: A Many-Core x86 Architecture for Visual Computing,”
ACM Transactions on Graphics, Vol. 27, No. 3, Article No. 18, Aug. 2008
[28]: Kogo H., “Larrabee”, PC Watch, Oct. 17, 2008,

[29]: Shrout R., IDF Fall 2007 Keynote, PC Perspective, Sept. 18, 2007,
http://www.pcper.com/article.php?aid=453
[30]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Ars Technica,
Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels-biggest-leap-
ahead-since-the-pentium-pro.html
[31]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated

First Move, Anandtech, Aug. 4. 2008,
http://www.anandtech.com/showdoc.aspx?i=3367&p=2
[32]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19,
Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf
[33]: AMD Stream Computing, User Guide, Oct. 2008, Rev. 1.2.1
http://ati.amd.com/technology/streamcomputing/Stream_Computing_User_Guide.pdf
[34]: Doggett M., Radeon HD 2900, Graphics Hardware Conf. Aug. 2007,
http://www.graphicshardware.org/previous/www_2007/presentations/doggett-
radeon2900-gh07.pdf

References (5)
[35]: Mantor M., “AMD’s Radeon Hd 2900,” Hot Chips 19, Aug. 2007,
http://www.hotchips.org/archives/hc19/2_Mon/HC19.03/HC19.03.01.pdf
[36]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008,
http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf
[37]: Mantor M., “Entering the Golden Age of Heterogeneous Computing,” PEEP 2008,
http://ati.amd.com/technology/streamcomputing/IUCAA_Pune_PEEP_2008.pdf
[38]: Kogo H., RV770 Overview, PC Watch, July 02 2008,

[39]: Kanter D., Inside Fermi: Nvidia's HPC Push, Real World Technologies Sept 30 2009,
http://www.realworldtech.com/includes/templates/articles.cfm?
ArticleID=RWT093009110932&mode=print
[40]: Wasson S., Inside Fermi: Nvidia's 'Fermi' GPU architecture revealed,
[41]: Wasson S., AMD's Radeon HD 5870 graphics processor,

[42]: Bell B., ATI Radeon HD 5870 Performance Preview ,

Firing Squad, Sept 22 2009, http://www.firingsquad.com/hardware/
ati_radeon_hd_5870_performance_preview/default.asp

References (6)
[43]: Nvidia CUDA C Programming Guide, Version 3.2, October 22 2010

http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/
CUDA_C_Programming_Guide.pdf
[44]: Hwu W., Kirk D., Nvidia, Advanced Algorithmic Techniques for GPUs, Berkeley,
January 24-25 2011
http://iccs.lbl.gov/assets/docs/2011-01-24/lecture1_computational_thinking_
Berkeley_2011.pdf
[45]: Wasson S., Nvidia's GeForce GTX 580 graphics processor
Tech Report, Nov 9 2010, http://techreport.com/articles.x/19934/1
[46]: Shrout R., Nvidia GeForce 8800 GTX Review – DX10 and Unified Architecture,
PC Perspective, Nov 8 2006
http://swfan.com/reviews/graphics-cards/nvidia-geforce-8800-gtx-review-dx10-
and-unified-architecture/g80-architecture
[47]: Wasson S., Nvidia's GeForce GTX 480 and 470 graphics processors
Tech Report, March 31 2010, http://techreport.com/articles.x/18682
[48]: Gangar K., Tianhe-1A from China is world’s fastest Supercomputer
Tech Ticker, Oct 28 2010, http://techtickerblog.com/2010/10/28/tianhe-1a-
from-china-is-worlds-fastest-supercomputer/
[49]: Smalley T., ATI Radeon HD 5870 Architecture Analysis, Bit-tech, Sept 30 2009,
http://www.bit-tech.net/hardware/graphics/2009/09/30/ati-radeon-hd-5870-
architecture-analysis/8

References (7)
[50]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 1.0, June 2007,
https://www.doc.ic.ac.uk/~wl/teachlocal/arch2/papers/nvidia-PTX_ISA_1.0.pdf
[51]: Kanter D., Intel's Sandy Bridge Microarchitecture, Real World Technologies,
Sept 25 2010 http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4
[52]: Nvidia CUDATM FermiTM Compatibility Guide for CUDA Applications, Version 1.0,
February 2010, http://developer.download.nvidia.com/compute/cuda/3_0/
docs/NVIDIA_FermiCompatibilityGuide.pdf
[53]: Hallock R., Dissecting Fermi, NVIDIA’s next generation GPU, Icrontic, Sept 30 2009,
http://tech.icrontic.com/articles/nvidia_fermi_dissected/
[54]: Kirsch N., NVIDIA GF100 Fermi Architecture and Performance Preview,
Legit Reviews, Jan 20 2010, http://www.legitreviews.com/article/1193/2/
[55]: Hoenig M., NVIDIA GeForce GTX 460 SE 1GB Review, Hardware Canucks, Nov 21 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/38178-
nvidia-geforce-gtx-460-se-1gb-review-2.html
[56]: Glaskowsky P. N., Nvidia’s Fermi: The First Complete GPU Computing Architecture
Sept 2009, http://www.nvidia.com/content/PDF/fermi_white_papers/
P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf
[57]: Kirk D. & Hwu W. W., ECE498AL Lectures 4: CUDA Threads – Part 2, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture4%20cuda%20threads%20part2%20spring%202009.ppt

References (8)
[58]: Nvidia’s Next Generation CUDATM Compute Architecture: FermiTM, Version 1.1, 2009
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_
Architecture_Whitepaper.pdf
[59]: Kirk D. & Hwu W. W., ECE498AL Lectures 8: Threading Hardware in G80, 2007-2009,
University of Illinois, Urbana-Champaign, http://courses.engr.illinois.edu/ece498/
al/lectures/lecture8-threading-hardware-spring-2009.ppt
[60]: Wong H., Papadopoulou M.M., Sadooghi-Alvandi M., Moshovos A., Demystifying GPU
Microarchitecture through Microbenchmarking, University of Toronto, 2010,
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf
[61]: Pettersson J., Wainwright I., Radar Signal Processing with Graphics Processors
(GPUs), SAAB Technologies, Jan 27 2010,
http://www.hpcsweden.se/files/RadarSignalProcessingwithGraphicsProcessors.pdf
[62]: Smith R., NVIDIA’s GeForce GTX 460: The $200 King, AnandTech, July 11 2010,
http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2
[63]: Angelini C., GeForce GTX 580 And GF110: The Way Nvidia Meant It To Be Played,
Tom’s Hardware, Nov 9 2010, http://www.tomshardware.com/reviews/geforce-
gtx-580-gf110-geforce-gtx-480,2781.html
[64]: NVIDIA G80: Architecture and GPU Analysis, Beyond3D, Nov. 8 2006,
http://www.beyond3d.com/content/reviews/1/11
[65]: D. Kirk and W. Hwu, Programming Massively Parallel Processors, 2008
Chapter 3: CUDA Threads, http://courses.engr.illinois.edu/ece498/al/textbook/
Chapter3-CudaThreadingModel.pdf

References (9)
[66]: NVIDIA Forums: General CUDA GPU Computing Discussion, 2008

http://forums.nvidia.com/index.php?showtopic=73056
[67]: Wikipedia: Comparison of AMD graphics processing units, 2011
http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units
[68]: Nvidia OpenCL Overview, 2009
http://gpgpu.org/wp/wp-content/uploads/2009/06/05-OpenCLIntroduction.pdf
[69]: Chester E., Nvidia GeForce GTX 460 1GB Fermi Review, Trusted Reviews,
July 13 2010, http://www.trustedreviews.com/graphics/review/2010/07/13/
Nvidia-GeForce-GTX-460-1GB-Fermi/p1
[70]: NVIDIA GF100 Architecture Details, Geeks3D, 2008-2010,
http://www.geeks3d.com/20100118/nvidia-gf100-architecture-details/
[71]: Murad A., Nvidia Tesla C2050 and C2070 Cards, Science and Technology Zone,
17 nov. 2009,
http://forum.xcitefun.net/nvidia-tesla-c2050-and-c2070-cards-t39578.html
[72]: New NVIDIA Tesla GPUs Reduce Cost Of Supercomputing By A Factor Of 10,
Nvidia, Nov. 16 2009
http://www.nvidia.com/object/io_1258360868914.html
[73]: Nvidia Tesla, Wikipedia, http://en.wikipedia.org/wiki/Nvidia_Tesla
[74]: Tesla M2050 and Tesla M2070/M2070Q Dual-Slot Computing Processor Modules,
Board Specification, v. 03, Nvidia, Aug. 2010,
http://www.nvidia.asia/docs/IO/43395/BD-05238-001_v03.pdf

References (10)
[75]: Tesla 1U gPU Computing System, Product Soecification, v. 04, Nvidia, June 2009,
http://www.nvidia.com/docs/IO/43395/SP-04975-001-v04.pdf
[76]: Kanter D., The Case for ECC Memory in Nvidia’s Next GPU, Realworkd Technologies,
19 Aug. 2009,
http://www.realworldtech.com/page.cfm?ArticleID=RWT081909212132
[77]: Hoenig M., Nvidia GeForce 580 Review, HardwareCanucks, Nov. 8, 2010,
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/37789-nvidia-
geforce-gtx-580-review-5.html
[78]: Angelini C., AMD Radeon HD 6990 4 GB Review, Tom’s Hardware, March 8, 2011,
http://www.tomshardware.com/reviews/radeon-hd-6990-antilles-crossfire,2878.html
[79]: Tom’s Hardware Gallery,
http://www.tomshardware.com/gallery/two-cypress-gpus,0101-230369-
7179-0-0-0-jpg-.html
http://www.tomshardware.com/gallery/Bare-Radeon-HD-5970,0101-230349-
7179-0-0-0-jpg-.html
[81]: CUDA, Wikipedia, http://en.wikipedia.org/wiki/CUDA
[82]: GeForce Graphics Processors, Nvidia, http://www.nvidia.com/object/geforce_family.html
[83]: Next Gen CUDA GPU Architecture, Code-Named “Fermi”, Press Presentation at
Nvidia’s 2009 GPU Technology Conference, (GTC), Sept. 30 2009,
http://www.nvidia.com/object/gpu_tech_conf_press_room.html

References (11)

http://www.tomshardware.com/gallery/SM,0101-110801-0-14-15-1-jpg-.html
[85]: Butler, M., Bulldozer, a new approach to multithreaded compute performance,
Hot Chips 22, Aug. 24 2010
http://www.hotchips.org/index.php?page=hot-chips-22
[86]:. Voicu A., NVIDIA Fermi GPU and Architecture Analysis, Beyond 3D, 23rd Oct 2010,
http://www.beyond3d.com/content/reviews/55/1
[87]: Chu M. M., GPU Computing: Past, Present and Future with ATI Stream Technology,
AMD, March 9 2010, http://developer.amd.com/gpu_assets/GPU%20Computing%20-
%20Past%20Present%20and%20Future%20with%20ATI%20Stream%20Technology.pdf
[88]: Smith R., AMD's Radeon HD 6970 & Radeon HD 6950: Paving The Future For AMD,
AnandTech, Dec. 15 2010,
http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950
[89] Christian, AMD renames ATI Stream SDK, updates its with APU, OpenCL 1.1 support,
Jan. 27 2011, http://www.tcmagazine.com/tcm/news/software/34765/
amd-renames-ati-stream-sdk-updates-its-apu-opencl-11-support
[90]: User Guide: AMD Stream Computing, Revision 1.3.0, Dec. 2008,
http://www.ele.uri.edu/courses/ele408/StreamGPU.pdf
[91]: ATI Stream Computing Compute Abstraction Layer (CAL) Programming Guide,
Revision 2.01, AMD, March 2010, http://developer.amd.com/gpu_assets/ATI_Stream_
SDK_CAL_Programming_Guide_v2.0.pdf
http://developer.amd.com/gpu/amdappsdk/assets/AMD_CAL_Programming_Guide_v2.0.pdf
References (12)
[92]: Technical Overview: AMD Stream Computing, Revision 1.2.1, Oct. 2008,
http://www.cct.lsu.edu/~scheinin/Parallel/StreamComputingOverview.pdf
[93]: AMD Accelerated Parallel Processing OpenCL Programming Guide, Rev. 1.2,
AMD, Jan. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
[94]: An Introduction to OpenCL, AMD, http://www.amd.com/us/products/technologies/

stream-technology/opencl/pages/opencl-intro.aspx
[95]: Behr D., Introduction to OpenCL PPAM 2009, Sept. 15 2009,
http://gpgpu.org/wp/wp-content/uploads/2009/09/B1-OpenCL-Introduction.pdf
[96]: Gohara D.W. PhD, OpenCL Episode 2 – OpenCL Fundamentals, Aug. 26 2009,
MacResearch, http://www.macresearch.org/files/opencl/Episode_2.pdf
[97]: Kanter D., AMD's Cayman GPU Architecture, Real World Technologies, Dec. 14 2010,
http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=3
[98]: Hoenig M., AMD Radeon HD 6970 and HD 6950 Review, Hardware Canucks,
Dec. 14 2010, http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/
38899-amd-radeon-hd-6970-hd-6950-review-3.html
[99]: Reference Guide: AMD HD 6900 Series Instruction Set Architecture, Revision 1.0,
Febr. 2011, http://developer.amd.com/gpu/AMDAPPSDK/assets/
AMD_HD_6900_Series_Instruction_Set_Architecture.pdf
[100]:Howes L., AMD and OpenCL, AMD Application Engineering, Dec. 2010,
http://www.many-core.group.cam.ac.uk/ukgpucc2/talks/Howes.pdf

References (13)
[101]: ATI R700-Family Instruction Set Architecture Reference Guide, Revision 1.0a,
AMD, Febr. 2011, http://developer.amd.com/gpu_assets/R700-Family_Instruction_
Set_Architecture.pdf
[102]: Piazza T., Dr. Jiang H., Microarchitecture Codename Sandy Bridge: Processor
Graphics, Presentation ARCS002, IDF San Francisco, Sept. 2010
[103]: Bhaniramka P., Introduction to Compute Abstraction Layer (CAL),
http://coachk.cs.ucf.edu/courses/CDA6938/AMD_course/M5%20-
%20Introduction%20to%20CAL.pdf
[104]: Villmow M., ATI Stream Computing, ATI Intermediate Language (IL),
May 30 2008, http://developer.amd.com/gpu/amdappsdk/assets/ATI%20Stream
%20Computing%20-%20ATI%20Intermediate%20Language.ppt#547,9
[105]: AMD Accelerated Parallel Processing Technology,
AMD Intermediate Language (IL), Reference Guide, Revision 2.0e, March 2011,
http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Intermediate_Language
_(IL)_Specification_v2.pdf
[106]: Hensley J., Hardware and Compute Abstraction Layers for Accelerated Computing
Using Graphics Hardware and Conventional CPUs, AMD, 2007,
http://www.ll.mit.edu/HPEC/agendas/proc07/Day3/10_Hensley_Abstract.pdf
[107]: Hensley J., Yang J., Compute Abstraction Layer, AMD, Febr. 1 2008,
http://coachk.cs.ucf.edu/courses/CDA6938/s08/UCF-2008-02-01a.pdf
[108]: AMD Accelerated Parallel Processing (APP) SDK, AMD Developer Central,
http://developer.amd.com/gpu/amdappsdk/pages/default.aspx

References (14)
[109]: OpenCL™ and the AMD APP SDK v2.4, AMD Developer Central, April 6 2011,
http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-AMD-APP-
SDK.aspx
[110]: Stone J., An Introduction to OpenCL, U. of Illinois at Urbana-Champign, Dec. 2009,
http://www.ks.uiuc.edu/Research/gpu/gpucomputing.net
[111]: Introduction to OpenCL Programming, AMD, No. 137-41768-10, Rev. A, May 2010,
http://developer.amd.com/zones/OpenCLZone/courses/Documents/Introduction_
to_OpenCL_Programming%20Training_Guide%20(201005).pdf
[112]: Evergreen Family Instruction Set Architecture, Instructions and Microcode Reference
Guide, AMD, Febr. 2011, http://developer.amd.com/gpu/amdappsdk/assets/
AMD_Evergreen-Family_Instruction_Set_Architecture.pdf
[113]: Intel 810 Chipset: Intel 82810/82810-DC100 Graphics and Memory Controller Hub
(GMCH) Datasheet, June 1999
ftp://download.intel.com/design/chipsets/datashts/29065602.pdf
[114]: Huynh A.T., AMD Announces "Fusion" CPU/GPU Program, Daily Tech, Oct. 25 2006,
http://www.dailytech.com/article.aspx?newsid=4696
[115]: Grim B., AMD Fusion Family of APUs, Dec. 7 2010, http://www.mytechnology.eu/wp-
content/uploads/2011/01/AMD-Fusion-Press-Tour_EMEA.pdf
[116]: Newell D., AMD Financial Analyst Day, Nov. 9 2010,
http://www.rumorpedia.net/wp-content/uploads/2010/11/rumorpedia02.jpg
[117]: De Maesschalck T., AMD starts shipping Ontario and Zacate CPUs, DarkVision
Hardware, Nov. 10 2010, http://www.dvhardware.net/article46449.html

References (15)
[118]: AMD Accelerated Parallel Processing (APP) SDK (formerly ATI Stream) with
OpenCL 1.1 Support, APP SDK 2.3 Jan. 2011
[119]: Burgess B., „Bobcat” AMD’s New Low Power x86 Core Architecture, Aug. 24 2010,
http://www.hotchips.org/uploads/archive22/HC22.24.730-Burgess-AMD-Bobcat-x86.pdf
[120]: AMD Ontario APU pictures, Xtreme Systems, Sept. 3 2010,

http://www.xtremesystems.org/forums/showthread.php?t=258499
[121]: Stokes J., AMD reveals Fusion CPU+GPU, to challenge Intel in laptops,
Febr. 8 2010, http://arstechnica.com/business/news/2010/02/amd-reveals-
fusion-cpugpu-to-challege-intel-in-laptops.ars
[122]: AMD Unveils Future of Computing at Annual Financial Analyst Day, CDRinfo,
Nov. 10 2010, http://www.cdrinfo.com/sections/news/Details.aspx?NewsId=28748
[123]: Shimpi A. L., The Intel Core i3 530 Review - Great for Overclockers & Gamers,
AnandTech, Jan. 22 2010, http://www.anandtech.com/show/2921
[124]: Hagedoorn H. Mohammad S., Barling I. R., Core i5 2500K and Core i7 2600K review,
Jan. 3 2011,
http://www.guru3d.com/article/core-i5-2500k-and-core-i7-2600k-review/2
[125]: Wikipedia: Intel GMA, 2011, http://en.wikipedia.org/wiki/Intel_GMA
[126]: Shimpi A. L., The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core
i3-2100 Tested, AnandTech, Jan. 3 2011,
http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i7-600k-i5-
2500k-core-i3-2100-tested/11

References (16)
[127]: Marques T., AMD Ontario, Zacate Die Sizes - Take 2 , Sept. 14 2010,
http://www.siliconmadness.com/2010/09/amd-ontario-zacate-die-sizes-take-2.html
[128]: De Vries H., AMD Bulldozer, 8 core processor, Nov. 24 2010,

http://chip-architect.com/
[129]: Intel® 845G/845GL/845GV Chipset Datasheet: Intel® 82845G/82845GL/82845GV

Graphics and Memory Controller Hub (GMCH), Mai 2002
http://www.intel.com/design/chipsets/datashts/290746.htm
[130]: Huynh A. T., Final AMD "Stars" Models Unveiled, Daily Tech, May 4 2007,
http://www.dailytech.com/Final+AMD+Stars+Models+Unveiled+/article7157.htm
[131]: AMD Fusion, Wikipedia, http://en.wikipedia.org/wiki/AMD_Fusion
[132]: Nita S., AMD Llano APU to Get Dual-GPU Technology Similar to Hybrid CrossFire,
Softpedia, Jan. 21 2011, http://news.softpedia.com/news/AMD-Llano-APU-to-
Get-Dual-GPU-Technology-Similar-to-Hybrid-CrossFire-179740.shtml
[133]: Jotwani R., Sundaram S., Kosonocky S., Schaefer A., Andrade V. F., Novak A.,
Naffziger S., An x86-64 Core in 32 nm SOI CMOS, IEEE Xplore, Jan. 2011,
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5624589
[134]: Karmehed A., The graphical performance of the AMD A series APUs, Nordic
Hardware, March 16 2011,
http://www.nordichardware.com/news/69-cpu-chipset/42650-the-graphical-
performance-of-the-amd-a-series-apus.html

References (17)
[135]: Butler M., „Bulldozer” A new approach to multithreaded compute performance,

Aug. 24 2010, http://www.hotchips.org/uploads/archive22/HC22.24.720-Butler
-AMD-Bulldozer.pdf
[136]: „Bulldozer” and „Bobcat” AMD’s Latest x86 Core Innovations, HotChips22,
http://www.slideshare.net/AMDUnprocessed/amd-hot-chips-bulldozer-bobcat
-presentation-5041615
[137]: Altavilla D., Intel Arrandale Core i5 and Core i3 Mobile Unveiled, Hot Hardware,
Jan. 04 2010,
http://hothardware.com/Reviews/Intel-Arrandale-Core-i5-and-Core-i3-Mobile-Unveiled/
[138]: Dodeja A., Intel Arrandale, High Performance for the Masses, Hot Hardware,
Review of the IDF San Francisco, Sept. 2009,
http://akshaydodeja.com/intel-arrandale-high-performance-for-the-mass
[139]: Shimpi A., An Intel Arrandale: 32nm review for Notebooks, core to be assigned Core i5 540M
ReviewedAnand Tech, 1/4/2010
http://www.anandtech.com/show/2902
[140]: Chiappeta M., Intel Clarkdale Core i5 Desktop Processor Debuts, Hot Hardware
Jan. 03 2010,
http://hothardware.com/Articles/Intel-Clarkdale-Core-i5-Desktop-Processor-Debuts/
[141]: Thomas S. L., Desktop Platform Design Overview for Intel Microarchitecture (Nehalem)
Based Platform, Presentation ARCS001, IDF 2009
[142]: Kahn O., Valentine B., Microarchitecture Codename Sandy Bridge: New Processor
Innovations, Presentation ARCS001, IDF San Francisco Sept. 2010

References (18)
[143]: Intel Sandy Bridge Review, Bit-tech, Jan. 3 2011,

http://www.bit-tech.net/hardware/cpus/2011/01/03/intel-sandy-bridge-review/1
[144]: OpenCL Introduction and Overview, Chronos Group, June 2010,

http://www.khronos.org/developers/library/overview/opencl_overview.pdf
[145]: ATI Stream Computing OpenCL Programming Guide, rev.1.0b, AMD, March 2010,
http://www.ljll.math.upmc.fr/groupes/gpgpu/tutorial/ATI_Stream_SDK_OpenCL
Programming_Guide.pdf
[146]: Nvidia CUDA C Programming Guide, Version 0.8, Febr.2007

http://www.scribd.com/doc/6577212/NVIDIA-CUDA-Programming-Guide-0
[147]: Nvidia Compute PTX: Parallel Thread Execution, ISA, Version 2.3, March 2011,
http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/ptx_isa_
2.3.pdf
[148]: ATI Stream Computing Compute Abstraction Layer (CAL) Programming Guide,
Revision 2.03, AMD, Dec. 2010
http://developer.amd.com/gpu/amdappsdk/assets/AMD_CAL_Programming_Guide_
v2.0.pdf
[149]: Wikipedia: Dolphin triangle mesh,
http://en.wikipedia.org/wiki/File:Dolphin_triangle_mesh.png

0053 Parhuzamos Rendszerek Architekturaja

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

0053 Parhuzamos Rendszerek Architekturaja

Uploaded by

Copyright:

Available Formats

Írta: Sima Dezső

Lektorálta: oktatói munkaközösség

PÁRHUZAMOS SZÁMÍTÁSTECHNIKA MODUL

PROAKTÍV INFORMATIKAI MODULFEJLESZTÉS

LEKTORÁLTA: oktatói munkaközösség

Creative Commons NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)

KÉSZÜLT: a Typotex Kiadó gondozásában

© Sima Dezső, ÓE NIK 4 www.tankonyvtar.hu

© Sima Dezső, ÓE NIK 5 www.tankonyvtar.hu

• 1.The inevitable era of multicores

• 2.1 Conventional multicores

• 2.2 Many-core processors

• 3.1 Master-slave type heterogeneous multicores

• 3.2 Add-on type heterogeneous multicores

© Sima Dezső, ÓE NIK 6 www.tankonyvtar.hu

© Sima Dezső, ÓE NIK 7 www.tankonyvtar.hu

1. The inevitable era of multicores

~ 100*/10 years Pentium/200 * Pentium Pro/200

Figure 1.1: Integer performance8growth of Intel’s x86 processors

Clock frequency x Instructions Per Cycle

© Sima Dezső, ÓE NIK 9 www.tankonyvtar.hu

Figure 1.2: Efficiency of Intel processors

© Sima Dezső, ÓE NIK 10 www.tankonyvtar.hu

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3

Figure 1.3: Main sources of processor efficiency

© Sima Dezső, ÓE NIK 11 www.tankonyvtar.hu

Figure 1.4: Extent of parallelism available in general purpose applications

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3

Figure 1.5: Main sources of processor efficiency

© Sima Dezső, ÓE NIK 13 www.tankonyvtar.hu

Beginning with 2. generation superscalars

• the era of extensively increasing processor efficiency came to an end

Clock frequency x Instructions Per Cycle

Performance increase can basically be achived by fc

© Sima Dezső, ÓE NIK 14 www.tankonyvtar.hu

Shrinking: ~ 0.7/2 Years

Figure 1.6: Evolution of Intel’s process technology [38]

© Sima Dezső, ÓE NIK 15 www.tankonyvtar.hu

© Sima Dezső, ÓE NIK 16 www.tankonyvtar.hu

Main sources of processor efficiency (IPC)

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3

© Sima Dezső, ÓE NIK 17 www.tankonyvtar.hu

Possible use of surplus transistors

Processor width Core enhancements Cache enhancements

• branch prediction L2/L3

Figure 1.8: Possible use of surplus transistors

Increasing number of transistors Diminishing return in performance

Use available surplus transistors for multiple cores

The inevitable era of multicores

© Sima Dezső, ÓE NIK 19 www.tankonyvtar.hu

Figure 1.9: Rapid spreading of Intel’s multicore processors [40]

• 2.1 Conventional multicores

• 2.2 Manycore processors

© Sima Dezső, ÓE NIK 21 www.tankonyvtar.hu

Conventional Manycore Master/slave Add-on

2≤ n≤8 cores with >8 cores

Mobiles Desktops Servers

General purpose Prototypes/ MM/3D/HPC HPC

© Sima Dezső, ÓE NIK Figure 2.1: Major classes22

~ 100/10 years Pentium/200 Pentium Pro/200